Determining the Single Nucleotide Polymorphisms to be used in building Polygenic Risk Scores for Lung Cancer Susceptibility

3 minute read

Published:

This blog post describes how to create a list of SNPs associated with Lung Cancer. This will be used in future steps to build PRSs.

The NHGRI-EBI Catalog of human genome-wide association studies

It might sound a silly method, but I went to The EBI UK GWAS website and I downloaded the file entitled “All associations v1.0”.

This file has about 600,000 rows, all containing SNPs identified in prior GWAS that are associated with disease. I searched for any rows which contained the word “lung” and deleted all irrelevant rows.

Although it took about 20 minutes, I felt that doing it by hand was a safer method than building a script to identify words like “lung cancer”, “small cell lung cancer”, “lung adenocarcinoma”, etc. There were also several studies with other non-malignant lung diseases that had to be ignored.

My final list has all studies up until 4-28-2022 and contained 957 SNPs associated with lung malignancy in some way. Of note, these studies include associations with susceptibility to lung malignancy, as well as associations with prognosis and response to certain treatments.

The final file is called: gwas_snps_07_2022.xlsx and should be available in my directory on LPC.

Cutting down and cleaning the SNPs

First, I removed any rows with a study containing <1000 cases (e.g. a study with 100 cases, but 10,000 controls is still removed). This narrowed down to 794 SNPs. I labeled this tab as “removed_small_studies”.

Second, I removed any duplicate SNPs. I only removed the duplicate SNP if the SNP was attempting to describe association with the same disease. If there was a duplicate with different odds ratio and p-values, I deleted the duplicate SNP from the study with the smaller sample size. If there was a duplicte SNP with a different variant at that locus, I again chose the SNP measured from the bigger study size. Some notes about this:

  • I sorted the excel file by SNP rsid, then used “conditional formatting” tool to find duplicates
  • I didn’t remove duplicates if there was the same rsid for “lung cancer” and “squamous cell carcinoma”, etc.
  • This means that there are still duplicates in the table, which will be separated later into subtypes of cancer

By the end of step 2, there were 703 SNPs. I labelled this tab as “removed_duplicates”

Third, I removed studies that examined SNP x SNPs. These are listed as an rsid x rsid, and not a single SNP. After this, I had 676 SNPs, labelled as tab “removed_crossSNPs”.

Fourth, I removed all SNPs with a listed variant risk allele that was blank, i.e. “?”. This was labelled as “removed_noRiskAllele”

Fifth, I separated out studies that examined specifically the prognosis of patients with all types of lung cancer. This isn’t my focus at the moment, but may be in the future. I moved this group to a tab called “SNPs_prognosis”. All the other SNPs that we’re interested in, associated with lung cancer susceptibility went in a tab called “SNPs_susceptibility”.

Lastly, I created files for each subtype of cancer susceptibilities:

  • SNPs_susceptibilities_lung_cancer for lung cancer overall: 172 SNPs
  • SNPs_susceptibilities_lung_squamous for squamous cell lung cancer: 67 SNPs
  • SNPs_susceptibilities_lung_adeno for lung adenocarcionma: 67 SNPs
  • SNPs_susceptibilities_lung_SCLC for small cell lung cancer: 11 SNPs
  • SNPs_susceptibilities_lung_NCLCL for non-small cell lung cancer: 28 SNPs
    • I acknowledge that non-small cell cancer should eventually include subtypes like squamous and adenocarcinoma! But for now I’m placing all the SNPs in their appropriate histological subtype. We can consider modifying things later on

Note: At this point in the process, I did not stratify at all by P-value for each risk allele. This table includes SNPs with any P value < 10 e -5.