Performance of polygenic risk scores for cancer prediction in a racially diverse academic biobank: Methods part 2

8 minute read

Published: April 23, 2022

This blog post is Part 2 of outlining the materials and methods used in our lab’s most recent work on generating polygenic risk scores (PRSs). This article is entitled “Performance of polygenic risk scores for cancer prediction in a racially diverse academic biobank”. It was published in October 2021 and led by Dr. Louise Wang and Heena Desai.

My sources for this blog post are the article, available here on Elsevier and here on PubMed, as well as a text file from Heena that explained the study process in greater detail.

It’s worth noting that the following instructions were created for the purposes of passing on our labs methods to another lab and application in different datasets. In my future work, I actually manipulate this process quite a lot and simplify it. In retrospect, it also has quite a bit of missing information.

Code that manipulates exome data

Below, we use PLINK 1.9 to start manipulating exome data Data used is the Exome .vcf.gz files, of which there is 1 for each chromosome

1. Convert exome.vcf.gz to .bed files for each chromosome

plink –vcf exome_chr#.vcf.gz –make-bed –out chr#_plink_exome

2. Use PLINK to convert pull out only SNPs of interest

For -snp, enter format of SNPs in PMBB dataset, such as chr#_pos_ref_alt or chr#_pos
This will create a file for each chromosome, which has subsetted SNPs and still contains the entire population

plink –bfile chr#_plink_exome –snp #_pos_ref_alt –out chr#_pos_genotype –recode A

3. Convert the file master SNP input file into specific columns with headers of “ID” and “rsid”

These correspond to PMBB ids and whether a given variant at a specific rsid is present for an individual’s genotype with values 0, 1, or 2

cut -d “ “ -f2-2,7- –output-delimiter=, chr#_pos_genotype.raw > chr#_pos_geno.csv

Code that manipulates phenotypic data:

1. Separate individuals into African vs. European ancestry. This was performed using PCA with Eigenstrat fastpca

Code is not available for this at the moment
This comes in a .csv filetype, for which I presume there’s now patient IDs in one column and ancestry in another column

3. We then take the .csv file and read it into R

population<-read.csv(pca_file from above, header=T, sep=””)

4. List number of individuals in each ancestry groups

table(population$GIA)

5. Take european population only and save as var

Write the European subset into a file
- No row names
- Remove quotes
- Tab separated
- Label as “EUR_ancestry_PCs.txt”

EUR_pop <-subset(population, population$GIA==”EUR”)
write.table(EUR_pop, “EUR_ancestry_PCs.txt”, row.names=F, quote=F, sep=”\t”)

6. Extract European subset PMBB IDs and write them to a file

No row or column names
Remove quotes
Tab separated
Label as “EUR_ancestry_IDs.txt”

EUR_pop_id<-EUR_pop[,c(1,1)]
write.table(EUR_pop_id, "EUR_ancestry_IDs.txt", row.names=F, col.names=F, quote=F, sep="\t")

7. Repeat the same as #5 and #6 but for the African population

AFR_pop<-subset(population, pop$GIA=="AFR")
write.table(AFR_pop, "AFR_ancestry_PCs.txt", row.names=F, quote=F, sep="\t")
AFR_pop_id<-AFR_pop[,c(1,2)]
write.table(AFR_pop_id "AFR_ancestry_IDs.txt", row.names=F, col.names=F, quote=F, sep="\t")

Code that manipulates genetic data

8. Keep European individuals, and remove related individuals

We have a “EUR_ancestry_IDs.txt” file
We have a “related individuals file” with PMBB ids
For each chromosome with a filetype of .vcf.gz
- Keep the european individuals
- Remove the related individuals
- Save the remaining files as .bed format

plink --vcf VCF_chr#.vcf.gz --keep EUR_ancestry_IDs.txt --remove (related individuals file) --make-bed --out EUR_file_name_chr# 

9. Merge the list of 23 chromosome files into one .txt file, then convert each to .bed filetype

ls EUR_file_name_chr*.bed > eur_merge_list.txt
plink --merge-list eur_merge_list.txt --allow-no-sex --make-bed --out EUR_file_name_chrALL

10. Remove samples that failed sex check

plink –bfile –remove failed_sex_check_file –make-bed –out EUR_file_name_final_chrALL

11. Repeat steps 8-10, but for African population

plink --vcf VCF_chr#.vcf.gz --keep AFR_ancestry_IDs.txt --remove (related individuals file) --make-bed --out AFR_file_name_chr# 
ls AFR_file_name_chr*.bed > merge_list.txt
plink --merge-list afr_merge_list.txt --allow-no-sex --make-bed --out AFR_file_name_chrALL
plink --bfile AFR_file_name_chrALL --remove failed_sex_check_file --make-bed --out AFR_file_name_final_chrALL

End result: we now have two files of .bed format:

EUR_file_name_final_chrALL
AFR_file_name_final_chrALL

Phenotyping of PMBB participants

I explain the phenotyping of PMBB participants in greater detail in prior entries. Briefly, at the time of the study, there were 19,935 individuals included for whom electronic health data was available. This includes demographic data, lab results, and ICD-9 and ICD-10 diagnosis codes.

Interestingly, the presence of an ICD code for prostate cancer was imperfect. Manual chart review was performed for 2365 individuals with at least 1 ICD-9/10 billing code for prostate cancer. The positive predictive value of 1 ICD-9/10 prostate cancer billing code was 94%, while the presence of >5 ICD-9/10 prostate cancer billing codes for a patient across several encounters had much high positive predictive value.

Because of this finding, manual chart review was performed for several other cancers which showed slightly better positive predictive value of ICD-9/10 codes for true presence of a cancer diagnosis.

SNV (Single Nucleotide Variation) selection

Summary statistics were obtained using the most recent study and largest GWAS available in the GWAS catalog for each of the 15 cancers in this study. Most of these were obtained at the www.ebi.ac.uk website, and the table below shows where they were taken from.

These were studies of European populations. SNVs were chosen on the basis of a P value threshold of P < 1 × 10–6.

SNVs were also controlled for linkage disequilibrium (LD). If there was a pairwise genotypic correlation at r2 > 0.1, then SNVs were “pruned”. I.e, only the most significant SNV in that group was included.

PRS generation and statistical analysis

PRS for each individual was calculated using PLINK 1.9 by summing the LD-pruned SNV variants and weighing their corresponding effect sizes using the odds ratios (ORs) or beta values reported in the original GWAS. We standardized each cancer PRS to a mean of 0 and SD of 1 and used logistic regression to test for the association between cancer PRS and cancer phenotype and to compare the top versus bottom quintiles of polygenic risk, controlling for age, sex, and the first 10 within-ancestry PCs as covariates. We performed a primary analysis for cancers with >100 cases and included a secondary analysis of the remaining cancer types. Area under the curve (AUC) for primary phenotype was determined using the package pROC.23 All statistical analyses were performed using R 4.0.3.

Code for generation of PRSs

1. Select out only variants for our SNPs that achieved genome wide significance in GWAS studies

This filetype is named gws_variants_file.txt

Here, we’re inputting our final .bed file that was created above, for each our African and European population

EUR_file_name_final_chrALL
AFR_file_name_final_chrALL

Our output file is a .profile file that should include

A master file of genome-wide significant SNPs for each individual
Columns that identify covariant IDs, allele codes, and allele scores
Note that 3 4 7 here denotes columns for: variant IDs - col 3, allele codes - col 4, scores - col 7

plink --bfile EUR_file_name_final_chrALL --score gws_variants_file.txt 3 4 7 header --out EUR_cancer_name_PRS_PLINK
plink --bfile AFR_file_name_chrALL --score gws_variants_file.txt 3 4 7 header --out AFR_cancer_name_PRS_PLINK

2. Load libraries

library(pROC)
library(ggplot2)
library(scales)
library(tidyverse)
library(dplyr)

3. Add our 10 principal components

Add covariates that were measured through PCA - this is essentially 10 control variables
Ensure labelling of “id” as our first column throughout all files

covariates <- read.csv(file= eur_covariate_file, header = T, sep="\t")
colnames(eur_covariates)[1] = "id"

4. Idenfity subpopulation with the cancer in question

Read the individuals into a variable named “phenotypes”

phenotypes <- read.csv("Phenotypes/EUR_cancer_name_Pheno.csv", header =T, sep=",", check.names = F)
colnames(phenotypes)[1] = "id"
colnames(phenotypes)[2] = "pheno"

5. Load .profile file into variable, label first column as “id”

genotype file = .profile generated from PLINK
genotypes <- read.csv("EUR_cancer_name_PRS_PLINK.profile", header = T, sep = "")
colnames(genotypes)[1] <- "id"

6. Scale genotype file

genotypes_scaled <- data.frame(genotypes[2], (apply(genotypes[6],2,scale)))
colnames(genotypes_scaled)[1] = "id"
colnames(genotypes_scaled)[2] <- "normalizedScore"

7. Merge 3 files based on “id” column:

Genotype file
Phenotype file
Covariate file

score_pheno = merge(genotypes_scaled, phenotypes, by="id")
score_pheno_cov = merge(score_pheno, covariates, by="id")

8. Extract top, bottom, and remaining quintiles

top = which(score_pheno_cov$normalizedScore > quantile(score_pheno_cov$normalizedScore, 0.8))
bottom = which(score_pheno_cov$normalizedScore < quantile(score_pheno_cov$normalizedScore,0.2))
rest = which(score_pheno_cov$normalizedScore <= quantile(score_pheno_cov$normalizedScore, 0.8)) 

9. Define data frames for top, bottom, and remaining quintiles

top_q = score_pheno_cov[top,]
bot_q = score_pheno_cov[bottom,]
top_bot = c(top_q, bot_q)

10. Regression model as below:

mod <- glm(score_pheno_cov$pheno ~ score_pheno_cov$normalizedScore + score_pheno_cov$gender + score_pheno_cov$Age + score_pheno_cov$PC1 + score_pheno_cov$PC2 + score_pheno_cov$PC3 + score_pheno_cov$PC4 + score_pheno_cov$PC5 + score_pheno_cov$PC6 + score_pheno_cov$PC7 + score_pheno_cov$PC8 + score_pheno_cov$PC9 + score_pheno_cov$PC10, family=”binomial”)
summary(mod)

11. Repeat Steps 1-10 for AFR population

That’s all for today!

Share on

Twitter Facebook LinkedIn

Oliver Clark

Performance of polygenic risk scores for cancer prediction in a racially diverse academic biobank: Methods part 2

Code that manipulates exome data

Code that manipulates phenotypic data:

Code that manipulates genetic data

Phenotyping of PMBB participants

SNV (Single Nucleotide Variation) selection

PRS generation and statistical analysis

Code for generation of PRSs

Share on

You May Also Enjoy

Genetic Risk for Overall Cancer and the Benefit of Adherence to a Healthy Lifestyle: A review

Genetic risk, incident gastric cancer, and healthy lifestyle: a meta-analysis of genome-wide association studies and prospective cohort study: A review

On cross-ancestry cancer polygenic risk scores: A review

Reviewing PRS literature Part 1