Creating a PRS based on a list of SNPs and the imputed PMBB dataset

13 minute read
Published: May 08, 2022
This post shows the script I wrote to convert LD pruned lists of SNPs, apply them to the Penn Medicine Biobank on a specific population, and use plink -score to create Polygenic Risk scores. Below I have pasted the .sh script I wrote today to create PRS.
The input of the script is several components:
Arg #1: A list of LD pruned SNPs from the LD_pruning.sh script
- This is a single column text file in which each line contains “chr_pos”
- E.g. output.prune.in
Arg #2: A table of SNPs and their associated GWAS data. This was created in previous posting
Each list of SNPs will be specific to the SNP list that was input into the LD_pruning script
Each table contains many many columns with all the associated study data
Importantly, there should be the following columns, which will also be added as
- Arg #3: column # for the SNP Chromosome #
- Arg #4: column # for the SNP base pair position
- Arg #5: column # for chr_pos (variant ID)
- Arg #6: column # for strongest SNP allele
- Arg #7: column # for the odds ratio
Arg #8: PMBB ID file to keep. E.g. this is a curated .txt file of European PMBB ID numbers
Arg #9: PMBB ID relateds file to remove
Arg #10: The output .profile file name
#!/bin/bash
# Takes as input:
# $1 LD pruned SNP list, created using ld_pruning.sh
# $2 a file master list with a potentially larger list of SNPs and associated study data
# $3 column # for SNP Chr
# $4 column # for SNP Pos
# $5 column $ for chr_pos
# $6 column # for "STRONGEST SNP-RISK ALLELE"
# $7 column # for odds ratio
# $8 PMBB ID file to keep
# $9 PMBB ID relateds file to remove
# $10 Output name

# Example script call:
# bsub -e build_profile.e -o build_profile.o ../OC_build_profile.sh input.prune.in ../master_snplist_lung_cancer_all.txt 12 13 14 23 32 PMBBID.txt relateds.txt PMBB_subset_PRS
# ../OC_build_profile.sh snplist_oc_lung_all.prune.in ../master_snplist_lung_cancer_all.txt 12 13 14 22 32 ../../phenotype/PMBB_IDs_EUR_ancestry.txt ../../phenotype/PMBB_IDs_relateds.txt European_PRS


# Ensure input SNP file is in order
# Create a new file chr_pos_temp.txt for $1 which has column #1 as chr and column #2 as position
sort -o $1 $1
tr '_' $'\t' < $1 > chr_pos_temp.txt

# Add "chr" before each chromosome number to match Imputed VCF formatting
sed -e 's/^/chr/' chr_pos_temp.txt > chrchr_pos_temp.txt

# Load modules vcftools and plink 1.9
module load vcftools
module load plink/1.9-20210416

# Subset all imputed .vcf files using our pruned input list using vcftools

# Because of concern for long run times, I actually submitted this manually and added bsub... to the beginning

# bsub -e chr1.e -o chr1.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr1.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr1_temp
# bsub -e chr2.e -o chr2.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr2.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr2_temp
# bsub -e chr3.e -o chr3.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr3.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr3_temp
# bsub -e chr4.e -o chr4.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr4.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr4_temp
# bsub -e chr5.e -o chr5.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr5.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr5_temp
# bsub -e chr6.e -o chr6.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr6.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr6_temp
# bsub -e chr7.e -o chr7.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr7.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr7_temp
# bsub -e chr8.e -o chr8.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr8.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr8_temp
# bsub -e chr9.e -o chr9.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr9.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr9_temp
# bsub -e chr10.e -o chr10.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr10.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr10_temp
# bsub -e chr11.e -o chr11.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr11.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr11_temp
# bsub -e chr12.e -o chr12.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr12.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr12_temp
# bsub -e chr13.e -o chr13.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr13.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr13_temp
# bsub -e chr14.e -o chr14.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr14.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr14_temp
# bsub -e chr15.e -o chr15.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr15.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr15_temp
# bsub -e chr16.e -o chr16.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr16.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr16_temp
# bsub -e chr17.e -o chr17.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr17.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr17_temp
# bsub -e chr18.e -o chr18.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr18.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr18_temp
# bsub -e chr19.e -o chr19.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr19.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr19_temp
# bsub -e chr20.e -o chr20.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr20.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr20_temp
# bsub -e chr21.e -o chr21.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr21.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr21_temp
# bsub -e chr22.e -o chr22.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr22.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr22_temp
# bsub -e chrX.e -o chrX.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chrX.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chrX_temp

# Fix notation of chr#
# E.g. first column currently shown as "chr1", want to change each to "1"
sed -i 's/chr//' chr1_temp.recode.vcf
sed -i 's/chr//' chr2_temp.recode.vcf
sed -i 's/chr//' chr3_temp.recode.vcf
sed -i 's/chr//' chr4_temp.recode.vcf
sed -i 's/chr//' chr5_temp.recode.vcf
sed -i 's/chr//' chr6_temp.recode.vcf
sed -i 's/chr//' chr7_temp.recode.vcf
sed -i 's/chr//' chr8_temp.recode.vcf
sed -i 's/chr//' chr9_temp.recode.vcf
sed -i 's/chr//' chr10_temp.recode.vcf
sed -i 's/chr//' chr11_temp.recode.vcf
sed -i 's/chr//' chr12_temp.recode.vcf
sed -i 's/chr//' chr13_temp.recode.vcf
sed -i 's/chr//' chr14_temp.recode.vcf
sed -i 's/chr//' chr15_temp.recode.vcf
sed -i 's/chr//' chr16_temp.recode.vcf
sed -i 's/chr//' chr17_temp.recode.vcf
sed -i 's/chr//' chr18_temp.recode.vcf
sed -i 's/chr//' chr19_temp.recode.vcf
sed -i 's/chr//' chr20_temp.recode.vcf
sed -i 's/chr//' chr21_temp.recode.vcf
sed -i 's/chr//' chr22_temp.recode.vcf
sed -i 's/chr//' chrX_temp.recode.vcf

# Convert .vcf files to .bed using Plink
plink --vcf chr1_temp.recode.vcf --make-bed --recode --out chr1_temp2
plink --vcf chr2_temp.recode.vcf --make-bed --recode --out chr2_temp2
plink --vcf chr3_temp.recode.vcf --make-bed --recode --out chr3_temp2
plink --vcf chr4_temp.recode.vcf --make-bed --recode --out chr4_temp2
plink --vcf chr5_temp.recode.vcf --make-bed --recode --out chr5_temp2
plink --vcf chr6_temp.recode.vcf --make-bed --recode --out chr6_temp2
plink --vcf chr7_temp.recode.vcf --make-bed --recode --out chr7_temp2
plink --vcf chr8_temp.recode.vcf --make-bed --recode --out chr8_temp2
plink --vcf chr9_temp.recode.vcf --make-bed --recode --out chr9_temp2
plink --vcf chr10_temp.recode.vcf --make-bed --recode --out chr10_temp2
plink --vcf chr11_temp.recode.vcf --make-bed --recode --out chr11_temp2
plink --vcf chr12_temp.recode.vcf --make-bed --recode --out chr12_temp2
plink --vcf chr13_temp.recode.vcf --make-bed --recode --out chr13_temp2
plink --vcf chr14_temp.recode.vcf --make-bed --recode --out chr14_temp2
plink --vcf chr15_temp.recode.vcf --make-bed --recode --out chr15_temp2
plink --vcf chr16_temp.recode.vcf --make-bed --recode --out chr16_temp2
plink --vcf chr17_temp.recode.vcf --make-bed --recode --out chr17_temp2
plink --vcf chr18_temp.recode.vcf --make-bed --recode --out chr18_temp2
plink --vcf chr19_temp.recode.vcf --make-bed --recode --out chr19_temp2
plink --vcf chr20_temp.recode.vcf --make-bed --recode --out chr20_temp2
plink --vcf chr21_temp.recode.vcf --make-bed --recode --out chr21_temp2
plink --vcf chr22_temp.recode.vcf --make-bed --recode --out chr22_temp2
plink --vcf chrX_temp.recode.vcf --make-bed --recode --out chrX_temp2

# Fix notation of variant IDs:
# E.g. third column showing chr1:pos:ref:alt, wnat to change to 1_pos_ref_alt
module load plink/2.0
plink2 --bfile chr1_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr1_temp3
plink2 --bfile chr2_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr2_temp3
plink2 --bfile chr3_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr3_temp3
plink2 --bfile chr4_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr4_temp3
plink2 --bfile chr5_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr5_temp3
plink2 --bfile chr6_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr6_temp3
plink2 --bfile chr7_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr7_temp3
plink2 --bfile chr8_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr8_temp3
plink2 --bfile chr9_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr9_temp3
plink2 --bfile chr10_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr10_temp3
plink2 --bfile chr11_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr11_temp3
plink2 --bfile chr12_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr12_temp3
plink2 --bfile chr13_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr13_temp3
plink2 --bfile chr14_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr14_temp3
plink2 --bfile chr15_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr15_temp3
plink2 --bfile chr16_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr16_temp3
plink2 --bfile chr17_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr17_temp3
plink2 --bfile chr18_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr18_temp3
plink2 --bfile chr19_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr19_temp3
plink2 --bfile chr20_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr20_temp3
plink2 --bfile chr21_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr21_temp3
plink2 --bfile chr22_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr22_temp3
plink2 --bfile chrX_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chrX_temp3
module unload plink/2.0
rm *temp2*

# Create list of bed files in the directory, save as merge_list_temp.txt
ls *.bed >  merge_list_temp.txt

# Remove .bed from ending of each line in file to give file prefixes
sed -i 's/....$//' merge_list_temp.txt

# Merge all .bed files into a single file
plink --merge-list merge_list_temp.txt --allow-no-sex --make-bed --out merged_temp
rm *temp3*

# Need to reformat PMBB IDs into file with 2 columns
awk '{print $1 " " $1}' $8 > keep_temp.txt
awk '{print $1 " " $1}' $9 > remove_temp.txt

# Keep PMBB IDs of interest, remove relateds, and create file merged_subset_temp
plink --bfile merged_temp --keep keep_temp.txt --remove remove_temp.txt --make-bed --out merged_subset_temp
rm merged_temp*

# Create a file with columns representing our SNPs, rsids, and odds ratios

# First, get the Risk Allele value from the STRONGEST SNP-RISK ALLELE column
# We use awk to get the correct column, then sed to get the characters after the last "-"
# Then save this column as file "temp_risk_alleles.txt"
awk -F '\t' -v riskallele="$6" '{print $riskallele}' $2 | sed 's/^.*-\([^-]*\)$/\1/' > temp_risk_alleles.txt

# Now use awk to get the columns for chr, pos, rsid, and odds ratio
# Save them to snplist_temp1.txt
awk -F '\t' -v FS='\t' -v OFS='\t' -v chr="$3" -v pos="$4" -v chr_pos="$5" -v oddsratio="$7" '{print $chr, $pos, $chr_pos, $oddsratio}' $2 > snplist_temp1.txt

# Combine the above to text files
# We now have a tab-delimited file called snplist_temp2.txt which contains columns for chr, pos, rsid, odds ratio, and risk allele
paste -d "\t" snplist_temp1.txt temp_risk_alleles.txt > snplist_temp2.txt

# Sort snplist_temp2.txt
sort -k3 -o snplist_temp2.txt snplist_temp2.txt

# Get a list of the common SNPs within our LD_pruning file and master SNPs file
# Save common list as snplist_temp3.txt
comm -12 <(awk '{ print $3 }' snplist_temp2.txt) <(awk '{ print $1 }' $1) > snplist_temp3.txt

# Intersect our common list (common_snps_temp.txt) with the master list of SNPs (temp_snps.txt)
# Save the final subsetted SNP file as snplist_temp4.txt
awk 'FNR==NR {a[$1]; next}; $3 in a' snplist_temp3.txt snplist_temp2.txt > snplist_temp4.txt

# Below, we match columns according to chr, pos, and "risk allele" matching with "alt"
# Find matching columns to get the "ref" allele, then create a temp file with all the info
# The file will only contain matches for "chr" "pos" "ref" and "alt" in both files
# Columns are
	# 1 - chr
	# 2 - pos
	# 3 - ref
	# 4 - alt
	# 5 - chr_pos_ref_alt
awk  'FNR==NR{a[$1,$2,$5]; next} ($1,$4,$5) in a{print $1 "\t" $4 "\t" $6 "\t" $5 "\t" $1 "_" $4 "\t" $1 "_" $4 "_" $6 "_" $5 }' snplist_temp4.txt merged_subset_temp.bim > snplist_temp5.txt

# At this point, approximately 60% of SNPs will have a risk allele that matches with a "alt" allele
# However, ~30% of SNPs are reversed and need to have a "risk allele" matched with the "ref" allele instead

# Below I take match the "risk allele" with the "ref" allele
# Find matching columns to get the "ref" allele, then create a temp file with all the info
# The file will only contain matches for "chr" "pos" "ref" and "alt" in both files
# Columns are
        # 1 - chr
        # 2 - pos
        # 3 - ref
        # 4 - alt
        # 5 - chr_pos_ref_alt
awk  'FNR==NR{a[$1,$2,$5]; next} ($1,$4,$6) in a{print $1 "\t" $4 "\t" $6 "\t" $5 "\t" $1 "_" $4 "\t" $1 "_" $4 "_" $6 "_" $5 }' snplist_temp4.txt merged_subset_temp.bim > snplist_temp6.txt

# Sort before matching
sort -o snplist_temp4.txt snplist_temp4.txt
sort -o snplist_temp5.txt snplist_temp5.txt
sort -o snplist_temp6.txt snplist_temp6.txt

### For matching non-inverted SNPs ###

# After sorting, get the matching columns in the temp4 and temp5 file, in the correct order. This outputs file in the format chr_pos
# Non-inverted SNPs in temp7 file
comm -12 <(awk '{ print $1 "_" $2 }' snplist_temp4.txt) <(awk '{ print $1 "_" $2 }' snplist_temp5.txt) > snplist_temp7.txt

# Slightly confusing - match up files by certain columns
# Then paste two files together
awk 'FNR==NR {a[$1]; next}; $3 in a' snplist_temp7.txt snplist_temp4.txt > snplist_temp8.txt
paste snplist_temp8.txt snplist_temp5.txt > snplist_temp9.txt

### For matching inverted SNPS ###
# After sorting, get the matching columns in the temp4 and temp6 file, in the correct order. This outputs file in the format chr_pos
# Inverted SNPs in temp10
comm -12 <(awk '{ print $1 "_" $2 }' snplist_temp4.txt) <(awk '{ print $1 "_" $2 }' snplist_temp6.txt) > snplist_temp10.txt

# Again - match up files by certain columns
# Then paste two files together
awk 'FNR==NR {a[$1]; next}; $3 in a' snplist_temp10.txt snplist_temp4.txt > snplist_temp11.txt
paste snplist_temp11.txt snplist_temp6.txt > snplist_temp12.txt

# Merge temp9 and temp13
cat snplist_temp9.txt snplist_temp12.txt > snplist_temp13.txt

# Use temp_snps file of SNPs with odds ratio data and merged_subset_temp file of genotype information to create PRS
plink --bfile merged_subset_temp --score snplist_temp13.txt 11 5 4 header --out ${10}

# Remove temp files
rm merged_subset_temp*
rm chr_pos_temp.txt
rm chrchr_pos_temp.txt
rm merge_list_temp.txt
rm keep_temp.txt
rm remove_temp.txt
rm temp_risk_alleles.txt
rm snplist_temp1.txt
rm snplist_temp2.txt
rm snplist_temp3.txt
rm snplist_temp4.txt
rm snplist_temp5.txt
rm snplist_temp6.txt
rm snplist_temp7.txt
rm snplist_temp8.txt
rm snplist_temp9.txt
rm snplist_temp10.txt
rm snplist_temp11.txt
rm snplist_temp12.txt
Share on

Twitter Facebook LinkedIn
Oliver Clark

Creating a PRS based on a list of SNPs and the imputed PMBB dataset

Share on

You May Also Enjoy

Genetic Risk for Overall Cancer and the Benefit of Adherence to a Healthy Lifestyle: A review

Genetic risk, incident gastric cancer, and healthy lifestyle: a meta-analysis of genome-wide association studies and prospective cohort study: A review

On cross-ancestry cancer polygenic risk scores: A review

Reviewing PRS literature Part 1