Creating a PRS based on a list of SNPs and the imputed PMBB dataset

13 minute read

Published:

This post shows the script I wrote to convert LD pruned lists of SNPs, apply them to the Penn Medicine Biobank on a specific population, and use plink -score to create Polygenic Risk scores. Below I have pasted the .sh script I wrote today to create PRS.

The input of the script is several components:

  • Arg #1: A list of LD pruned SNPs from the LD_pruning.sh script
    • This is a single column text file in which each line contains “chr_pos”
    • E.g. output.prune.in
  • Arg #2: A table of SNPs and their associated GWAS data. This was created in previous posting
  • Each list of SNPs will be specific to the SNP list that was input into the LD_pruning script
  • Each table contains many many columns with all the associated study data
  • Importantly, there should be the following columns, which will also be added as
    • Arg #3: column # for the SNP Chromosome #
    • Arg #4: column # for the SNP base pair position
    • Arg #5: column # for chr_pos (variant ID)
    • Arg #6: column # for strongest SNP allele
    • Arg #7: column # for the odds ratio
  • Arg #8: PMBB ID file to keep. E.g. this is a curated .txt file of European PMBB ID numbers
  • Arg #9: PMBB ID relateds file to remove
  • Arg #10: The output .profile file name
#!/bin/bash
# Takes as input:
# $1 LD pruned SNP list, created using ld_pruning.sh
# $2 a file master list with a potentially larger list of SNPs and associated study data
# $3 column # for SNP Chr
# $4 column # for SNP Pos
# $5 column $ for chr_pos
# $6 column # for "STRONGEST SNP-RISK ALLELE"
# $7 column # for odds ratio
# $8 PMBB ID file to keep
# $9 PMBB ID relateds file to remove
# $10 Output name

# Example script call:
# bsub -e build_profile.e -o build_profile.o ../OC_build_profile.sh input.prune.in ../master_snplist_lung_cancer_all.txt 12 13 14 23 32 PMBBID.txt relateds.txt PMBB_subset_PRS
# ../OC_build_profile.sh snplist_oc_lung_all.prune.in ../master_snplist_lung_cancer_all.txt 12 13 14 22 32 ../../phenotype/PMBB_IDs_EUR_ancestry.txt ../../phenotype/PMBB_IDs_relateds.txt European_PRS


# Ensure input SNP file is in order
# Create a new file chr_pos_temp.txt for $1 which has column #1 as chr and column #2 as position
sort -o $1 $1
tr '_' $'\t' < $1 > chr_pos_temp.txt

# Add "chr" before each chromosome number to match Imputed VCF formatting
sed -e 's/^/chr/' chr_pos_temp.txt > chrchr_pos_temp.txt

# Load modules vcftools and plink 1.9
module load vcftools
module load plink/1.9-20210416

# Subset all imputed .vcf files using our pruned input list using vcftools

# Because of concern for long run times, I actually submitted this manually and added bsub... to the beginning

# bsub -e chr1.e -o chr1.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr1.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr1_temp
# bsub -e chr2.e -o chr2.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr2.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr2_temp
# bsub -e chr3.e -o chr3.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr3.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr3_temp
# bsub -e chr4.e -o chr4.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr4.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr4_temp
# bsub -e chr5.e -o chr5.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr5.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr5_temp
# bsub -e chr6.e -o chr6.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr6.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr6_temp
# bsub -e chr7.e -o chr7.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr7.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr7_temp
# bsub -e chr8.e -o chr8.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr8.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr8_temp
# bsub -e chr9.e -o chr9.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr9.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr9_temp
# bsub -e chr10.e -o chr10.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr10.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr10_temp
# bsub -e chr11.e -o chr11.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr11.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr11_temp
# bsub -e chr12.e -o chr12.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr12.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr12_temp
# bsub -e chr13.e -o chr13.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr13.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr13_temp
# bsub -e chr14.e -o chr14.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr14.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr14_temp
# bsub -e chr15.e -o chr15.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr15.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr15_temp
# bsub -e chr16.e -o chr16.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr16.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr16_temp
# bsub -e chr17.e -o chr17.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr17.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr17_temp
# bsub -e chr18.e -o chr18.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr18.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr18_temp
# bsub -e chr19.e -o chr19.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr19.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr19_temp
# bsub -e chr20.e -o chr20.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr20.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr20_temp
# bsub -e chr21.e -o chr21.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr21.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr21_temp
# bsub -e chr22.e -o chr22.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chr22.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chr22_temp
# bsub -e chrX.e -o chrX.o vcftools --gzvcf /project/PMBB/PMBB-Release-2020-2.0/Imputed/vcf/PMBB-Release-2020-2.0_genetic_imputed-topmed-r2_chrX.vcf.gz --positions chrchr_pos_temp.txt --recode --recode-INFO-all --out chrX_temp

# Fix notation of chr#
# E.g. first column currently shown as "chr1", want to change each to "1"
sed -i 's/chr//' chr1_temp.recode.vcf
sed -i 's/chr//' chr2_temp.recode.vcf
sed -i 's/chr//' chr3_temp.recode.vcf
sed -i 's/chr//' chr4_temp.recode.vcf
sed -i 's/chr//' chr5_temp.recode.vcf
sed -i 's/chr//' chr6_temp.recode.vcf
sed -i 's/chr//' chr7_temp.recode.vcf
sed -i 's/chr//' chr8_temp.recode.vcf
sed -i 's/chr//' chr9_temp.recode.vcf
sed -i 's/chr//' chr10_temp.recode.vcf
sed -i 's/chr//' chr11_temp.recode.vcf
sed -i 's/chr//' chr12_temp.recode.vcf
sed -i 's/chr//' chr13_temp.recode.vcf
sed -i 's/chr//' chr14_temp.recode.vcf
sed -i 's/chr//' chr15_temp.recode.vcf
sed -i 's/chr//' chr16_temp.recode.vcf
sed -i 's/chr//' chr17_temp.recode.vcf
sed -i 's/chr//' chr18_temp.recode.vcf
sed -i 's/chr//' chr19_temp.recode.vcf
sed -i 's/chr//' chr20_temp.recode.vcf
sed -i 's/chr//' chr21_temp.recode.vcf
sed -i 's/chr//' chr22_temp.recode.vcf
sed -i 's/chr//' chrX_temp.recode.vcf

# Convert .vcf files to .bed using Plink
plink --vcf chr1_temp.recode.vcf --make-bed --recode --out chr1_temp2
plink --vcf chr2_temp.recode.vcf --make-bed --recode --out chr2_temp2
plink --vcf chr3_temp.recode.vcf --make-bed --recode --out chr3_temp2
plink --vcf chr4_temp.recode.vcf --make-bed --recode --out chr4_temp2
plink --vcf chr5_temp.recode.vcf --make-bed --recode --out chr5_temp2
plink --vcf chr6_temp.recode.vcf --make-bed --recode --out chr6_temp2
plink --vcf chr7_temp.recode.vcf --make-bed --recode --out chr7_temp2
plink --vcf chr8_temp.recode.vcf --make-bed --recode --out chr8_temp2
plink --vcf chr9_temp.recode.vcf --make-bed --recode --out chr9_temp2
plink --vcf chr10_temp.recode.vcf --make-bed --recode --out chr10_temp2
plink --vcf chr11_temp.recode.vcf --make-bed --recode --out chr11_temp2
plink --vcf chr12_temp.recode.vcf --make-bed --recode --out chr12_temp2
plink --vcf chr13_temp.recode.vcf --make-bed --recode --out chr13_temp2
plink --vcf chr14_temp.recode.vcf --make-bed --recode --out chr14_temp2
plink --vcf chr15_temp.recode.vcf --make-bed --recode --out chr15_temp2
plink --vcf chr16_temp.recode.vcf --make-bed --recode --out chr16_temp2
plink --vcf chr17_temp.recode.vcf --make-bed --recode --out chr17_temp2
plink --vcf chr18_temp.recode.vcf --make-bed --recode --out chr18_temp2
plink --vcf chr19_temp.recode.vcf --make-bed --recode --out chr19_temp2
plink --vcf chr20_temp.recode.vcf --make-bed --recode --out chr20_temp2
plink --vcf chr21_temp.recode.vcf --make-bed --recode --out chr21_temp2
plink --vcf chr22_temp.recode.vcf --make-bed --recode --out chr22_temp2
plink --vcf chrX_temp.recode.vcf --make-bed --recode --out chrX_temp2

# Fix notation of variant IDs:
# E.g. third column showing chr1:pos:ref:alt, wnat to change to 1_pos_ref_alt
module load plink/2.0
plink2 --bfile chr1_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr1_temp3
plink2 --bfile chr2_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr2_temp3
plink2 --bfile chr3_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr3_temp3
plink2 --bfile chr4_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr4_temp3
plink2 --bfile chr5_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr5_temp3
plink2 --bfile chr6_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr6_temp3
plink2 --bfile chr7_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr7_temp3
plink2 --bfile chr8_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr8_temp3
plink2 --bfile chr9_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr9_temp3
plink2 --bfile chr10_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr10_temp3
plink2 --bfile chr11_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr11_temp3
plink2 --bfile chr12_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr12_temp3
plink2 --bfile chr13_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr13_temp3
plink2 --bfile chr14_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr14_temp3
plink2 --bfile chr15_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr15_temp3
plink2 --bfile chr16_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr16_temp3
plink2 --bfile chr17_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr17_temp3
plink2 --bfile chr18_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr18_temp3
plink2 --bfile chr19_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr19_temp3
plink2 --bfile chr20_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr20_temp3
plink2 --bfile chr21_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr21_temp3
plink2 --bfile chr22_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chr22_temp3
plink2 --bfile chrX_temp2 --set-all-var-ids @_#_\$r_\$a --make-bed -out chrX_temp3
module unload plink/2.0
rm *temp2*

# Create list of bed files in the directory, save as merge_list_temp.txt
ls *.bed >  merge_list_temp.txt

# Remove .bed from ending of each line in file to give file prefixes
sed -i 's/....$//' merge_list_temp.txt

# Merge all .bed files into a single file
plink --merge-list merge_list_temp.txt --allow-no-sex --make-bed --out merged_temp
rm *temp3*

# Need to reformat PMBB IDs into file with 2 columns
awk '{print $1 " " $1}' $8 > keep_temp.txt
awk '{print $1 " " $1}' $9 > remove_temp.txt

# Keep PMBB IDs of interest, remove relateds, and create file merged_subset_temp
plink --bfile merged_temp --keep keep_temp.txt --remove remove_temp.txt --make-bed --out merged_subset_temp
rm merged_temp*

# Create a file with columns representing our SNPs, rsids, and odds ratios

# First, get the Risk Allele value from the STRONGEST SNP-RISK ALLELE column
# We use awk to get the correct column, then sed to get the characters after the last "-"
# Then save this column as file "temp_risk_alleles.txt"
awk -F '\t' -v riskallele="$6" '{print $riskallele}' $2 | sed 's/^.*-\([^-]*\)$/\1/' > temp_risk_alleles.txt

# Now use awk to get the columns for chr, pos, rsid, and odds ratio
# Save them to snplist_temp1.txt
awk -F '\t' -v FS='\t' -v OFS='\t' -v chr="$3" -v pos="$4" -v chr_pos="$5" -v oddsratio="$7" '{print $chr, $pos, $chr_pos, $oddsratio}' $2 > snplist_temp1.txt

# Combine the above to text files
# We now have a tab-delimited file called snplist_temp2.txt which contains columns for chr, pos, rsid, odds ratio, and risk allele
paste -d "\t" snplist_temp1.txt temp_risk_alleles.txt > snplist_temp2.txt

# Sort snplist_temp2.txt
sort -k3 -o snplist_temp2.txt snplist_temp2.txt

# Get a list of the common SNPs within our LD_pruning file and master SNPs file
# Save common list as snplist_temp3.txt
comm -12 <(awk '{ print $3 }' snplist_temp2.txt) <(awk '{ print $1 }' $1) > snplist_temp3.txt

# Intersect our common list (common_snps_temp.txt) with the master list of SNPs (temp_snps.txt)
# Save the final subsetted SNP file as snplist_temp4.txt
awk 'FNR==NR {a[$1]; next}; $3 in a' snplist_temp3.txt snplist_temp2.txt > snplist_temp4.txt

# Below, we match columns according to chr, pos, and "risk allele" matching with "alt"
# Find matching columns to get the "ref" allele, then create a temp file with all the info
# The file will only contain matches for "chr" "pos" "ref" and "alt" in both files
# Columns are
	# 1 - chr
	# 2 - pos
	# 3 - ref
	# 4 - alt
	# 5 - chr_pos_ref_alt
awk  'FNR==NR{a[$1,$2,$5]; next} ($1,$4,$5) in a{print $1 "\t" $4 "\t" $6 "\t" $5 "\t" $1 "_" $4 "\t" $1 "_" $4 "_" $6 "_" $5 }' snplist_temp4.txt merged_subset_temp.bim > snplist_temp5.txt

# At this point, approximately 60% of SNPs will have a risk allele that matches with a "alt" allele
# However, ~30% of SNPs are reversed and need to have a "risk allele" matched with the "ref" allele instead

# Below I take match the "risk allele" with the "ref" allele
# Find matching columns to get the "ref" allele, then create a temp file with all the info
# The file will only contain matches for "chr" "pos" "ref" and "alt" in both files
# Columns are
        # 1 - chr
        # 2 - pos
        # 3 - ref
        # 4 - alt
        # 5 - chr_pos_ref_alt
awk  'FNR==NR{a[$1,$2,$5]; next} ($1,$4,$6) in a{print $1 "\t" $4 "\t" $6 "\t" $5 "\t" $1 "_" $4 "\t" $1 "_" $4 "_" $6 "_" $5 }' snplist_temp4.txt merged_subset_temp.bim > snplist_temp6.txt

# Sort before matching
sort -o snplist_temp4.txt snplist_temp4.txt
sort -o snplist_temp5.txt snplist_temp5.txt
sort -o snplist_temp6.txt snplist_temp6.txt

### For matching non-inverted SNPs ###

# After sorting, get the matching columns in the temp4 and temp5 file, in the correct order. This outputs file in the format chr_pos
# Non-inverted SNPs in temp7 file
comm -12 <(awk '{ print $1 "_" $2 }' snplist_temp4.txt) <(awk '{ print $1 "_" $2 }' snplist_temp5.txt) > snplist_temp7.txt

# Slightly confusing - match up files by certain columns
# Then paste two files together
awk 'FNR==NR {a[$1]; next}; $3 in a' snplist_temp7.txt snplist_temp4.txt > snplist_temp8.txt
paste snplist_temp8.txt snplist_temp5.txt > snplist_temp9.txt

### For matching inverted SNPS ###
# After sorting, get the matching columns in the temp4 and temp6 file, in the correct order. This outputs file in the format chr_pos
# Inverted SNPs in temp10
comm -12 <(awk '{ print $1 "_" $2 }' snplist_temp4.txt) <(awk '{ print $1 "_" $2 }' snplist_temp6.txt) > snplist_temp10.txt

# Again - match up files by certain columns
# Then paste two files together
awk 'FNR==NR {a[$1]; next}; $3 in a' snplist_temp10.txt snplist_temp4.txt > snplist_temp11.txt
paste snplist_temp11.txt snplist_temp6.txt > snplist_temp12.txt

# Merge temp9 and temp13
cat snplist_temp9.txt snplist_temp12.txt > snplist_temp13.txt

# Use temp_snps file of SNPs with odds ratio data and merged_subset_temp file of genotype information to create PRS
plink --bfile merged_subset_temp --score snplist_temp13.txt 11 5 4 header --out ${10}

# Remove temp files
rm merged_subset_temp*
rm chr_pos_temp.txt
rm chrchr_pos_temp.txt
rm merge_list_temp.txt
rm keep_temp.txt
rm remove_temp.txt
rm temp_risk_alleles.txt
rm snplist_temp1.txt
rm snplist_temp2.txt
rm snplist_temp3.txt
rm snplist_temp4.txt
rm snplist_temp5.txt
rm snplist_temp6.txt
rm snplist_temp7.txt
rm snplist_temp8.txt
rm snplist_temp9.txt
rm snplist_temp10.txt
rm snplist_temp11.txt
rm snplist_temp12.txt