Linkage Disequilibrium Pruning on the 1000 genomes dataset
Published:
In this blog post, I’ll outline the script I made to conduct Linkage Disequilibrium Pruning with a given list of SNPs on the 1000 genomes dataset.
SNPs
In the methods, I’ll be using 2 different set of significantly associated SNPs. The first is a file given to me by Heena. The second was compiled by me, which is the “SNPS_susceptibility_lung_cancer_all” tab that I created in previous blog posting.
To be honest, I don’t know the nuances in the creation of Heena’s SNP file.
I’ll compare them below:
Heena’s SNPs
- More (~300)
- Less specific - contains SNPs associated with susceptibility to lung cancer, subtypes of lung cancer, and may contain SNPs associated with lung cancer prognosis
- After LD pruning below, narrowed down to ~220 SNPs
My SNPs
- Fewer (~170)
- More specific - contains only SNPs associated with overall lung cancer. SNPs associated with “squamous cell carcinoma of lung” or other subtypes were not included
- After LD pruning, contains about ~70 SNPs
Performing Linkage disequilibrium pruning on the 1000 genomes dataset
I wrote the following .sh script for LD pruning. It’s located in the dir /projects/kmaxwell/PMBB/OClark_PRS/ld_pruning.sh.
#!/bin/bash
# Args: $1 is input SNPs. $2 is output filename
# Recommended format of submitting job:
# bsub -e lung_adeno_pruning.e -o lung_adeno_pruning.o ../OC_LD_pruning.sh chr_pos.txt snplist_lung_adeno
# This is a script that accepts a list of SNPs and runs them through LD pruning with the 1000 genomes dataset
# The input is in the form of a tab separated text file in which column $1 is chromosome number and column $2 is chromosome position
# The output is in the form of two tab separated text file containing
# Load modules vcftools and plink 1.9
module load vcftools
module load plink/1.9-20210416
# Subset the .vcf files in the 1000 genomes dataset with vcftools
# If you're working with a different dataset, may need to change 1000 genomes file dir
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr1.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr1_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr2.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr2_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr3.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr3_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr4.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr4_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr5.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr5_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr6.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr6_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr7.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr7_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr8.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr8_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr9.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr9_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr10.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr10_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr11.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr11_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr12.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr12_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr13.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr13_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr14.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr14_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr15.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr15_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr16.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr16_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr17.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr17_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr18.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr18_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr19.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr19_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr20.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr20_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr21.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr21_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chr22_temp
vcftools --gzvcf /project/kmaxwell/resources/1000_Genomes_hg38/ALL.chrX.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz --positions $1 --recode --recode-INFO-all --out chrX_temp
# Use plink to convert all files to .bed format
plink --noweb --vcf chr1_temp.recode.vcf --make-bed --recode --out chr1_temp2
plink --noweb --vcf chr2_temp.recode.vcf --make-bed --recode --out chr2_temp2
plink --noweb --vcf chr3_temp.recode.vcf --make-bed --recode --out chr3_temp2
plink --noweb --vcf chr4_temp.recode.vcf --make-bed --recode --out chr4_temp2
plink --noweb --vcf chr5_temp.recode.vcf --make-bed --recode --out chr5_temp2
plink --noweb --vcf chr6_temp.recode.vcf --make-bed --recode --out chr6_temp2
plink --noweb --vcf chr7_temp.recode.vcf --make-bed --recode --out chr7_temp2
plink --noweb --vcf chr8_temp.recode.vcf --make-bed --recode --out chr8_temp2
plink --noweb --vcf chr9_temp.recode.vcf --make-bed --recode --out chr9_temp2
plink --noweb --vcf chr10_temp.recode.vcf --make-bed --recode --out chr10_temp2
plink --noweb --vcf chr11_temp.recode.vcf --make-bed --recode --out chr11_temp2
plink --noweb --vcf chr12_temp.recode.vcf --make-bed --recode --out chr12_temp2
plink --noweb --vcf chr13_temp.recode.vcf --make-bed --recode --out chr13_temp2
plink --noweb --vcf chr14_temp.recode.vcf --make-bed --recode --out chr14_temp2
plink --noweb --vcf chr15_temp.recode.vcf --make-bed --recode --out chr15_temp2
plink --noweb --vcf chr16_temp.recode.vcf --make-bed --recode --out chr16_temp2
plink --noweb --vcf chr17_temp.recode.vcf --make-bed --recode --out chr17_temp2
plink --noweb --vcf chr18_temp.recode.vcf --make-bed --recode --out chr18_temp2
plink --noweb --vcf chr19_temp.recode.vcf --make-bed --recode --out chr19_temp2
plink --noweb --vcf chr20_temp.recode.vcf --make-bed --recode --out chr20_temp2
plink --noweb --vcf chr21_temp.recode.vcf --make-bed --recode --out chr21_temp2
plink --noweb --vcf chr22_temp.recode.vcf --make-bed --recode --out chr22_temp2
plink --noweb --vcf chrX_temp.recode.vcf --make-bed --recode --out chrX_temp2
rm *.recode.vcf
rm *.log
# Set the variant IDs that are missing in the 1000 genomes dataset
plink --noweb --bfile chr1_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr1_temp3
plink --noweb --bfile chr2_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr2_temp3
plink --noweb --bfile chr3_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr3_temp3
plink --noweb --bfile chr4_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr4_temp3
plink --noweb --bfile chr5_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr5_temp3
plink --noweb --bfile chr6_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr6_temp3
plink --noweb --bfile chr7_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr7_temp3
plink --noweb --bfile chr8_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr8_temp3
plink --noweb --bfile chr9_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr9_temp3
plink --noweb --bfile chr10_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr10_temp3
plink --noweb --bfile chr11_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr11_temp3
plink --noweb --bfile chr12_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr12_temp3
plink --noweb --bfile chr13_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr13_temp3
plink --noweb --bfile chr14_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr14_temp3
plink --noweb --bfile chr15_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr15_temp3
plink --noweb --bfile chr16_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr16_temp3
plink --noweb --bfile chr17_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr17_temp3
plink --noweb --bfile chr18_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr18_temp3
plink --noweb --bfile chr19_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr19_temp3
plink --noweb --bfile chr20_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr20_temp3
plink --noweb --bfile chr21_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr21_temp3
plink --noweb --bfile chr22_temp2 --set-missing-var-ids @_# --recode --make-bed --out chr22_temp3
plink --noweb --bfile chrX_temp2 --set-missing-var-ids @_# --recode --make-bed --out chrX_temp3
rm *temp2*
# Rephrase $1 input file into a text file with 1 column, containing the text chr_pos
awk '{print $1 "_" $2}' $1 > chr_pos_temp.txt
# Plink ld pruning with the following factors:
# LD window 50
# LD step 5
# r2 <0.1
plink --noweb --bfile chr1_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr1_temp4
plink --noweb --bfile chr2_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr2_temp4
plink --noweb --bfile chr3_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr3_temp4
plink --noweb --bfile chr4_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr4_temp4
plink --noweb --bfile chr5_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr5_temp4
plink --noweb --bfile chr6_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr6_temp4
plink --noweb --bfile chr7_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr7_temp4
plink --noweb --bfile chr8_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr8_temp4
plink --noweb --bfile chr9_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr9_temp4
plink --noweb --bfile chr10_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr10_temp4
plink --noweb --bfile chr11_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr11_temp4
plink --noweb --bfile chr12_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr12_temp4
plink --noweb --bfile chr13_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr13_temp4
plink --noweb --bfile chr14_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr14_temp4
plink --noweb --bfile chr15_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr15_temp4
plink --noweb --bfile chr16_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr16_temp4
plink --noweb --bfile chr17_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr17_temp4
plink --noweb --bfile chr18_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr18_temp4
plink --noweb --bfile chr19_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr19_temp4
plink --noweb --bfile chr20_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr20_temp4
plink --noweb --bfile chr21_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr21_temp4
plink --noweb --bfile chr22_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chr22_temp4
plink --noweb --bfile chrX_temp3 --indep-pairwise 50 5 0.1 --ld-snp-list chr_pos_temp.txt --r2 --recode --make-bed --out chrX_temp4
rm chr_pos_temp.txt
rm *temp3*
# Create concatenated file with all pruned in snps and all pruned out snps
cat *.prune.in > $2.prune.in
cat *.prune.out > $2.prune.out
rm *temp4*
Outcome
By the end of this script, we have a file called “output.prune.in” and “output.prune.out”. Each contains a list of SNPs that have been pruned in and out of our dataset, respectively.
These files take the format of a single column in a text file with each line representing a single SNP. The SNP location is listed as chr_pos.