Creating concise phenotype files for patients in PMBB

5 minute read

Published: May 21, 2022

In this posting, I created a script in R that goes through the 3 most important phenotype files in PMBB.

These files contain information for:

Patient demographics and covariates
- Demographics include Age, Gender
- Covariates are “Principal Components” discussed earlier. These are vectors with a magnitude according to 10 components based on exome sequencing, and 10 components based on genome sequencing
Patient BMI
Patient smoking history

In this R script, I basically take all the above information and combine it into something useful.

Many of the files contain unnecessary columns
- So I removed them
For patient BMI, each patient has multiple BMIs recorded for each hospital or office visit
- I take the mean BMI of all recordings, so I only include 1 BMI which summarizes overall BMI over time
For smoking history, the data listed includes “Never”, “Quit” or “Current” smoking status, smoking quantity, smoking duration, and quit date, with multiple versions for each encounter data was modified
- I take only the most recent encounter data, as this is the information that is most up to date
- I summarize smoking quantity and duration into a single number, which is the “Pack-Year-History”
  - For those unfamiliar, this is quantity smoked in packs multiplied by years smoked
  - E.g. 2 packs a day for 30 years = 60 Pack-Year-History
  - E.g. A never smoker has a 0 Pack-Year-History
  - Of note, I remove patients who don’t have quantified information on their smoking history. Of the ~20,000 smokers, this removes approximately 10,000
- I also create a new column which evaluates “Yes” or “No”, is the patient recommended for lung cancer screening by low-dose chest CT scan? I.e, do they fit the criteria:
  - Age 50-80
  - 20+ pack-year smoking history
  - Current smoker or smoker who quit <15 years ago

Overall, this creates a file which I think is very useful, as it contains pertinent patient information with columns for each patient phenotype

PMBB_ID
Year of birth
Gender
ePC1-10: 10 columns each denoting the principal component vector for exome data #1-10
gPC1-10: 10 columns each denoting the principal component vector for genome data #1-10
Pack-year-smoking-history
Eligible for low dose chest CT for lung cancer screening?
BMI

This file will be used in the OC_build_PRS.Rscript, and will be the basis of control variables when building Polygenic Risk Scores.

At the end, I create a pie chart to visualize the patients who would be recommended for Lung Cancer screening.

#!/usr/bin/env Rscript

####
# Code to make a text file containing important info to input into PRS scores
# This code should output a file that contains the following columns
  # PMBB_ID
  # Year of Birth
  # Gender
  # ePC1-10: 10 columns each denoting the principal component vector for exome data #1-10
  # gPC1-10: 10 columns each denoting the principal component vector for genome data #1-10
  # Pack-year-smoking-history
  # Eligible for low dose chest CT for lung cancer screening?
  # BMI

# Load phenotypes files, this was from phenotypes 2.1 in PMBB
covariates <- read.csv(file= "G://My Drive//Work//Research//Maxwell//phenotypes//PMBB-Release-2020-2.1_phenotype_covariates.txt", header = T, sep="\t")
tobacco <- read.csv(file= "G://My Drive//Work//Research//Maxwell//phenotypes//PMBB-Release-2020-2.1_phenotype_smoking-history.txt", header = T, sep="\t")
bmi <- read.csv(file= "G://My Drive//Work//Research//Maxwell//phenotypes//PMBB-Release-2020-2.1_phenotype_vitals-BMI.txt", header = T, sep="\t")

# Covariates
# Only pull out the sex, birth year, and principal components of each patient
covariates <- covariates[,-c(4)]

# BMI
# Calculate the average patient BMI over all encounters
bmi <- aggregate(bmi$BMI, list(bmi$PMBB_ID), FUN=mean)
colnames(bmi) <- c("PMBB_ID", "BMI")

# Tobacco
# Only pull out the most recent encounter for each patient
tobacco <- as.data.table(tobacco)
tobacco <- tobacco[tobacco[, .I[ENC_DT_SHIFT == max(ENC_DT_SHIFT)], by=PMBB_ID]$V1]

# Make a data frame with never smokers only, 2nd column is their pack year history of 0
never_smokers <- data.frame(PMBB_ID=tobacco[which(tobacco$SOCIAL_HISTORY_USE=="Never"),c(1)],PACK_YR_HX=0, CT_ELIGIBLE="NO")

# Subset data frame to ever smokers
ever_smokers <- tobacco[-which(tobacco$SOCIAL_HISTORY_USE=="Never"),]
# Remove smokers who don't have a listed duration of smoking
ever_smokers <- ever_smokers[-which(is.na(ever_smokers$USE_YEARS)),]
# Remove smokers who don't have a listed quantity of smoking
ever_smokers <- ever_smokers[-which(ever_smokers$USE_AMOUNT_PER_TIME==""),]
# Subset only important columns
ever_smokers <- ever_smokers[,c(1,3,5,6,7,8)]
# Calculate pack-year-smoking history
ever_smokers$YEARS = as.numeric(substr(ever_smokers$USE_AMOUNT_PER_TIME,1,nchar(ever_smokers$USE_AMOUNT_PER_TIME)-16))
ever_smokers$PACK_YR_HX = ever_smokers$USE_YEARS*ever_smokers$YEARS
# Create a column YES/NO on if the patient would be recommended for Chest CT screening already
  # >20 pack year history smoking
  # AND
  # Current smokers or quit <15 years ago
ever_smokers$CT_ELIGIBLE = "YES" 
ever_smokers$CT_ELIGIBLE[which(ever_smokers$PACK_YR_HX<=20)] = "NO"
ever_smokers$CT_ELIGIBLE[which(ever_smokers$USE_QUIT_DATE_SHIFT<"2007-05-21")] = "NO"
# Subset important columns
ever_smokers <- ever_smokers[,c(1,8,9)]

# Combine never and ever smoker data frames, merge with covariates and bmi
tobacco <- rbind(never_smokers, ever_smokers)
covariates_tobacco <- merge(covariates,tobacco, by="PMBB_ID")
covariates_tobacco_BMI <- merge(covariates_tobacco,bmi, by="PMBB_ID")

# Fix patients CT_ELIGIBLE according to age - make "NO" if age >80 or <50
covariates_tobacco_BMI$CT_ELIGIBLE[which(covariates_tobacco_BMI$BIRTH_YEAR<1942)] = "NO"
covariates_tobacco_BMI$CT_ELIGIBLE[which(covariates_tobacco_BMI$BIRTH_YEAR>1972)] = "NO"

# Plot how many are CT eligible, just for fun

# 3D Exploded Pie Chart
library(plotrix)
slices <- c(length(which(covariates_tobacco_BMI$CT_ELIGIBLE=="YES")), length(which(covariates_tobacco_BMI$CT_ELIGIBLE=="NO")))
lbls <- c(paste("YES -",slices[1]), paste("NO -",slices[2]))
pie(slices,labels=lbls,main="# eligible for Lung Cancer Screening with CT",col=rainbow(2))
legend("bottom", "YES - patients with 20+ pack-year hx and current smokers or quit <15 years ago", cex=0.7)

# Save covariates file
write.table(covariates_tobacco_BMI , "G://My Drive//Work//Research//Maxwell//phenotypes//OC_covariate_file.txt", row.names = FALSE, sep="\t", quote = FALSE)

Share on

Twitter Facebook LinkedIn

Oliver Clark

Creating concise phenotype files for patients in PMBB

Share on

You May Also Enjoy

Genetic Risk for Overall Cancer and the Benefit of Adherence to a Healthy Lifestyle: A review

Genetic risk, incident gastric cancer, and healthy lifestyle: a meta-analysis of genome-wide association studies and prospective cohort study: A review

On cross-ancestry cancer polygenic risk scores: A review

Reviewing PRS literature Part 1