Creating concise phenotype files for patients in PMBB

5 minute read

Published:

In this posting, I created a script in R that goes through the 3 most important phenotype files in PMBB.

These files contain information for:

  • Patient demographics and covariates
    • Demographics include Age, Gender
    • Covariates are “Principal Components” discussed earlier. These are vectors with a magnitude according to 10 components based on exome sequencing, and 10 components based on genome sequencing
  • Patient BMI
  • Patient smoking history

In this R script, I basically take all the above information and combine it into something useful.

  • Many of the files contain unnecessary columns
    • So I removed them
  • For patient BMI, each patient has multiple BMIs recorded for each hospital or office visit
    • I take the mean BMI of all recordings, so I only include 1 BMI which summarizes overall BMI over time
  • For smoking history, the data listed includes “Never”, “Quit” or “Current” smoking status, smoking quantity, smoking duration, and quit date, with multiple versions for each encounter data was modified
    • I take only the most recent encounter data, as this is the information that is most up to date
    • I summarize smoking quantity and duration into a single number, which is the “Pack-Year-History”
      • For those unfamiliar, this is quantity smoked in packs multiplied by years smoked
      • E.g. 2 packs a day for 30 years = 60 Pack-Year-History
      • E.g. A never smoker has a 0 Pack-Year-History
      • Of note, I remove patients who don’t have quantified information on their smoking history. Of the ~20,000 smokers, this removes approximately 10,000
    • I also create a new column which evaluates “Yes” or “No”, is the patient recommended for lung cancer screening by low-dose chest CT scan? I.e, do they fit the criteria:
      • Age 50-80
      • 20+ pack-year smoking history
      • Current smoker or smoker who quit <15 years ago

Overall, this creates a file which I think is very useful, as it contains pertinent patient information with columns for each patient phenotype

  • PMBB_ID
  • Year of birth
  • Gender
  • ePC1-10: 10 columns each denoting the principal component vector for exome data #1-10
  • gPC1-10: 10 columns each denoting the principal component vector for genome data #1-10
  • Pack-year-smoking-history
  • Eligible for low dose chest CT for lung cancer screening?
  • BMI

This file will be used in the OC_build_PRS.Rscript, and will be the basis of control variables when building Polygenic Risk Scores.

At the end, I create a pie chart to visualize the patients who would be recommended for Lung Cancer screening.

Image

#!/usr/bin/env Rscript

####
# Code to make a text file containing important info to input into PRS scores
# This code should output a file that contains the following columns
  # PMBB_ID
  # Year of Birth
  # Gender
  # ePC1-10: 10 columns each denoting the principal component vector for exome data #1-10
  # gPC1-10: 10 columns each denoting the principal component vector for genome data #1-10
  # Pack-year-smoking-history
  # Eligible for low dose chest CT for lung cancer screening?
  # BMI

# Load phenotypes files, this was from phenotypes 2.1 in PMBB
covariates <- read.csv(file= "G://My Drive//Work//Research//Maxwell//phenotypes//PMBB-Release-2020-2.1_phenotype_covariates.txt", header = T, sep="\t")
tobacco <- read.csv(file= "G://My Drive//Work//Research//Maxwell//phenotypes//PMBB-Release-2020-2.1_phenotype_smoking-history.txt", header = T, sep="\t")
bmi <- read.csv(file= "G://My Drive//Work//Research//Maxwell//phenotypes//PMBB-Release-2020-2.1_phenotype_vitals-BMI.txt", header = T, sep="\t")

# Covariates
# Only pull out the sex, birth year, and principal components of each patient
covariates <- covariates[,-c(4)]

# BMI
# Calculate the average patient BMI over all encounters
bmi <- aggregate(bmi$BMI, list(bmi$PMBB_ID), FUN=mean)
colnames(bmi) <- c("PMBB_ID", "BMI")

# Tobacco
# Only pull out the most recent encounter for each patient
tobacco <- as.data.table(tobacco)
tobacco <- tobacco[tobacco[, .I[ENC_DT_SHIFT == max(ENC_DT_SHIFT)], by=PMBB_ID]$V1]

# Make a data frame with never smokers only, 2nd column is their pack year history of 0
never_smokers <- data.frame(PMBB_ID=tobacco[which(tobacco$SOCIAL_HISTORY_USE=="Never"),c(1)],PACK_YR_HX=0, CT_ELIGIBLE="NO")

# Subset data frame to ever smokers
ever_smokers <- tobacco[-which(tobacco$SOCIAL_HISTORY_USE=="Never"),]
# Remove smokers who don't have a listed duration of smoking
ever_smokers <- ever_smokers[-which(is.na(ever_smokers$USE_YEARS)),]
# Remove smokers who don't have a listed quantity of smoking
ever_smokers <- ever_smokers[-which(ever_smokers$USE_AMOUNT_PER_TIME==""),]
# Subset only important columns
ever_smokers <- ever_smokers[,c(1,3,5,6,7,8)]
# Calculate pack-year-smoking history
ever_smokers$YEARS = as.numeric(substr(ever_smokers$USE_AMOUNT_PER_TIME,1,nchar(ever_smokers$USE_AMOUNT_PER_TIME)-16))
ever_smokers$PACK_YR_HX = ever_smokers$USE_YEARS*ever_smokers$YEARS
# Create a column YES/NO on if the patient would be recommended for Chest CT screening already
  # >20 pack year history smoking
  # AND
  # Current smokers or quit <15 years ago
ever_smokers$CT_ELIGIBLE = "YES" 
ever_smokers$CT_ELIGIBLE[which(ever_smokers$PACK_YR_HX<=20)] = "NO"
ever_smokers$CT_ELIGIBLE[which(ever_smokers$USE_QUIT_DATE_SHIFT<"2007-05-21")] = "NO"
# Subset important columns
ever_smokers <- ever_smokers[,c(1,8,9)]

# Combine never and ever smoker data frames, merge with covariates and bmi
tobacco <- rbind(never_smokers, ever_smokers)
covariates_tobacco <- merge(covariates,tobacco, by="PMBB_ID")
covariates_tobacco_BMI <- merge(covariates_tobacco,bmi, by="PMBB_ID")

# Fix patients CT_ELIGIBLE according to age - make "NO" if age >80 or <50
covariates_tobacco_BMI$CT_ELIGIBLE[which(covariates_tobacco_BMI$BIRTH_YEAR<1942)] = "NO"
covariates_tobacco_BMI$CT_ELIGIBLE[which(covariates_tobacco_BMI$BIRTH_YEAR>1972)] = "NO"

# Plot how many are CT eligible, just for fun

# 3D Exploded Pie Chart
library(plotrix)
slices <- c(length(which(covariates_tobacco_BMI$CT_ELIGIBLE=="YES")), length(which(covariates_tobacco_BMI$CT_ELIGIBLE=="NO")))
lbls <- c(paste("YES -",slices[1]), paste("NO -",slices[2]))
pie(slices,labels=lbls,main="# eligible for Lung Cancer Screening with CT",col=rainbow(2))
legend("bottom", "YES - patients with 20+ pack-year hx and current smokers or quit <15 years ago", cex=0.7)

# Save covariates file
write.table(covariates_tobacco_BMI , "G://My Drive//Work//Research//Maxwell//phenotypes//OC_covariate_file.txt", row.names = FALSE, sep="\t", quote = FALSE)