What is a VCF file?

2 minute read

Published:

This blog post exists because I tried to create a blog post on study methods and failed.

The fundamental issue was that I didn’t understand what the “All Variants” directory contained in the PMBB. I also wasn’t aware what a .vcf file contains. This blog post is created to explain that specific concept.

Variant Call Format

A .vcf file contains 3 main components

  1. Header containing metadata
  2. Column labels
  3. Data/content within the table

Header:

The header begins the file and provides metadata describing the body of the file. Header lines start with #. Special keywords start with ##.

For example:

##fileformat=VCFv4.3

##fileDate=20090805

##source=myImputationProgramV3.1

##FORMAT=<ID=GT,Number=1,Type=String,Description=”Genotype”>

Columns

The columns are listed as tab-separated 8 mandatory columns, followed by optional columns. When additional columns are used, the first optional column is used to describe the format of the data in the columns that follow.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

Above, you can see that the standard columns go from “CHROM” to “INFO”. These 8 columns contain:

  • CHROM: Chromosome number
    • E.g. 20
  • POS: The “1-based position” of the variation on the given sequence. This is the physical nucleotide position, by number on the chromosome
    • E.g. 14370
  • ID: SNP identifier, which uses the “reference SNP cluster” or “rs” identification
    • E.g. rs6054257
  • REF: The reference sequence at this position. Can be 1 or a sequence of nucleotides
    • E.g. G
  • ALT: The alternative alleles at the position.
    • E.g. A
  • QUAL: A quality score associated with the inference of the given alleles. A Phred-scaled quality score, which has funky meaning. E.g. Q10 = 10%, Q20 = 1%, Q30 = 0.1% error rate
    • E.g. 29
  • FILTER: A flag indicating which of a given set of filters the variation has failed or PASS if all the filters were passed successfully.
    • E.g. PASS
  • INFO: An extensible list of key-value pairs (fields) describing the variation.
    • E.g. NS=3 (where NS = number of samples with data)

Then, the column “FORMAT” would contain information such as “GT”. This goes on to list the data that would be contained in columns NA00001, etc. As you can see from the metadata above, a column listed as “GT” stands for “Genotype” and would contain a string.

Reflections and Further Questions

I was interested to learn about this filetype and see that a .vcf file shows that there’s a specific variant at a specific location in the genome, yet doesn’t identify the patient the genetic information belongs to.

It’s also interesting that although I learned about .vcf files here, the files located in PMBB have extension .vcf.gz or .vcf.gz.tbi. These have presumable been compressed into GNU zip files. Not sure about .tbi, but it may stand for TeraByte Unlimited (TBU) Image archive.