What Is VCF (Variant Calling Format)

Molecule Tutorials - Herong's Tutorial Examples

∟What Is VCF (Variant Calling Format)

Provides a quick introduction of VAF (Variant Allele Frequency), which is a text file format for storing genome sequence variant data.

What Is VCF (Variant Calling Format)? - VCF is a text file format for storing genome sequence variant data. It is often used to describe single nucleotide variants (SNVs) as well as insertions, deletions, and other sequence variations.

Here is an example of a VCF file:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype ...">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMP001	SAMP002
20	1291018	rs11449	G	A	.	PASS	.	GT	0/0	0/1
20	2300608	rs84825	C	T	.	PASS	.	GT:GP	0/1:.	0/1:0.03,0.97,0
20	2301308	rs84823	T	G	.	PASS	.	GT:PL	./.:.	1/1:10,5,0

My notes for reading VCF files:

Meta lines - Lines beginning with "##", followed by KEY=VALUE, to provide any custom information.
Header line - Line beginning with "#CHROM", followed by a tab delimited field headers.
Data lines - Lines follow the header line with a tab delimited field values.
The symbol "." is used to denote missing value.

Here are field headers and their explanations:

CHROM - The chromosome number.
POS - The genome coordinate of the first nucleotide base in the variant. Within a chromosome, VCF records are sorted in order of increasing position.
ID - A semicolon-separated list of marker identifiers.
REF - The reference allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC")
ALT - The alternate allele (the variant or mutation) expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC"). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles.
QUAL - Probability that the ALT allele is incorrectly specified, expressed on the the phred scale (-10log₁₀(probability)).
FILTER - Either "PASS" or a semicolon-separated list of failed quality control filters.
INFO - Additional information (no white space, tabs, or semi-colons permitted).
FORMAT - Colon-separated list of additional subfields reported for each sample.
SAMP001 - Values for subfields specified in FORMAT of sample 1.
SAMP002 - Values for subfields specified in FORMAT of sample 2.
...

Descriptions of subfields used in INFO and FORMAT fields are usually provided in meta lines. If a subfield is used in the INFO field, it applies to all samples. If a subfield is used in the FORMAT field, it applies to single samples.

Here is a list of some commonly used subfields:

AC - (Allele Count) in genotypes.
AN - (Allele Number) Total number of alleles in called genotypes.
AD - (Allele Depth) of each Allele
DP - (Depth) Total depth or allele depth
AF - (Allele Frequency)
GT - (Genotype)
GL - (Genotype Likelihoods) for "RR,RA,AA" genotypes. R is for reference allele, and A is for Alternative allele.
PL - (Phred-scaled genotype Likelihoods) for "RR,RA,AA" genotypes.

Here is another example of a VCF file with more data fields with QUAL and FILTER removed to reduce the line size.

##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership - build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002
20	14370	.	G	A	.	.	NS=3;DP=14;AF=0.5	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	.	.	NS=3;DP=11;AF=0.02	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
20	1110696	.	A	G,T	.	.	NS=2;DP=10;AF=0.33,0.67	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
20	1230237	.	T	.	.	.	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
20	1234567	.	GTC	G,GTCT	.	.	NS=3;DP=9;AA=G	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3

Now let's try to read the first data line that provides detailed information about the first gene variant (mutation):

1. The first part contains standard fields:

CHROM    20 - The chromosome # where the gene is located
POS      14370 - The position # on the Chromosome where variant occurs
ID       rs6054257 - ID of the variant
REF      G - Nucleotide letter at this position in the reference gene
ALT      A - Nucleotide letter of the alternate (or variant)
QUAL     29
FILTER   PASS

2. The INFO part contains custom fields apply all samples:

INFO     NS=2;DP=14;AF=0.5;DB;H2

Where:
  NS   3 - Number of samples
  DP   14 - Total depth of all 3 samples: 1 + 5 + 8
  AF   0.5 - Allele Frequency.

3. The FORMAT and NA00001 provide additional information for the first sample:

FORMAT      NA00001
GT:GQ:DP:HQ 0|0:48:1:51,51

Where:
  GT   0|0 - Genotype: 0|0 indicates Reference-Reference chromosome pair
  GQ   48 - Genotype Quality
  DP   1 - Read depth of this sample
  HQ   51,51 - Haplotype Quality

4. The FORMAT and NA00002 provide additional information for the second sample:

FORMAT      NA00002
GT:GQ:DP:HQ 1|0:48:8:51,51

Where:
  GT   1|0 - Genotype: 1|0 indicates Alternate-Reference chromosome pair
  GQ   48 - Genotype Quality
  DP   8 - Read depth of this sample
  HQ   51,51 - Haplotype Quality

5. The FORMAT and NA00003 provide additional information for the third sample:

FORMAT      NA00003
GT:GQ:DP:HQ 1/1:43:5:.,.

Where:
  GT   1|1 - Genotype: 1|0 indicates Alternate-Alternate chromosome pair
  GQ   43 - Genotype Quality
  DP   5 - Read depth of this sample
  HQ   .,. - Haplotype Quality