What Is VCF (Variant Calling Format)

Provides a quick introduction of VAF (Variant Allele Frequency), which is a text file format for storing genome sequence variant data.

What Is VCF (Variant Calling Format)? - VCF is a text file format for storing genome sequence variant data. It is often used to describe single nucleotide variants (SNVs) as well as insertions, deletions, and other sequence variations.

Here is an example of a VCF file:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype ...">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMP001	SAMP002
20	1291018	rs11449	G	A	.	PASS	.	GT	0/0	0/1
20	2300608	rs84825	C	T	.	PASS	.	GT:GP	0/1:.	0/1:0.03,0.97,0
20	2301308	rs84823	T	G	.	PASS	.	GT:PL	./.:.	1/1:10,5,0

My notes for reading VCF files:

Here are field headers and their explanations:

Descriptions of subfields used in INFO and FORMAT fields are usually provided in meta lines. If a subfield is used in the INFO field, it applies to all samples. If a subfield is used in the FORMAT field, it applies to single samples.

Here is a list of some commonly used subfields:

Here is another example of a VCF file with more data fields with QUAL and FILTER removed to reduce the line size.

##fileformat=VCFv4.2
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership - build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002
20	14370	.	G	A	.	.	NS=3;DP=14;AF=0.5	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	.	.	NS=3;DP=11;AF=0.02	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
20	1110696	.	A	G,T	.	.	NS=2;DP=10;AF=0.33,0.67	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
20	1230237	.	T	.	.	.	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
20	1234567	.	GTC	G,GTCT	.	.	NS=3;DP=9;AA=G	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3

Now let's try to read the first data line that provides detailed information about the first gene variant (mutation):

1. The first part contains standard fields:

CHROM    20 - The chromosome # where the gene is located
POS      14370 - The position # on the Chromosome where variant occurs
ID       rs6054257 - ID of the variant
REF      G - Nucleotide letter at this position in the reference gene
ALT      A - Nucleotide letter of the alternate (or variant)
QUAL     29
FILTER   PASS

2. The INFO part contains custom fields apply all samples:

INFO     NS=2;DP=14;AF=0.5;DB;H2

Where:
  NS   3 - Number of samples
  DP   14 - Total depth of all 3 samples: 1 + 5 + 8
  AF   0.5 - Allele Frequency.

3. The FORMAT and NA00001 provide additional information for the first sample:

FORMAT      NA00001
GT:GQ:DP:HQ 0|0:48:1:51,51

Where:
  GT   0|0 - Genotype: 0|0 indicates Reference-Reference chromosome pair
  GQ   48 - Genotype Quality
  DP   1 - Read depth of this sample
  HQ   51,51 - Haplotype Quality

4. The FORMAT and NA00002 provide additional information for the second sample:

FORMAT      NA00002
GT:GQ:DP:HQ 1|0:48:8:51,51

Where:
  GT   1|0 - Genotype: 1|0 indicates Alternate-Reference chromosome pair
  GQ   48 - Genotype Quality
  DP   8 - Read depth of this sample
  HQ   51,51 - Haplotype Quality

5. The FORMAT and NA00003 provide additional information for the third sample:

FORMAT      NA00003
GT:GQ:DP:HQ 1/1:43:5:.,.

Where:
  GT   1|1 - Genotype: 1|0 indicates Alternate-Alternate chromosome pair
  GQ   43 - Genotype Quality
  DP   5 - Read depth of this sample
  HQ   .,. - Haplotype Quality

For more information, see "The Variant Call Format (VCF) Version 4.2 Specification" at http://samtools.github.io/hts-specs/VCFv4.2.pdf.

Table of Contents

 About This Book

 Introduction of Molecules

 Molecule Names and Identifications

 Molecule Mass and Weight

 Protein and Amino Acid

 Nucleobase, Nucleoside, Nucleotide, DNA and RNA

 Gene and Chromosome

 Protein Kinase (PK)

 DNA Sequencing

Gene Mutation

 What Is Gene Mutation

 What Is Point Mutation

 Base-Pair Insertion and Deletion

 Gene Mutation Inheritance Likelihood

 Types of Genetic Testing

 Mutation Detection with NGS

 What Is Allele Frequency

What Is VCF (Variant Calling Format)

 "vcftools" - VCF Utility Command

 What Is VAF (Variant Allele Frequency)

 Gene Mutation Naming Convention

 Gene Mutation Test Report

 What Is ctDNA Testing

 Sanger Sequencing Test Report

 SDF (Structure Data File)

 PyMol Installation

 PyMol GUI and CLI

 PyMol Selections

 PyMol Editing Functions

 PyMol Measurement Functions

 PyMol Movie Functions

 PyMol Python Integration

 PyMol Object Functions

 ChEMBL Database - European Molecular Biology Laboratory

 PubChem Database - National Library of Medicine

 PDB (Protein Data Bank)

 INSDC (International Nucleotide Sequence Database Collaboration)

 HGNC (HUGO Gene Nomenclature Committee)

 Relocated Tutorials

 Resources and Tools

 Molecule Related Terminologies

 References

 Full Version in PDF/EPUB