Molecule Tutorials - Herong's Tutorial Examples - v1.26, by Herong Yang
What Is VCF (Variant Calling Format)
Provides a quick introduction of VAF (Variant Allele Frequency), which is a text file format for storing genome sequence variant data.
What Is VCF (Variant Calling Format)? - VCF is a text file format for storing genome sequence variant data. It is often used to describe single nucleotide variants (SNVs) as well as insertions, deletions, and other sequence variations.
Here is an example of a VCF file:
##fileformat=VCFv4.2 ##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype"> ##FORMAT=<ID=GP,Number=G,Type=Float,Description="Genotype Probabilities"> ##FORMAT=<ID=PL,Number=G,Type=Float,Description="Phred-scaled Genotype ..."> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP001 SAMP002 20 1291018 rs11449 G A . PASS . GT 0/0 0/1 20 2300608 rs84825 C T . PASS . GT:GP 0/1:. 0/1:0.03,0.97,0 20 2301308 rs84823 T G . PASS . GT:PL ./.:. 1/1:10,5,0
My notes for reading VCF files:
Here are field headers and their explanations:
Descriptions of subfields used in INFO and FORMAT fields are usually provided in meta lines. If a subfield is used in the INFO field, it applies to all samples. If a subfield is used in the FORMAT field, it applies to single samples.
Here is a list of some commonly used subfields:
Here is another example of a VCF file with more data fields with QUAL and FILTER removed to reduce the line size.
##fileformat=VCFv4.2 ##fileDate=20090805 ##source=myImputationProgramV3.1 ##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta ##contig=<ID=20,length=62435964,assembly=B36,species="Homo sapiens",taxonomy=x> ##phasing=partial ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership - build 129"> ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 . G A . . NS=3;DP=14;AF=0.5 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A . . NS=3;DP=11;AF=0.02 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 . A G,T . . NS=2;DP=10;AF=0.33,0.67 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . . . NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 . GTC G,GTCT . . NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Now let's try to read the first data line that provides detailed information about the first gene variant (mutation):
1. The first part contains standard fields:
CHROM 20 - The chromosome # where the gene is located POS 14370 - The position # on the Chromosome where variant occurs ID rs6054257 - ID of the variant REF G - Nucleotide letter at this position in the reference gene ALT A - Nucleotide letter of the alternate (or variant) QUAL 29 FILTER PASS
2. The INFO part contains custom fields apply all samples:
INFO NS=2;DP=14;AF=0.5;DB;H2 Where: NS 3 - Number of samples DP 14 - Total depth of all 3 samples: 1 + 5 + 8 AF 0.5 - Allele Frequency.
3. The FORMAT and NA00001 provide additional information for the first sample:
FORMAT NA00001 GT:GQ:DP:HQ 0|0:48:1:51,51 Where: GT 0|0 - Genotype: 0|0 indicates Reference-Reference chromosome pair GQ 48 - Genotype Quality DP 1 - Read depth of this sample HQ 51,51 - Haplotype Quality
4. The FORMAT and NA00002 provide additional information for the second sample:
FORMAT NA00002 GT:GQ:DP:HQ 1|0:48:8:51,51 Where: GT 1|0 - Genotype: 1|0 indicates Alternate-Reference chromosome pair GQ 48 - Genotype Quality DP 8 - Read depth of this sample HQ 51,51 - Haplotype Quality
5. The FORMAT and NA00003 provide additional information for the third sample:
FORMAT NA00003 GT:GQ:DP:HQ 1/1:43:5:.,. Where: GT 1|1 - Genotype: 1|0 indicates Alternate-Alternate chromosome pair GQ 43 - Genotype Quality DP 5 - Read depth of this sample HQ .,. - Haplotype Quality
For more information, see "The Variant Call Format (VCF) Version 4.2 Specification" at http://samtools.github.io/hts-specs/VCFv4.2.pdf.
Table of Contents
Molecule Names and Identifications
Nucleobase, Nucleoside, Nucleotide, DNA and RNA
Base-Pair Insertion and Deletion
Gene Mutation Inheritance Likelihood
►What Is VCF (Variant Calling Format)
"vcftools" - VCF Utility Command
What Is VAF (Variant Allele Frequency)
Gene Mutation Naming Convention
ChEMBL Database - European Molecular Biology Laboratory
PubChem Database - National Library of Medicine
INSDC (International Nucleotide Sequence Database Collaboration)
HGNC (HUGO Gene Nomenclature Committee)