Molecule Tutorials - Herong's Tutorial Examples - v1.26, by Herong Yang
"vcftools" - VCF Utility Command
Provides a quick introduction of 'vcftools' command, which is a utility tool to summarize data, run calculations on data, filter out data, and convert data from VCF and BCF files.
What Is "vcftools" Command? - "vcftools" is a utility command that allows you to summarize data, run calculations on data, filter out data, and convert data from VCF and BCF files.
Here is what I did on my Linux computer to play with "vcftools".
1. Install "vcftools" with the DNF command:
herong$ sudo dnf install vcftools Installed: vcftools-0.1.16-5.el8.x86_64
2. Calculate AC (Allele Count) values with "vcftools --counts" on the first VCF sample file provided in the previous tutorial.
herong$ more sample.vcf ... #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMP001 SAMP002 20 1291018 rs11449 G A . PASS . GT 0/0 0/1 20 2300608 rs84825 C T . PASS . GT:GP 0/1:. 0/1:0.03,0.97,0 20 2301308 rs84823 T G . PASS . GT:PL ./.:. 1/1:10,5,0 herong$ vcftools --vcf sample.vcf --counts VCFtools - 0.1.16 (C) Adam Auton and Anthony Marcketta 2009 Parameters as interpreted: --vcf sample.vcf --counts After filtering, kept 2 out of 2 Individuals Outputting Frequency Statistics... After filtering, kept 3 out of a possible 3 Sites herong$ more out.frq.count CHROM POS N_ALLELES N_CHR {ALLELE:COUNT} 20 1291018 2 4 G:3 A:1 20 2300608 2 4 C:2 T:2 20 2301308 2 2 T:0 G:2
3. Validate above AC (Allele Count) output with GT (Genotype) information in the VCF file.
Line 1: N_ALLELES 2 - Number of alleles reported N_CHR 4 - Number of chromosomes fragments reported G 3 - Allele "G" as "Ref (0)" in GT fields for 3 times A 1 - Allele "A" as "Alt (1)" in GT fields for 1 times Line 2: N_ALLELES 2 - Number of alleles reported N_CHR 4 - Number of chromosomes fragments reported C 2 - Allele "G" as "Ref (0)" in GT fields for 3 times T 2 - Allele "A" as "Alt (1)" in GT fields for 1 times
4. Calculate AF (Allele Frequency) values with "vcftools --freq".
herong$ vcftools --vcf sample.vcf --freq herong$ more out.frq CHROM POS N_ALLELES N_CHR {ALLELE:FREQ} 20 1291018 2 4 G:0.75 A:0.25 20 2300608 2 4 C:0.5 T:0.5 20 2301308 2 2 T:0 G:1
5. Validate above AF (Allele Frequency) output with AC (Allele Frequency) output:
Line 1: AF for "G" = AC for "G" / N_CHR = 3/4 = 75% AF for "A" = AC for "G" / N_CHR = 1/4 = 25% Line 2: AF for "C" = AC for "C" / N_CHR = 2/4 = 50% AF for "T" = AC for "T" / N_CHR = 2/4 = 50%
6. Calculate AC and AF values with "vcftools" on the second VCF sample file provided in the previous tutorial.
herong$ more sample-2.vcf ... #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 . G A . . NS=3;DP=14;AF=0.5 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 17330 . T A . . NS=3;DP=11;AF=0.02 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3 20 1110696 . A G,T . . NS=2;DP=10;AF=0.33,0.67 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4 20 1230237 . T . . . NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2 20 1234567 . GTC G,GTCT . . NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3 herong$ vcftools --vcf sample-2.vcf --counts herong$ more out.frq.count CHROM POS N_ALLELES N_CHR {ALLELE:COUNT} 20 14370 2 6 G:3 A:3 20 17330 2 6 T:5 A:1 20 1110696 3 6 A:0 G:2 T:4 20 1230237 1 6 T:6 20 1234567 3 6 GTC:2 G:3 GTCT:1 herong$ vcftools --vcf sample-2.vcf --freq herong$ more out.frq CHROM POS N_ALLELES N_CHR {ALLELE:FREQ} 20 14370 2 6 G:0.5 A:0.5 20 17330 2 6 T:0.833333 A:0.166667 20 1110696 3 6 A:0 G:0.333333 T:0.666667 20 1230237 1 6 T:1 20 1234567 3 6 GTC:0.333333 G:0.5 GTCT:0.166667
Note that above AF values are calculated purely based on GT (Genotype) counts from all samples. DP (Depth: number of fragment reads) from each sample is not considered at all.
In other words, if each sample represents a different individual in a population, the above AF value of allele X (or nucleotide X) estimates how frequent nucleotide X appears in a chromosome of a diploid population represented by those samples. This assumes all cells carry identical sets of chromosome pairs.
If a given sample contains a mixture of normal cells and mutated cells, allele counts at DNA fragment reads (or depth) level must be used to calculate the AF value of allele X for the given sample. This AF value estimates how frequent nucleotide X appears in a chromosome of all cells in a simple sample. This assumes cells may carry different sets of chromosome pairs due to mutation.
Also note that some data lines do include AF values for alternate alleles only calculated in the same way as the "vcftools". For example, the above VCF file reports AF=0.333,0.667 for alternate alleles G and T, that matches well with the "vcftools" output:
VCF date line: 20 1110696 . A G,T . . NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 ... vcftools output: 20 1110696 3 6 A:0 G:0.333333 T:0.666667
Conclusion, VCF files are used to store genome sequence variant data collected from multiple individual to sample a given population. The main statistical value, AF (Allele Frequency), is calculated based on counts of different genotypes in all samples.
Table of Contents
Molecule Names and Identifications
Nucleobase, Nucleoside, Nucleotide, DNA and RNA
Base-Pair Insertion and Deletion
Gene Mutation Inheritance Likelihood
What Is VCF (Variant Calling Format)
►"vcftools" - VCF Utility Command
What Is VAF (Variant Allele Frequency)
Gene Mutation Naming Convention
ChEMBL Database - European Molecular Biology Laboratory
PubChem Database - National Library of Medicine
INSDC (International Nucleotide Sequence Database Collaboration)
HGNC (HUGO Gene Nomenclature Committee)