"vcftools" - VCF Utility Command

Provides a quick introduction of 'vcftools' command, which is a utility tool to summarize data, run calculations on data, filter out data, and convert data from VCF and BCF files.

What Is "vcftools" Command? - "vcftools" is a utility command that allows you to summarize data, run calculations on data, filter out data, and convert data from VCF and BCF files.

Here is what I did on my Linux computer to play with "vcftools".

1. Install "vcftools" with the DNF command:

herong$ sudo dnf install vcftools
  Installed:
    vcftools-0.1.16-5.el8.x86_64

2. Calculate AC (Allele Count) values with "vcftools --counts" on the first VCF sample file provided in the previous tutorial.

herong$ more sample.vcf
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMP001	SAMP002
20	1291018	rs11449	G	A	.	PASS	.	GT	0/0	0/1
20	2300608	rs84825	C	T	.	PASS	.	GT:GP	0/1:.	0/1:0.03,0.97,0
20	2301308	rs84823	T	G	.	PASS	.	GT:PL	./.:.	1/1:10,5,0

herong$ vcftools --vcf sample.vcf --counts
  VCFtools - 0.1.16
  (C) Adam Auton and Anthony Marcketta 2009

  Parameters as interpreted:
    --vcf sample.vcf
    --counts

  After filtering, kept 2 out of 2 Individuals
  Outputting Frequency Statistics...
  After filtering, kept 3 out of a possible 3 Sites

herong$ more out.frq.count
  CHROM	POS	N_ALLELES	N_CHR	{ALLELE:COUNT}
  20	1291018	2	4	G:3	A:1
  20	2300608	2	4	C:2	T:2
  20	2301308	2	2	T:0	G:2

3. Validate above AC (Allele Count) output with GT (Genotype) information in the VCF file.

Line 1:
  N_ALLELES  2 - Number of alleles reported
  N_CHR      4 - Number of chromosomes fragments reported
  G          3 - Allele "G" as "Ref (0)" in GT fields for 3 times
  A          1 - Allele "A" as "Alt (1)" in GT fields for 1 times

Line 2:
  N_ALLELES  2 - Number of alleles reported
  N_CHR      4 - Number of chromosomes fragments reported
  C          2 - Allele "G" as "Ref (0)" in GT fields for 3 times
  T          2 - Allele "A" as "Alt (1)" in GT fields for 1 times

4. Calculate AF (Allele Frequency) values with "vcftools --freq".

herong$ vcftools --vcf sample.vcf --freq

herong$ more out.frq
  CHROM	POS	N_ALLELES	N_CHR	{ALLELE:FREQ}
  20	1291018	2	4	G:0.75	A:0.25
  20	2300608	2	4	C:0.5	T:0.5
  20	2301308	2	2	T:0	G:1

5. Validate above AF (Allele Frequency) output with AC (Allele Frequency) output:

Line 1:
  AF for "G" = AC for "G" / N_CHR = 3/4 = 75%
  AF for "A" = AC for "G" / N_CHR = 1/4 = 25%

Line 2:
  AF for "C" = AC for "C" / N_CHR = 2/4 = 50%
  AF for "T" = AC for "T" / N_CHR = 2/4 = 50%

6. Calculate AC and AF values with "vcftools" on the second VCF sample file provided in the previous tutorial.

herong$ more sample-2.vcf
...
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003
20	14370	.	G	A	.	.	NS=3;DP=14;AF=0.5	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	.	.	NS=3;DP=11;AF=0.02	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3
20	1110696	.	A	G,T	.	.	NS=2;DP=10;AF=0.33,0.67	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4
20	1230237	.	T	.	.	.	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:7:56,60	0|0:48:4:51,51	0/0:61:2
20	1234567	.	GTC	G,GTCT	.	.	NS=3;DP=9;AA=G	GT:GQ:DP	0/1:35:4	0/2:17:2	1/1:40:3

herong$ vcftools --vcf sample-2.vcf --counts

herong$ more out.frq.count
  CHROM	POS	N_ALLELES	N_CHR	{ALLELE:COUNT}
  20	14370	2	6	G:3	A:3
  20	17330	2	6	T:5	A:1
  20	1110696	3	6	A:0	G:2	T:4
  20	1230237	1	6	T:6
  20	1234567	3	6	GTC:2	G:3	GTCT:1

herong$ vcftools --vcf sample-2.vcf --freq

herong$ more out.frq
  CHROM	POS	N_ALLELES	N_CHR	{ALLELE:FREQ}
  20	14370	2	6	G:0.5	A:0.5
  20	17330	2	6	T:0.833333	A:0.166667
  20	1110696	3	6	A:0	G:0.333333	T:0.666667
  20	1230237	1	6	T:1
  20	1234567	3	6	GTC:0.333333	G:0.5	GTCT:0.166667

Note that above AF values are calculated purely based on GT (Genotype) counts from all samples. DP (Depth: number of fragment reads) from each sample is not considered at all.

In other words, if each sample represents a different individual in a population, the above AF value of allele X (or nucleotide X) estimates how frequent nucleotide X appears in a chromosome of a diploid population represented by those samples. This assumes all cells carry identical sets of chromosome pairs.

If a given sample contains a mixture of normal cells and mutated cells, allele counts at DNA fragment reads (or depth) level must be used to calculate the AF value of allele X for the given sample. This AF value estimates how frequent nucleotide X appears in a chromosome of all cells in a simple sample. This assumes cells may carry different sets of chromosome pairs due to mutation.

Also note that some data lines do include AF values for alternate alleles only calculated in the same way as the "vcftools". For example, the above VCF file reports AF=0.333,0.667 for alternate alleles G and T, that matches well with the "vcftools" output:

VCF date line:
20	1110696	.	A	G,T	.	.	NS=2;DP=10;AF=0.333,0.667	GT:GQ:DP:HQ	1|2:21:6:23,27	...

vcftools output:
20	1110696	3	6	A:0	G:0.333333	T:0.666667

Conclusion, VCF files are used to store genome sequence variant data collected from multiple individual to sample a given population. The main statistical value, AF (Allele Frequency), is calculated based on counts of different genotypes in all samples.

Table of Contents

 About This Book

 Introduction of Molecules

 Molecule Names and Identifications

 Molecule Mass and Weight

 Protein and Amino Acid

 Nucleobase, Nucleoside, Nucleotide, DNA and RNA

 Gene and Chromosome

 Protein Kinase (PK)

 DNA Sequencing

Gene Mutation

 What Is Gene Mutation

 What Is Point Mutation

 Base-Pair Insertion and Deletion

 Gene Mutation Inheritance Likelihood

 Types of Genetic Testing

 Mutation Detection with NGS

 What Is Allele Frequency

 What Is VCF (Variant Calling Format)

"vcftools" - VCF Utility Command

 What Is VAF (Variant Allele Frequency)

 Gene Mutation Naming Convention

 Gene Mutation Test Report

 What Is ctDNA Testing

 Sanger Sequencing Test Report

 SDF (Structure Data File)

 PyMol Installation

 PyMol GUI and CLI

 PyMol Selections

 PyMol Editing Functions

 PyMol Measurement Functions

 PyMol Movie Functions

 PyMol Python Integration

 PyMol Object Functions

 ChEMBL Database - European Molecular Biology Laboratory

 PubChem Database - National Library of Medicine

 PDB (Protein Data Bank)

 INSDC (International Nucleotide Sequence Database Collaboration)

 HGNC (HUGO Gene Nomenclature Committee)

 Relocated Tutorials

 Resources and Tools

 Molecule Related Terminologies

 References

 Full Version in PDF/EPUB