Cheminformatics Tutorials - Herong's Tutorial Examples - v2.01, by Herong Yang
Impact of 'nBits' on GetMorganFingerprintAsBitVect()
This section provides a tutorial example on impact of the 'nBits' option on fingerprint generation with rdkit.Chem.rdMolDescriptors.GetMorganFingerprintAsBitVect() function.
The 'nBits' option in the rdkit.Chem.rdMolDescriptors.GetMorganFingerprintAsBitVect() function call allows you to control the number of bits (length) of the fingerprint. The default is nBits=2048, which is capable to represent 2048 identifiers in a perfect scenario.
Generating shorter fingerprints can save data storage. But it also reduces the accuracy the final fingerprints.
1. For example, "CCCCCCCC" and "OOOOOOOO" are 2 completely different molecules with a zero similarity between them. But their fingerprints share 3 common bits, when fpSize=16 is used. And that gives a 0.33 similarity score.
from rdkit.Chem import AllChem from rdkit import DataStructs mol = AllChem.MolFromSmiles('CCCCCCCC') fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=16) print(fp1.ToBitString()) mol = AllChem.MolFromSmiles('OOOOOOOO') fp2 = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=16) print(fp2.ToBitString()) print(DataStructs.FingerprintSimilarity(fp1, fp2)) # output 1100101100100001 0110011100000000 ^ ^^ 0.3333333333333333
The main reason for two different identifiers turning on the same bit in the fingerprint is that the fingerprint is not long enough to allow them to use different bits.
2. Now if we use "fpSize=64", "CCCCCCCC" and "OOOOOOOO" will share only 1 bit in their fingerprints. And that lowers the similarity score to 0.08, much more accurate now.
mol = AllChem.MolFromSmiles('CCCCCCCC') fp1 = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=64) print(fp1.ToBitString()) mol = AllChem.MolFromSmiles('OOOOOOOO') fp2 = AllChem.GetMorganFingerprintAsBitVect(mol, 2, nBits=64) print(fp2.ToBitString()) print(DataStructs.FingerprintSimilarity(fp1, fp2)) # output 0000000000000001100000000010000001001010000000000000000100000000 0000011000000000011000000000000000000001000000000100010100000000 ^ 0.07142857142857142
Conclusion: We should increase nBits as large as possible to avoid multiple different identifiers turning on the same bit in the fingerprint.
Table of Contents
SMILES (Simplified Molecular-Input Line-Entry System)
Open Babel: The Open Source Chemistry Toolbox
Using Open Babel Command: "obabel"
Generating SVG Pictures with Open Babel
Substructure Search with Open Babel
Similarity Search with Open Babel
Fingerprint Index for Fastsearch with Open Babel
Stereochemistry with Open Babel
Command Line Tools Provided by Open Babel
RDKit: Open-Source Cheminformatics Software
rdkit.Chem.rdchem - The Core Module
rdkit.Chem.rdmolfiles - Molecular File Module
rdkit.Chem.rdDepictor - Compute 2D Coordinates
rdkit.Chem.Draw - Handle Molecule Images
Molecule Substructure Search with RDKit
rdkit.Chem.rdmolops - Molecule Operations
Daylight Fingerprint Generator in RDKit
►Morgan Fingerprint Generator in RDKit
What Is Morgan Fingerprint Generator in RDKit
GetMorganFingerprint() Method in RDKit
Impact of 'radius' on GetMorganFingerprint()
Impact of 'useCounts' on GetMorganFingerprint()
Impact of 'invariants' on GetMorganFingerprint()
Impact of 'useBondTypes' on GetMorganFingerprint()
Impact of 'fromAtoms' on GetMorganFingerprint()
GetMorganFingerprintAsBitVect() Method in RDKit
►Impact of 'nBits' on GetMorganFingerprintAsBitVect()
GetHashedMorganFingerprint() Method in RDKit
Impact of 'nBits' on GetHashedMorganFingerprint()
GetMorganGenerator() Method in RDKit
Morgan Fingerprint Generator in RDKit for FCFP
RDKit Performance on Substructure Search
Introduction to Molecular Fingerprints
OCSR (Optical Chemical Structure Recognition)
AlphaFold - Protein Structure Prediction