Cheminformatics Tutorials - Herong's Tutorial Examples - v2.01, by Herong Yang
Impact of 'nBitsPerHash' on RDKFingerprint()
This section provides a tutorial example on impact of the 'nBitsPerHash' option on fingerprint generation with rdkit.Chem.rdmolops.RDKFingerprint() function.
The 'nBitsPerHash' option in the rdkit.Chem.rdmolops.RDKFingerprint() function call allows you to control the number of bits each subgraph will turn on in the fingerprint. The default value, nBitsPerHash=2, specifies that 2 bits to be turned on for each subgraph.
Based on the algorithm used in the rdkit.Chem.rdmolops.RDKFingerprint() function, if the fingerprint is long enough, we only need to turn on 1 bit for each subgraph. Different subgraphs will never collide to a single bit in the fingerprint, because their hash values are different. However, if the fingerprint is short, and there are many unique subgraphs in the molecule, different subgraphs may collide to a single bit in the fingerprint.
This is why RDKit sets nBitsPerHash=2 as the default. The likelihood of 2 different subgraphs colliding on both bits is much lower than colliding on a single bit.
For example, "C(Oc1ccccc1C(=O)O)C" represents the Aspirin molecule. If we use "nBitsPerHash=1" and "fpSize=2048" options, the resulting fingerprint has a very low density of 0.07568359375. But we can easily find many subgraph collisions:
from rdkit import Chem atomBits = [] bitInfo = {} mol = Chem.MolFromSmiles('C(Oc1ccccc1C(=O)O)C') fp = Chem.RDKFingerprint(mol, fpSize=2048, nBitsPerHash=1, atomBits=atomBits, bitInfo=bitInfo) print(fp.GetNumOnBits()) print(fp.GetNumBits()) print(fp.GetNumOnBits()/fp.GetNumBits()) for bit in bitInfo: graphs = bitInfo[bit] length = len(graphs[0]) for g in graphs: if (len(g)!=length): print("Bit: {0}, Exp: {1}, Act: {2}".format(bit, length, len(g))) display(graphs) break # output: 155 2048 0.07568359375 Bit: 215, Exp: 5, Act: 6 [[0, 1, 11, 6, 5], [0, 1, 2, 3, 4], [2, 3, 4, 11, 7, 9], [3, 4, 5, 6, 7, 9]] Bit: 437, Exp: 5, Act: 7 [[1, 11, 7, 8, 6], [0, 1, 11, 6, 5, 4, 2], [0, 1, 11, 2, 3, 4, 5]] Bit: 760, Exp: 3, Act: 7 [[6, 7, 8], [7, 8, 11], [0, 1, 11, 7, 9, 6, 2]] Bit: 929, Exp: 2, Act: 6 [[8, 9], [0, 1, 11, 7, 6, 2]] ...
As you can see from the output, the fingerprint uses only 155 bits out of 2048 bits. But different subgraphs are still collide to the same bit. For example, subgraph of bonds [8, 9] and subgraph of fonds [0, 1, 11, 7, 6, 2] turn on the same bit 929.
2. Now let's look at a much simpler molecule, Benzene ("c1ccccc1"). We can get a perfect fingerprint with "nBitsPerHash=1" and "fpSize=128". All 6 unique subgraphs are represented by 6 different bits.
atomBits = [] bitInfo = {} mol = Chem.MolFromSmiles('c1ccccc1') fp = Chem.RDKFingerprint(mol, fpSize=128, nBitsPerHash=1, atomBits=atomBits, bitInfo=bitInfo) print(fp.GetNumOnBits()) print(fp.GetNumBits()) print(fp.GetNumOnBits()/fp.GetNumBits()) print(bitInfo) # output: 6 128 0.046875 { 7: [[0, 1, 2, 3, 4, 5]], 33: [[0, 1, 2], [0, 1, 5], [0, 5, 4], [1, 2, 3], [2, 3, 4], [3, 4, 5]], 38: [[0, 1], [0, 5], [1, 2], [2, 3], [3, 4], [4, 5]], 66: [[0, 1, 2, 3], [0, 1, 2, 5], [0, 1, 5, 4], [0, 5, 4, 3], ... 74: [[0], [1], [2], [3], [4], [5]], 97: [[0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 5, 4], ... }
3. But if we reduce the fingerprint length to 64, the fingerprint is no longer perfect. Subgraphs "cccc" and "cccccc" are both represented by the same bit 33. In other words, "cccc" and "cccccc" are indistinguishable in the fingerprint.
atomBits = [] bitInfo = {} mol = Chem.MolFromSmiles('c1ccccc1') fp = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=1, atomBits=atomBits, bitInfo=bitInfo) print(fp.GetNumOnBits()) print(fp.GetNumBits()) print(fp.GetNumOnBits()/fp.GetNumBits()) print(bitInfo) # output: 5 64 0.078125 { 2: [[0, 1, 2, 3], [0, 1, 2, 5], [0, 1, 5, 4], [0, 5, 4, 3], ... 7: [[0, 1, 2, 3, 4, 5]], 10: [[0], [1], [2], [3], [4], [5]], 33: [[0, 1, 2], [0, 1, 5], [0, 5, 4], [1, 2, 3], [2, 3, 4], [3, 4, 5], [0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 5, 4], ... 38: [[0, 1], [0, 5], [1, 2], [2, 3], [3, 4], [4, 5]]}
4. Now if we change "nBitsPerHash" to 2, the fingerprint becomes almost perfect again. There are 3 pairs of subgraphs shares 1 of their 2 bits. But they all distinguishable in the fingerprint. For example, "cccc" is represented by bits: 33 and 56. "cccccc" is represented by bits: 24 and 33.
atomBits = [] bitInfo = {} mol = Chem.MolFromSmiles('c1ccccc1') fp = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=2, atomBits=atomBits, bitInfo=bitInfo) print(fp.GetNumOnBits()) print(fp.GetNumBits()) print(fp.GetNumOnBits()/fp.GetNumBits()) print(bitInfo) # output: 9 64 0.140625 { 2: [[0], [1], [2], [3], [4], [5], [0, 1, 2, 3], [0, 1, 2, 5], [0, 1, 5, 4], [0, 5, 4, 3], ... 7: [[0, 1, 2, 3, 4, 5]], 10: [[0], [1], [2], [3], [4], [5], [0, 1], [0, 5], [1, 2], [2, 3], [3, 4], [4, 5]], 24: [[0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 5, 4], ... 33: [[0, 1, 2], [0, 1, 5], [0, 5, 4], [1, 2, 3], [2, 3, 4], [3, 4, 5], [0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 5, 4], ... 38: [[0, 1], [0, 5], [1, 2], [2, 3], [3, 4], [4, 5]], 39: [[0, 1, 2, 3], [0, 1, 2, 5], [0, 1, 5, 4], [0, 5, 4, 3], ... 51: [[0, 1, 2, 3, 4, 5]], 56: [[0, 1, 2], [0, 1, 5], [0, 5, 4], [1, 2, 3], [2, 3, 4], [3, 4, 5]] }
Conclusion: We should use nBitsPerHash=2 or higher to improve unique representations of subgraphs in the fingerprint. But don't go too high to avoid the high fingerprint density issue.
Table of Contents
SMILES (Simplified Molecular-Input Line-Entry System)
Open Babel: The Open Source Chemistry Toolbox
Using Open Babel Command: "obabel"
Generating SVG Pictures with Open Babel
Substructure Search with Open Babel
Similarity Search with Open Babel
Fingerprint Index for Fastsearch with Open Babel
Stereochemistry with Open Babel
Command Line Tools Provided by Open Babel
RDKit: Open-Source Cheminformatics Software
rdkit.Chem.rdchem - The Core Module
rdkit.Chem.rdmolfiles - Molecular File Module
rdkit.Chem.rdDepictor - Compute 2D Coordinates
rdkit.Chem.Draw - Handle Molecule Images
Molecule Substructure Search with RDKit
rdkit.Chem.rdmolops - Molecule Operations
►Daylight Fingerprint Generator in RDKit
What Is Daylight Fingerprint Generator in RDKit
RDKFingerprint() Method in RDKit
Impact of 'useBondOrder' on RDKFingerprint()
Impact of 'branchedPaths' on RDKFingerprint()
Impact of 'maxPath' on RDKFingerprint()
Impact of 'fpSize' on RDKFingerprint()
Impact of 'tgtDensity' on RDKFingerprint()
►Impact of 'nBitsPerHash' on RDKFingerprint()
UnfoldedRDKFingerprintCountBased() Method in RDKit
GetRDKitFPGenerator() Method in RDKit
Morgan Fingerprint Generator in RDKit
RDKit Performance on Substructure Search
Introduction to Molecular Fingerprints
OCSR (Optical Chemical Structure Recognition)
AlphaFold - Protein Structure Prediction