Impact of 'nBitsPerHash' on RDKFingerprint()

This section provides a tutorial example on impact of the 'nBitsPerHash' option on fingerprint generation with rdkit.Chem.rdmolops.RDKFingerprint() function.

The 'nBitsPerHash' option in the rdkit.Chem.rdmolops.RDKFingerprint() function call allows you to control the number of bits each subgraph will turn on in the fingerprint. The default value, nBitsPerHash=2, specifies that 2 bits to be turned on for each subgraph.

Based on the algorithm used in the rdkit.Chem.rdmolops.RDKFingerprint() function, if the fingerprint is long enough, we only need to turn on 1 bit for each subgraph. Different subgraphs will never collide to a single bit in the fingerprint, because their hash values are different. However, if the fingerprint is short, and there are many unique subgraphs in the molecule, different subgraphs may collide to a single bit in the fingerprint.

This is why RDKit sets nBitsPerHash=2 as the default. The likelihood of 2 different subgraphs colliding on both bits is much lower than colliding on a single bit.

For example, "C(Oc1ccccc1C(=O)O)C" represents the Aspirin molecule. If we use "nBitsPerHash=1" and "fpSize=2048" options, the resulting fingerprint has a very low density of 0.07568359375. But we can easily find many subgraph collisions:

from rdkit import Chem
atomBits = []
bitInfo = {}
mol = Chem.MolFromSmiles('C(Oc1ccccc1C(=O)O)C')
fp = Chem.RDKFingerprint(mol, fpSize=2048, nBitsPerHash=1,
  atomBits=atomBits, bitInfo=bitInfo)
print(fp.GetNumOnBits())
print(fp.GetNumBits())
print(fp.GetNumOnBits()/fp.GetNumBits())

for bit in bitInfo:
  graphs = bitInfo[bit]
  length = len(graphs[0])
  for g in graphs:
    if (len(g)!=length):
      print("Bit: {0}, Exp: {1}, Act: {2}".format(bit, length, len(g)))
      display(graphs)
      break

# output: 
155
2048
0.07568359375

Bit: 215, Exp: 5, Act: 6
[[0, 1, 11, 6, 5], [0, 1, 2, 3, 4], [2, 3, 4, 11, 7, 9], [3, 4, 5, 6, 7, 9]]

Bit: 437, Exp: 5, Act: 7
[[1, 11, 7, 8, 6], [0, 1, 11, 6, 5, 4, 2], [0, 1, 11, 2, 3, 4, 5]]

Bit: 760, Exp: 3, Act: 7
[[6, 7, 8], [7, 8, 11], [0, 1, 11, 7, 9, 6, 2]]

Bit: 929, Exp: 2, Act: 6
[[8, 9], [0, 1, 11, 7, 6, 2]]

...

As you can see from the output, the fingerprint uses only 155 bits out of 2048 bits. But different subgraphs are still collide to the same bit. For example, subgraph of bonds [8, 9] and subgraph of fonds [0, 1, 11, 7, 6, 2] turn on the same bit 929.

2. Now let's look at a much simpler molecule, Benzene ("c1ccccc1"). We can get a perfect fingerprint with "nBitsPerHash=1" and "fpSize=128". All 6 unique subgraphs are represented by 6 different bits.

atomBits = []
bitInfo = {}
mol = Chem.MolFromSmiles('c1ccccc1')
fp = Chem.RDKFingerprint(mol, fpSize=128, nBitsPerHash=1,
  atomBits=atomBits, bitInfo=bitInfo)
print(fp.GetNumOnBits())
print(fp.GetNumBits())
print(fp.GetNumOnBits()/fp.GetNumBits())
print(bitInfo)

# output: 
6
128
0.046875
{ 7: [[0, 1, 2, 3, 4, 5]], 
 33: [[0, 1, 2], [0, 1, 5], [0, 5, 4], [1, 2, 3], [2, 3, 4], [3, 4, 5]], 
 38: [[0, 1], [0, 5], [1, 2], [2, 3], [3, 4], [4, 5]], 
 66: [[0, 1, 2, 3], [0, 1, 2, 5], [0, 1, 5, 4], [0, 5, 4, 3], ...
 74: [[0], [1], [2], [3], [4], [5]], 
 97: [[0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 5, 4], ...
}

3. But if we reduce the fingerprint length to 64, the fingerprint is no longer perfect. Subgraphs "cccc" and "cccccc" are both represented by the same bit 33. In other words, "cccc" and "cccccc" are indistinguishable in the fingerprint.

atomBits = []
bitInfo = {}
mol = Chem.MolFromSmiles('c1ccccc1')
fp = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=1,
  atomBits=atomBits, bitInfo=bitInfo)
print(fp.GetNumOnBits())
print(fp.GetNumBits())
print(fp.GetNumOnBits()/fp.GetNumBits())
print(bitInfo)

# output: 
5
64
0.078125
{ 2: [[0, 1, 2, 3], [0, 1, 2, 5], [0, 1, 5, 4], [0, 5, 4, 3], ...
  7: [[0, 1, 2, 3, 4, 5]], 
 10: [[0], [1], [2], [3], [4], [5]], 
 33: [[0, 1, 2], [0, 1, 5], [0, 5, 4], [1, 2, 3], [2, 3, 4], [3, 4, 5], 
      [0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 5, 4], ...
 38: [[0, 1], [0, 5], [1, 2], [2, 3], [3, 4], [4, 5]]}

4. Now if we change "nBitsPerHash" to 2, the fingerprint becomes almost perfect again. There are 3 pairs of subgraphs shares 1 of their 2 bits. But they all distinguishable in the fingerprint. For example, "cccc" is represented by bits: 33 and 56. "cccccc" is represented by bits: 24 and 33.

atomBits = []
bitInfo = {}
mol = Chem.MolFromSmiles('c1ccccc1')
fp = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=2,
  atomBits=atomBits, bitInfo=bitInfo)
print(fp.GetNumOnBits())
print(fp.GetNumBits())
print(fp.GetNumOnBits()/fp.GetNumBits())
print(bitInfo)

# output:
9
64
0.140625
{ 2: [[0], [1], [2], [3], [4], [5], 
      [0, 1, 2, 3], [0, 1, 2, 5], [0, 1, 5, 4], [0, 5, 4, 3], ...
  7: [[0, 1, 2, 3, 4, 5]], 
 10: [[0], [1], [2], [3], [4], [5], 
      [0, 1], [0, 5], [1, 2], [2, 3], [3, 4], [4, 5]], 
 24: [[0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 5, 4], ...
 33: [[0, 1, 2], [0, 1, 5], [0, 5, 4], [1, 2, 3], [2, 3, 4], [3, 4, 5], 
      [0, 1, 2, 3, 4], [0, 1, 2, 3, 5], [0, 1, 2, 5, 4], ...
 38: [[0, 1], [0, 5], [1, 2], [2, 3], [3, 4], [4, 5]], 
 39: [[0, 1, 2, 3], [0, 1, 2, 5], [0, 1, 5, 4], [0, 5, 4, 3], ...
 51: [[0, 1, 2, 3, 4, 5]], 
 56: [[0, 1, 2], [0, 1, 5], [0, 5, 4], [1, 2, 3], [2, 3, 4], [3, 4, 5]]
}

Conclusion: We should use nBitsPerHash=2 or higher to improve unique representations of subgraphs in the fingerprint. But don't go too high to avoid the high fingerprint density issue.

Table of Contents

 About This Book

 SMILES (Simplified Molecular-Input Line-Entry System)

 Open Babel: The Open Source Chemistry Toolbox

 Using Open Babel Command: "obabel"

 Generating SVG Pictures with Open Babel

 Substructure Search with Open Babel

 Similarity Search with Open Babel

 Fingerprint Index for Fastsearch with Open Babel

 Stereochemistry with Open Babel

 Command Line Tools Provided by Open Babel

 RDKit: Open-Source Cheminformatics Software

 rdkit.Chem.rdchem - The Core Module

 rdkit.Chem.rdmolfiles - Molecular File Module

 rdkit.Chem.rdDepictor - Compute 2D Coordinates

 rdkit.Chem.Draw - Handle Molecule Images

 Molecule Substructure Search with RDKit

 rdkit.Chem.rdmolops - Molecule Operations

Daylight Fingerprint Generator in RDKit

 What Is Daylight Fingerprint Generator in RDKit

 RDKFingerprint() Method in RDKit

 Impact of 'useBondOrder' on RDKFingerprint()

 Impact of 'branchedPaths' on RDKFingerprint()

 Impact of 'maxPath' on RDKFingerprint()

 Impact of 'fpSize' on RDKFingerprint()

 Impact of 'tgtDensity' on RDKFingerprint()

Impact of 'nBitsPerHash' on RDKFingerprint()

 UnfoldedRDKFingerprintCountBased() Method in RDKit

 GetRDKitFPGenerator() Method in RDKit

 Morgan Fingerprint Generator in RDKit

 RDKit Performance on Substructure Search

 Introduction to Molecular Fingerprints

 OCSR (Optical Chemical Structure Recognition)

 AlphaFold - Protein Structure Prediction

 Resources and Tools

 Cheminformatics Related Terminologies

 References

 Full Version in PDF/EPUB