Impact of 'fpSize' on RDKFingerprint()

This section provides a tutorial example on impact of the 'fpSize' option on fingerprint generation with rdkit.Chem.rdmolops.RDKFingerprint() function.

The 'fpSize' option in the rdkit.Chem.rdmolops.RDKFingerprint() function call allows you to control the size (length) of the fingerprint. The default is fpSize=2048, which is capable to support 1024 subgraphs with nBitsPerHash=2 in a perfect scenario.

Generating shorter fingerprints can save data storage. But it also reduces the accuracy the final fingerprints.

1. For example, "CCCCCCCC" and "OOOOOOOO" are 2 completely different molecules with a zero similarity between them. But their fingerprints share 3 common bits, when fpSize=16 is used. And that gives a 0.27 similarity score.

from rdkit import Chem
from rdkit import DataStructs
mol = Chem.MolFromSmiles('CCCCCCCC')
fp1 = Chem.RDKFingerprint(mol, fpSize=16, nBitsPerHash=1)
print(fp1.ToBitString())

mol = Chem.MolFromSmiles('OOOOOOOO')
fp2 = Chem.RDKFingerprint(mol, fpSize=16, nBitsPerHash=1)
print(fp2.ToBitString())

print(DataStructs.FingerprintSimilarity(fp1, fp2))

# output 
0100110001011100
1100001001110010
 ^       ^ ^
0.2727272727272727

The main reason for two different subgraphs turning on the same bit in the fingerprint is that the fingerprint is not long enough to allow them to use different bits.

2. Now if we use "fpSize=64", "CCCCCCCC" and "OOOOOOOO" will share only 1 bit in their fingerprints. And that lowers the similarity score to 0.08, much more accurate now.

mol = Chem.MolFromSmiles('CCCCCCCC')
fp1 = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=1)
print(fp1.ToBitString())

mol = Chem.MolFromSmiles('OOOOOOOO')
fp2 = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=1)
print(fp2.ToBitString())

print(DataStructs.FingerprintSimilarity(fp1, fp2))

# output 
0000000000000100000011000000100001000000000100000000000001000000
0000000000010000000000100100000001000000000000001000000000100010
                                 ^
0.07692307692307693

Conclusion: We should increase fpSize as large as possible to avoid multiple different subgraphs turning on the same bit in the fingerprint.

Table of Contents

 About This Book

 SMILES (Simplified Molecular-Input Line-Entry System)

 Open Babel: The Open Source Chemistry Toolbox

 Using Open Babel Command: "obabel"

 Generating SVG Pictures with Open Babel

 Substructure Search with Open Babel

 Similarity Search with Open Babel

 Fingerprint Index for Fastsearch with Open Babel

 Stereochemistry with Open Babel

 Command Line Tools Provided by Open Babel

 RDKit: Open-Source Cheminformatics Software

 rdkit.Chem.rdchem - The Core Module

 rdkit.Chem.rdmolfiles - Molecular File Module

 rdkit.Chem.rdDepictor - Compute 2D Coordinates

 rdkit.Chem.Draw - Handle Molecule Images

 Molecule Substructure Search with RDKit

 rdkit.Chem.rdmolops - Molecule Operations

Daylight Fingerprint Generator in RDKit

 What Is Daylight Fingerprint Generator in RDKit

 RDKFingerprint() Method in RDKit

 Impact of 'useBondOrder' on RDKFingerprint()

 Impact of 'branchedPaths' on RDKFingerprint()

 Impact of 'maxPath' on RDKFingerprint()

Impact of 'fpSize' on RDKFingerprint()

 Impact of 'tgtDensity' on RDKFingerprint()

 Impact of 'nBitsPerHash' on RDKFingerprint()

 UnfoldedRDKFingerprintCountBased() Method in RDKit

 GetRDKitFPGenerator() Method in RDKit

 Morgan Fingerprint Generator in RDKit

 RDKit Performance on Substructure Search

 Introduction to Molecular Fingerprints

 OCSR (Optical Chemical Structure Recognition)

 AlphaFold - Protein Structure Prediction

 Resources and Tools

 Cheminformatics Related Terminologies

 References

 Full Version in PDF/EPUB