Impact of 'tgtDensity' on RDKFingerprint()

This section provides a tutorial example on impact of the 'tgtDensity' option on fingerprint generation with rdkit.Chem.rdmolops.RDKFingerprint() function.

The 'tgtDensity' option in the rdkit.Chem.rdmolops.RDKFingerprint() function call allows you to control the folding behavior in the fingerprint generation process. The default value, tgtDensity=0, specifies no folding and related option, minSize, is ignored.

When tgtDensity>0 is specified, the density and the length of the initial fingerprint generated from subgraphs will be examined. If the density is lower than tgtDensity and the length is greater than minSize, the fingerprint will be folded by half. Then the density and the length will be examined again, then fingerprint could be folded again, until the density is >= tgtDensity or the length is <= minSize.

Note that the folding operation is carried out by cutting the fingerprint into 2 bit strings of the same length. Then join them into a single bit string using the bitwise OR operation.

By the way, fingerprint density is defined as the ratio of number of '1' bits over number of all bits.

Having fingerprints with higher density can result shorter fingerprints, and save data storage. But it also reduces the accuracy the final fingerprints.

1. For example, "CCCCCCCC" will have a 64-bit fingerprint without any folding, if "tgtDensity=0.0" is used. The fingerprint density is 7/64 = 0.109375.

from rdkit import Chem
mol = Chem.MolFromSmiles('CCCCCCCC')
fp = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=1,
  tgtDensity=0.0, minSize=64)
print(fp.ToBitString())
print(fp.GetNumOnBits()/fp.GetNumBits())

# output 
0000000000000100000011000000100001000000000100000000000001000000
0.109375

2. Now if we use "tgtDensity=0.15" and "minSize=16", the fingerprint will be folded once. The length is reduced to 32 and the density is increased to 7/32 = 0.21875.

mol = Chem.MolFromSmiles('CCCCCCCC')
fp = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=1,
  tgtDensity=0.15, minSize=16)
print(fp.ToBitString())
print(fp.GetNumOnBits()/fp.GetNumBits())

# output 
01000000000101000000110001001000
0.21875

As you can see, there is no loss of information in this folding operation. All 7 "1" bits are maintained.

3. Now let's try molecule "NNNNNNNN". The folding operation will result a loss of 1 bit.

mol = Chem.MolFromSmiles('NNNNNNNN')
fp = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=1,
  tgtDensity=0.0, minSize=64)
print(fp.ToBitString())
print(fp.GetNumOnBits()/fp.GetNumBits())

# output 
0000001000000000100100100000000000000010000001000000000100000000
0.109375

mol = Chem.MolFromSmiles('NNNNNNNN')
fp = Chem.RDKFingerprint(mol, fpSize=64, nBitsPerHash=1,
  tgtDensity=0.15, minSize=16)
print(fp.ToBitString())
print(fp.GetNumOnBits()/fp.GetNumBits())

# output 
00000010000001001001001100000000
0.1875

The folding operation is shown below with the lost bit identified.

00000010000000001001001000000000
00000010000001000000000100000000 OR
================================
00000010000001001001001100000000
      ^ loss

Conclusion: We should keep the default of tgtDensity=0 to stop folding to avoid any loss of information.

Table of Contents

 About This Book

 SMILES (Simplified Molecular-Input Line-Entry System)

 Open Babel: The Open Source Chemistry Toolbox

 Using Open Babel Command: "obabel"

 Generating SVG Pictures with Open Babel

 Substructure Search with Open Babel

 Similarity Search with Open Babel

 Fingerprint Index for Fastsearch with Open Babel

 Stereochemistry with Open Babel

 Command Line Tools Provided by Open Babel

 RDKit: Open-Source Cheminformatics Software

 rdkit.Chem.rdchem - The Core Module

 rdkit.Chem.rdmolfiles - Molecular File Module

 rdkit.Chem.rdDepictor - Compute 2D Coordinates

 rdkit.Chem.Draw - Handle Molecule Images

 Molecule Substructure Search with RDKit

 rdkit.Chem.rdmolops - Molecule Operations

Daylight Fingerprint Generator in RDKit

 What Is Daylight Fingerprint Generator in RDKit

 RDKFingerprint() Method in RDKit

 Impact of 'useBondOrder' on RDKFingerprint()

 Impact of 'branchedPaths' on RDKFingerprint()

 Impact of 'maxPath' on RDKFingerprint()

 Impact of 'fpSize' on RDKFingerprint()

Impact of 'tgtDensity' on RDKFingerprint()

 Impact of 'nBitsPerHash' on RDKFingerprint()

 UnfoldedRDKFingerprintCountBased() Method in RDKit

 GetRDKitFPGenerator() Method in RDKit

 Morgan Fingerprint Generator in RDKit

 RDKit Performance on Substructure Search

 Introduction to Molecular Fingerprints

 OCSR (Optical Chemical Structure Recognition)

 AlphaFold - Protein Structure Prediction

 Resources and Tools

 Cheminformatics Related Terminologies

 References

 Full Version in PDF/EPUB