Substructure Library in Binary and SMILES Formats

This section provides a quick introduction on building molecule libraries in binary and SMILES formats to reduce memory usage. Molecule libraries can be serialized to external files, so they can be restored quickly later.

By default, rdkit.Chem.rdSubstructLibrary.SubstructLibrary stores each molecule as an object. This requires a large amount of memory if you are adding a large number of molecules into the library.

RDKit offers 3 other formats to store molecules in the library to reduce memory usage:

1. Binary Format - Using rdkit.Chem.rdSubstructLibrary.CachedMolHolder to store molecules in binary format.

2. SMILES Format - Using rdkit.Chem.rdSubstructLibrary.CachedSmilesMolHolder to store molecules in SMILES format.

3. Trusted SMILES Format - Using rdkit.Chem.rdSubstructLibrary.CachedTrustedSmilesMolHolder to store molecules in Trusted SMILES format.

To use a specific library format, you need to instantiate the library object with a new format object as shown below:

from rdkit.Chem import rdSubstructLibrary

# Creating an empty library in binary format
f = rdSubstructLibrary.CachedMolHolder()
l = rdSubstructLibrary.SubstructLibrary(f)

# Creating an empty library in SMILES format
f = rdSubstructLibrary.CachedSmilesMolHolder()
l = rdSubstructLibrary.SubstructLibrary(f)

# Creating an empty library in Trusted SMILES format
f = rdSubstructLibrary.CachedTrustedSmilesMolHolder()
l = rdSubstructLibrary.SubstructLibrary(f)

After finishing building a molecule library, you may want save it to a file and reuse it later to avoid spending time to build the same library again and again. RDKit offers two methods to write and read molecule libraries with external files.

l.ToStream(open(f, 'w')) - Writes the molecule library to a given file.

l.InitFromStream(open(f, 'rb')) - Reads the molecule library from a given file.

The short example below builds a molecule library in Trusted SMILES Format and writes it an external file.

herong$ more rdkit_write_library.py

from rdkit import Chem
ms = ["O=C(O)c2cc1ccc(Cl)n1cn2", "O=C(O)c2ccn1c(Cl)ccc1n2"]
ms = list(map(Chem.MolFromSmiles, ms))

from rdkit.Chem import rdSubstructLibrary
f = rdSubstructLibrary.CachedTrustedSmilesMolHolder()
l = rdSubstructLibrary.SubstructLibrary(f)
for m in ms:
  l.AddMol(m)

l.ToStream(open('rdkit.sslib', 'w'))

herong$ python rdkit_write_library.py

herong$ more rdkit.sslib
22 serialization::archive 18 0 1 0 1 4 1 1
0 0 0 0 0 2 0 23 O=C(O)c1cc2ccc(Cl)n2cn1 23 O=C(O)c1ccn2c(Cl)ccc2n1 0 1 -1

Now you can use the next example to read the molecule library from a file and do a substructure search. The output is the same as the previous tutorial.

herong$ more rdkit_search_library.py

from rdkit import Chem
s = Chem.MolFromSmiles('c1nccc2n1ccc2')

from rdkit.Chem import rdSubstructLibrary
l = rdSubstructLibrary.SubstructLibrary()
l.InitFromStream(open('rdkit.sslib', 'rb'))

ms = l.GetMatches(s)
ms = [l.GetMol(m) for m in ms]
i = Chem.Draw.MolsToGridImage(ms)
display(i)

herong$ python rdkit_search_library.py

Table of Contents

 About This Book

 SMILES (Simplified Molecular-Input Line-Entry System)

 Open Babel: The Open Source Chemistry Toolbox

 Using Open Babel Command: "obabel"

 Generating SVG Pictures with Open Babel

 Substructure Search with Open Babel

 Similarity Search with Open Babel

 Fingerprint Index for Fastsearch with Open Babel

 Stereochemistry with Open Babel

 Command Line Tools Provided by Open Babel

 RDKit: Open-Source Cheminformatics Software

 rdkit.Chem.rdchem - The Core Module

 rdkit.Chem.rdmolfiles - Molecular File Module

 rdkit.Chem.rdDepictor - Compute 2D Coordinates

 rdkit.Chem.Draw - Handle Molecule Images

Molecule Substructure Search with RDKit

 RDKit m.HasSubstructMatch(s) - Substructure Match

 RDKit GenerateDepictionMatching2DStructure(m, s) - Substructure Orientation

 RDKit rdMolDraw2D.PrepareAndDrawMolecule - Substructure Highlight

 RDKit Substructure Search with SMARTS

 rdkit.Chem.rdFMCS - Maximum Common Substructure

 rdkit.Chem.rdSubstructLibrary - Substructure Library

Substructure Library in Binary and SMILES Formats

 rdkit.Chem.rdmolops - Molecule Operations

 Daylight Fingerprint Generator in RDKit

 Morgan Fingerprint Generator in RDKit

 RDKit Performance on Substructure Search

 Introduction to Molecular Fingerprints

 OCSR (Optical Chemical Structure Recognition)

 AlphaFold - Protein Structure Prediction

 Resources and Tools

 Cheminformatics Related Terminologies

 References

 Full Version in PDF/EPUB