Substructure Search with SMARTS Expressions

This section provides a tutorial example on how to use SMARTS expressions string to do substructure search using 'obabel ... -s ...' command with Open Babel.

What Are SMARTS Expressions? - A SMARTS expression is a sequence of atoms, bonds, and operators that can be evaluated to a substructure pattern. SMARTS expressions are divided into 3 types:

1. Atom Expressions - An expression that can be evaluated to an atom pattern using atom operations on atoms and atom expressions.

An atom expression must be enclosed in square bracktes, if it is not a single simple atom symbol.

Here are some examples of atom expressions of a single atom operand by itself. No operator is used. The object molecule must have at least one atom matching the given atom pattern.

Atom  Description
----  -----------
C     normal (aliphatic) carbon atom
[#6]  atom with 6 protons (carbon atom)
c     aromatic carbon atom
[Ca]  calcium atom
[R]   any atom on any ring
[D3]  any atom connected with 3 explicit bonds
[X3]  any atom connected with 3 total bonds (includes implicit H bonds)
[h2]  any atom connected with 2 implicit hydrogens
[H2]  any atom connected with 2 total hydrogens
a     any aromatic atom
*     any atom

Here are atom operators and their precedences:

Precedence Operation  Description
---------- ---------  -----------
4          [!e]       NOT operation, negate expression e
                      The object molecule must have no atom matching
                      the given atom expression.

3          [e1&e2]    AND operation, "e1" and "e2" are matched
                      The object molecule must have least one atom matching
                      both given atom expressions.

3          [e1e2]     AND operation, same as above, & omitted

2          [e1,e2]    OR operation, "e1" or "e2" is matched
                      The object molecule must have least one atom matching
                      one of the given atom expressions.

1          [e1;e2]    AND operation with lower precedence,
                      Same as "&" operation, but evaluated after ",".

2. Bond Expressions - An expression that can be evaluated to a bond pattern using bond operations on bonds and bond expressions.

Here are some examples of bond expressions of one bond operand by itself. No operator is used. The object molecule must have at least one bond matching the given bond pattern.

Bond  Description
----  -----------
-     single bond (aliphatic), the default bond, can be omitted
=     double bond
#     triple bond
/     directional bond "up"
\     directional bond "down"1
/?    directional bond "up or unspecified"
\?    directional bond "down or unspecified"
:     aromatic bond
@     any ring bond
~     any bond

Here are bond operators and their precedences:

Precedence Operation Description
---------- --------- -----------
4          !e        NOT operation, negate expression e
                     The object molecule must have no bond matching
                     the given bond expression.

3          e1&e2     AND operation, "e1" and "e2" are matched
                     The object molecule must have least one bond matching
                     both given bond expressions.

3          e1e2      AND operation, same as above, & omitted

2          e1,e2     OR operation, take "e1" or "e2" is matched
                     The object molecule must have least one bond matching
                     one of the given bond expressions.

1          e1;e2     AND operation with lower precedence,
                     Same as "&" operation, but evaluated after ",".

3. Recursive SMARTS Expressions - An expression that can be evaluated to a SMARTS pattern using SMARTS operations on SMARTS and SMARTS expressions.

A SMARTS expression must be enclosed in bracktes and prefixed with "$", if it is used in a SMARTS operation.

Here are SMARTS operators and their precedences:

Precedence Operation Description
---------- --------- -----------
4          !$(e)     NOT operation, negate expression e
                     The object molecule must not match the given
                     SMARTS expression.

3          $(e1)&$(e2) AND operation, "e1" and "e2" are matched
                     The object molecule must have least one bond matching
                     both given bond expressions.

3          $(e1)$(e2) AND operation, same as above, & omitted

2          $(e1),$(e2) OR operation, take "e1" or "e2" is matched
                     The object molecule must have least one bond matching
                     one of the given bond expressions.

1          $(e1);$(e2) AND operation with lower precedence,
                     Same as "&" operation, but evaluated after ",".

Open Babel Supports SMARTS expressions - You can use SMARTS expressions in the "-s ..." option in "obabel" commands to filter molecules that match given SMARTS expressions.

Here are some examples:

# C, C and O connected with single bonds
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s C-C-O
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# same as above with single bonds omitted
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s CCO
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# same as above atom expression with optional brackets added
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s C[C]O
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# double conditions on the middle atom
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s C[CH0]O
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# same as above with implicit & included
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s 'C[C&H0]O'
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# bond expression used
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s 'C-,=O'
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# same as above, but in a native way
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s 'C!#O'
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# bad bond expression, no bond can be both single and double.
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s 'C-=O'
0 molecules converted

# poor bond expression, a single bond is also an any bond.
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s 'C-~O'
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# matching aromatic C and connected with 1 H
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s '[c;H1]'
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# matching aromatic C and connected with 0 H
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles -s '[c;H0]'
c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

# nested SMARTS expressions
herong$ obabel "-:c1cc(ccc1CC(C(=O)O)N)O" -o smiles \
  -s '[C;H0]-,=[$([O;H1]),$([O;H0])]'

c1cc(ccc1CC(C(=O)O)N)O
1 molecule converted

You can validate the above matching result by looking at the tyrosine molecule structure below:

Open Babel SVG Picture - Tyrosine Molecule
Open Babel SVG Picture - Tyrosine Molecule

Table of Contents

 About This Book

 SMILES (Simplified Molecular-Input Line-Entry System)

 Open Babel: The Open Source Chemistry Toolbox

 Using Open Babel Command: "obabel"

 Generating SVG Pictures with Open Babel

Substructure Search with Open Babel

 "obabel -s ..." Command - Substructure Search

 Substructure Search with Wildcard Atom "*"

 Substructure Search with Wildcard Bond "~"

Substructure Search with SMARTS Expressions

 Similarity Search with Open Babel

 Fingerprint Index for Fastsearch with Open Babel

 Stereochemistry with Open Babel

 Command Line Tools Provided by Open Babel

 RDKit: Open-Source Cheminformatics Software

 rdkit.Chem.rdchem - The Core Module

 rdkit.Chem.rdmolfiles - Molecular File Module

 rdkit.Chem.rdDepictor - Compute 2D Coordinates

 rdkit.Chem.Draw - Handle Molecule Images

 Molecule Substructure Search with RDKit

 rdkit.Chem.rdmolops - Molecule Operations

 Daylight Fingerprint Generator in RDKit

 Morgan Fingerprint Generator in RDKit

 RDKit Performance on Substructure Search

 Introduction to Molecular Fingerprints

 OCSR (Optical Chemical Structure Recognition)

 AlphaFold - Protein Structure Prediction

 Resources and Tools

 Cheminformatics Related Terminologies

 References

 Full Version in PDF/EPUB