Molecule MolGraph featurizers#

[1]:
from chemprop.featurizers.molgraph.molecule import SimpleMoleculeMolGraphFeaturizer

This is an example molecule to featurize.

[2]:
from rdkit import Chem

mol_to_featurize = Chem.MolFromSmiles("CC")

Simple molgraph featurizer#

A MolGraph represents the graph featurization of a molecule. It is made of atom features (V), bond features (E), and a mapping between atoms and bonds (edge_index and rev_edge_index). It is created by SimpleMoleculeMolGraphFeaturizer.

[3]:
featurizer = SimpleMoleculeMolGraphFeaturizer()
featurizer(mol_to_featurize)
[3]:
MolGraph(V=array([[0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        1.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        1.     , 0.     , 1.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
        0.     , 0.12011],
       [0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        1.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        1.     , 0.     , 1.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
        0.     , 0.12011]], dtype=float32), E=array([[0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]]), edge_index=array([[0, 1],
       [1, 0]]), rev_edge_index=array([1, 0]))

CuikmolmakerMolGraphFeaturizer (available with cuik-molmaker only)#

This featurizer turns a batch of SMILES strings into a featurized batched graph. It relies on the cuik-molmaker package to (1) convert SMILES strings into Chem.Mol objects, (2) calculate the features of atoms and bonds, and (3) batch all the individual graphs together.

When creating the featurizer, you can give it options to control step 1 (currently only add_h). For step 2, the featurizer takes atom_featurizer_mode to determine which atom and bond features to calculate, instead of separate atom and bond featurizers.

[ ]:
from chemprop.featurizers.molgraph.molecule import CuikmolmakerMolGraphFeaturizer
from chemprop.utils.utils import is_cuikmolmaker_available

if is_cuikmolmaker_available():
    featurizer = CuikmolmakerMolGraphFeaturizer(atom_featurizer_mode="organic", add_h=True)
    bmg = featurizer(["C" * i for i in range(1, 20)])
    print(bmg)
<chemprop.featurizers.molgraph.molecule.BatchCuikMolGraph at 0x706e1b865760>

Custom#

The atom and bond featurizers used by the molgraph featurizer are customizable.

[5]:
from chemprop.featurizers import MultiHotAtomFeaturizer, MultiHotBondFeaturizer

atom_featurizer = MultiHotAtomFeaturizer.organic()
bond_featurizer = MultiHotBondFeaturizer(stereos=[0, 1, 2, 3, 4])
featurizer = SimpleMoleculeMolGraphFeaturizer(
    atom_featurizer=atom_featurizer, bond_featurizer=bond_featurizer
)
featurizer(mol_to_featurize)
[5]:
MolGraph(V=array([[0.     , 0.     , 1.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 1.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 1.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
        0.     , 0.12011],
       [0.     , 0.     , 1.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 1.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 1.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
        0.     , 0.12011]], dtype=float32), E=array([[0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]]), edge_index=array([[0, 1],
       [1, 0]]), rev_edge_index=array([1, 0]))

Extra atom and bond features#

If your datapoints have extra atom or bond features, the molgraph featurizer needs to know the length of the extra features when it is created so that the empty Chem.Mol (Chem.MolFromSmiles("")) is featurized correctly and so that the bond feature array is the correct shape.

[6]:
n_extra_atom_features = 3
n_extra_bond_features = 4
featurizer = SimpleMoleculeMolGraphFeaturizer(
    extra_atom_fdim=n_extra_atom_features, extra_bond_fdim=n_extra_bond_features
)

The dataset is given this custom featurizer and automatically handles the featurization including passing extra atom and bond features for each datapoint.