Molecule featurizers#

[1]:
from chemprop.featurizers.molecule import (
    MorganBinaryFeaturizer,
    MorganCountFeaturizer,
    RDKit2DFeaturizer,
    V1RDKit2DFeaturizer,
    V1RDKit2DNormalizedFeaturizer,
)

These are example molecules to featurize.

[2]:
from chemprop.utils import make_mol

smis = ["C" * i for i in range(1, 11)]
mols = [make_mol(smi, keep_h=False, add_h=False, ignore_stereo=False) for smi in smis]

Molecule vs molgraph featurizers#

Both molecule and molgraph featurizers take rdkit.Chem.Mol objects as input. Molgraph featurizers produce a MolGraph which is used in message passing. Molecule featurizers produce a 1D numpy array of features that can be used as extra datapoint descriptors.

[3]:
from chemprop.data import MoleculeDatapoint

molecule_featurizer = MorganBinaryFeaturizer()

datapoints = [MoleculeDatapoint(mol, x_d=molecule_featurizer(mol)) for mol in mols]

molecule_featurizer(mols[0])
[3]:
array([0, 0, 0, ..., 0, 0, 0], shape=(2048,), dtype=uint8)

Morgan fingerprint featurizers#

Morgan fingerprint can either use a binary or count representation of molecular structures. The radius of structures, length of the fingerprint, and whether to include chirality can all be customized. The default radius is 2, the default length is 2048, and chirality is included by default.

[4]:
mf = MorganCountFeaturizer(radius=3, length=1024, include_chirality=False)
morgan_fp = mf(mols[0])
morgan_fp.shape, morgan_fp
[4]:
((1024,), array([0, 0, 0, ..., 0, 0, 0], shape=(1024,), dtype=int32))

RDKit molecule featurizers#

Chemprop gives a warning that the RDKit molecule featurers are not well scaled by a StandardScaler. Consult the literature for more appropriate scaling methods.

[5]:
molecule_featurizer = RDKit2DFeaturizer()
extra_datapoint_descriptors = [molecule_featurizer(mol) for mol in mols]
extra_datapoint_descriptors[0]
The RDKit 2D features can deviate signifcantly from a normal distribution. Consider manually scaling them using an appropriate scaler before creating datapoints, rather than using the scikit-learn `StandardScaler` (the default in Chemprop).
[5]:
array([ 0.        ,  0.        ,  0.        ,  0.        ,  0.35978494,
        0.        , 16.043     , 12.011     , 16.03130013,  8.        ,
        0.        , -0.07755789, -0.07755789,  0.07755789,  0.07755789,
        1.        ,  1.        ,  1.        , 12.011     , 12.011     ,
       -0.07755789, -0.07755789,  0.1441    ,  0.1441    ,  2.503     ,
        2.503     ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  8.73925103,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  7.42665278,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  7.42665278,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  7.42665278,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  7.42665278,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        1.        ,  1.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.6361    ,  6.731     ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ])

The rdkit featurizers from v1 are also available. They rely on the descriptastorus package which can be found at bp-kelley/descriptastorus. This package doesn’t include the following rdkit descriptors: ['AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'SPS']. Scaled versions of these descriptors are available, though it is unknown which molecules were used to fit the scaling, so this may be a dataleak depending on the test set used to evaluate model performace. See this issue for more details about the scaling.

[6]:
molecule_featurizer = V1RDKit2DFeaturizer()
molecule_featurizer = V1RDKit2DNormalizedFeaturizer()
molecule_featurizer(mols[0])
[6]:
array([1.96075662e-05, 5.77173432e-04, 3.87525506e-15, 2.72296612e-11,
       1.02515408e-07, 4.10254814e-13, 1.63521389e-11, 1.93930344e-05,
       1.22824218e-06, 2.20907757e-07, 6.35349909e-07, 3.08677419e-06,
       1.70338959e-05, 1.34072882e-05, 4.07488775e-10, 2.17523456e-08,
       6.89356874e-07, 2.63048207e-01, 1.96742684e-02, 2.50993926e-11,
       9.25841695e-11, 5.85610910e-17, 1.08871430e-06, 2.39145041e-11,
       7.52245592e-13, 1.23345732e-08, 2.94906350e-01, 9.59992784e-03,
       2.31947354e-03, 9.99390325e-01, 9.88006922e-01, 1.59186446e-08,
       4.42180049e-09, 1.00000000e+00, 7.85198619e-13, 4.14332758e-13,
       6.49617582e-11, 4.45588945e-06, 7.89307465e-03, 2.39990382e-02,
       7.89307465e-03, 4.59284380e-03, 3.24286613e-10, 1.83192891e-02,
       7.38491174e-01, 9.73505944e-01, 6.05575320e-02, 3.42737552e-07,
       1.23284669e-08, 6.13163344e-02, 3.33304127e-02, 9.93858689e-22,
       1.42492255e-01, 6.29631332e-02, 3.47228888e-02, 4.82992991e-15,
       1.11775996e-02, 1.89758400e-02, 5.52866693e-02, 5.22997303e-05,
       5.69516350e-08, 2.15229839e-03, 0.00000000e+00, 1.14242658e-21,
       2.40245513e-23, 1.31105703e-02, 8.72153349e-03, 5.76142917e-21,
       3.60875252e-15, 1.45980119e-01, 1.73556718e-22, 1.18093757e-10,
       5.99833786e-02, 9.05498589e-08, 4.60978367e-10, 1.57072376e-01,
       1.66847964e-01, 2.37240682e-02, 8.07601514e-02, 2.75008841e-02,
       4.92845505e-03, 1.24459630e-01, 7.31816496e-02, 1.67096874e-01,
       7.55810089e-02, 8.78622233e-24, 1.33643046e-01, 3.04494668e-02,
       2.58369311e-02, 5.30138094e-05, 1.42657565e-16, 3.73160396e-02,
       6.95272017e-13, 0.00000000e+00, 9.79690873e-13, 2.64281353e-04,
       1.20493060e-11, 2.86305006e-09, 1.04578852e-01, 3.09944928e-02,
       2.99487758e-06, 2.77639012e-01, 5.30138094e-05, 6.17138309e-03,
       5.30138094e-05, 5.00000000e-01, 3.84710451e-01, 5.30138094e-05,
       5.30138094e-05, 1.64664515e-01, 5.30138094e-05, 9.98653446e-01,
       3.99820633e-01, 2.02868342e-02, 5.70867846e-19, 3.32362804e-10,
       9.64197643e-10, 7.10542736e-15, 5.83707586e-13, 1.19880642e-20,
       1.65079548e-01, 1.67040631e-01, 1.66498334e-01, 1.66486816e-01,
       2.02864661e-01, 6.93658809e-02, 7.10542736e-15, 1.68346480e-01,
       1.67982932e-01, 6.87189958e-10, 1.18157291e-03, 1.64332634e-01,
       8.37776917e-04, 1.66325734e-01, 1.63034142e-01, 1.65079548e-01,
       9.56970492e-08, 3.49708922e-08, 1.68206175e-01, 1.65806858e-01,
       1.67346595e-01, 7.13964619e-07, 2.64115098e-12, 9.99127911e-02,
       2.86809243e-10, 3.77737848e-01, 4.50616778e-03, 1.33250251e-01,
       3.47299284e-02, 1.61482916e-09, 1.87517315e-18, 2.09410539e-07,
       7.10542736e-15, 4.99264281e-01, 1.64929402e-01, 1.31744508e-17,
       2.11164355e-16, 1.16815875e-09, 3.25923600e-22, 6.24601420e-10,
       1.68149182e-01, 1.65450729e-01, 1.17110262e-13, 0.00000000e+00,
       1.64668868e-01, 1.66924728e-01, 0.00000000e+00, 5.10071327e-08,
       7.10542736e-15, 1.54654108e-01, 2.79420938e-22, 0.00000000e+00,
       1.67639733e-01, 6.31499266e-25, 1.68186130e-01, 9.08850267e-03,
       1.68363202e-01, 8.26542313e-11, 1.56346354e-01, 0.00000000e+00,
       0.00000000e+00, 2.11354236e-02, 2.11354236e-02, 2.38815575e-20,
       0.00000000e+00, 8.33672450e-25, 5.30138094e-05, 1.56951066e-01,
       4.03434503e-08, 1.55259196e-23, 1.59306117e-17, 5.76610077e-14,
       2.95798941e-11, 1.68378369e-01, 1.67380186e-01, 1.48151465e-18,
       2.32414994e-16, 4.70359809e-08, 1.66633397e-01, 1.87492844e-01])

Custom#

Any class that has a length and returns a 1D numpy array when given an rdkit.Chem.Mol can be used as a molecule featurizer.

[7]:
import numpy as np
from rdkit import Chem

class MyMoleculeFeaturizer:
    def __len__(self) -> int:
        return 1

    def __call__(self, mol: Chem.Mol) -> np.ndarray:
        total_atoms = mol.GetNumAtoms()
        return np.array([total_atoms])
[8]:
mf = MyMoleculeFeaturizer()
mf(mols[0])
[8]:
array([1])

Using molecule features as extra datapoint descriptors#

If you only have molecule features for one molecule per datapoint, those features can be used directly as extra datapoint descriptors. If you have multiple molecules with extra features, or other extra datapoint descriptors, they first need to be concatenated into a single numpy array.

[9]:
mol1_features = np.random.randn(len(mols), 1)
mol2_features = np.random.randn(len(mols), 2)
other_datapoint_descriptors = np.random.randn(len(mols), 3)

extra_datapoint_descriptors = np.hstack([mol1_features, mol2_features, other_datapoint_descriptors])