Extra Datapoint Descriptors from Molecule Featurizers

Contents

Extra Datapoint Descriptors from Molecule Featurizers#

Datapoints can have extra descriptors concatenated to the learned representation before sending to the FFN. These descriptors can be automatically generated using molecule featurizers.

Loading packages#

[1]:

import numpy as np
import pandas as pd
from rdkit import Chem
from pathlib import Path
from chemprop import data, utils
from chemprop.featurizers import MoleculeFeaturizer
from rdkit.Chem import rdFingerprintGenerator
from dataclasses import dataclass

Change data inputs here#

[2]:

chemprop_dir = Path.cwd().parent
test_path = chemprop_dir / "tests/data/regression.csv"
target_columns = ['logSolubility']

[3]:

df_test = pd.read_csv(test_path)
df_test

[3]:

	smiles	logSolubility
0	OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)...	-0.770
1	Cc1occc1C(=O)Nc2ccccc2	-3.300
2	CC(C)=CCCC(C)=CC(=O)	-2.060
3	c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43	-7.870
4	c1ccsc1	-1.330
...	...	...
495	Nc1cc(nc(N)n1=O)N2CCCCC2	-1.989
496	Nc2cccc3nc1ccccc1cc23	-4.220
497	c1ccc2cc3c4cccc5cccc(c3cc2c1)c45	-8.490
498	OC(c1ccc(Cl)cc1)(c2ccc(Cl)cc2)C(Cl)(Cl)Cl	-5.666
499	C1Cc2cccc3cccc1c23	-4.630

500 rows × 2 columns

[4]:

smis = df_test['smiles']
ys = df_test.loc[:, target_columns].values

Creating custom featurizers#

Custom featurizers can be made by inheriting the MoleculeFeaturizer class. These featurizers must override the following methods: - __len__(self) - __call__(self, mol: Chem.mol)

Note that this is just an example of how to create a custom featurizer. The MorganBinaryFeaturizer in featurizers/molecule.py already implements this functionality.

[5]:

@dataclass
class MorganFingerprintMoleculeFeaturizer(MoleculeFeaturizer):
    fp_size: int = 2048

    def __post_init__(self):
        self.mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=self.fp_size)

    def __len__(self) -> int:
        """the length of the feature vector"""
        return self.fp_size

    def __call__(self, mol: Chem.Mol) -> np.ndarray:
        """Featurize the molecule ``mol``"""
        fp = self.mfpgen.GetFingerprintAsNumPy(mol)
        return fp

Testing the featurizer#

[6]:

mf = MorganFingerprintMoleculeFeaturizer()
morgan = mf(utils.make_mol(smis[0], keep_h=False, add_h=False))
morgan.shape, morgan

[6]:

((2048,), array([0, 1, 0, ..., 0, 0, 0], dtype=uint8))

Loading featurizers into datapoints#

[7]:

mfs = [MorganFingerprintMoleculeFeaturizer()] # supply a list of all featurizers that
                                              # will generate the extra descriptors. This is separate
                                              # from the main featurizer supplied to molecule datasets.

# An arbitrary amount of molecule featurizers can be supplied to each datapoint in a dataset.
# Note that pre-obtained extra descriptors cannot also be added at the same time, as shown in
# the loaded molecule features notebook. An attempt to do so will result in an error.

all_data = [data.MoleculeDatapoint.from_smi(smi, y=y, mfs=mfs) for smi, y in zip(smis, ys)]
all_data[:5]

[7]:

[MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a018c0>, y=array([-0.77]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 1, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None),
 MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a01850>, y=array([-3.3]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None),
 MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a019a0>, y=array([-2.06]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None),
 MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a01a80>, y=array([-7.87]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None),
 MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a01b60>, y=array([-1.33]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None)]