Extra Datapoint Descriptors from Molecule Featurizers#
Datapoints can have extra descriptors concatenated to the learned representation before sending to the FFN. These descriptors can be automatically generated using molecule featurizers.
Loading packages#
[1]:
import numpy as np
import pandas as pd
from rdkit import Chem
from pathlib import Path
from chemprop import data, utils
from chemprop.featurizers import MoleculeFeaturizer
from rdkit.Chem import rdFingerprintGenerator
from dataclasses import dataclass
Change data inputs here#
[2]:
chemprop_dir = Path.cwd().parent
test_path = chemprop_dir / "tests/data/regression.csv"
target_columns = ['logSolubility']
[3]:
df_test = pd.read_csv(test_path)
df_test
[3]:
smiles | logSolubility | |
---|---|---|
0 | OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)... | -0.770 |
1 | Cc1occc1C(=O)Nc2ccccc2 | -3.300 |
2 | CC(C)=CCCC(C)=CC(=O) | -2.060 |
3 | c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 | -7.870 |
4 | c1ccsc1 | -1.330 |
... | ... | ... |
495 | Nc1cc(nc(N)n1=O)N2CCCCC2 | -1.989 |
496 | Nc2cccc3nc1ccccc1cc23 | -4.220 |
497 | c1ccc2cc3c4cccc5cccc(c3cc2c1)c45 | -8.490 |
498 | OC(c1ccc(Cl)cc1)(c2ccc(Cl)cc2)C(Cl)(Cl)Cl | -5.666 |
499 | C1Cc2cccc3cccc1c23 | -4.630 |
500 rows × 2 columns
[4]:
smis = df_test['smiles']
ys = df_test.loc[:, target_columns].values
Creating custom featurizers#
Custom featurizers can be made by inheriting the MoleculeFeaturizer
class. These featurizers must override the following methods: - __len__(self)
- __call__(self, mol: Chem.mol)
Note that this is just an example of how to create a custom featurizer. The MorganBinaryFeaturizer
in featurizers/molecule.py
already implements this functionality.
[5]:
@dataclass
class MorganFingerprintMoleculeFeaturizer(MoleculeFeaturizer):
fp_size: int = 2048
def __post_init__(self):
self.mfpgen = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=self.fp_size)
def __len__(self) -> int:
"""the length of the feature vector"""
return self.fp_size
def __call__(self, mol: Chem.Mol) -> np.ndarray:
"""Featurize the molecule ``mol``"""
fp = self.mfpgen.GetFingerprintAsNumPy(mol)
return fp
Testing the featurizer#
[6]:
mf = MorganFingerprintMoleculeFeaturizer()
morgan = mf(utils.make_mol(smis[0], keep_h=False, add_h=False))
morgan.shape, morgan
[6]:
((2048,), array([0, 1, 0, ..., 0, 0, 0], dtype=uint8))
Loading featurizers into datapoints#
[7]:
mfs = [MorganFingerprintMoleculeFeaturizer()] # supply a list of all featurizers that
# will generate the extra descriptors. This is separate
# from the main featurizer supplied to molecule datasets.
# An arbitrary amount of molecule featurizers can be supplied to each datapoint in a dataset.
# Note that pre-obtained extra descriptors cannot also be added at the same time, as shown in
# the loaded molecule features notebook. An attempt to do so will result in an error.
all_data = [data.MoleculeDatapoint.from_smi(smi, y=y, mfs=mfs) for smi, y in zip(smis, ys)]
all_data[:5]
[7]:
[MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a018c0>, y=array([-0.77]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 1, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None),
MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a01850>, y=array([-3.3]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None),
MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a019a0>, y=array([-2.06]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None),
MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a01a80>, y=array([-7.87]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None),
MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f0aa8a01b60>, y=array([-1.33]), weight=1, gt_mask=None, lt_mask=None, x_d=array([0, 0, 0, ..., 0, 0, 0], dtype=uint8), x_phase=None, V_f=None, E_f=None, V_d=None)]