Datasets#

[1]:

from chemprop.data.datasets import (
    CuikmolmakerDataset, MoleculeDataset, ReactionDataset, MulticomponentDataset
)

To make a dataset you first need a list of datapoints.

[2]:

import numpy as np
from chemprop.data import LazyMoleculeDatapoint, MoleculeDatapoint, ReactionDatapoint

ys = np.random.rand(2, 1)

smis = ["C", "CC"]
mol_datapoints = [MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(smis, ys)]

rxn_smis = ["[H:2][O:1][H:3]>>[H:2][O:1].[H:3]", "[H:2][S:1][H:3]>>[H:2][S:1].[H:3]"]
rxn_datapoints = [
    ReactionDatapoint.from_smi(rxn_smi, y, keep_h=True) for rxn_smi, y in zip(rxn_smis, ys)
]

Molecule Datasets#

MoleculeDatasets are made from a list of MoleculeDatapoints.

[3]:

MoleculeDataset(mol_datapoints)

[3]:

MoleculeDataset(data=[MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7d4882771000>, y=array([0.38506292]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='C', V_f=None, E_f=None, V_d=None), MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7d48827710e0>, y=array([0.72285413]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='CC', V_f=None, E_f=None, V_d=None)], featurizer=SimpleMoleculeMolGraphFeaturizer(atom_featurizer=<chemprop.featurizers.atom.MultiHotAtomFeaturizer object at 0x7d4a0c048b90>, bond_featurizer=<chemprop.featurizers.bond.MultiHotBondFeaturizer object at 0x7d48828d5bd0>, extra_atom_fdim=0, extra_bond_fdim=0), n_workers=0)

Dataset properties#

The properties of datapoints are collated in a dataset.

[4]:

dataset = MoleculeDataset(mol_datapoints)
print(dataset.Y)
print(dataset.names)

[[0.38506292]
 [0.72285413]]
['C', 'CC']

Datasets return a Datum when indexed. A Datum contains a MolGraph (see the molgraph featurizer notebook), the extra atom and datapoint level descriptors, the target(s), the weights, and masks for bounded loss functions.

[5]:

dataset[0]

[5]:

Datum(mg=MolGraph(V=array([[0.     , 0.     , 0.     , 0.     , 0.     , 1.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        1.     , 0.     , 0.     , 0.     , 0.     , 0.     , 0.     ,
        1.     , 0.     , 1.     , 0.     , 0.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 0.     , 1.     , 0.     , 0.     ,
        0.     , 0.     , 0.     , 1.     , 0.     , 0.     , 0.     ,
        0.     , 0.12011]], dtype=float32), E=array([], shape=(0, 14), dtype=float64), edge_index=array([], shape=(2, 0), dtype=int64), rev_edge_index=array([], dtype=int64)), V_d=None, x_d=None, y=array([0.38506292]), weight=1.0, lt_mask=None, gt_mask=None)

Caching#

The MolGraphs are generated as needed by default. For small to medium dataset (exact sizes not yet benchmarked), it is more efficient to generate and cache the molgraphs when the dataset is created.

If the cache needs to be recreated, set the cache to True again. To clear the cache, set it to False.

Note we recommend scaling additional atom and bond features before setting the cache, as scaling them after caching will require the cache to be recreated, which is done automatically.

To featurize the graphs in parallel when caching, use the n_workers argument when creating the dataset. Note that this may cause hangs on Windows and MacOS.

[6]:

import sys

if sys.platform not in ["win32", "darwin"]:
    dataset = MoleculeDataset(mol_datapoints, n_workers=3)
else:
    dataset = MoleculeDataset(mol_datapoints)

dataset.cache = True  # Generate the molgraphs and cache them
dataset.cache = True  # Recreate the cache
dataset.cache = False  # Clear the cache

dataset.cache = True  # Cache created with unscaled extra bond features
dataset.normalize_inputs(key="E_f")  # Cache recreated automatically with scaled extra bond features

CuikmolmakerDataset (accelerated training and inference)#

This dataset constructs and featurizes a batch of molecules at once instead of one at a time using cuik-molmaker. CuikmolmakerDataset implements __getitems__ instead of __getitem__ enabling batched dataset featurization and access. This method returns a CuikBatchedDatum which contains the same information as a Datum, except that the graph information is returned as a series of Tensors instead of a MolGraph and each molecule’s information is batched together.

This dataset is used to accelerate training and inference and save memory for large datasets. CuikmolmakerDataset is intended to be used with LazyMoleculeDatapoints and uses CuikmolmakerMolGraphFeaturizer to featurize the molecules.

[7]:

lazy_mol_datapoints = [LazyMoleculeDatapoint("C" * i, y=[i]) for i in range(1, 20)]

cuik_dataset = CuikmolmakerDataset(lazy_mol_datapoints)

[8]:

print(cuik_dataset.__getitems__([0, 1]))

CuikBatchedDatum(bmg=<chemprop.featurizers.molgraph.molecule.BatchCuikMolGraph object at 0x7d48827ca930>, V_d=None, X_d=None, Y=array([[1.],
       [2.]]), weights=array([1., 1.]), lt_mask=None, gt_mask=None)

Datasets with custom featurizers#

Datasets use a molgraph featurizer to create the MolGraphss from the rdkit.Chem.Mol objects in datapoints. A basic SimpleMoleculeMolGraphFeaturizer is the default featurizer for MoleculeDatasets. If you are using a custom molgraph featurizer, pass it as an argument when creating the dataset.

[9]:

from chemprop.featurizers import SimpleMoleculeMolGraphFeaturizer, MultiHotAtomFeaturizer

mol_featurizer = SimpleMoleculeMolGraphFeaturizer(atom_featurizer=MultiHotAtomFeaturizer.v1())
MoleculeDataset(mol_datapoints, featurizer=mol_featurizer)

[9]:

MoleculeDataset(data=[MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7d4882771000>, y=array([0.38506292]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='C', V_f=None, E_f=None, V_d=None), MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7d48827710e0>, y=array([0.72285413]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='CC', V_f=None, E_f=None, V_d=None)], featurizer=SimpleMoleculeMolGraphFeaturizer(atom_featurizer=<chemprop.featurizers.atom.MultiHotAtomFeaturizer object at 0x7d48828a7bb0>, bond_featurizer=<chemprop.featurizers.bond.MultiHotBondFeaturizer object at 0x7d48827b8510>, extra_atom_fdim=0, extra_bond_fdim=0), n_workers=0)

Reaction Datasets#

Reaction datasets are the same as molecule datasets, except they are made from a list of ReactionDatapoints and CondensedGraphOfReactionFeaturizer is the default featurizer. CGRs are also MolGraphs.

[10]:

ReactionDataset(rxn_datapoints).featurizer

[10]:

CondensedGraphOfReactionFeaturizer(atom_featurizer=<chemprop.featurizers.atom.MultiHotAtomFeaturizer object at 0x7d488274a210>, bond_featurizer=<chemprop.featurizers.bond.MultiHotBondFeaturizer object at 0x7d48827f0050>)

Multicomponent datasets#

MulticomponentDataset is for datasets whose target values depend on multiple components. It is composed of parallel MoleculeDatasets and ReactionDatasets.

[11]:

mol_dataset = MoleculeDataset(mol_datapoints)
rxn_dataset = ReactionDataset(rxn_datapoints)

# e.g. reaction in solvent
multi_dataset = MulticomponentDataset(datasets=[mol_dataset, rxn_dataset])

# e.g. solubility
MulticomponentDataset(datasets=[mol_dataset, mol_dataset])

[11]:

<chemprop.data.datasets.MulticomponentDataset at 0x7d48828d7890>

A MulticomponentDataset collates dataset properties (e.g. SMILES) of each dataset. It does not collate datapoint level properties like target values and extra datapoint descriptors. Chemprop models automatically take those from the first dataset in datasets.

[12]:

multi_dataset.smiles

[12]:

[('C', ('[O:1]([H:2])[H:3]', '[H:3].[O:1][H:2]')),
 ('CC', ('[S:1]([H:2])[H:3]', '[H:3].[S:1][H:2]'))]

[13]:

multi_dataset.datasets[0].Y

[13]:

array([[0.38506292],
       [0.72285413]])

[ ]:

Datasets

Contents