Datasets#
[1]:
from chemprop.data.datasets import (
CuikmolmakerDataset, MoleculeDataset, ReactionDataset, MulticomponentDataset
)
To make a dataset you first need a list of datapoints.
[2]:
import numpy as np
from chemprop.data import LazyMoleculeDatapoint, MoleculeDatapoint, ReactionDatapoint
ys = np.random.rand(2, 1)
smis = ["C", "CC"]
mol_datapoints = [MoleculeDatapoint.from_smi(smi, y) for smi, y in zip(smis, ys)]
lazy_mol_datapoints = [LazyMoleculeDatapoint("C" * i, y=[i]) for i in range(1, 20)]
rxn_smis = ["[H:2][O:1][H:3]>>[H:2][O:1].[H:3]", "[H:2][S:1][H:3]>>[H:2][S:1].[H:3]"]
rxn_datapoints = [
ReactionDatapoint.from_smi(rxn_smi, y, keep_h=True) for rxn_smi, y in zip(rxn_smis, ys)
]
Molecule Datasets#
MoleculeDatasets are made from a list of MoleculeDatapoints.
[3]:
MoleculeDataset(mol_datapoints)
[3]:
MoleculeDataset(data=[MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f7a6b6ed5b0>, y=array([0.23384385]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='C', V_f=None, E_f=None, V_d=None), MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f7a6b6ed690>, y=array([0.74433064]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='CC', V_f=None, E_f=None, V_d=None)], featurizer=SimpleMoleculeMolGraphFeaturizer(atom_featurizer=<chemprop.featurizers.atom.MultiHotAtomFeaturizer object at 0x7f7a6b52c290>, bond_featurizer=<chemprop.featurizers.bond.MultiHotBondFeaturizer object at 0x7f7a6b52c150>))
Dataset properties#
The properties of datapoints are collated in a dataset.
[4]:
dataset = MoleculeDataset(mol_datapoints)
print(dataset.Y)
print(dataset.names)
[[0.23384385]
[0.74433064]]
['C', 'CC']
Datasets return a Datum when indexed. A Datum contains a MolGraph (see the molgraph featurizer notebook), the extra atom and datapoint level descriptors, the target(s), the weights, and masks for bounded loss functions.
[5]:
dataset[0]
[5]:
Datum(mg=MolGraph(V=array([[0. , 0. , 0. , 0. , 0. , 1. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
1. , 0. , 0. , 0. , 0. , 0. , 0. ,
1. , 0. , 1. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 1. , 0. , 0. ,
0. , 0. , 0. , 1. , 0. , 0. , 0. ,
0. , 0.12011]], dtype=float32), E=array([], shape=(0, 14), dtype=float64), edge_index=array([], shape=(2, 0), dtype=int64), rev_edge_index=array([], dtype=int64)), V_d=None, x_d=None, y=array([0.23384385]), weight=1.0, lt_mask=None, gt_mask=None)
Caching#
The MolGraphs are generated as needed by default. For small to medium dataset (exact sizes not yet benchmarked), it is more efficient to generate and cache the molgraphs when the dataset is created.
If the cache needs to be recreated, set the cache to True again. To clear the cache, set it to False.
Note we recommend scaling additional atom and bond features before setting the cache, as scaling them after caching will require the cache to be recreated, which is done automatically.
To featurize the graphs in parallel when caching, use the n_workers argument when creating the dataset. Note that this may cause hangs on Windows and MacOS.
[6]:
import sys
if sys.platform not in ["win32", "darwin"]:
dataset = MoleculeDataset(mol_datapoints, n_workers=3)
else:
dataset = MoleculeDataset(mol_datapoints)
dataset.cache = True # Generate the molgraphs and cache them
dataset.cache = True # Recreate the cache
dataset.cache = False # Clear the cache
dataset.cache = True # Cache created with unscaled extra bond features
dataset.normalize_inputs(key="E_f") # Cache recreated automatically with scaled extra bond features
CuikmolmakerDataset (available with cuik-molmaker only)#
This dataset constructs and featurizes a batch of molecules at once instead of one at a time using cuik-molmaker. CuikmolmakerDataset implements __getitems__ instead of __getitem__ enabling batched dataset featurization and access. This method returns a CuikBatchedDatum which contains the same information as a Datum, except that the graph information is returned as a series of Tensors instead of a MolGraph and each molecule’s information is batched together.
[7]:
from chemprop.utils.utils import is_cuikmolmaker_available
print(f"cuik-molmaker available: {is_cuikmolmaker_available()}")
cuik-molmaker available: True
[8]:
if is_cuikmolmaker_available():
cuik_dataset = CuikmolmakerDataset(lazy_mol_datapoints)
print(cuik_dataset.__getitems__([0, 1]))
CuikBatchedDatum(atom_feats=tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1201],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1201],
[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 1.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000,
0.0000, 0.0000, 0.0000, 1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.1201]]), bond_feats=tensor([[0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]]), edge_index=tensor([[1, 2],
[2, 1]]), rev_edge_index=tensor([1, 0]), batch=tensor([0, 1, 1]), V_d=None, X_d=array([None, None], dtype=object), Y=array([[1.],
[2.]]), weights=array([1., 1.]), lt_mask=array([None, None], dtype=object), gt_mask=array([None, None], dtype=object))
Datasets with custom featurizers#
Datasets use a molgraph featurizer to create the MolGraphss from the rdkit.Chem.Mol objects in datapoints. A basic SimpleMoleculeMolGraphFeaturizer is the default featurizer for MoleculeDatasets. If you are using a custom molgraph featurizer, pass it as an argument when creating the dataset.
[9]:
from chemprop.featurizers import SimpleMoleculeMolGraphFeaturizer, MultiHotAtomFeaturizer
mol_featurizer = SimpleMoleculeMolGraphFeaturizer(atom_featurizer=MultiHotAtomFeaturizer.v1())
MoleculeDataset(mol_datapoints, featurizer=mol_featurizer)
[9]:
MoleculeDataset(data=[MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f7a6b6ed5b0>, y=array([0.23384385]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='C', V_f=None, E_f=None, V_d=None), MoleculeDatapoint(mol=<rdkit.Chem.rdchem.Mol object at 0x7f7a6b6ed690>, y=array([0.74433064]), weight=1.0, gt_mask=None, lt_mask=None, x_d=None, x_phase=None, name='CC', V_f=None, E_f=None, V_d=None)], featurizer=SimpleMoleculeMolGraphFeaturizer(atom_featurizer=<chemprop.featurizers.atom.MultiHotAtomFeaturizer object at 0x7f7a6b538a50>, bond_featurizer=<chemprop.featurizers.bond.MultiHotBondFeaturizer object at 0x7f7a6b538f50>))
Reaction Datasets#
Reaction datasets are the same as molecule datasets, except they are made from a list of ReactionDatapoints and CondensedGraphOfReactionFeaturizer is the default featurizer. CGRs are also MolGraphs.
[10]:
ReactionDataset(rxn_datapoints).featurizer
[10]:
CondensedGraphOfReactionFeaturizer(atom_featurizer=<chemprop.featurizers.atom.MultiHotAtomFeaturizer object at 0x7f7a6b53ab10>, bond_featurizer=<chemprop.featurizers.bond.MultiHotBondFeaturizer object at 0x7f7a6b53a8d0>)
Multicomponent datasets#
MulticomponentDataset is for datasets whose target values depend on multiple components. It is composed of parallel MoleculeDatasets and ReactionDatasets.
[11]:
mol_dataset = MoleculeDataset(mol_datapoints)
rxn_dataset = ReactionDataset(rxn_datapoints)
# e.g. reaction in solvent
multi_dataset = MulticomponentDataset(datasets=[mol_dataset, rxn_dataset])
# e.g. solubility
MulticomponentDataset(datasets=[mol_dataset, mol_dataset])
[11]:
<chemprop.data.datasets.MulticomponentDataset at 0x7f7a6b53bb90>
A MulticomponentDataset collates dataset properties (e.g. SMILES) of each dataset. It does not collate datapoint level properties like target values and extra datapoint descriptors. Chemprop models automatically take those from the first dataset in datasets.
[12]:
multi_dataset.smiles
[12]:
[('C', ('[O:1]([H:2])[H:3]', '[H:3].[O:1][H:2]')),
('CC', ('[S:1]([H:2])[H:3]', '[H:3].[S:1][H:2]'))]
[13]:
multi_dataset.datasets[0].Y
[13]:
array([[0.23384385],
[0.74433064]])