chemprop.data

Contents

chemprop.data#

Submodules#

Attributes#

Classes#

BatchMolAtomBondGraph

A BatchMolGraph represents a batch of individual MolGraphs.

BatchMolGraph

A BatchMolGraph represents a batch of individual MolGraphs.

MolAtomBondTrainingBatch

MulticomponentTrainingBatch

TrainingBatch

LazyMoleculeDatapoint

A LazyMoleculeDatapoint contains a single SMILES string, and all attributes need to

MolAtomBondDatapoint

A MoleculeDatapoint contains a single molecule and its associated features and targets.

MoleculeDatapoint

A MoleculeDatapoint contains a single molecule and its associated features and targets.

ReactionDatapoint

A ReactionDatapoint contains a single reaction and its associated features and targets.

CuikmolmakerDataset

A CuikmolmakerDataset composed of LazyMoleculeDatapoints and a

Datum

a singular training data point

MolAtomBondDataset

A MoleculeDataset composed of MoleculeDatapoints

MolAtomBondDatum

a singular training data point that supports atom and bond level targets

MoleculeDataset

A MoleculeDataset composed of MoleculeDatapoints

MulticomponentDataset

A MulticomponentDataset is a Dataset composed of parallel

ReactionDataset

A ReactionDataset composed of ReactionDatapoints

MolGraph

A MolGraph represents the graph featurization of a molecule.

ClassBalanceSampler

A ClassBalanceSampler samples data from a MolGraphDataset such that

SeededSampler

A :class`SeededSampler` is a class for iterating through a dataset in a randomly seeded

SplitType

Enum where members are also (and must be) strings

Functions#

collate_batch(batch)

collate_mol_atom_bond_batch(batch)

collate_multicomponent(batches)

build_dataloader(dataset[, batch_size, num_workers, ...])

Return a DataLoader for MolGraphDatasets

make_split_indices(mols[, split, sizes, seed, ...])

Splits data into training, validation, and test splits.

split_data_by_indices(data[, train_indices, ...])

Splits data into training, validation, and test groups based on split indices given.

Package Contents#

class chemprop.data.BatchMolAtomBondGraph[source]#

Bases: BatchMolGraph

A BatchMolGraph represents a batch of individual MolGraphs.

It has all the attributes of a MolGraph with the addition of the batch attribute. This class is intended for use with data loading, so it uses Tensors to store data

bond_batch: torch.Tensor#

A tensor of indices that show which MolGraph each bond belongs to in the batch

__post_init__(mgs)[source]#
Parameters:

mgs (Sequence[chemprop.data.molgraph.MolGraph])

to(device)[source]#
class chemprop.data.BatchMolGraph[source]#

A BatchMolGraph represents a batch of individual MolGraphs.

It has all the attributes of a MolGraph with the addition of the batch attribute. This class is intended for use with data loading, so it uses Tensors to store data

mgs: dataclasses.InitVar[Sequence[chemprop.data.molgraph.MolGraph]]#

A list of individual MolGraphs to be batched together

V: torch.Tensor#

the atom feature matrix

E: torch.Tensor#

the bond feature matrix

edge_index: torch.Tensor#

an tensor of shape 2 x E containing the edges of the graph in COO format

rev_edge_index: torch.Tensor#

A tensor of shape E that maps from an edge index to the index of the source of the reverse edge in the edge_index attribute.

batch: torch.Tensor#

the index of the parent MolGraph in the batched graph

__post_init__(mgs)[source]#
Parameters:

mgs (Sequence[chemprop.data.molgraph.MolGraph])

__len__()[source]#

the number of individual MolGraphs in this batch

Return type:

int

to(device)[source]#
Parameters:

device (str | torch.device)

class chemprop.data.MolAtomBondTrainingBatch[source]#

Bases: NamedTuple

bmg: BatchMolAtomBondGraph#
V_d: torch.Tensor | None#
E_d: torch.Tensor | None#
X_d: torch.Tensor | None#
Ys: tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]#
w: tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]#
lt_masks: tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]#
gt_masks: tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]#
constraints: tuple[torch.Tensor | None, torch.Tensor | None]#
class chemprop.data.MulticomponentTrainingBatch[source]#

Bases: NamedTuple

bmgs: list[BatchMolGraph]#
V_ds: list[torch.Tensor | None]#
X_d: torch.Tensor | None#
Y: torch.Tensor | None#
w: torch.Tensor#
lt_mask: torch.Tensor | None#
gt_mask: torch.Tensor | None#
class chemprop.data.TrainingBatch[source]#

Bases: NamedTuple

bmg: BatchMolGraph | chemprop.featurizers.molgraph.molecule.BatchCuikMolGraph#
V_d: torch.Tensor | None#
X_d: torch.Tensor | None#
Y: torch.Tensor | None#
w: torch.Tensor#
lt_mask: torch.Tensor | None#
gt_mask: torch.Tensor | None#
chemprop.data.collate_batch(batch)[source]#
Parameters:

batch (Iterable[chemprop.data.datasets.Datum])

Return type:

TrainingBatch

chemprop.data.collate_mol_atom_bond_batch(batch)[source]#
Parameters:

batch (Iterable[chemprop.data.datasets.MolAtomBondDatum])

Return type:

MolAtomBondTrainingBatch

chemprop.data.collate_multicomponent(batches)[source]#
Parameters:

batches (Iterable[Iterable[chemprop.data.datasets.Datum]])

Return type:

MulticomponentTrainingBatch

chemprop.data.build_dataloader(dataset, batch_size=64, num_workers=0, class_balance=False, seed=None, shuffle=True, drop_last=None, **kwargs)[source]#

Return a DataLoader for MolGraphDatasets

Parameters:
  • dataset (MoleculeDataset | ReactionDataset | MulticomponentDataset) – The dataset containing the molecules or reactions to load.

  • batch_size (int, default=64) – the batch size to load.

  • num_workers (int, default=0) – the number of workers used to build batches.

  • class_balance (bool, default=False) – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Class balance is only available for single task classification datasets. Set shuffle to True in order to get a random subset of the larger class.

  • seed (int, default=None) – the random seed to use for shuffling (only used when shuffle is True).

  • shuffle (bool, default=True) – whether to shuffle the data during sampling.

  • drop_last (bool, default=None) – Whether to drop the last batch if it is of size 1 (needed if using batchnorm during training). If None, this will be set automatically.

class chemprop.data.LazyMoleculeDatapoint[source]#

Bases: _DatapointMixin, _LazyMoleculeDatapointMixin

A LazyMoleculeDatapoint contains a single SMILES string, and all attributes need to form a rdkit.Chem.Mol object. The molecule is computed lazily when the attribute mol is accessed.

V_f: numpy.ndarray | None = None#

A numpy array of shape V x d_vf, where V is the number of atoms in the molecule, and d_vf is the number of additional features that will be concatenated to atom-level features before message passing

E_f: numpy.ndarray | None = None#

A numpy array of shape E x d_ef, where E is the number of bonds in the molecule, and d_ef is the number of additional features containing additional features that will be concatenated to bond-level features before message passing

V_d: numpy.ndarray | None = None#

A numpy array of shape V x d_vd, where V is the number of atoms in the molecule, and d_vd is the number of additional descriptors that will be concatenated to atom-level descriptors after message passing

__post_init__()[source]#
__len__()[source]#
Return type:

int

class chemprop.data.MolAtomBondDatapoint[source]#

Bases: MoleculeDatapoint

A MoleculeDatapoint contains a single molecule and its associated features and targets.

E_d: numpy.ndarray | None = None#

A numpy array of shape E x d_ed, where E is the number of bonds in the molecule, and d_ed is the number of additional descriptors that will be concatenated to edge-level descriptors after message passing

atom_y: numpy.ndarray | None = None#

A numpy array of shape V x v_t, where V is the number of atoms in the molecule, and v_t is the number of atom targets. The order of atoms in the array should match the order of atoms in the mol. Unknown targets are indicated by `nan`s.

atom_gt_mask: numpy.ndarray | None = None#

Indicates whether the atom targets are an inequality regression target of the form <x

atom_lt_mask: numpy.ndarray | None = None#

Indicates whether the atom targets are an inequality regression target of the form >x

bond_y: numpy.ndarray | None = None#

A numpy array of shape E x e_t, where V is the number of bonds in the molecule, and e_t is the number of bond targets. The order of bonds in the array should match the order of bonds in the mol. Unknown targets are indicated by `nan`s.

bond_gt_mask: numpy.ndarray | None = None#

Indicates whether the bond targets are an inequality regression target of the form <x

bond_lt_mask: numpy.ndarray | None = None#

Indicates whether the bond targets are an inequality regression target of the form >x

atom_constraint: numpy.ndarray | None = None#

A numpy array of shape 1 x v_t containing the values that the atom property predictions should be constrained to sum to, with np.nan indicating no constraint for that property

bond_constraint: numpy.ndarray | None = None#

A numpy array of shape 1 x e_t containing the values that the bond property predictions should be constrained to sum to, with np.nan indicating no constraint for that property

__post_init__()[source]#
classmethod from_smi(smi, *args, keep_h=False, add_h=False, ignore_stereo=False, reorder_atoms=True, **kwargs)[source]#
Parameters:
  • smi (str)

  • keep_h (bool)

  • add_h (bool)

  • ignore_stereo (bool)

  • reorder_atoms (bool)

Return type:

MolAtomBondDatapoint

class chemprop.data.MoleculeDatapoint[source]#

Bases: _DatapointMixin, _MoleculeDatapointMixin

A MoleculeDatapoint contains a single molecule and its associated features and targets.

V_f: numpy.ndarray | None = None#

A numpy array of shape V x d_vf, where V is the number of atoms in the molecule, and d_vf is the number of additional features that will be concatenated to atom-level features before message passing

E_f: numpy.ndarray | None = None#

A numpy array of shape E x d_ef, where E is the number of bonds in the molecule, and d_ef is the number of additional features containing additional features that will be concatenated to bond-level features before message passing

V_d: numpy.ndarray | None = None#

A numpy array of shape V x d_vd, where V is the number of atoms in the molecule, and d_vd is the number of additional descriptors that will be concatenated to atom-level descriptors after message passing

__post_init__()[source]#
__len__()[source]#
Return type:

int

class chemprop.data.ReactionDatapoint[source]#

Bases: _DatapointMixin, _ReactionDatapointMixin

A ReactionDatapoint contains a single reaction and its associated features and targets.

__post_init__()[source]#
__len__()[source]#
Return type:

int

class chemprop.data.CuikmolmakerDataset[source]#

Bases: MoleculeDataset

A CuikmolmakerDataset composed of LazyMoleculeDatapoints and a CuikmolmakerMolGraphFeaturizer

A CuikmolmakerDataset produces featurized data for a batch of molecules for ingestion by a MPNN model. Data featurization is always performed on-the-fly and using the cuik-molmaker package. This batched processing is significantly faster and consumes less memory than the default featurization method when caching is not possible.

Parameters:
data: list[chemprop.data.datapoints.LazyMoleculeDatapoint]#
featurizer: chemprop.featurizers.molgraph.CuikmolmakerMolGraphFeaturizer#
property smiles: list[str]#

the SMILES strings associated with the dataset

Return type:

list[str]

__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

Datum

__getitems__(indexes)[source]#
Parameters:

indexes (list[int])

Return type:

CuikBatchedDatum

class chemprop.data.Datum[source]#

Bases: NamedTuple

a singular training data point

mg: chemprop.data.molgraph.MolGraph#
V_d: numpy.ndarray | None#
x_d: numpy.ndarray | None#
y: numpy.ndarray | None#
weight: float#
lt_mask: numpy.ndarray | None#
gt_mask: numpy.ndarray | None#
class chemprop.data.MolAtomBondDataset[source]#

Bases: MoleculeDataset, MolAtomBondGraphDataset

A MoleculeDataset composed of MoleculeDatapoints

A MoleculeDataset produces featurized data for input to a MPNN model. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via the data DataLoader class. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by setting MoleculeDataset.cache=True.

Parameters:
  • data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset

  • featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules

  • n_workers (int, optional) – number of workers to use for cache calculation

data: list[chemprop.data.datapoints.MolAtomBondDatapoint]#
__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

MolAtomBondDatum

property atom_Y: list[numpy.ndarray]#

the (scaled) atom targets of the dataset

Return type:

list[numpy.ndarray]

property atom_constraints: numpy.ndarray#
Return type:

numpy.ndarray

property bond_Y: list[numpy.ndarray]#

the (scaled) bond targets of the dataset

Return type:

list[numpy.ndarray]

property bond_constraints: numpy.ndarray#
Return type:

numpy.ndarray

property atom_gt_mask: numpy.ndarray#
Return type:

numpy.ndarray

property atom_lt_mask: numpy.ndarray#
Return type:

numpy.ndarray

property bond_gt_mask: numpy.ndarray#
Return type:

numpy.ndarray

property bond_lt_mask: numpy.ndarray#
Return type:

numpy.ndarray

property E_ds: list[numpy.ndarray]#

the (scaled) bond descriptors of the dataset

Return type:

list[numpy.ndarray]

property d_ed: int#

the extra bond descriptor dimension, if any

Return type:

int

normalize_targets(key='mol', scaler=None)[source]#

Normalizes the targets of this dataset using a StandardScaler

The StandardScaler subtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.

Returns:

a scaler fit to the targets.

Return type:

StandardScaler

Parameters:
  • key (str)

  • scaler (sklearn.preprocessing.StandardScaler | None)

normalize_inputs(key='X_d', scaler=None)[source]#
Parameters:
  • key (str)

  • scaler (sklearn.preprocessing.StandardScaler | None)

Return type:

sklearn.preprocessing.StandardScaler

reset()[source]#

Reset the atom and bond features; atom and extra descriptors; and targets of each datapoint to their initial, unnormalized values.

class chemprop.data.MolAtomBondDatum[source]#

Bases: NamedTuple

a singular training data point that supports atom and bond level targets

mg: chemprop.data.molgraph.MolGraph#
V_d: numpy.ndarray | None#
E_d: numpy.ndarray | None#
x_d: numpy.ndarray | None#
ys: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
weight: float#
lt_masks: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
gt_masks: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
constraints: tuple[numpy.ndarray | None, numpy.ndarray | None]#
class chemprop.data.MoleculeDataset[source]#

Bases: _MolGraphDatasetMixin, MolGraphDataset

A MoleculeDataset composed of MoleculeDatapoints

A MoleculeDataset produces featurized data for input to a MPNN model. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via the data DataLoader class. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by setting MoleculeDataset.cache=True.

Parameters:
  • data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset

  • featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules

  • n_workers (int, optional) – number of workers to use for cache calculation

data: list[chemprop.data.datapoints.MoleculeDatapoint]#
featurizer: chemprop.featurizers.base.Featurizer[rdkit.Chem.Mol, chemprop.data.molgraph.MolGraph]#
n_workers: int = 0#
__post_init__()[source]#
__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

Datum

property cache: bool#
Return type:

bool

property smiles: list[str]#

the SMILES strings associated with the dataset

Return type:

list[str]

property mols: list[rdkit.Chem.Mol]#

the molecules associated with the dataset

Return type:

list[rdkit.Chem.Mol]

property V_fs: list[numpy.ndarray]#

the (scaled) atom descriptors of the dataset

Return type:

list[numpy.ndarray]

property E_fs: list[numpy.ndarray]#

the (scaled) bond features of the dataset

Return type:

list[numpy.ndarray]

property V_ds: list[numpy.ndarray]#

the (scaled) atom descriptors of the dataset

Return type:

list[numpy.ndarray]

property d_vf: int#

the extra atom feature dimension, if any

Return type:

int

property d_ef: int#

the extra bond feature dimension, if any

Return type:

int

property d_vd: int#

the extra atom descriptor dimension, if any

Return type:

int

normalize_inputs(key='X_d', scaler=None)[source]#
Parameters:
  • key (str)

  • scaler (sklearn.preprocessing.StandardScaler | None)

Return type:

sklearn.preprocessing.StandardScaler

reset()[source]#

Reset the atom and bond features; atom and extra descriptors; and targets of each datapoint to their initial, unnormalized values.

type chemprop.data.MolGraphDataset = Dataset[Datum]#
class chemprop.data.MulticomponentDataset[source]#

Bases: _MolGraphDatasetMixin, torch.utils.data.Dataset

A MulticomponentDataset is a Dataset composed of parallel MoleculeDatasets and ReactionDatasets

datasets: list[MoleculeDataset | ReactionDataset]#

the parallel datasets

__post_init__()[source]#
__len__()[source]#
Return type:

int

property n_components: int#
Return type:

int

__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

list[Datum]

property smiles: list[list[str]]#
Return type:

list[list[str]]

property names: list[list[str]]#
Return type:

list[list[str]]

property mols: list[list[rdkit.Chem.Mol]]#
Return type:

list[list[rdkit.Chem.Mol]]

normalize_targets(scaler=None)[source]#

Normalizes the targets of this dataset using a StandardScaler

The StandardScaler subtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.

Returns:

a scaler fit to the targets.

Return type:

StandardScaler

Parameters:

scaler (sklearn.preprocessing.StandardScaler | None)

normalize_inputs(key='X_d', scaler=None)[source]#
Parameters:
  • key (str)

  • scaler (list[sklearn.preprocessing.StandardScaler] | None)

Return type:

list[sklearn.preprocessing.StandardScaler]

reset()[source]#

Reset the atom and bond features; atom and extra descriptors; and targets of each datapoint to their initial, unnormalized values.

property d_xd: list[int]#

the extra molecule descriptor dimension, if any

Return type:

list[int]

property d_vf: list[int]#
Return type:

list[int]

property d_ef: list[int]#
Return type:

list[int]

property d_vd: list[int]#
Return type:

list[int]

class chemprop.data.ReactionDataset[source]#

Bases: _MolGraphDatasetMixin, MolGraphDataset

A ReactionDataset composed of ReactionDatapoints

Note

The featurized data provided by this class may be cached, simlar to a MoleculeDataset. To enable the cache, set ReactionDataset cache=True.

data: list[chemprop.data.datapoints.ReactionDatapoint]#

the dataset from which to load

featurizer: chemprop.featurizers.base.Featurizer[chemprop.types.Rxn, chemprop.data.molgraph.MolGraph]#

the featurizer with which to generate MolGraphs of the input

n_workers: int = 0#

number of workers to use for cache calculation

__post_init__()[source]#
property cache: bool#
Return type:

bool

__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

Datum

property smiles: list[tuple]#
Return type:

list[tuple]

property mols: list[chemprop.types.Rxn]#
Return type:

list[chemprop.types.Rxn]

property d_vf: int#
Return type:

int

property d_ef: int#
Return type:

int

property d_vd: int#
Return type:

int

class chemprop.data.MolGraph[source]#

Bases: NamedTuple

A MolGraph represents the graph featurization of a molecule.

V: numpy.ndarray#

an array of shape V x d_v containing the atom features of the molecule

E: numpy.ndarray#

an array of shape E x d_e containing the bond features of the molecule

edge_index: numpy.ndarray#

an array of shape 2 x E containing the edges of the graph in COO format

rev_edge_index: numpy.ndarray#

A array of shape E that maps from an edge index to the index of the source of the reverse edge in edge_index attribute.

class chemprop.data.ClassBalanceSampler(Y, seed=None, shuffle=False)[source]#

Bases: torch.utils.data.Sampler

A ClassBalanceSampler samples data from a MolGraphDataset such that positive and negative classes are equally sampled

Parameters:
  • dataset (MolGraphDataset) – the dataset from which to sample

  • seed (int) – the random seed to use for shuffling (only used when shuffle is True)

  • shuffle (bool, default=False) – whether to shuffle the data during sampling

  • Y (numpy.ndarray)

shuffle = False#
rg#
pos_idxs#
neg_idxs#
length#
__iter__()[source]#

an iterator over indices to sample.

Return type:

Iterator[int]

__len__()[source]#

the number of indices that will be sampled.

Return type:

int

class chemprop.data.SeededSampler(N, seed)[source]#

Bases: torch.utils.data.Sampler

A :class`SeededSampler` is a class for iterating through a dataset in a randomly seeded fashion

Parameters:
  • N (int)

  • seed (int)

idxs#
rg#
__iter__()[source]#

an iterator over indices to sample.

Return type:

Iterator[int]

__len__()[source]#

the number of indices that will be sampled.

Return type:

int

class chemprop.data.SplitType[source]#

Bases: chemprop.utils.utils.EnumMapping

Enum where members are also (and must be) strings

SCAFFOLD_BALANCED#
RANDOM_WITH_REPEATED_SMILES#
RANDOM#
KENNARD_STONE#
KMEANS#
chemprop.data.make_split_indices(mols, split='random', sizes=(0.8, 0.1, 0.1), seed=0, num_replicates=1, num_folds=None)[source]#

Splits data into training, validation, and test splits.

Parameters:
  • mols (Sequence[Chem.Mol] | Sized) – Sequence of RDKit molecules to use for structure based splitting or any object with a length equal to the number of datapoints if using random splitting

  • split (SplitType | str, optional) – Split type, one of ~chemprop.data.utils.SplitType, by default “random”

  • sizes (tuple[float, float, float], optional) – 3-tuple with the proportions of data in the train, validation, and test sets, by default (0.8, 0.1, 0.1). Set the middle value to 0 for a two way split.

  • seed (int, optional) – The random seed passed to astartes, by default 0

  • num_replicates (int, optional) – Number of replicates, by default 1

  • num_folds (None, optional) – This argument was removed in v2.1 - use num_replicates instead.

Returns:

2- or 3-member tuple containing num_replicates length lists of training, validation, and testing indexes.

Important

Validation may or may not be present

Return type:

tuple[list[list[int]], …]

Raises:
  • ValueError – Requested split sizes tuple not of length 3

  • ValueError – Unsupported split method requested

chemprop.data.split_data_by_indices(data, train_indices=None, val_indices=None, test_indices=None)[source]#

Splits data into training, validation, and test groups based on split indices given.

Parameters:
  • data (Datapoints | MulticomponentDatapoints)

  • train_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | None)

  • val_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | None)

  • test_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | None)