chemprop.data.datasets

Contents

chemprop.data.datasets#

Attributes#

Classes#

Datum

a singular training data point

MolAtomBondDatum

a singular training data point that supports atom and bond level targets

CuikBatchedDatum

a cuik-molmaker batch of data points

MoleculeDataset

A MoleculeDataset composed of MoleculeDatapoints

CuikmolmakerDataset

A CuikmolmakerDataset composed of LazyMoleculeDatapoints and a

MolAtomBondDataset

A MoleculeDataset composed of MoleculeDatapoints

ReactionDataset

A ReactionDataset composed of ReactionDatapoints

MulticomponentDataset

A MulticomponentDataset is a Dataset composed of parallel

Module Contents#

chemprop.data.datasets.logger#
class chemprop.data.datasets.Datum[source]#

Bases: NamedTuple

a singular training data point

mg: chemprop.data.molgraph.MolGraph#
V_d: numpy.ndarray | None#
x_d: numpy.ndarray | None#
y: numpy.ndarray | None#
weight: float#
lt_mask: numpy.ndarray | None#
gt_mask: numpy.ndarray | None#
class chemprop.data.datasets.MolAtomBondDatum[source]#

Bases: NamedTuple

a singular training data point that supports atom and bond level targets

mg: chemprop.data.molgraph.MolGraph#
V_d: numpy.ndarray | None#
E_d: numpy.ndarray | None#
x_d: numpy.ndarray | None#
ys: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
weight: float#
lt_masks: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
gt_masks: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
constraints: tuple[numpy.ndarray | None, numpy.ndarray | None]#
class chemprop.data.datasets.CuikBatchedDatum[source]#

Bases: NamedTuple

a cuik-molmaker batch of data points

bmg: chemprop.featurizers.molgraph.BatchCuikMolGraph#
V_d: numpy.ndarray#
X_d: numpy.ndarray#
Y: numpy.ndarray#
weights: numpy.ndarray#
lt_mask: numpy.ndarray#
gt_mask: numpy.ndarray#
type chemprop.data.datasets.MolGraphDataset = Dataset[Datum]#
type chemprop.data.datasets.MolAtomBondGraphDataset = Dataset[MolAtomBondDatum]#
class chemprop.data.datasets.MoleculeDataset[source]#

Bases: _MolGraphDatasetMixin, MolGraphDataset

A MoleculeDataset composed of MoleculeDatapoints

A MoleculeDataset produces featurized data for input to a MPNN model. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via the data DataLoader class. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by setting MoleculeDataset.cache=True.

Parameters:
  • data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset

  • featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules

  • n_workers (int, optional) – number of workers to use for cache calculation

data: list[chemprop.data.datapoints.MoleculeDatapoint]#
featurizer: chemprop.featurizers.base.Featurizer[rdkit.Chem.Mol, chemprop.data.molgraph.MolGraph]#
n_workers: int = 0#
__post_init__()[source]#
__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

Datum

property cache: bool#
Return type:

bool

property smiles: list[str]#

the SMILES strings associated with the dataset

Return type:

list[str]

property mols: list[rdkit.Chem.Mol]#

the molecules associated with the dataset

Return type:

list[rdkit.Chem.Mol]

property V_fs: list[numpy.ndarray]#

the (scaled) atom descriptors of the dataset

Return type:

list[numpy.ndarray]

property E_fs: list[numpy.ndarray]#

the (scaled) bond features of the dataset

Return type:

list[numpy.ndarray]

property V_ds: list[numpy.ndarray]#

the (scaled) atom descriptors of the dataset

Return type:

list[numpy.ndarray]

property d_vf: int#

the extra atom feature dimension, if any

Return type:

int

property d_ef: int#

the extra bond feature dimension, if any

Return type:

int

property d_vd: int#

the extra atom descriptor dimension, if any

Return type:

int

normalize_inputs(key='X_d', scaler=None)[source]#
Parameters:
  • key (str)

  • scaler (sklearn.preprocessing.StandardScaler | None)

Return type:

sklearn.preprocessing.StandardScaler

reset()[source]#

Reset the atom and bond features; atom and extra descriptors; and targets of each datapoint to their initial, unnormalized values.

class chemprop.data.datasets.CuikmolmakerDataset[source]#

Bases: MoleculeDataset

A CuikmolmakerDataset composed of LazyMoleculeDatapoints and a CuikmolmakerMolGraphFeaturizer

A CuikmolmakerDataset produces featurized data for a batch of molecules for ingestion by a MPNN model. Data featurization is always performed on-the-fly and using the cuik-molmaker package. This batched processing is significantly faster and consumes less memory than the default featurization method when caching is not possible.

Parameters:
data: list[chemprop.data.datapoints.LazyMoleculeDatapoint]#
featurizer: chemprop.featurizers.molgraph.CuikmolmakerMolGraphFeaturizer#
property smiles: list[str]#

the SMILES strings associated with the dataset

Return type:

list[str]

__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

Datum

__getitems__(indexes)[source]#
Parameters:

indexes (list[int])

Return type:

CuikBatchedDatum

class chemprop.data.datasets.MolAtomBondDataset[source]#

Bases: MoleculeDataset, MolAtomBondGraphDataset

A MoleculeDataset composed of MoleculeDatapoints

A MoleculeDataset produces featurized data for input to a MPNN model. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via the data DataLoader class. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by setting MoleculeDataset.cache=True.

Parameters:
  • data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset

  • featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules

  • n_workers (int, optional) – number of workers to use for cache calculation

data: list[chemprop.data.datapoints.MolAtomBondDatapoint]#
__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

MolAtomBondDatum

property atom_Y: list[numpy.ndarray]#

the (scaled) atom targets of the dataset

Return type:

list[numpy.ndarray]

property atom_constraints: numpy.ndarray#
Return type:

numpy.ndarray

property bond_Y: list[numpy.ndarray]#

the (scaled) bond targets of the dataset

Return type:

list[numpy.ndarray]

property bond_constraints: numpy.ndarray#
Return type:

numpy.ndarray

property atom_gt_mask: numpy.ndarray#
Return type:

numpy.ndarray

property atom_lt_mask: numpy.ndarray#
Return type:

numpy.ndarray

property bond_gt_mask: numpy.ndarray#
Return type:

numpy.ndarray

property bond_lt_mask: numpy.ndarray#
Return type:

numpy.ndarray

property E_ds: list[numpy.ndarray]#

the (scaled) bond descriptors of the dataset

Return type:

list[numpy.ndarray]

property d_ed: int#

the extra bond descriptor dimension, if any

Return type:

int

normalize_targets(key='mol', scaler=None)[source]#

Normalizes the targets of this dataset using a StandardScaler

The StandardScaler subtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.

Returns:

a scaler fit to the targets.

Return type:

StandardScaler

Parameters:
  • key (str)

  • scaler (sklearn.preprocessing.StandardScaler | None)

normalize_inputs(key='X_d', scaler=None)[source]#
Parameters:
  • key (str)

  • scaler (sklearn.preprocessing.StandardScaler | None)

Return type:

sklearn.preprocessing.StandardScaler

reset()[source]#

Reset the atom and bond features; atom and extra descriptors; and targets of each datapoint to their initial, unnormalized values.

class chemprop.data.datasets.ReactionDataset[source]#

Bases: _MolGraphDatasetMixin, MolGraphDataset

A ReactionDataset composed of ReactionDatapoints

Note

The featurized data provided by this class may be cached, simlar to a MoleculeDataset. To enable the cache, set ReactionDataset cache=True.

data: list[chemprop.data.datapoints.ReactionDatapoint]#

the dataset from which to load

featurizer: chemprop.featurizers.base.Featurizer[chemprop.types.Rxn, chemprop.data.molgraph.MolGraph]#

the featurizer with which to generate MolGraphs of the input

n_workers: int = 0#

number of workers to use for cache calculation

__post_init__()[source]#
property cache: bool#
Return type:

bool

__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

Datum

property smiles: list[tuple]#
Return type:

list[tuple]

property mols: list[chemprop.types.Rxn]#
Return type:

list[chemprop.types.Rxn]

property d_vf: int#
Return type:

int

property d_ef: int#
Return type:

int

property d_vd: int#
Return type:

int

class chemprop.data.datasets.MulticomponentDataset[source]#

Bases: _MolGraphDatasetMixin, torch.utils.data.Dataset

A MulticomponentDataset is a Dataset composed of parallel MoleculeDatasets and ReactionDatasets

datasets: list[MoleculeDataset | ReactionDataset]#

the parallel datasets

__post_init__()[source]#
__len__()[source]#
Return type:

int

property n_components: int#
Return type:

int

__getitem__(idx)[source]#
Parameters:

idx (int)

Return type:

list[Datum]

property smiles: list[list[str]]#
Return type:

list[list[str]]

property names: list[list[str]]#
Return type:

list[list[str]]

property mols: list[list[rdkit.Chem.Mol]]#
Return type:

list[list[rdkit.Chem.Mol]]

normalize_targets(scaler=None)[source]#

Normalizes the targets of this dataset using a StandardScaler

The StandardScaler subtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.

Returns:

a scaler fit to the targets.

Return type:

StandardScaler

Parameters:

scaler (sklearn.preprocessing.StandardScaler | None)

normalize_inputs(key='X_d', scaler=None)[source]#
Parameters:
  • key (str)

  • scaler (list[sklearn.preprocessing.StandardScaler] | None)

Return type:

list[sklearn.preprocessing.StandardScaler]

reset()[source]#

Reset the atom and bond features; atom and extra descriptors; and targets of each datapoint to their initial, unnormalized values.

property d_xd: list[int]#

the extra molecule descriptor dimension, if any

Return type:

list[int]

property d_vf: list[int]#
Return type:

list[int]

property d_ef: list[int]#
Return type:

list[int]

property d_vd: list[int]#
Return type:

list[int]