chemprop.data.datasets#
Attributes#
Classes#
a singular training data point |
|
a singular training data point that supports atom and bond level targets |
|
a cuik-molmaker batch of data points |
|
A |
|
A |
|
A |
|
A |
|
A |
Module Contents#
- chemprop.data.datasets.logger#
- class chemprop.data.datasets.Datum[source]#
Bases:
NamedTuplea singular training data point
- V_d: numpy.ndarray | None#
- x_d: numpy.ndarray | None#
- y: numpy.ndarray | None#
- weight: float#
- lt_mask: numpy.ndarray | None#
- gt_mask: numpy.ndarray | None#
- class chemprop.data.datasets.MolAtomBondDatum[source]#
Bases:
NamedTuplea singular training data point that supports atom and bond level targets
- V_d: numpy.ndarray | None#
- E_d: numpy.ndarray | None#
- x_d: numpy.ndarray | None#
- ys: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
- weight: float#
- lt_masks: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
- gt_masks: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
- constraints: tuple[numpy.ndarray | None, numpy.ndarray | None]#
- class chemprop.data.datasets.CuikBatchedDatum[source]#
Bases:
NamedTuplea cuik-molmaker batch of data points
- bmg: chemprop.featurizers.molgraph.BatchCuikMolGraph#
- V_d: numpy.ndarray#
- X_d: numpy.ndarray#
- Y: numpy.ndarray#
- weights: numpy.ndarray#
- lt_mask: numpy.ndarray#
- gt_mask: numpy.ndarray#
- type chemprop.data.datasets.MolAtomBondGraphDataset = Dataset[MolAtomBondDatum]#
- class chemprop.data.datasets.MoleculeDataset[source]#
Bases:
_MolGraphDatasetMixin,MolGraphDatasetA
MoleculeDatasetcomposed ofMoleculeDatapointsA
MoleculeDatasetproduces featurized data for input to aMPNNmodel. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via thedata DataLoaderclass. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by settingMoleculeDataset.cache=True.- Parameters:
data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset
featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules
n_workers (int, optional) – number of workers to use for cache calculation
- data: list[chemprop.data.datapoints.MoleculeDatapoint]#
- featurizer: chemprop.featurizers.base.Featurizer[rdkit.Chem.Mol, chemprop.data.molgraph.MolGraph]#
- n_workers: int = 0#
- property cache: bool#
- Return type:
bool
- property smiles: list[str]#
the SMILES strings associated with the dataset
- Return type:
list[str]
- property mols: list[rdkit.Chem.Mol]#
the molecules associated with the dataset
- Return type:
list[rdkit.Chem.Mol]
- property V_fs: list[numpy.ndarray]#
the (scaled) atom descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property E_fs: list[numpy.ndarray]#
the (scaled) bond features of the dataset
- Return type:
list[numpy.ndarray]
- property V_ds: list[numpy.ndarray]#
the (scaled) atom descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property d_vf: int#
the extra atom feature dimension, if any
- Return type:
int
- property d_ef: int#
the extra bond feature dimension, if any
- Return type:
int
- property d_vd: int#
the extra atom descriptor dimension, if any
- Return type:
int
- class chemprop.data.datasets.CuikmolmakerDataset[source]#
Bases:
MoleculeDatasetA
CuikmolmakerDatasetcomposed ofLazyMoleculeDatapoints and aCuikmolmakerMolGraphFeaturizerA
CuikmolmakerDatasetproduces featurized data for a batch of molecules for ingestion by aMPNNmodel. Data featurization is always performed on-the-fly and using the cuik-molmaker package. This batched processing is significantly faster and consumes less memory than the default featurization method when caching is not possible.- Parameters:
data (Iterable[LazyMoleculeDatapoint]) – the data from which to create a dataset
featurizer (CuikmolmakerMolGraphFeaturizer) – the featurizer with which to generate MolGraphs of the molecules
- data: list[chemprop.data.datapoints.LazyMoleculeDatapoint]#
- property smiles: list[str]#
the SMILES strings associated with the dataset
- Return type:
list[str]
- class chemprop.data.datasets.MolAtomBondDataset[source]#
Bases:
MoleculeDataset,MolAtomBondGraphDatasetA
MoleculeDatasetcomposed ofMoleculeDatapointsA
MoleculeDatasetproduces featurized data for input to aMPNNmodel. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via thedata DataLoaderclass. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by settingMoleculeDataset.cache=True.- Parameters:
data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset
featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules
n_workers (int, optional) – number of workers to use for cache calculation
- data: list[chemprop.data.datapoints.MolAtomBondDatapoint]#
- property atom_Y: list[numpy.ndarray]#
the (scaled) atom targets of the dataset
- Return type:
list[numpy.ndarray]
- property atom_constraints: numpy.ndarray#
- Return type:
numpy.ndarray
- property bond_Y: list[numpy.ndarray]#
the (scaled) bond targets of the dataset
- Return type:
list[numpy.ndarray]
- property bond_constraints: numpy.ndarray#
- Return type:
numpy.ndarray
- property atom_gt_mask: numpy.ndarray#
- Return type:
numpy.ndarray
- property atom_lt_mask: numpy.ndarray#
- Return type:
numpy.ndarray
- property bond_gt_mask: numpy.ndarray#
- Return type:
numpy.ndarray
- property bond_lt_mask: numpy.ndarray#
- Return type:
numpy.ndarray
- property E_ds: list[numpy.ndarray]#
the (scaled) bond descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property d_ed: int#
the extra bond descriptor dimension, if any
- Return type:
int
- normalize_targets(key='mol', scaler=None)[source]#
Normalizes the targets of this dataset using a
StandardScalerThe
StandardScalersubtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.- Returns:
a scaler fit to the targets.
- Return type:
StandardScaler
- Parameters:
key (str)
scaler (sklearn.preprocessing.StandardScaler | None)
- class chemprop.data.datasets.ReactionDataset[source]#
Bases:
_MolGraphDatasetMixin,MolGraphDatasetA
ReactionDatasetcomposed ofReactionDatapointsNote
The featurized data provided by this class may be cached, simlar to a
MoleculeDataset. To enable the cache, setReactionDataset cache=True.- data: list[chemprop.data.datapoints.ReactionDatapoint]#
the dataset from which to load
- featurizer: chemprop.featurizers.base.Featurizer[chemprop.types.Rxn, chemprop.data.molgraph.MolGraph]#
the featurizer with which to generate MolGraphs of the input
- n_workers: int = 0#
number of workers to use for cache calculation
- property cache: bool#
- Return type:
bool
- property smiles: list[tuple]#
- Return type:
list[tuple]
- property mols: list[chemprop.types.Rxn]#
- Return type:
list[chemprop.types.Rxn]
- property d_vf: int#
- Return type:
int
- property d_ef: int#
- Return type:
int
- property d_vd: int#
- Return type:
int
- class chemprop.data.datasets.MulticomponentDataset[source]#
Bases:
_MolGraphDatasetMixin,torch.utils.data.DatasetA
MulticomponentDatasetis aDatasetcomposed of parallelMoleculeDatasetsandReactionDatasets- datasets: list[MoleculeDataset | ReactionDataset]#
the parallel datasets
- property n_components: int#
- Return type:
int
- property smiles: list[list[str]]#
- Return type:
list[list[str]]
- property names: list[list[str]]#
- Return type:
list[list[str]]
- property mols: list[list[rdkit.Chem.Mol]]#
- Return type:
list[list[rdkit.Chem.Mol]]
- normalize_targets(scaler=None)[source]#
Normalizes the targets of this dataset using a
StandardScalerThe
StandardScalersubtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.- Returns:
a scaler fit to the targets.
- Return type:
StandardScaler
- Parameters:
scaler (sklearn.preprocessing.StandardScaler | None)
- normalize_inputs(key='X_d', scaler=None)[source]#
- Parameters:
key (str)
scaler (list[sklearn.preprocessing.StandardScaler] | None)
- Return type:
list[sklearn.preprocessing.StandardScaler]
- reset()[source]#
Reset the atom and bond features; atom and extra descriptors; and targets of each datapoint to their initial, unnormalized values.
- property d_xd: list[int]#
the extra molecule descriptor dimension, if any
- Return type:
list[int]
- property d_vf: list[int]#
- Return type:
list[int]
- property d_ef: list[int]#
- Return type:
list[int]
- property d_vd: list[int]#
- Return type:
list[int]