chemprop.data.datasets
#
Module Contents#
Classes#
a singular training data point |
|
A |
|
A |
|
A |
Attributes#
- class chemprop.data.datasets.Datum[source]#
Bases:
NamedTuple
a singular training data point
- V_d: numpy.ndarray | None#
- x_d: numpy.ndarray | None#
- y: numpy.ndarray | None#
- weight: float#
- lt_mask: numpy.ndarray | None#
- gt_mask: numpy.ndarray | None#
- chemprop.data.datasets.MolGraphDataset: TypeAlias#
- class chemprop.data.datasets.MoleculeDataset[source]#
Bases:
_MolGraphDatasetMixin
,MolGraphDataset
A
MoleculeDataset
composed ofMoleculeDatapoint
sA
MoleculeDataset
produces featurized data for input to aMPNN
model. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via thedata DataLoader
class. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by settingMoleculeDataset.cache=True
.- Parameters:
data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset
featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules
- property cache: bool#
- Return type:
bool
- property smiles: list[str]#
the SMILES strings associated with the dataset
- Return type:
list[str]
- property mols: list[rdkit.Chem.Mol]#
the molecules associated with the dataset
- Return type:
list[rdkit.Chem.Mol]
- property V_fs: list[numpy.ndarray]#
the (scaled) atom descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property E_fs: list[numpy.ndarray]#
the (scaled) bond features of the dataset
- Return type:
list[numpy.ndarray]
- property V_ds: list[numpy.ndarray]#
the (scaled) atom descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property d_vf: int#
the extra atom feature dimension, if any
- Return type:
int
- property d_ef: int#
the extra bond feature dimension, if any
- Return type:
int
- property d_vd: int#
the extra atom descriptor dimension, if any
- Return type:
int
- data: list[chemprop.data.datapoints.MoleculeDatapoint]#
- featurizer: chemprop.featurizers.base.Featurizer[rdkit.Chem.Mol, chemprop.data.molgraph.MolGraph]#
- class chemprop.data.datasets.ReactionDataset[source]#
Bases:
_MolGraphDatasetMixin
,MolGraphDataset
A
ReactionDataset
composed ofReactionDatapoint
sNote
The featurized data provided by this class may be cached, simlar to a
MoleculeDataset
. To enable the cache, setReactionDataset cache=True
.- property cache: bool#
- Return type:
bool
- property smiles: list[tuple]#
- Return type:
list[tuple]
- property mols: list[chemprop.types.Rxn]#
- Return type:
list[chemprop.types.Rxn]
- property d_vf: int#
- Return type:
int
- property d_ef: int#
- Return type:
int
- property d_vd: int#
- Return type:
int
- data: list[chemprop.data.datapoints.ReactionDatapoint]#
the dataset from which to load
- featurizer: chemprop.featurizers.base.Featurizer[chemprop.types.Rxn, chemprop.data.molgraph.MolGraph]#
the featurizer with which to generate MolGraphs of the input
- class chemprop.data.datasets.MulticomponentDataset[source]#
Bases:
_MolGraphDatasetMixin
,torch.utils.data.Dataset
A
MulticomponentDataset
is aDataset
composed of parallelMoleculeDatasets
andReactionDataset
s- property n_components: int#
- Return type:
int
- property smiles: list[list[str]]#
- Return type:
list[list[str]]
- property names: list[list[str]]#
- Return type:
list[list[str]]
- property mols: list[list[rdkit.Chem.Mol]]#
- Return type:
list[list[rdkit.Chem.Mol]]
- property d_xd: list[int]#
the extra molecule descriptor dimension, if any
- Return type:
list[int]
- property d_vf: list[int]#
- Return type:
list[int]
- property d_ef: list[int]#
- Return type:
list[int]
- property d_vd: list[int]#
- Return type:
list[int]
- datasets: list[MoleculeDataset | ReactionDataset]#
the parallel datasets
- normalize_targets(scaler=None)[source]#
Normalizes the targets of this dataset using a
StandardScaler
The
StandardScaler
subtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.- Returns:
a scaler fit to the targets.
- Return type:
StandardScaler
- Parameters:
scaler (sklearn.preprocessing.StandardScaler | None)