chemprop.data#
Submodules#
Attributes#
Classes#
A |
|
A |
|
A |
|
A |
|
A |
|
A |
|
A |
|
a singular training data point |
|
A |
|
a singular training data point that supports atom and bond level targets |
|
A |
|
A |
|
A |
|
A |
|
A |
|
A :class`SeededSampler` is a class for iterating through a dataset in a randomly seeded |
|
Enum where members are also (and must be) strings |
Functions#
|
|
|
|
|
|
|
Return a |
|
Splits data into training, validation, and test splits. |
|
Splits data into training, validation, and test groups based on split indices given. |
Package Contents#
- class chemprop.data.BatchMolAtomBondGraph[source]#
Bases:
BatchMolGraphA
BatchMolGraphrepresents a batch of individualMolGraphs.It has all the attributes of a
MolGraphwith the addition of thebatchattribute. This class is intended for use with data loading, so it usesTensors to store data- bond_batch: torch.Tensor#
A tensor of indices that show which
MolGrapheach bond belongs to in the batch
- __post_init__(mgs)[source]#
- Parameters:
mgs (Sequence[chemprop.data.molgraph.MolGraph])
- class chemprop.data.BatchMolGraph[source]#
A
BatchMolGraphrepresents a batch of individualMolGraphs.It has all the attributes of a
MolGraphwith the addition of thebatchattribute. This class is intended for use with data loading, so it usesTensors to store data- mgs: dataclasses.InitVar[Sequence[chemprop.data.molgraph.MolGraph]]#
A list of individual
MolGraphs to be batched together
- V: torch.Tensor#
the atom feature matrix
- E: torch.Tensor#
the bond feature matrix
- edge_index: torch.Tensor#
an tensor of shape
2 x Econtaining the edges of the graph in COO format
- rev_edge_index: torch.Tensor#
A tensor of shape
Ethat maps from an edge index to the index of the source of the reverse edge in theedge_indexattribute.
- __post_init__(mgs)[source]#
- Parameters:
mgs (Sequence[chemprop.data.molgraph.MolGraph])
- class chemprop.data.MolAtomBondTrainingBatch[source]#
Bases:
NamedTuple- V_d: torch.Tensor | None#
- E_d: torch.Tensor | None#
- X_d: torch.Tensor | None#
- Ys: tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]#
- w: tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]#
- lt_masks: tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]#
- gt_masks: tuple[torch.Tensor | None, torch.Tensor | None, torch.Tensor | None]#
- constraints: tuple[torch.Tensor | None, torch.Tensor | None]#
- class chemprop.data.MulticomponentTrainingBatch[source]#
Bases:
NamedTuple- bmgs: list[BatchMolGraph]#
- V_ds: list[torch.Tensor | None]#
- X_d: torch.Tensor | None#
- Y: torch.Tensor | None#
- w: torch.Tensor#
- lt_mask: torch.Tensor | None#
- gt_mask: torch.Tensor | None#
- class chemprop.data.TrainingBatch[source]#
Bases:
NamedTuple- V_d: torch.Tensor | None#
- X_d: torch.Tensor | None#
- Y: torch.Tensor | None#
- w: torch.Tensor#
- lt_mask: torch.Tensor | None#
- gt_mask: torch.Tensor | None#
- chemprop.data.collate_batch(batch)[source]#
- Parameters:
batch (Iterable[chemprop.data.datasets.Datum])
- Return type:
- chemprop.data.collate_mol_atom_bond_batch(batch)[source]#
- Parameters:
batch (Iterable[chemprop.data.datasets.MolAtomBondDatum])
- Return type:
- chemprop.data.collate_multicomponent(batches)[source]#
- Parameters:
batches (Iterable[Iterable[chemprop.data.datasets.Datum]])
- Return type:
- chemprop.data.build_dataloader(dataset, batch_size=64, num_workers=0, class_balance=False, seed=None, shuffle=True, drop_last=None, **kwargs)[source]#
Return a
DataLoaderforMolGraphDatasets- Parameters:
dataset (MoleculeDataset | ReactionDataset | MulticomponentDataset) – The dataset containing the molecules or reactions to load.
batch_size (int, default=64) – the batch size to load.
num_workers (int, default=0) – the number of workers used to build batches.
class_balance (bool, default=False) – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Class balance is only available for single task classification datasets. Set shuffle to True in order to get a random subset of the larger class.
seed (int, default=None) – the random seed to use for shuffling (only used when shuffle is True).
shuffle (bool, default=True) – whether to shuffle the data during sampling.
drop_last (bool, default=None) – Whether to drop the last batch if it is of size 1 (needed if using batchnorm during training). If None, this will be set automatically.
- class chemprop.data.LazyMoleculeDatapoint[source]#
Bases:
_DatapointMixin,_LazyMoleculeDatapointMixinA
LazyMoleculeDatapointcontains a single SMILES string, and all attributes need to form a rdkit.Chem.Mol object. The molecule is computed lazily when the attribute mol is accessed.- V_f: numpy.ndarray | None = None#
A numpy array of shape
V x d_vf, whereVis the number of atoms in the molecule, andd_vfis the number of additional features that will be concatenated to atom-level features before message passing
- E_f: numpy.ndarray | None = None#
A numpy array of shape
E x d_ef, whereEis the number of bonds in the molecule, andd_efis the number of additional features containing additional features that will be concatenated to bond-level features before message passing
- V_d: numpy.ndarray | None = None#
A numpy array of shape
V x d_vd, whereVis the number of atoms in the molecule, andd_vdis the number of additional descriptors that will be concatenated to atom-level descriptors after message passing
- class chemprop.data.MolAtomBondDatapoint[source]#
Bases:
MoleculeDatapointA
MoleculeDatapointcontains a single molecule and its associated features and targets.- E_d: numpy.ndarray | None = None#
A numpy array of shape
E x d_ed, whereEis the number of bonds in the molecule, andd_edis the number of additional descriptors that will be concatenated to edge-level descriptors after message passing
- atom_y: numpy.ndarray | None = None#
A numpy array of shape
V x v_t, whereVis the number of atoms in the molecule, andv_tis the number of atom targets. The order of atoms in the array should match the order of atoms in the mol. Unknown targets are indicated by `nan`s.
- atom_gt_mask: numpy.ndarray | None = None#
Indicates whether the atom targets are an inequality regression target of the form <x
- atom_lt_mask: numpy.ndarray | None = None#
Indicates whether the atom targets are an inequality regression target of the form >x
- bond_y: numpy.ndarray | None = None#
A numpy array of shape
E x e_t, whereVis the number of bonds in the molecule, ande_tis the number of bond targets. The order of bonds in the array should match the order of bonds in the mol. Unknown targets are indicated by `nan`s.
- bond_gt_mask: numpy.ndarray | None = None#
Indicates whether the bond targets are an inequality regression target of the form <x
- bond_lt_mask: numpy.ndarray | None = None#
Indicates whether the bond targets are an inequality regression target of the form >x
- atom_constraint: numpy.ndarray | None = None#
A numpy array of shape
1 x v_tcontaining the values that the atom property predictions should be constrained to sum to, with np.nan indicating no constraint for that property
- bond_constraint: numpy.ndarray | None = None#
A numpy array of shape
1 x e_tcontaining the values that the bond property predictions should be constrained to sum to, with np.nan indicating no constraint for that property
- class chemprop.data.MoleculeDatapoint[source]#
Bases:
_DatapointMixin,_MoleculeDatapointMixinA
MoleculeDatapointcontains a single molecule and its associated features and targets.- V_f: numpy.ndarray | None = None#
A numpy array of shape
V x d_vf, whereVis the number of atoms in the molecule, andd_vfis the number of additional features that will be concatenated to atom-level features before message passing
- E_f: numpy.ndarray | None = None#
A numpy array of shape
E x d_ef, whereEis the number of bonds in the molecule, andd_efis the number of additional features containing additional features that will be concatenated to bond-level features before message passing
- V_d: numpy.ndarray | None = None#
A numpy array of shape
V x d_vd, whereVis the number of atoms in the molecule, andd_vdis the number of additional descriptors that will be concatenated to atom-level descriptors after message passing
- class chemprop.data.ReactionDatapoint[source]#
Bases:
_DatapointMixin,_ReactionDatapointMixinA
ReactionDatapointcontains a single reaction and its associated features and targets.
- class chemprop.data.CuikmolmakerDataset[source]#
Bases:
MoleculeDatasetA
CuikmolmakerDatasetcomposed ofLazyMoleculeDatapoints and aCuikmolmakerMolGraphFeaturizerA
CuikmolmakerDatasetproduces featurized data for a batch of molecules for ingestion by aMPNNmodel. Data featurization is always performed on-the-fly and using the cuik-molmaker package. This batched processing is significantly faster and consumes less memory than the default featurization method when caching is not possible.- Parameters:
data (Iterable[LazyMoleculeDatapoint]) – the data from which to create a dataset
featurizer (CuikmolmakerMolGraphFeaturizer) – the featurizer with which to generate MolGraphs of the molecules
- data: list[chemprop.data.datapoints.LazyMoleculeDatapoint]#
- property smiles: list[str]#
the SMILES strings associated with the dataset
- Return type:
list[str]
- class chemprop.data.Datum[source]#
Bases:
NamedTuplea singular training data point
- V_d: numpy.ndarray | None#
- x_d: numpy.ndarray | None#
- y: numpy.ndarray | None#
- weight: float#
- lt_mask: numpy.ndarray | None#
- gt_mask: numpy.ndarray | None#
- class chemprop.data.MolAtomBondDataset[source]#
Bases:
MoleculeDataset,MolAtomBondGraphDatasetA
MoleculeDatasetcomposed ofMoleculeDatapointsA
MoleculeDatasetproduces featurized data for input to aMPNNmodel. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via thedata DataLoaderclass. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by settingMoleculeDataset.cache=True.- Parameters:
data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset
featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules
n_workers (int, optional) – number of workers to use for cache calculation
- data: list[chemprop.data.datapoints.MolAtomBondDatapoint]#
- property atom_Y: list[numpy.ndarray]#
the (scaled) atom targets of the dataset
- Return type:
list[numpy.ndarray]
- property atom_constraints: numpy.ndarray#
- Return type:
numpy.ndarray
- property bond_Y: list[numpy.ndarray]#
the (scaled) bond targets of the dataset
- Return type:
list[numpy.ndarray]
- property bond_constraints: numpy.ndarray#
- Return type:
numpy.ndarray
- property atom_gt_mask: numpy.ndarray#
- Return type:
numpy.ndarray
- property atom_lt_mask: numpy.ndarray#
- Return type:
numpy.ndarray
- property bond_gt_mask: numpy.ndarray#
- Return type:
numpy.ndarray
- property bond_lt_mask: numpy.ndarray#
- Return type:
numpy.ndarray
- property E_ds: list[numpy.ndarray]#
the (scaled) bond descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property d_ed: int#
the extra bond descriptor dimension, if any
- Return type:
int
- normalize_targets(key='mol', scaler=None)[source]#
Normalizes the targets of this dataset using a
StandardScalerThe
StandardScalersubtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.- Returns:
a scaler fit to the targets.
- Return type:
StandardScaler
- Parameters:
key (str)
scaler (sklearn.preprocessing.StandardScaler | None)
- class chemprop.data.MolAtomBondDatum[source]#
Bases:
NamedTuplea singular training data point that supports atom and bond level targets
- V_d: numpy.ndarray | None#
- E_d: numpy.ndarray | None#
- x_d: numpy.ndarray | None#
- ys: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
- weight: float#
- lt_masks: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
- gt_masks: tuple[numpy.ndarray | None, numpy.ndarray | None, numpy.ndarray | None]#
- constraints: tuple[numpy.ndarray | None, numpy.ndarray | None]#
- class chemprop.data.MoleculeDataset[source]#
Bases:
_MolGraphDatasetMixin,MolGraphDatasetA
MoleculeDatasetcomposed ofMoleculeDatapointsA
MoleculeDatasetproduces featurized data for input to aMPNNmodel. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via thedata DataLoaderclass. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by settingMoleculeDataset.cache=True.- Parameters:
data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset
featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules
n_workers (int, optional) – number of workers to use for cache calculation
- data: list[chemprop.data.datapoints.MoleculeDatapoint]#
- featurizer: chemprop.featurizers.base.Featurizer[rdkit.Chem.Mol, chemprop.data.molgraph.MolGraph]#
- n_workers: int = 0#
- property cache: bool#
- Return type:
bool
- property smiles: list[str]#
the SMILES strings associated with the dataset
- Return type:
list[str]
- property mols: list[rdkit.Chem.Mol]#
the molecules associated with the dataset
- Return type:
list[rdkit.Chem.Mol]
- property V_fs: list[numpy.ndarray]#
the (scaled) atom descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property E_fs: list[numpy.ndarray]#
the (scaled) bond features of the dataset
- Return type:
list[numpy.ndarray]
- property V_ds: list[numpy.ndarray]#
the (scaled) atom descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property d_vf: int#
the extra atom feature dimension, if any
- Return type:
int
- property d_ef: int#
the extra bond feature dimension, if any
- Return type:
int
- property d_vd: int#
the extra atom descriptor dimension, if any
- Return type:
int
- class chemprop.data.MulticomponentDataset[source]#
Bases:
_MolGraphDatasetMixin,torch.utils.data.DatasetA
MulticomponentDatasetis aDatasetcomposed of parallelMoleculeDatasetsandReactionDatasets- datasets: list[MoleculeDataset | ReactionDataset]#
the parallel datasets
- property n_components: int#
- Return type:
int
- property smiles: list[list[str]]#
- Return type:
list[list[str]]
- property names: list[list[str]]#
- Return type:
list[list[str]]
- property mols: list[list[rdkit.Chem.Mol]]#
- Return type:
list[list[rdkit.Chem.Mol]]
- normalize_targets(scaler=None)[source]#
Normalizes the targets of this dataset using a
StandardScalerThe
StandardScalersubtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.- Returns:
a scaler fit to the targets.
- Return type:
StandardScaler
- Parameters:
scaler (sklearn.preprocessing.StandardScaler | None)
- normalize_inputs(key='X_d', scaler=None)[source]#
- Parameters:
key (str)
scaler (list[sklearn.preprocessing.StandardScaler] | None)
- Return type:
list[sklearn.preprocessing.StandardScaler]
- reset()[source]#
Reset the atom and bond features; atom and extra descriptors; and targets of each datapoint to their initial, unnormalized values.
- property d_xd: list[int]#
the extra molecule descriptor dimension, if any
- Return type:
list[int]
- property d_vf: list[int]#
- Return type:
list[int]
- property d_ef: list[int]#
- Return type:
list[int]
- property d_vd: list[int]#
- Return type:
list[int]
- class chemprop.data.ReactionDataset[source]#
Bases:
_MolGraphDatasetMixin,MolGraphDatasetA
ReactionDatasetcomposed ofReactionDatapointsNote
The featurized data provided by this class may be cached, simlar to a
MoleculeDataset. To enable the cache, setReactionDataset cache=True.- data: list[chemprop.data.datapoints.ReactionDatapoint]#
the dataset from which to load
- featurizer: chemprop.featurizers.base.Featurizer[chemprop.types.Rxn, chemprop.data.molgraph.MolGraph]#
the featurizer with which to generate MolGraphs of the input
- n_workers: int = 0#
number of workers to use for cache calculation
- property cache: bool#
- Return type:
bool
- property smiles: list[tuple]#
- Return type:
list[tuple]
- property mols: list[chemprop.types.Rxn]#
- Return type:
list[chemprop.types.Rxn]
- property d_vf: int#
- Return type:
int
- property d_ef: int#
- Return type:
int
- property d_vd: int#
- Return type:
int
- class chemprop.data.MolGraph[source]#
Bases:
NamedTupleA
MolGraphrepresents the graph featurization of a molecule.- V: numpy.ndarray#
an array of shape
V x d_vcontaining the atom features of the molecule
- E: numpy.ndarray#
an array of shape
E x d_econtaining the bond features of the molecule
- edge_index: numpy.ndarray#
an array of shape
2 x Econtaining the edges of the graph in COO format
- rev_edge_index: numpy.ndarray#
A array of shape
Ethat maps from an edge index to the index of the source of the reverse edge inedge_indexattribute.
- class chemprop.data.ClassBalanceSampler(Y, seed=None, shuffle=False)[source]#
Bases:
torch.utils.data.SamplerA
ClassBalanceSamplersamples data from aMolGraphDatasetsuch that positive and negative classes are equally sampled- Parameters:
dataset (MolGraphDataset) – the dataset from which to sample
seed (int) – the random seed to use for shuffling (only used when shuffle is True)
shuffle (bool, default=False) – whether to shuffle the data during sampling
Y (numpy.ndarray)
- shuffle = False#
- rg#
- pos_idxs#
- neg_idxs#
- length#
- class chemprop.data.SeededSampler(N, seed)[source]#
Bases:
torch.utils.data.SamplerA :class`SeededSampler` is a class for iterating through a dataset in a randomly seeded fashion
- Parameters:
N (int)
seed (int)
- idxs#
- rg#
- class chemprop.data.SplitType[source]#
Bases:
chemprop.utils.utils.EnumMappingEnum where members are also (and must be) strings
- SCAFFOLD_BALANCED#
- RANDOM_WITH_REPEATED_SMILES#
- RANDOM#
- KENNARD_STONE#
- KMEANS#
- chemprop.data.make_split_indices(mols, split='random', sizes=(0.8, 0.1, 0.1), seed=0, num_replicates=1, num_folds=None)[source]#
Splits data into training, validation, and test splits.
- Parameters:
mols (Sequence[Chem.Mol] | Sized) – Sequence of RDKit molecules to use for structure based splitting or any object with a length equal to the number of datapoints if using random splitting
split (SplitType | str, optional) – Split type, one of ~chemprop.data.utils.SplitType, by default “random”
sizes (tuple[float, float, float], optional) – 3-tuple with the proportions of data in the train, validation, and test sets, by default (0.8, 0.1, 0.1). Set the middle value to 0 for a two way split.
seed (int, optional) – The random seed passed to astartes, by default 0
num_replicates (int, optional) – Number of replicates, by default 1
num_folds (None, optional) – This argument was removed in v2.1 - use num_replicates instead.
- Returns:
2- or 3-member tuple containing num_replicates length lists of training, validation, and testing indexes.
Important
Validation may or may not be present
- Return type:
tuple[list[list[int]], …]
- Raises:
ValueError – Requested split sizes tuple not of length 3
ValueError – Unsupported split method requested
- chemprop.data.split_data_by_indices(data, train_indices=None, val_indices=None, test_indices=None)[source]#
Splits data into training, validation, and test groups based on split indices given.
- Parameters:
data (Datapoints | MulticomponentDatapoints)
train_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | None)
val_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | None)
test_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | None)