chemprop.data
#
Submodules#
Package Contents#
Classes#
A |
|
A |
|
A |
|
A |
|
A |
|
a singular training data point |
|
A |
|
A |
|
A |
|
A :class`SeededSampler` is a class for iterating through a dataset in a randomly seeded |
|
Enum where members are also (and must be) strings |
Functions#
|
|
|
|
|
Return a |
|
Splits data into training, validation, and test splits. |
|
Splits data into training, validation, and test groups based on split indices given. |
Attributes#
- class chemprop.data.BatchMolGraph[source]#
A
BatchMolGraph
represents a batch of individualMolGraph
s.It has all the attributes of a
MolGraph
with the addition of thebatch
attribute. This class is intended for use with data loading, so it usesTensor
s to store data- mgs: dataclasses.InitVar[Sequence[chemprop.data.molgraph.MolGraph]]#
A list of individual
MolGraph
s to be batched together
- V: torch.Tensor#
the atom feature matrix
- E: torch.Tensor#
the bond feature matrix
- edge_index: torch.Tensor#
an tensor of shape
2 x E
containing the edges of the graph in COO format
- rev_edge_index: torch.Tensor#
A tensor of shape
E
that maps from an edge index to the index of the source of the reverse edge in theedge_index
attribute.
- __post_init__(mgs)[source]#
- Parameters:
mgs (Sequence[chemprop.data.molgraph.MolGraph])
- class chemprop.data.TrainingBatch[source]#
Bases:
NamedTuple
- bmg: BatchMolGraph#
- V_d: torch.Tensor | None#
- X_d: torch.Tensor | None#
- Y: torch.Tensor | None#
- w: torch.Tensor#
- lt_mask: torch.Tensor | None#
- gt_mask: torch.Tensor | None#
- chemprop.data.collate_batch(batch)[source]#
- Parameters:
batch (Iterable[chemprop.data.datasets.Datum])
- Return type:
- chemprop.data.collate_multicomponent(batches)[source]#
- Parameters:
batches (Iterable[Iterable[chemprop.data.datasets.Datum]])
- Return type:
- chemprop.data.build_dataloader(dataset, batch_size=64, num_workers=0, class_balance=False, seed=None, shuffle=True, **kwargs)[source]#
Return a
DataLoader
forMolGraphDataset
s- Parameters:
dataset (MoleculeDataset | ReactionDataset | MulticomponentDataset) – The dataset containing the molecules or reactions to load.
batch_size (int, default=64) – the batch size to load.
num_workers (int, default=0) – the number of workers used to build batches.
class_balance (bool, default=False) – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Class balance is only available for single task classification datasets. Set shuffle to True in order to get a random subset of the larger class.
seed (int, default=None) – the random seed to use for shuffling (only used when shuffle is True).
shuffle (bool, default=False) – whether to shuffle the data during sampling.
- class chemprop.data.MoleculeDatapoint[source]#
Bases:
_DatapointMixin
,_MoleculeDatapointMixin
A
MoleculeDatapoint
contains a single molecule and its associated features and targets.- V_f: numpy.ndarray | None#
a numpy array of shape
V x d_vf
, whereV
is the number of atoms in the molecule, andd_vf
is the number of additional features that will be concatenated to atom-level features before message passing
- E_f: numpy.ndarray | None#
A numpy array of shape
E x d_ef
, whereE
is the number of bonds in the molecule, andd_ef
is the number of additional features containing additional features that will be concatenated to bond-level features before message passing
- V_d: numpy.ndarray | None#
A numpy array of shape
V x d_vd
, whereV
is the number of atoms in the molecule, andd_vd
is the number of additional descriptors that will be concatenated to atom-level descriptors after message passing
- class chemprop.data.ReactionDatapoint[source]#
Bases:
_DatapointMixin
,_ReactionDatapointMixin
A
ReactionDatapoint
contains a single reaction and its associated features and targets.
- class chemprop.data.MoleculeDataset[source]#
Bases:
_MolGraphDatasetMixin
,MolGraphDataset
A
MoleculeDataset
composed ofMoleculeDatapoint
sA
MoleculeDataset
produces featurized data for input to aMPNN
model. Typically, data featurization is performed on-the-fly and parallelized across multiple workers via thedata DataLoader
class. However, for small datasets, it may be more efficient to featurize the data in advance and cache the results. This can be done by settingMoleculeDataset.cache=True
.- Parameters:
data (Iterable[MoleculeDatapoint]) – the data from which to create a dataset
featurizer (MoleculeFeaturizer) – the featurizer with which to generate MolGraphs of the molecules
- property cache: bool#
- Return type:
bool
- property smiles: list[str]#
the SMILES strings associated with the dataset
- Return type:
list[str]
- property mols: list[rdkit.Chem.Mol]#
the molecules associated with the dataset
- Return type:
list[rdkit.Chem.Mol]
- property V_fs: list[numpy.ndarray]#
the (scaled) atom descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property E_fs: list[numpy.ndarray]#
the (scaled) bond features of the dataset
- Return type:
list[numpy.ndarray]
- property V_ds: list[numpy.ndarray]#
the (scaled) atom descriptors of the dataset
- Return type:
list[numpy.ndarray]
- property d_vf: int#
the extra atom feature dimension, if any
- Return type:
int
- property d_ef: int#
the extra bond feature dimension, if any
- Return type:
int
- property d_vd: int#
the extra atom descriptor dimension, if any
- Return type:
int
- data: list[chemprop.data.datapoints.MoleculeDatapoint]#
- featurizer: chemprop.featurizers.base.Featurizer[rdkit.Chem.Mol, chemprop.data.molgraph.MolGraph]#
- class chemprop.data.ReactionDataset[source]#
Bases:
_MolGraphDatasetMixin
,MolGraphDataset
A
ReactionDataset
composed ofReactionDatapoint
sNote
The featurized data provided by this class may be cached, simlar to a
MoleculeDataset
. To enable the cache, setReactionDataset cache=True
.- property cache: bool#
- Return type:
bool
- property smiles: list[tuple]#
- Return type:
list[tuple]
- property mols: list[chemprop.types.Rxn]#
- Return type:
list[chemprop.types.Rxn]
- property d_vf: int#
- Return type:
int
- property d_ef: int#
- Return type:
int
- property d_vd: int#
- Return type:
int
- data: list[chemprop.data.datapoints.ReactionDatapoint]#
the dataset from which to load
- featurizer: chemprop.featurizers.base.Featurizer[chemprop.types.Rxn, chemprop.data.molgraph.MolGraph]#
the featurizer with which to generate MolGraphs of the input
- class chemprop.data.Datum[source]#
Bases:
NamedTuple
a singular training data point
- V_d: numpy.ndarray | None#
- x_d: numpy.ndarray | None#
- y: numpy.ndarray | None#
- weight: float#
- lt_mask: numpy.ndarray | None#
- gt_mask: numpy.ndarray | None#
- class chemprop.data.MulticomponentDataset[source]#
Bases:
_MolGraphDatasetMixin
,torch.utils.data.Dataset
A
MulticomponentDataset
is aDataset
composed of parallelMoleculeDatasets
andReactionDataset
s- property n_components: int#
- Return type:
int
- property smiles: list[list[str]]#
- Return type:
list[list[str]]
- property names: list[list[str]]#
- Return type:
list[list[str]]
- property mols: list[list[rdkit.Chem.Mol]]#
- Return type:
list[list[rdkit.Chem.Mol]]
- property d_xd: list[int]#
the extra molecule descriptor dimension, if any
- Return type:
list[int]
- property d_vf: list[int]#
- Return type:
list[int]
- property d_ef: list[int]#
- Return type:
list[int]
- property d_vd: list[int]#
- Return type:
list[int]
- datasets: list[MoleculeDataset | ReactionDataset]#
the parallel datasets
- normalize_targets(scaler=None)[source]#
Normalizes the targets of this dataset using a
StandardScaler
The
StandardScaler
subtracts the mean and divides by the standard deviation for each task independently. NOTE: This should only be used for regression datasets.- Returns:
a scaler fit to the targets.
- Return type:
StandardScaler
- Parameters:
scaler (sklearn.preprocessing.StandardScaler | None)
- chemprop.data.MolGraphDataset: TypeAlias#
- class chemprop.data.MolGraph[source]#
Bases:
NamedTuple
A
MolGraph
represents the graph featurization of a molecule.- V: numpy.ndarray#
an array of shape
V x d_v
containing the atom features of the molecule
- E: numpy.ndarray#
an array of shape
E x d_e
containing the bond features of the molecule
- edge_index: numpy.ndarray#
an array of shape
2 x E
containing the edges of the graph in COO format
- rev_edge_index: numpy.ndarray#
A array of shape
E
that maps from an edge index to the index of the source of the reverse edge inedge_index
attribute.
- class chemprop.data.ClassBalanceSampler(Y, seed=None, shuffle=False)[source]#
Bases:
torch.utils.data.Sampler
A
ClassBalanceSampler
samples data from aMolGraphDataset
such that positive and negative classes are equally sampled- Parameters:
dataset (MolGraphDataset) – the dataset from which to sample
seed (int) – the random seed to use for shuffling (only used when shuffle is True)
shuffle (bool, default=False) – whether to shuffle the data during sampling
Y (numpy.ndarray)
- class chemprop.data.SeededSampler(N, seed)[source]#
Bases:
torch.utils.data.Sampler
A :class`SeededSampler` is a class for iterating through a dataset in a randomly seeded fashion
- Parameters:
N (int)
seed (int)
- class chemprop.data.SplitType[source]#
Bases:
chemprop.utils.utils.EnumMapping
Enum where members are also (and must be) strings
- CV_NO_VAL#
- CV#
- SCAFFOLD_BALANCED#
- RANDOM_WITH_REPEATED_SMILES#
- RANDOM#
- KENNARD_STONE#
- KMEANS#
- chemprop.data.make_split_indices(mols, split='random', sizes=(0.8, 0.1, 0.1), seed=0, num_folds=1)[source]#
Splits data into training, validation, and test splits.
- Parameters:
mols (Sequence[Chem.Mol]) – Sequence of RDKit molecules to use for structure based splitting
split (SplitType | str, optional) – Split type, one of ~chemprop.data.utils.SplitType, by default “random”
sizes (tuple[float, float, float], optional) – 3-tuple with the proportions of data in the train, validation, and test sets, by default (0.8, 0.1, 0.1). Set the middle value to 0 for a two way split.
seed (int, optional) – The random seed passed to astartes, by default 0
num_folds (int, optional) – Number of folds to create (only needed for “cv” and “cv-no-test”), by default 1
- Returns:
A tuple of list of indices corresponding to the train, validation, and test splits of the data. If the split type is “cv” or “cv-no-test”, returns a tuple of lists of lists of indices corresponding to the train, validation, and test splits of each fold.
Important
validation may or may not be present
- Return type:
tuple[list[int], list[int], list[int]] | tuple[list[list[int], …], list[list[int], …], list[list[int], …]]
- Raises:
ValueError – Requested split sizes tuple not of length 3
ValueError – Innapropriate number of folds requested
ValueError – Unsupported split method requested
- chemprop.data.split_data_by_indices(data, train_indices=None, val_indices=None, test_indices=None)[source]#
Splits data into training, validation, and test groups based on split indices given.
- Parameters:
data (Datapoints | MulticomponentDatapoints)
train_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | collections.abc.Iterable[int] | None)
val_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | collections.abc.Iterable[int] | None)
test_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | collections.abc.Iterable[int] | None)