Data¶
chemprop.data contains functions and classes for loading, containing, and splitting data.
Data¶
Classes and functions from chemprop.data.data.py.
- class chemprop.data.data.MoleculeDataLoader(dataset: MoleculeDataset, batch_size: int = 50, num_workers: int = 8, class_balance: bool = False, shuffle: bool = False, seed: int = 0)[source]¶
A
MoleculeDataLoader
is a PyTorchDataLoader
for loading aMoleculeDataset
.- Parameters
dataset – The
MoleculeDataset
containing the molecules to load.batch_size – Batch size.
num_workers – Number of workers used to build batches.
class_balance – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Class balance is only available for single task classification datasets. Set shuffle to True in order to get a random subset of the larger class.
shuffle – Whether to shuffle the data.
seed – Random seed. Only needed if shuffle is True.
- property gt_targets: List[List[Optional[bool]]]¶
Returns booleans for whether each target is an inequality rather than a value target, associated with each molecule.
- Returns
A list of lists of booleans (or None) containing the targets.
- property iter_size: int¶
Returns the number of data points included in each full iteration through the
MoleculeDataLoader
.
- property lt_targets: List[List[Optional[bool]]]¶
Returns booleans for whether each target is an inequality rather than a value target, associated with each molecule.
- Returns
A list of lists of booleans (or None) containing the targets.
- property targets: List[List[Optional[float]]]¶
Returns the targets associated with each molecule.
- Returns
A list of lists of floats (or None) containing the targets.
- class chemprop.data.data.MoleculeDatapoint(smiles: List[str], targets: Optional[List[Optional[float]]] = None, row: Optional[OrderedDict] = None, data_weight: Optional[float] = None, gt_targets: Optional[List[bool]] = None, lt_targets: Optional[List[bool]] = None, features: Optional[ndarray] = None, features_generator: Optional[List[str]] = None, phase_features: Optional[List[float]] = None, atom_features: Optional[ndarray] = None, atom_descriptors: Optional[ndarray] = None, bond_features: Optional[ndarray] = None, overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False)[source]¶
A
MoleculeDatapoint
contains a single molecule and its associated features and targets.- Parameters
smiles – A list of the SMILES strings for the molecules.
targets – A list of targets for the molecule (contains None for unknown target values).
row – The raw CSV row containing the information for this molecule.
data_weight – Weighting of the datapoint for the loss function.
gt_targets – Indicates whether the targets are an inequality regression target of the form “>x”.
lt_targets – Indicates whether the targets are an inequality regression target of the form “<x”.
features – A numpy array containing additional features (e.g., Morgan fingerprint).
features_generator – A list of features generators to use.
phase_features – A one-hot vector indicating the phase of the data, as used in spectra data.
atom_descriptors – A numpy array containing additional atom descriptors to featurize the molecule
bond_features – A numpy array containing additional bond features to featurize the molecule
overwrite_default_atom_features – Boolean to overwrite default atom features by atom_features
overwrite_default_bond_features – Boolean to overwrite default bond features by bond_features
- extend_features(features: ndarray) None [source]¶
Extends the features of the molecule.
- Parameters
features – A 1D numpy array of extra features for the molecule.
- property mol: List[Union[Mol, Tuple[Mol, Mol]]]¶
Gets the corresponding list of RDKit molecules for the corresponding SMILES list.
- property number_of_molecules: int¶
Gets the number of molecules in the
MoleculeDatapoint
.- Returns
The number of molecules.
- reset_features_and_targets() None [source]¶
Resets the features (atom, bond, and molecule) and targets to their raw values.
- set_atom_descriptors(atom_descriptors: ndarray) None [source]¶
Sets the atom descriptors of the molecule.
- Parameters
atom_descriptors – A 1D numpy array of features for the molecule.
- set_atom_features(atom_features: ndarray) None [source]¶
Sets the atom features of the molecule.
- Parameters
atom_features – A 1D numpy array of features for the molecule.
- set_bond_features(bond_features: ndarray) None [source]¶
Sets the bond features of the molecule.
- Parameters
bond_features – A 1D numpy array of features for the molecule.
- class chemprop.data.data.MoleculeDataset(data: List[MoleculeDatapoint])[source]¶
A
MoleculeDataset
contains a list ofMoleculeDatapoint
s with access to their attributes.- Parameters
data – A list of
MoleculeDatapoint
s.
- atom_descriptors() List[ndarray] [source]¶
Returns the atom descriptors associated with each molecule (if they exit).
- Returns
A list of 2D numpy arrays containing the atom descriptors for each molecule or None if there are no features.
- atom_descriptors_size() int [source]¶
Returns the size of custom additional atom descriptors vector associated with the molecules.
- Returns
The size of the additional atom descriptor vector.
- atom_features() List[ndarray] [source]¶
Returns the atom descriptors associated with each molecule (if they exit).
- Returns
A list of 2D numpy arrays containing the atom descriptors for each molecule or None if there are no features.
- atom_features_size() int [source]¶
Returns the size of custom additional atom features vector associated with the molecules.
- Returns
The size of the additional atom feature vector.
- batch_graph() List[BatchMolGraph] [source]¶
Constructs a
BatchMolGraph
with the graph featurization of all the molecules.Note
The
BatchMolGraph
is cached in after the first time it is computed and is simply accessed upon subsequent calls tobatch_graph()
. This means that if the underlying set ofMoleculeDatapoint
s changes, then the returnedBatchMolGraph
will be incorrect for the underlying data.- Returns
A list of
BatchMolGraph
containing the graph featurization of all the molecules in eachMoleculeDatapoint
.
- bond_features() List[ndarray] [source]¶
Returns the bond features associated with each molecule (if they exit).
- Returns
A list of 2D numpy arrays containing the bond features for each molecule or None if there are no features.
- bond_features_size() int [source]¶
Returns the size of custom additional bond features vector associated with the molecules.
- Returns
The size of the additional bond feature vector.
- features() List[ndarray] [source]¶
Returns the features associated with each molecule (if they exist).
- Returns
A list of 1D numpy arrays containing the features for each molecule or None if there are no features.
- features_size() int [source]¶
Returns the size of the additional features vector associated with the molecules.
- Returns
The size of the additional features vector.
- gt_targets() List[ndarray] [source]¶
Returns indications of whether the targets associated with each molecule are greater-than inequalities.
- Returns
A list of lists of booleans indicating whether the targets in those positions are greater-than inequality targets.
- lt_targets() List[ndarray] [source]¶
Returns indications of whether the targets associated with each molecule are less-than inequalities.
- Returns
A list of lists of booleans indicating whether the targets in those positions are less-than inequality targets.
- mask() List[List[bool]] [source]¶
Returns whether the targets associated with each molecule and task are present.
- Returns
A list of list of booleans associated with targets.
- mols(flatten: bool = False) Union[List[Mol], List[List[Mol]], List[Tuple[Mol, Mol]], List[List[Tuple[Mol, Mol]]]] [source]¶
Returns a list of the RDKit molecules associated with each
MoleculeDatapoint
.- Parameters
flatten – Whether to flatten the returned RDKit molecules to a list instead of a list of lists.
- Returns
A list of SMILES or a list of lists of RDKit molecules, depending on
flatten
.
- normalize_features(scaler: Optional[StandardScaler] = None, replace_nan_token: int = 0, scale_atom_descriptors: bool = False, scale_bond_features: bool = False) StandardScaler [source]¶
Normalizes the features of the dataset using a
StandardScaler
.The
StandardScaler
subtracts the mean and divides by the standard deviation for each feature independently.If a
StandardScaler
is provided, it is used to perform the normalization. Otherwise, aStandardScaler
is first fit to the features in this dataset and is then used to perform the normalization.- Parameters
scaler – A fitted
StandardScaler
. If it is provided it is used, otherwise a newStandardScaler
is first fitted to this data and is then used.replace_nan_token – A token to use to replace NaN entries in the features.
scale_atom_descriptors – If the features that need to be scaled are atom features rather than molecule.
scale_bond_features – If the features that need to be scaled are bond descriptors rather than molecule.
- Returns
A fitted
StandardScaler
. If aStandardScaler
is provided as a parameter, this is the sameStandardScaler
. Otherwise, this is a newStandardScaler
that has been fit on this dataset.
- normalize_targets() StandardScaler [source]¶
Normalizes the targets of the dataset using a
StandardScaler
.The
StandardScaler
subtracts the mean and divides by the standard deviation for each task independently.This should only be used for regression datasets.
- Returns
A
StandardScaler
fitted to the targets.
- property number_of_molecules: int¶
Gets the number of molecules in each
MoleculeDatapoint
.- Returns
The number of molecules.
- phase_features() List[ndarray] [source]¶
Returns the phase features associated with each molecule (if they exist).
- Returns
A list of 1D numpy arrays containing the phase features for each molecule or None if there are no features.
- reset_features_and_targets() None [source]¶
Resets the features (atom, bond, and molecule) and targets to their raw values.
- set_targets(targets: List[List[Optional[float]]]) None [source]¶
Sets the targets for each molecule in the dataset. Assumes the targets are aligned with the datapoints.
- Parameters
targets – A list of lists of floats (or None) containing targets for each molecule. This must be the same length as the underlying dataset.
- smiles(flatten: bool = False) Union[List[str], List[List[str]]] [source]¶
Returns a list containing the SMILES list associated with each
MoleculeDatapoint
.- Parameters
flatten – Whether to flatten the returned SMILES to a list instead of a list of lists.
- Returns
A list of SMILES or a list of lists of SMILES, depending on
flatten
.
- class chemprop.data.data.MoleculeSampler(dataset: MoleculeDataset, class_balance: bool = False, shuffle: bool = False, seed: int = 0)[source]¶
A
MoleculeSampler
samples data from aMoleculeDataset
for aMoleculeDataLoader
.- Parameters
class_balance – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Set shuffle to True in order to get a random subset of the larger class.
shuffle – Whether to shuffle the data.
seed – Random seed. Only needed if
shuffle
is True.
- chemprop.data.data.construct_molecule_batch(data: List[MoleculeDatapoint]) MoleculeDataset [source]¶
Constructs a
MoleculeDataset
from a list ofMoleculeDatapoint
s.Additionally, precomputes the
BatchMolGraph
for the constructedMoleculeDataset
.- Parameters
data – A list of
MoleculeDatapoint
s.- Returns
A
MoleculeDataset
containing all theMoleculeDatapoint
s.
- chemprop.data.data.make_mols(smiles: List[str], reaction_list: List[bool], keep_h_list: List[bool], add_h_list: List[bool])[source]¶
Builds a list of RDKit molecules (or a list of tuples of molecules if reaction is True) for a list of smiles.
- Parameters
smiles – List of SMILES strings.
reaction_list – List of booleans whether the SMILES strings are to be treated as a reaction.
keep_h_list – List of booleans whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.
add_h_list – List of booleasn whether to add hydrogens to the input smiles.
- Returns
List of RDKit molecules or list of tuple of molecules.
Scaffold¶
Classes and functions from chemprop.data.scaffold.py.
- chemprop.data.scaffold.generate_scaffold(mol: Union[str, Mol, Tuple[Mol, Mol]], include_chirality: bool = False) str [source]¶
Computes the Bemis-Murcko scaffold for a SMILES string.
- Parameters
mol – A SMILES or an RDKit molecule.
include_chirality – Whether to include chirality in the computed scaffold..
- Returns
The Bemis-Murcko scaffold for the molecule.
- chemprop.data.scaffold.log_scaffold_stats(data: MoleculeDataset, index_sets: List[Set[int]], num_scaffolds: int = 10, num_labels: int = 20, logger: Optional[Logger] = None) List[Tuple[List[float], List[int]]] [source]¶
Logs and returns statistics about counts and average target values in molecular scaffolds.
- Parameters
data – A
MoleculeDataset
.index_sets – A list of sets of indices representing splits of the data.
num_scaffolds – The number of scaffolds about which to display statistics.
num_labels – The number of labels about which to display statistics.
logger – A logger for recording output.
- Returns
A list of tuples where each tuple contains a list of average target values across the first
num_labels
labels and a list of the number of non-zero values for the firstnum_scaffolds
scaffolds, sorted in decreasing order of scaffold frequency.
- chemprop.data.scaffold.scaffold_split(data: MoleculeDataset, sizes: Tuple[float, float, float] = (0.8, 0.1, 0.1), balanced: bool = False, key_molecule_index: int = 0, seed: int = 0, logger: Optional[Logger] = None) Tuple[MoleculeDataset, MoleculeDataset, MoleculeDataset] [source]¶
Splits a
MoleculeDataset
by scaffold so that no molecules sharing a scaffold are in different splits.- Parameters
data – A
MoleculeDataset
.sizes – A length-3 tuple with the proportions of data in the train, validation, and test sets.
balanced – Whether to balance the sizes of scaffolds in each set rather than putting the smallest in test set.
key_molecule_index – For data with multiple molecules, this sets which molecule will be considered during splitting.
seed – Random seed for shuffling when doing balanced splitting.
logger – A logger for recording output.
- Returns
A tuple of
MoleculeDataset
s containing the train, validation, and test splits of the data.
- chemprop.data.scaffold.scaffold_to_smiles(mols: Union[List[str], List[Mol], List[Tuple[Mol, Mol]]], use_indices: bool = False) Dict[str, Union[Set[str], Set[int]]] [source]¶
Computes the scaffold for each SMILES and returns a mapping from scaffolds to sets of smiles (or indices).
- Parameters
mols – A list of SMILES or RDKit molecules.
use_indices – Whether to map to the SMILES’s index in
mols
rather than mapping to the smiles string itself. This is necessary if there are duplicate smiles.
- Returns
A dictionary mapping each unique scaffold to all SMILES (or indices) which have that scaffold.
Scaler¶
Classes and functions from chemprop.data.scaler.py.
- class chemprop.data.scaler.StandardScaler(means: Optional[ndarray] = None, stds: Optional[ndarray] = None, replace_nan_token: Optional[Any] = None)[source]¶
A
StandardScaler
normalizes the features of a dataset.When it is fit on a dataset, the
StandardScaler
learns the mean and standard deviation across the 0th axis. When transforming a dataset, theStandardScaler
subtracts the means and divides by the standard deviations.- Parameters
means – An optional 1D numpy array of precomputed means.
stds – An optional 1D numpy array of precomputed standard deviations.
replace_nan_token – A token to use to replace NaN entries in the features.
- fit(X: List[List[Optional[float]]]) StandardScaler [source]¶
Learns means and standard deviations across the 0th axis of the data
X
.- Parameters
X – A list of lists of floats (or None).
- Returns
The fitted
StandardScaler
(self).
Utils¶
Classes and functions from chemprop.data.utils.py.
- chemprop.data.utils.filter_invalid_smiles(data: MoleculeDataset) MoleculeDataset [source]¶
Filters out invalid SMILES.
- Parameters
data – A
MoleculeDataset
.- Returns
A
MoleculeDataset
with only the valid molecules.
- chemprop.data.utils.get_class_sizes(data: MoleculeDataset, proportion: bool = True) List[List[float]] [source]¶
Determines the proportions of the different classes in a classification dataset.
- Parameters
data – A classification
MoleculeDataset
.proportion – Choice of whether to return proportions for class size or counts.
- Returns
A list of lists of class proportions. Each inner list contains the class proportions for a task.
- chemprop.data.utils.get_data(path: str, smiles_columns: Optional[Union[str, List[str]]] = None, target_columns: Optional[List[str]] = None, ignore_columns: Optional[List[str]] = None, skip_invalid_smiles: bool = True, args: Optional[Union[TrainArgs, PredictArgs]] = None, data_weights_path: Optional[str] = None, features_path: Optional[List[str]] = None, features_generator: Optional[List[str]] = None, phase_features_path: Optional[str] = None, atom_descriptors_path: Optional[str] = None, bond_features_path: Optional[str] = None, max_data_size: Optional[int] = None, store_row: bool = False, logger: Optional[Logger] = None, loss_function: Optional[str] = None, skip_none_targets: bool = False) MoleculeDataset [source]¶
Gets SMILES and target values from a CSV file.
- Parameters
path – Path to a CSV file.
smiles_columns – The names of the columns containing SMILES. By default, uses the first
number_of_molecules
columns.target_columns – Name of the columns containing target values. By default, uses all columns except the
smiles_column
and theignore_columns
.ignore_columns – Name of the columns to ignore when
target_columns
is not provided.skip_invalid_smiles – Whether to skip and filter out invalid smiles using
filter_invalid_smiles()
.args – Arguments, either
TrainArgs
orPredictArgs
.data_weights_path – A path to a file containing weights for each molecule in the loss function.
features_path – A list of paths to files containing features. If provided, it is used in place of
args.features_path
.features_generator – A list of features generators to use. If provided, it is used in place of
args.features_generator
.phase_features_path – A path to a file containing phase features as applicable to spectra.
atom_descriptors_path – The path to the file containing the custom atom descriptors.
bond_features_path – The path to the file containing the custom bond features.
max_data_size – The maximum number of data points to load.
logger – A logger for recording output.
store_row – Whether to store the raw CSV row in each
MoleculeDatapoint
.skip_none_targets – Whether to skip targets that are all ‘None’. This is mostly relevant when –target_columns are passed in, so only a subset of tasks are examined.
loss_function – The loss function to be used in training.
- Returns
A
MoleculeDataset
containing SMILES and target values along with other info such as additional features when desired.
- chemprop.data.utils.get_data_from_smiles(smiles: List[List[str]], skip_invalid_smiles: bool = True, logger: Optional[Logger] = None, features_generator: Optional[List[str]] = None) MoleculeDataset [source]¶
Converts a list of SMILES to a
MoleculeDataset
.- Parameters
smiles – A list of lists of SMILES with length depending on the number of molecules.
skip_invalid_smiles – Whether to skip and filter out invalid smiles using
filter_invalid_smiles()
logger – A logger for recording output.
features_generator – List of features generators.
- Returns
A
MoleculeDataset
with all of the provided SMILES.
- chemprop.data.utils.get_data_weights(path: str) List[float] [source]¶
Returns the list of data weights for the loss function as stored in a CSV file.
- Parameters
path – Path to a CSV file.
- Returns
A list of floats containing the data weights.
- chemprop.data.utils.get_header(path: str) List[str] [source]¶
Returns the header of a data CSV file.
- Parameters
path – Path to a CSV file.
- Returns
A list of strings containing the strings in the comma-separated header.
- chemprop.data.utils.get_inequality_targets(path: str, target_columns: Optional[List[str]] = None) List[str] [source]¶
- chemprop.data.utils.get_invalid_smiles_from_file(path: Optional[str] = None, smiles_columns: Optional[Union[str, List[str]]] = None, header: bool = True, reaction: bool = False) Union[List[str], List[List[str]]] [source]¶
Returns the invalid SMILES from a data CSV file.
- Parameters
path – Path to a CSV file.
smiles_columns – A list of the names of the columns containing SMILES. By default, uses the first
number_of_molecules
columns.header – Whether the CSV file contains a header.
reaction – Boolean whether the SMILES strings are to be treated as a reaction.
- Returns
A list of lists of SMILES, for the invalid SMILES in the file.
- chemprop.data.utils.get_invalid_smiles_from_list(smiles: List[List[str]], reaction: bool = False) List[List[str]] [source]¶
Returns the invalid SMILES from a list of lists of SMILES strings.
- Parameters
smiles – A list of list of SMILES.
reaction – Boolean whether the SMILES strings are to be treated as a reaction.
- Returns
A list of lists of SMILES, for the invalid SMILES among the lists provided.
- chemprop.data.utils.get_smiles(path: str, smiles_columns: Optional[Union[str, List[str]]] = None, number_of_molecules: int = 1, header: bool = True, flatten: bool = False) Union[List[str], List[List[str]]] [source]¶
Returns the SMILES from a data CSV file.
- Parameters
path – Path to a CSV file.
smiles_columns – A list of the names of the columns containing SMILES. By default, uses the first
number_of_molecules
columns.number_of_molecules – The number of molecules for each data point. Not necessary if the names of smiles columns are previously processed.
header – Whether the CSV file contains a header.
flatten – Whether to flatten the returned SMILES to a list instead of a list of lists.
- Returns
A list of SMILES or a list of lists of SMILES, depending on
flatten
.
- chemprop.data.utils.get_task_names(path: str, smiles_columns: Optional[Union[str, List[str]]] = None, target_columns: Optional[List[str]] = None, ignore_columns: Optional[List[str]] = None) List[str] [source]¶
Gets the task names from a data CSV file.
If
target_columns
is provided, returns target_columns. Otherwise, returns all columns except thesmiles_columns
(or the first column, if thesmiles_columns
is None) and theignore_columns
.- Parameters
path – Path to a CSV file.
smiles_columns – The names of the columns containing SMILES. By default, uses the first
number_of_molecules
columns.target_columns – Name of the columns containing target values. By default, uses all columns except the
smiles_columns
and theignore_columns
.ignore_columns – Name of the columns to ignore when
target_columns
is not provided.
- Returns
A list of task names.
- chemprop.data.utils.preprocess_smiles_columns(path: str, smiles_columns: Optional[Union[str, List[str]]] = None, number_of_molecules: int = 1) List[str] [source]¶
Preprocesses the
smiles_columns
variable to ensure that it is a list of column headings corresponding to the columns in the data file holding SMILES. Assumes file has a header.- Parameters
path – Path to a CSV file.
smiles_columns – The names of the columns containing SMILES. By default, uses the first
number_of_molecules
columns.number_of_molecules – The number of molecules with associated SMILES for each data point.
- Returns
The preprocessed version of
smiles_columns
which is guaranteed to be a list.
- chemprop.data.utils.split_data(data: MoleculeDataset, split_type: str = 'random', sizes: Tuple[float, float, float] = (0.8, 0.1, 0.1), key_molecule_index: int = 0, seed: int = 0, num_folds: int = 1, args: Optional[TrainArgs] = None, logger: Optional[Logger] = None) Tuple[MoleculeDataset, MoleculeDataset, MoleculeDataset] [source]¶
Splits data into training, validation, and test splits.
- Parameters
data – A
MoleculeDataset
.split_type – Split type.
sizes – A length-3 tuple with the proportions of data in the train, validation, and test sets.
key_molecule_index – For data with multiple molecules, this sets which molecule will be considered during splitting.
seed – The random seed to use before shuffling data.
num_folds – Number of folds to create (only needed for “cv” split type).
args – A
TrainArgs
object.logger – A logger for recording output.
- Returns
A tuple of
MoleculeDataset
s containing the train, validation, and test splits of the data.
- chemprop.data.utils.validate_data(data_path: str) Set[str] [source]¶
Validates a data CSV file, returning a set of errors.
- Parameters
data_path – Path to a data CSV file.
- Returns
A set of error messages.
- chemprop.data.utils.validate_dataset_type(data: MoleculeDataset, dataset_type: str) None [source]¶
Validates the dataset type to ensure the data matches the provided type.
- Parameters
data – A
MoleculeDataset
.dataset_type – The dataset type to check.