Data

chemprop.data contains functions and classes for loading, containing, and splitting data.

Data

Classes and functions from chemprop.data.data.py.

class chemprop.data.data.MoleculeDataLoader(dataset: chemprop.data.data.MoleculeDataset, batch_size: int = 50, num_workers: int = 8, class_balance: bool = False, shuffle: bool = False, seed: int = 0)[source]

A MoleculeDataLoader is a PyTorch DataLoader for loading a MoleculeDataset.

Parameters
  • dataset – The MoleculeDataset containing the molecules to load.

  • batch_size – Batch size.

  • num_workers – Number of workers used to build batches.

  • class_balance – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Class balance is only available for single task classification datasets. Set shuffle to True in order to get a random subset of the larger class.

  • shuffle – Whether to shuffle the data.

  • seed – Random seed. Only needed if shuffle is True.

property iter_size: int

Returns the number of data points included in each full iteration through the MoleculeDataLoader.

property targets: List[List[Optional[float]]]

Returns the targets associated with each molecule.

Returns

A list of lists of floats (or None) containing the targets.

class chemprop.data.data.MoleculeDatapoint(smiles: List[str], targets: Optional[List[Optional[float]]] = None, row: Optional[collections.OrderedDict] = None, data_weight: float = 1, features: Optional[numpy.ndarray] = None, features_generator: Optional[List[str]] = None, phase_features: Optional[List[float]] = None, atom_features: Optional[numpy.ndarray] = None, atom_descriptors: Optional[numpy.ndarray] = None, bond_features: Optional[numpy.ndarray] = None, overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False)[source]

A MoleculeDatapoint contains a single molecule and its associated features and targets.

Parameters
  • smiles – A list of the SMILES strings for the molecules.

  • targets – A list of targets for the molecule (contains None for unknown target values).

  • row – The raw CSV row containing the information for this molecule.

  • data_weight – Weighting of the datapoint for the loss function.

  • features – A numpy array containing additional features (e.g., Morgan fingerprint).

  • features_generator – A list of features generators to use.

  • phase_features – A one-hot vector indicating the phase of the data, as used in spectra data.

  • atom_descriptors – A numpy array containing additional atom descriptors to featurize the molecule

  • bond_features – A numpy array containing additional bond features to featurize the molecule

  • overwrite_default_atom_features – Boolean to overwrite default atom features by atom_features

  • overwrite_default_bond_features – Boolean to overwrite default bond features by bond_features

extend_features(features: numpy.ndarray) None[source]

Extends the features of the molecule.

Parameters

features – A 1D numpy array of extra features for the molecule.

property mol: Union[List[rdkit.Chem.rdchem.Mol], List[Tuple[rdkit.Chem.rdchem.Mol, rdkit.Chem.rdchem.Mol]]]

Gets the corresponding list of RDKit molecules for the corresponding SMILES list.

num_tasks() int[source]

Returns the number of prediction tasks.

Returns

The number of tasks.

property number_of_molecules: int

Gets the number of molecules in the MoleculeDatapoint.

Returns

The number of molecules.

reset_features_and_targets() None[source]

Resets the features (atom, bond, and molecule) and targets to their raw values.

set_atom_descriptors(atom_descriptors: numpy.ndarray) None[source]

Sets the atom descriptors of the molecule.

Parameters

atom_descriptors – A 1D numpy array of features for the molecule.

set_atom_features(atom_features: numpy.ndarray) None[source]

Sets the atom features of the molecule.

Parameters

atom_features – A 1D numpy array of features for the molecule.

set_bond_features(bond_features: numpy.ndarray) None[source]

Sets the bond features of the molecule.

Parameters

bond_features – A 1D numpy array of features for the molecule.

set_features(features: numpy.ndarray) None[source]

Sets the features of the molecule.

Parameters

features – A 1D numpy array of features for the molecule.

set_targets(targets: List[Optional[float]])[source]

Sets the targets of a molecule.

Parameters

targets – A list of floats containing the targets.

class chemprop.data.data.MoleculeDataset(data: List[chemprop.data.data.MoleculeDatapoint])[source]

A MoleculeDataset contains a list of MoleculeDatapoints with access to their attributes.

Parameters

data – A list of MoleculeDatapoints.

atom_descriptors() List[numpy.ndarray][source]

Returns the atom descriptors associated with each molecule (if they exit).

Returns

A list of 2D numpy arrays containing the atom descriptors for each molecule or None if there are no features.

atom_descriptors_size() int[source]

Returns the size of custom additional atom descriptors vector associated with the molecules.

Returns

The size of the additional atom descriptor vector.

atom_features() List[numpy.ndarray][source]

Returns the atom descriptors associated with each molecule (if they exit).

Returns

A list of 2D numpy arrays containing the atom descriptors for each molecule or None if there are no features.

atom_features_size() int[source]

Returns the size of custom additional atom features vector associated with the molecules.

Returns

The size of the additional atom feature vector.

batch_graph() List[chemprop.features.featurization.BatchMolGraph][source]

Constructs a BatchMolGraph with the graph featurization of all the molecules.

Note

The BatchMolGraph is cached in after the first time it is computed and is simply accessed upon subsequent calls to batch_graph(). This means that if the underlying set of MoleculeDatapoints changes, then the returned BatchMolGraph will be incorrect for the underlying data.

Returns

A list of BatchMolGraph containing the graph featurization of all the molecules in each MoleculeDatapoint.

bond_features() List[numpy.ndarray][source]

Returns the bond features associated with each molecule (if they exit).

Returns

A list of 2D numpy arrays containing the bond features for each molecule or None if there are no features.

bond_features_size() int[source]

Returns the size of custom additional bond features vector associated with the molecules.

Returns

The size of the additional bond feature vector.

data_weights() List[float][source]

Returns the loss weighting associated with each molecule

features() List[numpy.ndarray][source]

Returns the features associated with each molecule (if they exist).

Returns

A list of 1D numpy arrays containing the features for each molecule or None if there are no features.

features_size() int[source]

Returns the size of the additional features vector associated with the molecules.

Returns

The size of the additional features vector.

mols(flatten: bool = False) Union[List[rdkit.Chem.rdchem.Mol], List[List[rdkit.Chem.rdchem.Mol]], List[Tuple[rdkit.Chem.rdchem.Mol, rdkit.Chem.rdchem.Mol]], List[List[Tuple[rdkit.Chem.rdchem.Mol, rdkit.Chem.rdchem.Mol]]]][source]

Returns a list of the RDKit molecules associated with each MoleculeDatapoint.

Parameters

flatten – Whether to flatten the returned RDKit molecules to a list instead of a list of lists.

Returns

A list of SMILES or a list of lists of RDKit molecules, depending on flatten.

normalize_features(scaler: Optional[chemprop.data.scaler.StandardScaler] = None, replace_nan_token: int = 0, scale_atom_descriptors: bool = False, scale_bond_features: bool = False) chemprop.data.scaler.StandardScaler[source]

Normalizes the features of the dataset using a StandardScaler.

The StandardScaler subtracts the mean and divides by the standard deviation for each feature independently.

If a StandardScaler is provided, it is used to perform the normalization. Otherwise, a StandardScaler is first fit to the features in this dataset and is then used to perform the normalization.

Parameters
  • scaler – A fitted StandardScaler. If it is provided it is used, otherwise a new StandardScaler is first fitted to this data and is then used.

  • replace_nan_token – A token to use to replace NaN entries in the features.

  • scale_atom_descriptors – If the features that need to be scaled are atom features rather than molecule.

  • scale_bond_features – If the features that need to be scaled are bond descriptors rather than molecule.

Returns

A fitted StandardScaler. If a StandardScaler is provided as a parameter, this is the same StandardScaler. Otherwise, this is a new StandardScaler that has been fit on this dataset.

normalize_targets() chemprop.data.scaler.StandardScaler[source]

Normalizes the targets of the dataset using a StandardScaler.

The StandardScaler subtracts the mean and divides by the standard deviation for each task independently.

This should only be used for regression datasets.

Returns

A StandardScaler fitted to the targets.

num_tasks() int[source]

Returns the number of prediction tasks.

Returns

The number of tasks.

property number_of_molecules: int

Gets the number of molecules in each MoleculeDatapoint.

Returns

The number of molecules.

phase_features() List[numpy.ndarray][source]

Returns the phase features associated with each molecule (if they exist).

Returns

A list of 1D numpy arrays containing the phase features for each molecule or None if there are no features.

reset_features_and_targets() None[source]

Resets the features (atom, bond, and molecule) and targets to their raw values.

set_targets(targets: List[List[Optional[float]]]) None[source]

Sets the targets for each molecule in the dataset. Assumes the targets are aligned with the datapoints.

Parameters

targets – A list of lists of floats (or None) containing targets for each molecule. This must be the same length as the underlying dataset.

smiles(flatten: bool = False) Union[List[str], List[List[str]]][source]

Returns a list containing the SMILES list associated with each MoleculeDatapoint.

Parameters

flatten – Whether to flatten the returned SMILES to a list instead of a list of lists.

Returns

A list of SMILES or a list of lists of SMILES, depending on flatten.

targets() List[List[Optional[float]]][source]

Returns the targets associated with each molecule.

Returns

A list of lists of floats (or None) containing the targets.

class chemprop.data.data.MoleculeSampler(dataset: chemprop.data.data.MoleculeDataset, class_balance: bool = False, shuffle: bool = False, seed: int = 0)[source]

A MoleculeSampler samples data from a MoleculeDataset for a MoleculeDataLoader.

Parameters
  • class_balance – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Set shuffle to True in order to get a random subset of the larger class.

  • shuffle – Whether to shuffle the data.

  • seed – Random seed. Only needed if shuffle is True.

chemprop.data.data.cache_graph() bool[source]

Returns whether MolGraphs will be cached.

chemprop.data.data.cache_mol() bool[source]

Returns whether RDKit molecules will be cached.

chemprop.data.data.construct_molecule_batch(data: List[chemprop.data.data.MoleculeDatapoint]) chemprop.data.data.MoleculeDataset[source]

Constructs a MoleculeDataset from a list of MoleculeDatapoints.

Additionally, precomputes the BatchMolGraph for the constructed MoleculeDataset.

Parameters

data – A list of MoleculeDatapoints.

Returns

A MoleculeDataset containing all the MoleculeDatapoints.

chemprop.data.data.empty_cache()[source]

Empties the cache of MolGraph and RDKit molecules.

chemprop.data.data.make_mols(smiles: List[str], reaction: bool, keep_h: bool)[source]

Builds a list of RDKit molecules (or a list of tuples of molecules if reaction is True) for a list of smiles.

Parameters
  • smiles – List of SMILES strings.

  • reaction – Boolean whether the SMILES strings are to be treated as a reaction.

  • keep_h – Boolean whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.

Returns

List of RDKit molecules or list of tuple of molecules.

chemprop.data.data.set_cache_graph(cache_graph: bool) None[source]

Sets whether MolGraphs will be cached.

chemprop.data.data.set_cache_mol(cache_mol: bool) None[source]

Sets whether RDKit molecules will be cached.

Scaffold

Classes and functions from chemprop.data.scaffold.py.

chemprop.data.scaffold.generate_scaffold(mol: Union[str, rdkit.Chem.rdchem.Mol, Tuple[rdkit.Chem.rdchem.Mol, rdkit.Chem.rdchem.Mol]], include_chirality: bool = False) str[source]

Computes the Bemis-Murcko scaffold for a SMILES string.

Parameters
  • mol – A SMILES or an RDKit molecule.

  • include_chirality – Whether to include chirality in the computed scaffold..

Returns

The Bemis-Murcko scaffold for the molecule.

chemprop.data.scaffold.log_scaffold_stats(data: chemprop.data.data.MoleculeDataset, index_sets: List[Set[int]], num_scaffolds: int = 10, num_labels: int = 20, logger: Optional[logging.Logger] = None) List[Tuple[List[float], List[int]]][source]

Logs and returns statistics about counts and average target values in molecular scaffolds.

Parameters
  • data – A MoleculeDataset.

  • index_sets – A list of sets of indices representing splits of the data.

  • num_scaffolds – The number of scaffolds about which to display statistics.

  • num_labels – The number of labels about which to display statistics.

  • logger – A logger for recording output.

Returns

A list of tuples where each tuple contains a list of average target values across the first num_labels labels and a list of the number of non-zero values for the first num_scaffolds scaffolds, sorted in decreasing order of scaffold frequency.

chemprop.data.scaffold.scaffold_split(data: chemprop.data.data.MoleculeDataset, sizes: Tuple[float, float, float] = (0.8, 0.1, 0.1), balanced: bool = False, seed: int = 0, logger: Optional[logging.Logger] = None) Tuple[chemprop.data.data.MoleculeDataset, chemprop.data.data.MoleculeDataset, chemprop.data.data.MoleculeDataset][source]

Splits a MoleculeDataset by scaffold so that no molecules sharing a scaffold are in different splits.

Parameters
  • data – A MoleculeDataset.

  • sizes – A length-3 tuple with the proportions of data in the train, validation, and test sets.

  • balanced – Whether to balance the sizes of scaffolds in each set rather than putting the smallest in test set.

  • seed – Random seed for shuffling when doing balanced splitting.

  • logger – A logger for recording output.

Returns

A tuple of MoleculeDatasets containing the train, validation, and test splits of the data.

chemprop.data.scaffold.scaffold_to_smiles(mols: Union[List[str], List[rdkit.Chem.rdchem.Mol], List[Tuple[rdkit.Chem.rdchem.Mol, rdkit.Chem.rdchem.Mol]]], use_indices: bool = False) Dict[str, Union[Set[str], Set[int]]][source]

Computes the scaffold for each SMILES and returns a mapping from scaffolds to sets of smiles (or indices).

Parameters
  • mols – A list of SMILES or RDKit molecules.

  • use_indices – Whether to map to the SMILES’s index in mols rather than mapping to the smiles string itself. This is necessary if there are duplicate smiles.

Returns

A dictionary mapping each unique scaffold to all SMILES (or indices) which have that scaffold.

Scaler

Classes and functions from chemprop.data.scaler.py.

class chemprop.data.scaler.StandardScaler(means: Optional[numpy.ndarray] = None, stds: Optional[numpy.ndarray] = None, replace_nan_token: Optional[Any] = None)[source]

A StandardScaler normalizes the features of a dataset.

When it is fit on a dataset, the StandardScaler learns the mean and standard deviation across the 0th axis. When transforming a dataset, the StandardScaler subtracts the means and divides by the standard deviations.

Parameters
  • means – An optional 1D numpy array of precomputed means.

  • stds – An optional 1D numpy array of precomputed standard deviations.

  • replace_nan_token – A token to use to replace NaN entries in the features.

fit(X: List[List[Optional[float]]]) chemprop.data.scaler.StandardScaler[source]

Learns means and standard deviations across the 0th axis of the data X.

Parameters

X – A list of lists of floats (or None).

Returns

The fitted StandardScaler (self).

inverse_transform(X: List[List[Optional[float]]]) numpy.ndarray[source]

Performs the inverse transformation by multiplying by the standard deviations and adding the means.

Parameters

X – A list of lists of floats.

Returns

The inverse transformed data with NaNs replaced by self.replace_nan_token.

transform(X: List[List[Optional[float]]]) numpy.ndarray[source]

Transforms the data by subtracting the means and dividing by the standard deviations.

Parameters

X – A list of lists of floats (or None).

Returns

The transformed data with NaNs replaced by self.replace_nan_token.

Utils

Classes and functions from chemprop.data.utils.py.

chemprop.data.utils.filter_invalid_smiles(data: chemprop.data.data.MoleculeDataset) chemprop.data.data.MoleculeDataset[source]

Filters out invalid SMILES.

Parameters

data – A MoleculeDataset.

Returns

A MoleculeDataset with only the valid molecules.

chemprop.data.utils.get_class_sizes(data: chemprop.data.data.MoleculeDataset) List[List[float]][source]

Determines the proportions of the different classes in a classification dataset.

Parameters

data – A classification MoleculeDataset.

Returns

A list of lists of class proportions. Each inner list contains the class proportions for a task.

chemprop.data.utils.get_data(path: str, smiles_columns: Optional[Union[str, List[str]]] = None, target_columns: Optional[List[str]] = None, ignore_columns: Optional[List[str]] = None, skip_invalid_smiles: bool = True, args: Optional[Union[chemprop.args.TrainArgs, chemprop.args.PredictArgs]] = None, data_weights_path: Optional[str] = None, features_path: Optional[List[str]] = None, features_generator: Optional[List[str]] = None, phase_features_path: Optional[str] = None, atom_descriptors_path: Optional[str] = None, bond_features_path: Optional[str] = None, max_data_size: Optional[int] = None, store_row: bool = False, logger: Optional[logging.Logger] = None, skip_none_targets: bool = False) chemprop.data.data.MoleculeDataset[source]

Gets SMILES and target values from a CSV file.

Parameters
  • path – Path to a CSV file.

  • smiles_columns – The names of the columns containing SMILES. By default, uses the first number_of_molecules columns.

  • target_columns – Name of the columns containing target values. By default, uses all columns except the smiles_column and the ignore_columns.

  • ignore_columns – Name of the columns to ignore when target_columns is not provided.

  • skip_invalid_smiles – Whether to skip and filter out invalid smiles using filter_invalid_smiles().

  • args – Arguments, either TrainArgs or PredictArgs.

  • data_weights_path – A path to a file containing weights for each molecule in the loss function.

  • features_path – A list of paths to files containing features. If provided, it is used in place of args.features_path.

  • features_generator – A list of features generators to use. If provided, it is used in place of args.features_generator.

  • phase_features_path – A path to a file containing phase features as applicable to spectra.

  • atom_descriptors_path – The path to the file containing the custom atom descriptors.

  • bond_features_path – The path to the file containing the custom bond features.

  • max_data_size – The maximum number of data points to load.

  • logger – A logger for recording output.

  • store_row – Whether to store the raw CSV row in each MoleculeDatapoint.

  • skip_none_targets – Whether to skip targets that are all ‘None’. This is mostly relevant when –target_columns are passed in, so only a subset of tasks are examined.

Returns

A MoleculeDataset containing SMILES and target values along with other info such as additional features when desired.

chemprop.data.utils.get_data_from_smiles(smiles: List[List[str]], skip_invalid_smiles: bool = True, logger: Optional[logging.Logger] = None, features_generator: Optional[List[str]] = None) chemprop.data.data.MoleculeDataset[source]

Converts a list of SMILES to a MoleculeDataset.

Parameters
  • smiles – A list of lists of SMILES with length depending on the number of molecules.

  • skip_invalid_smiles – Whether to skip and filter out invalid smiles using filter_invalid_smiles()

  • logger – A logger for recording output.

  • features_generator – List of features generators.

Returns

A MoleculeDataset with all of the provided SMILES.

chemprop.data.utils.get_data_weights(path: str) List[float][source]

Returns the list of data weights for the loss function as stored in a CSV file.

Parameters

path – Path to a CSV file.

Returns

A list of floats containing the data weights.

chemprop.data.utils.get_header(path: str) List[str][source]

Returns the header of a data CSV file.

Parameters

path – Path to a CSV file.

Returns

A list of strings containing the strings in the comma-separated header.

chemprop.data.utils.get_smiles(path: str, smiles_columns: Optional[Union[str, List[str]]] = None, header: bool = True, flatten: bool = False) Union[List[str], List[List[str]]][source]

Returns the SMILES from a data CSV file.

Parameters
  • path – Path to a CSV file.

  • smiles_columns – A list of the names of the columns containing SMILES. By default, uses the first number_of_molecules columns.

  • header – Whether the CSV file contains a header.

  • flatten – Whether to flatten the returned SMILES to a list instead of a list of lists.

Returns

A list of SMILES or a list of lists of SMILES, depending on flatten.

chemprop.data.utils.get_task_names(path: str, smiles_columns: Optional[Union[str, List[str]]] = None, target_columns: Optional[List[str]] = None, ignore_columns: Optional[List[str]] = None) List[str][source]

Gets the task names from a data CSV file.

If target_columns is provided, returns target_columns. Otherwise, returns all columns except the smiles_columns (or the first column, if the smiles_columns is None) and the ignore_columns.

Parameters
  • path – Path to a CSV file.

  • smiles_columns – The names of the columns containing SMILES. By default, uses the first number_of_molecules columns.

  • target_columns – Name of the columns containing target values. By default, uses all columns except the smiles_columns and the ignore_columns.

  • ignore_columns – Name of the columns to ignore when target_columns is not provided.

Returns

A list of task names.

chemprop.data.utils.preprocess_smiles_columns(path: str, smiles_columns: Optional[Union[str, List[Optional[str]]]], number_of_molecules: int = 1) List[Optional[str]][source]

Preprocesses the smiles_columns variable to ensure that it is a list of column headings corresponding to the columns in the data file holding SMILES.

Parameters
  • path – Path to a CSV file.

  • smiles_columns – The names of the columns containing SMILES. By default, uses the first number_of_molecules columns.

  • number_of_molecules – The number of molecules with associated SMILES for each data point.

Returns

The preprocessed version of smiles_columns which is guaranteed to be a list.

chemprop.data.utils.split_data(data: chemprop.data.data.MoleculeDataset, split_type: str = 'random', sizes: Tuple[float, float, float] = (0.8, 0.1, 0.1), seed: int = 0, num_folds: int = 1, args: Optional[chemprop.args.TrainArgs] = None, logger: Optional[logging.Logger] = None) Tuple[chemprop.data.data.MoleculeDataset, chemprop.data.data.MoleculeDataset, chemprop.data.data.MoleculeDataset][source]

Splits data into training, validation, and test splits.

Parameters
  • data – A MoleculeDataset.

  • split_type – Split type.

  • sizes – A length-3 tuple with the proportions of data in the train, validation, and test sets.

  • seed – The random seed to use before shuffling data.

  • num_folds – Number of folds to create (only needed for “cv” split type).

  • args – A TrainArgs object.

  • logger – A logger for recording output.

Returns

A tuple of MoleculeDatasets containing the train, validation, and test splits of the data.

chemprop.data.utils.validate_data(data_path: str) Set[str][source]

Validates a data CSV file, returning a set of errors.

Parameters

data_path – Path to a data CSV file.

Returns

A set of error messages.

chemprop.data.utils.validate_dataset_type(data: chemprop.data.data.MoleculeDataset, dataset_type: str) None[source]

Validates the dataset type to ensure the data matches the provided type.

Parameters
  • data – A MoleculeDataset.

  • dataset_type – The dataset type to check.