Data

chemprop.data contains functions and classes for loading, containing, and splitting data.

Data

Classes and functions from chemprop.data.data.py.

class chemprop.data.data.MoleculeDataLoader(dataset: MoleculeDataset, batch_size: int = 50, num_workers: int = 8, class_balance: bool = False, shuffle: bool = False, seed: int = 0)[source]

A MoleculeDataLoader is a PyTorch DataLoader for loading a MoleculeDataset.

Parameters:

dataset – The MoleculeDataset containing the molecules to load.
batch_size – Batch size.
num_workers – Number of workers used to build batches.
class_balance – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Class balance is only available for single task classification datasets. Set shuffle to True in order to get a random subset of the larger class.
shuffle – Whether to shuffle the data.
seed – Random seed. Only needed if shuffle is True.

property gt_targets: List[List[bool | None]]

Returns booleans for whether each target is an inequality rather than a value target, associated with each molecule.

Returns:: A list of lists of booleans (or None) containing the targets.

property iter_size: int: Returns the number of data points included in each full iteration through the MoleculeDataLoader.

property lt_targets: List[List[bool | None]]

Returns booleans for whether each target is an inequality rather than a value target, associated with each molecule.

Returns:: A list of lists of booleans (or None) containing the targets.

property targets: List[List[float | None]]

Returns the targets associated with each molecule.

Returns:: A list of lists of floats (or None) containing the targets.

A MoleculeDatapoint contains a single molecule and its associated features and targets.

Parameters:

smiles – A list of the SMILES strings for the molecules.
targets – A list of targets for the molecule (contains None for unknown target values).
atom_targets – A list of targets for the atomic properties.
bond_targets – A list of targets for the bond properties.
row – The raw CSV row containing the information for this molecule.
data_weight – Weighting of the datapoint for the loss function.
gt_targets – Indicates whether the targets are an inequality regression target of the form “>x”.
lt_targets – Indicates whether the targets are an inequality regression target of the form “<x”.
features – A numpy array containing additional features (e.g., Morgan fingerprint).
features_generator – A list of features generators to use.
phase_features – A one-hot vector indicating the phase of the data, as used in spectra data.
atom_descriptors – A numpy array containing additional atom descriptors to featurize the molecule.
bond_descriptors – A numpy array containing additional bond descriptors to featurize the molecule.
raw_constraints – A numpy array containing all user-provided atom/bond-level constraints in input data.
constraints – A numpy array containing atom/bond-level constraints that are used in training. Param constraints is a subset of param raw_constraints.
overwrite_default_atom_features – Boolean to overwrite default atom features by atom_features.
overwrite_default_bond_features – Boolean to overwrite default bond features by bond_features.

property bond_types: List[List[float]]

Gets the bond types in the MoleculeDatapoint.

Returns:: A list of bond types for each molecule.

extend_features(features: ndarray) → None[source]

Extends the features of the molecule.

Parameters:: features – A 1D numpy array of extra features for the molecule.

property max_molwt: float

Gets the maximum molecular weight among all the molecules in the MoleculeDatapoint.

Returns:: The maximum molecular weight.

property mol: List[Mol | Tuple[Mol, Mol]]: Gets the corresponding list of RDKit molecules for the corresponding SMILES list.

num_tasks() → int[source]

Returns the number of prediction tasks.

Returns:: The number of tasks.

property number_of_atoms: int

Gets the number of atoms in the MoleculeDatapoint.

Returns:: A list of number of atoms for each molecule.

property number_of_bonds: List[int]

Gets the number of bonds in the MoleculeDatapoint.

Returns:: A list of number of bonds for each molecule.

property number_of_molecules: int

Gets the number of molecules in the MoleculeDatapoint.

Returns:: The number of molecules.

reset_features_and_targets() → None[source]: Resets the features (atom, bond, and molecule) and targets to their raw values.

set_atom_descriptors(atom_descriptors: ndarray) → None[source]

Sets the atom descriptors of the molecule.

Parameters:: atom_descriptors – A 1D numpy array of atom descriptors for the molecule.

set_atom_features(atom_features: ndarray) → None[source]

Sets the atom features of the molecule.

Parameters:: atom_features – A 1D numpy array of atom features for the molecule.

set_bond_descriptors(bond_descriptors: ndarray) → None[source]

Sets the atom descriptors of the molecule.

Parameters:: bond_descriptors – A 1D numpy array of bond descriptors for the molecule.

set_bond_features(bond_features: ndarray) → None[source]

Sets the bond features of the molecule.

Parameters:: bond_features – A 1D numpy array of bond features for the molecule.

set_features(features: ndarray) → None[source]

Sets the features of the molecule.

Parameters:: features – A 1D numpy array of features for the molecule.

set_targets(targets: List[float | None])[source]

Sets the targets of a molecule.

Parameters:: targets – A list of floats containing the targets.

class chemprop.data.data.MoleculeDataset(data: List[MoleculeDatapoint])[source]

A MoleculeDataset contains a list of MoleculeDatapoints with access to their attributes.

Parameters:: data – A list of MoleculeDatapoints.

atom_bond_data_weights() → List[List[float]][source]: Returns the loss weighting associated with each datapoint for atomic/bond properties prediction.

atom_descriptors() → List[ndarray][source]

Returns the atom descriptors associated with each molecule (if they exit).

Returns:: A list of 2D numpy arrays containing the atom descriptors for each molecule or None if there are no features.

atom_descriptors_size() → int[source]

Returns the size of custom additional atom descriptors vector associated with the molecules.

Returns:: The size of the additional atom descriptor vector.

atom_features() → List[ndarray][source]

Returns the atom descriptors associated with each molecule (if they exit).

Returns:: A list of 2D numpy arrays containing the atom descriptors for each molecule or None if there are no features.

atom_features_size() → int[source]

Returns the size of custom additional atom features vector associated with the molecules.

Returns:: The size of the additional atom feature vector.

batch_graph() → List[BatchMolGraph][source]

Constructs a BatchMolGraph with the graph featurization of all the molecules.

Note

The BatchMolGraph is cached in after the first time it is computed and is simply accessed upon subsequent calls to batch_graph(). This means that if the underlying set of MoleculeDatapoints changes, then the returned BatchMolGraph will be incorrect for the underlying data.

Returns:: A list of BatchMolGraph containing the graph featurization of all the molecules in each MoleculeDatapoint.

bond_descriptors() → List[ndarray][source]

Returns the bond descriptors associated with each molecule (if they exit).

Returns:: A list of 2D numpy arrays containing the bond descriptors for each molecule or None if there are no features.

bond_descriptors_size() → int[source]

Returns the size of custom additional bond descriptors vector associated with the molecules.

Returns:: The size of the additional bond descriptor vector.

bond_features() → List[ndarray][source]

Returns the bond features associated with each molecule (if they exit).

Returns:: A list of 2D numpy arrays containing the bond features for each molecule or None if there are no features.

bond_features_size() → int[source]

Returns the size of custom additional bond features vector associated with the molecules.

Returns:: The size of the additional bond feature vector.

property bond_types: List[List[float]]

Gets the bond types in each MoleculeDatapoint.

Returns:: A list of bond types for each molecule.

constraints() → List[ndarray][source]: Return the constraints applied in atomic/bond properties prediction.

data_weights() → List[float][source]: Returns the loss weighting associated with each datapoint.

features() → List[ndarray][source]

Returns the features associated with each molecule (if they exist).

Returns:: A list of 1D numpy arrays containing the features for each molecule or None if there are no features.

features_size() → int[source]

Returns the size of the additional features vector associated with the molecules.

Returns:: The size of the additional features vector.

gt_targets() → List[ndarray][source]

Returns indications of whether the targets associated with each molecule are greater-than inequalities.

Returns:: A list of lists of booleans indicating whether the targets in those positions are greater-than inequality targets.

property is_atom_bond_targets: bool

Gets the Boolean whether this is atomic/bond properties prediction.

Returns:: A Boolean value.

lt_targets() → List[ndarray][source]

Returns indications of whether the targets associated with each molecule are less-than inequalities.

Returns:: A list of lists of booleans indicating whether the targets in those positions are less-than inequality targets.

mask() → List[List[bool]][source]

Returns whether the targets associated with each molecule and task are present.

Returns:: A list of list of booleans associated with targets.

mols(flatten: bool = False) → List[Mol] | List[List[Mol]] | List[Tuple[Mol, Mol]] | List[List[Tuple[Mol, Mol]]][source]

Returns a list of the RDKit molecules associated with each MoleculeDatapoint.

Parameters:: flatten – Whether to flatten the returned RDKit molecules to a list instead of a list of lists.
Returns:: A list of SMILES or a list of lists of RDKit molecules, depending on flatten.

normalize_atom_bond_targets() → AtomBondScaler[source]

Normalizes the targets of the dataset using a AtomBondScaler.

The AtomBondScaler subtracts the mean and divides by the standard deviation for each task independently.

This should only be used for regression datasets.

Returns:: A AtomBondScaler fitted to the targets.

normalize_features(scaler: StandardScaler | None = None, replace_nan_token: int = 0, scale_atom_descriptors: bool = False, scale_bond_descriptors: bool = False) → StandardScaler[source]

Normalizes the features of the dataset using a StandardScaler.

The StandardScaler subtracts the mean and divides by the standard deviation for each feature independently.

If a StandardScaler is provided, it is used to perform the normalization. Otherwise, a StandardScaler is first fit to the features in this dataset and is then used to perform the normalization.

Parameters:

scaler – A fitted StandardScaler. If it is provided it is used, otherwise a new StandardScaler is first fitted to this data and is then used.
replace_nan_token – A token to use to replace NaN entries in the features.
scale_atom_descriptors – If the features that need to be scaled are atom features rather than molecule.
scale_bond_descriptors – If the features that need to be scaled are bond features rather than molecule.

Returns:

A fitted StandardScaler. If a StandardScaler is provided as a parameter, this is the same StandardScaler. Otherwise, this is a new StandardScaler that has been fit on this dataset.

normalize_targets() → StandardScaler[source]: Normalizes the targets of the dataset using a StandardScaler. The StandardScaler subtracts the mean and divides by the standard deviation for each task independently. This should only be used for regression datasets. :return: A StandardScaler fitted to the targets.

num_tasks() → int[source]

Returns the number of prediction tasks.

Returns:: The number of tasks.

property number_of_atoms: List[List[int]]

Gets the number of atoms in each MoleculeDatapoint.

Returns:: A list of number of atoms for each molecule.

property number_of_bonds: List[List[int]]

Gets the number of bonds in each MoleculeDatapoint.

Returns:: A list of number of bonds for each molecule.

property number_of_molecules: int

Gets the number of molecules in each MoleculeDatapoint.

Returns:: The number of molecules.

phase_features() → List[ndarray][source]

Returns the phase features associated with each molecule (if they exist).

Returns:: A list of 1D numpy arrays containing the phase features for each molecule or None if there are no features.

reset_features_and_targets() → None[source]: Resets the features (atom, bond, and molecule) and targets to their raw values.

set_targets(targets: List[List[float | None]]) → None[source]

Sets the targets for each molecule in the dataset. Assumes the targets are aligned with the datapoints.

Parameters:: targets – A list of lists of floats (or None) containing targets for each molecule. This must be the same length as the underlying dataset.

smiles(flatten: bool = False) → List[str] | List[List[str]][source]

Returns a list containing the SMILES list associated with each MoleculeDatapoint.

Parameters:: flatten – Whether to flatten the returned SMILES to a list instead of a list of lists.
Returns:: A list of SMILES or a list of lists of SMILES, depending on flatten.

targets() → List[List[float | None]][source]

Returns the targets associated with each molecule.

Returns:: A list of lists of floats (or None) containing the targets.

class chemprop.data.data.MoleculeSampler(dataset: MoleculeDataset, class_balance: bool = False, shuffle: bool = False, seed: int = 0)[source]

A MoleculeSampler samples data from a MoleculeDataset for a MoleculeDataLoader.

Parameters:

class_balance – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Set shuffle to True in order to get a random subset of the larger class.
shuffle – Whether to shuffle the data.
seed – Random seed. Only needed if shuffle is True.

chemprop.data.data.cache_graph() → bool[source]: Returns whether MolGraphs will be cached.

chemprop.data.data.cache_mol() → bool[source]: Returns whether RDKit molecules will be cached.

chemprop.data.data.construct_molecule_batch(data: List[MoleculeDatapoint]) → MoleculeDataset[source]

Constructs a MoleculeDataset from a list of MoleculeDatapoints.

Additionally, precomputes the BatchMolGraph for the constructed MoleculeDataset.

Parameters:: data – A list of MoleculeDatapoints.
Returns:: A MoleculeDataset containing all the MoleculeDatapoints.

chemprop.data.data.empty_cache()[source]: Empties the cache of MolGraph and RDKit molecules.

chemprop.data.data.make_mols(smiles: List[str], reaction_list: List[bool], keep_h_list: List[bool], add_h_list: List[bool], keep_atom_map_list: List[bool])[source]

Builds a list of RDKit molecules (or a list of tuples of molecules if reaction is True) for a list of smiles.

Parameters:

smiles – List of SMILES strings.
reaction_list – List of booleans whether the SMILES strings are to be treated as a reaction.
keep_h_list – List of booleans whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.
add_h_list – List of booleasn whether to add hydrogens to the input smiles.
keep_atom_map_list – List of booleasn whether to keep the original atom mapping.

Returns:

List of RDKit molecules or list of tuple of molecules.

chemprop.data.data.set_cache_graph(cache_graph: bool) → None[source]: Sets whether MolGraphs will be cached.

chemprop.data.data.set_cache_mol(cache_mol: bool) → None[source]: Sets whether RDKit molecules will be cached.

Scaffold

Classes and functions from chemprop.data.scaffold.py.

chemprop.data.scaffold.generate_scaffold(mol: str | Mol | Tuple[Mol, Mol], include_chirality: bool = False) → str[source]

Computes the Bemis-Murcko scaffold for a SMILES string.

Parameters:

mol – A SMILES or an RDKit molecule.
include_chirality – Whether to include chirality in the computed scaffold..

Returns:

The Bemis-Murcko scaffold for the molecule.

chemprop.data.scaffold.log_scaffold_stats(data: MoleculeDataset, index_sets: List[Set[int]], num_scaffolds: int = 10, num_labels: int = 20, logger: Logger | None = None) → List[Tuple[List[float], List[int]]][source]

Logs and returns statistics about counts and average target values in molecular scaffolds.

Parameters:

data – A MoleculeDataset.
index_sets – A list of sets of indices representing splits of the data.
num_scaffolds – The number of scaffolds about which to display statistics.
num_labels – The number of labels about which to display statistics.
logger – A logger for recording output.

Returns:

A list of tuples where each tuple contains a list of average target values across the first num_labels labels and a list of the number of non-zero values for the first num_scaffolds scaffolds, sorted in decreasing order of scaffold frequency.

chemprop.data.scaffold.scaffold_split(data: MoleculeDataset, sizes: Tuple[float, float, float] = (0.8, 0.1, 0.1), balanced: bool = False, key_molecule_index: int = 0, seed: int = 0, logger: Logger | None = None) → Tuple[MoleculeDataset, MoleculeDataset, MoleculeDataset][source]

Splits a MoleculeDataset by scaffold so that no molecules sharing a scaffold are in different splits.

Parameters:

data – A MoleculeDataset.
sizes – A length-3 tuple with the proportions of data in the train, validation, and test sets.
balanced – Whether to balance the sizes of scaffolds in each set rather than putting the smallest in test set.
key_molecule_index – For data with multiple molecules, this sets which molecule will be considered during splitting.
seed – Random seed for shuffling when doing balanced splitting.
logger – A logger for recording output.

Returns:

A tuple of MoleculeDatasets containing the train, validation, and test splits of the data.

chemprop.data.scaffold.scaffold_to_smiles(mols: List[str] | List[Mol] | List[Tuple[Mol, Mol]], use_indices: bool = False) → Dict[str, Set[str] | Set[int]][source]

Computes the scaffold for each SMILES and returns a mapping from scaffolds to sets of smiles (or indices).

Parameters:

mols – A list of SMILES or RDKit molecules.
use_indices – Whether to map to the SMILES’s index in mols rather than mapping to the smiles string itself. This is necessary if there are duplicate smiles.

Returns:

A dictionary mapping each unique scaffold to all SMILES (or indices) which have that scaffold.

Scaler

Classes and functions from chemprop.data.scaler.py.

class chemprop.data.scaler.AtomBondScaler(means: ndarray | None = None, stds: ndarray | None = None, replace_nan_token: Any | None = None, n_atom_targets=None, n_bond_targets=None)[source]

A AtomBondScaler normalizes the features of a dataset.

When it is fit on a dataset, the AtomBondScaler learns the mean and standard deviation across the 0th axis. When transforming a dataset, the AtomBondScaler subtracts the means and divides by the standard deviations.

Parameters:

means – An optional 1D numpy array of precomputed means.
stds – An optional 1D numpy array of precomputed standard deviations.
replace_nan_token – A token to use to replace NaN entries in the features.

fit(X: List[List[float | None]]) → AtomBondScaler[source]

Learns means and standard deviations across the 0th axis of the data X.

Parameters:: X – A list of lists of floats (or None).
Returns:: The fitted StandardScaler (self).

inverse_transform(X: List[List[float | None]]) → List[ndarray][source]

Performs the inverse transformation by multiplying by the standard deviations and adding the means.

Parameters:: X – A list of lists of floats.
Returns:: The inverse transformed data with NaNs replaced by self.replace_nan_token.

transform(X: List[List[float | None]]) → List[ndarray][source]

Transforms the data by subtracting the means and dividing by the standard deviations.

Parameters:: X – A list of lists of floats (or None).
Returns:: The transformed data with NaNs replaced by self.replace_nan_token.

class chemprop.data.scaler.StandardScaler(means: ndarray | None = None, stds: ndarray | None = None, replace_nan_token: Any | None = None)[source]

A StandardScaler normalizes the features of a dataset.

When it is fit on a dataset, the StandardScaler learns the mean and standard deviation across the 0th axis. When transforming a dataset, the StandardScaler subtracts the means and divides by the standard deviations.

Parameters:

means – An optional 1D numpy array of precomputed means.
stds – An optional 1D numpy array of precomputed standard deviations.
replace_nan_token – A token to use to replace NaN entries in the features.

fit(X: List[List[float | None]]) → StandardScaler[source]

Learns means and standard deviations across the 0th axis of the data X.

Parameters:: X – A list of lists of floats (or None).
Returns:: The fitted StandardScaler (self).

inverse_transform(X: List[List[float | None]]) → ndarray[source]

Performs the inverse transformation by multiplying by the standard deviations and adding the means.

Parameters:: X – A list of lists of floats.
Returns:: The inverse transformed data with NaNs replaced by self.replace_nan_token.

transform(X: List[List[float | None]]) → ndarray[source]

Transforms the data by subtracting the means and dividing by the standard deviations.

Parameters:: X – A list of lists of floats (or None).
Returns:: The transformed data with NaNs replaced by self.replace_nan_token.

Utils

Classes and functions from chemprop.data.utils.py.

chemprop.data.utils.filter_invalid_smiles(data: MoleculeDataset) → MoleculeDataset[source]

Filters out invalid SMILES.

Parameters:: data – A MoleculeDataset.
Returns:: A MoleculeDataset with only the valid molecules.

chemprop.data.utils.get_class_sizes(data: MoleculeDataset, proportion: bool = True) → List[List[float]][source]

Determines the proportions of the different classes in a classification dataset.

Parameters:

data – A classification MoleculeDataset.
proportion – Choice of whether to return proportions for class size or counts.

Returns:

A list of lists of class proportions. Each inner list contains the class proportions for a task.

chemprop.data.utils.get_constraints(path: str, target_columns: List[str], save_raw_data: bool = False) → Tuple[List[float], List[float]][source]

Returns lists of data constraints for the atomic/bond targets as stored in a CSV file.

Parameters:

path – Path to a CSV file.
target_columns – Name of the columns containing target values.
save_raw_data – Whether to save all user-provided atom/bond-level constraints in input data, which will be used to construct constraints files for each train/val/test split for prediction convenience later.

Returns:

Lists of floats containing the data constraints.

chemprop.data.utils.get_data(path: str, smiles_columns: str | List[str] | None = None, target_columns: List[str] | None = None, ignore_columns: List[str] | None = None, skip_invalid_smiles: bool = True, args: TrainArgs | PredictArgs | None = None, data_weights_path: str | None = None, features_path: List[str] | None = None, features_generator: List[str] | None = None, phase_features_path: str | None = None, atom_descriptors_path: str | None = None, bond_descriptors_path: str | None = None, constraints_path: str | None = None, max_data_size: int | None = None, store_row: bool = False, logger: Logger | None = None, loss_function: str | None = None, skip_none_targets: bool = False) → MoleculeDataset[source]

Gets SMILES and target values from a CSV file.

Parameters:

path – Path to a CSV file.
smiles_columns – The names of the columns containing SMILES. By default, uses the first number_of_molecules columns.
target_columns – Name of the columns containing target values. By default, uses all columns except the smiles_column and the ignore_columns.
ignore_columns – Name of the columns to ignore when target_columns is not provided.
skip_invalid_smiles – Whether to skip and filter out invalid smiles using filter_invalid_smiles().
args – Arguments, either TrainArgs or PredictArgs.
data_weights_path – A path to a file containing weights for each molecule in the loss function.
features_path – A list of paths to files containing features. If provided, it is used in place of args.features_path.
features_generator – A list of features generators to use. If provided, it is used in place of args.features_generator.
phase_features_path – A path to a file containing phase features as applicable to spectra.
atom_descriptors_path – The path to the file containing the custom atom descriptors.
bond_descriptors_path – The path to the file containing the custom bond descriptors.
constraints_path – The path to the file containing constraints applied to different atomic/bond properties.
max_data_size – The maximum number of data points to load.
logger – A logger for recording output.
store_row – Whether to store the raw CSV row in each MoleculeDatapoint.
skip_none_targets – Whether to skip targets that are all ‘None’. This is mostly relevant when –target_columns are passed in, so only a subset of tasks are examined.
loss_function – The loss function to be used in training.

Returns:

A MoleculeDataset containing SMILES and target values along with other info such as additional features when desired.

chemprop.data.utils.get_data_from_smiles(smiles: List[List[str]], skip_invalid_smiles: bool = True, logger: Logger | None = None, features_generator: List[str] | None = None) → MoleculeDataset[source]

Converts a list of SMILES to a MoleculeDataset.

Parameters:

smiles – A list of lists of SMILES with length depending on the number of molecules.
skip_invalid_smiles – Whether to skip and filter out invalid smiles using filter_invalid_smiles()
logger – A logger for recording output.
features_generator – List of features generators.

Returns:

A MoleculeDataset with all of the provided SMILES.

chemprop.data.utils.get_data_weights(path: str) → List[float][source]

Returns the list of data weights for the loss function as stored in a CSV file.

Parameters:: path – Path to a CSV file.
Returns:: A list of floats containing the data weights.

chemprop.data.utils.get_header(path: str) → List[str][source]: Returns the header of a data CSV file. :param path: Path to a CSV file. :return: A list of strings containing the strings in the comma-separated header.

chemprop.data.utils.get_inequality_targets(path: str, target_columns: List[str] | None = None) → List[str][source]

chemprop.data.utils.get_invalid_smiles_from_file(path: str | None = None, smiles_columns: str | List[str] | None = None, header: bool = True, reaction: bool = False) → List[str] | List[List[str]][source]

Returns the invalid SMILES from a data CSV file.

Parameters:

path – Path to a CSV file.
smiles_columns – A list of the names of the columns containing SMILES. By default, uses the first number_of_molecules columns.
header – Whether the CSV file contains a header.
reaction – Boolean whether the SMILES strings are to be treated as a reaction.

Returns:

A list of lists of SMILES, for the invalid SMILES in the file.

chemprop.data.utils.get_invalid_smiles_from_list(smiles: List[List[str]], reaction: bool = False) → List[List[str]][source]

Returns the invalid SMILES from a list of lists of SMILES strings.

Parameters:

smiles – A list of list of SMILES.
reaction – Boolean whether the SMILES strings are to be treated as a reaction.

Returns:

A list of lists of SMILES, for the invalid SMILES among the lists provided.

Gets the task names for atomic, bond, and molecule targets separately from a data CSV file.

If target_columns is provided, returned lists based off target_columns. Otherwise, returned lists based off all columns except the smiles_columns (or the first column, if the smiles_columns is None) and the ignore_columns.

Parameters:

path – Path to a CSV file.
smiles_columns – The names of the columns containing SMILES. By default, uses the first number_of_molecules columns.
target_columns – Name of the columns containing target values. By default, uses all columns except the smiles_columns and the ignore_columns.
ignore_columns – Name of the columns to ignore when target_columns is not provided.
keep_h – Boolean whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.
add_h – Boolean whether to add hydrogens to the input smiles.
keep_atom_map – Boolean whether to keep the original atom mapping.

Returns:

A tuple containing the task names of atomic, bond, and molecule properties separately.

chemprop.data.utils.get_smiles(path: str, smiles_columns: str | List[str] | None = None, number_of_molecules: int = 1, header: bool = True, flatten: bool = False) → List[str] | List[List[str]][source]

Returns the SMILES from a data CSV file.

Parameters:

path – Path to a CSV file.
smiles_columns – A list of the names of the columns containing SMILES. By default, uses the first number_of_molecules columns.
number_of_molecules – The number of molecules for each data point. Not necessary if the names of smiles columns are previously processed.
header – Whether the CSV file contains a header.
flatten – Whether to flatten the returned SMILES to a list instead of a list of lists.

Returns:

A list of SMILES or a list of lists of SMILES, depending on flatten.

Gets the task names from a data CSV file. If target_columns is provided, returns target_columns. Otherwise, returns all columns except the smiles_columns (or the first column, if the smiles_columns is None) and the ignore_columns. :param path: Path to a CSV file. :param smiles_columns: The names of the columns containing SMILES.

By default, uses the first number_of_molecules columns.

Parameters:

target_columns – Name of the columns containing target values. By default, uses all columns except the smiles_columns and the ignore_columns.
ignore_columns – Name of the columns to ignore when target_columns is not provided.

Returns:

A list of task names.

chemprop.data.utils.preprocess_smiles_columns(path: str, smiles_columns: str | List[str] | None = None, number_of_molecules: int = 1) → List[str][source]

Preprocesses the smiles_columns variable to ensure that it is a list of column headings corresponding to the columns in the data file holding SMILES. Assumes file has a header. :param path: Path to a CSV file. :param smiles_columns: The names of the columns containing SMILES.

By default, uses the first number_of_molecules columns.

Parameters:: number_of_molecules – The number of molecules with associated SMILES for each data point.
Returns:: The preprocessed version of smiles_columns which is guaranteed to be a list.

chemprop.data.utils.split_data(data: MoleculeDataset, split_type: str = 'random', sizes: Tuple[float, float, float] = (0.8, 0.1, 0.1), key_molecule_index: int = 0, seed: int = 0, num_folds: int = 1, args: TrainArgs | None = None, logger: Logger | None = None) → Tuple[MoleculeDataset, MoleculeDataset, MoleculeDataset][source]

Splits data into training, validation, and test splits.

Parameters:

data – A MoleculeDataset.
split_type – Split type.
sizes – A length-3 tuple with the proportions of data in the train, validation, and test sets.
key_molecule_index – For data with multiple molecules, this sets which molecule will be considered during splitting.
seed – The random seed to use before shuffling data.
num_folds – Number of folds to create (only needed for “cv” split type).
args – A TrainArgs object.
logger – A logger for recording output.

Returns:

A tuple of MoleculeDatasets containing the train, validation, and test splits of the data.

chemprop.data.utils.validate_data(data_path: str) → Set[str][source]

Validates a data CSV file, returning a set of errors.

Parameters:: data_path – Path to a data CSV file.
Returns:: A set of error messages.

chemprop.data.utils.validate_dataset_type(data: MoleculeDataset, dataset_type: str) → None[source]

Validates the dataset type to ensure the data matches the provided type.

Parameters:

data – A MoleculeDataset.
dataset_type – The dataset type to check.