Data
chemprop.data contains functions and classes for loading, containing, and splitting data.
Data
Classes and functions from chemprop.data.data.py.
- class chemprop.data.data.MoleculeDataLoader(dataset: MoleculeDataset, batch_size: int = 50, num_workers: int = 8, class_balance: bool = False, shuffle: bool = False, seed: int = 0)[source]
A
MoleculeDataLoader
is a PyTorchDataLoader
for loading aMoleculeDataset
.- Parameters:
dataset – The
MoleculeDataset
containing the molecules to load.batch_size – Batch size.
num_workers – Number of workers used to build batches.
class_balance – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Class balance is only available for single task classification datasets. Set shuffle to True in order to get a random subset of the larger class.
shuffle – Whether to shuffle the data.
seed – Random seed. Only needed if shuffle is True.
- property gt_targets: List[List[bool | None]]
Returns booleans for whether each target is an inequality rather than a value target, associated with each molecule.
- Returns:
A list of lists of booleans (or None) containing the targets.
- property iter_size: int
Returns the number of data points included in each full iteration through the
MoleculeDataLoader
.
- property lt_targets: List[List[bool | None]]
Returns booleans for whether each target is an inequality rather than a value target, associated with each molecule.
- Returns:
A list of lists of booleans (or None) containing the targets.
- property targets: List[List[float | None]]
Returns the targets associated with each molecule.
- Returns:
A list of lists of floats (or None) containing the targets.
- class chemprop.data.data.MoleculeDatapoint(smiles: List[str], targets: List[float | None] | None = None, atom_targets: List[float | None] | None = None, bond_targets: List[float | None] | None = None, row: OrderedDict | None = None, data_weight: float | None = None, gt_targets: List[List[bool]] | None = None, lt_targets: List[List[bool]] | None = None, features: ndarray | None = None, features_generator: List[str] | None = None, phase_features: List[float] | None = None, atom_features: ndarray | None = None, atom_descriptors: ndarray | None = None, bond_features: ndarray | None = None, bond_descriptors: ndarray | None = None, raw_constraints: ndarray | None = None, constraints: ndarray | None = None, overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False)[source]
A
MoleculeDatapoint
contains a single molecule and its associated features and targets.- Parameters:
smiles – A list of the SMILES strings for the molecules.
targets – A list of targets for the molecule (contains None for unknown target values).
atom_targets – A list of targets for the atomic properties.
bond_targets – A list of targets for the bond properties.
row – The raw CSV row containing the information for this molecule.
data_weight – Weighting of the datapoint for the loss function.
gt_targets – Indicates whether the targets are an inequality regression target of the form “>x”.
lt_targets – Indicates whether the targets are an inequality regression target of the form “<x”.
features – A numpy array containing additional features (e.g., Morgan fingerprint).
features_generator – A list of features generators to use.
phase_features – A one-hot vector indicating the phase of the data, as used in spectra data.
atom_descriptors – A numpy array containing additional atom descriptors to featurize the molecule.
bond_descriptors – A numpy array containing additional bond descriptors to featurize the molecule.
raw_constraints – A numpy array containing all user-provided atom/bond-level constraints in input data.
constraints – A numpy array containing atom/bond-level constraints that are used in training. Param constraints is a subset of param raw_constraints.
overwrite_default_atom_features – Boolean to overwrite default atom features by atom_features.
overwrite_default_bond_features – Boolean to overwrite default bond features by bond_features.
- property bond_types: List[List[float]]
Gets the bond types in the
MoleculeDatapoint
.- Returns:
A list of bond types for each molecule.
- extend_features(features: ndarray) None [source]
Extends the features of the molecule.
- Parameters:
features – A 1D numpy array of extra features for the molecule.
- property max_molwt: float
Gets the maximum molecular weight among all the molecules in the
MoleculeDatapoint
.- Returns:
The maximum molecular weight.
- property mol: List[Mol | Tuple[Mol, Mol]]
Gets the corresponding list of RDKit molecules for the corresponding SMILES list.
- property number_of_atoms: int
Gets the number of atoms in the
MoleculeDatapoint
.- Returns:
A list of number of atoms for each molecule.
- property number_of_bonds: List[int]
Gets the number of bonds in the
MoleculeDatapoint
.- Returns:
A list of number of bonds for each molecule.
- property number_of_molecules: int
Gets the number of molecules in the
MoleculeDatapoint
.- Returns:
The number of molecules.
- reset_features_and_targets() None [source]
Resets the features (atom, bond, and molecule) and targets to their raw values.
- set_atom_descriptors(atom_descriptors: ndarray) None [source]
Sets the atom descriptors of the molecule.
- Parameters:
atom_descriptors – A 1D numpy array of atom descriptors for the molecule.
- set_atom_features(atom_features: ndarray) None [source]
Sets the atom features of the molecule.
- Parameters:
atom_features – A 1D numpy array of atom features for the molecule.
- set_bond_descriptors(bond_descriptors: ndarray) None [source]
Sets the atom descriptors of the molecule.
- Parameters:
bond_descriptors – A 1D numpy array of bond descriptors for the molecule.
- set_bond_features(bond_features: ndarray) None [source]
Sets the bond features of the molecule.
- Parameters:
bond_features – A 1D numpy array of bond features for the molecule.
- class chemprop.data.data.MoleculeDataset(data: List[MoleculeDatapoint])[source]
A
MoleculeDataset
contains a list ofMoleculeDatapoint
s with access to their attributes.- Parameters:
data – A list of
MoleculeDatapoint
s.
- atom_bond_data_weights() List[List[float]] [source]
Returns the loss weighting associated with each datapoint for atomic/bond properties prediction.
- atom_descriptors() List[ndarray] [source]
Returns the atom descriptors associated with each molecule (if they exit).
- Returns:
A list of 2D numpy arrays containing the atom descriptors for each molecule or None if there are no features.
- atom_descriptors_size() int [source]
Returns the size of custom additional atom descriptors vector associated with the molecules.
- Returns:
The size of the additional atom descriptor vector.
- atom_features() List[ndarray] [source]
Returns the atom descriptors associated with each molecule (if they exit).
- Returns:
A list of 2D numpy arrays containing the atom descriptors for each molecule or None if there are no features.
- atom_features_size() int [source]
Returns the size of custom additional atom features vector associated with the molecules.
- Returns:
The size of the additional atom feature vector.
- batch_graph() List[BatchMolGraph] [source]
Constructs a
BatchMolGraph
with the graph featurization of all the molecules.Note
The
BatchMolGraph
is cached in after the first time it is computed and is simply accessed upon subsequent calls tobatch_graph()
. This means that if the underlying set ofMoleculeDatapoint
s changes, then the returnedBatchMolGraph
will be incorrect for the underlying data.- Returns:
A list of
BatchMolGraph
containing the graph featurization of all the molecules in eachMoleculeDatapoint
.
- bond_descriptors() List[ndarray] [source]
Returns the bond descriptors associated with each molecule (if they exit).
- Returns:
A list of 2D numpy arrays containing the bond descriptors for each molecule or None if there are no features.
- bond_descriptors_size() int [source]
Returns the size of custom additional bond descriptors vector associated with the molecules.
- Returns:
The size of the additional bond descriptor vector.
- bond_features() List[ndarray] [source]
Returns the bond features associated with each molecule (if they exit).
- Returns:
A list of 2D numpy arrays containing the bond features for each molecule or None if there are no features.
- bond_features_size() int [source]
Returns the size of custom additional bond features vector associated with the molecules.
- Returns:
The size of the additional bond feature vector.
- property bond_types: List[List[float]]
Gets the bond types in each
MoleculeDatapoint
.- Returns:
A list of bond types for each molecule.
- constraints() List[ndarray] [source]
Return the constraints applied in atomic/bond properties prediction.
- features() List[ndarray] [source]
Returns the features associated with each molecule (if they exist).
- Returns:
A list of 1D numpy arrays containing the features for each molecule or None if there are no features.
- features_size() int [source]
Returns the size of the additional features vector associated with the molecules.
- Returns:
The size of the additional features vector.
- gt_targets() List[ndarray] [source]
Returns indications of whether the targets associated with each molecule are greater-than inequalities.
- Returns:
A list of lists of booleans indicating whether the targets in those positions are greater-than inequality targets.
- property is_atom_bond_targets: bool
Gets the Boolean whether this is atomic/bond properties prediction.
- Returns:
A Boolean value.
- lt_targets() List[ndarray] [source]
Returns indications of whether the targets associated with each molecule are less-than inequalities.
- Returns:
A list of lists of booleans indicating whether the targets in those positions are less-than inequality targets.
- mask() List[List[bool]] [source]
Returns whether the targets associated with each molecule and task are present.
- Returns:
A list of list of booleans associated with targets.
- mols(flatten: bool = False) List[Mol] | List[List[Mol]] | List[Tuple[Mol, Mol]] | List[List[Tuple[Mol, Mol]]] [source]
Returns a list of the RDKit molecules associated with each
MoleculeDatapoint
.- Parameters:
flatten – Whether to flatten the returned RDKit molecules to a list instead of a list of lists.
- Returns:
A list of SMILES or a list of lists of RDKit molecules, depending on
flatten
.
- normalize_atom_bond_targets() AtomBondScaler [source]
Normalizes the targets of the dataset using a
AtomBondScaler
.The
AtomBondScaler
subtracts the mean and divides by the standard deviation for each task independently.This should only be used for regression datasets.
- Returns:
A
AtomBondScaler
fitted to the targets.
- normalize_features(scaler: StandardScaler | None = None, replace_nan_token: int = 0, scale_atom_descriptors: bool = False, scale_bond_descriptors: bool = False) StandardScaler [source]
Normalizes the features of the dataset using a
StandardScaler
.The
StandardScaler
subtracts the mean and divides by the standard deviation for each feature independently.If a
StandardScaler
is provided, it is used to perform the normalization. Otherwise, aStandardScaler
is first fit to the features in this dataset and is then used to perform the normalization.- Parameters:
scaler – A fitted
StandardScaler
. If it is provided it is used, otherwise a newStandardScaler
is first fitted to this data and is then used.replace_nan_token – A token to use to replace NaN entries in the features.
scale_atom_descriptors – If the features that need to be scaled are atom features rather than molecule.
scale_bond_descriptors – If the features that need to be scaled are bond features rather than molecule.
- Returns:
A fitted
StandardScaler
. If aStandardScaler
is provided as a parameter, this is the sameStandardScaler
. Otherwise, this is a newStandardScaler
that has been fit on this dataset.
- normalize_targets() StandardScaler [source]
Normalizes the targets of the dataset using a
StandardScaler
. TheStandardScaler
subtracts the mean and divides by the standard deviation for each task independently. This should only be used for regression datasets. :return: AStandardScaler
fitted to the targets.
- property number_of_atoms: List[List[int]]
Gets the number of atoms in each
MoleculeDatapoint
.- Returns:
A list of number of atoms for each molecule.
- property number_of_bonds: List[List[int]]
Gets the number of bonds in each
MoleculeDatapoint
.- Returns:
A list of number of bonds for each molecule.
- property number_of_molecules: int
Gets the number of molecules in each
MoleculeDatapoint
.- Returns:
The number of molecules.
- phase_features() List[ndarray] [source]
Returns the phase features associated with each molecule (if they exist).
- Returns:
A list of 1D numpy arrays containing the phase features for each molecule or None if there are no features.
- reset_features_and_targets() None [source]
Resets the features (atom, bond, and molecule) and targets to their raw values.
- set_targets(targets: List[List[float | None]]) None [source]
Sets the targets for each molecule in the dataset. Assumes the targets are aligned with the datapoints.
- Parameters:
targets – A list of lists of floats (or None) containing targets for each molecule. This must be the same length as the underlying dataset.
- smiles(flatten: bool = False) List[str] | List[List[str]] [source]
Returns a list containing the SMILES list associated with each
MoleculeDatapoint
.- Parameters:
flatten – Whether to flatten the returned SMILES to a list instead of a list of lists.
- Returns:
A list of SMILES or a list of lists of SMILES, depending on
flatten
.
- class chemprop.data.data.MoleculeSampler(dataset: MoleculeDataset, class_balance: bool = False, shuffle: bool = False, seed: int = 0)[source]
A
MoleculeSampler
samples data from aMoleculeDataset
for aMoleculeDataLoader
.- Parameters:
class_balance – Whether to perform class balancing (i.e., use an equal number of positive and negative molecules). Set shuffle to True in order to get a random subset of the larger class.
shuffle – Whether to shuffle the data.
seed – Random seed. Only needed if
shuffle
is True.
- chemprop.data.data.construct_molecule_batch(data: List[MoleculeDatapoint]) MoleculeDataset [source]
Constructs a
MoleculeDataset
from a list ofMoleculeDatapoint
s.Additionally, precomputes the
BatchMolGraph
for the constructedMoleculeDataset
.- Parameters:
data – A list of
MoleculeDatapoint
s.- Returns:
A
MoleculeDataset
containing all theMoleculeDatapoint
s.
- chemprop.data.data.make_mols(smiles: List[str], reaction_list: List[bool], keep_h_list: List[bool], add_h_list: List[bool], keep_atom_map_list: List[bool])[source]
Builds a list of RDKit molecules (or a list of tuples of molecules if reaction is True) for a list of smiles.
- Parameters:
smiles – List of SMILES strings.
reaction_list – List of booleans whether the SMILES strings are to be treated as a reaction.
keep_h_list – List of booleans whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.
add_h_list – List of booleasn whether to add hydrogens to the input smiles.
keep_atom_map_list – List of booleasn whether to keep the original atom mapping.
- Returns:
List of RDKit molecules or list of tuple of molecules.
Scaffold
Classes and functions from chemprop.data.scaffold.py.
- chemprop.data.scaffold.generate_scaffold(mol: str | Mol | Tuple[Mol, Mol], include_chirality: bool = False) str [source]
Computes the Bemis-Murcko scaffold for a SMILES string.
- Parameters:
mol – A SMILES or an RDKit molecule.
include_chirality – Whether to include chirality in the computed scaffold..
- Returns:
The Bemis-Murcko scaffold for the molecule.
- chemprop.data.scaffold.log_scaffold_stats(data: MoleculeDataset, index_sets: List[Set[int]], num_scaffolds: int = 10, num_labels: int = 20, logger: Logger | None = None) List[Tuple[List[float], List[int]]] [source]
Logs and returns statistics about counts and average target values in molecular scaffolds.
- Parameters:
data – A
MoleculeDataset
.index_sets – A list of sets of indices representing splits of the data.
num_scaffolds – The number of scaffolds about which to display statistics.
num_labels – The number of labels about which to display statistics.
logger – A logger for recording output.
- Returns:
A list of tuples where each tuple contains a list of average target values across the first
num_labels
labels and a list of the number of non-zero values for the firstnum_scaffolds
scaffolds, sorted in decreasing order of scaffold frequency.
- chemprop.data.scaffold.scaffold_split(data: MoleculeDataset, sizes: Tuple[float, float, float] = (0.8, 0.1, 0.1), balanced: bool = False, key_molecule_index: int = 0, seed: int = 0, logger: Logger | None = None) Tuple[MoleculeDataset, MoleculeDataset, MoleculeDataset] [source]
Splits a
MoleculeDataset
by scaffold so that no molecules sharing a scaffold are in different splits.- Parameters:
data – A
MoleculeDataset
.sizes – A length-3 tuple with the proportions of data in the train, validation, and test sets.
balanced – Whether to balance the sizes of scaffolds in each set rather than putting the smallest in test set.
key_molecule_index – For data with multiple molecules, this sets which molecule will be considered during splitting.
seed – Random seed for shuffling when doing balanced splitting.
logger – A logger for recording output.
- Returns:
A tuple of
MoleculeDataset
s containing the train, validation, and test splits of the data.
- chemprop.data.scaffold.scaffold_to_smiles(mols: List[str] | List[Mol] | List[Tuple[Mol, Mol]], use_indices: bool = False) Dict[str, Set[str] | Set[int]] [source]
Computes the scaffold for each SMILES and returns a mapping from scaffolds to sets of smiles (or indices).
- Parameters:
mols – A list of SMILES or RDKit molecules.
use_indices – Whether to map to the SMILES’s index in
mols
rather than mapping to the smiles string itself. This is necessary if there are duplicate smiles.
- Returns:
A dictionary mapping each unique scaffold to all SMILES (or indices) which have that scaffold.
Scaler
Classes and functions from chemprop.data.scaler.py.
- class chemprop.data.scaler.AtomBondScaler(means: ndarray | None = None, stds: ndarray | None = None, replace_nan_token: Any | None = None, n_atom_targets=None, n_bond_targets=None)[source]
A
AtomBondScaler
normalizes the features of a dataset.When it is fit on a dataset, the
AtomBondScaler
learns the mean and standard deviation across the 0th axis. When transforming a dataset, theAtomBondScaler
subtracts the means and divides by the standard deviations.- Parameters:
means – An optional 1D numpy array of precomputed means.
stds – An optional 1D numpy array of precomputed standard deviations.
replace_nan_token – A token to use to replace NaN entries in the features.
- fit(X: List[List[float | None]]) AtomBondScaler [source]
Learns means and standard deviations across the 0th axis of the data
X
.- Parameters:
X – A list of lists of floats (or None).
- Returns:
The fitted
StandardScaler
(self).
- class chemprop.data.scaler.StandardScaler(means: ndarray | None = None, stds: ndarray | None = None, replace_nan_token: Any | None = None)[source]
A
StandardScaler
normalizes the features of a dataset.When it is fit on a dataset, the
StandardScaler
learns the mean and standard deviation across the 0th axis. When transforming a dataset, theStandardScaler
subtracts the means and divides by the standard deviations.- Parameters:
means – An optional 1D numpy array of precomputed means.
stds – An optional 1D numpy array of precomputed standard deviations.
replace_nan_token – A token to use to replace NaN entries in the features.
- fit(X: List[List[float | None]]) StandardScaler [source]
Learns means and standard deviations across the 0th axis of the data
X
.- Parameters:
X – A list of lists of floats (or None).
- Returns:
The fitted
StandardScaler
(self).
Utils
Classes and functions from chemprop.data.utils.py.
- chemprop.data.utils.filter_invalid_smiles(data: MoleculeDataset) MoleculeDataset [source]
Filters out invalid SMILES.
- Parameters:
data – A
MoleculeDataset
.- Returns:
A
MoleculeDataset
with only the valid molecules.
- chemprop.data.utils.get_class_sizes(data: MoleculeDataset, proportion: bool = True) List[List[float]] [source]
Determines the proportions of the different classes in a classification dataset.
- Parameters:
data – A classification
MoleculeDataset
.proportion – Choice of whether to return proportions for class size or counts.
- Returns:
A list of lists of class proportions. Each inner list contains the class proportions for a task.
- chemprop.data.utils.get_constraints(path: str, target_columns: List[str], save_raw_data: bool = False) Tuple[List[float], List[float]] [source]
Returns lists of data constraints for the atomic/bond targets as stored in a CSV file.
- Parameters:
path – Path to a CSV file.
target_columns – Name of the columns containing target values.
save_raw_data – Whether to save all user-provided atom/bond-level constraints in input data, which will be used to construct constraints files for each train/val/test split for prediction convenience later.
- Returns:
Lists of floats containing the data constraints.
- chemprop.data.utils.get_data(path: str, smiles_columns: str | List[str] | None = None, target_columns: List[str] | None = None, ignore_columns: List[str] | None = None, skip_invalid_smiles: bool = True, args: TrainArgs | PredictArgs | None = None, data_weights_path: str | None = None, features_path: List[str] | None = None, features_generator: List[str] | None = None, phase_features_path: str | None = None, atom_descriptors_path: str | None = None, bond_descriptors_path: str | None = None, constraints_path: str | None = None, max_data_size: int | None = None, store_row: bool = False, logger: Logger | None = None, loss_function: str | None = None, skip_none_targets: bool = False) MoleculeDataset [source]
Gets SMILES and target values from a CSV file.
- Parameters:
path – Path to a CSV file.
smiles_columns – The names of the columns containing SMILES. By default, uses the first
number_of_molecules
columns.target_columns – Name of the columns containing target values. By default, uses all columns except the
smiles_column
and theignore_columns
.ignore_columns – Name of the columns to ignore when
target_columns
is not provided.skip_invalid_smiles – Whether to skip and filter out invalid smiles using
filter_invalid_smiles()
.args – Arguments, either
TrainArgs
orPredictArgs
.data_weights_path – A path to a file containing weights for each molecule in the loss function.
features_path – A list of paths to files containing features. If provided, it is used in place of
args.features_path
.features_generator – A list of features generators to use. If provided, it is used in place of
args.features_generator
.phase_features_path – A path to a file containing phase features as applicable to spectra.
atom_descriptors_path – The path to the file containing the custom atom descriptors.
bond_descriptors_path – The path to the file containing the custom bond descriptors.
constraints_path – The path to the file containing constraints applied to different atomic/bond properties.
max_data_size – The maximum number of data points to load.
logger – A logger for recording output.
store_row – Whether to store the raw CSV row in each
MoleculeDatapoint
.skip_none_targets – Whether to skip targets that are all ‘None’. This is mostly relevant when –target_columns are passed in, so only a subset of tasks are examined.
loss_function – The loss function to be used in training.
- Returns:
A
MoleculeDataset
containing SMILES and target values along with other info such as additional features when desired.
- chemprop.data.utils.get_data_from_smiles(smiles: List[List[str]], skip_invalid_smiles: bool = True, logger: Logger | None = None, features_generator: List[str] | None = None) MoleculeDataset [source]
Converts a list of SMILES to a
MoleculeDataset
.- Parameters:
smiles – A list of lists of SMILES with length depending on the number of molecules.
skip_invalid_smiles – Whether to skip and filter out invalid smiles using
filter_invalid_smiles()
logger – A logger for recording output.
features_generator – List of features generators.
- Returns:
A
MoleculeDataset
with all of the provided SMILES.
- chemprop.data.utils.get_data_weights(path: str) List[float] [source]
Returns the list of data weights for the loss function as stored in a CSV file.
- Parameters:
path – Path to a CSV file.
- Returns:
A list of floats containing the data weights.
- chemprop.data.utils.get_header(path: str) List[str] [source]
Returns the header of a data CSV file. :param path: Path to a CSV file. :return: A list of strings containing the strings in the comma-separated header.
- chemprop.data.utils.get_inequality_targets(path: str, target_columns: List[str] | None = None) List[str] [source]
- chemprop.data.utils.get_invalid_smiles_from_file(path: str | None = None, smiles_columns: str | List[str] | None = None, header: bool = True, reaction: bool = False) List[str] | List[List[str]] [source]
Returns the invalid SMILES from a data CSV file.
- Parameters:
path – Path to a CSV file.
smiles_columns – A list of the names of the columns containing SMILES. By default, uses the first
number_of_molecules
columns.header – Whether the CSV file contains a header.
reaction – Boolean whether the SMILES strings are to be treated as a reaction.
- Returns:
A list of lists of SMILES, for the invalid SMILES in the file.
- chemprop.data.utils.get_invalid_smiles_from_list(smiles: List[List[str]], reaction: bool = False) List[List[str]] [source]
Returns the invalid SMILES from a list of lists of SMILES strings.
- Parameters:
smiles – A list of list of SMILES.
reaction – Boolean whether the SMILES strings are to be treated as a reaction.
- Returns:
A list of lists of SMILES, for the invalid SMILES among the lists provided.
- chemprop.data.utils.get_mixed_task_names(path: str, smiles_columns: str | List[str] | None = None, target_columns: List[str] | None = None, ignore_columns: List[str] | None = None, keep_h: bool | None = None, add_h: bool | None = None, keep_atom_map: bool | None = None) Tuple[List[str], List[str], List[str]] [source]
Gets the task names for atomic, bond, and molecule targets separately from a data CSV file.
If
target_columns
is provided, returned lists based off target_columns. Otherwise, returned lists based off all columns except thesmiles_columns
(or the first column, if thesmiles_columns
is None) and theignore_columns
.- Parameters:
path – Path to a CSV file.
smiles_columns – The names of the columns containing SMILES. By default, uses the first
number_of_molecules
columns.target_columns – Name of the columns containing target values. By default, uses all columns except the
smiles_columns
and theignore_columns
.ignore_columns – Name of the columns to ignore when
target_columns
is not provided.keep_h – Boolean whether to keep hydrogens in the input smiles. This does not add hydrogens, it only keeps them if they are specified.
add_h – Boolean whether to add hydrogens to the input smiles.
keep_atom_map – Boolean whether to keep the original atom mapping.
- Returns:
A tuple containing the task names of atomic, bond, and molecule properties separately.
- chemprop.data.utils.get_smiles(path: str, smiles_columns: str | List[str] | None = None, number_of_molecules: int = 1, header: bool = True, flatten: bool = False) List[str] | List[List[str]] [source]
Returns the SMILES from a data CSV file.
- Parameters:
path – Path to a CSV file.
smiles_columns – A list of the names of the columns containing SMILES. By default, uses the first
number_of_molecules
columns.number_of_molecules – The number of molecules for each data point. Not necessary if the names of smiles columns are previously processed.
header – Whether the CSV file contains a header.
flatten – Whether to flatten the returned SMILES to a list instead of a list of lists.
- Returns:
A list of SMILES or a list of lists of SMILES, depending on
flatten
.
- chemprop.data.utils.get_task_names(path: str, smiles_columns: str | List[str] | None = None, target_columns: List[str] | None = None, ignore_columns: List[str] | None = None, loss_function: str | None = None) List[str] [source]
Gets the task names from a data CSV file. If
target_columns
is provided, returns target_columns. Otherwise, returns all columns except thesmiles_columns
(or the first column, if thesmiles_columns
is None) and theignore_columns
. :param path: Path to a CSV file. :param smiles_columns: The names of the columns containing SMILES.By default, uses the first
number_of_molecules
columns.- Parameters:
target_columns – Name of the columns containing target values. By default, uses all columns except the
smiles_columns
and theignore_columns
.ignore_columns – Name of the columns to ignore when
target_columns
is not provided.
- Returns:
A list of task names.
- chemprop.data.utils.preprocess_smiles_columns(path: str, smiles_columns: str | List[str] | None = None, number_of_molecules: int = 1) List[str] [source]
Preprocesses the
smiles_columns
variable to ensure that it is a list of column headings corresponding to the columns in the data file holding SMILES. Assumes file has a header. :param path: Path to a CSV file. :param smiles_columns: The names of the columns containing SMILES.By default, uses the first
number_of_molecules
columns.- Parameters:
number_of_molecules – The number of molecules with associated SMILES for each data point.
- Returns:
The preprocessed version of
smiles_columns
which is guaranteed to be a list.
- chemprop.data.utils.split_data(data: MoleculeDataset, split_type: str = 'random', sizes: Tuple[float, float, float] = (0.8, 0.1, 0.1), key_molecule_index: int = 0, seed: int = 0, num_folds: int = 1, args: TrainArgs | None = None, logger: Logger | None = None) Tuple[MoleculeDataset, MoleculeDataset, MoleculeDataset] [source]
Splits data into training, validation, and test splits.
- Parameters:
data – A
MoleculeDataset
.split_type – Split type.
sizes – A length-3 tuple with the proportions of data in the train, validation, and test sets.
key_molecule_index – For data with multiple molecules, this sets which molecule will be considered during splitting.
seed – The random seed to use before shuffling data.
num_folds – Number of folds to create (only needed for “cv” split type).
args – A
TrainArgs
object.logger – A logger for recording output.
- Returns:
A tuple of
MoleculeDataset
s containing the train, validation, and test splits of the data.
- chemprop.data.utils.validate_data(data_path: str) Set[str] [source]
Validates a data CSV file, returning a set of errors.
- Parameters:
data_path – Path to a data CSV file.
- Returns:
A set of error messages.
- chemprop.data.utils.validate_dataset_type(data: MoleculeDataset, dataset_type: str) None [source]
Validates the dataset type to ensure the data matches the provided type.
- Parameters:
data – A
MoleculeDataset
.dataset_type – The dataset type to check.