chemprop.data.splitting#

Attributes#

Classes#

SplitType

Enum where members are also (and must be) strings

Functions#

make_split_indices(mols[, split, sizes, seed, ...])

Splits data into training, validation, and test splits.

split_data_by_indices(data[, train_indices, ...])

Splits data into training, validation, and test groups based on split indices given.

Module Contents#

chemprop.data.splitting.logger#
chemprop.data.splitting.Datapoints#
chemprop.data.splitting.MulticomponentDatapoints#
class chemprop.data.splitting.SplitType[source]#

Bases: chemprop.utils.utils.EnumMapping

Enum where members are also (and must be) strings

SCAFFOLD_BALANCED#
RANDOM_WITH_REPEATED_SMILES#
RANDOM#
KENNARD_STONE#
KMEANS#
chemprop.data.splitting.make_split_indices(mols, split='random', sizes=(0.8, 0.1, 0.1), seed=0, num_replicates=1, num_folds=None)[source]#

Splits data into training, validation, and test splits.

Parameters:
  • mols (Sequence[Chem.Mol] | Sized) – Sequence of RDKit molecules to use for structure based splitting or any object with a length equal to the number of datapoints if using random splitting

  • split (SplitType | str, optional) – Split type, one of ~chemprop.data.utils.SplitType, by default “random”

  • sizes (tuple[float, float, float], optional) – 3-tuple with the proportions of data in the train, validation, and test sets, by default (0.8, 0.1, 0.1). Set the middle value to 0 for a two way split.

  • seed (int, optional) – The random seed passed to astartes, by default 0

  • num_replicates (int, optional) – Number of replicates, by default 1

  • num_folds (None, optional) – This argument was removed in v2.1 - use num_replicates instead.

Returns:

2- or 3-member tuple containing num_replicates length lists of training, validation, and testing indexes.

Important

Validation may or may not be present

Return type:

tuple[list[list[int]], …]

Raises:
  • ValueError – Requested split sizes tuple not of length 3

  • ValueError – Unsupported split method requested

chemprop.data.splitting.split_data_by_indices(data, train_indices=None, val_indices=None, test_indices=None)[source]#

Splits data into training, validation, and test groups based on split indices given.

Parameters:
  • data (Datapoints | MulticomponentDatapoints)

  • train_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | None)

  • val_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | None)

  • test_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | None)