`chemprop.data.splitting`#

Module Contents#

Classes#

SplitType

Enum where members are also (and must be) strings

Functions#

`make_split_indices`(mols[, split, sizes, seed, num_folds])	Splits data into training, validation, and test splits.
`split_data_by_indices`(data[, train_indices, ...])	Splits data into training, validation, and test groups based on split indices given.

Attributes#

`logger`
`Datapoints`
`MulticomponentDatapoints`

chemprop.data.splitting.logger#

chemprop.data.splitting.Datapoints#

chemprop.data.splitting.MulticomponentDatapoints#

class chemprop.data.splitting.SplitType[source]#

Bases: chemprop.utils.utils.EnumMapping

Enum where members are also (and must be) strings

CV_NO_VAL#

CV#

SCAFFOLD_BALANCED#

RANDOM_WITH_REPEATED_SMILES#

RANDOM#

KENNARD_STONE#

KMEANS#

chemprop.data.splitting.make_split_indices(mols, split='random', sizes=(0.8, 0.1, 0.1), seed=0, num_folds=1)[source]#

Splits data into training, validation, and test splits.

Parameters:

mols (Sequence[Chem.Mol]) – Sequence of RDKit molecules to use for structure based splitting
split (SplitType | str, optional) – Split type, one of ~chemprop.data.utils.SplitType, by default “random”
sizes (tuple[float, float, float], optional) – 3-tuple with the proportions of data in the train, validation, and test sets, by default (0.8, 0.1, 0.1). Set the middle value to 0 for a two way split.
seed (int, optional) – The random seed passed to astartes, by default 0
num_folds (int, optional) – Number of folds to create (only needed for “cv” and “cv-no-test”), by default 1

Returns:

A tuple of list of indices corresponding to the train, validation, and test splits of the data. If the split type is “cv” or “cv-no-test”, returns a tuple of lists of lists of indices corresponding to the train, validation, and test splits of each fold.

Important

validation may or may not be present

Return type:

tuple[list[int], list[int], list[int]] | tuple[list[list[int], …], list[list[int], …], list[list[int], …]]

Raises:

ValueError – Requested split sizes tuple not of length 3
ValueError – Innapropriate number of folds requested
ValueError – Unsupported split method requested

chemprop.data.splitting.split_data_by_indices(data, train_indices=None, val_indices=None, test_indices=None)[source]#

Splits data into training, validation, and test groups based on split indices given.

Parameters:

data (Datapoints | MulticomponentDatapoints)
train_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | collections.abc.Iterable[int] | None)
val_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | collections.abc.Iterable[int] | None)
test_indices (collections.abc.Iterable[collections.abc.Iterable[int]] | collections.abc.Iterable[int] | None)

chemprop.data.splitting

Contents

chemprop.data.splitting#

Module Contents#

Classes#

Functions#

Attributes#

`chemprop.data.splitting`#