CLI Reference#

usage: chemprop [-h] {train,predict,convert,fingerprint,hpopt} ...

mode#

mode

Possible choices: train, predict, convert, fingerprint, hpopt

Sub-commands#

train#

train a chemprop model

chemprop train [-h] [--logfile [LOGFILE]] [-v]
               [-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
               [-r REACTION_COLUMNS [REACTION_COLUMNS ...]] [--no-header-row]
               [-n NUM_WORKERS] [-b BATCH_SIZE] [--accelerator ACCELERATOR]
               [--devices DEVICES]
               [--rxn-mode {REAC_PROD,REAC_PROD_BALANCE,REAC_DIFF,REAC_DIFF_BALANCE,PROD_DIFF,PROD_DIFF_BALANCE}]
               [--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}] [--keep-h]
               [--add-h]
               [--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
               [--descriptors-path DESCRIPTORS_PATH] [--no-descriptor-scaling]
               [--no-atom-feature-scaling] [--no-atom-descriptor-scaling]
               [--no-bond-feature-scaling]
               [--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
               [--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
               [--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
               [--config-path CONFIG_PATH] [-i DATA_PATH] [-o OUTPUT_DIR]
               [--model-frzn MODEL_FRZN] [--frzn-ffn-layers FRZN_FFN_LAYERS]
               [--ensemble-size ENSEMBLE_SIZE]
               [--message-hidden-dim MESSAGE_HIDDEN_DIM] [--message-bias]
               [--depth DEPTH] [--undirected] [--dropout DROPOUT]
               [--mpn-shared]
               [--activation {RELU,LEAKYRELU,PRELU,TANH,SELU,ELU}]
               [--aggregation {mean,sum,norm}]
               [--aggregation-norm AGGREGATION_NORM] [--atom-messages]
               [--ffn-hidden-dim FFN_HIDDEN_DIM]
               [--ffn-num-layers FFN_NUM_LAYERS] [--no-batch-norm]
               [--multiclass-num-classes MULTICLASS_NUM_CLASSES]
               [-w WEIGHT_COLUMN]
               [--target-columns TARGET_COLUMNS [TARGET_COLUMNS ...]]
               [--ignore-columns IGNORE_COLUMNS [IGNORE_COLUMNS ...]]
               [-t {regression,regression-mve,regression-evidential,classification,classification-dirichlet,multiclass,multiclass-dirichlet,spectral}]
               [-l {mse,bounded-mse,mve,evidential,bce,ce,binary-mcc,multiclass-mcc,binary-dirichlet,multiclass-dirichlet,sid,earthmovers,wasserstein}]
               [--v-kl V_KL] [--eps EPS]
               [--metrics {mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} [{mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} ...]]
               [--task-weights TASK_WEIGHTS [TASK_WEIGHTS ...]]
               [--warmup-epochs WARMUP_EPOCHS] [--init-lr INIT_LR]
               [--max-lr MAX_LR] [--final-lr FINAL_LR] [--epochs EPOCHS]
               [--patience PATIENCE] [--grad-clip GRAD_CLIP]
               [--split {CV_NO_VAL,CV,SCAFFOLD_BALANCED,RANDOM_WITH_REPEATED_SMILES,RANDOM,KENNARD_STONE,KMEANS}]
               [--split-sizes SPLIT_SIZES SPLIT_SIZES SPLIT_SIZES]
               [--split-key-molecule SPLIT_KEY_MOLECULE] [-k NUM_FOLDS]
               [--save-smiles-splits] [--splits-file SPLITS_FILE]
               [--splits-column SPLITS_COLUMN] [--data-seed DATA_SEED]
               [--pytorch-seed PYTORCH_SEED]

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0

--accelerator

Passed directly to the lightning Trainer().

Default: “auto”

--devices

Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.

Default: “auto”

--config-path

Path to a configuration file. Command line arguments override values in the configuration file.

-i, --data-path

Path to an input CSV file containing SMILES and the associated target values.

-o, --output-dir, --save-dir

Directory where training outputs will be saved. Defaults to ‘CURRENT_DIRECTORY/chemprop_training/STEM_OF_INPUT/TIME_STAMP’.

--ensemble-size

Number of models in ensemble for each splitting of data.

Default: 1

--pytorch-seed

Seed for PyTorch randomness (e.g., random initial weights).

Shared input data args#

-s, --smiles-columns

The column names in the input CSV containing SMILES strings. If unspecified, uses the the 0th column.

-r, --reaction-columns

The column names in the input CSV containing reaction SMILES in the format ‘REACTANT>AGENT>PRODUCT’, where ‘AGENT’ is optional.

--no-header-row

If specified, the first row in the input CSV will not be used as column names.

Default: False

Dataloader args#

-n, --num-workers

Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.

Default: 0

-b, --batch-size

Batch size.

Default: 64

Featurization args#

--rxn-mode, --reaction-mode

Possible choices: REAC_PROD, REAC_PROD_BALANCE, REAC_DIFF, REAC_DIFF_BALANCE, PROD_DIFF, PROD_DIFF_BALANCE

Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

Default: REAC_DIFF

--multi-hot-atom-featurizer-mode

Possible choices: V1, V2, ORGANIC

Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.

Default: V2

--keep-h

Whether hydrogens explicitly specified in input should be kept in the mol graph.

Default: False

--add-h

Whether hydrogens should be added to the mol graph.

Default: False

--features-generators

Possible choices: morgan_binary, morgan_count

Method(s) of generating additional features.

--descriptors-path

Path to extra descriptors to concatenate to learned representation.

--no-descriptor-scaling

Turn off extra descriptor scaling.

Default: False

--no-atom-feature-scaling

Turn off extra atom feature scaling.

Default: False

--no-atom-descriptor-scaling

Turn off extra atom descriptor scaling.

Default: False

--no-bond-feature-scaling

Turn off extra bond feature scaling.

Default: False

--atom-features-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].

--atom-descriptors-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].

--bond-features-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].

transfer learning args#

--model-frzn

Path to model checkpoint file to be loaded for overwriting and freezing weights.

--frzn-ffn-layers

Overwrites weights for the first n layers of the ffn from checkpoint model (specified checkpoint_frzn), where n is specified in the input. Automatically also freezes mpnn weights.

Default: 0

message passing#

--message-hidden-dim

hidden dimension of the messages

Default: 300

--message-bias

add bias to the message passing layers

Default: False

--depth

Number of message passing steps.

Default: 3

--undirected

Pass messages on undirected bonds/edges (always sum the two relevant bond vectors).

Default: False

--dropout

dropout probability in message passing/FFN layers

Default: 0.0

--mpn-shared

Whether to use the same message passing neural network for all input molecules. Only relevant if number_of_molecules > 1

Default: False

--activation

Possible choices: RELU, LEAKYRELU, PRELU, TANH, SELU, ELU

activation function in message passing/FFN layers

Default: RELU

--aggregation, --agg

Possible choices: mean, sum, norm

the aggregation mode to use during graph predictor

Default: “mean”

--aggregation-norm

normalization factor by which to divide summed up atomic features for ‘norm’ aggregation

Default: 100

--atom-messages

pass messages on atoms rather than bonds

Default: False

FFN args#

--ffn-hidden-dim

hidden dimension in the FFN top model

Default: 300

--ffn-num-layers

number of layers in FFN top model

Default: 1

extra MPNN args#

--no-batch-norm

Don’t use batch normalization after aggregation.

Default: False

--multiclass-num-classes

Number of classes when running multiclass classification.

Default: 3

training input data args#

-w, --weight-column

the name of the column in the input CSV containg individual data weights

--target-columns

Name of the columns containing target values. By default, uses all columns except the SMILES column and the ignore_columns.

--ignore-columns

Name of the columns to ignore when target_columns is not provided.

--splits-column

Name of the column in the input CSV file containing ‘train’, ‘val’, or ‘test’ for each row.

training args#

-t, --task-type

Possible choices: regression, regression-mve, regression-evidential, classification, classification-dirichlet, multiclass, multiclass-dirichlet, spectral

Type of dataset. This determines the default loss function used during training. Defaults to regression.

Default: “regression”

-l, --loss-function

Possible choices: mse, bounded-mse, mve, evidential, bce, ce, binary-mcc, multiclass-mcc, binary-dirichlet, multiclass-dirichlet, sid, earthmovers, wasserstein

Loss function to use during training. If not specified, will use the default loss function for the given task type (see documentation).

--v-kl, --evidential-regularization

Value used in regularization for evidential loss function. The default value recommended by Soleimany et al.(2021) is 0.2. Optimal value is dataset-dependent; it is recommended that users test different values to find the best value for their model.

Default: 0.0

--eps

evidential regularization epsilon

Default: 1e-08

--metrics, --metric

Possible choices: mae, mse, rmse, bounded-mae, bounded-mse, bounded-rmse, r2, roc, prc, accuracy, f1, bce, ce, binary-mcc, multiclass-mcc, sid, wasserstein

evaluation metrics. If unspecified, will use the following metrics for given dataset types: regression->rmse, classification->roc, multiclass->ce (‘cross entropy’), spectral->sid. If multiple metrics are provided, the 0th one will be used for early stopping and checkpointing

--task-weights

the weight to apply to an individual task in the overall loss

--warmup-epochs

Number of epochs during which learning rate increases linearly from init_lr to max_lr. Afterwards, learning rate decreases exponentially from max_lr to final_lr.

Default: 2

--init-lr

Initial learning rate.

Default: 0.0001

--max-lr

Maximum learning rate.

Default: 0.001

--final-lr

Final learning rate.

Default: 0.0001

--epochs

the number of epochs to train over

Default: 50

--patience

Number of epochs to wait for improvement before early stopping.

--grad-clip

Passed directly to the lightning trainer which controls grad clipping. See the Trainer() docstring for details.

split args#

--split, --split-type

Possible choices: CV_NO_VAL, CV, SCAFFOLD_BALANCED, RANDOM_WITH_REPEATED_SMILES, RANDOM, KENNARD_STONE, KMEANS

Method of splitting the data into train/val/test (case insensitive).

Default: RANDOM

--split-sizes

Split proportions for train/validation/test sets.

Default: [0.8, 0.1, 0.1]

--split-key-molecule

The index of the key molecule used for splitting when multiple molecules are present and constrained split_type is used (e.g., ‘scaffold_balanced’ or ‘random_with_repeated_smiles’). Note that this index begins with zero for the first molecule.

Default: 0

-k, --num-folds

Number of folds when performing cross validation.

Default: 1

--save-smiles-splits

Save smiles for each train/val/test splits for prediction convenience later.

Default: False

--splits-file

Path to a JSON file containing pre-defined splits for the input data, formatted as a list of dictionaries with keys ‘train’, ‘val’, and ‘test’ and values as lists of indices or strings formatted like ‘0-2,4’. See documentation for more details.

--data-seed

Random seed to use when splitting data into train/val/test sets. When :code`num_folds > 1`, the first fold uses this seed and all subsequent folds add 1 to the seed. Also used for shuffling data in build_dataloader when shuffle is True.

Default: 0

predict#

use a pretrained chemprop model for prediction

chemprop predict [-h] [--logfile [LOGFILE]] [-v]
                 [-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
                 [-r REACTION_COLUMNS [REACTION_COLUMNS ...]]
                 [--no-header-row] [-n NUM_WORKERS] [-b BATCH_SIZE]
                 [--accelerator ACCELERATOR] [--devices DEVICES]
                 [--rxn-mode {REAC_PROD,REAC_PROD_BALANCE,REAC_DIFF,REAC_DIFF_BALANCE,PROD_DIFF,PROD_DIFF_BALANCE}]
                 [--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}] [--keep-h]
                 [--add-h]
                 [--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
                 [--descriptors-path DESCRIPTORS_PATH]
                 [--no-descriptor-scaling] [--no-atom-feature-scaling]
                 [--no-atom-descriptor-scaling] [--no-bond-feature-scaling]
                 [--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
                 [--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
                 [--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
                 -i TEST_PATH [-o OUTPUT] [--drop-extra-columns] --model-path
                 MODEL_PATH
                 [--target-columns TARGET_COLUMNS [TARGET_COLUMNS ...]]

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0

--accelerator

Passed directly to the lightning Trainer().

Default: “auto”

--devices

Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.

Default: “auto”

-i, --test-path

Path to an input CSV file containing SMILES.

-o, --output, --preds-path

Path to which predictions will be saved. If the file extension is .pkl, will be saved as a pickle file. Otherwise, will save predictions as a CSV. The index of the model will be appended to the filename’s stem. By default, predictions will be saved to the same location as ‘–test-path’ with ‘_preds’ appended, i.e., ‘PATH/TO/TEST_PATH_preds_0.csv’.

--drop-extra-columns

Whether to drop all columns from the test data file besides the SMILES columns and the new prediction columns.

Default: False

--model-path

Path to either a single pretrained model checkpoint (.ckpt) or single pretrained model file (.pt) or to a directory that contains these files. If a directory, will recursively search and predict on all found models.

--target-columns

Column names to save the predictions to. If not provided, the predictions will be saved to columns named ‘pred_0’, ‘pred_1’, etc.

Shared input data args#

-s, --smiles-columns

The column names in the input CSV containing SMILES strings. If unspecified, uses the the 0th column.

-r, --reaction-columns

The column names in the input CSV containing reaction SMILES in the format ‘REACTANT>AGENT>PRODUCT’, where ‘AGENT’ is optional.

--no-header-row

If specified, the first row in the input CSV will not be used as column names.

Default: False

Dataloader args#

-n, --num-workers

Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.

Default: 0

-b, --batch-size

Batch size.

Default: 64

Featurization args#

--rxn-mode, --reaction-mode

Possible choices: REAC_PROD, REAC_PROD_BALANCE, REAC_DIFF, REAC_DIFF_BALANCE, PROD_DIFF, PROD_DIFF_BALANCE

Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

Default: REAC_DIFF

--multi-hot-atom-featurizer-mode

Possible choices: V1, V2, ORGANIC

Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.

Default: V2

--keep-h

Whether hydrogens explicitly specified in input should be kept in the mol graph.

Default: False

--add-h

Whether hydrogens should be added to the mol graph.

Default: False

--features-generators

Possible choices: morgan_binary, morgan_count

Method(s) of generating additional features.

--descriptors-path

Path to extra descriptors to concatenate to learned representation.

--no-descriptor-scaling

Turn off extra descriptor scaling.

Default: False

--no-atom-feature-scaling

Turn off extra atom feature scaling.

Default: False

--no-atom-descriptor-scaling

Turn off extra atom descriptor scaling.

Default: False

--no-bond-feature-scaling

Turn off extra bond feature scaling.

Default: False

--atom-features-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].

--atom-descriptors-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].

--bond-features-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].

convert#

convert a v1 model checkpoint (.pt) to a v2 model checkpoint (.ckpt)

chemprop convert [-h] [--logfile [LOGFILE]] [-v] -i INPUT_PATH
                 [-o OUTPUT_PATH]

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0

-i, --input-path

The path to a v1 model .pt checkpoint file.

-o, --output-path

The path to which the converted model will be saved. Defaults to ‘CURRENT_DIRECTORY/STEM_OF_INPUT_v2.ckpt’

fingerprint#

use a pretrained chemprop model for to calculate learned representations

chemprop fingerprint [-h] [--logfile [LOGFILE]] [-v]
                     [-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
                     [-r REACTION_COLUMNS [REACTION_COLUMNS ...]]
                     [--no-header-row] [-n NUM_WORKERS] [-b BATCH_SIZE]
                     [--accelerator ACCELERATOR] [--devices DEVICES]
                     [--rxn-mode {REAC_PROD,REAC_PROD_BALANCE,REAC_DIFF,REAC_DIFF_BALANCE,PROD_DIFF,PROD_DIFF_BALANCE}]
                     [--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}]
                     [--keep-h] [--add-h]
                     [--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
                     [--descriptors-path DESCRIPTORS_PATH]
                     [--no-descriptor-scaling] [--no-atom-feature-scaling]
                     [--no-atom-descriptor-scaling]
                     [--no-bond-feature-scaling]
                     [--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
                     [--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
                     [--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
                     -i TEST_PATH [-o OUTPUT] --model-path MODEL_PATH
                     --ffn-block-index FFN_BLOCK_INDEX

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0

--accelerator

Passed directly to the lightning Trainer().

Default: “auto”

--devices

Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.

Default: “auto”

-i, --test-path

Path to an input CSV file containing SMILES.

-o, --output, --preds-path

Path to which predictions will be saved. If the file extension is .npz, they will be saved as a npz file, respectively. Otherwise, will save predictions as a CSV. The index of the model will be appended to the filename’s stem. By default, predictions will be saved to the same location as ‘–test-path’ with ‘_fps’ appended, i.e., ‘PATH/TO/TEST_PATH_fps_0.csv’.

--model-path

Path to either a single pretrained model checkpoint (.ckpt) or single pretrained model file (.pt) or to a directory that contains these files. If a directory, will recursively search and predict on all found models.

--ffn-block-index

The index indicates which linear layer returns the encoding in the FFN. An index of 0 denotes the post-aggregation representation through a 0-layer MLP, while an index of 1 represents the output from the first linear layer in the FFN, and so forth.

Default: -1

Shared input data args#

-s, --smiles-columns

The column names in the input CSV containing SMILES strings. If unspecified, uses the the 0th column.

-r, --reaction-columns

The column names in the input CSV containing reaction SMILES in the format ‘REACTANT>AGENT>PRODUCT’, where ‘AGENT’ is optional.

--no-header-row

If specified, the first row in the input CSV will not be used as column names.

Default: False

Dataloader args#

-n, --num-workers

Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.

Default: 0

-b, --batch-size

Batch size.

Default: 64

Featurization args#

--rxn-mode, --reaction-mode

Possible choices: REAC_PROD, REAC_PROD_BALANCE, REAC_DIFF, REAC_DIFF_BALANCE, PROD_DIFF, PROD_DIFF_BALANCE

Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

Default: REAC_DIFF

--multi-hot-atom-featurizer-mode

Possible choices: V1, V2, ORGANIC

Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.

Default: V2

--keep-h

Whether hydrogens explicitly specified in input should be kept in the mol graph.

Default: False

--add-h

Whether hydrogens should be added to the mol graph.

Default: False

--features-generators

Possible choices: morgan_binary, morgan_count

Method(s) of generating additional features.

--descriptors-path

Path to extra descriptors to concatenate to learned representation.

--no-descriptor-scaling

Turn off extra descriptor scaling.

Default: False

--no-atom-feature-scaling

Turn off extra atom feature scaling.

Default: False

--no-atom-descriptor-scaling

Turn off extra atom descriptor scaling.

Default: False

--no-bond-feature-scaling

Turn off extra bond feature scaling.

Default: False

--atom-features-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].

--atom-descriptors-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].

--bond-features-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].

hpopt#

perform hyperparameter optimization on the given task

chemprop hpopt [-h] [--logfile [LOGFILE]] [-v]
               [-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
               [-r REACTION_COLUMNS [REACTION_COLUMNS ...]] [--no-header-row]
               [-n NUM_WORKERS] [-b BATCH_SIZE] [--accelerator ACCELERATOR]
               [--devices DEVICES]
               [--rxn-mode {REAC_PROD,REAC_PROD_BALANCE,REAC_DIFF,REAC_DIFF_BALANCE,PROD_DIFF,PROD_DIFF_BALANCE}]
               [--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}] [--keep-h]
               [--add-h]
               [--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
               [--descriptors-path DESCRIPTORS_PATH] [--no-descriptor-scaling]
               [--no-atom-feature-scaling] [--no-atom-descriptor-scaling]
               [--no-bond-feature-scaling]
               [--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
               [--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
               [--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
               [--config-path CONFIG_PATH] [-i DATA_PATH] [-o OUTPUT_DIR]
               [--model-frzn MODEL_FRZN] [--frzn-ffn-layers FRZN_FFN_LAYERS]
               [--ensemble-size ENSEMBLE_SIZE]
               [--message-hidden-dim MESSAGE_HIDDEN_DIM] [--message-bias]
               [--depth DEPTH] [--undirected] [--dropout DROPOUT]
               [--mpn-shared]
               [--activation {RELU,LEAKYRELU,PRELU,TANH,SELU,ELU}]
               [--aggregation {mean,sum,norm}]
               [--aggregation-norm AGGREGATION_NORM] [--atom-messages]
               [--ffn-hidden-dim FFN_HIDDEN_DIM]
               [--ffn-num-layers FFN_NUM_LAYERS] [--no-batch-norm]
               [--multiclass-num-classes MULTICLASS_NUM_CLASSES]
               [-w WEIGHT_COLUMN]
               [--target-columns TARGET_COLUMNS [TARGET_COLUMNS ...]]
               [--ignore-columns IGNORE_COLUMNS [IGNORE_COLUMNS ...]]
               [-t {regression,regression-mve,regression-evidential,classification,classification-dirichlet,multiclass,multiclass-dirichlet,spectral}]
               [-l {mse,bounded-mse,mve,evidential,bce,ce,binary-mcc,multiclass-mcc,binary-dirichlet,multiclass-dirichlet,sid,earthmovers,wasserstein}]
               [--v-kl V_KL] [--eps EPS]
               [--metrics {mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} [{mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} ...]]
               [--task-weights TASK_WEIGHTS [TASK_WEIGHTS ...]]
               [--warmup-epochs WARMUP_EPOCHS] [--init-lr INIT_LR]
               [--max-lr MAX_LR] [--final-lr FINAL_LR] [--epochs EPOCHS]
               [--patience PATIENCE] [--grad-clip GRAD_CLIP]
               [--split {CV_NO_VAL,CV,SCAFFOLD_BALANCED,RANDOM_WITH_REPEATED_SMILES,RANDOM,KENNARD_STONE,KMEANS}]
               [--split-sizes SPLIT_SIZES SPLIT_SIZES SPLIT_SIZES]
               [--split-key-molecule SPLIT_KEY_MOLECULE] [-k NUM_FOLDS]
               [--save-smiles-splits] [--splits-file SPLITS_FILE]
               [--splits-column SPLITS_COLUMN] [--data-seed DATA_SEED]
               [--pytorch-seed PYTORCH_SEED]
               [--search-parameter-keywords SEARCH_PARAMETER_KEYWORDS [SEARCH_PARAMETER_KEYWORDS ...]]
               [--hpopt-save-dir HPOPT_SAVE_DIR]
               [--raytune-num-samples RAYTUNE_NUM_SAMPLES]
               [--raytune-search-algorithm {random,hyperopt}]
               [--raytune-num-workers RAYTUNE_NUM_WORKERS] [--raytune-use-gpu]
               [--raytune-num-checkpoints-to-keep RAYTUNE_NUM_CHECKPOINTS_TO_KEEP]
               [--raytune-grace-period RAYTUNE_GRACE_PERIOD]
               [--raytune-reduction-factor RAYTUNE_REDUCTION_FACTOR]
               [--hyperopt-n-initial-points HYPEROPT_N_INITIAL_POINTS]
               [--hyperopt-random-state-seed HYPEROPT_RANDOM_STATE_SEED]

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0

--accelerator

Passed directly to the lightning Trainer().

Default: “auto”

--devices

Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.

Default: “auto”

--config-path

Path to a configuration file. Command line arguments override values in the configuration file.

-i, --data-path

Path to an input CSV file containing SMILES and the associated target values.

-o, --output-dir, --save-dir

Directory where training outputs will be saved. Defaults to ‘CURRENT_DIRECTORY/chemprop_training/STEM_OF_INPUT/TIME_STAMP’.

--ensemble-size

Number of models in ensemble for each splitting of data.

Default: 1

--pytorch-seed

Seed for PyTorch randomness (e.g., random initial weights).

Shared input data args#

-s, --smiles-columns

The column names in the input CSV containing SMILES strings. If unspecified, uses the the 0th column.

-r, --reaction-columns

The column names in the input CSV containing reaction SMILES in the format ‘REACTANT>AGENT>PRODUCT’, where ‘AGENT’ is optional.

--no-header-row

If specified, the first row in the input CSV will not be used as column names.

Default: False

Dataloader args#

-n, --num-workers

Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.

Default: 0

-b, --batch-size

Batch size.

Default: 64

Featurization args#

--rxn-mode, --reaction-mode

Possible choices: REAC_PROD, REAC_PROD_BALANCE, REAC_DIFF, REAC_DIFF_BALANCE, PROD_DIFF, PROD_DIFF_BALANCE

Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

Default: REAC_DIFF

--multi-hot-atom-featurizer-mode

Possible choices: V1, V2, ORGANIC

Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.

Default: V2

--keep-h

Whether hydrogens explicitly specified in input should be kept in the mol graph.

Default: False

--add-h

Whether hydrogens should be added to the mol graph.

Default: False

--features-generators

Possible choices: morgan_binary, morgan_count

Method(s) of generating additional features.

--descriptors-path

Path to extra descriptors to concatenate to learned representation.

--no-descriptor-scaling

Turn off extra descriptor scaling.

Default: False

--no-atom-feature-scaling

Turn off extra atom feature scaling.

Default: False

--no-atom-descriptor-scaling

Turn off extra atom descriptor scaling.

Default: False

--no-bond-feature-scaling

Turn off extra bond feature scaling.

Default: False

--atom-features-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].

--atom-descriptors-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].

--bond-features-path

If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].

transfer learning args#

--model-frzn

Path to model checkpoint file to be loaded for overwriting and freezing weights.

--frzn-ffn-layers

Overwrites weights for the first n layers of the ffn from checkpoint model (specified checkpoint_frzn), where n is specified in the input. Automatically also freezes mpnn weights.

Default: 0

message passing#

--message-hidden-dim

hidden dimension of the messages

Default: 300

--message-bias

add bias to the message passing layers

Default: False

--depth

Number of message passing steps.

Default: 3

--undirected

Pass messages on undirected bonds/edges (always sum the two relevant bond vectors).

Default: False

--dropout

dropout probability in message passing/FFN layers

Default: 0.0

--mpn-shared

Whether to use the same message passing neural network for all input molecules. Only relevant if number_of_molecules > 1

Default: False

--activation

Possible choices: RELU, LEAKYRELU, PRELU, TANH, SELU, ELU

activation function in message passing/FFN layers

Default: RELU

--aggregation, --agg

Possible choices: mean, sum, norm

the aggregation mode to use during graph predictor

Default: “mean”

--aggregation-norm

normalization factor by which to divide summed up atomic features for ‘norm’ aggregation

Default: 100

--atom-messages

pass messages on atoms rather than bonds

Default: False

FFN args#

--ffn-hidden-dim

hidden dimension in the FFN top model

Default: 300

--ffn-num-layers

number of layers in FFN top model

Default: 1

extra MPNN args#

--no-batch-norm

Don’t use batch normalization after aggregation.

Default: False

--multiclass-num-classes

Number of classes when running multiclass classification.

Default: 3

training input data args#

-w, --weight-column

the name of the column in the input CSV containg individual data weights

--target-columns

Name of the columns containing target values. By default, uses all columns except the SMILES column and the ignore_columns.

--ignore-columns

Name of the columns to ignore when target_columns is not provided.

--splits-column

Name of the column in the input CSV file containing ‘train’, ‘val’, or ‘test’ for each row.

training args#

-t, --task-type

Possible choices: regression, regression-mve, regression-evidential, classification, classification-dirichlet, multiclass, multiclass-dirichlet, spectral

Type of dataset. This determines the default loss function used during training. Defaults to regression.

Default: “regression”

-l, --loss-function

Possible choices: mse, bounded-mse, mve, evidential, bce, ce, binary-mcc, multiclass-mcc, binary-dirichlet, multiclass-dirichlet, sid, earthmovers, wasserstein

Loss function to use during training. If not specified, will use the default loss function for the given task type (see documentation).

--v-kl, --evidential-regularization

Value used in regularization for evidential loss function. The default value recommended by Soleimany et al.(2021) is 0.2. Optimal value is dataset-dependent; it is recommended that users test different values to find the best value for their model.

Default: 0.0

--eps

evidential regularization epsilon

Default: 1e-08

--metrics, --metric

Possible choices: mae, mse, rmse, bounded-mae, bounded-mse, bounded-rmse, r2, roc, prc, accuracy, f1, bce, ce, binary-mcc, multiclass-mcc, sid, wasserstein

evaluation metrics. If unspecified, will use the following metrics for given dataset types: regression->rmse, classification->roc, multiclass->ce (‘cross entropy’), spectral->sid. If multiple metrics are provided, the 0th one will be used for early stopping and checkpointing

--task-weights

the weight to apply to an individual task in the overall loss

--warmup-epochs

Number of epochs during which learning rate increases linearly from init_lr to max_lr. Afterwards, learning rate decreases exponentially from max_lr to final_lr.

Default: 2

--init-lr

Initial learning rate.

Default: 0.0001

--max-lr

Maximum learning rate.

Default: 0.001

--final-lr

Final learning rate.

Default: 0.0001

--epochs

the number of epochs to train over

Default: 50

--patience

Number of epochs to wait for improvement before early stopping.

--grad-clip

Passed directly to the lightning trainer which controls grad clipping. See the Trainer() docstring for details.

split args#

--split, --split-type

Possible choices: CV_NO_VAL, CV, SCAFFOLD_BALANCED, RANDOM_WITH_REPEATED_SMILES, RANDOM, KENNARD_STONE, KMEANS

Method of splitting the data into train/val/test (case insensitive).

Default: RANDOM

--split-sizes

Split proportions for train/validation/test sets.

Default: [0.8, 0.1, 0.1]

--split-key-molecule

The index of the key molecule used for splitting when multiple molecules are present and constrained split_type is used (e.g., ‘scaffold_balanced’ or ‘random_with_repeated_smiles’). Note that this index begins with zero for the first molecule.

Default: 0

-k, --num-folds

Number of folds when performing cross validation.

Default: 1

--save-smiles-splits

Save smiles for each train/val/test splits for prediction convenience later.

Default: False

--splits-file

Path to a JSON file containing pre-defined splits for the input data, formatted as a list of dictionaries with keys ‘train’, ‘val’, and ‘test’ and values as lists of indices or strings formatted like ‘0-2,4’. See documentation for more details.

--data-seed

Random seed to use when splitting data into train/val/test sets. When :code`num_folds > 1`, the first fold uses this seed and all subsequent folds add 1 to the seed. Also used for shuffling data in build_dataloader when shuffle is True.

Default: 0

Chemprop hyperparameter optimization arguments#

--search-parameter-keywords
The model parameters over which to search for an optimal hyperparameter configuration.

Some options are bundles of parameters or otherwise special parameter operations.

Special keywords:

basic - the default set of hyperparameters for search: depth, ffn_num_layers, dropout, message_hidden_dim, and ffn_hidden_dim. learning_rate - search for max_lr, init_lr_ratio, final_lr_ratio, and warmup_epochs. The search for init_lr and final_lr values

are defined as fractions of the max_lr value. The search for warmup_epochs is as a fraction of the total epochs used.

all - include search for all 13 inidividual keyword options

Individual supported parameters:

[]

Default: [‘basic’]

--hpopt-save-dir

Directory to save the hyperparameter optimization results

Ray Tune arguments#

--raytune-num-samples

Passed directly to Ray Tune TuneConfig to control number of trials to run

Default: 10

--raytune-search-algorithm

Possible choices: random, hyperopt

Passed to Ray Tune TuneConfig to control search algorithm

Default: “hyperopt”

--raytune-num-workers

Passed directly to Ray Tune ScalingConfig to control number of workers to use

Default: 1

--raytune-use-gpu

Passed directly to Ray Tune ScalingConfig to control whether to use GPUs

Default: False

--raytune-num-checkpoints-to-keep

Passed directly to Ray Tune CheckpointConfig to control number of checkpoints to keep

Default: 1

--raytune-grace-period

Passed directly to Ray Tune ASHAScheduler to control grace period

Default: 10

--raytune-reduction-factor

Passed directly to Ray Tune ASHAScheduler to control reduction factor

Default: 2

Hyperopt arguments#

--hyperopt-n-initial-points

Passed directly to HyperOptSearch to control number of initial points to sample

Default: 20

--hyperopt-random-state-seed

Passed directly to HyperOptSearch to control random state seed