CLI Reference#

usage: chemprop [-h] {train,predict,convert,fingerprint,hpopt} ...



Possible choices: train, predict, convert, fingerprint, hpopt



train a chemprop model

chemprop train [-h] [--logfile [LOGFILE]] [-v]
               [-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
               [-r REACTION_COLUMNS [REACTION_COLUMNS ...]] [--no-header-row]
               [-n NUM_WORKERS] [-b BATCH_SIZE] [--accelerator ACCELERATOR]
               [--devices DEVICES]
               [--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}] [--keep-h]
               [--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
               [--descriptors-path DESCRIPTORS_PATH] [--no-descriptor-scaling]
               [--no-atom-feature-scaling] [--no-atom-descriptor-scaling]
               [--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
               [--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
               [--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
               [--config-path CONFIG_PATH] [-i DATA_PATH] [-o OUTPUT_DIR]
               [--model-frzn MODEL_FRZN] [--frzn-ffn-layers FRZN_FFN_LAYERS]
               [--ensemble-size ENSEMBLE_SIZE]
               [--message-hidden-dim MESSAGE_HIDDEN_DIM] [--message-bias]
               [--depth DEPTH] [--undirected] [--dropout DROPOUT]
               [--activation {RELU,LEAKYRELU,PRELU,TANH,SELU,ELU}]
               [--aggregation {mean,sum,norm}]
               [--aggregation-norm AGGREGATION_NORM] [--atom-messages]
               [--ffn-hidden-dim FFN_HIDDEN_DIM]
               [--ffn-num-layers FFN_NUM_LAYERS] [--no-batch-norm]
               [--multiclass-num-classes MULTICLASS_NUM_CLASSES]
               [-w WEIGHT_COLUMN]
               [--target-columns TARGET_COLUMNS [TARGET_COLUMNS ...]]
               [--ignore-columns IGNORE_COLUMNS [IGNORE_COLUMNS ...]]
               [-t {regression,regression-mve,regression-evidential,classification,classification-dirichlet,multiclass,multiclass-dirichlet,spectral}]
               [-l {mse,bounded-mse,mve,evidential,bce,ce,binary-mcc,multiclass-mcc,binary-dirichlet,multiclass-dirichlet,sid,earthmovers,wasserstein}]
               [--v-kl V_KL] [--eps EPS]
               [--metrics {mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} [{mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} ...]]
               [--task-weights TASK_WEIGHTS [TASK_WEIGHTS ...]]
               [--warmup-epochs WARMUP_EPOCHS] [--init-lr INIT_LR]
               [--max-lr MAX_LR] [--final-lr FINAL_LR] [--epochs EPOCHS]
               [--patience PATIENCE] [--grad-clip GRAD_CLIP]
               [--split-sizes SPLIT_SIZES SPLIT_SIZES SPLIT_SIZES]
               [--split-key-molecule SPLIT_KEY_MOLECULE] [-k NUM_FOLDS]
               [--save-smiles-splits] [--splits-file SPLITS_FILE]
               [--splits-column SPLITS_COLUMN] [--data-seed DATA_SEED]
               [--pytorch-seed PYTORCH_SEED]

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0


Passed directly to the lightning Trainer().

Default: “auto”


Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.

Default: “auto”


Path to a configuration file. Command line arguments override values in the configuration file.

-i, --data-path

Path to an input CSV file containing SMILES and the associated target values.

-o, --output-dir, --save-dir

Directory where training outputs will be saved. Defaults to ‘CURRENT_DIRECTORY/chemprop_training/STEM_OF_INPUT/TIME_STAMP’.


Number of models in ensemble for each splitting of data.

Default: 1


Seed for PyTorch randomness (e.g., random initial weights).

Shared input data args#

-s, --smiles-columns

The column names in the input CSV containing SMILES strings. If unspecified, uses the the 0th column.

-r, --reaction-columns

The column names in the input CSV containing reaction SMILES in the format ‘REACTANT>AGENT>PRODUCT’, where ‘AGENT’ is optional.


If specified, the first row in the input CSV will not be used as column names.

Default: False

Dataloader args#

-n, --num-workers

Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.

Default: 0

-b, --batch-size

Batch size.

Default: 64

Featurization args#

--rxn-mode, --reaction-mode


Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

Default: REAC_DIFF


Possible choices: V1, V2, ORGANIC

Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.

Default: V2


Whether hydrogens explicitly specified in input should be kept in the mol graph.

Default: False


Whether hydrogens should be added to the mol graph.

Default: False


Possible choices: morgan_binary, morgan_count

Method(s) of generating additional features.


Path to extra descriptors to concatenate to learned representation.


Turn off extra descriptor scaling.

Default: False


Turn off extra atom feature scaling.

Default: False


Turn off extra atom descriptor scaling.

Default: False


Turn off extra bond feature scaling.

Default: False


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].

transfer learning args#


Path to model checkpoint file to be loaded for overwriting and freezing weights.


Overwrites weights for the first n layers of the ffn from checkpoint model (specified checkpoint_frzn), where n is specified in the input. Automatically also freezes mpnn weights.

Default: 0

message passing#


hidden dimension of the messages

Default: 300


add bias to the message passing layers

Default: False


Number of message passing steps.

Default: 3


Pass messages on undirected bonds/edges (always sum the two relevant bond vectors).

Default: False


dropout probability in message passing/FFN layers

Default: 0.0


Whether to use the same message passing neural network for all input molecules. Only relevant if number_of_molecules > 1

Default: False



activation function in message passing/FFN layers

Default: RELU

--aggregation, --agg

Possible choices: mean, sum, norm

the aggregation mode to use during graph predictor

Default: “mean”


normalization factor by which to divide summed up atomic features for ‘norm’ aggregation

Default: 100


pass messages on atoms rather than bonds

Default: False

FFN args#


hidden dimension in the FFN top model

Default: 300


number of layers in FFN top model

Default: 1

extra MPNN args#


Don’t use batch normalization after aggregation.

Default: False


Number of classes when running multiclass classification.

Default: 3

training input data args#

-w, --weight-column

the name of the column in the input CSV containg individual data weights


Name of the columns containing target values. By default, uses all columns except the SMILES column and the ignore_columns.


Name of the columns to ignore when target_columns is not provided.


Name of the column in the input CSV file containing ‘train’, ‘val’, or ‘test’ for each row.

training args#

-t, --task-type

Possible choices: regression, regression-mve, regression-evidential, classification, classification-dirichlet, multiclass, multiclass-dirichlet, spectral

Type of dataset. This determines the default loss function used during training. Defaults to regression.

Default: “regression”

-l, --loss-function

Possible choices: mse, bounded-mse, mve, evidential, bce, ce, binary-mcc, multiclass-mcc, binary-dirichlet, multiclass-dirichlet, sid, earthmovers, wasserstein

Loss function to use during training. If not specified, will use the default loss function for the given task type (see documentation).

--v-kl, --evidential-regularization

Value used in regularization for evidential loss function. The default value recommended by Soleimany et al.(2021) is 0.2. Optimal value is dataset-dependent; it is recommended that users test different values to find the best value for their model.

Default: 0.0


evidential regularization epsilon

Default: 1e-08

--metrics, --metric

Possible choices: mae, mse, rmse, bounded-mae, bounded-mse, bounded-rmse, r2, roc, prc, accuracy, f1, bce, ce, binary-mcc, multiclass-mcc, sid, wasserstein

evaluation metrics. If unspecified, will use the following metrics for given dataset types: regression->rmse, classification->roc, multiclass->ce (‘cross entropy’), spectral->sid. If multiple metrics are provided, the 0th one will be used for early stopping and checkpointing


the weight to apply to an individual task in the overall loss


Number of epochs during which learning rate increases linearly from init_lr to max_lr. Afterwards, learning rate decreases exponentially from max_lr to final_lr.

Default: 2


Initial learning rate.

Default: 0.0001


Maximum learning rate.

Default: 0.001


Final learning rate.

Default: 0.0001


the number of epochs to train over

Default: 50


Number of epochs to wait for improvement before early stopping.


Passed directly to the lightning trainer which controls grad clipping. See the Trainer() docstring for details.

split args#

--split, --split-type


Method of splitting the data into train/val/test (case insensitive).

Default: RANDOM


Split proportions for train/validation/test sets.

Default: [0.8, 0.1, 0.1]


The index of the key molecule used for splitting when multiple molecules are present and constrained split_type is used (e.g., ‘scaffold_balanced’ or ‘random_with_repeated_smiles’). Note that this index begins with zero for the first molecule.

Default: 0

-k, --num-folds

Number of folds when performing cross validation.

Default: 1


Save smiles for each train/val/test splits for prediction convenience later.

Default: False


Path to a JSON file containing pre-defined splits for the input data, formatted as a list of dictionaries with keys ‘train’, ‘val’, and ‘test’ and values as lists of indices or strings formatted like ‘0-2,4’. See documentation for more details.


Random seed to use when splitting data into train/val/test sets. When :code`num_folds > 1`, the first fold uses this seed and all subsequent folds add 1 to the seed. Also used for shuffling data in build_dataloader when shuffle is True.

Default: 0


use a pretrained chemprop model for prediction

chemprop predict [-h] [--logfile [LOGFILE]] [-v]
                 [-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
                 [-r REACTION_COLUMNS [REACTION_COLUMNS ...]]
                 [--no-header-row] [-n NUM_WORKERS] [-b BATCH_SIZE]
                 [--accelerator ACCELERATOR] [--devices DEVICES]
                 [--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}] [--keep-h]
                 [--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
                 [--descriptors-path DESCRIPTORS_PATH]
                 [--no-descriptor-scaling] [--no-atom-feature-scaling]
                 [--no-atom-descriptor-scaling] [--no-bond-feature-scaling]
                 [--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
                 [--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
                 [--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
                 -i TEST_PATH [-o OUTPUT] [--drop-extra-columns] --model-path
                 [--target-columns TARGET_COLUMNS [TARGET_COLUMNS ...]]

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0


Passed directly to the lightning Trainer().

Default: “auto”


Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.

Default: “auto”

-i, --test-path

Path to an input CSV file containing SMILES.

-o, --output, --preds-path

Path to which predictions will be saved. If the file extension is .pkl, will be saved as a pickle file. Otherwise, will save predictions as a CSV. The index of the model will be appended to the filename’s stem. By default, predictions will be saved to the same location as ‘–test-path’ with ‘_preds’ appended, i.e., ‘PATH/TO/TEST_PATH_preds_0.csv’.


Whether to drop all columns from the test data file besides the SMILES columns and the new prediction columns.

Default: False


Path to either a single pretrained model checkpoint (.ckpt) or single pretrained model file (.pt) or to a directory that contains these files. If a directory, will recursively search and predict on all found models.


Column names to save the predictions to. If not provided, the predictions will be saved to columns named ‘pred_0’, ‘pred_1’, etc.

Shared input data args#

-s, --smiles-columns

The column names in the input CSV containing SMILES strings. If unspecified, uses the the 0th column.

-r, --reaction-columns

The column names in the input CSV containing reaction SMILES in the format ‘REACTANT>AGENT>PRODUCT’, where ‘AGENT’ is optional.


If specified, the first row in the input CSV will not be used as column names.

Default: False

Dataloader args#

-n, --num-workers

Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.

Default: 0

-b, --batch-size

Batch size.

Default: 64

Featurization args#

--rxn-mode, --reaction-mode


Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

Default: REAC_DIFF


Possible choices: V1, V2, ORGANIC

Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.

Default: V2


Whether hydrogens explicitly specified in input should be kept in the mol graph.

Default: False


Whether hydrogens should be added to the mol graph.

Default: False


Possible choices: morgan_binary, morgan_count

Method(s) of generating additional features.


Path to extra descriptors to concatenate to learned representation.


Turn off extra descriptor scaling.

Default: False


Turn off extra atom feature scaling.

Default: False


Turn off extra atom descriptor scaling.

Default: False


Turn off extra bond feature scaling.

Default: False


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].


convert a v1 model checkpoint (.pt) to a v2 model checkpoint (.ckpt)

chemprop convert [-h] [--logfile [LOGFILE]] [-v] -i INPUT_PATH
                 [-o OUTPUT_PATH]

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0

-i, --input-path

The path to a v1 model .pt checkpoint file.

-o, --output-path

The path to which the converted model will be saved. Defaults to ‘CURRENT_DIRECTORY/STEM_OF_INPUT_v2.ckpt’


use a pretrained chemprop model for to calculate learned representations

chemprop fingerprint [-h] [--logfile [LOGFILE]] [-v]
                     [-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
                     [-r REACTION_COLUMNS [REACTION_COLUMNS ...]]
                     [--no-header-row] [-n NUM_WORKERS] [-b BATCH_SIZE]
                     [--accelerator ACCELERATOR] [--devices DEVICES]
                     [--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}]
                     [--keep-h] [--add-h]
                     [--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
                     [--descriptors-path DESCRIPTORS_PATH]
                     [--no-descriptor-scaling] [--no-atom-feature-scaling]
                     [--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
                     [--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
                     [--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
                     -i TEST_PATH [-o OUTPUT] --model-path MODEL_PATH
                     --ffn-block-index FFN_BLOCK_INDEX

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0


Passed directly to the lightning Trainer().

Default: “auto”


Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.

Default: “auto”

-i, --test-path

Path to an input CSV file containing SMILES.

-o, --output, --preds-path

Path to which predictions will be saved. If the file extension is .npz, they will be saved as a npz file, respectively. Otherwise, will save predictions as a CSV. The index of the model will be appended to the filename’s stem. By default, predictions will be saved to the same location as ‘–test-path’ with ‘_fps’ appended, i.e., ‘PATH/TO/TEST_PATH_fps_0.csv’.


Path to either a single pretrained model checkpoint (.ckpt) or single pretrained model file (.pt) or to a directory that contains these files. If a directory, will recursively search and predict on all found models.


The index indicates which linear layer returns the encoding in the FFN. An index of 0 denotes the post-aggregation representation through a 0-layer MLP, while an index of 1 represents the output from the first linear layer in the FFN, and so forth.

Default: -1

Shared input data args#

-s, --smiles-columns

The column names in the input CSV containing SMILES strings. If unspecified, uses the the 0th column.

-r, --reaction-columns

The column names in the input CSV containing reaction SMILES in the format ‘REACTANT>AGENT>PRODUCT’, where ‘AGENT’ is optional.


If specified, the first row in the input CSV will not be used as column names.

Default: False

Dataloader args#

-n, --num-workers

Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.

Default: 0

-b, --batch-size

Batch size.

Default: 64

Featurization args#

--rxn-mode, --reaction-mode


Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

Default: REAC_DIFF


Possible choices: V1, V2, ORGANIC

Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.

Default: V2


Whether hydrogens explicitly specified in input should be kept in the mol graph.

Default: False


Whether hydrogens should be added to the mol graph.

Default: False


Possible choices: morgan_binary, morgan_count

Method(s) of generating additional features.


Path to extra descriptors to concatenate to learned representation.


Turn off extra descriptor scaling.

Default: False


Turn off extra atom feature scaling.

Default: False


Turn off extra atom descriptor scaling.

Default: False


Turn off extra bond feature scaling.

Default: False


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].


perform hyperparameter optimization on the given task

chemprop hpopt [-h] [--logfile [LOGFILE]] [-v]
               [-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
               [-r REACTION_COLUMNS [REACTION_COLUMNS ...]] [--no-header-row]
               [-n NUM_WORKERS] [-b BATCH_SIZE] [--accelerator ACCELERATOR]
               [--devices DEVICES]
               [--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}] [--keep-h]
               [--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
               [--descriptors-path DESCRIPTORS_PATH] [--no-descriptor-scaling]
               [--no-atom-feature-scaling] [--no-atom-descriptor-scaling]
               [--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
               [--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
               [--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
               [--config-path CONFIG_PATH] [-i DATA_PATH] [-o OUTPUT_DIR]
               [--model-frzn MODEL_FRZN] [--frzn-ffn-layers FRZN_FFN_LAYERS]
               [--ensemble-size ENSEMBLE_SIZE]
               [--message-hidden-dim MESSAGE_HIDDEN_DIM] [--message-bias]
               [--depth DEPTH] [--undirected] [--dropout DROPOUT]
               [--activation {RELU,LEAKYRELU,PRELU,TANH,SELU,ELU}]
               [--aggregation {mean,sum,norm}]
               [--aggregation-norm AGGREGATION_NORM] [--atom-messages]
               [--ffn-hidden-dim FFN_HIDDEN_DIM]
               [--ffn-num-layers FFN_NUM_LAYERS] [--no-batch-norm]
               [--multiclass-num-classes MULTICLASS_NUM_CLASSES]
               [-w WEIGHT_COLUMN]
               [--target-columns TARGET_COLUMNS [TARGET_COLUMNS ...]]
               [--ignore-columns IGNORE_COLUMNS [IGNORE_COLUMNS ...]]
               [-t {regression,regression-mve,regression-evidential,classification,classification-dirichlet,multiclass,multiclass-dirichlet,spectral}]
               [-l {mse,bounded-mse,mve,evidential,bce,ce,binary-mcc,multiclass-mcc,binary-dirichlet,multiclass-dirichlet,sid,earthmovers,wasserstein}]
               [--v-kl V_KL] [--eps EPS]
               [--metrics {mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} [{mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} ...]]
               [--task-weights TASK_WEIGHTS [TASK_WEIGHTS ...]]
               [--warmup-epochs WARMUP_EPOCHS] [--init-lr INIT_LR]
               [--max-lr MAX_LR] [--final-lr FINAL_LR] [--epochs EPOCHS]
               [--patience PATIENCE] [--grad-clip GRAD_CLIP]
               [--split-sizes SPLIT_SIZES SPLIT_SIZES SPLIT_SIZES]
               [--split-key-molecule SPLIT_KEY_MOLECULE] [-k NUM_FOLDS]
               [--save-smiles-splits] [--splits-file SPLITS_FILE]
               [--splits-column SPLITS_COLUMN] [--data-seed DATA_SEED]
               [--pytorch-seed PYTORCH_SEED]
               [--search-parameter-keywords SEARCH_PARAMETER_KEYWORDS [SEARCH_PARAMETER_KEYWORDS ...]]
               [--hpopt-save-dir HPOPT_SAVE_DIR]
               [--raytune-num-samples RAYTUNE_NUM_SAMPLES]
               [--raytune-search-algorithm {random,hyperopt}]
               [--raytune-num-workers RAYTUNE_NUM_WORKERS] [--raytune-use-gpu]
               [--raytune-num-checkpoints-to-keep RAYTUNE_NUM_CHECKPOINTS_TO_KEEP]
               [--raytune-grace-period RAYTUNE_GRACE_PERIOD]
               [--raytune-reduction-factor RAYTUNE_REDUCTION_FACTOR]
               [--hyperopt-n-initial-points HYPEROPT_N_INITIAL_POINTS]
               [--hyperopt-random-state-seed HYPEROPT_RANDOM_STATE_SEED]

Named Arguments#

--logfile, --log

The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.

-v, --verbose

The verbosity level, specify the flag multiple times to increase verbosity.

Default: 0


Passed directly to the lightning Trainer().

Default: “auto”


Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.

Default: “auto”


Path to a configuration file. Command line arguments override values in the configuration file.

-i, --data-path

Path to an input CSV file containing SMILES and the associated target values.

-o, --output-dir, --save-dir

Directory where training outputs will be saved. Defaults to ‘CURRENT_DIRECTORY/chemprop_training/STEM_OF_INPUT/TIME_STAMP’.


Number of models in ensemble for each splitting of data.

Default: 1


Seed for PyTorch randomness (e.g., random initial weights).

Shared input data args#

-s, --smiles-columns

The column names in the input CSV containing SMILES strings. If unspecified, uses the the 0th column.

-r, --reaction-columns

The column names in the input CSV containing reaction SMILES in the format ‘REACTANT>AGENT>PRODUCT’, where ‘AGENT’ is optional.


If specified, the first row in the input CSV will not be used as column names.

Default: False

Dataloader args#

-n, --num-workers

Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.

Default: 0

-b, --batch-size

Batch size.

Default: 64

Featurization args#

--rxn-mode, --reaction-mode


Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

Default: REAC_DIFF


Possible choices: V1, V2, ORGANIC

Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.

Default: V2


Whether hydrogens explicitly specified in input should be kept in the mol graph.

Default: False


Whether hydrogens should be added to the mol graph.

Default: False


Possible choices: morgan_binary, morgan_count

Method(s) of generating additional features.


Path to extra descriptors to concatenate to learned representation.


Turn off extra descriptor scaling.

Default: False


Turn off extra atom feature scaling.

Default: False


Turn off extra atom descriptor scaling.

Default: False


Turn off extra bond feature scaling.

Default: False


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].


If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].

transfer learning args#


Path to model checkpoint file to be loaded for overwriting and freezing weights.


Overwrites weights for the first n layers of the ffn from checkpoint model (specified checkpoint_frzn), where n is specified in the input. Automatically also freezes mpnn weights.

Default: 0

message passing#


hidden dimension of the messages

Default: 300


add bias to the message passing layers

Default: False


Number of message passing steps.

Default: 3


Pass messages on undirected bonds/edges (always sum the two relevant bond vectors).

Default: False


dropout probability in message passing/FFN layers

Default: 0.0


Whether to use the same message passing neural network for all input molecules. Only relevant if number_of_molecules > 1

Default: False



activation function in message passing/FFN layers

Default: RELU

--aggregation, --agg

Possible choices: mean, sum, norm

the aggregation mode to use during graph predictor

Default: “mean”


normalization factor by which to divide summed up atomic features for ‘norm’ aggregation

Default: 100


pass messages on atoms rather than bonds

Default: False

FFN args#


hidden dimension in the FFN top model

Default: 300


number of layers in FFN top model

Default: 1

extra MPNN args#


Don’t use batch normalization after aggregation.

Default: False


Number of classes when running multiclass classification.

Default: 3

training input data args#

-w, --weight-column

the name of the column in the input CSV containg individual data weights


Name of the columns containing target values. By default, uses all columns except the SMILES column and the ignore_columns.


Name of the columns to ignore when target_columns is not provided.


Name of the column in the input CSV file containing ‘train’, ‘val’, or ‘test’ for each row.

training args#

-t, --task-type

Possible choices: regression, regression-mve, regression-evidential, classification, classification-dirichlet, multiclass, multiclass-dirichlet, spectral

Type of dataset. This determines the default loss function used during training. Defaults to regression.

Default: “regression”

-l, --loss-function

Possible choices: mse, bounded-mse, mve, evidential, bce, ce, binary-mcc, multiclass-mcc, binary-dirichlet, multiclass-dirichlet, sid, earthmovers, wasserstein

Loss function to use during training. If not specified, will use the default loss function for the given task type (see documentation).

--v-kl, --evidential-regularization

Value used in regularization for evidential loss function. The default value recommended by Soleimany et al.(2021) is 0.2. Optimal value is dataset-dependent; it is recommended that users test different values to find the best value for their model.

Default: 0.0


evidential regularization epsilon

Default: 1e-08

--metrics, --metric

Possible choices: mae, mse, rmse, bounded-mae, bounded-mse, bounded-rmse, r2, roc, prc, accuracy, f1, bce, ce, binary-mcc, multiclass-mcc, sid, wasserstein

evaluation metrics. If unspecified, will use the following metrics for given dataset types: regression->rmse, classification->roc, multiclass->ce (‘cross entropy’), spectral->sid. If multiple metrics are provided, the 0th one will be used for early stopping and checkpointing


the weight to apply to an individual task in the overall loss


Number of epochs during which learning rate increases linearly from init_lr to max_lr. Afterwards, learning rate decreases exponentially from max_lr to final_lr.

Default: 2


Initial learning rate.

Default: 0.0001


Maximum learning rate.

Default: 0.001


Final learning rate.

Default: 0.0001


the number of epochs to train over

Default: 50


Number of epochs to wait for improvement before early stopping.


Passed directly to the lightning trainer which controls grad clipping. See the Trainer() docstring for details.

split args#

--split, --split-type


Method of splitting the data into train/val/test (case insensitive).

Default: RANDOM


Split proportions for train/validation/test sets.

Default: [0.8, 0.1, 0.1]


The index of the key molecule used for splitting when multiple molecules are present and constrained split_type is used (e.g., ‘scaffold_balanced’ or ‘random_with_repeated_smiles’). Note that this index begins with zero for the first molecule.

Default: 0

-k, --num-folds

Number of folds when performing cross validation.

Default: 1


Save smiles for each train/val/test splits for prediction convenience later.

Default: False


Path to a JSON file containing pre-defined splits for the input data, formatted as a list of dictionaries with keys ‘train’, ‘val’, and ‘test’ and values as lists of indices or strings formatted like ‘0-2,4’. See documentation for more details.


Random seed to use when splitting data into train/val/test sets. When :code`num_folds > 1`, the first fold uses this seed and all subsequent folds add 1 to the seed. Also used for shuffling data in build_dataloader when shuffle is True.

Default: 0

Chemprop hyperparameter optimization arguments#

The model parameters over which to search for an optimal hyperparameter configuration.

Some options are bundles of parameters or otherwise special parameter operations.

Special keywords:

basic - the default set of hyperparameters for search: depth, ffn_num_layers, dropout, message_hidden_dim, and ffn_hidden_dim. learning_rate - search for max_lr, init_lr_ratio, final_lr_ratio, and warmup_epochs. The search for init_lr and final_lr values

are defined as fractions of the max_lr value. The search for warmup_epochs is as a fraction of the total epochs used.

all - include search for all 13 inidividual keyword options

Individual supported parameters:


Default: [‘basic’]


Directory to save the hyperparameter optimization results

Ray Tune arguments#


Passed directly to Ray Tune TuneConfig to control number of trials to run

Default: 10


Possible choices: random, hyperopt

Passed to Ray Tune TuneConfig to control search algorithm

Default: “hyperopt”


Passed directly to Ray Tune ScalingConfig to control number of workers to use

Default: 1


Passed directly to Ray Tune ScalingConfig to control whether to use GPUs

Default: False


Passed directly to Ray Tune CheckpointConfig to control number of checkpoints to keep

Default: 1


Passed directly to Ray Tune ASHAScheduler to control grace period

Default: 10


Passed directly to Ray Tune ASHAScheduler to control reduction factor

Default: 2

Hyperopt arguments#


Passed directly to HyperOptSearch to control number of initial points to sample

Default: 20


Passed directly to HyperOptSearch to control random state seed