CLI Reference#
usage: chemprop [-h] {train,predict,convert,fingerprint,hpopt} ...
mode#
- mode
Possible choices: train, predict, convert, fingerprint, hpopt
Sub-commands#
train#
train a chemprop model
chemprop train [-h] [--logfile [LOGFILE]] [-v]
[-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
[-r REACTION_COLUMNS [REACTION_COLUMNS ...]] [--no-header-row]
[-n NUM_WORKERS] [-b BATCH_SIZE] [--accelerator ACCELERATOR]
[--devices DEVICES]
[--rxn-mode {REAC_PROD,REAC_PROD_BALANCE,REAC_DIFF,REAC_DIFF_BALANCE,PROD_DIFF,PROD_DIFF_BALANCE}]
[--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}] [--keep-h]
[--add-h]
[--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
[--descriptors-path DESCRIPTORS_PATH] [--no-descriptor-scaling]
[--no-atom-feature-scaling] [--no-atom-descriptor-scaling]
[--no-bond-feature-scaling]
[--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
[--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
[--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
[--config-path CONFIG_PATH] [-i DATA_PATH] [-o OUTPUT_DIR]
[--model-frzn MODEL_FRZN] [--frzn-ffn-layers FRZN_FFN_LAYERS]
[--ensemble-size ENSEMBLE_SIZE]
[--message-hidden-dim MESSAGE_HIDDEN_DIM] [--message-bias]
[--depth DEPTH] [--undirected] [--dropout DROPOUT]
[--mpn-shared]
[--activation {RELU,LEAKYRELU,PRELU,TANH,SELU,ELU}]
[--aggregation {mean,sum,norm}]
[--aggregation-norm AGGREGATION_NORM] [--atom-messages]
[--ffn-hidden-dim FFN_HIDDEN_DIM]
[--ffn-num-layers FFN_NUM_LAYERS] [--no-batch-norm]
[--multiclass-num-classes MULTICLASS_NUM_CLASSES]
[-w WEIGHT_COLUMN]
[--target-columns TARGET_COLUMNS [TARGET_COLUMNS ...]]
[--ignore-columns IGNORE_COLUMNS [IGNORE_COLUMNS ...]]
[-t {regression,regression-mve,regression-evidential,classification,classification-dirichlet,multiclass,multiclass-dirichlet,spectral}]
[-l {mse,bounded-mse,mve,evidential,bce,ce,binary-mcc,multiclass-mcc,binary-dirichlet,multiclass-dirichlet,sid,earthmovers,wasserstein}]
[--v-kl V_KL] [--eps EPS]
[--metrics {mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} [{mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} ...]]
[--task-weights TASK_WEIGHTS [TASK_WEIGHTS ...]]
[--warmup-epochs WARMUP_EPOCHS] [--init-lr INIT_LR]
[--max-lr MAX_LR] [--final-lr FINAL_LR] [--epochs EPOCHS]
[--patience PATIENCE] [--grad-clip GRAD_CLIP]
[--split {CV_NO_VAL,CV,SCAFFOLD_BALANCED,RANDOM_WITH_REPEATED_SMILES,RANDOM,KENNARD_STONE,KMEANS}]
[--split-sizes SPLIT_SIZES SPLIT_SIZES SPLIT_SIZES]
[--split-key-molecule SPLIT_KEY_MOLECULE] [-k NUM_FOLDS]
[--save-smiles-splits] [--splits-file SPLITS_FILE]
[--splits-column SPLITS_COLUMN] [--data-seed DATA_SEED]
[--pytorch-seed PYTORCH_SEED]
Named Arguments#
- --logfile, --log
The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.
- -v, --verbose
The verbosity level, specify the flag multiple times to increase verbosity.
Default: 0
- --accelerator
Passed directly to the lightning Trainer().
Default: “auto”
- --devices
Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.
Default: “auto”
- --config-path
Path to a configuration file. Command line arguments override values in the configuration file.
- -i, --data-path
Path to an input CSV file containing SMILES and the associated target values.
- -o, --output-dir, --save-dir
Directory where training outputs will be saved. Defaults to ‘CURRENT_DIRECTORY/chemprop_training/STEM_OF_INPUT/TIME_STAMP’.
- --ensemble-size
Number of models in ensemble for each splitting of data.
Default: 1
- --pytorch-seed
Seed for PyTorch randomness (e.g., random initial weights).
Dataloader args#
- -n, --num-workers
Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.
Default: 0
- -b, --batch-size
Batch size.
Default: 64
Featurization args#
- --rxn-mode, --reaction-mode
Possible choices: REAC_PROD, REAC_PROD_BALANCE, REAC_DIFF, REAC_DIFF_BALANCE, PROD_DIFF, PROD_DIFF_BALANCE
Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.
Default: REAC_DIFF
- --multi-hot-atom-featurizer-mode
Possible choices: V1, V2, ORGANIC
Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.
Default: V2
- --keep-h
Whether hydrogens explicitly specified in input should be kept in the mol graph.
Default: False
- --add-h
Whether hydrogens should be added to the mol graph.
Default: False
- --features-generators
Possible choices: morgan_binary, morgan_count
Method(s) of generating additional features.
- --descriptors-path
Path to extra descriptors to concatenate to learned representation.
- --no-descriptor-scaling
Turn off extra descriptor scaling.
Default: False
- --no-atom-feature-scaling
Turn off extra atom feature scaling.
Default: False
- --no-atom-descriptor-scaling
Turn off extra atom descriptor scaling.
Default: False
- --no-bond-feature-scaling
Turn off extra bond feature scaling.
Default: False
- --atom-features-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].
- --atom-descriptors-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].
- --bond-features-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].
transfer learning args#
- --model-frzn
Path to model checkpoint file to be loaded for overwriting and freezing weights.
- --frzn-ffn-layers
Overwrites weights for the first n layers of the ffn from checkpoint model (specified checkpoint_frzn), where n is specified in the input. Automatically also freezes mpnn weights.
Default: 0
message passing#
- --message-hidden-dim
hidden dimension of the messages
Default: 300
- --message-bias
add bias to the message passing layers
Default: False
- --depth
Number of message passing steps.
Default: 3
- --undirected
Pass messages on undirected bonds/edges (always sum the two relevant bond vectors).
Default: False
- --dropout
dropout probability in message passing/FFN layers
Default: 0.0
- --mpn-shared
Whether to use the same message passing neural network for all input molecules. Only relevant if
number_of_molecules > 1
Default: False
- --activation
Possible choices: RELU, LEAKYRELU, PRELU, TANH, SELU, ELU
activation function in message passing/FFN layers
Default: RELU
- --aggregation, --agg
Possible choices: mean, sum, norm
the aggregation mode to use during graph predictor
Default: “mean”
- --aggregation-norm
normalization factor by which to divide summed up atomic features for ‘norm’ aggregation
Default: 100
- --atom-messages
pass messages on atoms rather than bonds
Default: False
FFN args#
- --ffn-hidden-dim
hidden dimension in the FFN top model
Default: 300
- --ffn-num-layers
number of layers in FFN top model
Default: 1
extra MPNN args#
- --no-batch-norm
Don’t use batch normalization after aggregation.
Default: False
- --multiclass-num-classes
Number of classes when running multiclass classification.
Default: 3
training input data args#
- -w, --weight-column
the name of the column in the input CSV containg individual data weights
- --target-columns
Name of the columns containing target values. By default, uses all columns except the SMILES column and the
ignore_columns
.- --ignore-columns
Name of the columns to ignore when
target_columns
is not provided.- --splits-column
Name of the column in the input CSV file containing ‘train’, ‘val’, or ‘test’ for each row.
training args#
- -t, --task-type
Possible choices: regression, regression-mve, regression-evidential, classification, classification-dirichlet, multiclass, multiclass-dirichlet, spectral
Type of dataset. This determines the default loss function used during training. Defaults to regression.
Default: “regression”
- -l, --loss-function
Possible choices: mse, bounded-mse, mve, evidential, bce, ce, binary-mcc, multiclass-mcc, binary-dirichlet, multiclass-dirichlet, sid, earthmovers, wasserstein
Loss function to use during training. If not specified, will use the default loss function for the given task type (see documentation).
- --v-kl, --evidential-regularization
Value used in regularization for evidential loss function. The default value recommended by Soleimany et al.(2021) is 0.2. Optimal value is dataset-dependent; it is recommended that users test different values to find the best value for their model.
Default: 0.0
- --eps
evidential regularization epsilon
Default: 1e-08
- --metrics, --metric
Possible choices: mae, mse, rmse, bounded-mae, bounded-mse, bounded-rmse, r2, roc, prc, accuracy, f1, bce, ce, binary-mcc, multiclass-mcc, sid, wasserstein
evaluation metrics. If unspecified, will use the following metrics for given dataset types: regression->rmse, classification->roc, multiclass->ce (‘cross entropy’), spectral->sid. If multiple metrics are provided, the 0th one will be used for early stopping and checkpointing
- --task-weights
the weight to apply to an individual task in the overall loss
- --warmup-epochs
Number of epochs during which learning rate increases linearly from
init_lr
tomax_lr
. Afterwards, learning rate decreases exponentially frommax_lr
tofinal_lr
.Default: 2
- --init-lr
Initial learning rate.
Default: 0.0001
- --max-lr
Maximum learning rate.
Default: 0.001
- --final-lr
Final learning rate.
Default: 0.0001
- --epochs
the number of epochs to train over
Default: 50
- --patience
Number of epochs to wait for improvement before early stopping.
- --grad-clip
Passed directly to the lightning trainer which controls grad clipping. See the
Trainer()
docstring for details.
split args#
- --split, --split-type
Possible choices: CV_NO_VAL, CV, SCAFFOLD_BALANCED, RANDOM_WITH_REPEATED_SMILES, RANDOM, KENNARD_STONE, KMEANS
Method of splitting the data into train/val/test (case insensitive).
Default: RANDOM
- --split-sizes
Split proportions for train/validation/test sets.
Default: [0.8, 0.1, 0.1]
- --split-key-molecule
The index of the key molecule used for splitting when multiple molecules are present and constrained split_type is used (e.g., ‘scaffold_balanced’ or ‘random_with_repeated_smiles’). Note that this index begins with zero for the first molecule.
Default: 0
- -k, --num-folds
Number of folds when performing cross validation.
Default: 1
- --save-smiles-splits
Save smiles for each train/val/test splits for prediction convenience later.
Default: False
- --splits-file
Path to a JSON file containing pre-defined splits for the input data, formatted as a list of dictionaries with keys ‘train’, ‘val’, and ‘test’ and values as lists of indices or strings formatted like ‘0-2,4’. See documentation for more details.
- --data-seed
Random seed to use when splitting data into train/val/test sets. When :code`num_folds > 1`, the first fold uses this seed and all subsequent folds add 1 to the seed. Also used for shuffling data in
build_dataloader
whenshuffle
is True.Default: 0
predict#
use a pretrained chemprop model for prediction
chemprop predict [-h] [--logfile [LOGFILE]] [-v]
[-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
[-r REACTION_COLUMNS [REACTION_COLUMNS ...]]
[--no-header-row] [-n NUM_WORKERS] [-b BATCH_SIZE]
[--accelerator ACCELERATOR] [--devices DEVICES]
[--rxn-mode {REAC_PROD,REAC_PROD_BALANCE,REAC_DIFF,REAC_DIFF_BALANCE,PROD_DIFF,PROD_DIFF_BALANCE}]
[--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}] [--keep-h]
[--add-h]
[--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
[--descriptors-path DESCRIPTORS_PATH]
[--no-descriptor-scaling] [--no-atom-feature-scaling]
[--no-atom-descriptor-scaling] [--no-bond-feature-scaling]
[--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
[--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
[--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
-i TEST_PATH [-o OUTPUT] [--drop-extra-columns] --model-path
MODEL_PATH
[--target-columns TARGET_COLUMNS [TARGET_COLUMNS ...]]
Named Arguments#
- --logfile, --log
The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.
- -v, --verbose
The verbosity level, specify the flag multiple times to increase verbosity.
Default: 0
- --accelerator
Passed directly to the lightning Trainer().
Default: “auto”
- --devices
Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.
Default: “auto”
- -i, --test-path
Path to an input CSV file containing SMILES.
- -o, --output, --preds-path
Path to which predictions will be saved. If the file extension is .pkl, will be saved as a pickle file. Otherwise, will save predictions as a CSV. The index of the model will be appended to the filename’s stem. By default, predictions will be saved to the same location as ‘–test-path’ with ‘_preds’ appended, i.e., ‘PATH/TO/TEST_PATH_preds_0.csv’.
- --drop-extra-columns
Whether to drop all columns from the test data file besides the SMILES columns and the new prediction columns.
Default: False
- --model-path
Path to either a single pretrained model checkpoint (.ckpt) or single pretrained model file (.pt) or to a directory that contains these files. If a directory, will recursively search and predict on all found models.
- --target-columns
Column names to save the predictions to. If not provided, the predictions will be saved to columns named ‘pred_0’, ‘pred_1’, etc.
Dataloader args#
- -n, --num-workers
Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.
Default: 0
- -b, --batch-size
Batch size.
Default: 64
Featurization args#
- --rxn-mode, --reaction-mode
Possible choices: REAC_PROD, REAC_PROD_BALANCE, REAC_DIFF, REAC_DIFF_BALANCE, PROD_DIFF, PROD_DIFF_BALANCE
Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.
Default: REAC_DIFF
- --multi-hot-atom-featurizer-mode
Possible choices: V1, V2, ORGANIC
Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.
Default: V2
- --keep-h
Whether hydrogens explicitly specified in input should be kept in the mol graph.
Default: False
- --add-h
Whether hydrogens should be added to the mol graph.
Default: False
- --features-generators
Possible choices: morgan_binary, morgan_count
Method(s) of generating additional features.
- --descriptors-path
Path to extra descriptors to concatenate to learned representation.
- --no-descriptor-scaling
Turn off extra descriptor scaling.
Default: False
- --no-atom-feature-scaling
Turn off extra atom feature scaling.
Default: False
- --no-atom-descriptor-scaling
Turn off extra atom descriptor scaling.
Default: False
- --no-bond-feature-scaling
Turn off extra bond feature scaling.
Default: False
- --atom-features-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].
- --atom-descriptors-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].
- --bond-features-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].
convert#
convert a v1 model checkpoint (.pt) to a v2 model checkpoint (.ckpt)
chemprop convert [-h] [--logfile [LOGFILE]] [-v] -i INPUT_PATH
[-o OUTPUT_PATH]
Named Arguments#
- --logfile, --log
The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.
- -v, --verbose
The verbosity level, specify the flag multiple times to increase verbosity.
Default: 0
- -i, --input-path
The path to a v1 model .pt checkpoint file.
- -o, --output-path
The path to which the converted model will be saved. Defaults to ‘CURRENT_DIRECTORY/STEM_OF_INPUT_v2.ckpt’
fingerprint#
use a pretrained chemprop model for to calculate learned representations
chemprop fingerprint [-h] [--logfile [LOGFILE]] [-v]
[-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
[-r REACTION_COLUMNS [REACTION_COLUMNS ...]]
[--no-header-row] [-n NUM_WORKERS] [-b BATCH_SIZE]
[--accelerator ACCELERATOR] [--devices DEVICES]
[--rxn-mode {REAC_PROD,REAC_PROD_BALANCE,REAC_DIFF,REAC_DIFF_BALANCE,PROD_DIFF,PROD_DIFF_BALANCE}]
[--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}]
[--keep-h] [--add-h]
[--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
[--descriptors-path DESCRIPTORS_PATH]
[--no-descriptor-scaling] [--no-atom-feature-scaling]
[--no-atom-descriptor-scaling]
[--no-bond-feature-scaling]
[--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
[--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
[--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
-i TEST_PATH [-o OUTPUT] --model-path MODEL_PATH
--ffn-block-index FFN_BLOCK_INDEX
Named Arguments#
- --logfile, --log
The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.
- -v, --verbose
The verbosity level, specify the flag multiple times to increase verbosity.
Default: 0
- --accelerator
Passed directly to the lightning Trainer().
Default: “auto”
- --devices
Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.
Default: “auto”
- -i, --test-path
Path to an input CSV file containing SMILES.
- -o, --output, --preds-path
Path to which predictions will be saved. If the file extension is .npz, they will be saved as a npz file, respectively. Otherwise, will save predictions as a CSV. The index of the model will be appended to the filename’s stem. By default, predictions will be saved to the same location as ‘–test-path’ with ‘_fps’ appended, i.e., ‘PATH/TO/TEST_PATH_fps_0.csv’.
- --model-path
Path to either a single pretrained model checkpoint (.ckpt) or single pretrained model file (.pt) or to a directory that contains these files. If a directory, will recursively search and predict on all found models.
- --ffn-block-index
The index indicates which linear layer returns the encoding in the FFN. An index of 0 denotes the post-aggregation representation through a 0-layer MLP, while an index of 1 represents the output from the first linear layer in the FFN, and so forth.
Default: -1
Dataloader args#
- -n, --num-workers
Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.
Default: 0
- -b, --batch-size
Batch size.
Default: 64
Featurization args#
- --rxn-mode, --reaction-mode
Possible choices: REAC_PROD, REAC_PROD_BALANCE, REAC_DIFF, REAC_DIFF_BALANCE, PROD_DIFF, PROD_DIFF_BALANCE
Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.
Default: REAC_DIFF
- --multi-hot-atom-featurizer-mode
Possible choices: V1, V2, ORGANIC
Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.
Default: V2
- --keep-h
Whether hydrogens explicitly specified in input should be kept in the mol graph.
Default: False
- --add-h
Whether hydrogens should be added to the mol graph.
Default: False
- --features-generators
Possible choices: morgan_binary, morgan_count
Method(s) of generating additional features.
- --descriptors-path
Path to extra descriptors to concatenate to learned representation.
- --no-descriptor-scaling
Turn off extra descriptor scaling.
Default: False
- --no-atom-feature-scaling
Turn off extra atom feature scaling.
Default: False
- --no-atom-descriptor-scaling
Turn off extra atom descriptor scaling.
Default: False
- --no-bond-feature-scaling
Turn off extra bond feature scaling.
Default: False
- --atom-features-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].
- --atom-descriptors-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].
- --bond-features-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].
hpopt#
perform hyperparameter optimization on the given task
chemprop hpopt [-h] [--logfile [LOGFILE]] [-v]
[-s SMILES_COLUMNS [SMILES_COLUMNS ...]]
[-r REACTION_COLUMNS [REACTION_COLUMNS ...]] [--no-header-row]
[-n NUM_WORKERS] [-b BATCH_SIZE] [--accelerator ACCELERATOR]
[--devices DEVICES]
[--rxn-mode {REAC_PROD,REAC_PROD_BALANCE,REAC_DIFF,REAC_DIFF_BALANCE,PROD_DIFF,PROD_DIFF_BALANCE}]
[--multi-hot-atom-featurizer-mode {V1,V2,ORGANIC}] [--keep-h]
[--add-h]
[--features-generators {morgan_binary,morgan_count} [{morgan_binary,morgan_count} ...]]
[--descriptors-path DESCRIPTORS_PATH] [--no-descriptor-scaling]
[--no-atom-feature-scaling] [--no-atom-descriptor-scaling]
[--no-bond-feature-scaling]
[--atom-features-path ATOM_FEATURES_PATH [ATOM_FEATURES_PATH ...]]
[--atom-descriptors-path ATOM_DESCRIPTORS_PATH [ATOM_DESCRIPTORS_PATH ...]]
[--bond-features-path BOND_FEATURES_PATH [BOND_FEATURES_PATH ...]]
[--config-path CONFIG_PATH] [-i DATA_PATH] [-o OUTPUT_DIR]
[--model-frzn MODEL_FRZN] [--frzn-ffn-layers FRZN_FFN_LAYERS]
[--ensemble-size ENSEMBLE_SIZE]
[--message-hidden-dim MESSAGE_HIDDEN_DIM] [--message-bias]
[--depth DEPTH] [--undirected] [--dropout DROPOUT]
[--mpn-shared]
[--activation {RELU,LEAKYRELU,PRELU,TANH,SELU,ELU}]
[--aggregation {mean,sum,norm}]
[--aggregation-norm AGGREGATION_NORM] [--atom-messages]
[--ffn-hidden-dim FFN_HIDDEN_DIM]
[--ffn-num-layers FFN_NUM_LAYERS] [--no-batch-norm]
[--multiclass-num-classes MULTICLASS_NUM_CLASSES]
[-w WEIGHT_COLUMN]
[--target-columns TARGET_COLUMNS [TARGET_COLUMNS ...]]
[--ignore-columns IGNORE_COLUMNS [IGNORE_COLUMNS ...]]
[-t {regression,regression-mve,regression-evidential,classification,classification-dirichlet,multiclass,multiclass-dirichlet,spectral}]
[-l {mse,bounded-mse,mve,evidential,bce,ce,binary-mcc,multiclass-mcc,binary-dirichlet,multiclass-dirichlet,sid,earthmovers,wasserstein}]
[--v-kl V_KL] [--eps EPS]
[--metrics {mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} [{mae,mse,rmse,bounded-mae,bounded-mse,bounded-rmse,r2,roc,prc,accuracy,f1,bce,ce,binary-mcc,multiclass-mcc,sid,wasserstein} ...]]
[--task-weights TASK_WEIGHTS [TASK_WEIGHTS ...]]
[--warmup-epochs WARMUP_EPOCHS] [--init-lr INIT_LR]
[--max-lr MAX_LR] [--final-lr FINAL_LR] [--epochs EPOCHS]
[--patience PATIENCE] [--grad-clip GRAD_CLIP]
[--split {CV_NO_VAL,CV,SCAFFOLD_BALANCED,RANDOM_WITH_REPEATED_SMILES,RANDOM,KENNARD_STONE,KMEANS}]
[--split-sizes SPLIT_SIZES SPLIT_SIZES SPLIT_SIZES]
[--split-key-molecule SPLIT_KEY_MOLECULE] [-k NUM_FOLDS]
[--save-smiles-splits] [--splits-file SPLITS_FILE]
[--splits-column SPLITS_COLUMN] [--data-seed DATA_SEED]
[--pytorch-seed PYTORCH_SEED]
[--search-parameter-keywords SEARCH_PARAMETER_KEYWORDS [SEARCH_PARAMETER_KEYWORDS ...]]
[--hpopt-save-dir HPOPT_SAVE_DIR]
[--raytune-num-samples RAYTUNE_NUM_SAMPLES]
[--raytune-search-algorithm {random,hyperopt}]
[--raytune-num-workers RAYTUNE_NUM_WORKERS] [--raytune-use-gpu]
[--raytune-num-checkpoints-to-keep RAYTUNE_NUM_CHECKPOINTS_TO_KEEP]
[--raytune-grace-period RAYTUNE_GRACE_PERIOD]
[--raytune-reduction-factor RAYTUNE_REDUCTION_FACTOR]
[--hyperopt-n-initial-points HYPEROPT_N_INITIAL_POINTS]
[--hyperopt-random-state-seed HYPEROPT_RANDOM_STATE_SEED]
Named Arguments#
- --logfile, --log
The path to which the log file should be written. Specifying just the flag (i.e., ‘–log/–logfile’) will automatically log to a file ‘chemprop_logs/MODE/TIMESTAMP.log’, where ‘MODE’ is the CLI mode chosen. An example ‘TIMESTAMP’ is 2024-04-30T15-20-07.
- -v, --verbose
The verbosity level, specify the flag multiple times to increase verbosity.
Default: 0
- --accelerator
Passed directly to the lightning Trainer().
Default: “auto”
- --devices
Passed directly to the lightning Trainer(). If specifying multiple devices, must be a single string of comma separated devices, e.g. ‘1, 2’.
Default: “auto”
- --config-path
Path to a configuration file. Command line arguments override values in the configuration file.
- -i, --data-path
Path to an input CSV file containing SMILES and the associated target values.
- -o, --output-dir, --save-dir
Directory where training outputs will be saved. Defaults to ‘CURRENT_DIRECTORY/chemprop_training/STEM_OF_INPUT/TIME_STAMP’.
- --ensemble-size
Number of models in ensemble for each splitting of data.
Default: 1
- --pytorch-seed
Seed for PyTorch randomness (e.g., random initial weights).
Dataloader args#
- -n, --num-workers
Number of workers for parallel data loading (0 means sequential). Warning: setting num_workers>0 can cause hangs on Windows and MacOS.
Default: 0
- -b, --batch-size
Batch size.
Default: 64
Featurization args#
- --rxn-mode, --reaction-mode
Possible choices: REAC_PROD, REAC_PROD_BALANCE, REAC_DIFF, REAC_DIFF_BALANCE, PROD_DIFF, PROD_DIFF_BALANCE
Choices for construction of atom and bond features for reactions (case insensitive): - ‘reac_prod’: concatenates the reactants feature with the products feature. - ‘reac_diff’: concatenates the reactants feature with the difference in features between reactants and products. (Default) - ‘prod_diff’: concatenates the products feature with the difference in features between reactants and products. - ‘reac_prod_balance’: concatenates the reactants feature with the products feature, balances imbalanced reactions. - ‘reac_diff_balance’: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. - ‘prod_diff_balance’: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.
Default: REAC_DIFF
- --multi-hot-atom-featurizer-mode
Possible choices: V1, V2, ORGANIC
Choices for multi-hot atom featurization scheme. This will affect both non-reatction and reaction feturization (case insensitive): - V1: Corresponds to the original configuration employed in the Chemprop V1. - V2: Tailored for a broad range of molecules, this configuration encompasses all elements in the first four rows of the periodic table, along with iodine. It is the default in Chemprop V2. - ORGANIC: Designed specifically for use with organic molecules for drug research and development, this configuration includes a subset of elements most common in organic chemistry, including H, B, C, N, O, F, Si, P, S, Cl, Br, and I.
Default: V2
- --keep-h
Whether hydrogens explicitly specified in input should be kept in the mol graph.
Default: False
- --add-h
Whether hydrogens should be added to the mol graph.
Default: False
- --features-generators
Possible choices: morgan_binary, morgan_count
Method(s) of generating additional features.
- --descriptors-path
Path to extra descriptors to concatenate to learned representation.
- --no-descriptor-scaling
Turn off extra descriptor scaling.
Default: False
- --no-atom-feature-scaling
Turn off extra atom feature scaling.
Default: False
- --no-atom-descriptor-scaling
Turn off extra atom descriptor scaling.
Default: False
- --no-bond-feature-scaling
Turn off extra bond feature scaling.
Default: False
- --atom-features-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom features to supply before message passing. E.g., –atom-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-features-path […] –atom-features-path […].
- --atom-descriptors-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional atom descriptors to supply after message passing. E.g., –atom-descriptors-path 0 /path/to/descriptors_0.npz indicates that the descriptors at the given path should be supplied to the 0-th component. To supply additional descriptors for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –atom-descriptors-path […] –atom-descriptors-path […].
- --bond-features-path
If a single path is given, it’s assumed to correspond to the 0-th molecule. Or, it can be a two-tuple of molecule index and path to additional bond features to supply before message passing. E.g., –bond-features-path 0 /path/to/features_0.npz indicates that the features at the given path should be supplied to the 0-th component. To supply additional features for multiple components, repeat this argument on the command line for each component’s respective values, e.g., –bond-features-path […] –bond-features-path […].
transfer learning args#
- --model-frzn
Path to model checkpoint file to be loaded for overwriting and freezing weights.
- --frzn-ffn-layers
Overwrites weights for the first n layers of the ffn from checkpoint model (specified checkpoint_frzn), where n is specified in the input. Automatically also freezes mpnn weights.
Default: 0
message passing#
- --message-hidden-dim
hidden dimension of the messages
Default: 300
- --message-bias
add bias to the message passing layers
Default: False
- --depth
Number of message passing steps.
Default: 3
- --undirected
Pass messages on undirected bonds/edges (always sum the two relevant bond vectors).
Default: False
- --dropout
dropout probability in message passing/FFN layers
Default: 0.0
- --mpn-shared
Whether to use the same message passing neural network for all input molecules. Only relevant if
number_of_molecules > 1
Default: False
- --activation
Possible choices: RELU, LEAKYRELU, PRELU, TANH, SELU, ELU
activation function in message passing/FFN layers
Default: RELU
- --aggregation, --agg
Possible choices: mean, sum, norm
the aggregation mode to use during graph predictor
Default: “mean”
- --aggregation-norm
normalization factor by which to divide summed up atomic features for ‘norm’ aggregation
Default: 100
- --atom-messages
pass messages on atoms rather than bonds
Default: False
FFN args#
- --ffn-hidden-dim
hidden dimension in the FFN top model
Default: 300
- --ffn-num-layers
number of layers in FFN top model
Default: 1
extra MPNN args#
- --no-batch-norm
Don’t use batch normalization after aggregation.
Default: False
- --multiclass-num-classes
Number of classes when running multiclass classification.
Default: 3
training input data args#
- -w, --weight-column
the name of the column in the input CSV containg individual data weights
- --target-columns
Name of the columns containing target values. By default, uses all columns except the SMILES column and the
ignore_columns
.- --ignore-columns
Name of the columns to ignore when
target_columns
is not provided.- --splits-column
Name of the column in the input CSV file containing ‘train’, ‘val’, or ‘test’ for each row.
training args#
- -t, --task-type
Possible choices: regression, regression-mve, regression-evidential, classification, classification-dirichlet, multiclass, multiclass-dirichlet, spectral
Type of dataset. This determines the default loss function used during training. Defaults to regression.
Default: “regression”
- -l, --loss-function
Possible choices: mse, bounded-mse, mve, evidential, bce, ce, binary-mcc, multiclass-mcc, binary-dirichlet, multiclass-dirichlet, sid, earthmovers, wasserstein
Loss function to use during training. If not specified, will use the default loss function for the given task type (see documentation).
- --v-kl, --evidential-regularization
Value used in regularization for evidential loss function. The default value recommended by Soleimany et al.(2021) is 0.2. Optimal value is dataset-dependent; it is recommended that users test different values to find the best value for their model.
Default: 0.0
- --eps
evidential regularization epsilon
Default: 1e-08
- --metrics, --metric
Possible choices: mae, mse, rmse, bounded-mae, bounded-mse, bounded-rmse, r2, roc, prc, accuracy, f1, bce, ce, binary-mcc, multiclass-mcc, sid, wasserstein
evaluation metrics. If unspecified, will use the following metrics for given dataset types: regression->rmse, classification->roc, multiclass->ce (‘cross entropy’), spectral->sid. If multiple metrics are provided, the 0th one will be used for early stopping and checkpointing
- --task-weights
the weight to apply to an individual task in the overall loss
- --warmup-epochs
Number of epochs during which learning rate increases linearly from
init_lr
tomax_lr
. Afterwards, learning rate decreases exponentially frommax_lr
tofinal_lr
.Default: 2
- --init-lr
Initial learning rate.
Default: 0.0001
- --max-lr
Maximum learning rate.
Default: 0.001
- --final-lr
Final learning rate.
Default: 0.0001
- --epochs
the number of epochs to train over
Default: 50
- --patience
Number of epochs to wait for improvement before early stopping.
- --grad-clip
Passed directly to the lightning trainer which controls grad clipping. See the
Trainer()
docstring for details.
split args#
- --split, --split-type
Possible choices: CV_NO_VAL, CV, SCAFFOLD_BALANCED, RANDOM_WITH_REPEATED_SMILES, RANDOM, KENNARD_STONE, KMEANS
Method of splitting the data into train/val/test (case insensitive).
Default: RANDOM
- --split-sizes
Split proportions for train/validation/test sets.
Default: [0.8, 0.1, 0.1]
- --split-key-molecule
The index of the key molecule used for splitting when multiple molecules are present and constrained split_type is used (e.g., ‘scaffold_balanced’ or ‘random_with_repeated_smiles’). Note that this index begins with zero for the first molecule.
Default: 0
- -k, --num-folds
Number of folds when performing cross validation.
Default: 1
- --save-smiles-splits
Save smiles for each train/val/test splits for prediction convenience later.
Default: False
- --splits-file
Path to a JSON file containing pre-defined splits for the input data, formatted as a list of dictionaries with keys ‘train’, ‘val’, and ‘test’ and values as lists of indices or strings formatted like ‘0-2,4’. See documentation for more details.
- --data-seed
Random seed to use when splitting data into train/val/test sets. When :code`num_folds > 1`, the first fold uses this seed and all subsequent folds add 1 to the seed. Also used for shuffling data in
build_dataloader
whenshuffle
is True.Default: 0
Chemprop hyperparameter optimization arguments#
- --search-parameter-keywords
- The model parameters over which to search for an optimal hyperparameter configuration.
Some options are bundles of parameters or otherwise special parameter operations.
- Special keywords:
basic - the default set of hyperparameters for search: depth, ffn_num_layers, dropout, message_hidden_dim, and ffn_hidden_dim. learning_rate - search for max_lr, init_lr_ratio, final_lr_ratio, and warmup_epochs. The search for init_lr and final_lr values
are defined as fractions of the max_lr value. The search for warmup_epochs is as a fraction of the total epochs used.
all - include search for all 13 inidividual keyword options
- Individual supported parameters:
[]
Default: [‘basic’]
- --hpopt-save-dir
Directory to save the hyperparameter optimization results
Ray Tune arguments#
- --raytune-num-samples
Passed directly to Ray Tune TuneConfig to control number of trials to run
Default: 10
- --raytune-search-algorithm
Possible choices: random, hyperopt
Passed to Ray Tune TuneConfig to control search algorithm
Default: “hyperopt”
- --raytune-num-workers
Passed directly to Ray Tune ScalingConfig to control number of workers to use
Default: 1
- --raytune-use-gpu
Passed directly to Ray Tune ScalingConfig to control whether to use GPUs
Default: False
- --raytune-num-checkpoints-to-keep
Passed directly to Ray Tune CheckpointConfig to control number of checkpoints to keep
Default: 1
- --raytune-grace-period
Passed directly to Ray Tune ASHAScheduler to control grace period
Default: 10
- --raytune-reduction-factor
Passed directly to Ray Tune ASHAScheduler to control reduction factor
Default: 2
Hyperopt arguments#
- --hyperopt-n-initial-points
Passed directly to HyperOptSearch to control number of initial points to sample
Default: 20
- --hyperopt-random-state-seed
Passed directly to HyperOptSearch to control random state seed