Command Line Arguments

chemprop.args.py contains all command line arguments, which are processed using the Typed Argument Parser (Tap) package.

Common Arguments

class chemprop.args.CommonArgs(*args, **kwargs)[source]

CommonArgs contains arguments that are used in both TrainArgs and PredictArgs.

Initializes the Tap instance.

Parameters
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

atom_descriptors: Literal['feature', 'descriptor'] = None

Custom extra atom descriptors. feature: used as atom features to featurize a given molecule. descriptor: used as descriptor and concatenated to the machine learned atomic representation.

atom_descriptors_path: str = None

Path to the extra atom descriptors.

property atom_descriptors_size: int

The size of the atom descriptors.

property atom_features_size: int

The size of the atom features.

batch_size: int = 50

Batch size.

bond_features_path: str = None

Path to the extra bond descriptors that will be used as bond features to featurize a given molecule.

property bond_features_size: int

The size of the atom features.

checkpoint_dir: str = None

Directory from which to load model checkpoints (walks directory and ensembles all models that are found).

checkpoint_path: str = None

Path to model checkpoint (.pt file).

checkpoint_paths: List[str] = None

List of paths to model checkpoints (.pt files).

configure() None[source]

Overwrite this method to configure the parser during initialization.

For example,
self.add_argument(’–sum’,

dest=’accumulate’, action=’store_const’, const=sum, default=max)

self.add_subparsers(help=’sub-command help’) self.add_subparser(‘a’, SubparserA, help=’a help’)

property cuda: bool

Whether to use CUDA (i.e., GPUs) or not.

property device: torch.device

The torch.device on which to load and process data and models.

empty_cache: bool = False

Whether to empty all caches before training or predicting. This is necessary if multiple jobs are run within a single script and the atom or bond features change.

features_generator: List[str] = None

Method(s) of generating additional features.

features_path: List[str] = None

Path(s) to features to use in FNN (instead of features_generator).

property features_scaling: bool

Whether to apply normalization with a StandardScaler to the additional molecule-level features.

gpu: int = None

Which GPU to use.

max_data_size: int = None

Maximum number of data points to load.

no_cache_mol: bool = False

Whether to not cache the RDKit molecule for each SMILES string to reduce memory usage (cached by default).

no_cuda: bool = False

Turn off cuda (i.e., use CPU instead of GPU).

no_features_scaling: bool = False

Turn off scaling of features.

num_workers: int = 8

Number of workers for the parallel data loading (0 means sequential).

number_of_molecules: int = 1

Number of molecules in each input to the model. This must equal the length of smiles_columns (if not None).

phase_features_path: str = None

Path to features used to indicate the phase of the data in one-hot vector form. Used in spectra datatype.

process_args() None[source]

Perform additional argument processing and/or validation.

smiles_columns: List[str] = None

List of names of the columns containing SMILES strings. By default, uses the first number_of_molecules columns.

Train Arguments

class chemprop.args.TrainArgs(*args, **kwargs)[source]

TrainArgs includes CommonArgs along with additional arguments used for training a Chemprop model.

Initializes the Tap instance.

Parameters
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

activation: Literal['ReLU', 'LeakyReLU', 'PReLU', 'tanh', 'SELU', 'ELU'] = 'ReLU'

Activation function.

aggregation: Literal['mean', 'sum', 'norm'] = 'mean'

Aggregation scheme for atomic vectors into molecular vectors

aggregation_norm: int = 100

For norm aggregation, number by which to divide summed up atomic features

alternative_loss_function: Literal['wasserstein'] = None

Option to replace the default loss function, with an alternative. Only currently applied for spectra data type and wasserstein loss.

property atom_descriptor_scaling: bool

Whether to apply normalization with a StandardScaler to the additional atom features.”

atom_messages: bool = False

Centers messages on atoms instead of on bonds.

bias: bool = False

Whether to add bias to linear layers.

property bond_feature_scaling: bool

Whether to apply normalization with a StandardScaler to the additional bond features.”

cache_cutoff: float = 10000

Maximum number of molecules in dataset to allow caching. Below this number, caching is used and data loading is sequential. Above this number, caching is not used and data loading is parallel. Use “inf” to always cache.

checkpoint_frzn: str = None

Path to model checkpoint file to be loaded for overwriting and freezing weights.

class_balance: bool = False

Trains with an equal number of positives and negatives in each batch.

config_path: str = None

Path to a .json file containing arguments. Any arguments present in the config file will override arguments specified via the command line or by the defaults.

crossval_index_dir: str = None

Directory in which to find cross validation index files.

crossval_index_file: str = None

Indices of files to use as train/val/test. Overrides --num_folds and --seed.

property crossval_index_sets: List[List[List[int]]]

Index sets used for splitting data into train/validation/test during cross-validation

data_path: str

Path to data CSV file.

data_weights_path: str = None

Path to weights for each molecule in the training data, affecting the relative weight of molecules in the loss function

dataset_type: Literal['regression', 'classification', 'multiclass', 'spectra']

Type of dataset. This determines the loss function used during training.

depth: int = 3

Number of message passing steps.

dropout: float = 0.0

Dropout probability.

ensemble_size: int = 1

Number of models in ensemble.

epochs: int = 30

Number of epochs to run.

explicit_h: bool = False

Whether H are explicitly specified in input (and should be kept this way).

extra_metrics: List[Literal['auc', 'prc-auc', 'rmse', 'mae', 'mse', 'r2', 'accuracy', 'cross_entropy', 'binary_cross_entropy', 'sid', 'wasserstein']] = []

Additional metrics to use to evaluate the model. Not used for early stopping.

features_only: bool = False

Use only the additional features in an FFN, no graph network.

property features_size: int

The dimensionality of the additional molecule-level features.

ffn_hidden_size: int = None

Hidden dim for higher-capacity FFN (defaults to hidden_size).

ffn_num_layers: int = 2

Number of layers in FFN after MPN encoding.

final_lr: float = 0.0001

Final learning rate.

folds_file: str = None

Optional file of fold labels.

freeze_first_only: bool = False

Determines whether or not to use checkpoint_frzn for just the first encoder. Default (False) is to use the checkpoint to freeze all encoders. (only relevant for number_of_molecules > 1, where checkpoint model has number_of_molecules = 1)

frzn_ffn_layers: int = 0

Overwrites weights for the first n layers of the ffn from checkpoint model (specified checkpoint_frzn), where n is specified in the input. Automatically also freezes mpnn weights.

grad_clip: float = None

Maximum magnitude of gradient during training.

hidden_size: int = 300

Dimensionality of hidden layers in MPN.

ignore_columns: List[str] = None

Name of the columns to ignore when target_columns is not provided.

init_lr: float = 0.0001

Initial learning rate.

log_frequency: int = 10

The number of batches between each logging of the training loss.

max_lr: float = 0.001

Maximum learning rate.

metric: Literal['auc', 'prc-auc', 'rmse', 'mae', 'mse', 'r2', 'accuracy', 'cross_entropy', 'binary_cross_entropy', 'sid', 'wasserstein'] = None

Metric to use during evaluation. It is also used with the validation set for early stopping. Defaults to “auc” for classification, “rmse” for regression, and “sid” for spectra.

property metrics: List[str]

The list of metrics used for evaluation. Only the first is used for early stopping.

property minimize_score: bool

Whether the model should try to minimize the score metric or maximize it.

mpn_shared: bool = False

Whether to use the same message passing neural network for all input molecules Only relevant if number_of_molecules > 1

multiclass_num_classes: int = 3

Number of classes when running multiclass classification.

no_atom_descriptor_scaling: bool = False

Turn off atom feature scaling.

no_bond_features_scaling: bool = False

Turn off atom feature scaling.

num_folds: int = 1

Number of folds when performing cross validation.

property num_lrs: int

The number of learning rates to use (currently hard-coded to 1).

property num_tasks: int

The number of tasks being trained on.

overwrite_default_atom_features: bool = False

Overwrites the default atom descriptors with the new ones instead of concatenating them. Can only be used if atom_descriptors are used as a feature.

overwrite_default_bond_features: bool = False

Overwrites the default atom descriptors with the new ones instead of concatenating them

process_args() None[source]

Perform additional argument processing and/or validation.

pytorch_seed: int = 0

Seed for PyTorch randomness (e.g., random initial weights).

quiet: bool = False

Skip non-essential print statements.

reaction: bool = False

Whether to adjust MPNN layer to take reactions as input instead of molecules.

reaction_mode: Literal['reac_prod', 'reac_diff', 'prod_diff', 'reac_prod_balance', 'reac_diff_balance', 'prod_diff_balance'] = 'reac_diff'

Choices for construction of atom and bond features for reactions reac_prod: concatenates the reactants feature with the products feature. reac_diff: concatenates the reactants feature with the difference in features between reactants and products. prod_diff: concatenates the products feature with the difference in features between reactants and products. reac_prod_balance: concatenates the reactants feature with the products feature, balances imbalanced reactions. reac_diff_balance: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. prod_diff_balance: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

resume_experiment: bool = False

Whether to resume the experiment. Loads test results from any folds that have already been completed and skips training those folds.

save_dir: str = None

Directory where model checkpoints will be saved.

save_preds: bool = False

Whether to save test split predictions during training.

save_smiles_splits: bool = False

Save smiles for each train/val/test splits for prediction convenience later.

seed: int = 0

Random seed to use when splitting data into train/val/test sets. When :code`num_folds > 1`, the first fold uses this seed and all subsequent folds add 1 to the seed.

separate_test_atom_descriptors_path: str = None

Path to file with extra atom descriptors for separate test set.

separate_test_bond_features_path: str = None

Path to file with extra atom descriptors for separate test set.

separate_test_features_path: List[str] = None

Path to file with features for separate test set.

separate_test_path: str = None

Path to separate test set, optional.

separate_test_phase_features_path: str = None

Path to file with phase features for separate test set.

separate_val_atom_descriptors_path: str = None

Path to file with extra atom descriptors for separate val set.

separate_val_bond_features_path: str = None

Path to file with extra atom descriptors for separate val set.

separate_val_features_path: List[str] = None

Path to file with features for separate val set.

separate_val_path: str = None

Path to separate val set, optional.

separate_val_phase_features_path: str = None

Path to file with phase features for separate val set.

show_individual_scores: bool = False

Show all scores for individual targets, not just average, at the end.

spectra_activation: Literal['exp', 'softplus'] = 'exp'

Indicates which function to use in dataset_type spectra training to constrain outputs to be positive.

spectra_phase_mask_path: str = None

Path to a file containing a phase mask array, used for excluding particular regions in spectra predictions.

spectra_target_floor: float = 1e-08

Values in targets for dataset type spectra are replaced with this value, intended to be a small positive number used to enforce positive values.

split_sizes: Tuple[float, float, float] = (0.8, 0.1, 0.1)

Split proportions for train/validation/test sets.

split_type: Literal['random', 'scaffold_balanced', 'predetermined', 'crossval', 'cv', 'cv-no-test', 'index_predetermined', 'random_with_repeated_smiles'] = 'random'

Method of splitting the data into train/val/test.

target_columns: List[str] = None

Name of the columns containing target values. By default, uses all columns except the SMILES column and the ignore_columns.

target_weights: List[float] = None

Weights associated with each target, affecting the relative weight of targets in the loss function. Must match the number of target columns.

property task_names: List[str]

A list of names of the tasks being trained on.

test: bool = False

Whether to skip training and only test the model.

test_fold_index: int = None

Which fold to use as test for leave-one-out cross val.

property train_data_size: int

The size of the training data set.

undirected: bool = False

Undirected edges (always sum the two relevant bond vectors).

property use_input_features: bool

Whether the model is using additional molecule-level features.

val_fold_index: int = None

Which fold to use as val for leave-one-out cross val.

warmup_epochs: float = 2.0

Number of epochs during which learning rate increases linearly from init_lr to max_lr. Afterwards, learning rate decreases exponentially from max_lr to final_lr.

Predict Arguments

class chemprop.args.PredictArgs(*args, **kwargs)[source]

PredictArgs includes CommonArgs along with additional arguments used for predicting with a Chemprop model.

Initializes the Tap instance.

Parameters
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

drop_extra_columns: bool = False

Whether to drop all columns from the test data file besides the SMILES columns and the new prediction columns.

property ensemble_size: int

The number of models in the ensemble.

ensemble_variance: bool = False

Whether to calculate the variance of ensembles as a measure of epistemic uncertainty. If True, the variance is saved as an additional column for each target in the preds_path.

individual_ensemble_predictions: bool = False

Whether to return the predictions made by each of the individual models rather than the average of the ensemble

preds_path: str

Path to CSV file where predictions will be saved.

process_args() None[source]

Perform additional argument processing and/or validation.

test_path: str

Path to CSV file containing testing data for which predictions will be made.

Interpret Arguments

class chemprop.args.InterpretArgs(*args, **kwargs)[source]

InterpretArgs includes CommonArgs along with additional arguments used for interpreting a trained Chemprop model.

Initializes the Tap instance.

Parameters
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

batch_size: int = 500

Batch size.

c_puct: float = 10.0

Constant factor in MCTS.

data_path: str

Path to data CSV file.

max_atoms: int = 20

Maximum number of atoms in rationale.

min_atoms: int = 8

Minimum number of atoms in rationale.

process_args() None[source]

Perform additional argument processing and/or validation.

prop_delta: float = 0.5

Minimum score to count as positive.

property_id: int = 1

Index of the property of interest in the trained model.

rollout: int = 20

Number of rollout steps.

Hyperparameter Optimization Arguments

class chemprop.args.HyperoptArgs(*args, **kwargs)[source]

HyperoptArgs includes TrainArgs along with additional arguments used for optimizing Chemprop hyperparameters.

Initializes the Tap instance.

Parameters
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

config_save_path: str

Path to .json file where best hyperparameter settings will be written.

hyperopt_checkpoint_dir: str = None

Path to a directory where hyperopt completed trial data is stored. Hyperopt job will include these trials if restarted. Can also be used to run multiple instances in parallel if they share the same checkpoint directory.

log_dir: str = None

(Optional) Path to a directory where all results of the hyperparameter optimization will be written.

manual_trial_dirs: List[str] = None

Paths to save directories for manually trained models in the same search space as the hyperparameter search. Results will be considered as part of the trial history of the hyperparameter search.

num_iters: int = 20

Number of hyperparameter choices to try.

process_args() None[source]

Perform additional argument processing and/or validation.

startup_random_iters: int = 10

The initial number of trials that will be randomly specified before TPE algorithm is used to select the rest.

Scikit-Learn Train Arguments

class chemprop.args.SklearnTrainArgs(*args, **kwargs)[source]

SklearnTrainArgs includes TrainArgs along with additional arguments for training a scikit-learn model.

Initializes the Tap instance.

Parameters
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

class_weight: Literal['balanced'] = None

How to weight classes (None means no class balance).

impute_mode: Literal['single_task', 'median', 'mean', 'linear', 'frequent'] = None

How to impute missing data (None means no imputation).

model_type: Literal['random_forest', 'svm']

scikit-learn model to use.

num_bits: int = 2048

Number of bits in morgan fingerprint.

num_trees: int = 500

Number of random forest trees.

radius: int = 2

Morgan fingerprint radius.

single_task: bool = False

Whether to run each task separately (needed when dataset has null entries).

Scikit-Learn Predict Arguments

class chemprop.args.SklearnPredictArgs(*args, underscores_to_dashes: bool = False, explicit_bool: bool = False, config_files: Optional[List[str]] = None, **kwargs)[source]

SklearnPredictArgs contains arguments used for predicting with a trained scikit-learn model.

Initializes the Tap instance.

Parameters
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

checkpoint_dir: str = None

Path to directory containing model checkpoints (.pkl file)

checkpoint_path: str = None

Path to model checkpoint (.pkl file)

checkpoint_paths: List[str] = None

List of paths to model checkpoints (.pkl files)

number_of_molecules: int = 1

Number of molecules in each input to the model. This must equal the length of smiles_columns (if not None).

preds_path: str

Path to CSV file where predictions will be saved.

process_args() None[source]

Perform additional argument processing and/or validation.

smiles_columns: List[str] = None

List of names of the columns containing SMILES strings. By default, uses the first number_of_molecules columns.

test_path: str

Path to CSV file containing testing data for which predictions will be made.

Utility Functions

chemprop.args.get_checkpoint_paths(checkpoint_path: Optional[str] = None, checkpoint_paths: Optional[List[str]] = None, checkpoint_dir: Optional[str] = None, ext: str = '.pt') Optional[List[str]][source]

Gets a list of checkpoint paths either from a single checkpoint path or from a directory of checkpoints.

If checkpoint_path is provided, only collects that one checkpoint. If checkpoint_paths is provided, collects all of the provided checkpoints. If checkpoint_dir is provided, walks the directory and collects all checkpoints. A checkpoint is any file ending in the extension ext.

Parameters
  • checkpoint_path – Path to a checkpoint.

  • checkpoint_paths – List of paths to checkpoints.

  • checkpoint_dir – Path to a directory containing checkpoints.

  • ext – The extension which defines a checkpoint file.

Returns

A list of paths to checkpoints or None if no checkpoint path(s)/dir are provided.