Command Line Arguments

chemprop.args.py contains all command line arguments, which are processed using the Typed Argument Parser (Tap) package.

Common Arguments

class chemprop.args.CommonArgs(*args, **kwargs)[source]

CommonArgs contains arguments that are used in both TrainArgs and PredictArgs.

Initializes the Tap instance.

Parameters:
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

atom_descriptors: typing_extensions.Literal[feature, descriptor] = None

Custom extra atom descriptors. feature: used as atom features to featurize a given molecule. descriptor: used as descriptor and concatenated to the machine learned atomic representation.

atom_descriptors_path: str = None

Path to the extra atom descriptors.

property atom_descriptors_size: int

The size of the atom descriptors.

property atom_features_size: int

The size of the atom features.

batch_size: int = 50

Batch size.

bond_descriptors: typing_extensions.Literal[feature, descriptor] = None

Custom extra bond descriptors. feature: used as bond features to featurize a given molecule. descriptor: used as descriptor and concatenated to the machine learned bond representation.

bond_descriptors_path: str = None

Path to the extra bond descriptors that will be used as bond features to featurize a given molecule.

property bond_descriptors_size: int

The size of the bond descriptors.

property bond_features_size: int

The size of the atom features.

checkpoint_dir: str = None

Directory from which to load model checkpoints (walks directory and ensembles all models that are found).

checkpoint_path: str = None

Path to model checkpoint (.pt file).

checkpoint_paths: List[str] = None

List of paths to model checkpoints (.pt files).

configure() None[source]

Overwrite this method to configure the parser during initialization.

For example,
self.add_argument(’–sum’,

dest=’accumulate’, action=’store_const’, const=sum, default=max)

self.add_subparsers(help=’sub-command help’) self.add_subparser(‘a’, SubparserA, help=’a help’)

constraints_path: str = None

Path to constraints applied to atomic/bond properties prediction.

property cuda: bool

Whether to use CUDA (i.e., GPUs) or not.

property device: device

The torch.device on which to load and process data and models.

empty_cache: bool = False

Whether to empty all caches before training or predicting. This is necessary if multiple jobs are run within a single script and the atom or bond features change.

features_generator: List[str] = None

Method(s) of generating additional features.

features_path: List[str] = None

Path(s) to features to use in FNN (instead of features_generator).

property features_scaling: bool

Whether to apply normalization with a StandardScaler to the additional molecule-level features.

gpu: int = None

Which GPU to use.

max_data_size: int = None

Maximum number of data points to load.

no_cache_mol: bool = False

Whether to not cache the RDKit molecule for each SMILES string to reduce memory usage (cached by default).

no_cuda: bool = False

Turn off cuda (i.e., use CPU instead of GPU).

no_features_scaling: bool = False

Turn off scaling of features.

num_workers: int = 8

Number of workers for the parallel data loading (0 means sequential).

number_of_molecules: int = 1

Number of molecules in each input to the model. This must equal the length of smiles_columns (if not None).

phase_features_path: str = None

Path to features used to indicate the phase of the data in one-hot vector form. Used in spectra datatype.

process_args() None[source]

Perform additional argument processing and/or validation.

smiles_columns: List[str] = None

List of names of the columns containing SMILES strings. By default, uses the first number_of_molecules columns.

Train Arguments

class chemprop.args.TrainArgs(*args, **kwargs)[source]

TrainArgs includes CommonArgs along with additional arguments used for training a Chemprop model.

Initializes the Tap instance.

Parameters:
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

activation: typing_extensions.Literal[ReLU, LeakyReLU, PReLU, tanh, SELU, ELU] = 'ReLU'

Activation function.

property adding_bond_types: bool

Whether the bond types determined by RDKit molecules should be added to the output of bond targets.

adding_h: bool = False

Whether RDKit molecules will be constructed with adding the Hs to them. This option is intended to be used with Chemprop’s default molecule or multi-molecule encoders, or in reaction_solvent mode where it applies to the solvent only.

aggregation: typing_extensions.Literal[mean, sum, norm] = 'mean'

Aggregation scheme for atomic vectors into molecular vectors

aggregation_norm: int = 100

For norm aggregation, number by which to divide summed up atomic features

property atom_constraints: List[bool]

A list of booleans indicating whether constraints applied to output of atomic properties.

property atom_descriptor_scaling: bool

Whether to apply normalization with a StandardScaler to the additional atom features.”

atom_messages: bool = False

Centers messages on atoms instead of on bonds.

bias: bool = False

Whether to add bias to linear layers.

bias_solvent: bool = False

Whether to add bias to linear layers for solvent MPN if reaction_solvent is True.

property bond_constraints: List[bool]

A list of booleans indicating whether constraints applied to output of bond properties.

property bond_descriptor_scaling: bool

Whether to apply normalization with a StandardScaler to the additional bond features.”

cache_cutoff: float = 10000

Maximum number of molecules in dataset to allow caching. Below this number, caching is used and data loading is sequential. Above this number, caching is not used and data loading is parallel. Use “inf” to always cache.

checkpoint_frzn: str = None

Path to model checkpoint file to be loaded for overwriting and freezing weights.

class_balance: bool = False

Trains with an equal number of positives and negatives in each batch.

config_path: str = None

Path to a .json file containing arguments. Any arguments present in the config file will override arguments specified via the command line or by the defaults.

crossval_index_dir: str = None

Directory in which to find cross validation index files.

crossval_index_file: str = None

Indices of files to use as train/val/test. Overrides --num_folds and --seed.

property crossval_index_sets: List[List[List[int]]]

Index sets used for splitting data into train/validation/test during cross-validation

data_path: str

Path to data CSV file.

data_weights_path: str = None

Path to weights for each molecule in the training data, affecting the relative weight of molecules in the loss function

dataset_type: typing_extensions.Literal[regression, classification, multiclass, spectra]

Type of dataset. This determines the default loss function used during training.

depth: int = 3

Number of message passing steps.

depth_solvent: int = 3

Number of message passing steps for solvent if reaction_solvent is True.

dropout: float = 0.0

Dropout probability.

ensemble_size: int = 1

Number of models in ensemble.

epochs: int = 30

Number of epochs to run.

evidential_regularization: float = 0

Value used in regularization for evidential loss function. The default value recommended by Soleimany et al.(2021) is 0.2. Optimal value is dataset-dependent; it is recommended that users test different values to find the best value for their model.

explicit_h: bool = False

Whether H are explicitly specified in input (and should be kept this way). This option is intended to be used with the reaction or reaction_solvent options, and applies only to the reaction part.

extra_metrics: Literal[auc, prc-auc, rmse, mae, mse, r2, accuracy, cross_entropy, binary_cross_entropy, sid, wasserstein, f1, mcc, bounded_rmse, bounded_mae, bounded_mse, recall, precision, balanced_accuracy]] = []

Additional metrics to use to evaluate the model. Not used for early stopping.

features_only: bool = False

Use only the additional features in an FFN, no graph network.

property features_size: int

The dimensionality of the additional molecule-level features.

ffn_hidden_size: int = None

Hidden dim for higher-capacity FFN (defaults to hidden_size).

ffn_num_layers: int = 2

Number of layers in FFN after MPN encoding.

final_lr: float = 0.0001

Final learning rate.

folds_file: str = None

Optional file of fold labels.

freeze_first_only: bool = False

Determines whether or not to use checkpoint_frzn for just the first encoder. Default (False) is to use the checkpoint to freeze all encoders. (only relevant for number_of_molecules > 1, where checkpoint model has number_of_molecules = 1)

frzn_ffn_layers: int = 0

Overwrites weights for the first n layers of the ffn from checkpoint model (specified checkpoint_frzn), where n is specified in the input. Automatically also freezes mpnn weights.

grad_clip: float = None

Maximum magnitude of gradient during training.

hidden_size: int = 300

Dimensionality of hidden layers in MPN.

hidden_size_solvent: int = 300

Dimensionality of hidden layers in solvent MPN if reaction_solvent is True.

ignore_columns: List[str] = None

Name of the columns to ignore when target_columns is not provided.

ignore_nan_metrics: bool = False

Ignore invalid task metrics (NaNs) when computing average metrics across tasks.

init_lr: float = 0.0001

Initial learning rate.

is_atom_bond_targets: bool = False

whether this is atomic/bond properties prediction.

keeping_atom_map: bool = False

Whether RDKit molecules keep the original atom mapping. This option is intended to be used when providing atom-mapped SMILES with the is_atom_bond_targets.

log_frequency: int = 10

The number of batches between each logging of the training loss.

loss_function: typing_extensions.Literal[mse, bounded_mse, binary_cross_entropy, cross_entropy, mcc, sid, wasserstein, mve, evidential, dirichlet, quantile_interval] = None

Choice of loss function. Loss functions are limited to compatible dataset types.

max_lr: float = 0.001

Maximum learning rate.

metric: typing_extensions.Literal[auc, prc-auc, rmse, mae, mse, r2, accuracy, cross_entropy, binary_cross_entropy, sid, wasserstein, f1, mcc, bounded_rmse, bounded_mae, bounded_mse, recall, precision, balanced_accuracy] = None

Metric to use during evaluation. It is also used with the validation set for early stopping. Defaults to “auc” for classification, “rmse” for regression, and “sid” for spectra.

property metrics: List[str]

The list of metrics used for evaluation. Only the first is used for early stopping.

property minimize_score: bool

Whether the model should try to minimize the score metric or maximize it.

mpn_shared: bool = False

Whether to use the same message passing neural network for all input molecules Only relevant if number_of_molecules > 1

multiclass_num_classes: int = 3

Number of classes when running multiclass classification.

no_adding_bond_types: bool = False

Whether the bond types determined by RDKit molecules added to the output of bond targets. This option is intended to be used with the is_atom_bond_targets.

no_atom_descriptor_scaling: bool = False

Turn off atom feature scaling.

no_bond_descriptor_scaling: bool = False

Turn off atom feature scaling.

no_shared_atom_bond_ffn: bool = False

Whether the FFN weights for atom and bond targets should be independent between tasks.

num_folds: int = 1

Number of folds when performing cross validation.

property num_lrs: int

The number of learning rates to use (currently hard-coded to 1).

property num_tasks: int

The number of tasks being trained on.

overwrite_default_atom_features: bool = False

Overwrites the default atom descriptors with the new ones instead of concatenating them. Can only be used if atom_descriptors are used as a feature.

overwrite_default_bond_features: bool = False

Overwrites the default bond descriptors with the new ones instead of concatenating them. Can only be used if bond_descriptors are used as a feature.

process_args() None[source]

Perform additional argument processing and/or validation.

pytorch_seed: int = 0

Seed for PyTorch randomness (e.g., random initial weights).

quantile_loss_alpha: float = 0.1

Target error bounds for quantile interval loss

property quantiles: List[float]

A list of quantiles to be being trained on.

quiet: bool = False

Skip non-essential print statements.

reaction: bool = False

Whether to adjust MPNN layer to take reactions as input instead of molecules.

reaction_mode: typing_extensions.Literal[reac_prod, reac_diff, prod_diff, reac_prod_balance, reac_diff_balance, prod_diff_balance] = 'reac_diff'

Choices for construction of atom and bond features for reactions reac_prod: concatenates the reactants feature with the products feature. reac_diff: concatenates the reactants feature with the difference in features between reactants and products. prod_diff: concatenates the products feature with the difference in features between reactants and products. reac_prod_balance: concatenates the reactants feature with the products feature, balances imbalanced reactions. reac_diff_balance: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions. prod_diff_balance: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.

reaction_solvent: bool = False

Whether to adjust the MPNN layer to take as input a reaction and a molecule, and to encode them with separate MPNNs.

resume_experiment: bool = False

Whether to resume the experiment. Loads test results from any folds that have already been completed and skips training those folds.

save_dir: str = None

Directory where model checkpoints will be saved.

save_preds: bool = False

Whether to save test split predictions during training.

save_smiles_splits: bool = False

Save smiles for each train/val/test splits for prediction convenience later.

seed: int = 0

Random seed to use when splitting data into train/val/test sets. When :code`num_folds > 1`, the first fold uses this seed and all subsequent folds add 1 to the seed.

separate_test_atom_descriptors_path: str = None

Path to file with extra atom descriptors for separate test set.

separate_test_bond_descriptors_path: str = None

Path to file with extra atom descriptors for separate test set.

separate_test_constraints_path: str = None

Path to file with constraints for separate test set.

separate_test_features_path: List[str] = None

Path to file with features for separate test set.

separate_test_path: str = None

Path to separate test set, optional.

separate_test_phase_features_path: str = None

Path to file with phase features for separate test set.

separate_val_atom_descriptors_path: str = None

Path to file with extra atom descriptors for separate val set.

separate_val_bond_descriptors_path: str = None

Path to file with extra atom descriptors for separate val set.

separate_val_constraints_path: str = None

Path to file with constraints for separate val set.

separate_val_features_path: List[str] = None

Path to file with features for separate val set.

separate_val_path: str = None

Path to separate val set, optional.

separate_val_phase_features_path: str = None

Path to file with phase features for separate val set.

property shared_atom_bond_ffn: bool

Whether the FFN weights for atom and bond targets should be shared between tasks.

show_individual_scores: bool = False

Show all scores for individual targets, not just average, at the end.

spectra_activation: typing_extensions.Literal[exp, softplus] = 'exp'

Indicates which function to use in dataset_type spectra training to constrain outputs to be positive.

spectra_phase_mask_path: str = None

Path to a file containing a phase mask array, used for excluding particular regions in spectra predictions.

spectra_target_floor: float = 1e-08

Values in targets for dataset type spectra are replaced with this value, intended to be a small positive number used to enforce positive values.

split_key_molecule: int = 0

The index of the key molecule used for splitting when multiple molecules are present and constrained split_type is used, like scaffold_balanced or random_with_repeated_smiles. Note that this index begins with zero for the first molecule.

split_sizes: List[float] = None

Split proportions for train/validation/test sets.

split_type: typing_extensions.Literal[random, scaffold_balanced, predetermined, crossval, cv, cv-no-test, index_predetermined, random_with_repeated_smiles, molecular_weight] = 'random'

Method of splitting the data into train/val/test.

target_columns: List[str] = None

Name of the columns containing target values. By default, uses all columns except the SMILES column and the ignore_columns.

target_weights: List[float] = None

Weights associated with each target, affecting the relative weight of targets in the loss function. Must match the number of target columns.

property task_names: List[str]

A list of names of the tasks being trained on.

test: bool = False

Whether to skip training and only test the model.

test_fold_index: int = None

Which fold to use as test for leave-one-out cross val.

property train_data_size: int

The size of the training data set.

undirected: bool = False

Undirected edges (always sum the two relevant bond vectors).

property use_input_features: bool

Whether the model is using additional molecule-level features.

val_fold_index: int = None

Which fold to use as val for leave-one-out cross val.

warmup_epochs: float = 2.0

Number of epochs during which learning rate increases linearly from init_lr to max_lr. Afterwards, learning rate decreases exponentially from max_lr to final_lr.

weights_ffn_num_layers: int = 2

Number of layers in FFN for determining weights used in constrained targets.

Predict Arguments

class chemprop.args.PredictArgs(*args, **kwargs)[source]

PredictArgs includes CommonArgs along with additional arguments used for predicting with a Chemprop model.

Initializes the Tap instance.

Parameters:
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

calibration_atom_descriptors_path: str = None

Path to the extra atom descriptors.

calibration_bond_descriptors_path: str = None

Path to the extra bond descriptors that will be used as bond features to featurize a given molecule.

calibration_features_path: List[str] = None

Path to features data to be used with the uncertainty calibration dataset.

calibration_interval_percentile: float = 95

Sets the percentile used in the calibration methods. Must be in the range (1,100).

calibration_method: typing_extensions.Literal[zscaling, tscaling, zelikman_interval, mve_weighting, platt, isotonic, conformal, conformal_adaptive, conformal_regression, conformal_quantile_regression] = None

Methods used for calibrating the uncertainty calculated with uncertainty method.

calibration_path: str = None

Path to data file to be used for uncertainty calibration.

calibration_phase_features_path: str = None
conformal_alpha: float = 0.1

Target error rate for conformal prediction.

drop_extra_columns: bool = False

Whether to drop all columns from the test data file besides the SMILES columns and the new prediction columns.

dropout_sampling_size: int = 10

The number of samples to use for Monte Carlo dropout uncertainty estimation. Distinct from the dropout used during training.

property ensemble_size: int

The number of models in the ensemble.

ensemble_variance: bool = False

Deprecated. Whether to calculate the variance of ensembles as a measure of epistemic uncertainty. If True, the variance is saved as an additional column for each target in the preds_path.

evaluation_methods: List[str] = None

The methods used for evaluating the uncertainty performance if the test data provided includes targets. Available methods are [nll, miscalibration_area, ence, spearman] or any available classification or multiclass metric.

evaluation_scores_path: str = None

Location to save the results of uncertainty evaluations.

individual_ensemble_predictions: bool = False

Whether to return the predictions made by each of the individual models rather than the average of the ensemble

preds_path: str

Path to CSV or PICKLE file where predictions will be saved.

process_args() None[source]

Perform additional argument processing and/or validation.

regression_calibrator_metric: typing_extensions.Literal[stdev, interval] = None

Regression calibrators can output either a stdev or an inverval.

test_path: str

Path to CSV file containing testing data for which predictions will be made.

uncertainty_dropout_p: float = 0.1

The probability to use for Monte Carlo dropout uncertainty estimation.

uncertainty_method: typing_extensions.Literal[mve, ensemble, evidential_epistemic, evidential_aleatoric, evidential_total, classification, dropout, spectra_roundrobin, dirichlet] = None

The method of calculating uncertainty.

Interpret Arguments

class chemprop.args.InterpretArgs(*args, **kwargs)[source]

InterpretArgs includes CommonArgs along with additional arguments used for interpreting a trained Chemprop model.

Initializes the Tap instance.

Parameters:
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

batch_size: int = 500

Batch size.

c_puct: float = 10.0

Constant factor in MCTS.

data_path: str

Path to data CSV file.

max_atoms: int = 20

Maximum number of atoms in rationale.

min_atoms: int = 8

Minimum number of atoms in rationale.

process_args() None[source]

Perform additional argument processing and/or validation.

prop_delta: float = 0.5

Minimum score to count as positive.

property_id: int = 1

Index of the property of interest in the trained model.

rollout: int = 20

Number of rollout steps.

Hyperparameter Optimization Arguments

class chemprop.args.HyperoptArgs(*args, **kwargs)[source]

HyperoptArgs includes TrainArgs along with additional arguments used for optimizing Chemprop hyperparameters.

Initializes the Tap instance.

Parameters:
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

config_save_path: str

Path to .json file where best hyperparameter settings will be written.

hyperopt_checkpoint_dir: str = None

Path to a directory where hyperopt completed trial data is stored. Hyperopt job will include these trials if restarted. Can also be used to run multiple instances in parallel if they share the same checkpoint directory.

hyperopt_seed: int = 0

The initial seed used for choosing parameters in hyperopt trials. In each trial, the seed will be increased by one, skipping seeds previously used.

log_dir: str = None

(Optional) Path to a directory where all results of the hyperparameter optimization will be written.

manual_trial_dirs: List[str] = None

Paths to save directories for manually trained models in the same search space as the hyperparameter search. Results will be considered as part of the trial history of the hyperparameter search.

num_iters: int = 20

Number of hyperparameter choices to try.

process_args() None[source]

Perform additional argument processing and/or validation.

search_parameter_keywords: List[str] = ['basic']

The model parameters over which to search for an optimal hyperparameter configuration. Some options are bundles of parameters or otherwise special parameter operations.

Special keywords:

basic - the default set of hyperparameters for search: depth, ffn_num_layers, dropout, and linked_hidden_size. linked_hidden_size - search for hidden_size and ffn_hidden_size, but constrained for them to have the same value.

If either of the component words are entered in separately, both are searched independently.

learning_rate - search for max_lr, init_lr, final_lr, and warmup_epochs. The search for init_lr and final_lr values

are defined as fractions of the max_lr value. The search for warmup_epochs is as a fraction of the total epochs used.

all - include search for all 13 inidividual keyword options

Individual supported parameters:

activation, aggregation, aggregation_norm, batch_size, depth, dropout, ffn_hidden_size, ffn_num_layers, final_lr, hidden_size, init_lr, max_lr, warmup_epochs

startup_random_iters: int = None

The initial number of trials that will be randomly specified before TPE algorithm is used to select the rest. By default will be half the total number of trials.

Scikit-Learn Train Arguments

class chemprop.args.SklearnTrainArgs(*args, **kwargs)[source]

SklearnTrainArgs includes TrainArgs along with additional arguments for training a scikit-learn model.

Initializes the Tap instance.

Parameters:
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

class_weight: typing_extensions.Literal[balanced] = None

How to weight classes (None means no class balance).

impute_mode: typing_extensions.Literal[single_task, median, mean, linear, frequent] = None

How to impute missing data (None means no imputation).

model_type: typing_extensions.Literal[random_forest, svm]

scikit-learn model to use.

num_bits: int = 2048

Number of bits in morgan fingerprint.

num_trees: int = 500

Number of random forest trees.

radius: int = 2

Morgan fingerprint radius.

single_task: bool = False

Whether to run each task separately (needed when dataset has null entries).

Scikit-Learn Predict Arguments

class chemprop.args.SklearnPredictArgs(*args, **kwargs)[source]

SklearnPredictArgs contains arguments used for predicting with a trained scikit-learn model.

Initializes the Tap instance.

Parameters:
  • args – Arguments passed to the super class ArgumentParser.

  • underscores_to_dashes – If True, convert underscores in flags to dashes.

  • explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.

  • config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.

  • kwargs – Keyword arguments passed to the super class ArgumentParser.

checkpoint_dir: str = None

Path to directory containing model checkpoints (.pkl file)

checkpoint_path: str = None

Path to model checkpoint (.pkl file)

checkpoint_paths: List[str] = None

List of paths to model checkpoints (.pkl files)

number_of_molecules: int = 1

Number of molecules in each input to the model. This must equal the length of smiles_columns (if not None).

preds_path: str

Path to CSV file where predictions will be saved.

process_args() None[source]

Perform additional argument processing and/or validation.

smiles_columns: List[str] = None

List of names of the columns containing SMILES strings. By default, uses the first number_of_molecules columns.

test_path: str

Path to CSV file containing testing data for which predictions will be made.

Utility Functions

chemprop.args.get_checkpoint_paths(checkpoint_path: str | None = None, checkpoint_paths: List[str] | None = None, checkpoint_dir: str | None = None, ext: str = '.pt') List[str] | None[source]

Gets a list of checkpoint paths either from a single checkpoint path or from a directory of checkpoints.

If checkpoint_path is provided, only collects that one checkpoint. If checkpoint_paths is provided, collects all of the provided checkpoints. If checkpoint_dir is provided, walks the directory and collects all checkpoints. A checkpoint is any file ending in the extension ext.

Parameters:
  • checkpoint_path – Path to a checkpoint.

  • checkpoint_paths – List of paths to checkpoints.

  • checkpoint_dir – Path to a directory containing checkpoints.

  • ext – The extension which defines a checkpoint file.

Returns:

A list of paths to checkpoints or None if no checkpoint path(s)/dir are provided.