Command Line Arguments
chemprop.args.py contains all command line arguments, which are processed using the Typed Argument Parser (Tap
) package.
Common Arguments
- class chemprop.args.CommonArgs(*args, **kwargs)[source]
CommonArgs
contains arguments that are used in bothTrainArgs
andPredictArgs
.Initializes the Tap instance.
- Parameters:
args – Arguments passed to the super class ArgumentParser.
underscores_to_dashes – If True, convert underscores in flags to dashes.
explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.
config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.
kwargs – Keyword arguments passed to the super class ArgumentParser.
- atom_descriptors: typing_extensions.Literal[feature, descriptor] = None
Custom extra atom descriptors.
feature
: used as atom features to featurize a given molecule.descriptor
: used as descriptor and concatenated to the machine learned atomic representation.
- atom_descriptors_path: str = None
Path to the extra atom descriptors.
- property atom_descriptors_size: int
The size of the atom descriptors.
- property atom_features_size: int
The size of the atom features.
- batch_size: int = 50
Batch size.
- bond_descriptors: typing_extensions.Literal[feature, descriptor] = None
Custom extra bond descriptors.
feature
: used as bond features to featurize a given molecule.descriptor
: used as descriptor and concatenated to the machine learned bond representation.
- bond_descriptors_path: str = None
Path to the extra bond descriptors that will be used as bond features to featurize a given molecule.
- property bond_descriptors_size: int
The size of the bond descriptors.
- property bond_features_size: int
The size of the atom features.
- checkpoint_dir: str = None
Directory from which to load model checkpoints (walks directory and ensembles all models that are found).
- checkpoint_path: str = None
Path to model checkpoint (
.pt
file).
- checkpoint_paths: List[str] = None
List of paths to model checkpoints (
.pt
files).
- configure() None [source]
Overwrite this method to configure the parser during initialization.
- For example,
- self.add_argument(’–sum’,
dest=’accumulate’, action=’store_const’, const=sum, default=max)
self.add_subparsers(help=’sub-command help’) self.add_subparser(‘a’, SubparserA, help=’a help’)
- constraints_path: str = None
Path to constraints applied to atomic/bond properties prediction.
- property cuda: bool
Whether to use CUDA (i.e., GPUs) or not.
- property device: device
The
torch.device
on which to load and process data and models.
- empty_cache: bool = False
Whether to empty all caches before training or predicting. This is necessary if multiple jobs are run within a single script and the atom or bond features change.
- features_generator: List[str] = None
Method(s) of generating additional features.
- features_path: List[str] = None
Path(s) to features to use in FNN (instead of features_generator).
- property features_scaling: bool
Whether to apply normalization with a
StandardScaler
to the additional molecule-level features.
- gpu: int = None
Which GPU to use.
- max_data_size: int = None
Maximum number of data points to load.
- no_cache_mol: bool = False
Whether to not cache the RDKit molecule for each SMILES string to reduce memory usage (cached by default).
- no_cuda: bool = False
Turn off cuda (i.e., use CPU instead of GPU).
- no_features_scaling: bool = False
Turn off scaling of features.
- num_workers: int = 8
Number of workers for the parallel data loading (0 means sequential).
- number_of_molecules: int = 1
Number of molecules in each input to the model. This must equal the length of
smiles_columns
(if notNone
).
- phase_features_path: str = None
Path to features used to indicate the phase of the data in one-hot vector form. Used in spectra datatype.
- smiles_columns: List[str] = None
List of names of the columns containing SMILES strings. By default, uses the first
number_of_molecules
columns.
Train Arguments
- class chemprop.args.TrainArgs(*args, **kwargs)[source]
TrainArgs
includesCommonArgs
along with additional arguments used for training a Chemprop model.Initializes the Tap instance.
- Parameters:
args – Arguments passed to the super class ArgumentParser.
underscores_to_dashes – If True, convert underscores in flags to dashes.
explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.
config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.
kwargs – Keyword arguments passed to the super class ArgumentParser.
- activation: typing_extensions.Literal[ReLU, LeakyReLU, PReLU, tanh, SELU, ELU] = 'ReLU'
Activation function.
- property adding_bond_types: bool
Whether the bond types determined by RDKit molecules should be added to the output of bond targets.
- adding_h: bool = False
Whether RDKit molecules will be constructed with adding the Hs to them. This option is intended to be used with Chemprop’s default molecule or multi-molecule encoders, or in
reaction_solvent
mode where it applies to the solvent only.
- aggregation: typing_extensions.Literal[mean, sum, norm] = 'mean'
Aggregation scheme for atomic vectors into molecular vectors
- aggregation_norm: int = 100
For norm aggregation, number by which to divide summed up atomic features
- property atom_constraints: List[bool]
A list of booleans indicating whether constraints applied to output of atomic properties.
- property atom_descriptor_scaling: bool
Whether to apply normalization with a
StandardScaler
to the additional atom features.”
- atom_messages: bool = False
Centers messages on atoms instead of on bonds.
- bias: bool = False
Whether to add bias to linear layers.
- bias_solvent: bool = False
Whether to add bias to linear layers for solvent MPN if
reaction_solvent
is True.
- property bond_constraints: List[bool]
A list of booleans indicating whether constraints applied to output of bond properties.
- property bond_descriptor_scaling: bool
Whether to apply normalization with a
StandardScaler
to the additional bond features.”
- cache_cutoff: float = 10000
Maximum number of molecules in dataset to allow caching. Below this number, caching is used and data loading is sequential. Above this number, caching is not used and data loading is parallel. Use “inf” to always cache.
- checkpoint_frzn: str = None
Path to model checkpoint file to be loaded for overwriting and freezing weights.
- class_balance: bool = False
Trains with an equal number of positives and negatives in each batch.
- config_path: str = None
Path to a
.json
file containing arguments. Any arguments present in the config file will override arguments specified via the command line or by the defaults.
- crossval_index_dir: str = None
Directory in which to find cross validation index files.
- crossval_index_file: str = None
Indices of files to use as train/val/test. Overrides
--num_folds
and--seed
.
- property crossval_index_sets: List[List[List[int]]]
Index sets used for splitting data into train/validation/test during cross-validation
- data_path: str
Path to data CSV file.
- data_weights_path: str = None
Path to weights for each molecule in the training data, affecting the relative weight of molecules in the loss function
- dataset_type: typing_extensions.Literal[regression, classification, multiclass, spectra]
Type of dataset. This determines the default loss function used during training.
- depth: int = 3
Number of message passing steps.
- depth_solvent: int = 3
Number of message passing steps for solvent if
reaction_solvent
is True.
- dropout: float = 0.0
Dropout probability.
- ensemble_size: int = 1
Number of models in ensemble.
- epochs: int = 30
Number of epochs to run.
- evidential_regularization: float = 0
Value used in regularization for evidential loss function. The default value recommended by Soleimany et al.(2021) is 0.2. Optimal value is dataset-dependent; it is recommended that users test different values to find the best value for their model.
- explicit_h: bool = False
Whether H are explicitly specified in input (and should be kept this way). This option is intended to be used with the
reaction
orreaction_solvent
options, and applies only to the reaction part.
- extra_metrics: Literal[auc, prc-auc, rmse, mae, mse, r2, accuracy, cross_entropy, binary_cross_entropy, sid, wasserstein, f1, mcc, bounded_rmse, bounded_mae, bounded_mse, recall, precision, balanced_accuracy]] = []
Additional metrics to use to evaluate the model. Not used for early stopping.
- features_only: bool = False
Use only the additional features in an FFN, no graph network.
- property features_size: int
The dimensionality of the additional molecule-level features.
Hidden dim for higher-capacity FFN (defaults to hidden_size).
- ffn_num_layers: int = 2
Number of layers in FFN after MPN encoding.
- final_lr: float = 0.0001
Final learning rate.
- folds_file: str = None
Optional file of fold labels.
- freeze_first_only: bool = False
Determines whether or not to use checkpoint_frzn for just the first encoder. Default (False) is to use the checkpoint to freeze all encoders. (only relevant for number_of_molecules > 1, where checkpoint model has number_of_molecules = 1)
- frzn_ffn_layers: int = 0
Overwrites weights for the first n layers of the ffn from checkpoint model (specified checkpoint_frzn), where n is specified in the input. Automatically also freezes mpnn weights.
- grad_clip: float = None
Maximum magnitude of gradient during training.
Dimensionality of hidden layers in MPN.
Dimensionality of hidden layers in solvent MPN if
reaction_solvent
is True.
- ignore_columns: List[str] = None
Name of the columns to ignore when
target_columns
is not provided.
- ignore_nan_metrics: bool = False
Ignore invalid task metrics (NaNs) when computing average metrics across tasks.
- init_lr: float = 0.0001
Initial learning rate.
- is_atom_bond_targets: bool = False
whether this is atomic/bond properties prediction.
- keeping_atom_map: bool = False
Whether RDKit molecules keep the original atom mapping. This option is intended to be used when providing atom-mapped SMILES with the
is_atom_bond_targets
.
- log_frequency: int = 10
The number of batches between each logging of the training loss.
- loss_function: typing_extensions.Literal[mse, bounded_mse, binary_cross_entropy, cross_entropy, mcc, sid, wasserstein, mve, evidential, dirichlet, quantile_interval] = None
Choice of loss function. Loss functions are limited to compatible dataset types.
- max_lr: float = 0.001
Maximum learning rate.
- metric: typing_extensions.Literal[auc, prc-auc, rmse, mae, mse, r2, accuracy, cross_entropy, binary_cross_entropy, sid, wasserstein, f1, mcc, bounded_rmse, bounded_mae, bounded_mse, recall, precision, balanced_accuracy] = None
Metric to use during evaluation. It is also used with the validation set for early stopping. Defaults to “auc” for classification, “rmse” for regression, and “sid” for spectra.
- property metrics: List[str]
The list of metrics used for evaluation. Only the first is used for early stopping.
- property minimize_score: bool
Whether the model should try to minimize the score metric or maximize it.
Whether to use the same message passing neural network for all input molecules Only relevant if
number_of_molecules > 1
- multiclass_num_classes: int = 3
Number of classes when running multiclass classification.
- no_adding_bond_types: bool = False
Whether the bond types determined by RDKit molecules added to the output of bond targets. This option is intended to be used with the
is_atom_bond_targets
.
- no_atom_descriptor_scaling: bool = False
Turn off atom feature scaling.
- no_bond_descriptor_scaling: bool = False
Turn off atom feature scaling.
Whether the FFN weights for atom and bond targets should be independent between tasks.
- num_folds: int = 1
Number of folds when performing cross validation.
- property num_lrs: int
The number of learning rates to use (currently hard-coded to 1).
- property num_tasks: int
The number of tasks being trained on.
- overwrite_default_atom_features: bool = False
Overwrites the default atom descriptors with the new ones instead of concatenating them. Can only be used if atom_descriptors are used as a feature.
- overwrite_default_bond_features: bool = False
Overwrites the default bond descriptors with the new ones instead of concatenating them. Can only be used if bond_descriptors are used as a feature.
- pytorch_seed: int = 0
Seed for PyTorch randomness (e.g., random initial weights).
- quantile_loss_alpha: float = 0.1
Target error bounds for quantile interval loss
- property quantiles: List[float]
A list of quantiles to be being trained on.
- quiet: bool = False
Skip non-essential print statements.
- reaction: bool = False
Whether to adjust MPNN layer to take reactions as input instead of molecules.
- reaction_mode: typing_extensions.Literal[reac_prod, reac_diff, prod_diff, reac_prod_balance, reac_diff_balance, prod_diff_balance] = 'reac_diff'
Choices for construction of atom and bond features for reactions
reac_prod
: concatenates the reactants feature with the products feature.reac_diff
: concatenates the reactants feature with the difference in features between reactants and products.prod_diff
: concatenates the products feature with the difference in features between reactants and products.reac_prod_balance
: concatenates the reactants feature with the products feature, balances imbalanced reactions.reac_diff_balance
: concatenates the reactants feature with the difference in features between reactants and products, balances imbalanced reactions.prod_diff_balance
: concatenates the products feature with the difference in features between reactants and products, balances imbalanced reactions.
- reaction_solvent: bool = False
Whether to adjust the MPNN layer to take as input a reaction and a molecule, and to encode them with separate MPNNs.
- resume_experiment: bool = False
Whether to resume the experiment. Loads test results from any folds that have already been completed and skips training those folds.
- save_dir: str = None
Directory where model checkpoints will be saved.
- save_preds: bool = False
Whether to save test split predictions during training.
- save_smiles_splits: bool = False
Save smiles for each train/val/test splits for prediction convenience later.
- seed: int = 0
Random seed to use when splitting data into train/val/test sets. When :code`num_folds > 1`, the first fold uses this seed and all subsequent folds add 1 to the seed.
- separate_test_atom_descriptors_path: str = None
Path to file with extra atom descriptors for separate test set.
- separate_test_bond_descriptors_path: str = None
Path to file with extra atom descriptors for separate test set.
- separate_test_constraints_path: str = None
Path to file with constraints for separate test set.
- separate_test_features_path: List[str] = None
Path to file with features for separate test set.
- separate_test_path: str = None
Path to separate test set, optional.
- separate_test_phase_features_path: str = None
Path to file with phase features for separate test set.
- separate_val_atom_descriptors_path: str = None
Path to file with extra atom descriptors for separate val set.
- separate_val_bond_descriptors_path: str = None
Path to file with extra atom descriptors for separate val set.
- separate_val_constraints_path: str = None
Path to file with constraints for separate val set.
- separate_val_features_path: List[str] = None
Path to file with features for separate val set.
- separate_val_path: str = None
Path to separate val set, optional.
- separate_val_phase_features_path: str = None
Path to file with phase features for separate val set.
Whether the FFN weights for atom and bond targets should be shared between tasks.
- show_individual_scores: bool = False
Show all scores for individual targets, not just average, at the end.
- spectra_activation: typing_extensions.Literal[exp, softplus] = 'exp'
Indicates which function to use in dataset_type spectra training to constrain outputs to be positive.
- spectra_phase_mask_path: str = None
Path to a file containing a phase mask array, used for excluding particular regions in spectra predictions.
- spectra_target_floor: float = 1e-08
Values in targets for dataset type spectra are replaced with this value, intended to be a small positive number used to enforce positive values.
- split_key_molecule: int = 0
The index of the key molecule used for splitting when multiple molecules are present and constrained split_type is used, like scaffold_balanced or random_with_repeated_smiles. Note that this index begins with zero for the first molecule.
- split_sizes: List[float] = None
Split proportions for train/validation/test sets.
- split_type: typing_extensions.Literal[random, scaffold_balanced, predetermined, crossval, cv, cv-no-test, index_predetermined, random_with_repeated_smiles, molecular_weight] = 'random'
Method of splitting the data into train/val/test.
- target_columns: List[str] = None
Name of the columns containing target values. By default, uses all columns except the SMILES column and the
ignore_columns
.
- target_weights: List[float] = None
Weights associated with each target, affecting the relative weight of targets in the loss function. Must match the number of target columns.
- property task_names: List[str]
A list of names of the tasks being trained on.
- test: bool = False
Whether to skip training and only test the model.
- test_fold_index: int = None
Which fold to use as test for leave-one-out cross val.
- property train_data_size: int
The size of the training data set.
- undirected: bool = False
Undirected edges (always sum the two relevant bond vectors).
- property use_input_features: bool
Whether the model is using additional molecule-level features.
- val_fold_index: int = None
Which fold to use as val for leave-one-out cross val.
- warmup_epochs: float = 2.0
Number of epochs during which learning rate increases linearly from
init_lr
tomax_lr
. Afterwards, learning rate decreases exponentially frommax_lr
tofinal_lr
.
- weights_ffn_num_layers: int = 2
Number of layers in FFN for determining weights used in constrained targets.
Predict Arguments
- class chemprop.args.PredictArgs(*args, **kwargs)[source]
PredictArgs
includesCommonArgs
along with additional arguments used for predicting with a Chemprop model.Initializes the Tap instance.
- Parameters:
args – Arguments passed to the super class ArgumentParser.
underscores_to_dashes – If True, convert underscores in flags to dashes.
explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.
config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.
kwargs – Keyword arguments passed to the super class ArgumentParser.
- calibration_atom_descriptors_path: str = None
Path to the extra atom descriptors.
- calibration_bond_descriptors_path: str = None
Path to the extra bond descriptors that will be used as bond features to featurize a given molecule.
- calibration_features_path: List[str] = None
Path to features data to be used with the uncertainty calibration dataset.
- calibration_interval_percentile: float = 95
Sets the percentile used in the calibration methods. Must be in the range (1,100).
- calibration_method: typing_extensions.Literal[zscaling, tscaling, zelikman_interval, mve_weighting, platt, isotonic, conformal, conformal_adaptive, conformal_regression, conformal_quantile_regression] = None
Methods used for calibrating the uncertainty calculated with uncertainty method.
- calibration_path: str = None
Path to data file to be used for uncertainty calibration.
- calibration_phase_features_path: str = None
- conformal_alpha: float = 0.1
Target error rate for conformal prediction.
- drop_extra_columns: bool = False
Whether to drop all columns from the test data file besides the SMILES columns and the new prediction columns.
- dropout_sampling_size: int = 10
The number of samples to use for Monte Carlo dropout uncertainty estimation. Distinct from the dropout used during training.
- property ensemble_size: int
The number of models in the ensemble.
- ensemble_variance: bool = False
Deprecated. Whether to calculate the variance of ensembles as a measure of epistemic uncertainty. If True, the variance is saved as an additional column for each target in the preds_path.
- evaluation_methods: List[str] = None
The methods used for evaluating the uncertainty performance if the test data provided includes targets. Available methods are [nll, miscalibration_area, ence, spearman] or any available classification or multiclass metric.
- evaluation_scores_path: str = None
Location to save the results of uncertainty evaluations.
- individual_ensemble_predictions: bool = False
Whether to return the predictions made by each of the individual models rather than the average of the ensemble
- preds_path: str
Path to CSV or PICKLE file where predictions will be saved.
- regression_calibrator_metric: typing_extensions.Literal[stdev, interval] = None
Regression calibrators can output either a stdev or an inverval.
- test_path: str
Path to CSV file containing testing data for which predictions will be made.
- uncertainty_dropout_p: float = 0.1
The probability to use for Monte Carlo dropout uncertainty estimation.
- uncertainty_method: typing_extensions.Literal[mve, ensemble, evidential_epistemic, evidential_aleatoric, evidential_total, classification, dropout, spectra_roundrobin, dirichlet] = None
The method of calculating uncertainty.
Interpret Arguments
- class chemprop.args.InterpretArgs(*args, **kwargs)[source]
InterpretArgs
includesCommonArgs
along with additional arguments used for interpreting a trained Chemprop model.Initializes the Tap instance.
- Parameters:
args – Arguments passed to the super class ArgumentParser.
underscores_to_dashes – If True, convert underscores in flags to dashes.
explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.
config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.
kwargs – Keyword arguments passed to the super class ArgumentParser.
- batch_size: int = 500
Batch size.
- c_puct: float = 10.0
Constant factor in MCTS.
- data_path: str
Path to data CSV file.
- max_atoms: int = 20
Maximum number of atoms in rationale.
- min_atoms: int = 8
Minimum number of atoms in rationale.
- prop_delta: float = 0.5
Minimum score to count as positive.
- property_id: int = 1
Index of the property of interest in the trained model.
- rollout: int = 20
Number of rollout steps.
Hyperparameter Optimization Arguments
- class chemprop.args.HyperoptArgs(*args, **kwargs)[source]
HyperoptArgs
includesTrainArgs
along with additional arguments used for optimizing Chemprop hyperparameters.Initializes the Tap instance.
- Parameters:
args – Arguments passed to the super class ArgumentParser.
underscores_to_dashes – If True, convert underscores in flags to dashes.
explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.
config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.
kwargs – Keyword arguments passed to the super class ArgumentParser.
- config_save_path: str
Path to
.json
file where best hyperparameter settings will be written.
- hyperopt_checkpoint_dir: str = None
Path to a directory where hyperopt completed trial data is stored. Hyperopt job will include these trials if restarted. Can also be used to run multiple instances in parallel if they share the same checkpoint directory.
- hyperopt_seed: int = 0
The initial seed used for choosing parameters in hyperopt trials. In each trial, the seed will be increased by one, skipping seeds previously used.
- log_dir: str = None
(Optional) Path to a directory where all results of the hyperparameter optimization will be written.
- manual_trial_dirs: List[str] = None
Paths to save directories for manually trained models in the same search space as the hyperparameter search. Results will be considered as part of the trial history of the hyperparameter search.
- num_iters: int = 20
Number of hyperparameter choices to try.
- search_parameter_keywords: List[str] = ['basic']
The model parameters over which to search for an optimal hyperparameter configuration. Some options are bundles of parameters or otherwise special parameter operations.
- Special keywords:
basic - the default set of hyperparameters for search: depth, ffn_num_layers, dropout, and linked_hidden_size. linked_hidden_size - search for hidden_size and ffn_hidden_size, but constrained for them to have the same value.
If either of the component words are entered in separately, both are searched independently.
- learning_rate - search for max_lr, init_lr, final_lr, and warmup_epochs. The search for init_lr and final_lr values
are defined as fractions of the max_lr value. The search for warmup_epochs is as a fraction of the total epochs used.
all - include search for all 13 inidividual keyword options
- Individual supported parameters:
activation, aggregation, aggregation_norm, batch_size, depth, dropout, ffn_hidden_size, ffn_num_layers, final_lr, hidden_size, init_lr, max_lr, warmup_epochs
- startup_random_iters: int = None
The initial number of trials that will be randomly specified before TPE algorithm is used to select the rest. By default will be half the total number of trials.
Scikit-Learn Train Arguments
- class chemprop.args.SklearnTrainArgs(*args, **kwargs)[source]
SklearnTrainArgs
includesTrainArgs
along with additional arguments for training a scikit-learn model.Initializes the Tap instance.
- Parameters:
args – Arguments passed to the super class ArgumentParser.
underscores_to_dashes – If True, convert underscores in flags to dashes.
explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.
config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.
kwargs – Keyword arguments passed to the super class ArgumentParser.
- class_weight: typing_extensions.Literal[balanced] = None
How to weight classes (None means no class balance).
- impute_mode: typing_extensions.Literal[single_task, median, mean, linear, frequent] = None
How to impute missing data (None means no imputation).
- model_type: typing_extensions.Literal[random_forest, svm]
scikit-learn model to use.
- num_bits: int = 2048
Number of bits in morgan fingerprint.
- num_trees: int = 500
Number of random forest trees.
- radius: int = 2
Morgan fingerprint radius.
- single_task: bool = False
Whether to run each task separately (needed when dataset has null entries).
Scikit-Learn Predict Arguments
- class chemprop.args.SklearnPredictArgs(*args, **kwargs)[source]
SklearnPredictArgs
contains arguments used for predicting with a trained scikit-learn model.Initializes the Tap instance.
- Parameters:
args – Arguments passed to the super class ArgumentParser.
underscores_to_dashes – If True, convert underscores in flags to dashes.
explicit_bool – Booleans can be specified on the command line as “–arg True” or “–arg False” rather than “–arg”. Additionally, booleans can be specified by prefixes of True and False with any capitalization as well as 1 or 0.
config_files – A list of paths to configuration files containing the command line arguments (e.g., ‘–arg1 a1 –arg2 a2’). Arguments passed in from the command line overwrite arguments from the configuration files. Arguments in configuration files that appear later in the list overwrite the arguments in previous configuration files.
kwargs – Keyword arguments passed to the super class ArgumentParser.
- checkpoint_dir: str = None
Path to directory containing model checkpoints (
.pkl
file)
- checkpoint_path: str = None
Path to model checkpoint (
.pkl
file)
- checkpoint_paths: List[str] = None
List of paths to model checkpoints (
.pkl
files)
- number_of_molecules: int = 1
Number of molecules in each input to the model. This must equal the length of
smiles_columns
(if notNone
).
- preds_path: str
Path to CSV file where predictions will be saved.
- smiles_columns: List[str] = None
List of names of the columns containing SMILES strings. By default, uses the first
number_of_molecules
columns.
- test_path: str
Path to CSV file containing testing data for which predictions will be made.
Utility Functions
- chemprop.args.get_checkpoint_paths(checkpoint_path: str | None = None, checkpoint_paths: List[str] | None = None, checkpoint_dir: str | None = None, ext: str = '.pt') List[str] | None [source]
Gets a list of checkpoint paths either from a single checkpoint path or from a directory of checkpoints.
If
checkpoint_path
is provided, only collects that one checkpoint. Ifcheckpoint_paths
is provided, collects all of the provided checkpoints. Ifcheckpoint_dir
is provided, walks the directory and collects all checkpoints. A checkpoint is any file ending in the extension ext.- Parameters:
checkpoint_path – Path to a checkpoint.
checkpoint_paths – List of paths to checkpoints.
checkpoint_dir – Path to a directory containing checkpoints.
ext – The extension which defines a checkpoint file.
- Returns:
A list of paths to checkpoints or None if no checkpoint path(s)/dir are provided.