Training#

To train a model, run:

chemprop train --data-path <input_path> --task-type <task> --output-dir <dir>

where <input_path> is the path to a CSV file containing a dataset, <task> is the type of modeling task, and <dir> is the directory where model checkpoints will be saved.

For example:

chemprop train --data-path tests/data/regression.csv \
--task-type regression \
--output-dir solubility_checkpoints

The following modeling tasks are supported:

  • regression

  • regression-mve

  • regression-evidential

  • classification

  • classification-dirichlet

  • multiclass

  • multiclass-dirichlet

  • spectral

A full list of available command-line arguments can be found in CLI Reference.

Input Data#

In order to train a model, you must provide training data containing molecules (as SMILES strings) and known target values. Targets can either be real numbers, if performing regression, or binary (i.e. 0s and 1s), if performing classification. Target values which are unknown can be left as blanks. A model can be trained as either single- or multi-task.

The data file must be be a CSV file with a header row. For example:

smiles,NR-AR,NR-AR-LBD,NR-AhR,NR-Aromatase,NR-ER,NR-ER-LBD,NR-PPAR-gamma,SR-ARE,SR-ATAD5,SR-HSE,SR-MMP,SR-p53
CCOc1ccc2nc(S(N)(=O)=O)sc2c1,0,0,1,,,0,0,1,0,0,0,0
CCN1C(=O)NC(c2ccccc2)C1=O,0,0,0,0,0,0,0,,0,,0,0
...

By default, it is assumed that the SMILES are in the first column and the targets are in the remaining columns. However, the specific columns containing the SMILES and targets can be specified using the --smiles-columns <column> and --target-columns <column_1> <column_2> ... flags, respectively. To simultaneously train multiple molecules (such as a solute and a solvent), supply two column headers in --smiles-columns <columns>.

Train/Validation/Test Splits#

Our code supports several methods of splitting data into train, validation, and test sets.

  • Random: By default, the data will be split randomly into train, validation, and test sets.

  • Scaffold: Alternatively, the data can be split by molecular scaffold so that the same scaffold never appears in more than one split. This can be specified by adding --split-type scaffold_balanced.

  • User Specified Splits The ability to specify your own split indices will be added soon.

Note: By default, both random and scaffold split the data into 80% train, 10% validation, and 10% test. This can be changed with --split-sizes <train_frac> <val_frac> <test_frac>. The default setting is --split-sizes 0.8 0.1 0.1. Both splits also involve a random component that can be seeded with --data-seed <seed>. The default setting is --data-seed 0.

Other supported splitting methods include cv, cv_no_val, random_with_repeated_smiles, kennard_stone, and kmeans.

Cross Validation#

k-fold cross-validation can be run by specifying --num-folds <k> (default 1, i.e. no cross-validation).

Ensembling#

To train an ensemble, specify the number of models in the ensemble with --ensemble-size <n> (default 1).

Hyperparameters#

Model performance is often highly dependent on the hyperparameters used. Below is a list of common hyperparameters (see CLI Reference for a full list):

  • --batch-size Batch size (default 64)

  • --message-hidden-dim <n> Hidden dimension of the messages in the MPNN (default 300)

  • --depth <n> Number of message-passing steps (default 3)

  • --dropout <n> Dropout probability in the MPNN & FFN layers (default 0)

  • --activation <activation_type> The activation function used in the MPNN and FNN layers. Options include relu, leakyrelu, prelu, tanh, selu, and elu. (default relu`)

  • --epochs <n> How many epochs to train over (default 50)

  • --warmup-epochs <n>: The number of epochs during which the learning rate is linearly incremented from init_lr to max_lr (default 2)

  • --init_lr <n> Initial learning rate (default 0.0001)

  • --max-lr <n> Maximum learning rate (default 0.001)

  • --final-lr <n> Final learning rate (default 0.0001)

Loss Functions#

The loss function can be specified using the --loss-function <function> keyword, where <function> is one of the following:

Regression:

  • mse Mean squared error (default)

  • bounded-mse Bounded mean squared error

  • mve Mean-variance estimation

  • evidential Evidential; if used, --evidential-regularization can be specified to modify the regularization, and --eps to modify epsilon.

Classification:

  • bce Binary cross-entropy (default)

  • binary-mcc Binary Matthews correlation coefficient

  • binary-dirichlet Binary Dirichlet

Multiclass:

  • ce Cross-entropy (default)

  • multiclass-mcc Multiclass Matthews correlation coefficient

  • multiclass-dirichlet Multiclass Dirichlet

Spectral:

  • sid Spectral information divergence (default)

  • earthmovers Earth mover’s distance (or first-order Wasserstein distance)

  • wasserstein See above.

Evaluation Metrics#

The following evaluation metrics are supported during training:

Regression:

  • rmse Root mean squared error (default)

  • mae Mean absolute error

  • mse Mean squared error

  • bounded-mae Bounded mean absolute error

  • bounded-mse Bounded mean squared error

  • bounded-rmse Bounded root mean squared error

  • r2 R squared metric

Classification:

  • roc Receiver operating characteristic (default)

  • prc Precision-recall curve

  • accuracy Accuracy

  • f1 F1 score

  • bce Binary cross-entropy

  • binary-mcc Binary Matthews correlation coefficient

Multiclass:

  • ce Cross-entropy (default)

  • multiclass-mcc Multiclass Matthews correlation coefficient

Spectral:

  • sid Spectral information divergence (default)

  • wasserstein Earth mover’s distance (or first-order Wasserstein distance)

Advanced Training Methods#

Pretraining#

It is possible to freeze the weights of a loaded model during training, such as for transfer learning applications. To do so, specify --model-frzn <path> where <path> refers to a model’s checkpoint file that will be used to overwrite and freeze the model weights. The following flags may be used:

  • --frzn-ffn-layers <n> Overwrites weights for the first n layers of the FFN from the checkpoint (default 0)

Training on Reactions#

Chemprop can also process atom-mapped reaction SMILES (see Daylight manual for details), which consist of three parts denoting reactants, agents, and products, each separated by “>”. For example, an atom-mapped reaction SMILES denoting the reaction of methanol to formaldehyde without hydrogens: [CH3:1][OH:2]>>[CH2:1]=[O:2] and with hydrogens: [C:1]([H:3])([H:4])([H:5])[O:2][H:6]>>[C:1]([H:3])([H:4])=[O:2].[H:5][H:6]. The reactions do not need to be balanced and can thus contain unmapped parts, for example leaving groups, if necessary.

Specify columns in the input file with reaction SMILES using the option --reaction-columns to enable this, which transforms the reactants and products to the corresponding condensed graph of reaction, and changes the initial atom and bond features depending on the argument provided to --rxn-mode <feature_type>:

  • reac_diff Featurize with the reactant and the difference upon reaction (default)

  • reac_prod Featurize with both the reactant and product

  • prod_diff Featurize with the product and the difference upon reaction

Each of these arguments can be modified to balance imbalanced reactions by appending _balance, e.g. reac_diff_balance.

In reaction mode, Chemprop concatenates information to each atomic and bond feature vector. For example, using --reaction-mode reac_prod, each atomic feature vector holds information on the state of the atom in the reactant (similar to default Chemprop), and concatenates information on the state of the atom in the product. Agents are featurized with but not connected to the reactants. Functions incompatible with a reaction as input (scaffold splitting and feature generation) are carried out on the reactants only.

If the atom-mapped reaction SMILES contain mapped hydrogens, enable explicit hydrogens via --keep-h.

For further details and benchmarking, as well as a citable reference, please see DOI 10.1021/acs.jcim.1c00975.

Training Reactions with Molecules (e.g. Solvents, Reagents)#

Both reaction and molecule SMILES can be associated with a target (e.g. a reaction rate in a solvent). To do so, use both --smiles-columns and --reaction-columns.

The reaction and molecule SMILES columns can be ordered in any way. However, the same column ordering as used in the training must be used for the prediction. For more information on atom-mapped reaction SMILES, please refer to Training on Reactions.

Training on Spectra#

Spectra training is different than other datatypes because it considers the predictions of all targets together. Targets for spectra should be provided as the values for the spectrum at a specific position in the spectrum. Spectra predictions are configured to return only positive values and normalize them to sum each spectrum to 1. .. Activation to enforce positivity is an exponential function by default but can also be set as a Softplus function, according to the argument --spectral-activation <exp or softplus>. Value positivity is enforced on input targets as well using a floor value that replaces negative or smaller target values with the floor value, customizable with the argument --spectra_target_floor <float> (default 1e-8).

Additional Features#

While the model works very well on its own, especially after hyperparameter optimization, additional features and descriptors may further improve performance on certain datasets. Features are used before message passing while descriptors are used after message passing. The additional features/descriptors can be added at the atom-, bond, or molecule-level. Molecule-level features can be either automatically generated by RDKit or custom features provided by the user and are concatenated to the learned descriptors generated by Chemprop during message passing (i.e. used as extra descriptors).

Atom-Level Features/Descriptors#

You can provide additional atom features via --atom-features-path /path/to/atom/features.npz as a numpy .npz file. This command concatenates the features to each atomic feature vector before the D-MPNN, so that they are used during message-passing. This file can be saved using np.savez("atom_features.npz", *V_fs), where V_fs is a list containing the atom features V_f for each molecule, where V_f is a 2D array with a shape of number of atoms by number of atom features in the exact same order as the SMILES strings in your data file.

Similarly, you can provide additional atom descriptors via --atom-descriptors-path /path/to/atom/descriptors.npz as a numpy .npz file. This command concatenates the new features to the embedded atomic features after the D-MPNN with an additional linear layer. This file can be saved using np.savez("atom_descriptors.npz", *V_ds), where V_ds has the same format as V_fs above.

The order of the atom features and atom descriptors for each atom per molecule must match the ordering of atoms in the RDKit molecule object.

The atom-level features and descriptors are scaled by default. This can be disabled with the option --no-atom-feature-scaling or --no-atom-descriptor-scaling.

Bond-Level Features#

Bond-level features can be provided using the option --bond-features-path /path/to/bond/features.npz. as a numpy .npz file. This command concatenates the features to each bond feature vector before the D-MPNN, so that they are used during message-passing. This file can be saved using np.savez("bond_features.npz", *E_fs), where E_fs is a list containing the bond features E_f for each molecule, where E_f is a 2D array with a shape of number of bonds by number of bond features in the exact same order as the SMILES strings in your data file.

The order of the bond features for each molecule must match the bond ordering in the RDKit molecule object.

Note that bond descriptors are not currently supported because the post message passing readout function aggregates atom descriptors.

The bond-level features are scaled by default. This can be disabled with the option --no-bond-features-scaling.

Extra Descriptors#

Additional descriptors can be concatenated to the learned representaiton after aggregation. These could be molecule features, for example. If you install from source, you can modify the code to load custom descriptors as follows:

  1. Generate features: If you want to generate molecule features in code, you can write a custom features generator function using the default featurizers in chemprop/featurizers/. This also works for custom atom and bond features.

  2. Load features: Additional descriptors can be provided using --descriptors-path /path/to/descriptors.npz as a numpy .npz file. This file can be saved using np.savez("/path/to/descriptors.npz", X_d), where X_d is a 2D array with a shape of number of datapoints by number of additional descriptors. Note that the descriptors must be in the same order as the SMILES strings in your data file. The extra descriptors are scaled by default. This can be disabled with the option --no-descriptor-scaling.

Molecule-Level 2D Features#

Morgan fingerprints can be generated as molecular 2D features using --features-generators:

  • morgan_binary binary Morgan fingerprints, radius 2 and 2048 bits.

  • morgan_count count-based Morgan, radius 2 and 2048 bits.

Missing Target Values#

When training multitask models (models which predict more than one target simultaneously), sometimes not all target values are known for all molecules in the dataset. Chemprop automatically handles missing entries in the dataset by masking out the respective values in the loss function, so that partial data can be utilized.

The loss function is rescaled according to all non-missing values, and missing values do not contribute to validation or test errors. Training on partial data is therefore possible and encouraged (versus taking out datapoints with missing target entries). No keyword is needed for this behavior, it is the default.

TensorBoard#

During training, TensorBoard logs are automatically saved to the output directory under model_{i}/trainer_logs/version_0/. .. To view TensorBoard logs, run tensorboard --logdir=<dir> where <dir> is the path to the checkpoint directory. Then navigate to http://localhost:6006.