Scikit-Learn Models

In addition to message passing neural networks, Chemprop also enables training and predicting with scikit-learn Random Forest and Support Vector Machine models applied to Morgan fingerprints.

Scikit-Learn Train

chemprop.sklearn_train.py contains functions for training scikit-learn models.

chemprop.sklearn_train.impute_sklearn(model: Union[sklearn.ensemble._forest.RandomForestRegressor, sklearn.ensemble._forest.RandomForestClassifier, sklearn.svm._classes.SVR, sklearn.svm._classes.SVC], train_data: chemprop.data.data.MoleculeDataset, args: chemprop.args.SklearnTrainArgs, logger: Optional[logging.Logger] = None, threshold: float = 0.5) List[float][source]

Trains a single-task scikit-learn model, meaning a separate model is trained for each task.

This is necessary if some tasks have None (unknown) values.

Parameters
  • model – The scikit-learn model to train.

  • train_data – The training data.

  • args – A SklearnTrainArgs object containing arguments for training the scikit-learn model.

  • logger – A logger to record output.

  • theshold – Threshold for classification tasks.

Returns

A list of list of target values.

chemprop.sklearn_train.multi_task_sklearn(model: Union[sklearn.ensemble._forest.RandomForestRegressor, sklearn.ensemble._forest.RandomForestClassifier, sklearn.svm._classes.SVR, sklearn.svm._classes.SVC], train_data: chemprop.data.data.MoleculeDataset, test_data: chemprop.data.data.MoleculeDataset, metrics: List[str], args: chemprop.args.SklearnTrainArgs, logger: Optional[logging.Logger] = None) Dict[str, List[float]][source]

Trains a multi-task scikit-learn model, meaning one model is trained simultaneously on all tasks.

This is only possible if none of the tasks have None (unknown) values.

Parameters
  • model – The scikit-learn model to train.

  • train_data – The training data.

  • test_data – The test data.

  • metrics – A list of names of metric functions.

  • args – A SklearnTrainArgs object containing arguments for training the scikit-learn model.

  • logger – A logger to record output.

Returns

A dictionary mapping each metric in metrics to a list of values for each task.

chemprop.sklearn_train.predict(model: Union[sklearn.ensemble._forest.RandomForestRegressor, sklearn.ensemble._forest.RandomForestClassifier, sklearn.svm._classes.SVR, sklearn.svm._classes.SVC], model_type: str, dataset_type: str, features: List[numpy.ndarray]) List[List[float]][source]

Predicts using a scikit-learn model.

Parameters
  • model – The trained scikit-learn model to make predictions with.

  • model_type – The type of model.

  • dataset_type – The type of dataset.

  • features – The data features used as input for the model.

Returns

A list of lists of floats containing the predicted values.

chemprop.sklearn_train.run_sklearn(args: chemprop.args.SklearnTrainArgs, data: chemprop.data.data.MoleculeDataset, logger: Optional[logging.Logger] = None) Dict[str, List[float]][source]

Loads data, trains a scikit-learn model, and returns test scores for the model checkpoint with the highest validation score.

Parameters
  • args – A SklearnTrainArgs object containing arguments for loading data and training the scikit-learn model.

  • data – A MoleculeDataset containing the data.

  • logger – A logger to record output.

Returns

A dictionary mapping each metric in metrics to a list of values for each task.

chemprop.sklearn_train.single_task_sklearn(model: Union[sklearn.ensemble._forest.RandomForestRegressor, sklearn.ensemble._forest.RandomForestClassifier, sklearn.svm._classes.SVR, sklearn.svm._classes.SVC], train_data: chemprop.data.data.MoleculeDataset, test_data: chemprop.data.data.MoleculeDataset, metrics: List[str], args: chemprop.args.SklearnTrainArgs, logger: Optional[logging.Logger] = None) List[float][source]

Trains a single-task scikit-learn model, meaning a separate model is trained for each task.

This is necessary if some tasks have None (unknown) values.

Parameters
  • model – The scikit-learn model to train.

  • train_data – The training data.

  • test_data – The test data.

  • metrics – A list of names of metric functions.

  • args – A SklearnTrainArgs object containing arguments for training the scikit-learn model.

  • logger – A logger to record output.

Returns

A dictionary mapping each metric in metrics to a list of values for each task.

chemprop.sklearn_train.sklearn_train() None[source]

Parses scikit-learn training arguments and trains a scikit-learn model.

This is the entry point for the command line command sklearn_train.

Scikit-Learn Predict

chemprop.sklearn_predict.py contains functions for training scikit-learn models.

chemprop.sklearn_predict.predict_sklearn(args: chemprop.args.SklearnPredictArgs) None[source]

Loads data and a trained scikit-learn model and uses the model to make predictions on the data.

Parameters

args – A SklearnPredictArgs object containing arguments for loading data, loading a trained scikit-learn model, and making predictions with the model.

chemprop.sklearn_predict.sklearn_predict() None[source]

Parses scikit-learn predicting arguments and runs prediction using a trained scikit-learn model.

This is the entry point for the command line command sklearn_predict.