Features

chemprop.features contains functions for featurizing molecules. This includes both atom/bond features used in message passing and additional molecule-level features appended after message passing.

Featurization

Classes and functions from chemprop.features.featurization.py. Featurization specifically includes computation of the atom and bond features used in message passing.

class chemprop.features.featurization.BatchMolGraph(mol_graphs: List[MolGraph])[source]

A BatchMolGraph represents the graph structure and featurization of a batch of molecules.

A BatchMolGraph contains the attributes of a MolGraph plus:

  • atom_fdim: The dimensionality of the atom feature vector.

  • bond_fdim: The dimensionality of the bond feature vector (technically the combined atom/bond features).

  • a_scope: A list of tuples indicating the start and end atom indices for each molecule.

  • b_scope: A list of tuples indicating the start and end bond indices for each molecule.

  • max_num_bonds: The maximum number of bonds neighboring an atom in this batch.

  • b2b: (Optional) A mapping from a bond index to incoming bond indices.

  • a2a: (Optional): A mapping from an atom index to neighboring atom indices.

  • b2br: (Optional): A mapping from f_bonds to real bonds in molecule recorded in targets.

Parameters:

mol_graphs – A list of MolGraphs from which to construct the BatchMolGraph.

get_a2a() Tensor[source]

Computes (if necessary) and returns a mapping from each atom index to all neighboring atom indices.

Returns:

A PyTorch tensor containing the mapping from each atom index to all the neighboring atom indices.

get_b2b() Tensor[source]

Computes (if necessary) and returns a mapping from each bond index to all the incoming bond indices.

Returns:

A PyTorch tensor containing the mapping from each bond index to all the incoming bond indices.

get_b2br() Tensor[source]

Computes (if necessary) and returns a mapping from f_bonds to real bonds in molecule recorded in targets.

Returns:

A PyTorch tensor containing the mapping from f_bonds to real bonds in molecule recorded in targets.

get_components(atom_messages: bool = False) Tuple[Tensor, Tensor, Tensor, Tensor, Tensor, List[Tuple[int, int]], List[Tuple[int, int]]][source]

Returns the components of the BatchMolGraph.

The returned components are, in order:

  • f_atoms

  • f_bonds

  • a2b

  • b2a

  • b2revb

  • a_scope

  • b_scope

Parameters:

atom_messages – Whether to use atom messages instead of bond messages. This changes the bond feature vector to contain only bond features rather than both atom and bond features.

Returns:

A tuple containing PyTorch tensors with the atom features, bond features, graph structure, and scope of the atoms and bonds (i.e., the indices of the molecules they belong to).

class chemprop.features.featurization.Featurization_parameters[source]

A class holding molecule featurization parameters as attributes.

class chemprop.features.featurization.MolGraph(mol: str | Mol | Tuple[Mol, Mol], atom_features_extra: ndarray | None = None, bond_features_extra: ndarray | None = None, overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False)[source]

A MolGraph represents the graph structure and featurization of a single molecule.

A MolGraph computes the following attributes:

  • n_atoms: The number of atoms in the molecule.

  • n_bonds: The number of bonds in the molecule.

  • f_atoms: A mapping from an atom index to a list of atom features.

  • f_bonds: A mapping from a bond index to a list of bond features.

  • a2b: A mapping from an atom index to a list of incoming bond indices.

  • b2a: A mapping from a bond index to the index of the atom the bond originates from.

  • b2revb: A mapping from a bond index to the index of the reverse bond.

  • overwrite_default_atom_features: A boolean to overwrite default atom descriptors.

  • overwrite_default_bond_features: A boolean to overwrite default bond descriptors.

  • is_mol: A boolean whether the input is a molecule.

  • is_reaction: A boolean whether the molecule is a reaction.

  • is_explicit_h: A boolean whether to retain explicit Hs (for reaction mode).

  • is_adding_hs: A boolean whether to add explicit Hs (not for reaction mode).

  • reaction_mode: Reaction mode to construct atom and bond feature vectors.

  • b2br: A mapping from f_bonds to real bonds in molecule recorded in targets.

Parameters:
  • mol – A SMILES or an RDKit molecule.

  • atom_features_extra – A list of 2D numpy array containing additional atom features to featurize the molecule.

  • bond_features_extra – A list of 2D numpy array containing additional bond features to featurize the molecule.

  • overwrite_default_atom_features – Boolean to overwrite default atom features by atom_features instead of concatenating.

  • overwrite_default_bond_features – Boolean to overwrite default bond features by bond_features instead of concatenating.

chemprop.features.featurization.atom_features(atom: Atom, functional_groups: List[int] | None = None) List[bool | int | float][source]

Builds a feature vector for an atom.

Parameters:
  • atom – An RDKit atom.

  • functional_groups – A k-hot vector indicating the functional groups the atom belongs to.

Returns:

A list containing the atom features.

chemprop.features.featurization.atom_features_zeros(atom: Atom) List[bool | int | float][source]

Builds a feature vector for an atom containing only the atom number information.

Parameters:

atom – An RDKit atom.

Returns:

A list containing the atom features.

chemprop.features.featurization.bond_features(bond: Bond) List[bool | int | float][source]

Builds a feature vector for a bond.

Parameters:

bond – An RDKit bond.

Returns:

A list containing the bond features.

chemprop.features.featurization.get_atom_fdim(overwrite_default_atom: bool = False, is_reaction: bool = False) int[source]

Gets the dimensionality of the atom feature vector.

Parameters:
  • overwrite_default_atom – Whether to overwrite the default atom descriptors.

  • is_reaction – Whether to add EXTRA_ATOM_FDIM for reaction input when REACTION_MODE is not None.

Returns:

The dimensionality of the atom feature vector.

chemprop.features.featurization.get_bond_fdim(atom_messages: bool = False, overwrite_default_bond: bool = False, overwrite_default_atom: bool = False, is_reaction: bool = False) int[source]

Gets the dimensionality of the bond feature vector.

Parameters:
  • atom_messages – Whether atom messages are being used. If atom messages are used, then the bond feature vector only contains bond features. Otherwise it contains both atom and bond features.

  • overwrite_default_bond – Whether to overwrite the default bond descriptors.

  • overwrite_default_atom – Whether to overwrite the default atom descriptors.

  • is_reaction – Whether to add EXTRA_BOND_FDIM for reaction input when REACTION_MODE: is not None

Returns:

The dimensionality of the bond feature vector.

chemprop.features.featurization.is_adding_hs(is_mol: bool = True) bool[source]

Returns whether to add explicit Hs to the mol (not for reactions)

chemprop.features.featurization.is_explicit_h(is_mol: bool = True) bool[source]

Returns whether to retain explicit Hs (for reactions only)

chemprop.features.featurization.is_keeping_atom_map(is_mol: bool = True) bool[source]

Returns whether to keep the original atom mapping (not for reactions)

chemprop.features.featurization.is_mol(mol: str | Mol | Tuple[Mol, Mol]) bool[source]

Checks whether an input is a molecule or a reaction

Parameters:

mol – str, RDKIT molecule or tuple of molecules.

Returns:

Whether the supplied input corresponds to a single molecule.

chemprop.features.featurization.is_reaction(is_mol: bool = True) bool[source]

Returns whether to use reactions as input

chemprop.features.featurization.map_reac_to_prod(mol_reac: Mol, mol_prod: Mol)[source]

Build a dictionary of mapping atom indices in the reactants to the products.

Parameters:
  • mol_reac – An RDKit molecule of the reactants.

  • mol_prod – An RDKit molecule of the products.

Returns:

A dictionary of corresponding reactant and product atom indices.

chemprop.features.featurization.mol2graph(mols: List[str] | List[Mol] | List[Tuple[Mol, Mol]], atom_features_batch: List[array] = (None,), bond_features_batch: List[array] = (None,), overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False) BatchMolGraph[source]

Converts a list of SMILES or RDKit molecules to a BatchMolGraph containing the batch of molecular graphs.

Parameters:
  • mols – A list of SMILES or a list of RDKit molecules.

  • atom_features_batch – A list of 2D numpy array containing additional atom features to featurize the molecule.

  • bond_features_batch – A list of 2D numpy array containing additional bond features to featurize the molecule.

  • overwrite_default_atom_features – Boolean to overwrite default atom descriptors by atom_descriptors instead of concatenating.

  • overwrite_default_bond_features – Boolean to overwrite default bond descriptors by bond_descriptors instead of concatenating.

Returns:

A BatchMolGraph containing the combined molecular graph for the molecules.

chemprop.features.featurization.onek_encoding_unk(value: int, choices: List[int]) List[int][source]

Creates a one-hot encoding with an extra category for uncommon values.

Parameters:
  • value – The value for which the encoding should be one.

  • choices – A list of possible values.

Returns:

A one-hot encoding of the value in a list of length len(choices) + 1. If value is not in choices, then the final element in the encoding is 1.

chemprop.features.featurization.reaction_mode() str[source]

Returns the reaction mode

chemprop.features.featurization.reset_featurization_parameters(logger: Logger | None = None) None[source]

Function resets feature parameter values to defaults by replacing the parameters instance.

chemprop.features.featurization.set_adding_hs(adding_hs: bool) None[source]

Sets whether RDKit molecules will be constructed with adding the Hs to them.

Parameters:

adding_hs – Boolean whether to add Hs to the molecule.

chemprop.features.featurization.set_explicit_h(explicit_h: bool) None[source]

Sets whether RDKit molecules will be constructed with explicit Hs.

Parameters:

explicit_h – Boolean whether to keep explicit Hs from input.

chemprop.features.featurization.set_extra_atom_fdim(extra)[source]

Change the dimensionality of the atom feature vector.

chemprop.features.featurization.set_extra_bond_fdim(extra)[source]

Change the dimensionality of the bond feature vector.

chemprop.features.featurization.set_keeping_atom_map(keeping_atom_map: bool) None[source]

Sets whether RDKit molecules keep the original atom mapping.

Parameters:

keeping_atom_map – Boolean whether to keep the original atom mapping.

chemprop.features.featurization.set_reaction(reaction: bool, mode: str) None[source]

Sets whether to use a reaction or molecule as input and adapts feature dimensions.

Parameters:
  • reaction – Boolean whether to except reactions as input.

  • mode – Reaction mode to construct atom and bond feature vectors.

Features Generators

Classes and functions from chemprop.features.features_generators.py. Features generators are used for computing additional molecule-level features that are appended after message passing.

chemprop.features.features_generators.get_available_features_generators() List[str][source]

Returns a list of names of available features generators.

chemprop.features.features_generators.get_features_generator(features_generator_name: str) Callable[[str | Mol], ndarray][source]

Gets a registered features generator by name.

Parameters:

features_generator_name – The name of the features generator.

Returns:

The desired features generator.

chemprop.features.features_generators.morgan_binary_features_generator(mol: str | Mol, radius: int = 2, num_bits: int = 2048) ndarray[source]

Generates a binary Morgan fingerprint for a molecule.

Parameters:
  • mol – A molecule (i.e., either a SMILES or an RDKit molecule).

  • radius – Morgan fingerprint radius.

  • num_bits – Number of bits in Morgan fingerprint.

Returns:

A 1D numpy array containing the binary Morgan fingerprint.

chemprop.features.features_generators.morgan_counts_features_generator(mol: str | Mol, radius: int = 2, num_bits: int = 2048) ndarray[source]

Generates a counts-based Morgan fingerprint for a molecule.

Parameters:
  • mol – A molecule (i.e., either a SMILES or an RDKit molecule).

  • radius – Morgan fingerprint radius.

  • num_bits – Number of bits in Morgan fingerprint.

Returns:

A 1D numpy array containing the counts-based Morgan fingerprint.

chemprop.features.features_generators.rdkit_2d_features_generator(mol: str | Mol) ndarray[source]

Generates RDKit 2D features for a molecule.

Parameters:

mol – A molecule (i.e., either a SMILES or an RDKit molecule).

Returns:

A 1D numpy array containing the RDKit 2D features.

chemprop.features.features_generators.rdkit_2d_normalized_features_generator(mol: str | Mol) ndarray[source]

Generates RDKit 2D normalized features for a molecule.

Parameters:

mol – A molecule (i.e., either a SMILES or an RDKit molecule).

Returns:

A 1D numpy array containing the RDKit 2D normalized features.

chemprop.features.features_generators.register_features_generator(features_generator_name: str) Callable[[Callable[[str | Mol], ndarray]], Callable[[str | Mol], ndarray]][source]

Creates a decorator which registers a features generator in a global dictionary to enable access by name.

Parameters:

features_generator_name – The name to use to access the features generator.

Returns:

A decorator which will add a features generator to the registry using the specified name.

Utils

Classes and functions from chemprop.features.utils.py.

chemprop.features.utils.load_features(path: str) ndarray[source]

Loads features saved in a variety of formats.

Supported formats:

  • .npz compressed (assumes features are saved with name “features”)

  • .npy

  • .csv / .txt (assumes comma-separated features with a header and with one line per molecule)

  • .pkl / .pckl / .pickle containing a sparse numpy array

Note

All formats assume that the SMILES loaded elsewhere in the code are in the same order as the features loaded here.

Parameters:

path – Path to a file containing features.

Returns:

A 2D numpy array of size (num_molecules, features_size) containing the features.

chemprop.features.utils.load_valid_atom_or_bond_features(path: str, smiles: List[str]) List[ndarray][source]

Loads features saved in a variety of formats.

Supported formats:

  • .npz descriptors are saved as 2D array for each molecule in the order of that in the data.csv

  • .pkl / .pckl / .pickle containing a pandas dataframe with smiles as index and numpy array of descriptors as columns

  • :code:’.sdf’ containing all mol blocks with descriptors as entries

Parameters:

path – Path to file containing atomwise features.

Returns:

A list of 2D array.

chemprop.features.utils.save_features(path: str, features: List[ndarray]) None[source]

Saves features to a compressed .npz file with array name “features”.

Parameters:
  • path – Path to a .npz file where the features will be saved.

  • features – A list of 1D numpy arrays containing the features for molecules.