Features

chemprop.features contains functions for featurizing molecules. This includes both atom/bond features used in message passing and additional molecule-level features appended after message passing.

Featurization

Classes and functions from chemprop.features.featurization.py. Featurization specifically includes computation of the atom and bond features used in message passing.

class chemprop.features.featurization.BatchMolGraph(mol_graphs: List[chemprop.features.featurization.MolGraph])[source]

A BatchMolGraph represents the graph structure and featurization of a batch of molecules.

A BatchMolGraph contains the attributes of a MolGraph plus:

  • atom_fdim: The dimensionality of the atom feature vector.

  • bond_fdim: The dimensionality of the bond feature vector (technically the combined atom/bond features).

  • a_scope: A list of tuples indicating the start and end atom indices for each molecule.

  • b_scope: A list of tuples indicating the start and end bond indices for each molecule.

  • max_num_bonds: The maximum number of bonds neighboring an atom in this batch.

  • b2b: (Optional) A mapping from a bond index to incoming bond indices.

  • a2a: (Optional): A mapping from an atom index to neighboring atom indices.

Parameters

mol_graphs – A list of MolGraphs from which to construct the BatchMolGraph.

get_a2a() torch.LongTensor[source]

Computes (if necessary) and returns a mapping from each atom index to all neighboring atom indices.

Returns

A PyTorch tensor containing the mapping from each atom index to all the neighboring atom indices.

get_b2b() torch.LongTensor[source]

Computes (if necessary) and returns a mapping from each bond index to all the incoming bond indices.

Returns

A PyTorch tensor containing the mapping from each bond index to all the incoming bond indices.

get_components(atom_messages: bool = False) Tuple[torch.FloatTensor, torch.FloatTensor, torch.LongTensor, torch.LongTensor, torch.LongTensor, List[Tuple[int, int]], List[Tuple[int, int]]][source]

Returns the components of the BatchMolGraph.

The returned components are, in order:

  • f_atoms

  • f_bonds

  • a2b

  • b2a

  • b2revb

  • a_scope

  • b_scope

Parameters

atom_messages – Whether to use atom messages instead of bond messages. This changes the bond feature vector to contain only bond features rather than both atom and bond features.

Returns

A tuple containing PyTorch tensors with the atom features, bond features, graph structure, and scope of the atoms and bonds (i.e., the indices of the molecules they belong to).

class chemprop.features.featurization.Featurization_parameters[source]

A class holding molecule featurization parameters as attributes.

class chemprop.features.featurization.MolGraph(mol: Union[str, rdkit.Chem.rdchem.Mol, Tuple[rdkit.Chem.rdchem.Mol, rdkit.Chem.rdchem.Mol]], atom_features_extra: Optional[numpy.ndarray] = None, bond_features_extra: Optional[numpy.ndarray] = None, overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False)[source]

A MolGraph represents the graph structure and featurization of a single molecule.

A MolGraph computes the following attributes:

  • n_atoms: The number of atoms in the molecule.

  • n_bonds: The number of bonds in the molecule.

  • f_atoms: A mapping from an atom index to a list of atom features.

  • f_bonds: A mapping from a bond index to a list of bond features.

  • a2b: A mapping from an atom index to a list of incoming bond indices.

  • b2a: A mapping from a bond index to the index of the atom the bond originates from.

  • b2revb: A mapping from a bond index to the index of the reverse bond.

  • overwrite_default_atom_features: A boolean to overwrite default atom descriptors.

  • overwrite_default_bond_features: A boolean to overwrite default bond descriptors.

Parameters
  • mol – A SMILES or an RDKit molecule.

  • atom_features_extra – A list of 2D numpy array containing additional atom features to featurize the molecule

  • bond_features_extra – A list of 2D numpy array containing additional bond features to featurize the molecule

  • overwrite_default_atom_features – Boolean to overwrite default atom features by atom_features instead of concatenating

  • overwrite_default_bond_features – Boolean to overwrite default bond features by bond_features instead of concatenating

chemprop.features.featurization.atom_features(atom: rdkit.Chem.rdchem.Atom, functional_groups: Optional[List[int]] = None) List[Union[bool, int, float]][source]

Builds a feature vector for an atom.

Parameters
  • atom – An RDKit atom.

  • functional_groups – A k-hot vector indicating the functional groups the atom belongs to.

Returns

A list containing the atom features.

chemprop.features.featurization.atom_features_zeros(atom: rdkit.Chem.rdchem.Atom) List[Union[bool, int, float]][source]

Builds a feature vector for an atom containing only the atom number information.

Parameters

atom – An RDKit atom.

Returns

A list containing the atom features.

chemprop.features.featurization.bond_features(bond: rdkit.Chem.rdchem.Bond) List[Union[bool, int, float]][source]

Builds a feature vector for a bond.

Parameters

bond – An RDKit bond.

Returns

A list containing the bond features.

chemprop.features.featurization.get_atom_fdim(overwrite_default_atom: bool = False) int[source]

Gets the dimensionality of the atom feature vector.

Parameters

overwrite_default_atom – Whether to overwrite the default atom descriptors

Returns

The dimensionality of the atom feature vector.

chemprop.features.featurization.get_bond_fdim(atom_messages: bool = False, overwrite_default_bond: bool = False, overwrite_default_atom: bool = False) int[source]

Gets the dimensionality of the bond feature vector.

Parameters
  • atom_messages – Whether atom messages are being used. If atom messages are used, then the bond feature vector only contains bond features. Otherwise it contains both atom and bond features.

  • overwrite_default_bond – Whether to overwrite the default bond descriptors

  • overwrite_default_atom – Whether to overwrite the default atom descriptors

Returns

The dimensionality of the bond feature vector.

chemprop.features.featurization.is_explicit_h() bool[source]

Returns whether to use retain explicit Hs

chemprop.features.featurization.is_reaction() bool[source]

Returns whether to use reactions as input

chemprop.features.featurization.map_reac_to_prod(mol_reac: rdkit.Chem.rdchem.Mol, mol_prod: rdkit.Chem.rdchem.Mol)[source]

Build a dictionary of mapping atom indices in the reactants to the products.

Parameters
  • mol_reac – An RDKit molecule of the reactants.

  • mol_prod – An RDKit molecule of the products.

Returns

A dictionary of corresponding reactant and product atom indices.

chemprop.features.featurization.mol2graph(mols: Union[List[str], List[rdkit.Chem.rdchem.Mol], List[Tuple[rdkit.Chem.rdchem.Mol, rdkit.Chem.rdchem.Mol]]], atom_features_batch: List[numpy.array] = (None,), bond_features_batch: List[numpy.array] = (None,), overwrite_default_atom_features: bool = False, overwrite_default_bond_features: bool = False) chemprop.features.featurization.BatchMolGraph[source]

Converts a list of SMILES or RDKit molecules to a BatchMolGraph containing the batch of molecular graphs.

Parameters
  • mols – A list of SMILES or a list of RDKit molecules.

  • atom_features_batch – A list of 2D numpy array containing additional atom features to featurize the molecule

  • bond_features_batch – A list of 2D numpy array containing additional bond features to featurize the molecule

  • overwrite_default_atom_features – Boolean to overwrite default atom descriptors by atom_descriptors instead of concatenating

  • overwrite_default_bond_features – Boolean to overwrite default bond descriptors by bond_descriptors instead of concatenating

Returns

A BatchMolGraph containing the combined molecular graph for the molecules.

chemprop.features.featurization.onek_encoding_unk(value: int, choices: List[int]) List[int][source]

Creates a one-hot encoding with an extra category for uncommon values.

Parameters
  • value – The value for which the encoding should be one.

  • choices – A list of possible values.

Returns

A one-hot encoding of the value in a list of length len(choices) + 1. If value is not in choices, then the final element in the encoding is 1.

chemprop.features.featurization.reaction_mode() str[source]

Returns the reaction mode

chemprop.features.featurization.reset_featurization_parameters(logger: Optional[logging.Logger] = None) None[source]

Function resets feature parameter values to defaults by replacing the parameters instance.

chemprop.features.featurization.set_explicit_h(explicit_h: bool) None[source]

Sets whether RDKit molecules will be constructed with explicit Hs.

Parameters

explicit_h – Boolean whether to keep explicit Hs from input.

chemprop.features.featurization.set_extra_atom_fdim(extra)[source]

Change the dimensionality of the atom feature vector.

chemprop.features.featurization.set_extra_bond_fdim(extra)[source]

Change the dimensionality of the bond feature vector.

chemprop.features.featurization.set_reaction(reaction: bool, mode: str) None[source]

Sets whether to use a reaction or molecule as input and adapts feature dimensions.

Parameters
  • reaction – Boolean whether to except reactions as input.

  • mode – Reaction mode to construct atom and bond feature vectors.

Features Generators

Classes and functions from chemprop.features.features_generators.py. Features generators are used for computing additional molecule-level features that are appended after message passing.

chemprop.features.features_generators.get_available_features_generators() List[str][source]

Returns a list of names of available features generators.

chemprop.features.features_generators.get_features_generator(features_generator_name: str) Callable[[Union[str, rdkit.Chem.rdchem.Mol]], numpy.ndarray][source]

Gets a registered features generator by name.

Parameters

features_generator_name – The name of the features generator.

Returns

The desired features generator.

chemprop.features.features_generators.morgan_binary_features_generator(mol: Union[str, rdkit.Chem.rdchem.Mol], radius: int = 2, num_bits: int = 2048) numpy.ndarray[source]

Generates a binary Morgan fingerprint for a molecule.

Parameters
  • mol – A molecule (i.e., either a SMILES or an RDKit molecule).

  • radius – Morgan fingerprint radius.

  • num_bits – Number of bits in Morgan fingerprint.

Returns

A 1D numpy array containing the binary Morgan fingerprint.

chemprop.features.features_generators.morgan_counts_features_generator(mol: Union[str, rdkit.Chem.rdchem.Mol], radius: int = 2, num_bits: int = 2048) numpy.ndarray[source]

Generates a counts-based Morgan fingerprint for a molecule.

Parameters
  • mol – A molecule (i.e., either a SMILES or an RDKit molecule).

  • radius – Morgan fingerprint radius.

  • num_bits – Number of bits in Morgan fingerprint.

Returns

A 1D numpy array containing the counts-based Morgan fingerprint.

chemprop.features.features_generators.rdkit_2d_features_generator(mol: Union[str, rdkit.Chem.rdchem.Mol]) numpy.ndarray[source]

Generates RDKit 2D features for a molecule.

Parameters

mol – A molecule (i.e., either a SMILES or an RDKit molecule).

Returns

A 1D numpy array containing the RDKit 2D features.

chemprop.features.features_generators.rdkit_2d_normalized_features_generator(mol: Union[str, rdkit.Chem.rdchem.Mol]) numpy.ndarray[source]

Generates RDKit 2D normalized features for a molecule.

Parameters

mol – A molecule (i.e., either a SMILES or an RDKit molecule).

Returns

A 1D numpy array containing the RDKit 2D normalized features.

chemprop.features.features_generators.register_features_generator(features_generator_name: str) Callable[[Callable[[Union[str, rdkit.Chem.rdchem.Mol]], numpy.ndarray]], Callable[[Union[str, rdkit.Chem.rdchem.Mol]], numpy.ndarray]][source]

Creates a decorator which registers a features generator in a global dictionary to enable access by name.

Parameters

features_generator_name – The name to use to access the features generator.

Returns

A decorator which will add a features generator to the registry using the specified name.

Utils

Classes and functions from chemprop.features.utils.py.

chemprop.features.utils.load_features(path: str) numpy.ndarray[source]

Loads features saved in a variety of formats.

Supported formats:

  • .npz compressed (assumes features are saved with name “features”)

  • .npy

  • .csv / .txt (assumes comma-separated features with a header and with one line per molecule)

  • .pkl / .pckl / .pickle containing a sparse numpy array

Note

All formats assume that the SMILES loaded elsewhere in the code are in the same order as the features loaded here.

Parameters

path – Path to a file containing features.

Returns

A 2D numpy array of size (num_molecules, features_size) containing the features.

chemprop.features.utils.load_valid_atom_or_bond_features(path: str, smiles: List[str]) List[numpy.ndarray][source]

Loads features saved in a variety of formats.

Supported formats:

  • .npz descriptors are saved as 2D array for each molecule in the order of that in the data.csv

  • .pkl / .pckl / .pickle containing a pandas dataframe with smiles as index and numpy array of descriptors as columns

  • :code:’.sdf’ containing all mol blocks with descriptors as entries

Parameters

path – Path to file containing atomwise features.

Returns

A list of 2D array.

chemprop.features.utils.save_features(path: str, features: List[numpy.ndarray]) None[source]

Saves features to a compressed .npz file with array name “features”.

Parameters
  • path – Path to a .npz file where the features will be saved.

  • features – A list of 1D numpy arrays containing the features for molecules.