Atom featurizers#
[1]:
from chemprop.featurizers.atom import MultiHotAtomFeaturizer
This is an example atom to featurize.
[2]:
from rdkit import Chem
atom_to_featurize = Chem.MolFromSmiles("CC").GetAtoms()[0]
Atom features#
The following atom features are generated by rdkit and cast to one-hot vectors (except for mass which is divided by 100). These feature vectors are joined together to a single multi-hot feature vector (with a final float32 bit for mass). All of these features (except aromaticity and mass) are padded with an extra bit for all unknown values.
atomic number
degree
formal charge
chiral tag
number of hydrogens
hybridization
aromaticity
mass
v2#
The v2 atom featurizer is the default. It provides bits in the feature vector for:
atomic number
first four rows of the period table plus iodine
degree
0 bonds - 5 bonds
formal charge
-2, -1, 0, 1, 2
chiral tag
0, 1, 2, 3 - See
rdkit.Chem.rdchem.ChiralTypefor more details
number of hydrogens
0 - 4
hybridization
S, SP, SP2, SP2D, SP3, SP3D, SP3D2
[3]:
featurizer = MultiHotAtomFeaturizer.v2()
featurizer(atom_to_featurize)
[3]:
array([0. , 0. , 0. , 0. , 0. , 1. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
1. , 0. , 0. , 0. , 0. , 0. , 0. ,
1. , 0. , 1. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 1. , 0. , 0. , 0. ,
0. , 0. , 0. , 1. , 0. , 0. , 0. ,
0. , 0.12011])
v1#
The v1 atom featurizer is the same as was used in Chemprop v1. It is the same as the v2 atom featurizer except for:
atomic number
first 100 elements (customizable)
hybridization
SP, SP2, SP3, SP3D, SP3D2
[4]:
featurizer = MultiHotAtomFeaturizer.v1()
featurizer(atom_to_featurize)
[4]:
array([0. , 0. , 0. , 0. , 0. , 1. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
1. , 0. , 0. , 0. , 0. , 0. , 0. ,
1. , 0. , 1. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 1. , 0. , 0. , 0. ,
0. , 1. , 0. , 0. , 0. , 0. , 0.12011])
[5]:
featurizer = MultiHotAtomFeaturizer.v1(max_atomic_num=53)
featurizer(atom_to_featurize)
[5]:
array([0. , 0. , 0. , 0. , 0. , 1. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 1. , 0. , 0. , 0. , 0. ,
0. , 0. , 1. , 0. , 1. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 1. , 0. ,
0. , 0. , 0. , 1. , 0. , 0. , 0. ,
0. , 0.12011])
organic#
The organic atom featurizer is optimized to reduce feature vector size for organic molecule. It is the same as the v2 atom featurizer except for:
atomic number
H, B, C, N, O, F, Si, P, S, Cl, Br, and I atoms
hybridization
S, SP, SP2, SP3
[6]:
featurizer = MultiHotAtomFeaturizer.organic()
featurizer(atom_to_featurize)
[6]:
array([0. , 0. , 1. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 1. , 0. , 0. , 0. ,
0. , 0. , 0. , 1. , 0. , 1. , 0. ,
0. , 0. , 0. , 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 0. , 0. , 1. , 0. ,
0. , 0.12011])
Custom#
Custom atom featurizers can also be created by specifying the choices. Custom choices for atomic number, degree, formal charge, chiral tag, # of hydrogens, and hybridization can be specified to create a custom atom featurizer. Aromaticity featurization is always True/False.
[7]:
from rdkit.Chem.rdchem import HybridizationType
atomic_nums = [1, 6, 7, 8]
degrees = [0, 1, 2, 3, 4]
formal_charges = [-2, -1, 0, 1, 2]
chiral_tags = [0, 1, 2, 3]
num_Hs = [0, 1, 2, 3, 4]
hybridizations = [HybridizationType.SP, HybridizationType.SP2, HybridizationType.SP3]
featurizer = MultiHotAtomFeaturizer(
atomic_nums, degrees, formal_charges, chiral_tags, num_Hs, hybridizations
)
featurizer(atom_to_featurize)
[7]:
array([0. , 1. , 0. , 0. , 0. , 0. , 0. ,
0. , 0. , 1. , 0. , 0. , 0. , 1. ,
0. , 0. , 0. , 1. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. , 1. , 0. , 0. ,
0. , 0. , 1. , 0. , 0. , 0.12011])
Generic#
Any class that has a length and returns a numpy array when given an rdkit.Chem.rdchem.Atom can be used as an atom featurizer.
[8]:
from rdkit.Chem.rdchem import Atom
import numpy as np
class MyAtomFeaturizer:
def __len__(self):
return 1
def __call__(self, a: Atom):
return np.array([a.GetAtomicNum()], dtype=float)
featurizer = MyAtomFeaturizer()
featurizer(atom_to_featurize)
[8]:
array([6.])