What is a Fingerprint?

Fingerprint is the numerical representation of a molecule. The Fingerprint of a molecule contains set of different numbers, which usually describe the properties of a molecule such as physio-chemical properties, composition, topological features, substructure etc. Once the fingerprints of the molecules are computed they can be used for variety of different calculation such as similarity searching or model building.

What is a MQN?

MQN stands for Molecular Quantum Numbers and is one of the fingerprints we developed in our group. MQN represents the molecule using 42 integer value descriptors of molecular structure, which count different type of atoms, different bond types, polar groups, and topological features.

Reference: Classification of Organic Molecules by Molecular Quantum Numbers. K. T. Nguyen, L. C. Blum, R. van Deursen, J.-L. Reymond, ChemMedChem 2009, 4, 1803-1805.

What is a Xfp?

Xfp is the topological pharmacophore fingerprint recently developed in our group. Construction of Xfp for the molecule begins with, assignment of the atoms of a molecule to four different groups defined by generalized pharmacophoric atom types, namely Hydrophobic (H), Hydrogen Bond Acceptor (HBA), Hydrogen Bond Donor (HBD) and Sp2 atoms. Following classification of atoms, Xfp counts occurrence of “atom pairs” (formed by five different groups namely H-H, HBA-HBA, HBD-HBD, HBA-HBD and Sp2-Sp2) at topological distances ranging from 0 to 10 which results in 55 (5x11) dimensional Xfp.

Reference: Atom Pair 2D-Fingerprints Perceive 3D-Molecular Shape and Pharmacophores for Very Fast Virtual Screening of ZINC and GDB-17. M. Awale, J.-L. Reymond, J. Chem. Inf. Model. 2014, 54, 1892-1907.

What is a ECfp4?

Extended-Connectivity Fingerprints (ECFPs) are circular topological fingerprints designed for molecular characterization, similarity searching, and structure-activity modeling. It is among the most popular similarity search tools in drug discovery and they are effectively used in a wide variety of applications. ECfp4 encodes substructure patterns from molecules on to the bit string of length 1024 (length is variable). ECFP generate the substructure patterns by considering the atoms into multiple circular layers up to a given diameter. In our browser we have used the ECFP with diameter of 4 (ECfp4).

Reference: Extended-Connectivity Fingerprints. Rogers, D.; Hahn, M. J. Chem. Inf. Model. 2010, 50(5), 742-754.

What is a ligand based target prediction?

Ligand based target prediction is based on fact that similar molecules are most likely to have similar bioactivities and will bind similar target proteins. Thus for any given query molecule the targets can be predicted by comparing the query to known bioactive ligands. In ligand based target prediction approach each of the target proteins is represented by set of known bioactive ligands. Following, the targets for a query molecule is predicted either a) by ranking the targets in database as per similarity of the query to nearest neighbor associated with the target. or b) by using machine learning or statistical model built on the reference database.

What are nearest neighbors of a query molecule?

Nearest neighbors are compounds from database which are most similar to a query compound.

What is the most similar nearest neighbor associated with the target?

It is the compound among many different compounds annotated for the target X. This compound shows the highest similartiy to a query molecule.

How similarity searching for target prediction works (NN methods)?

Given a query molecule and fingerprint of a choice (MQN or Xfp or ECfp4) target prediction involves following steps: a) calculate the similarity between a query molecule and each of the compounds in reference database, b) sort the compounds in reference database as per similarity score, c) extract top 2000 compounds (nearest neighbor of a query) from reference database, d) collect the targets associated with these top 2000 compounds. e) make the list containing unique targets d) sort the targets as per similarity score of the most similar nearest neighbor associated with a target. In our browser we have implemented three different similarity searching based methods namely: NN(ECfp4), NN(Xfp) and NN(MQN).

How similarity searching + machine learning for target prediction works (NN + NB methods)?

NN + NB methods combine the similarity searching (NN) with the multinomial Naive Bayes machine learning model (NB). Given a query molecule target prediction using NN + NB method involves following steps: a) calculate the similarity (using either ECfp4, Xfp or MQN) between query molecule and each of the compound in reference database, b) sort the compounds in reference database as per similarity score, c) extract top 2000 compounds (nearest neighbor of a query) from reference database, d) build the Naive Bayes machine learning model with 2000 compounds using ECfp4 fingerprint e) predict the targets using NB model and return top 20 predicted targets. In our browser we have implemented three different NN + NB methods namely: i) NN (ECfp4) + NB(ECfp4), ii) NN(Xfp) + NB(ECfp4) and iii) NN(MQN) + NB (ECfp4).

How NB(ECfp4) perfoms target prediction?

NB(ECfp4) stands for multinomial Naive Bayes model built on reference database (~350K compounds and 1720 targets) extracted from ChEMBL using ECfp4 fingerprint. In this case model was built only once and use to predict the targets for any query molecule.

How DNN(ECfp4) perfoms target prediction?

DNN(ECfp4) stands for deep neural network model built on reference database (~350K compounds and 1720 targets) extracted from ChEMBL using ECfp4 fingerprint. In this case model was built only once and use to predict the targets for any query molecule.

What NN(ECfp4), NN(Xfp), NN(MQN) stand for?

NN(ECfp4), NN(Xfp), NN(MQN) are similarity searching based target predictions methods, using ECfp4, Xfp and MQN fingerprints respectively.

How many targets proteins are present in PPB2?

1,720 targets

How many compounds are present in PPB2?

344,164 compounds

What similarity metric is use for similarity calculation?

In case of Xfp and MQN city block distance is use. In case of ECfp4 Tanimoto coefficient is use.

What kind of target proteins from ChEMBL are considered?

Targets labeled as “single protein” with the source organism being either human or rat.

What activity cut-off was used to extract target-compound interactions?

We considered the compounds with IC50, EC50, Ki or Kd value of less than 10 uM.