Introduction to Cheminformatics

Source: Deep Learning on Medium

What is Cheminformatics?

With the increase of computational power, machine learning has found many applications in different fields of science. One of them is chemistry, where scientists apply machine learning models to predict various molecule’s properties such as its solubility and toxicity [1] or use it for drug discovery.

Cheminformatics is a field of science where computational methods like machine learning are applied to solve various problems in chemistry [2]. It poses very challenging problems such as converting 3D molecule structures into inputs to the ML model, tackling the scarcity of data, or trying to get a better understanding of which molecule’s features are important for predictions [3].

In this article, I will gently introduce you to a few problems that are currently faced in Cheminformatics and list some of the available libraries and datasets to get you started in this exciting and challenging field!

How to Represent the Molecule Structure

Molecules can be visualized as a 3D quantum mechanical object which consists of atoms with a well-defined location within the molecule [4]. You can extract a plethora of information here: relative distance to each molecule, atomic number, the shape of the electron probability cloud and many others.

3D visualization of the molecule. Taken from Giphy

However, it is quite difficult to retain all of that information while converting it into input for the machine learning model. Fortunately, there are a few existing molecules representations that are trying to address this problem.


SMILES is a text-like representation of the molecules and it stands for Simplified Molecular Input Line Entry. Besides its awkward-sounding name, it is one of the most popular ways of representing the molecule. In fact, some deep learning models directly accept SMILES as input [5].

You should be also aware that SMILES representation does produce information loss which is quite important for training machine learning models. Features like a length of atomic bonds and the 3D coordinate position of the molecules are lost while converting molecules to SMILES [4].

SMILES algorithm. [Source]

Extended-Connectivity Fingerprints (ECFP)

First things first: what is a molecular fingerprint? The molecular fingerprint is just another way of numerically representing a molecule. The bit-like patterns generated by the fingerprint indicate an absence or a presence of certain substructures within a molecule. The idea behind generating those patterns is a bit more involved and if you are interested, have a look here.

ECFP is a special case of molecular fingerprint where it assigns unique numbering to atoms for any molecule. Numbering depends on a number of things such as absolute charge, no. heavy atom connections, no. of non-hydrogen bonds, or atomic charge. The key thing to take here is that this algorithm can be optimized in various ways and it can be handled by a popular Cheminformatics library, RDKit.

Popular Deep Learning Architectures

Once we got input in the correct shape, we should decide which machine learning model will be effective to work with molecular structures. Here, I will provide you with a few popular architectures that work well in Cheminformatics.

Recurrent Neural Networks (RNNs)

RNNs work well with SMILES representation of the molecule. Since SMILES is a text-based representation, RNNs can be used to predict another molecule in the sequence. It allows generating new SMILES sequences which might help in finding molecules with desirable properties (e.g. certain solubility, or toxicity). This paper shows the results of generating new molecules with RNNs.

Graph Convolutional Networks (GCNs)

It is a quite difficult problem to generalize RNNs and CNNs to take a graph as an input, which can often be the case while working with molecules. Graph Convolutional Networks (GCNs) solve this problem by taking molecule graphs and their features as input. More specifically graphs, together with features of each node (atom), are converted into matrix form and then plugged into the Neural Network model. An in-depth article about GCNs can be found here.

Image Convolution and Graph Convolution. [Source]

Why should we use GCNs? According to the literature [7], more information is retained if a molecule is represented as a 2D graph. It also requires a relatively low computational cost and reaches a high accuracy which is contributed to the CNN-like architecture.

Publicly Available Molecules Datasets

How to train machine learning models if you don’t have enough data? That used to be a problem in chemistry, but nowadays chemists usually make the data freely available to everyone. I will list here some of the most popular molecules datasets so you can use it as a reference 🙂


The world’s largest collection of freely available chemical information. It contains close to 100M compounds and more than 236M substances. The dataset seems to be well-maintained including documentation and tutorials. There is plenty of information about physical and chemical properties, as well as various structural representations for each compound. Definitely worth checking out! The link is here.


It is a very similar database to Pubchem but with fewer chemicals. At the moment of writing, it contains 77M chemical structures from 270 data sources. It might be a good addition to use it with PubChem if you are planning to do something big. The link is here.


It is a manually curated dataset for bioactive molecules with drug-like properties. Currently, it contains 2M compounds and 1.1M assays. Additionally, they use data mining to gather more data about molecules (e.g. from patent documents). The link is here.

Honorable Mentions

Some of the less known but useful datasets include Tox21 (Toxicology), Solubility Challenge, and many others which can be found on Kaggle.

Useful Tools and Python Libraries


This is a very popular and well-maintained Python library with over 1.7k start on the Github. It provides open-source toolchain for deep learning in the drug discovery, quantum chemistry, and other life sciences. The link to their website is here.


It is a collection of general-purpose machine learning and cheminformatics software written in C++ and Python. Some of the functionality includes reading and writing molecules, substructure searching, chemical transformations, or molecular similarity. Personally, I have struggled with setting it up on my laptop (Ubuntu), but with Conda installation it works perfectly. The link to their website is here.

Honorable Mentions

Have a look at this curated list of all available Python libraries related to Chemistry.