A Data Scientist’s Tool Kit

Original article can be found here (source): Artificial Intelligence on Medium

NumPy

NumPy is a powerful library for performing mathematical and scientific computations with python. You will find that many other data science libraries require it as a dependency to run as it is one of the fundamental scientific packages.

This tool interacts with data as an N-dimensional array object. It provides tools for manipulating arrays, performing array operations, basic statistics and common linear algebra calculations such as cross and dot product operations.

Pandas

The Pandas library simplifies the manipulation and analysis of data in python. Pandas works with two fundamental data structures. They are Series, which is a one-dimensional labelled array, and a DataFrame, which is a two-dimensional labelled data structure. The Pandas package has a multitude of tools for reading data from various sources, including CSV files and relational databases.

Once data has been made available as one of these data structures pandas has a wide range of very simple functions provided for cleaning, transforming and analysing data. These include built-in tools to handle missing data, simple plotting functionality and excel-like pivot tables.

SciPy

SciPy is another core scientific computational python library. This library is built to interact with NumPy arrays and depends on much of the functionality made available through NumPy. However, although to use this package you need to have NumPy both installed and imported, there is no need to directly import the functionality as this is automatically made available.

Scipy effectively builds on the mathematical functionality available in NumPy. Where NumPy provides very fast array manipulation, SciPy works with these arrays and enables the application of advanced mathematical and scientific computations.

Scikit-learn

Scikit-learn is a user friendly, comprehensive and powerful library for machine learning. It contains functions to apply most machine learning techniques to data and has a consistent user interface for each.

This library also provides tools for data cleaning, data pre-processing and model validation. One of its most powerful features are the concept of machine learning pipelines. These pipelines enable the various steps in machine learning e.g. preprocessing, training and so on to be chained together into one object.

Keras

Keras is a python API which aims to provide a simple interface for working with neural networks. Popular deep learning libraries such as Tensorflow are notorious for not being very user-friendly. Keras sits on top of these frameworks to provide a friendly way to interact with them.

Keras supports both convolutional and recurrent networks, provides support for multi-backends and runs on both CPU and GPU.

Matplotlib

Matplotlib is one of the fundamental plotting libraries in python. Many other popular plotting libraries depend on the matplotlib API including the pandas plotting functionality and Seaborn.

Matplolib is a very rich plotting library and contains functionality to create a wide range of charts and visualisations. Additionally, it contains functions to create animated and interactive charts.

Jupyter notebooks

Jupyter notebooks are an interactive python programming interface. The benefit of writing python in a notebook environment is that it allows you to easily render visualisations, datasets and data summaries directly within the program.

These notebooks are also ideal for sharing data science work as they can be highly annotated by including markdown text directly in line with the code and visualisations.

Python IDE

Jupyter notebooks are a useful place to write code for data science. However, there will be many instances when writing code into reusable modules will be needed. This will particularly be the case if you are writing code to put a machine learning model into production.

In these instances and an IDE (Integrated Development Environment) is useful as they provide lots of useful features such as integrated python style guides, unit testing and version control. I personally use PyCharm but there many others available.

Github

Github is a very popular version control platform. One of the fundamental principles of data science is that code and results should be reproducible either by yourself at a future point in time or by others. Version control provides a mechanism to track and record changes to your work online.

Additionally, Github enables a safe form of collaboration on a project. This is achieved by a person cloning a branch (effectively a copy of your project), making changes locally and then uploading these for review before they are integrated into the project. For an introductory guide to Github for data scientists see my previous article here.