Data Science Roles

Original article was published on Deep Learning on Medium

If you are learning data science or entering the data science job market for the first time — there’s always seems to be a confusion on opportunities best suited to one’s interests. This is a result of no single definition of data science and the fact that the word is used in different ways and contexts depending on the conservation you are engaged in.

Photo by Markus Winkler on Unsplash

As a data science student myself, I wanted to know and understand what roles are available and where I would like to fit in the market. Hence, my motivation in writing this blog is to help other data enthusiasts who were in a position like myself to get a glimpse of what lies ahead in their careers, by describing the different data science roles in an AI development project.

I did my research by reading different articles and talking to different data scientists and machine learning engineers so as to understand how different companies operate and the roles available. And to no surprise, as to how data science has different meanings to different people, it was very hard to come up with definitive roles as companies tend to operate differently, depending on company size and goals, available talent and many other factors.

Fortunately, enough I fond a very insightful report from Workera, a deeplearning.ai company, with the title AI Career Pathways: Put yourself on the right track. The report described different roles and parts of the development of an AI project and hence, it covered all the aspects I felt like gave me a solid understanding of different roles and skills required in every stage of a development stage of an AI project.

Therefore, I ought to delve into the topic by making a summary of the report by dividing the blog into two parts: tasks and skills for the AI development life cycle, and the roles of an AI team.

Part I

Tasks and skills for the AI development life cycle

According to the workera report, Figure 1 illustrates how the AI project development life cycle works and is summarized as follows.

Firstly someone prepares data for modelling and then, someone trains a model on this data. Once that happens, the model is delivered to the customer. Team members then analyse the model to determine whether it brought value to the business and/or the user. If all goes well, the cycle will repeat itself with new data, models, and analysis.

All the while, people working in AI infrastructure build software to improve the cycle’s efficiency.

Figure 1: A visual representation of an AI project development cycle from the Workera Report, 2020.

Data Engineering

Data is the foundation in which data science, machine learning, deep learning and AI (decided to use all the buzz-words to not lose anyone out here)is built-on. Traditional data is stored across a variety of databases and files while Big data can be in the form of structured or unstructured data that ranges from numbers, texts, images, audio or videos in large volumes such as tera-, peta-, exa-bytes.

Data Engineers are responsible to prepare data and transform them into formats that other team members can use. They require mostly to have strong coding and software engineering skills, ideally combined with machine learning skills to help them make good design decisions related to the data. Big data tools such as Hadoop, Hive are commonly used and query language skills such as SQL and OOP languages such as Python, Java and C++.

Tasks in data engineering include:

· Defining data requirements

· Collecting data

· Labelling data

· Inspecting and cleaning data

· Augmenting data

· Moving data and building data pipelines

· Querying data

· Tracing data

As mentioned in the first paragraph of this section, data is the foundation of data science. Hence, this part of the development cycle is very crucial to the results of the project as a whole. As a common saying in machine learning says, “Garbage In, Garbage Out”.

Modelling

The modelling section is my personal favourite part. I believe data science to be an art, as two different data scientists have different ways of approaching a problem through feature engineering and choice of algorithms used and that in itself is beautiful. Also, I like this part because it’s where art and science come together to provide a solution or outcome to the problem at hand.

People assigned to modelling look for patterns in data that can help a company predict outcomes of various decisions, identify risks and opportunities, or determine cause and effect relationships.

Modelling can be done in Python, R, Java, MATLAB, C++ or any other suitable OOP language. This section is where a strong foundation in Mathematics, Statistics and Machine Learning is important.

Tasks in modelling include:

· Fitting probabilistic and statistical models

· Training machine and deep learning models

· Accelerating training

· Defining evaluation metrics

· Speeding up prediction time

· Iterating over the virtuous cycle of machine learning projects

· Searching hyperparameters

· Keeping your knowledge up to date

The most common Machine Learning methods used include Linear Regression, Logistic Regression, Decision Trees, Random Forest, XGBoost, Support Vector Machines, K-means, K-Nearest Neighbours, Neural Networks, Principal Component Analysis. Deep Learning skills are required by companies focusing on computer vision, natural language processing or speech recognition.

Deployment

This is the part in charge of making the project available to the users. Stream of data is combined with a model and tested before production. Use of cloud technologies such as AWS and Azure is an advantage in this part.

Tasks in deployment include:

· Converting prototype code into production code

· Setting up a cloud environment to deploy the model

· Branching(Version Control)

· Improving response times and saving bandwidth

· Encrypting files that store model parameters, architecture and data

· Building APIs for an application to use a model

· Retraining machine learning models

· Fitting models on resource-constrained devices

Business Analysis

The aim of any data science project is to provide (business) value. For a person starting out in data science, one might ask what happens to the models after deployment.

That’s when the business analysis step comes in. The team members in this stage suggest or make changes to either increase benefit or abandon unproductive models.

In this section of the development cycle, it is recommended the team members to have strong communication skills and business acumen as well as the necessary analytics principles for the given data science/ AI project.

Tasks in the business analysis include:

· Building data visualizations

· Building dashboards for business intelligence

· Presenting technical work to clients or colleagues

· Translating statistics to actionable business insights

· Analysing datasets

· Running experiments to analyse deployed models

· Running A/B tests.

For example, a team is tasked to build a recommendation engine for jokes to users for an online comedy-TV. People responsible for business analysis will use this data to evaluate the performance of the recommender system and how much value it creates to the client.

AI Infrastructure

The team working in the AI infrastructure builds and maintains reliable, fast, secure, and scalable software systems to help people working in data engineering, modelling, deployment, and business analysis. They build the infrastructure that supports the project.

Continuing with the example of a jokes recommender system, someone in the AI infrastructure would ensure that the recommender system is available 24/7 for global users, that the underlying model is stored securely, and that user interactions with the model on the website can be tracked reliably.

Working on AI infrastructure requires strong and broad software engineering skills to write production code and understand cloud technologies such as AWS and Azure.

Tasks in AI infrastructure include:

· Making software design decisions

· Building distribution storage and database systems

· Designing for scale

· Maintaining software infrastructure

· Networking

· Securing data and models

· Writing out tests

· Carrying out various software tasks

Part II

The roles of an AI team

According to the workera report, the six roles of an AI team include:

· Data Scientist

· Machine Learning Engineer

· Data Analyst

· Software Engineer — ML

· Machine Learning Researcher

· Software Engineer

For this article I decided, to merge Software Engineer — ML and Software Engineer as they play a very similar role in the development cycle.

Figure 2: A visual representation of the six roles in AI in the five stages of the AI development cycle. (Workera, 2020)

Data Scientist

There are tons of definitions on who a Data Scientist is. For this blog, I am going to keep it catered to what we talked about in the tasks and skills for an AI development cycle. A data scientist is a person who can carry out any of the data engineering, modelling and business analysis tasks in the cycle. Their skills and tasks complement those people who deploy models and build software infrastructure.

They use tools such as SQL or other query languages for data engineering, packages (python) such as NumPy, Scikit-learn, TensorFlow and many others for modelling, Tableau or any other presentation software for Business analysis and IDEs such as Jupyter Notebooks and Google Colab.

Data Analyst

Data Analysts do not require algorithmic coding skills but need to demonstrate solid analytical skills and business acumen. They are involved in the first and last steps of the cycle; data engineering and business analysis, and hence, complement the whole team with their work. Data analysts are required to have good communication skills.

Machine Learning Engineer

There is always a lot of confusion on the title machine learning engineer and data scientist. As both roles have a major overlap on one another. What I like to use to differentiate the two is the roles of a Scientist and an Engineer.

A data scientist is supposed to understand the necessary mathematics and statistics underlying the predictive models, whilst a Machine Learning Engineer doesn’t need to. But, what the ML Engineer needs to know is how to master the software tools to make the model useful, since ML Engineering is an intersection between Software engineering and Data Science.

ML Engineers carry out data engineering, modelling and deployment tasks.

Machine Learning Researcher

Machine Learning Researcher achieves their fullest potential in the research environment and demonstrates outstanding scientific skills. As there is a big technological gap between research and real-world production systems, ML researchers play an important role applying research to real-world use cases with excess availability of data and accelerated computed resources such as GPUs make research problems possible.

OpenAI founded by Elon Musk and Google’s DeepMind are good examples of AI/ machine learning research laboratories that work on complex and interesting challenges in Artificial Intelligence.

Software Engineer — Machine Learning

People who have the title software engineer-machine learning carry out data engineering, modelling, deployment and AI infrastructure tasks.

The software engineer-machine learning, demonstrates solid engineering skills, developing scientific skills and communication skills requirements vary among teams. Companies may refer to this position as a machine learning engineer, software engineer, full-stack data scientist, or many other titles.

Conclusion

In a company, one’s responsibilities may vary depending on the company’s size. Hence, at start-ups and small companies, an individual tends to wear more than one hat as opposed to big companies. I do understand that there is no one-size-fits-all description of data science roles, but to my own opinion, I fond the workera report to best describe the different roles in an AI team which translates to data science roles.

Please check out workera.ai for more information and resources on interview preparation and tests in data science and machine learning.

I hope you enjoyed reading this article as to how I enjoyed writing and sharing it with you !!