Original article was published on Artificial Intelligence on Medium
Many individuals picture a robot or a terminator when they catch wind of Machine Learning (ML) or Artificial Intelligence (AI). However, they aren’t something out of motion pictures, it is anything but a cutting edge dream. It’s already here. We are living in a situation with numerous cutting edge applications developed using machine learning, despite that there are certain challenges an ML practitioner might face while developing an application from zero to bringing them to production.
What are these challenges? Let’s take a look!
1. Data Collection
Data plays a key role in any use case. 60% of the work of a data scientist lies in collecting the data. For beginners to experiment with machine learning, they can easily find data from Kaggle, UCI ML Repository etc.
To implement real case scenarios, you need to collect the data through web-scraping or (through APIs like twitter) or for solving business problems you need to attain data from clients (here ML engineers need to coordinate with domain experts to collect the data).
Once the data is collected, we need to structure the data and store it in the database. This requires knowledge of Big data (or data engineer) which plays a major role here.
2. Less Amount of Training Data
Once the data is collected you need to validate if the quantity is sufficient for the use case (if it is a time-series data, we need a minimum of 3–5 years of data).
The two important things we do while doing a machine learning project are selecting a learning algorithm and training the model using some of the acquired data. So as humans, we naturally tend to make mistakes and as a result things may go wrong. Here, the mistakes could be opting for the wrong model or selecting a data which is bad. Now what do I mean by bad data? Let us try to understand.
Imagine your machine learning model is a baby, and you plan on teaching the baby to distinguish between a cat and a dog. So we begin with pointing at a cat and saying ‘ it’s a CAT’ and do the same thing with a DOG (possibly repeating this procedure a number of times). Now the child will able to distinguish between dog and cat, by identifying shapes, colors, or any other features. And just like that, the baby becomes a genius (in distinguishing)!
In a similar fashion, we train the model with a lot of data. A child may distinguish the animal with less number of samples, but a machine learning model requires thousands of examples for even simple problems. For complex problems like Image Classification and Speech Recognition it may require data in a count of millions.
Therefore, one thing is clear. We need to train a model with sufficient DATA.
3. Non-representative Training Data
The training data should be representative of the new cases to generalize well i.e., the data we use for training should cover all the cases that occurred and that are going to occur. By using a non-representative training set, the trained model is not likely to make accurate predictions.
Systems which are developed to make predictions for generalized cases in business problem view are said to be good machine learning models. It will help the model to perform well even for the data which the data model has never seen.
If the number of training samples is low, we have sampling noise which is unrepresentative data, again countless training tests bring sampling bias if the strategy utilized for training is defective.
A popular case of examining sampling bias occurred during the US Presidental election in 1936 (Landon against Roosevelt), a very large poll was conducted by the Literary Digest by sending mail to around ten million people out of which 2.4 million answered, and predicted that Landon is going to get 57% of votes with high confidence. Be that as it may, Roosevelt won with 62% of votes.
The problem here is in the sampling method, to get the email address for conducting the poll, Literary Digest used magazine subscribes, club membership lists, and the likes, which are utilized by wealthier individuals who are bound to cast a ballot Republican, (hence Landon). Also, non-response bias comes into the picture as only 25% of people answered to the poll.
To make accurate predictions without any drifts, the training datasets must be representative.
4. Poor Quality of Data
In reality, we don’t directly start training the model, analyzing data is the most important step. But the data we collected might not be ready for training, some samples are abnormal from others having outliers or missing values for instance.
In these cases, we can remove the outliers, or fill the missing features/values using median or mean (to fill height) or simply remove the attributes/instances with missing values, or train the model with and without these instances.
We don’t want our system to make false predictions, right? So the quality of data is very important to get accurate results. Data preprocessing needs to be done by filtering missing values, extract & rearrange what the model needs.
5. Irrelevant/Unwanted Features
Garbage in, Garbage out
If the training data contains a large number of irrelevant features and enough relevant features, the machine learning system will not give the results as expected. One of the important aspects required for the success of a machine learning project is selection of good features to train the model also known as Feature Selection.
Let’s say we are working on a project to predict the number of hours a person needs to exercise based on the input features that we collected — age, gender, weight, height, and location (i.e., where he/she lives).
- Among these 5 features, location value might not impact our output function. This is an irrelevant feature, we know that we can have better results without this feature.
- Also, we can combine two features to produce a more useful one i.e., Feature Extraction. In our example, we can produce a feature called BMI by eliminating weight and height. We can apply transformations on the dataset too.
- Creating new features by gathering more data also helps.
6. Overfitting the Training Data
Say you visited a restaurant in a new city. You looked at the menu to order something and found that the cost or bill is too high. You might be tempted to say that ‘all the restaurants in the city are too costly and not affordable’. Overgeneralizing is something that we do very frequently, and shockingly, the frameworks can likewise fall into a similar snare and in AI, we call it overfitting.
It means the model is performing well, making likely predictions on the training dataset, but it is not generalized well.
Let’s say you are attempting to implement an Image Classification model to classify apple, peach, oranges, bananas with training samples of — 3000, 500, 500, 500 respectively. If we train the model with these samples the system is more likely to classify oranges as apples as the number of training samples for apples is too high. This can be referred to as Oversampling.
At the point when the model is excessively unpredictable comparative with the noisiness of the training dataset, Overfitting occurs. We can avoid it by:
- Gathering more training data.
- Selecting a model with fewer features, a higher degree polynomial model is not preferred compared to the linear model.
- Fix data errors, remove the outliers, and reduce the number of instances in the training set.
7. Underfitting the Training data
Underfitting which is opposite to Overfitting generally occurs when the model is too simple to understand the base structure of the data. It’s like trying to fit the undersized pants. It generally happens when we have less information to construct an exact model and when we attempt to build or develop a linear model with non-linear information.
Main options to reduce underfitting are:
- Feature Engineering — feeding better features to the learning algorithm.
- Remove noise from the data.
- Increase parameters and select a powerful model.
8. Offline Learning & Deployment of the model
Machine Learning engineering follows these steps while building an application 1) Data collection 2) Data cleaning 3) Feature engineering 4) To analyze patterns 5) Training the model and its Optimization 6) Deployment.
Oops!! Did I say deployment? Yes, a lot of machine learning practitioners can perform all steps but lacks at deployment, bringing their cool applications into production is became one of the biggest challenges due to lack of practice, dependencies issues. low understanding of underlying models with business, understanding of business problems, unstable models.
Generally, many of the developers collect data from websites like Kaggle and start training the model. But in reality, we need to make a source for data collection, that varies dynamically. Offline learning or Batch learning may not be used for this type of variable data. The system is trained and then it is launched into production, runs without learning anymore. Here the data might drift as it changes dynamically.
It is always preferred to build a pipeline to collect, analyze, build/train, test & validate the dataset for any machine learning project and train the model in batches.
A system doesn’t perform well if the training set is too small, or if the data is not generalized, noisy, corrupted with irrelevant features. We went through some of the basic challenges faced by beginners who started machine learning.
I’ll be happy to hear suggestions if you have any. I will come back with another intriguing topic very soon. Till then, Stay Home, Stay Safe, and keep exploring!
If you would like to get in touch, connect with me on LinkedIn.