All thanks to Curators, I just change my LinkedIn Bio summary to machine learning expert, at least that’s one trophy I deserve after completing an even more rigorous training on machine learning this second day.
Before starting today’s class summary, please note: If you are a pro in machine already then this is not for you at all, Kiss this article goodbye and move on, don’t forget to tell your friends you saw me.
Now back to basics:
Before you ever start to think about machine learning, you should first think of Data you want and where to get it from. You know good information is scarce so, I only know a few source including:
· Ucl repo
· Plant Village (if your aim is to be a farmer, please buy me some oranges)
Like we said here, your data will come in several formats, and you want to import every python libraries necessary before working on your data sets.
Now that you have your datasets, the first thing you will want to do is preprocessing. Data preprocessing is a data mining technique that involves transforming raw data into an understandable format . As you go about preprocessing, do checkout for data imbalance using the following methods:
Important: Whenever you have a data imbalance try as much as possible not to bring your village knowledge into your data by oversampling and under sampling as Oversampling might result in so much noise since you be running similar data over and over again while, under-sampling might result in you having less data to work with.
My brothers and sisters to avoid stories do it the hard way and get more data, else your model will be as dull as I was in primary 5. You can also reduce data imbalance by using a different performance measure along side simple Python forloops. I know you are thinking I will write the python code, ask Stack overflow.
Moving on, should you have lots and lots of data, separate it into Train, Test and validation. Sad news for all those focusing on deep learning in Nigeria getting thousands of datasets will be your biggest challenge especially if you are thinking of writing a model to predict the next general elections.
As you continue to think about the difference between training a model and training a dog, remember, the features in your dataset will either be continuous or categorical. In making a choice you might want to check for data imbalance using histogram for continuous variable and bar chart for categorical variable in your dataset.
Methods used in converting categorical to numerical data include:
- One-hot encoding
- Label Encoding
Label encoding assumes that all you data features have structural similarity with each order example a Temperature scale. One-hot encoding is very good especially when you want to create more features in your dataset.
Choosing between regression and Classification
Classification predictive modeling is the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y), While Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y)
Types of classification
- Logistic regression
- Linear discriminant
- Support vector variable(kernel method]
- Neural networks[feed forward and convolutional neural network]
- K-nearest neighbors
- Decision tress
- Random Forest
- Gradient Boosting
Types of regression models
- Linear model
- Google search the rest!! use your MB for once
For you and me practicing machine learning with laptops bought in Aba that comes with 4GB Ram, it will be super cool if we respect ourselves and stick to SKlearn instead of Tensor flow least people start to think we starting up a barbecue business with your laptop as heating pan.
Scikit-learn is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines.
For any sklearn process you do the following:
- Transform or predict as the case may be
Source: Deep Learning on Medium