How to Master Python for Machine Learning in 10 minutes by doing small projects

Source: Artificial Intelligence on Medium

How to Master Python for Machine Learning in 10 minutes by doing small projects

The best way to learn machine learning related technologies is to design and complete small projects yourself.

Python is a very popular and powerful interpreted programming language. Unlike R, Python is a very complete language and platform. You can use it for both R & D and product development.

Moreover, Python has many modules and libraries for us to choose from, so that there can be many solutions for one task. How’s it sounds great?

If you’re doing machine learning in Python, the best way to start is to complete a small project first, so why?

Because this will let you know how to install and start the Python interpreter (this is the minimum requirement).

Because it will let you know how to complete a project step by step.

Because this will increase your confidence and may allow you to start creating your own small project.

Novices need a complete small project to practice

Textbooks and courses are quite annoying. Although I will explain them in detail and a lot, they are too fragmented. It is difficult for you to understand how these knowledge points can be used together.

When you apply machine learning to your own dataset, you start a project.

Define the problem.

Prepare the data.

Evaluation algorithm.

Optimization Results.

Present the results.

The best way to truly master a new platform and new tool is to use it to complete a complete machine learning project step by step, involving all important steps, that is, from importing data, summarizing data, evaluating algorithms to making predictions.

After operating such a set of procedures, you will probably understand the routine.

One of the best small projects to start with to practice is the classification of iris (dataset link). This project is very suitable for novices because it is very easy to understand.

Because attributes are numeric, you need to know how to import and process data this way.

This project is a classification problem that allows you to practice operating a relatively simple supervised learning algorithm.

It is also a multi-class classification problem, so some special processing methods may be needed.

It only has 4 attributes and 150 examples, which means that the dataset is small and does not take up too much memory.

All numeric attributes have the same units and sizes, and no special scaling or conversion is required before use.

Let’s start learning how to perform Hello World in machine learning with Python.

In this part, we will complete a complete small machine learning project. Here are the main steps:

Install Python and SciPy platforms.

Import the data set.

Summarize the data set.

Visualize the data set.

Evaluation algorithm.

Make predictions.

You can try typing the code on the command line yourself. To speed things up, you can also copy and paste my code.

If it is not installed on your computer, first install Python and SciPy platform.

I won’t go into details here because there are many tutorials online.

The Python version used in this article is 2.7 or 3.5.

scipy
numpy
matplotlib
pandas
Sklearn

There are many ways to install these libraries, and it is recommended to choose a method, and then use these methods to install these libraries.

The SciPy installation page provides detailed methods for installing the above libraries on a variety of systems:

On Mac OS, you can install Python2.7 and these libraries with macports. Click here for more information

On Linux, you can use your package manager, just like installing RPM on Fedora.

If you are on Windows, it is recommended to install the free version of Anaconda.

Note: The basis of the above methods is that scikit-learn version 0.18 or higher is already installed on your computer.

This step is important, make sure that you have successfully installed your Python environment and that it works.

The following script can help you test your Python environment, it will import every library required for this tutorial and export the corresponding version.

Open a command line and start the Python interpreter:

Python

I suggest that you work directly on the interpreter, or run the command line after writing the script, instead of running the script on a large editor and IDE. Our focus is on machine learning, not software tools.

Type or copy and paste the following script:

# Check the versions of libraries

# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

If you run it on an OS X workstation, you get the following output:

Python: 2.7.11 (default, Mar 1 2016, 18:40:10) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)]
scipy: 0.17.0
numpy: 1.10.4
matplotlib: 1.5.1
pandas: 0.17.1
sklearn: 0.18.1

Compare this with your own output.

Ideally, the two should be the same or mostly close. The API generally does not change quickly, so if your version is a bit low, don’t worry, this tutorial is still applicable to your later learning.

If you make a mistake here, pause for a while and fix the error.

If you cannot run the above script smoothly, then you will not be able to complete this tutorial completely later.

It is recommended to search the Internet for the errors you have encountered, or ask experienced people, such as StackOverFlow.

We are going to use the iris data set. This data set is very famous. The first data set used by almost anyone who is learning machine learning is it. It can be said to be the Hello Word in the machine learning data set.

It contains 150 observations of iris flowers, and the measurements of flowers are divided into 4 columns in centimeters. The fifth column is the type of flower observed. All observed flowers belong to three species.

In this step, we will import the iris data from the CSV file URL.

First, we import all the modules, functions, and objects used in this tutorial.

# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

All information must be imported accurately. If an error occurs, stop immediately. Before proceeding, be sure to get the correct SciPy environment.

We can import data directly from the UCI machine learning library, using tools for Pandas. We will continue to use it for data statistics and visualization.

Note that we will specify the name of each column when importing the data, which will help us to process the data later.

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)

There should also be no errors when importing the dataset.

If you have network problems, you can download the iris.data file to your working directory, then change the URL to a local file name, and import it in the same way.

3. Summary data set

Now we can take a look at the data.

In this step, we analyze the data in several ways:

The dimensions of the dataset.

Take a closer look at the data itself.

A statistical summary of all attributes.

Data are classified according to categorical variables.

Don’t worry, there is only one command line for each method. These command lines are not one-time and can be reused in future projects without waste.

We can quickly understand how many rows (examples) and how many columns (attributes) the shape attributes of the data contain.

# shape
print(dataset.shape)

You should see 150 rows and 5 columns:

(150, 5)

It’s always good to take a hard look at your data.

# head
print(dataset.head(20))
你应该会看到数据的前20行:
sepal-length sepal-width petal-length petal-width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa

3.3 Statistical summary

Now we can look at the statistical summary of each attribute, including the quantity, average, maximum, minimum, and some percentile values.

# descriptions
print(dataset.describe())

We can see that all numerical values ​​have the same unit (cm) and the size is between 0 and 8 cm.

sepal-length sepal-width petal-length petal-width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

3.4 Category distribution

We now look at the number of rows belonging to each category, which can be considered as an absolute count.

class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50

Now that we have a basic understanding of the data, we need to expand our understanding of the data with some visualizations.

There are two types of visualizations:

Univariate graphs to better understand each attribute.

Multivariate graphs to better understand the relationships between attributes.

Let’s start with some univariate graphs, that is, graphs for each individual variable.

Considering that the input variables are all numbers, we can create a boxplot for each input variable.

# box and whisker plots
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()

This allows us to see the distribution of input attributes more clearly:

We can also create a histogram for each input variable to understand their distribution.

# histograms
dataset.hist()
plt.show()

It seems that two of the input variables are Gaussian distributed, which is a bit useful, because we can use the algorithm to make full use of this assumption later.

Now we can look at the interaction between variables.

First, we look at the scatter plot of all attribute pairs, which helps us to see the structured relationship between the input variables.

# scatter plot matrix
scatter_matrix(dataset)
plt.show()

Note that some attribute pairs are diagonally distributed, which shows that they are highly correlated and predictable.

Now let’s build some models for the data and test their accuracy on invisible data.

The main steps in this section are:

Separate the data set into a validation set.

Set up test tools and use 10% cross validation.

Six different models were set up to predict the species of iris based on the flower measurements.

Pick the best model.

We need to know how well the model works. Later we will use statistical methods to verify the accuracy of the model on new data. We also want to further determine the accuracy of the model by evaluating its performance when the data is truly invisible.

That is, we will leave some data to prevent the algorithm from seeing it, and then use this data to determine how accurate the model is.

We will split the imported dataset into two parts, 80% for training the model and 20% for validating the model.

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

The training data in X_train and Y_train are used to prepare the model, and the X_validation and Y_validation sets we will use later.

We will use the 10-fold cross-validation method to test the accuracy of the model.

This will split our data set into 10 parts, taking 9 of them as training data in turn, and 1 as test data for testing.

# Test options and evaluation metric
seed = 7
scoring = 'accuracy'

Now we use the “accuracy” dimension to evaluate the model, which is to correctly predict the proportion of iris categories. We will use score variables later when we run and evaluate the model.

For this problem, we do not know which algorithm is the best and which configuration should be used. We can tell from the visualization chart that the parts of some categories in some dimensions are linearly separable, so we expect the overall effect to be good.

Let’s look at these 6 algorithms:

Logistic regression (LR)

Linear Discriminant Analysis (LDA)

K nearest neighbor algorithm (KNN)

Classification and Regression Tree (CART)

Gauss Naive Bayes (NB)

Support Vector Machine (SVM)

There are both simple linear algorithms (LA and LDA) and non-linear algorithms (KNN, CART, NB and SVM). We need to reset a random number of seeds before running the algorithm each time to ensure that each algorithm is evaluated with the same data split. This ensures that the final results can be compared directly.

Let’s build and evaluate the model:

Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

We now have 6 models and the accuracy evaluation status of each model. Next you need to compare the models with each other and choose the most accurate one.

Running the above example will get the following preliminary results:

LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)
NB: 0.975000 (0.053359)
SVM: 0.981667 (0.025000)

We can see that KNN seems to have the highest estimated accuracy score.

We can also graphically represent the model evaluation results and compare the span and average accuracy of each model. This method of measuring model accuracy is popular because each algorithm is evaluated 10 times (ten-fold cross-validation).

# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

You can see that the top range of the boxplot is compressed, and many models have achieved 100% accuracy.

After verification, the KNN algorithm has the highest accuracy. Now we look at the accuracy of the model on the validation set.

Let’s finally verify how accurate the best model is. It is worth splitting and keeping a validation set in case you make mistakes during training, such as overfitting the training set or data leaks, both of which can make the final result too optimistic.

We can run the KNN algorithm directly on the validation set and summarize the results into a final accuracy score, a confusion matrix, and a classification report.

# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

We can see that the accuracy of the model is 0.9, which is 90%. The confusion matrix shows three mistakes made. Finally, the classification report shows accuracy, recall, F1 values, and more for each category.

[[ 7 0 0]
[ 0 11 1]
[ 0 2 9]]

precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 7
Iris-versicolor 0.85 0.92 0.88 12
Iris-virginica 0.90 0.82 0.86 11

avg / total 0.90 0.90 0.90 30

It will take you 5–10 minutes to go through the above tutorial!

You don’t need to know everything. Your goal is to follow this tutorial completely and get the results. You don’t need to know everything at first. You can list the problems as you do them. Use help (FunctionName) to help you understand the syntax in Python and learn the functions you are using.

You don’t need to understand the principle of the algorithm. Of course, it is important to know the limitations and configuration methods of machine learning algorithms, but the study of algorithm principles can be put behind. You should understand the principle of the algorithm step by step. The main task at this stage is to become familiar with the platform.

You also don’t have to be a Python programmer. If you are a beginner in Python, Python’s syntax will be strange. Like other languages, focus on function calls and assignments, and then dig deeper into the syntax.

You also don’t need to be a machine learning expert. You can learn the benefits and limitations of each algorithm later, and there is a lot of information in this area, such as the operation steps of machine learning projects, and the importance of evaluating models with validation sets.

Other steps for a machine learning project. This article does not cover all the steps of a machine learning project, because this is our first project after all, just focus on the important steps, that is: import data, view data, evaluate algorithms, and make predictions.