This is what I have learned during my first 2 days of 100 days of Machine Learning Code



The #100DaysOfMLCode is a challenge in which you spend at least 1 hour a day learning about Machine Learning and you publish what you have been learning to keep yourself accountable during that time.

During my first week of #100DaysOfMLCode I’ve been working on two different courses in no particular order. Here is the list of courses:

This is what I learned about Machine Learning during my first 2 days:

Key Machine Learning Terminology

  • Feature: features are the input variables we feed into a network, it can be as simple as a single number or more complex as an image (which in reality is a vector of numbers, where each pixel is a feature)
  • Label: is the thing we are predicting, it is normally referred as y
  • Prediction: or predicted value if the value we predict with a previously trained model for a given output and it is referred as y’

Regression vs. classification:

  • A regression model predicts continuous values.
  • A classification model predicts discrete values.

Linear Regression

Is a method for finding the straight line or hyper plane that best fits a set of points.

Line formula:

y = wx + b

Where:

w = Weights

x = Input features

b = Bias

Some convenient loss functions for linear regression are:

  • L2 Loss also called squared error and it is equal to (observation — prediction) 2
  • Mean Square Error: is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and take the average (divide by the number of examples):

When training a model we want to minimize the loss as much as possible to make the model more accurate without over fitting.

This is what I learned about Pandas

Pandas is a great python API for column-oriented data analysis.

To import pandas use the following line:

import pandas as pd

There are 2 primary data structures used in Pandas:

Series: which represents a single column

DataFrame: which is similar to a relational data table, it is composed by one or more series.

To create a serie:

city_names = pd.Series(['Barcelona', 'Madrid', 'Valencia'])
population = pd.Series([1609000, 3166000, 790201])

To create a dataframe with the previous series use the following:

spain_cities_df = pd.DataFrame({ 'City name': city_names, 'Population': population })

A dataframe is created by passing a dictionary mapping with a string as the column name as a serie as the content.

Most commonly you will not write the content of a dataframe but read it from a file such as a comma separated values file (csv for short).

spain_cities_df = pd.read_csv('path/to/file.csv', sep=',')

You can get interesting statistics with the df.describe() function, you will get the count, mean, std, min, 25%, 50%, 75% and max for each column.

spain_cities_df.describe()

Another useful function is df.head() this will display the top 5 columns so you can have an idea of what the dataframe contains

spain_cities_df.head()

Similarly you can use df.tail() and it will return the last 5 rows of data in the dataframe. Both functions will accept an integer as input for the number of rows to return, by default it is 5 but you can use any number you want, for example

spain_cities_df.tail(20)

Will return the las 20 rows of the dataframe

A powerful feature is graphing. DataFrame.hist lets you quickly study the distribution of values in a column:

spain_cities_df.hist('Population')

To access data just use a column name as the key of the dataframe:

spain_cities_df.hist['City name']

Will return the whole serie, with the 3 items inside

To access just one item in that column you can do this

spain_cities_df.hist['City name'][0]

That will return “Barcelona” as a string

It is also possible to return only a slice of the dataframe (by slicing as you would do with any array in python)

spain_cities_df[0:2]

Will return a Dataframe with the first 2 columns of the sliced dataframe

Pandas will also allow manipulating data in series so for example you could do this:

spain_cities_df['Population']/1000

And all values in that column will be divided by 1000

To add new series (or columns) to a Dataframe it is as simple as to define it

spain_cities_df['New column'] = pd.Series([1, 2, 3])

Every value in a Dataframe will have an auto generated integer index, the index once created will never change, even if the data is reordered the index will move with the row.

Dataframe.reindex will reorder rows (it accepts a list of indexes as the new order)

spain_cities_df.reindex([2, 0, 1])

Will sort the cities as Valencia, Barcelona, Madrid

Pandas is huge and these are just the basics of course, but knowing just that it is already possible to do a lot of data analysis!

This is whas I could learn during my first 2 days of 100 days of ML Code!

During the first week I was also learning about Convolutional Neural Networks and Computer Vision, but that I will be posting in the next couple of days!

I post my daily updates on my Twitter account @georgestudenko and you can also see my daily progress on my Github repositoy

Source: Deep Learning on Medium