Source: Deep Learning on Medium

First of all I am deeply sorry for the late upload and now we can continue what we started in the previous article.

This article is the second part of the series that I started on the Deep Learning book. I would suggest to please check out the previous article also for better understanding. In this article, I am not really talking about Deep Learning but rather a very important topic which will help us to build our foundation for deep learning. So this article and also some upcoming ones will be a part of the **Applied Mathematics** that you should understand if you really want to learn deep learning.

#### Why I need to know this?

Why do we have to learn the maths and algorithms from scratch if we have built in frameworks that could do all the work automatically?

Let me tell you my personal experience here. When I started learning machine learning, I first learned algorithms by sklearn but I didn’t know anything about the working or mathematics behind all these algorithms. After that course I went looking for internships in machine learning to get more experience in it but every time on interviews when they asked me what do you know about (say SVM) and I said I know how to use it by sklearn “only” and this is same for every algorithm I learned. So a small giveaway from this experience is they don’t need you to apply algorithms by sklearn or any library because any person with web access can do it.

You should know the mechanics of all these algorithms, maths behind them, exactly what is going on when you apply them through some library so that when the time comes where you have to tweak some changes in the algorithm to get better results(say), you have the knowledge of its basic structure and to understand this you should have knowledge of some of the mathematical concepts like linear algebra, probability, statistics etc. So I am going to start here with linear algebra.

#### Introduction

Linear algebra is a branch of mathematics concerning linear equations such as

So here I am trying to introduce some terminology and concepts that we will use to understand some complex topics of machine learning as well as deep learning.

#### Scalars, Vectors, Matrices and Tensors

The study of linear algebra involves several types of mathematical objects:

**Scalars: **A scalar is just a single number. We write them in *italics* and give them lowercase variable names. For example, “Let *s *∈ ℝ be the slope of the line”.

**Vectors: **Vector is an array of numbers. The numbers are arranged in order and we can access each individual number by its index. Typically we represent them by using lowercase ** bold italic** interface such as

**and individual number can be accessed by using a subscript, for example the first element of**

*x***is**

*x**x₁*, second element is

*x₂*and so on. One more thing that we should know about vectors that they can be both “column” as well as “row” vectors.

We can think of vectors as identifying points in space, with each element giving the coordinate along a different axis.

**Matrices: **A matrix is a 2-D array of numbers, so we have rows and columns both in a matrix so elements can be represented by two indices. We usually give matrix uppercase variable name with bold typeface such as ** A**. The individual elements of a matrix is identified by indices listed with separating commas such as

There are many operations that we can perform in matrices and one of them is **transpose. **The transpose of a matrix is the mirror image of the matrix across the diagonal line, called **main diagonal** or **principal diagonal**. The same operation can also be performed on vectors too.

(

ᵗ)ᵢ,ⱼ =Aⱼ,ᵢA

We can add two matrices to each other as long as they have same shape, just by adding their corresponding elements; ** Cᵢ,ⱼ** =

**+**

*Aᵢ,ⱼ*

*Bᵢ,ⱼ*In context of deep learning, we allow the addition of matrix and a vector, yielding another matrix: ** Cᵢ,ⱼ = Aᵢ,ⱼ + bⱼ. **Here the concept of

**broadcasting**is used in which implicit copying of vector

**added to each row of matrix**

*b***.**

*A***Tensors:** An array of numbers arranged on a regular grid with a variable number of axes is known as tensor. We denote a tensor named “A” with typeface: **A.**

#### Multiplying Matrices and Vectors

Multiplication of matrices is one of the most important operation that we will use very often in deep learning too. For a matrix ** C** to be a product of matrices

**and**

*A***, they have to follow a condition i.e.**

*B***must have same number of columns as**

*A***has rows. So if**

*B***is of shape**

*A**m*x

*n*and

**is of shape**

*B**n*x

*p,*then

**is of shape**

*C**m*x

*p.*

C =

AB

Cᵢ,ⱼ = ∑ₖ(Aᵢ,ₖ Bₖ,ⱼ)

There is another type of product exist that we called **element-wise product** or **Hadamard product, **which is a product of individual elements of matrices.

In case of vectors there is something known as **dot product, **in which two vectors *x* and *y* of the same dimensionality is the matrix **xᵗy**.

#### Identity and Inverse Matrix

An Identity matrix is a matrix that does not change any vector when we multiply that vector by that matrix. We denote identity matrix by ** I. **Formally,

**ₙ ∈ ℝⁿˣⁿ, and**

*I*∀

∈ ℝⁿ,x=Iₙxx

The matrix inverse of ** A** is denoted as

**, and it is defined as the matrix such that**

*A⁻¹*

ₙA⁻¹A = I

#### Linear Combinations and Span

Let v₁, v₂,….,vₙ be vectors in ℝⁿ. A **linear combinations** of these vectors is any expression of the form

k₁v₁ + k₂v₂+….+kᵣvᵣ

where the coefficients k₁,k₂,….,kᵣ are scalars.

The set of all linear combinations of a collection of vectors v₁, v₂,….,vᵣ from ℝⁿ is called **span** of {v₁, v₂,….,vᵣ} and this set denoted span is always a subspace of ℝⁿ.

#### Linear Dependence and Independence

Let ** A** = {v₁, v₂,….,vᵣ} be a collection of vectors from ℝⁿ. If r > 2 and at least one of the vectors in

**can be written as a linear combination of the others, then**

*A***is said to be**

*A***linearly dependent**. On the other hand if no vector in

**can be written as linear combination of the others, then**

*A***is said to be**

*A***linearly independent.**

#### Norms

In machine learning we usually have to measure the size of vectors and it is possible by a function called a **norm.**

On an intuitive level, norm of a vector ** x** measures the distance from the origin to the point

**More rigorously, a norm is any function**

*x.**f*that satisfies the following properties:

f(x)= 0 =>x= 0

f(x+y)≤f(x)+f(y)(the triangle inequality)

∀

α∈ ℝ,f(αx)=|α|f(x)

Formally **Lᵖ norm** is given by :

||

x||ₚ = (Σᵢ |xᵢ|ᵖ)ᵖ⁻¹

This is the general form of norm and by putting different values of p we can get different types of norms.

The **L² norm**, with *p* = 2, is known as **Euclidean norm, **which is simply the Euclidean distance from the origin to the point identified by ** x. **It is used so frequently in machine learning that it is often denoted simply by ||

**||. It is also common to measure the size of vector using the squared L² norm, which can be calculated simply as**

*x*

*xᵗx.*Each derivative of the squared L² norm with respect to each element of x depends only on the corresponding element of x, while all the derivative of L² norm depend on the entire vector. This is one of the reason that it is mathematically and computationally easy to work with squared L² norm.

In several machine learning applications, it is important to discriminate between elements that are exactly zero and elements that are small but nonzero. But it is not possible to do this by squared L² norm because it increases very slowly near the origin and that’s where we turn into L¹ norm which grows at the same rate in all locations.

||

||₁ = ∑ᵢ (|xᵢ|)x

One other norm commonly arises in machine learning is also known as **max norm.**

maxnorm(x) = maxᵢ (|xᵢ|)

Sometimes we also wish to measure the size of a matrix. In deep learning, the most common way to do this is by **Frobenius norm**:

||

A|| = √∑ᵢ,ⱼ(A²ᵢ,ⱼ)

#### Special Kinds of matrices and vectors

**Diagonal matrix: **Matrices consist mostly of zeros and have nonzero entries only along with main diagonal. Identity matrix is also a kind of diagonal matrix with diagonal entries are 1.

**Square matrix: **Matrices who has equal number of rows and columns.

**Symmetric matrix: **Matrix that is equal to its own transpose.

A = Aᵗ

**Unit vector: **Vector with unit norm.

||x||₂ = 1

**Orthogonal vectors: **A vector x and a vector y is called orthogonal to each other if xᵗy = 0. If both vectors have nonzero norm, this means they are 90ᵒ angle to each other.

Vectors are called **orthonormal** if they are orthogonal as well as have unit norm.

**Orthogonal matrix: **A square matrix whose rows are mutually orthonormal and whose columns are mutually orthonormal:

A

ᵗA = AAᵗ =I

this implies that

A⁻¹ = Aᵗ

#### Eigendecomposition

Many mathematical problems can be understood better if you break them into smaller problems. This analogy truly describes our topic, we can easily find new properties or new ways to solve complex problems if we try to understand them part by part.

Just like we can find about an integer more if we divide them into its prime factors, similarly we can learn more about matrices and its properties if we decomposes them into a set of **eigenvectors** and **eigenvalues.**

An eigenvector of a square matrix ** A **is a nonzero vector

**which follows eigenvalue equation which is:**

*v*

=Avλv

Here the scalar **λ** is known as **eigenvalue** corresponding to this eigenvector.

Suppose that a matrix ** A **has n linearly independent eigenvectors {v₁,….,vₙ} with corresponding eigenvalues {λ₁,….,λₙ}. For better understanding, let’s concatenate all the eigenvectors to form a matrix

**with one eigenvector per column:**

*V***{v₁,….,vₙ} and similarly concatenate all eigenvalues to form a vector λ = {λ₁,….,λₙ}. So the**

*V =***eigendecomposition**of

**is given by:**

*A*

=Adiag(Vλ)V⁻¹

Here diag(λ) is the diagonal matrix comprised of eigenvalues along the main diagonal.

Eigendecomposition plays a role in a method used in machine learning such as **Principal Component Analysis** or **PCA**.

This concludes the second part of this blog series. Here I covered some basic topics that you need to know and in the next part I will cover some more advance topics in linear algebra and also learn about **Principal Component Analysis** from scratch. I hope you enjoy this part and we will learn more in the upcoming part too.

### Reference

Deep Learning by *Ian Goodfellow, Yoshua Bengio* and *Aaron Cournville.*