Linear algebra is incredibly useful for deep learning. If probability and statistics is the theory behind machine learning, then linear algebra is what makes it possible and computationally efficient.

This article assumes you have a pretty good computer science and math background. The article is on the visual side, meaning I tried to illustrate most concepts graphically.

Tensors
Tensor A, is just a grid of matrices, in other words a 3D array of numbers
Therefore to access an element you need i, j, k indexes
this tensor is not actually indexed properly, its just copied and layered
Matrices
Going back to matrices. Let’s try to find a motivation to study these.

Like I mentioned it’s just a 2D array of numbers but what can they represent? In the most traditional sense, they represent a system of linear equations. They are not polynomial equations. They are linear equations.

this is a polynomial equation, its got powers on the variables
This is a system of linear equations.

this is a 2×3 matrix, 2 rows and 3 columns
From this form you can do an algorithm called RREF. To solve this system of equations. See how this works . But there are better ways to do things.

in this representation you have A as a 2×2 matrix, x as a 2×1 vector and finally b as a 2×1 vector
You might be wondering how can you just do that? So to verify that this works, you have to know matrix multiplication.

Matrix Operations
In matrix multiplication where the order matters, you always wanna make sure the column of the first matrix, matches the rows of the second. Thereby leading to a rows of the first times column size of the second matrix as a result.

Addition
But let’s back it up a bit. You can do a lot of things with matrices. You can add them.

because
very straightforward
Multiplication or Cross Product
You can multiply them too, assuming certain conditions are met first.

Let’s say you have an mxn matrix A. You also have an pxq matrix B. You can multiple these matrices as AB, where A is on the left, if and only if n = p.

For a simple matrix that’s 2 by 2, and another matrix that’s 2 by 3. You can multiply them together only because the number of columns of the first matrix is the same as number of rows of the first.

This is because the formula for getting the resulting matrix is this:

this formula makes it clear that for each iteration A’s column and B’s row must be in sync and they are the same size
You should definitely look at this video to see how this works on paper. But soon we will be doing this where you’re familiar, in Python and Swift.

This is great. Because now we can add and multiply matrices, we can also use these matrix operation properties to utilize them for machine learning uses.

this is the distributive property
this is the associative property
Transpose
Another important operation is the transpose , the formula for the transpose is:

where the row index becomes the column index, and the column index becomes the row index for every element
The transpose can be thought to be a mirror reflection along the main diagonal . The main diagonal elements are the elements that have the same index for row and column.

notice that the (row, column) indexes are always the same (i, i)
Main diagonals are important because they come up a lot. I always thought of them like this. Whenever you either Sum or Product all the elements of the main diagonal, no matter how stretched or compressed the matrix is, those values will remain the same. So the sum or the product of all the elements of a matrix, also called the determinant or the trace are few of the constant properties of a matrix.

Broadcasting
We also allow for matrix and vector addition, where the resulting object is a matrix, in which every element of the vector’s row gets distributed to every row of the matrix:

notice that both matrices are the same size and that A and b must have same number of rows
Motivation To Understand Matrix Multiplication
Couple of things lead us needing to understand matrix multiplication on a deeper level.

The cost functions is defined as a summation(summation(long calculation)).
It’s possible to vectorize our math and code so that we are representing our cost functions as a matrix or a set of vectors. Therefore saving us from having nested for loops, which would take an eternity to compute. Because nested loops are a Big O no-noes!
Dot Product
The dot product, or the Hadamard product, is the element wise multiplication of two sets, usually vectors. It gives you a scalar value as opposed to a matrix value in the case of a Cross Product.

given these two vectors, the dot product is
Sometimes we might also use a times the transpose of b , to say the same thing. The dot product and the a times the transpose of b are equal. Another thing the dot product is equal to is the norm of a times the norm of b times cos(angle between them). It can be thought of as the projection of one onto the other.

Norms
I just mentioned norm of a , what is that? Norms generally describe the size of a vector.

Norm is really a type of function. A function that maps a vector into a non-negative scalar. Geometrically it’s like the distance between origin and the vector.

Norm, Euclidean Norm, Max Norm, Lp Norm, L1 Norm, Frobenius Norm and Norm of a Norm brought to you by Norm

So why so many norms? Well they’re for different uses.

L1
For example, the L1 Norm is mathematically very simple, and it grows at the same rate everywhere. Can you think of a situation where that might be useful?

Let’s say that you wanted to see if a vector was actually a zero or just a very small vector pretending to be zero. Applying L1 Norm just does the trick.

L Max
The max norm is the absolute value of the value of the biggest element in the vector.

Formal Definition
The formal definition of a norm is given by:

where p denotes the subscript of L, and p must be greater than or equal to 1, and it must also be a Real number
L2 is also known as the Euclidean Norm
If we don’t do the root part of 1/2, the Euclidean Norm of a vector can be represented as x transpose times x . This is computationally and mathematically convenient, and this value still represents the size of the vector.

If the L2 norm is for vectors, is there a norm for matrices? YES, YES THERE IS.

It’s called a Frobenius Norm. It’s the sum of the squared of each element square rooted. Did you get that? No?

since we’re squaring and square rooting, its closer to the Euclidean Norm in spirit than the other norms
So the formula I was referring to earlier with the dot product is this:

also the dot product between x and y
Up Next…
In the next article, I’ll talk about matrix invesion, linear dependence and different types of decompositions of a matrix. If you would like me to write another article explaining a topic in depth, please leave a comment.

References