Source: Deep Learning on Medium
Linear Algebra Data Structures and Operations
Linear algebra is incredibly useful for deep learning. If probability and statistics is the theory behind machine learning, then linear algebra is what makes it possible and computationally efficient.
This article assumes you have a pretty good computer science and math background. The article is on the visual side, meaning I tried to illustrate most concepts graphically.
- Scalars are single numbers, they can be represented by an italic smaller case letter. It doesn’t matter whether they’re integers or doubles, they’re all scalars
- Array of numbers, mathematically speaking they can be thought of as directions from the origin, or points in space.
- The individual elements of the vector are scalars
- Also the space of the vector is denoted by:
- Vectors and Sets, sets are pretty important, so to establish the proper notation for understanding vectors with sets:
- 2D array of numbers
- To identify an element you gotta use 2 indexes, whereas in a vector you would use 1 index
- The space of the matrix is denoted by:
- To denote the ith row, we say A(i, 🙂
- To denote the jth column, we say A(:, j)
- The notation for a transpose is, if you’re not sure, I’ll explain these later:
- Tensor A, is just a grid of matrices, in other words a 3D array of numbers
- Therefore to access an element you need i, j, k indexes
Going back to matrices. Let’s try to find a motivation to study these.
Like I mentioned it’s just a 2D array of numbers but what can they represent? In the most traditional sense, they represent a system of linear equations. They are not polynomial equations. They are linear equations.
This is a system of linear equations.
From this form you can do an algorithm called RREF. To solve this system of equations. See how this works. But there are better ways to do things.
You might be wondering how can you just do that? So to verify that this works, you have to know matrix multiplication.
In matrix multiplication where the order matters, you always wanna make sure the column of the first matrix, matches the rows of the second. Thereby leading to a rows of the first times column size of the second matrix as a result.
But let’s back it up a bit. You can do a lot of things with matrices. You can add them.
Multiplication or Cross Product
You can multiply them too, assuming certain conditions are met first.
Let’s say you have an mxn matrix A. You also have an pxq matrix B. You can multiple these matrices as AB, where A is on the left, if and only if n = p.
For a simple matrix that’s 2 by 2, and another matrix that’s 2 by 3. You can multiply them together only because the number of columns of the first matrix is the same as number of rows of the first.
This is because the formula for getting the resulting matrix is this:
You should definitely look at this video to see how this works on paper. But soon we will be doing this where you’re familiar, in Python and Swift.
This is great. Because now we can add and multiply matrices, we can also use these matrix operation properties to utilize them for machine learning uses.
Another important operation is the transpose, the formula for the transpose is:
The transpose can be thought to be a mirror reflection along the main diagonal. The main diagonal elements are the elements that have the same index for row and column.
Main diagonals are important because they come up a lot. I always thought of them like this. Whenever you either Sum or Product all the elements of the main diagonal, no matter how stretched or compressed the matrix is, those values will remain the same. So the sum or the product of all the elements of a matrix, also called the determinant or the trace are few of the constant properties of a matrix.
We also allow for matrix and vector addition, where the resulting object is a matrix, in which every element of the vector’s row gets distributed to every row of the matrix:
Motivation To Understand Matrix Multiplication
Couple of things lead us needing to understand matrix multiplication on a deeper level.
- The cost functions is defined as a summation(summation(long calculation)).
- It’s possible to vectorize our math and code so that we are representing our cost functions as a matrix or a set of vectors. Therefore saving us from having nested for loops, which would take an eternity to compute. Because nested loops are a Big O no-noes!
The dot product, or the Hadamard product, is the element wise multiplication of two sets, usually vectors. It gives you a scalar value as opposed to a matrix value in the case of a Cross Product.
Sometimes we might also use a times the transpose of b, to say the same thing. The dot product and the a times the transpose of b are equal. Another thing the dot product is equal to is the norm of a times the norm of b times cos(angle between them). It can be thought of as the projection of one onto the other.
I just mentioned norm of a, what is that? Norms generally describe the size of a vector.
Norm is really a type of function. A function that maps a vector into a non-negative scalar. Geometrically it’s like the distance between origin and the vector.
Norm, Euclidean Norm, Max Norm, Lp Norm, L1 Norm, Frobenius Norm and Norm of a Norm brought to you by Norm
So why so many norms? Well they’re for different uses.
For example, the L1 Norm is mathematically very simple, and it grows at the same rate everywhere. Can you think of a situation where that might be useful?
Let’s say that you wanted to see if a vector was actually a zero or just a very small vector pretending to be zero. Applying L1 Norm just does the trick.
The max norm is the absolute value of the value of the biggest element in the vector.
The formal definition of a norm is given by:
If we don’t do the root part of 1/2, the Euclidean Norm of a vector can be represented as x transpose times x. This is computationally and mathematically convenient, and this value still represents the size of the vector.
If the L2 norm is for vectors, is there a norm for matrices? YES, YES THERE IS.
It’s called a Frobenius Norm. It’s the sum of the squared of each element square rooted. Did you get that? No?
So the formula I was referring to earlier with the dot product is this:
In the next article, I’ll talk about matrix invesion, linear dependence and different types of decompositions of a matrix. If you would like me to write another article explaining a topic in depth, please leave a comment.