Lost In A Dense Forest: Intuition On Sparsity In Machine Learning With Simple Code!

Original article was published by Bharath K on Artificial Intelligence on Medium

Take a look at this beautiful forest! Now, imagine you were given an opportunity to perform data analysis on this gorgeous forest.

The task at hand would be to find out what is the population of pandas in every area in the forest location to that of the trees. After successfully performing the analysis, you realize that the density of the trees is much more compared to the panda’s species.

After converting it into a matrix form, you get an image that looks something along the lines from the figure shown below:

Spare Matrix By Author. Images from freepik.com

Here, we can notice that there are a few cute pandas scattered, but the density of the matrix is mostly comprised of trees. Let us assume that the value of trees = ‘0’ and the value of panda = any non-zero value say ‘1’. This image would then form a matrix comprised of zero’s and one’s. This means that the entire matrix density would comprise mostly of 0’s and very few 1’s.

This type of matrix, which consists of mostly 0’s and less of non-zero numbers, is called a sparse matrix.

We encounter a lot of sparsity in both machine learning and deep learning. They occur a lot in data counts, word encodings, and mostly in the field of natural language processing in vectorizing words with concepts such as the bag of words, tf-idf, word2vec, etc.

Does this mean sparsity is a good thing?

Unfortunately, No! Sparsity is not that helpful. On the contrary, it leads to certain problems. Let us discuss what these problems are in the next section.

Issues with Sparse Matrices:

  • Sparse matrices are computationally expensive because of the large amount of redundant zero’s that are present in the matrix structure. The problem of having a large size increases the space complexity enormously, and it becomes challenging to tackle these problems. Machine learning algorithms will not work as effectively as expected due to the increasing size and possibly due to the lack of exhaustive resources.
  • The other significant issues caused are the decreasing speed times to effectively compute the matrix and a decrease in the computational processing speed of machine learning algorithms, which finally leads to the crucial problem of an overall bad, or sometimes even terrible, time complexity. The reason for this problem is because of the longer time taken for the computational process.

The solutions to solving these issues and tackling these challenges will be discussed in the next section.

Methods of Dealing with Sparse Matrices:

Suppose we have a matrix X which is of a large size and is sparse, which means it is filled with lots of zero values in comparison to the non-zero values as shown in the figure below:

Image By Author

The main idea behind solving these problems must be to use an alternate approach to solve them because the addition of zero’s adds more complexity to both space and time. The main method to approach them must be to either use a dictionary of keys that represent the values with non-zero numbers or a list of lists that can represent these values. This process can be done from the table shown below:

Image By Author

The above table shows us an excellent way to approach this problem. Using this method, we can form a simplified equation containing only useful non-zero numbers. The below two methods are the ones that can be used to represent sparse matrices effectively.

  1. Compressed Sparse Row Matrix: This method of conversion utilizes the approach, as shown in the above table, i.e., a tuple of the row and column, respectively, followed by a non-zero value to represent the sparse matrix.
  2. Compressed Sparse Column Matrix: This method of conversion utilizes the approach, as shown in the above table, i.e., a tuple of the column and row, respectively, followed by a non-zero value to represent the sparse matrix.

The Compressed Sparse Matrix (CSR) and Compressed Sparse Column Matrix (CSC) are two of the best ways to represent the sparse matrices effectively. I would recommend the CSR matrix over the CSC matrix because it is my personal preference, and it is usually the standard notation used for representation. However, I will show the code for both CSR and CSC conversion using the scipy module in python. You guys can feel free to choose the option which works out best for you.

With most of the theoretical part out of the way, let’s begin with Coding!

Coding with Sparse Matrices:

Alright, so let’s start! The only two machine learning libraries you will need to compute the matrix shown in the above diagram are numpy and scipy. These can be installed by using a simple pip install command. After the installation process is completed, we will frame the matrix as desired and assign it to a variable.

Below is the code block representing the importing of the libraries and the process of creating the sparse matrix. The image of the respective output is also provided.

The next step is to convert this sparse matrix to the compressed row matrix (CSR) form. This output is similar to the tabular representation shown previously. The code block for this conversion is as shown below. Also, refer to the figure to notice a tuple of (rows, columns) followed by a non-zero number.

If you want to convert this back into a sparse matrix, then just use the todense() function, as shown in the next code block.

If you are wondering how to do the same process for compressed sparse column matrix, then it can be done by following the next code block.

This procedure completes all the requirements for understanding the core concepts of sparsity. 😃

I hope this was able to give a better intuition on the concept of sparsity and sparse matrices in general. Move forward to the conclusion section!