Making personal data make sense with machine learning.

Source: Artificial Intelligence on Medium

Making personal data make sense with machine learning.

Can we make sense of our own personal data? Do we even know the laws that protect or expose us?

From FT AI regulations article

As the field of big data, machine learning and artificial intelligence keep growing and revolutionizing the current world as we know it and playing a big role in determining the future, it is without doubt that certain questions are beginning to get raised in terms ethics, governance, regulations, and privacy issues surrounding the big data revolution. At a first glance, these topics can all be classified commonly as thorns in the advancement of AI and machine learning especially since most businesses are largely more curious about the business benefits of the domain and not necessarily the disadvantages as well.


Recent activities and global trends are however beginning to show the negative impact that can be caused by ignoring some of these seemingly looking thorns in companies trying to make money out of data. The European Union has been an example of how governments are beginning to prioritize certain regulations that most tech companies were not paying attention to before and hence affecting their business models. Facebook’s dating app which was supposed to be released today, a day before valentine, has been banned by the European Union as Facebook has failed to provide adequate and required documentation to the regulatory boards. Their offices in Ireland were actually raided as the governments now step up monitoring of big data companies.

Image from Gaming Tech Law


The issue of ethics around AI is also a very growing concern in this fairly new industry where most of the pros and cons are not known yet. From the challenges of making self-driving cars make ethical decisions based on created The recently in the news bias of facial recognition systems falsely identifying people of color and in particular black women.

The governance and ethics of AI is a topic that I would want to take a deeper look at as my interest in the AI field grows, so should my sense of responsibilities too. However, today I decided to take a look at something within our own means and control; Ownership of personal data.

Personal and private data:

The data that every company that you have an online account with collects from you and even your friends, colleagues and families and their friends and families and the chain can go on and on until you realize companies potentially have personal and private data on basically everyone. One of the most troubling factors about this is that the majority of users have no idea what kind of data is being collected them let alone what it is being used for. Recent pressure from regulators and civic society is forcing companies to be clear about allowing individuals not only to know what data is being collected of them but also access to all of it, if ever possible.

One such platform that allows people to view and access their data is Linkedin. By just going to your LinkedIn profile privacy settings yo can download a copy of all of your data from Connections, Messages, Likes, Articles, etc. As illustrated below:

Screenshot from Linked privacy settings page

As data scientists having access to personal data can enable us to make much more meaningful insights into our own personal data that we ever had in an organized and meaningful manner. Machine learning techniques such as hierarchical and k-means clustering and natural language processing can be used on a dataset like the LinkedIn personal data to create deeper insights about our own personal data. Here are a few insights I got from a preliminary analysis of my personal LinkedIn data.

1. Normalizing, counting and clustering companies, locations, and titles of connections

Data from social media networks can very often be less structured and does not follow the same format hence, making it challenging to aggregate similarly connected data features. For example for the title Founder, different users may have the title as 1. Founder 2. Co-Founder 3. Founder and CEO 4. Founder/CEO, using data normalization techniques one can group to the best of their expertise and knowledge data such as titles, location and company names into organized and structured data that will make it easier for applying close to accurate and meaningful clusters. Below I have included snippets from organizing the titles and companies of my connections.

My connections semi-normalized LinkedIn job titles
My connections semi-normalized companies

I plan to further apply numerous clustering machine learning techniques to extract deeper insights from my network using features such as location, likes, industry, the importance of roles and areas of expertise.

In conclusion, I urge all individuals to check the privacy settings of your most-used digital platforms and request information on what data they have of you. Data scientists can immediately become subject matter experts of their own personal data, a quality greatly desired in areas such as Natural Language Processing especially when implementing normalization and similarity computation as well as dimension reduction techniques.