Source: Deep Learning on Medium
Because we are interested in the content and in the outlet name only, we will focus on two columns. Column 3 contains the publication or outlet name, while 9 contains the content. We then need to extract this information and to store it accordingly so we can proceed with the analysis. But first, let’s import all required modules (adapt your code for latest releases if required, e.g: TensorFlow 2):
Each of the files described above contains around 50,000 entries, so to make the analysis faster we can extract a portion of this data. For my entire pipeline and to save time, I decided to randomly select ~40% of articles using this simple trick (you can change this of course):
p = 0.4
df = pd.read_csv('articles.csv',header=None,skiprows=lambda i: 1>0 and random.random() > p)
This will take from articles.csv a fraction of the data equal to p.
The next step is probably the most subjective in our entire pipeline. We have assign News Outlets either a Left or a Right inclination. For simplicity, I decided to use only two from each side and used allsides.com and mediabiasfactckeck.com to assign their bias. Based on information extracted from these websites and other sources¹ ², I decided to assign The Atlantic and The New York Times a Left bias and The New York Post and Breitbart a Right bias. I then filtered from the original files all rows containing these outlets with:
and created one big array of stories with:
n_s = list(n_s_b.iloc[:,9].values) + list(n_s_p.iloc[:,9].values) \
+ list(n_s_a.iloc[:,9].values) + list(n_s_n.iloc[:,9].values)
Note that n_s is an array that contains only the content of all stories ordered according to the name in the original array extracted from the above code, so Breitbart and Post stories come first followed by Atlantic and New York Times.
Great! What to do next? An important pre-processing step, and especially because we are dealing with Natural Language Processing, is to delete words that can add noise to the analysis. I decided to delete the name of the outlet, which is usually mentioned within the story as it can add “bias” to our analysis. This is simply done with:
n_s = [word.replace('New York Post','') for word in n_s]
n_s = [word.replace('Breitbart','') for word in n_s]
n_s = [word.replace('New York Times','') for word in n_s]
n_s = [word.replace('Atlantic','') for word in n_s]
Next step is to create a class array. We know how many articles each outlet has and we know their political bias. We can create two arrays, one for an Outlet classifier and one for a Bias classifier with:
If you are following the methods, you can see that classes_All is an array with length equal to n_s that contains integers from 1 to 4, each corresponding to one of the four outlets, while classes_Bias contains a 1 for outlets considered to be inclined to the Right and a 2 for those inclined to the Left (see previous code to understand this further). Like this, n_s is our feature array (which has been cleaned), as it contains a list of stories and these two arrays our class arrays. This means we are almost done with pre-processing.
A crucial final step is to transform the stories (actual News) to something that a Neural Network can understand. To to this, I used the amazing Universal Sentence Encoder from TensorFlow Hub that transforms any given sentence (in our case, a News Story) to an embedding vector of length 512, so in the end we will have an array of size number_of_stories X 512. This was done with:
NOTE: I used a similar approach before to classify Literary Movements, you can check it here if you want.
To compute the embedding matrix we simply have to run the function we just defined:
e_All = similarity_matrix(n_s)
Finally! we are done with pre-processing! e_All is our feature array, while classes_All and classes_Bias our class arrays. With this we are ready to build a Classifier with Keras now.