Original article was published by Mihail Dungarov on Artificial Intelligence on Medium
Hopefully at this stage it all still makes sense. We will now do the same but with some real data. Luckily, tensorflow datasets gives us a number of datasets to get us started quickly.
The approach to visualisation works well on grouping large sets non-domain specific short texts. For instance: news headlines, natural language questions and social media posts would be good candidates.
I will use examples here from the datasets respectively:
- CNN-DailyMail dataset — visualising the summary of each article
- Squad dataset — visualising the question
CNN-DailyMail is a well researched summarisation dataset consisting of news articles and a ‘highlights’ text which works as the summary benchmark and typically consists of 1–3 sentences. This makes a good candidate for what we want to do.
I took the first 10k ‘highlights’ from the data selecting only their first sentence and ran PictureText on it.
The top level split shows most sections as slightly amber, meaning below average within-group similarity. The smaller one to the bottom right seems a bit more promising with about 500 (ca 5%) docs about what seems like a ‘sports’ theme.
Zooming into the ‘sports’ theme, we see more mentions of game results across football (actual football) and tennis, etc.
In general, colour-coding blue should work well.
It is difficult to extract meaning from the biggest topics as they currently summarise a whole chunk of 1–2k articles with one sentence. However, picking the middle one, there are some similarities in the content within. A theme of hacks and scams seems to appear:
- Hacker group says it accessed info of snapchat users
- Firm used trick to get extra tax relief
- Facebook chat gimmick lets you insert faces of friends
There are also plenty of cases, where things make little sense. The middle tile of the same picture mixes news about the Orlando park attack, the great British bake-off, etc.
As a rule, the way to deal with issues like that would be to drill deeper into a group until a more ‘blue’ collection of titles appears, this one making more sense.
The SQuAD dataset is one of the standards for evaluating quality of question answering models. However, it is difficult to get a good sense of what sort of questions we are dealing with as a whole, what works and what doesn’t. We could sample individual questions, or do some distributions of words, etc. but nothing that really gives a good overview easily.
Similar results can be seen with Questions. At the top level the split makes little sense judging by the questions contained within each of the top three buckets.
However, digging into one of those ‘what era utilised Greek Theories’, we see a lot of questions about time periods of history. Digging deeper into those, there also seems to develop some more narrow themes. For instance, one can follow a thread about Slavic / Russian timelines & events. In one other thread, however, we jump from historic topics to questions about TV history such as ‘when did season five air?’ … so be wary…