Original article was published on Deep Learning on Medium
Exploring 4.5 Years of Journal Entries with NLP
Every single day since January 1, 2016, I’ve kept a journal in Google Docs documenting the events of the day and my thoughts. I started journaling because I wasn’t satisfied with where I was and wanted to be able to look back to see improvement. Despite this purpose, the journals got very long very fast, and I never got a chance to look back on them in any meaningful way.
As my journal approaches the five-year mark, I thought it would be fun to finally take a look back — not by reading it, but by analyzing it with my newly-learned data science and Natural Language Processing (NLP) skills.
Luckily for me, I’m a creature of habit. For the last 53 months, every journal has looked the exact same — one Google Doc per month, in the following format:
And so on.
Creating my corpus was as easy as downloading each Google Doc as a text file, quickly reading it over for stray paragraph breaks, and throwing it all into a pandas DataFrame using a few lines of code (21, to be exact). In the dataframe, I included the year, month, day, date (as datetime), and the journal entry.
I then cleaned the data using the regular expressions (re) module, created the document-term matrix using scikit-learn’s CountVectorizer function, and grouped it by year before transposing it for use in exploratory data analysis.
Exploratory Data Analysis
To get a better understanding of my journals, I looked at them from three perspectives: common words, sentiment, and length.
- Common Words: I determined and visualized the most common words using word clouds to gather important topics.
- Sentiment: I ran a sentiment analysis using the TextBlob library to see how my polarity changed over time.
- Length: I isolated various length metrics using pandas and made inferences about why I wrote more or less at certain points in time.
Most Common Words
First I wanted to look at the most common words, which I hoped would give an idea of the most important things to me for each year. I removed all words that were in the top 30 most common for at least three out of the five years, since they probably wouldn’t be very insightful for differentiation.
I like to represent information visually whenever I can, so I used the document-term matrix to create word clouds for each year.
- 2016: I talked about my significant other, Madison* a lot, as well as my bed (probably referring to how I struggled to get out of it or how much time I spent in it because of my back pain). Other words that stuck out to me were class, english, bio, and homework, referring to school, and played, referring to various games and sports.
- 2017: Madison was still prominent but less so; similar school themes are present, work refers to me starting my job at Sonic, and i’ll increases dramatically in importance — I guess I started talking more about what I was going to do in the future.
- 2018: Madison rose again in prominence as our relationship ended, and I started talking about other people more as people became larger. I also became close friends with and would later date Hannah*.
- 2019: Hannah became very important, as did other people — 2019 includes the second semester of my senior year in high school and my first semester of college, so it makes sense. Fun is also much larger this year, another unsurprising change.
- 2020: work refers to the increased workload from spring classes — such as MIS — and the internship I had during the semester; people is still prominent, and data and reading refer to hobbies/interests I picked up in 2020 (one of which I’m using to write this article).
*Names changed for anonymity
Next, I wanted to see how positive/negative my writing was, and if it would line up with my perception of those years. TextBlob makes sentiment analysis on the corpus incredibly easy, so I analyzed the corpus and looked at polarity by month/year.
The yearly results were surprising: 2020 was almost twice as positive as any other year. I would definitely consider 2020 to be my happiest year, but I wasn’t expecting the sentiment analysis to so strongly support my intuition.
Here we see the polarity visualized over time with monthly and yearly frequencies:
The last thing I wanted to explore was length: how much was I writing, and how did my verbosity change over time?
To do this, I created a DataFrame with various metrics about the length of my journal entries:
I wrote the most in 2019 and the least in 2020, and years where I was writing more per day had a larger vocabulary. It makes sense that 2020 has by far the lowest words per day (WPD); it’s been my happiest and busiest year.
Seeing how much less I wrote in 2020 begs the question: is there a link between the “happiness” (polarity) and the length of my entries? I plotted average words per day vs. sentiment in various ways to see if there were any patterns.
Plotting sentiment and words per day over time shows some relation; the monthly plot lines up pretty well other than in 2019, where word count goes way up, and in 2020, where the word count quickly dropped.
The yearly graph gives a more overarching, if not conflicting, insight: other than 2018–2019, where both the sentiment and word count went up, the word count and sentiment mirror each other. When sentiment goes up, word count goes down, and vice versa. This matches my intuition that if I’m happier and more positive, I’ll be journaling less and instead spending more time focusing on those positive things.
Removing the time component, the scatterplots become less meaningful; the only trend I see is that when plotting per-day, entries with very high or very low sentiments don’t have high word counts, while those with more median polarities (between -0.5 and 0.15) have higher word counts.
This is interesting; I wouldn’t expect to write a lot on very positive days, as the chart indicates, but on very negative days I would expect to write a lot about why the day was bad, why I was frustrated/upset, etc. It’s possible that the TextBlob sentiment analysis wasn’t very good at capturing my bad days as low-polarity compared to my positive days as high-polarity.
Since this is an unsupervised learning project (data without labels), most of my insights came from the exploratory data analysis. I tried topic modeling, but the results were incomprehensible. This makes sense, since topic modeling is best at recognizing highly distinct topics (such as sports and politics), whereas my journals would mostly be focused on talking about my day.
My plans for a classifier didn’t work out either, because I would’ve needed to manually classify all 1601 journal entries — not a particularly good use of my time.
However, I was able to use deep learning for the task I was most excited for: text generation!
Training the Language Model
I started with a language model that was pre-trained on the Wikitext103 dataset, a collection of over 100 million tokens extracted from verified articles in Wikipedia.
Doing so allowed me to take advantage of transfer learning, which meant that my model started out with a good understanding of the English language rather than starting from scratch. I then trained it on all 1601 journal entries, resulting in an accuracy of almost 33% after four fine-tuning epochs.
I was surprised by how well it performed. In the context of a language model, accuracy basically refers to how often it’s able to correctly predict the next word based on previous vocab. Correct predictions ⅓ of the time is pretty incredible, given the tens of thousands of words in my vocab. I guess I’m a predictable writer.
For text generation, I gave the model various starting words/phrases similar to what I often start my journal with, and tested a few different lengths before getting satisfactory results:
200 words, starting text: “Okay”
The entries understandably fluctuated a lot, here’s a less understandable entry with those same parameters:
And here’s my best attempt to get it talking about things in college — the starting text was “My dorm”. Considering that I’ve only had 1 year of college as compared to 4 years of high school, I’m not surprised at how infrequently the model mentioned topics relating to college.
There’s definitely a lot of repetition and some nonsensical sections, but it honestly sounds pretty similar to the way I write my journals. I’m quite satisfied with how it turned out. The language model really hit the nail on the head with how I start my journal most days: getting out of bed and talking about school.
Fast.ai’s learner.predict method certainly isn’t the best for text generation, but in the scope of this project I wasn’t able to get better results with beam search or nucleus predict, two other text-prediction methods.
The last thing I noticed was that there were basic grammar errors in the generated text such as missing punctuation, awkward spacing, and capitalization errors. There are a ton of grammar-checking APIs that I could have installed, but I use Grammarly for school and wanted quick results, so I manually put generated entries through it and indiscriminately accepted all change suggestions. Here were the results:
200 words, starting text: “Okay”
200 words, starting text: “My dorm”
Working on this project has not only been nostalgic but a great learning experience and a lot of fun. I wasn’t sure what to expect coming in, but I found a surprising amount of interesting results and looked at my personal experiences from a new, data-driven perspective.
It was really exciting getting to solve this novel problem with the skills I’ve learned in the past three months, and I want to continue to derive meaningful insights using the processes that guided me through this project.
This is the first article I’ve ever written and the biggest data science project I’ve ever completed, so there’s definitely room for improvement. Please let me know if you have any suggestions.