Original article can be found here (source): Deep Learning on Medium
Implementing the Viral Video Title Generator
All machine learning models require data. The dataset we will be using is the Trending YouTube Video Statistics dataset on Kaggle.
When loading and viewing the dataset, we can get an idea for how the data is structured:
import pandas as pd
data = pd.read_csv('/kaggle/input/youtube-new/USvideos.csv')
We are interested in the
title column — this will provide data to train the RNN on. This data has 40,949 rows; this is not much in comparison to some larger datasets, but to keep the training time reasonable let’s reduce the training data down to 5,000 instances.
In addition, we should narrow down what categories the training data is on:
After looking at different categories, it becomes clear that some categories are dedicated for news, music videos, movie trailers, etc., which wouldn’t make sense in the context of an idea generator because news, song titles, music video titles, and so on either can’t be generated or wouldn’t make sense. Category IDs 22, 23, and 24 are dedicated towards comedy and shorter segments created by small content creators. These are more in-line with what we want to generate.
The following code selects rows in
data that belong to categories 22, 23, or 24 and puts them in a DataFrame called
sub_data = data[(data['category_id']==24) | (data['category_id']==23) | (data['category_id']==22)]
There are still 16,631 rows — to reduce it down to five thousand rows, we will randomly shuffle the DataFrame several times and then select the top 5,000 rows for training data.
shuffle function can help:
from sklearn.utils import shuffle
sub_data = shuffle(shuffle(sub_data))
To feed the data into the model, it must be in a text file, with each new training instance on a separate line. The following code does just that:
titles = open('title.txt','w+')
for item in sub_data.head(5_000)['title']:
Note that the
.head(n) function selects the top
n rows in a DataFrame.
title.txt, we can call
Finally, the training file is ready. There are many powerful libraries that can implement RNNs like Keras (TensorFlow) and Pytorch, but we’ll be using a library that can skip the complexities of choosing a network architecture called
textgenrnn. This module can be called, trained, and used in 3 lines of code (4 if you count installing from pip), at the cost of lack of customizability.
!pip install textgenrnn
…installs the module in the Kaggle notebook environment. You may remove the
! if operating in other environments.
Training is simple:
from textgenrnn import textgenrnn
textgen = textgenrnn()
Since textgenrnn is built on a Keras RNN framework, it will output a familiar Keras progress-tracking print:
This takes about 2.5 hours to run through all 50 epochs.
…can be used to generate examples. ‘Temperature’ is a measure of how original the generated example will be (the less, the more original). It is a balance of being creative (smaller temperature) but not straying too far from the nature of the task, the balance between underfitting and overfitting.
Finally, the generated video titles!
To show the model’s progress over time, I’ll include three titles from (about) every 10 epochs, then leave you with a treasure trove of 50-epoch-model generated titles.
1 epoch (Loss: 1.9178) —
- The Moment To Make Me Make More Cat To Be Coming To The The Moment | The Moment | The Moments
- Keryn lost — Marlari Grace (Fi Wheel The Year Indieved)
- Reading Omarakhondras | Now Cultu 1010–75
10 epochs (Loss: 0.9409) —
- Grammy Dance of Series of Helping a Good Teass Shape | Will Smith and Season 5 Official Trailer
- Cardi Book Ad — Dancing on TBS
- Why Your Boyfriend In Handwarls
20 epochs (Loss: 0.5871) —
- My Mom Buys My Outfits!
- DINOSAUR YOGA CHALLENGE!!
- The Movie — All of Tam | Lele Pons & Hulue & Jurassic Contineest for Anime | E!
30 epochs (Loss: 0.3069) —
- Mirror-Polished Japanese Foil Ball Challenge Crushed in a Hydraulic Press-What’s Inside?
- Why Justin Bieber Was The Worst SNL Guest | WWHL
- The Most Famous Actor You’ve Never Seen
40 epochs (Loss: 0.1618) —
- Will Smith & Joel Edgerton Answer the Web’s Most Searched Questions | WIRED
- Adam and Jenna’s Cha Cha — Dancer Sharisons & Reveals Your Door ftta Answering Saffle Officers
- Bravon Goes Sneaker Shopping At Seoul Charman’s Fabar Things 2
…and finally, the top five 50-epoch (Loss: 0.1561) generated titles!
- MY BOY DO MY MAKEUP
- 24 HOUR BOX FORT PRISON ESCAPE
- Liam Payne Goes Sneaker Shopping
- Star Wars: The Bachelor Finale
- Disney Princess Pushing A Truck