Using the data behind TED Talks to explore speech patterns and build models that can generate speech, predict a speaker’s gender, and find out what makes a talk popular.
TED Talks are a wonderful way for people to share ideas. TED speakers are experts in a diverse array of fields across technology, entertainment, and design (that’s what the “TED” actually stands for!) TED talk transcripts hold some of the world’s most powerful ideas. By combining these ideas with useful statistics that indicate how audience members and online viewers react, a vivid picture can be painted of how humans communicate and respond to each other.
We wanted to explore the distribution of all TED talks, and ask questions about the similarities and differences between talks given by speakers from different groups of people, like men and women. Using this data exploration to guide us, we also wanted to see if we could build intelligent models that could predict the gender of a TED speaker, or even generate a portion of a talk.
We decided to approach these tasks using natural language processing, data visualization, and machine learning models. Our goals were to:
- Decipher any differences in men’s and women’s TED talks based on both speech content and audience reactions
- Build a classifier to predict a speaker’s gender given their TED talk transcript and related statistics
- Create a generative model that can finish writing a portion of a TED talk given a starter sentence
Exploring TED Data
TED transcripts are available online. The dataset we used consisted of these transcripts along with various metadata scraped from the official TED website that was posted on Kaggle by Rounak Banik. The dataset contains every TED talk posted before September 21, 2017 (2463 talks.) Each data point contains the following attributes:
- Main speaker name
- Number of speakers
- Film date and publish date
- Number of comments
- Number of views
- Speaker occupation
- Number of languages in which the talk is available
- Related talks
Creating Gender Labels
In order to compare the talks of men vs. women, we first had to know the speaker’s gender for each data point. Rather than watch all two thousand videos, we used a library called gender guesser that uses a person’s name to categorize their gender as one of: male, female, mostly male, mostly female, androgynous, or unknown. For simplicity, we merged the mostly female and mostly male categories with the female and male categories respectively.
The gender guesser left 311 speakers labeled as unknown, which we then labeled manually to the best of our abilities by looking at the actual TED videos. While the gender guesser may not have done a perfect job with the assignments, it was still able to correctly categorize enough of our speakers to allow us to perform a broad analysis comparing and contrasting the two genders. The final distribution of talks given across genders shows that there are more than twice as many male speakers as female speakers.
Differences in Speech Content By Gender
Are there differences in the ways men and women speak to an audience? To find out, we compared the relative word frequencies, number of words spoken per minute, talk durations, and sentiment scores of male and female speakers.
Relative Word Frequencies
Each word frequency was calculated by counting the number of times that word was spoken across all speakers of one gender and dividing by the total number of words spoken by that gender. We could then subtract and plot the words with the biggest differences between genders.
Relative Word Category Frequencies
To refine our exploration of word frequencies even further, we also compared the relative frequencies of different categories of words between genders. A previous study (along with interactive graphic) used Facebook data to analyze the occurrences of certain groups of English words across genders, and found some interesting differences. We used the following word categories and compared their results with those found from the TED transcripts.
Warm and agreeable words (‘family’, ‘friends’, ‘wonderful’, ‘blessed’, ‘amazing’, ‘loving’, ‘husband’, and ‘thankful’)
Previous research showed: Women are more likely to use ‘warm and agreeable’ words than men.
TED Data showed: Women used ‘warm and agreeable’ words 1.2 times more than men did. (women used them in 0.18% of all words spoken by women were ‘warm and agreeable’ compared to 0.15% of all words spoken by men)
Intensifiers ( ‘very’, ’so’, ‘such’, ‘really’, ‘totally’, and ‘too’)
Previous research showed: Women are more likely to use intensifier words than men.
TED Data showed: Women said these words 1.047x more frequently than did men, which is not very significant. (0.44% of all words used by women were intensifiers compared to 0.42% of all words used by men were)
Uncertainty indicators (‘um’ and ‘uh’)
Previous research showed: Women are more likely to use these words than men as they signify low-confidence and uncertainty.
TED Data showed: Our findings were actually quite different. Men actually used words of uncertainty 1.86 times more frequently than did women (0.00144% of all words used by men were uncertainty indicators compared to 0.000773% of all words used by women)
Rational words ( ‘opinion’, ‘opinions’, ‘logic’, ‘logical’, ‘based’, ‘political’, ‘fact’, ’moral’, ‘beliefs’)
Previous research showed: Men are more likely than women to use ‘rational’ words.
TED Data showed: Women were actually 1.07 times more likely to use ‘rational’ words than men. (0.15% of all words spoken by women were ‘rational’ words compared to 0.14% of all words spoken by men.
Cold-hearted words ( ‘kill’, ‘kills’, ‘killing’, ‘die’, ‘swear’, ‘dead’, and ‘murder’)
Previous research showed: Men are more likely than women to use ‘cold-hearted’ words.
TED Data showed: Women were actually 1.25 times more likely to use ‘cold-hearted’ words than men.(0.07% of all words spoken by women were ‘cold-hearted’ compared to 0.05% of all words spoken by men)
Although we found gender differences in frequencies of some of these word groups, not all of our findings agreed with those found in the previous study. Overall, the gender differences of these word groups in TED data were not significant. It is entirely possible that gender differences are more prevalent in the informal setting of Facebook than in the more formal setting of a TED Talk.
Words Per Minute (WPM)
We calculated the number of words each speaker says per minute by dividing the total number of words in the transcript by the duration of the talk in minutes (it was given in seconds in the dataset.) We then compared the distributions of speaker WPMs to see if men tended to speak faster or slower than women. The distributions actually ended up looking very similar indicating there is no strong relationship between someone’s gender and the speed at which they talk.
The distributions of speech duration across genders proved to be slightly more interesting, with the vast majority of extremely long talks being delivered by men, and the average duration of men’s talks being longer than women’s.
We also used the TextBlob library to find sentiment scores for each talk, including a polarity score from [-1,1] indicating how negative or positive the speech content was, and a subjectivity score from [0,1] indicating how subjective or objective the speech content was. The distributions of these scores showed almost no differences between gender, indicating that someone’s gender does not have an effect on how polarized and/or objective they are as a speaker.
How Audience Reception Differs by Speaker Gender
We just saw some of the differences and similarities by gender of TED speech content such as commonly used words, duration, and words per minute. The other side of the equation is in how the audience members (both those physically present and those viewing online) perceive and react to these speeches. We looked for differences across genders of frequencies of laughter and applause, number of views and comments, ratings, and number of speakers from each gender over time.
Laughter and Applause Ratios
Conveniently, each transcript was tagged with the audience reactions of ‘(laughter)’ and ‘(applause)’. For each talk, we calculated the ratios of laughter/total words spoken and applause/total words spoken, to see if we could spot any differences.
On average, the audience laughed about 1.214 times more per word spoken for women than for men, and clapped about 1.071 times more per word spoken for women than for men.
Number of Views and Comments
As for views, on average, the number of views per talk was slightly higher for women than for men. (The average number of views per male TED Talk was about 1,723,488, while for women it was about 1,726,024, although this is not a very significant difference.) On the other hand, the average number of comments per talk was slightly higher for male speakers than female speakers. (200 vs. 189 respectively.)
TED Talks can be rated according to 14 categories, the distribution of ratings across all talks can be seen below.
We looked into how these ratings split across the genders, and found some interesting insights:
- Talks given by men are rated “funny” 1.5 times more than talks given by women (men average 178 ‘funny’ ratings per talk while women average 116.) This is strange, as we have seen previously that audiences actually laugh 1.2 times more frequently in talks given by women than in talks given by men.
- Talks given by men are rated ‘ingenious’ 1.73 times more than those given by women.
- Talks given by women are rated ‘beautiful’ 1.3 times more than those given by men.
Distribution of Male and Female Speakers Over Time
Another interesting difference is the distribution of speakers from each gender over time. In more recent years, the number of female speakers has gotten denser and denser, as opposed to earlier years when speakers were primarily male.
Using these insights found during data exploration, we created new features that highlighted the differences between the gender distributions in order to help our classifier make more intelligent predictions. Even features with gender differences that seemed relatively slight can combine with other features in unexpected ways to classify speaker gender more correctly.
In order to make the raw text transcripts useful and extract all of these useful features mentioned above from the text, it was necessary for us to do some cleaning and preprocessing of each transcript including:
1. Removing punctuation and symbols from transcripts
2. Removing stop words like ‘the’, ‘and’, ‘to’, etc. when trying to identify important words/ word frequencies within a speech.
3. Many of the features had to be extracted/ changed to numeric forms (the ratings had to be parsed to extract names and numbers)
4. Creating TF-IDF (Term Frequency-Inverse Document Frequency) vectors. Each TF-IDF is a sparse vector the length of the total number of possible words (the dictionary), where each possible word is represented as a spot in the vector, and the frequency of the word in any given document is placed in that spot. This is an extension of the “bag of words” scheme because it indicates the frequency of each word in a document.
Building a Gender Classifier
Gradient Boosted Decision Trees
Gradient boosted decision trees work well for binary classification problems, so we decided to use XGBoost, a decision tree boosting library to build our classifier based off of our metadata and engineered features. The main idea behind XGboost is that it essentially takes multiple shallow learners (short depth decision trees), and combines them so that each tree predicts the error of the last.
We used AUC to quantify how well our model was doing. For a while, things looked bleak, with 0.55 test AUC on our baseline classifier at best. Feature engineering proved to be extremely helpful for us, adding in all engineered features bumped us up to 0.68 test AUC. We also spent quite some time tuning our hyperparameters for the model. Once all was said and done, it was interesting to see what features the classifier used most to get its best scores. We analyzed feature importances and saw that the ratings and our engineered features were some of the most important ones we had.
Long Short Term Memory and Word Embeddings
Our other attempts to classify text based on gender centered around LSTMs. An LSTM is a specific type of a recurrent neural network (RNN) which allows the model to learn context or long term dependencies in sequential data. LSTMs are often used in time series or sequential data, such as text. We used LSTMs and word embedding layers to attempt binary classification of talks based on gender. For logistical reasons, both LSTM models we implemented only used the data points where we sure of the gender leaving us with approximately 2,200 transcripts. All LSTM classifiers used binary cross entropy, as well as either stochastic gradient descent (SGD) or Adam for optimizers.
Less complex text analysis methods only take into account what words were said and how often, not the order or context in which the words were spoken. The lack of context in simpler models can lead to positive feedback loops where they sometimes continuously generate the most frequent word in a text over and over. By looking at more complex relationships in the text, LSTMs can use the structure of the text to discern which words are better in the current context then others.
When a body of text is treated as an entire training set, it can be broken up into variable sized chunks with each representing a single data point we can use to train our model. For our data we broke up each transcript into its own individual training set then combined all of those training sets into one large training set for our LSTM models. Because the meaning of written text is highly dependent on syntax, grammar, and context, most state of the art models use LSTMs.
A common way to handle text classification is with word embeddings. Briefly, a word embedding is a representation of a word in a finite dimensional vector space. Words that are more similar are assigned coordinates (or vectors) that are closer to each other. Typically text is fed through a word embedding layer of a neural network, so that each word gets mapped to its vector. These vectors are then mapped to the remaining layers of the network. Luckily, engineers and scientists often pre-train word embedding layers and open source them for others to use. We used two well known ones, GloVe from Stanford, and lm 1b from Google.
Classifying with GloVe Embeddings
The GloVe embeddings are available from Stanford’s website. With help from the Keras blog, we downloaded the embeddings and prepared them in an embedding layer. Several neural network architectures were attempted (typically two to three dense layers with dropout), but none were able to classify the text. Validation accuracy was flat, and cross entropy loss increased indicating overfitting.
Classifying with Google Embeddings
Research scientists from Google open sourced a pre-trained language model in 2016 which has a pre-trained computational graph that was trained on almost a billion words, with a vocabulary size of nearly 800,000. We used this model to create word embeddings which were then stacked on a standard network architecture of dense layers with ReLU activation. This approach presented several challenges. A friend of the group helped with some code from his blog to get the Google model working, but we had to significantly trim the size of the transcripts. Running full length transcripts through their LSTM led to approximately 20 minutes per transcripts with 16 cores and 32 GB RAM. After some trial and error, we truncated the transcripts and used the first 2,000 characters, and upped our virtual machine to 24 cores and 80 GB of RAM which put us at around two minutes per transcript. The increased computational power allowed the model to finish in around three days (we had around 2,200 data points). We found these methods a little late to attempt to run the entire transcripts through the LSTM, as we calculated it would take at least ten days.
In the end, the results from the LSTM were fed into several different neural network architectures similar to above, with poor results. We attempted several different layers and widths, including two to three dense layers with ReLU activation, tanh activation, and widths from 128 to 1,024. Training was attempted for several hundred epochs. We also encoded the meta-data and concatenated it with the LSTM outputs, but this actually decreased performance. We never achieved any significant AUC score with any of these methods.
LSTMs for Text Generation
As we previously discussed, the ability of an LSTM to learn context is what makes it shine in text generation. This time we used LSTM networks to train a model to learn character and word level language structure so that it could generate its own text for a TED Talk.
Our first task was preparing the data to be fed into the network. To accomplish this words or characters are encoded as numbers, either via one-hot encoding or via integer encoding. These types of encodings have shortcomings, as simple encodings place ordering and mathematical constraints on the text that may not reflect reality, but they are common and easy to implement.
We created a one to one mapping of words to unique integers. For example, our first word model would look at the last five words, represented as integers. These integers were fed to a neural network with a softmax outputting the probability of each word appearing next. We then simply look for the word with the highest probability. Then an array is constructed with the length of all unique words, with all elements set to zero, and set the index of our predicted word to one.
Architecture and Training
We attempted generation based on both different sized memory windows, and character vs. word encoding. Neither was extremely successful, likely due to a lack of training data. However, we did see massive improvement in both approaches from increased training time.
Training the models required feeding a seed to the trained network so that it could generate the next word, and keep predicting from there. We started with short training lengths but found that our generative abilities were not great. In general, increasing our training time resulted in better models. We used Adam as an optimizer and categorical cross entropy as a loss function for both models. The word model did have much wider layers, as the number of unique words is larger than the number of unique characters, and so we thought it would require a larger dimensional space to process.
Our first attempt with a character based generator that looked at the 100 most recent elements. We trained for thirty epochs and got underwhelming results, as shown below. While the model was not able to generate any useful sentences, it did pick up on some language patterns, including the likelihood of ‘o’s and ‘e’s being chained together, as well as the words ‘of’ and ‘the’. It also picked up on common word lengths.
r, when possible, label things, mostly. but i beg of the designers here to break all those rules if the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee of the sas oo the sooee
We eventually increased training to one hundred and fifty epochs to generate something legible. We had to reduce the dataset to just one TED talk to finish training before our deadline, and still trained for almost eight hours on a 24 core system. While we were at first excited with the results, we quickly realized it had simply just memorized the talks.
llet school; she became a soloist; she had a wonderful career at the royal ballet. she eventually graduated from the royal ballet school, founded the gillian lynne dance company, met andrew lloyd webber. she’s been responsible for some of the most successful musical theater productions in history, she’s given pleasure to millions, and she’s a multi-millionaire. somebody else might have put her on medication and told her to calm down.(applause)what i think it comes to is this: al gore spoke the other night about ecology and the revolution that was triggered by rachel carson.
The word models had similar results. We looked at a window of the five most recent words. Below we have the word based generator for 30 epochs. It appears it was not sophisticated enough to learn context, and just chose the most likely word over and over.
never happened before in the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the
Increasing our number of epochs helped significantly. We again had to reduce the subset of the data, this time to five talks. After training for for 150 epochs (approximately five hours on a 24 core system) it gained a more diverse vocabulary and was able to put together sentences, although they are clearly not representative of a coherent thought stream.
cannot give the data free to the students, free to opportunity just the and more surgery axises, the gap.” the the equipment the the convince the ford the things the much.(applause)that’s to quickly the gadgets the we to kings wears market, the exact the until the gapminder movement. walkways math mean. the a the heads.(laughter)don’t the unesco, avert the force: and $3 and the the “spot washington! ford the know: the three can the the us 75 down the a audience the function. everywhere the the going the the is the may the a that.you, community-friendly and the age bringing the choose, sewage recapitulation the the “this liberating everyone the people-first of sequence, the keep the conclude, republican, is the you: have to important. 600,000 the the but “former nonetheless, the an lesson, the education audience the can i economic the loss.(laughter)i to have and the “i’ve now! funding. the echo died.
It is worth noting that the model was able to insert (laughter) and (applause) where it thought the audience would clap or laugh at its own jokes (even if we may not understand them). The validation loss was still decreasing, perhaps a larger training set and more epochs would have helped this model learn grammatically correct sentences.
Brief discussion of the results
In general, increased training time helped our model generate more convincing language. However, to be able to train a model within our timeline we had to significantly decrease the size of the data that was considered. The smaller dataset led to massive overfitting, and we think that with more training time it could potentially produce passable speech.
Predicting number of views a TED talk will receive
We also took a brief look into the importance of certain features on predicting a talks view count. We found a large variance in the the number of views for a TED talk. It turns out that 93% of the talks had views in the range of 300,000 all the way to 5,000,000. There were also a few significant outliers, as the most viewed talk had approximately 47.2 million views. To help reduce the skew from these outliers, we dropped them during training our model.
Our best classifier was an XGBRegressor with a final test error of about 480,000 views. The most important features were the ratings, which we engineered during data exploration.
Conclusion and Future Work
We have several ideas to expand on our work. We would like to further explore LSTMs with word embeddings. Many state of the art methods use similar architectures, and finding out why ours did not work very well may just be a matter of tuning architecture. We are also interested in training the generative and classification LSTM models for longer periods of time to increase their effectiveness. As well, the generative models required training on a very small subset of the data due to time constraints, there is potential to improve this given more training resources and time. We would also like to integrate a sort of “plagiarism checker” into the generative model so that we can detect when overfitting has occurred, as it is hard to tell by simply looking at the text.
Another area for future work is how some of the other metadata affects view count, such as the unique tags and talk title. Using feature importance, it may be possible to analyze drafts of future talks and show people where they could improve their talk. For instance, if a talk with certain words in the title is more likely to be successful, we could use feedback to help increase a talk’s reach by creating a better title. We encourage anyone who wants to help us explore any of this future work to check out the GitHub page for our project!
Source: Deep Learning on Medium