How can Videos be Used to Detect Your Personality?

Original article was published by Metika Sikka on Artificial Intelligence on Medium

An interesting aspect of the Big-5 personality trait theory is that, these traits are independent but not mutually exclusive. For example, we can see in the above image that Sheldon Cooper (The Big Bang Theory) would score low on Extraversion but would also score high on Neuroticism, Phoebe Buffay (Friends) would score low on conscientiousness but score high on openness and so on…

About the Data Set

The First Impressions Challenge provides a data set of 10k clips from 3k YouTube Videos. The aim of this challenge was to understand how a deep learning approach can be used to infer apparent personality traits from videos of subjects speaking in front of the camera.

The training set comprised of 6k videos. The validation and test sets had 2k videos each. The average duration of the videos was 15 seconds. The ground truth labels for each video, consisted of five scores representing performance on each of the Big-5 personality traits. These scores were between 0 and 1. The labeling was done by Amazon Mechanical Turk Workers. More information about the challenge and the data set can be found in this paper.

Video data is unstructured but rich with multimedia features. The approach explained in this blog post uses audio and visual features from the videos. The analysis and modeling was done on Google Colab. The code can be accessed on Github.

Distribution of the Ground Truth Labels

Created by Author

The graph on the left shows the distributions of personality scores in the training data set. It’s interesting to note that the distributions of the scores are quite similar and even symmetric along the mean. The reason for this symmetry could be that the scores aren’t self reported. Self-reported personality assessment scores are usually skewed due to social desirability bias.

Extracting Visual Features

Videos consists of image frames. These frames were extracted from videos using OpenCV. In apparent personality analysis, visual features include facial cues, movement of hands, posture of the person, etc. Since the data set consisted of videos with an average duration of 15 seconds, from each video 15 random frames were extracted. Each extracted frame was then resized to 150 X 150 and scaled by a factor of 1/255.

Created by Author using a Video from the First Impressions Challenge

Extracting Audio Features

The waveform audio was extracted from each video using ffmpeg subprocess. An open source toolkit, pyAudioAnalysis was used to extract audio features from 15 non overlapping frames (keeping frame step equal to the frame length in the audioAnalysis subprocess). These included 34 features along with their delta features. The output was 1 X 68 dimensional vector for each frame or a 15 X 68 dimensional tensor for 15 audio frames.

The types of features extracted through pyAudioAnalysis include Zero crossing rate, Chroma Vector,Chroma Deviation, MFCCs, Energy, Entropy of Energy, Spectral Centroid, Spectral spread, Spectral entropy, Spectral Flux and Spectral Rolloff.

Deep Bimodal Regression Model

The Functional API of Keras with Tensorflow as backend was used for defining the model. The model was defined in two phases. In the first phase the image and audio features were extracted and then, the sequential features of the videos were processed. To process audio and visual features a bimodal time distributed approach was taken in the first phase.

Keras has a time distributed layer which can be used to apply the same layer individually to multiple inputs, resulting in a “many to many” mapping. Simply put, the time distributed wrapper enables any layer to extract features from each frame or time step separately. The result: an additional temporal dimension in the input and the output, representing the index of the time step.

The audio features extracted via pyAudioAnalysis were passed through a dense layer with 32 units in a time distributed wrapper. Hence, the same dense layer was applied to 1 X 68 dimensional vectors of each audio frame. Similarly, each image frame was passed in parallel through a series of convolutional blocks.

Created by Author

After this step the audio and visual models were concatenated. To process the chronological or temporal aspect of videos, the concatenated outputs were further passed to a stacked LSTM model with a dropout and recurrent dropout rate of 0.2. The output of the stacked LSTM was passed to a dense layer with ReLU activation and dropout rate of 0.5. The final dense layer had 5 output units (one for each personality trait), along with sigmoid activation to get predicted scores between 0 and 1.

Generator Function

The biggest challenge was managing the limited memory resources. This was accomplished using mini batch gradient descent. To implement it a custom generator function was defined as follows:

Note: The generator function yields the input for the audio and visual models in one list. Corresponding to this the model is defined by passing a list of two inputs to the Model class of keras:

model = Model([input_img,input_aud],output) 


The model was compiled using the Adam optimizer with a learning rate of 0.00001. The model was trained for 20 epochs with a mini batch size of 8. Mean squared error was taken as the loss function. A custom metric called Mean accuracy was defined to see the performance of the model. It was calculated as follows:

Here N is the number of input videos.

Overall the model performed quite well with a final test mean accuracy of 0.9047.

Created by Author

The table below shows the test mean accuracy for each of the Big-5 personality traits. The model shows similar performance for all 5 personality traits.

Created by Author

The Road ahead…

The results of the model can be further improved by increasing the frame sizes and lengths depending upon the availability of processing power. NLP analysis of video transcriptions can also be used to get additional features.

While automated apparent personality analysis has important use cases, it should be made sure that, algorithmic bias does not affect results. The aim of such AI applications is to provide a more objective approach. However, such objectivity can only be achieved if bias is excluded at each stage i.e. from data collection to results interpretation.