Does a Machine Know Your Gender Based on Your Tweets?

Original article was published on Artificial Intelligence on Medium

Does a Machine Know Your Gender Based on Your Tweets?

Exploring Natural Language Processing (NLP) and Machine Learning with Twitter API Data

Why is this an important question? Ultimately this is a stepping stone to answering the ethical question: Should we teach Artificial Intelligence to detect your ethnicity? This is an analogous question to: If we had complete discretion, would we teach our children to recognize someone’s ethnicity?

Image from unsplash.com/@heyerlein

As the creators and users of these tools, it is our responsibility to regulate AI in a similar manner to how we enforce seat belts and speed limits.

Elon Musk on AI regulation.

On the pathway to AI ethics, we can first explore AIs ability to predict your gender simply based on text. We will not use voice nor facial recognition because that would be too easy.

To add in economic value, what are business cases for classifying gender?

  • Better customer segmentation: companies can reduce and optimize advertising costs by correctly identifying customers and providing more personalized marketing.
  • Better ecommerce experience: in an industry like fashion, identifying the correct gender improves the customer experience, allows retailers to tailor suggestions based on your needs, and can increase sales.
  • Better recommendation engines: some algorithms are based on customer similarities with gender as a key variable. It can help you find interest groups (i.e. for Moms or Dads), gender specific topics (i.e. pregnancy), or potential entertainment preferences in movies, music, or books.
  • Business ethics: companies are increasingly needing to align and respond to social issues. Brand perception from the new generation can result in significant growth or loss of customers, for example social media companies and a person’s sense of “trust” using that product.

But can we really detect gender based only on text? Let’s explore this with natural language processing and machine learning. Here are 6 twitter quotes from public figures.

Image from unsplash.com/@mr_fresh

Can you guess the gender better than a machine?

  1. “But you have to do what you dream of doing even while you’re afraid.” (male or female?)
  2. “My biggest baby is 11 today! I’m so proud of the kind, artistic, curious, sensitive, intelligent, compassionate person she’s become. Time goes by so fast. Family dinner at our favorite spot” (male or female?)
  3. “Make no mistake, we may be in a terrible place right now but the young people in this photo will soon be in charge and what they want is pretty clear.” (male or female?)
  4. “It’s World Health Day, and we owe a profound debt of gratitude to all our medical professionals. They’re still giving their all for us every day, at great risk to themselves, and we can’t thank them enough for their bravery and their service.” (male or female?)
  5. “Why now? The needs are increasingly urgent, and I want to see the impact in my lifetime. I hope this inspires others to do something similar. Life is too short, so let’s do everything we can today to help people now.” (male or female?)
  6. “Don’t feel stupid if you don’t like what everyone else pretends to love.” (male or female?)

Building a Gender Prediction Machine Learning Model with Twitter Data

Steps:
1. Get Twitter data
2. Clean data
3. Build model
4. Predict gender
5. Discuss next steps

1. Get Twitter Data

# Import libraries
import json
import pprint
import tweepy as tw
import pandas as pd
# Connect to Twitter API
path_auth = '[your file path to twitter API keys]'
auth = json.loads(open(path_auth).read())
pp = pprint.PrettyPrinter(indent=4)
my_consumer_key = auth['my_consumer_key']
my_consumer_secret = auth['my_consumer_secret']
my_access_token = auth['your_access_token']
my_access_token_secret = auth['my_access_token_secret']
auth = tw.OAuthHandler(my_consumer_key, my_consumer_secret)
auth.set_access_token(my_access_token, my_access_token_secret)
api = tw.API(auth)
type(api)

Upload a list of desired Twitter users with their genders labeled:0 0 = male, 1 = female.

# Upload list of desired Twitter users
# Gender classification: 0 = male, 1 = female
users = pd.read_csv('../Data/twitter-users.csv')
users.sample(20)
User selection based on follower count, diversity in gender, and occupation. There are a total of 50 Twitter users in the dataset.

Use the Twitter API to get Tweets from users and create a DataFrame.

# Get collection of tweets and store into a new dataframe
list = []
for index, row in users.iterrows():
tweets = api.user_timeline(screen_name=row['user'], count=200, include_rts=False)
users_text = [[tweet.user.screen_name, tweet.text, row['gender']] for tweet in tweets]
tweet_text = pd.DataFrame(data=users_text,
columns=["user", "text", "gender"])
list.append(tweet_text)
# Merge the list
tweets = pd.concat(list)
tweets
Notice that we now have the gender for each Tweet which we can use to train the model.
# Check percentages for each gender
# 0 = male, 1 = female
tweets.gender.value_counts(normalize=True, sort=False)

The data is around an even gender distribution, this may change as we clean the text.

2. Clean Data

Remove anything in the text that is unnecessary to teach the model.

# Import libraries
import numpy as np
import re
import spacy
from matplotlib import pyplot as plt
# Clean text
def clean_text(text):
# Reduce multiple spaces and newlines to only one
text = re.sub(r'(\s\s+|\n\n+)', r'\1', text)
# Remove double quotes
text = re.sub(r'"', '', text)
return texttweets['clean_text'] = tweets['text'].apply(clean_text)# Remove hyperlinks
tweets['clean_text'] = tweets['clean_text'].str.replace('http\S+|www.\S+', '', case=False)
# Remove patterns
def remove_pattern(text,pattern):

# re.findall() finds the pattern i.e @user and puts it in a list for further task
r = re.findall(pattern,text)

# re.sub() removes @user from the sentences in the dataset
for i in r:
text = re.sub(i,"",text)

return text
tweets['clean_text'] = np.vectorize(remove_pattern)(tweets['clean_text'], "@[\w]*") # Removes all @
tweets['clean_text'] = np.vectorize(remove_pattern)(tweets['clean_text'], "&") # Removes all &
tweets['clean_text'] = np.vectorize(remove_pattern)(tweets['clean_text'], "#[\w]*") # Removes all #
# Remove stop words and apply lemmatization
nlp = spacy.load('en')
def convert_text(text):
sent = nlp(text)
ents = {x.text: x for x in sent.ents}
tokens = []
for w in sent:
if w.is_stop or w.is_punct:
continue
if w.text in ents:
tokens.append(w.text)
else:
tokens.append(w.lemma_.lower())
text = ' '.join(tokens)
return texttweets['clean_text'] = tweets['clean_text'].apply(convert_text)
tweets.sample(15)
The ‘clean_text’ column shows the changes from the original ‘text’ column. Notice how hyperlinks, @, hashtags are removed and some words are lemmatized.

There is still some text to be cleaned.

# Remove punctuation, numbers, and special characters
tweets['clean_text'] = tweets['clean_text'].str.replace("[^a-zA-Z#]", " ")
# Remove short words less than 3
tweets['clean_text'] = tweets['clean_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
# Count the length of characters
tweets['clean_length'] = tweets['clean_text'].apply(len)
# Remove rows where character length <= 20
tweets = tweets[tweets.clean_length > 20]
tweets.sample(n=15)
Symbols, words < 3 characters, and total text length of ≤ 20 have been removed.

3. Build Model

We can use Bag-of-words to build a model. There are other methods too, such as using Term Frequency–Inverse Document Frequency (TF-IDF).

# Import libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
# Bag-of-words features
bow_vectorizer = CountVectorizer(stop_words='english')
# Bag-of-words feature matrix
bow = bow_vectorizer.fit_transform(tweets['clean_text'])
df_bow = pd.DataFrame(bow.todense(), columns=bow_vectorizer.get_feature_names())
df_bow
To the machine, all the words become vectorized and treated as numbers. This becomes similar to running a prediction model on any other dataset, such as predicting housing prices.
# Splitting the data into training and test set
X = df_bow
y = tweets['gender']
# Use Bag-of-words features
X_train_bow, X_test_bow, y_train_bow, y_test_bow = train_test_split(X, y, test_size=0.20)

For the initial prediction, we can use Logistic Regression. There are other models too, such as Decision Tree or XGBoost.

# Fitting on Logistic Regression model
logreg = LogisticRegression()
logreg.fit(X_train_bow, y_train_bow)
prediction_bow = logreg.predict_proba(X_test_bow)
# Calculating the F1 score
# If prediction is greater than or equal to 0.5 than 1, else 0
# Gender, 0 = male and 1 = female
prediction_int = prediction_bow[:,1]>=0.5
prediction_int = prediction_int.astype(np.int)
# Calculating F1 score
log_bow = f1_score(y_test_bow, prediction_int)
log_bow
The F1 score measures the test’s accuracy. This score shows that a meaningful prediction can be made.

4. Predict Gender

After training the model and achieving a workable accuracy score, it’s time to predict the gender of these 6 tweets from the beginning.

  1. “But you have to do what you dream of doing even while you’re afraid.”
  2. “My biggest baby is 11 today! I’m so proud of the kind, artistic, curious, sensitive, intelligent, compassionate person she’s become. Time goes by so fast. Family dinner at our favorite spot”
  3. “Make no mistake, we may be in a terrible place right now but the young people in this photo will soon be in charge and what they want is pretty clear.”
  4. “It’s World Health Day, and we owe a profound debt of gratitude to all our medical professionals. They’re still giving their all for us every day, at great risk to themselves, and we can’t thank them enough for their bravery and their service.”
  5. “Why now? The needs are increasingly urgent, and I want to see the impact in my lifetime. I hope this inspires others to do something similar. Life is too short, so let’s do everything we can today to help people now.”
  6. “Don’t feel stupid if you don’t like what everyone else pretends to love.”

Import the above data and give it the same Bag-of-words treatment as the training data.

# Import testing set
testset = pd.read_csv('Data/twitter-test.csv')
# Bag-of-words feature matrix
bow = bow_vectorizer.transform(testset['text'])
df_bow_test = pd.DataFrame(bow.todense(), columns=bow_vectorizer.get_feature_names())
df_bow_test
Notice how this DataFrame also has 6034 columns just like the training data.

Now use Logistic Regression to predict the gender.

# Predict probability
z = df_bow_test
pred_prob = logreg.predict_proba(z)
pred_prob = pd.DataFrame(data=pred_prob, columns=['percentage_0', 'percentage_1'])
# Predict classification
pred = logreg.predict(z)
pred = pd.DataFrame(data=pred, columns=['predicted_gender'])
# Store into the same DataFrame
result = pd.concat([testset, pred, pred_prob], axis=1, sort=False)
result
# 0 = male, 1 = female

The result:

The left ‘gender’ column is the true label and the ‘predicted_gender’ is the machine’s predictions. The ‘percentage_0’ and ‘percentage_1’ is the probability of the text being 0 or 1.

The machine accurately predicted the gender 100%. How did you do compared to the machine?

5. Next Steps

The model can still be refined in many ways for higher accuracy:

  • Using Term Frequency-Inverse Document Frequency (TF-IDF) for feature extraction.
  • Using the combinations of Bag-of-words and TF-IDF features with different models: Logistic Regression, Decision Tree, XGBoost.
  • Optimizing the hyperparameters.
  • Using neural networks for better natural language processing.

Of course, you could trick the machine by using various words or gender-neutral phrasing. But what is the significance of this? If a machine could tell your gender solely based on text, what happens if we give it more data on your voice, facial recognition, physical attributes, credit card transactions, shopping purchases, social media behavior, or your social graph?

If you extrapolate this out, given the right data, the machine can detect your age, socio-economic status, sexuality, psychological health, and ethnicity the same way a human can, or even better.

Artificial Intelligence is at a point where what we do today will create a large snowball effect into the future, which may then become irreversible. Take the analogy of Pop/Rock Music, which originated around the 1950s and became a global phenomenon.