SHEROES is a women’s community platform where members discuss things like health, careers, relationships and share their life stories, achievements and moments. SHEROES has always aimed to create a safe and trusted space for women.
One of the biggest roles of the Data Science team at SHEROES is to leverage data and build systems to make sure that platform is safe and spam/abuse free so that women can have an open discussion.
Creating a safe space only for women is full of challenges where ensuring that the platform remains men free is really crucial so that our users can engage freely without hesitation. To ensure that our community remains men free, we use a multi-layer data-driven approach.
That’s why we have multiple checks to ensure the trust and safety on the platform:
- Gender check using Google and Facebook data
- Gender detection based on the user’s mobile and profile picture
- Spam profile detection
- In case, after the above-mentioned check, a fake/spammy profile is still active, our users and community moderators can report the profile.
At any step, if the profile is marked as spam/male it lands in the SPAM dashboard/
This post will be focusing only on Gender Inference Model. Let’s just dive into it.
Gender Inference model
We came with a hybrid model to overcome the obstacle. The model uses the name and profile image of the user to deduce the gender of the user. Now, one might wonder, what is the need for both the name and image. We wanted as many signals as possible so that we do not rely too much on one data point to make a decision and also, the reason for not being fully dependable on image is, firstly, not all users have a profile picture, and secondly, some mothers use a snap of their kids as their profile picture. As for the name, Indian names are very diverse and a foolproof machine learning model that predicts gender didn’t seem pragmatic.
Creating the dataset:
We used publicly available dataset for Indian names and our own name dictionary. To make our model robust we considered all possible spellings of the name. For example Ankur and Ankoor, Preyanka and Priyanka etc..
We used features such as:
- The last character is vowel: Major percentage of female Indian names end with a vowel. Example being Shivani, Gauri, Megha etc..
- Number of syllables: Every syllable must have a vowel, and every vowel makes a syllable. This means that the number of vowels in a word is equal to the number of syllables. It is the vowel that forms the syllable; syllables do not need to include a consonant. Syllables may, of course, include one or more consonants at the beginning or end, but a vowel can form a whole syllable on its own.
Female names contain more syllables compared to male names. Some examples are Anuradha, Anupriya, Gauri etc..
- Length of the word: Even though the length of a name does not relate to its gender but our analysis showed that males generally have longer names than females.
- Frequency of each character
- Ending N-grams: we restricted ourselves to 3 n-grams to avoid overfitting the model.
We trained the model using an SVM classifier. Initially, we tried with boosting as well, but it was leading to overfitting of the model.
SVM with linear kernel was giving the most satisfactory results.
Training testing split was 70:30.
We achieved an F1 score of 87%
One limitation of the model was that it wasn’t able to detect some names as Female such as “Komal”, “Sonam”, “Parul” and such. The reason was, out of the features we used, vowel ending feature was the dominating one, the other features were not making much of a difference. Hence, we thought of creating a better model which will be able to consider other characters in the name too as participating feature. RNNs were a good fit for this as it involves learning from sequences (in this case sequence of characters).
We developed a Long short-term memory model, or better known as LSTM using keras with tensorflow backend.
For the image model, we used a CNN model which can predict the user’s gender and age. Initially, the only purpose of image model was to predict gender but as soon as the model went live we realized some mothers used a snap of their kids as their profile picture, which resulted in a large number of false negatives. So, to reduce these misclassifications we updated our model to figure out age as well.
After pushing this model into production, we found a drop in the activity of male users, since, immediately after signing up, they landed up in the spam dashboard if the model detected them as male.
How we handled false positives and false negatives?
In our case, ignoring false positives (that would mean allowing a male user in the platform) and false negatives (that would mean blocking a female user to enter the platform) was equally dangerous.
That’s where our community team comes which has been removing males since the beginning of SHEROES.
False positives being very less in number are easily identified by our community moderators while using the app.
For false negatives we used our already built SPAM dashboard, so everytime our machine marks a user as male, they land in the spam dashboard. If he is actually a dude, he is deactivated otherwise she is approved to use the app.
We are working towards improving the LSTM model, and also trying to create a name based dictionary + Machine learning model i.e. if a user signs up we first check it in our dictionary whether that name is a male name or female name. In case, the name does not exist in our dictionary, it will be classified using the machine learning model.
This is one of the many things that we are doing to maintain the trust and keep the platform spam/abuse free.
Subscribe to read more about the exciting stuff Data Science team at SHEROES is working on :)
- Medium. (2018). Classifying Gender Based On Indian Names In — Simpl — Under The Hood — Medium. [online] Available at: https://medium.com/simpl-under-the-hood/classifying-gender-based-on-indian-names-in-82f34ce47f6d.
- Towards Data Science. (2018). Deep learning gender from name -LSTM Recurrent Neural Networks. [online] Available at: https://towardsdatascience.com/deep-learning-gender-from-name-lstm-recurrent-neural-networks-448d64553044.
- ayoungprogrammer’s blog. (2018). Determining Gender of a Name with 80% Accuracy Using Only Three Features. [online] Available at: http://blog.ayoungprogrammer.com/2016/04/determining-gender-of-name-with-80.html/.
- Gist. (2018). Dataset of ~14,000 Indian female names for NLP training and analysis. The names have been retrieved from public records. (name,gender,race). [online] Available at: https://gist.github.com/mbejda/9b93c7545c9dd93060bd.
- Gist. (2018). Dataset of ~14,000 Indian male names for NLP training and analysis. The names have been retrieved from public records. (name,gender,race). [online] Available at: https://gist.github.com/mbejda/7f86ca901fe41bc14a63.
Source: Deep Learning on Medium