Source: Deep Learning on Medium
A Distance-based Recommender System with the Yelp Dataset
Ah, Vegas. The lights, the sounds, the joyful lap-dancing abound. Or so I’ve heard (I haven’t actually been there). As they say, whatever happens in Vegas, stays in Vegas; we all know what happens in Vegas, but perhaps culinary experiences do not top that list.
For the 5th project in Metis’ Data Science Bootcamp, I decided to have a go at using Yelp’s Kaggle Dataset to build a distance-based recommender system. In this project, I hypothesize food being an afterthought, that people are indecisive, constantly hangry and want to be told what restaurants are good within their immediate proximity while still taking into account personal preferences and visit histories.
Recommender systems you say?
Recommender systems come implemented in various forms all across the web, the advantages of which are well documented. It is how you are kept in a never-ending loop of YouTube / Netflix binging or impromptu Amazon shopping sprees, often leaving you in a state of self-disgust (yes, we’ve all been there).
While platforms see a direct increase in revenue, consumers also benefit from having tailored suggestions pushed to them. Put simply, recommender systems show the right things to the right people at the right time.
Collaborative & Content-based Filtering
Whilst the level of sophistication in a recommender system ranges greatly from use-case to use-case, they are generally based on two fundamental methods of filtering: Collaborative & Content.
In a collaborative setting, birds of a feather flock together. Users with similar likes and dislikes are grouped together. If a person A has the same opinion as a person B on an issue, A is more likely to have B’s opinion on a different issue than that of a randomly chosen person.
In this example, as both Users A & B have rated Crystal Jade & Han’s highly, it can be assumed that Users A & B are similar because they have identical tastes and preferences. In this case, if User A then goes on to rate NamNam highly (a restaurant not yet visited by User B), NamNam would be recommended to User B.
In a content setting, however, recommendations are based on a description of the item and a profile of the user’s preferences. Similar restaurants will be recommended to the User based on what the User has previously rated highly, as in this KFC — McDonald’s fast-food chain example above. There are many ways to assess item-preference similarity but they are beyond the scope of this post.
Data Wrangling & Exploration
Now on to business. The datasets of focus here are the business & review jsons. The former provides information on a business’s attributes, whereabouts, and average ratings. The latter contains user reviews in the form of text and the ratings given for the establishments they have visited. Both ratings are on a scale of 1–5.
Restaurant-only establishments in Vegas were isolated from the rest. In total, there were about 1.2 million reviews for ~6,500 restaurants given by ~440,000 users. The visualization below shows a rather expected spread of user ratings — the better restaurants received more reviews than the worse ones.
By using review activity as a proxy, it quickly becomes clear where some of the eating hotspots exist in the city.
Unsurprisingly, a lot of activity is concentrated in the Las Vegas Strip, with smaller pockets of concentrated activity in several other places. The heatmap also satisfies us that the data has good coverage throughout the city.