Building a Content Based Book Recommendation Engine

Original article can be found here (source): Artificial Intelligence on Medium

Total 3592 books details available in our dataset. It has six columns

title -> Book Name

Rating -> Book rating given by the user

Genre -> Category(Type of book). I have taken only three genres like business, non-fiction and cooking for this problem

Author -> Book Author

Desc -> Book description

url -> Book cover image link

Exploratory Data Analysis

Genre distribution

# Genre distribution
df['genre'].value_counts().plot(x = 'genre', y ='count', kind = 'bar', figsize = (10,5) )

Printing the book title and description randomly

# Printing the book title and description randomly
df['title'] [2464]
df['Desc'][2464]
# Printing the book title and description randomly
df['title'] [367]
df['Desc'][267]

Book description — Word count distribution

# Calculating the word count for book description
df['word_count'] = df2['Desc'].apply(lambda x: len(str(x).split()))
# Plotting the word count
df['word_count'].plot(
kind='hist',
bins = 50,
figsize = (12,8),title='Word Count Distribution for book descriptions')

We don’t have much longer book description. It is clear that good reads provides just short description.

The distribution of top part-of-speech tags in the book descriptions

from textblob import TextBlob
blob = TextBlob(str(df['Desc']))
pos_df = pd.DataFrame(blob.tags, columns = ['word' , 'pos'])
pos_df = pos_df.pos.value_counts()[:20]
pos_df.plot(kind = 'bar', figsize=(10, 8), title = "Top 20 Part-of-speech tagging for comments")

Bigram distribution for the book description

#Converting text descriptions into vectors using TF-IDF using Bigram
tf = TfidfVectorizer(ngram_range=(2, 2), stop_words='english', lowercase = False)
tfidf_matrix = tf.fit_transform(df['Desc'])
total_words = tfidf_matrix.sum(axis=0)
#Finding the word frequency
freq = [(word, total_words[0, idx]) for word, idx in tf.vocabulary_.items()]
freq =sorted(freq, key = lambda x: x[1], reverse=True)
#converting into dataframe
bigram = pd.DataFrame(freq)
bigram.rename(columns = {0:'bigram', 1: 'count'}, inplace = True)
#Taking first 20 records
bigram = bigram.head(20)
#Plotting the trigramn distribution
bigram.plot(x ='bigram', y='count', kind = 'bar', title = "Bigram disribution for the top 20 words in the book description", figsize = (15,7), )

Trigram distribution for the book description

#Converting text descriptions into vectors using TF-IDF using Trigram
tf = TfidfVectorizer(ngram_range=(3, 3), stop_words='english', lowercase = False)
tfidf_matrix = tf.fit_transform(df['Desc'])
total_words = tfidf_matrix.sum(axis=0)
#Finding the word frequency
freq = [(word, total_words[0, idx]) for word, idx in tf.vocabulary_.items()]
freq =sorted(freq, key = lambda x: x[1], reverse=True)
#converting into dataframe
trigram = pd.DataFrame(freq)
trigram.rename(columns = {0:'trigram', 1: 'count'}, inplace = True)
#Taking first 20 records
trigram = trigram.head(20)
#Plotting the trigramn distribution
trigram.plot(x ='trigram', y='count', kind = 'bar', title = "Bigram disribution for the top 20 words in the book description", figsize = (15,7), )