Content-based Movie Recommendation System

20 minute read

Updated: December 01, 2021

1. Netflix Movies: Recommendation Engine

1.1 Setting the Context

Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform. As of Oct-2021, they have over 215M subscribers globally. Once you start logging into Netflix regularly, you will realize that Netflix is usually spot on about what you'd like to see. This is done with the help of something known as a recommender system. A recommender system is capable of predicting a person's future preference given a fixed amount of limited data. One primary reason Netflix uses a recommender system is that a lot of content is present on its platform, which can be entirely irrelevant to people based on their language or genres of interest.

In this blog, we will build a straightforward content-based recommendation system on Netflix data. But before getting to that point, we need to preprocess the data and understand the variables. The workflow is as follows:

A 3-step missing value imputation process
Building a Content-based Recommender System Maximum runtime of the notebook - 5-6 mins

1.2 Setup

# Data handling
import numpy as np
import pandas as pd
from collections import Counter
import time, math

# Parallel Tasking
from joblib import Parallel, delayed

# Web Crawling
from bs4 import BeautifulSoup
import requests

import matplotlib.pyplot as plt

# For the recommender system
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MultiLabelBinarizer 

1.3 Dataset : Kaggle

The tabular dataset that we will use in this notebook consists of listings of all the movies and tv shows available on Netflix, along with details such as - cast, directors, ratings, release year, duration, etc. The data is downloaded from Kaggle.

netflix_data = pd.read_csv("Data/netflix_titles.csv")
netflix_data.head(2)

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	s1	Movie	Dick Johnson Is Dead	Kirsten Johnson	NaN	United States	September 25, 2021	2020	PG-13	90 min	Documentaries	As her father nears the end of his life, filmm...
1	s2	TV Show	Blood & Water	NaN	Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...	South Africa	September 24, 2021	2021	TV-MA	2 Seasons	International TV Shows, TV Dramas, TV Mysteries	After crossing paths at a party, a Cape Town t...

Having a glance at the first two rows of the dataset tells us there are some missing values in the data. But we will deal with it later. First, we will understand what these variables represent.

Variable	Description
show_id	Unique ID for every Movie / Tv Show
type	Identifier - A Movie or TV Show
title	Title of the Movie / Tv Show
director	Director of the Movie
cast	Actors involved in the movie / show
country	Country where the movie / show was produced
date_added	Date it was added on Netflix
release_year	Actual Release year of the move / show
rating	TV Rating of the movie / show
duration	Total Duration - in minutes or number of seasons
listed_in	Genere
description	The summary description

1.4 Data Exploration

1.4.1 Filter only movies data and remove duplicate rows

netflix_data.describe(include='all').head(4)

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
count	8807	8807	8807	6173	7982	7976	8797	8807.0	8803	8804	8807	8807
unique	8807	2	8807	4528	7692	748	1767	NaN	17	220	514	8775
top	s1	Movie	Dick Johnson Is Dead	Rajiv Chilaka	David Attenborough	United States	January 1, 2020	NaN	TV-MA	1 Season	Dramas, International Movies	Paranormal activity at a lush, abandoned prope...
freq	1	6131	1	19	19	2818	109	NaN	3207	1793	362	4

There are 8807 rows in the dataset. We will focus only on movies data (not TV shows) and build a recommendation system on it.

netflix_data.groupby('type').count()

	show_id	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
type
Movie	6131	6131	5943	5656	5691	6131	6131	6129	6128	6131	6131
TV Show	2676	2676	230	2326	2285	2666	2676	2674	2676	2676	2676

We can observe 6131 movies and 2676 tv shows on Netflix. So, we will only filter the movies data from the original dataset.

movies_data = netflix_data.loc[netflix_data["type"]=="Movie",].copy()

movies_data["title"] = movies_data['title'].str.strip().str.lower()
temp = movies_data['title'].value_counts()
movies_data.loc[movies_data["title"].isin(list(temp.index[temp>1])),]

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
159	s160	Movie	love in a puff	Pang Ho-cheung	Miriam Chin Wah Yeung, Shawn Yue, Singh Hartih...	Hong Kong	September 1, 2021	2010	TV-MA	103 min	Comedies, Dramas, International Movies	When the Hong Kong government enacts a ban on ...
303	s304	Movie	esperando la carroza	Alejandro Doria	Luis Brandoni, China Zorrilla, Antonio Gasalla...	Argentina	August 5, 2021	1985	TV-MA	95 min	Comedies, Cult Movies, International Movies	Cora has three sons and a daughter and she´s a...
3371	s3372	Movie	consequences	Ozan Açıktan	Nehir Erdoğan, Tardu Flordun, İlker Kaleli, Se...	Turkey	October 25, 2019	2014	TV-MA	106 min	Dramas, International Movies, Thrillers	Secrets bubble to the surface after a sensual ...
6529	s6530	Movie	consequences	Ozan Açıktan	Nehir Erdoğan, Tardu Flordun, İlker Kaleli, Se...	Turkey	October 25, 2019	2014	TV-MA	106 min	Dramas, International Movies, Thrillers	Secrets bubble to the surface after a sensual ...
6705	s6706	Movie	esperando la carroza	Alejandro Doria	Luis Brandoni, China Zorrilla, Antonio Gasalla...	Argentina	July 15, 2018	1985	NR	95 min	Comedies, Cult Movies, International Movies	Cora has three sons and a daughter and she´s a...
7345	s7346	Movie	love in a puff	Pang Ho-cheung	Miriam Chin Wah Yeung, Shawn Yue, Singh Hartih...	Hong Kong	August 1, 2018	2010	TV-MA	103 min	Comedies, Dramas, International Movies	When the Hong Kong government enacts a ban on ...

It looks like these are surely duplicate rows. So, we can remove either of the rows for each movie.

movies_data = movies_data.drop([6529,6705,7345])

1.4.2 Missing Value Handling/ Imputation

Instead of directly removing rows with missing values, we try to impute as much data as possible with high accuracy. This process involves three steps:

Stage 1: Remove rows for columns with very, very few missing values
Stage 2: Web crawling based imputation to achieve high accuracy
Stage 3: Replace the remaining NaN values with an empty string to preserve information in other columns

1.4.2.1 Handling Missing Values - Stage 1 (Drop Rows)

print("Rows with missing values in the data: "+
      str(round(100*sum(movies_data.isnull().any(axis=1))/movies_data.shape[0],2))+"%")
movies_data.isna().sum()

Rows with missing values in the data: 15.44%





show_id           0
type              0
title             0
director        188
cast            475
country         440
date_added        0
release_year      0
rating            2
duration          3
listed_in         0
description       0
dtype: int64

We can see several missing values in the director, cast, country columns and a very few missing values in the rating and duration columns. Let’s remove the rows with missing values in the rating and duration columns.

movies_data.dropna(subset=["rating","duration"], how='any', inplace=True)
print("Rows with missing values in the data: "+str(round(100*sum(movies_data.isnull().any(axis=1))/movies_data.shape[0],2))+"%")
movies_data.isna().sum()

Rows with missing values in the data: 15.37%





show_id           0
type              0
title             0
director        187
cast            475
country         439
date_added        0
release_year      0
rating            0
duration          0
listed_in         0
description       0
dtype: int64

nan_rows_df = movies_data[movies_data.isnull().any(axis=1)]
nan_rows_df.head(2)

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	s1	Movie	dick johnson is dead	Kirsten Johnson	NaN	United States	September 25, 2021	2020	PG-13	90 min	Documentaries	As her father nears the end of his life, filmm...
6	s7	Movie	my little pony: a new generation	Robert Cullen, José Luis Ucha	Vanessa Hudgens, Kimiko Glenn, James Marsden, ...	NaN	September 24, 2021	2021	PG	91 min	Children & Family Movies	Equestria's divided. But a bright-eyed hero be...

There are still more than 15% rows with missing values. So instead of removing those rows, we will try to impute the data with high accuracy.

1.4.2.2 Handling Missing Values - Stage 2 (Web Crawling)

The concept behind this is simple. For example, let’s look at the first row in the above table where the movie title is “Dick Johnson Is Dead” and the cast has NaN value. First of all, why is this value missing? There could be two potential reasons.

Netflix might not log this information on their platform. Hence the value was missing
This data file is being maintained on Kaggle and constantly updated by only a single person. So, there might be some manual errors involved while copying the data into a CSV file.

However, we can’t attribute each row with missing value to a specific reason. In either case, we will look up the movie on IMDb and get the director, cast, and country of origin data.

You can also look at this sample URL and see how we can extract the director, cast, and country of origin variables from it.

# Functions to extract the director, cast, country of origin data from a page source

def get_director(soup):
    """
    Extract the director information from the HTML source data
    Args:
        soup: object (page source) obtained from scraping the website using BeautifulSoup() function
    Returns:
        director: returns a string containing directors of a movie separated by a comma
    """ 
    try:
        director = ""
        temp = soup.find("section",{"data-testid":"title-cast"}).find_all("li",{"class","ipc-metadata-list__item"})
        if len(temp)==4: #if the section on the page in found
            director_soups = temp[0].find_all("a")
            for director_soup in director_soups:
                name = director_soup.get_text().strip()
                director = director + name + ", "
            director = director[:-2]
            return director
        else:
            return director
    except:
        return director

def get_cast(soup):
    """
    Extract the cast information from the HTML source data
    Args:
        soup: object (page source) obtained from scraping the website using BeautifulSoup() function
    Returns:
        cast: returns a string containing all the cast members of a movie separated by a comma
    """ 
    try:
        cast = ""
        cast_soups = soup.find("section",{"data-testid":"title-cast"}).find_all("a",{"data-testid":"title-cast-item__actor"})
        for cast_soup in cast_soups:
            name = cast_soup.get_text().strip()
            cast = cast + name + ", "
        cast = cast[:-2]
        return cast
    except:
        return cast
    
def get_country(soup):
    """
    Extract the country information from the HTML source data
    Args:
        soup: object (page source) obtained from scraping the website using BeautifulSoup() function
    Returns:
        country: returns a string containing all the countries of origin of a movie separated by a comma
    """ 
    try:
        country = ""
        countries_soups = soup.find("div",{"data-testid":"title-details-section"}).find("li",{"data-testid":"title-details-origin"}).find_all("a")
        for countries_soup in countries_soups:
            name = countries_soup.get_text().strip()
            country = country + name + ", "
        country = country[:-2]
        return country
    except:
        return country

%%time

def imdb_requests(row):
    """
    Main Function that extracts the director, cast, and countries of origin information for each row with NaN value
    Args:
        row: dataframe row containing atleast one NaN value
    Returns:
        main_dict: returns a dictionary containing title, show_id, director, cast, country as keys and their
        corresponding values as dictionary values
    """ 
    main_dict = {}
    main_dict["title"] = row["title"]
    main_dict["show_id"] = row["show_id"]
    
    try:
        source = requests.get("https://www.imdb.com/find?ref_=nv_sr_fn&q="+str(main_dict["title"]))
        source.raise_for_status()
        soup = BeautifulSoup(source.text,'html.parser')

        #take the first URL on the results page and extract information from it
        title = soup.find("td",{"class":"result_text"}).find('a').get("href")
        
        new_url = "https://www.imdb.com"+title
        source = requests.get(new_url)
        source.raise_for_status()

        soup = BeautifulSoup(source.text,'html.parser')
        main_dict["director"] = get_director(soup)
        main_dict["cast"] = get_cast(soup)
        main_dict["country"] = get_country(soup)        
    except:
        pass
    
    return main_dict

"""
The below four rows calls several URL requests using a parallel function, convert an array of dictionaries 
to a data frame, and replace empty string with NaN values (to impute them later).
This code takes close to 5 minutes to run. I ran this already and generated the intermediate file, 
which I will use in the subsequent sections.
"""
# nan_rows_search_results = Parallel(n_jobs=-1)(delayed(imdb_requests)(row) for index, row in nan_rows_df.iterrows())
# nan_rows_search_results_df = pd.DataFrame(nan_rows_search_results)
# nan_rows_search_results_df = nan_rows_search_results_df.replace('',np.nan,regex=True)
# nan_rows_search_results_df.to_csv("IMDB_intermediate_data.csv",index=False)
nan_rows_search_results_df = pd.read_csv("IMDB_intermediate_data.csv")

CPU times: user 4.45 ms, sys: 1.67 ms, total: 6.12 ms
Wall time: 5.25 ms

Single Core vs Multi Core Computations:

We observe that using a parallel function helps us reduce the run time to 15%.

movies_data.cast = np.where(movies_data.cast.isnull(),movies_data.show_id.map(nan_rows_search_results_df.set_index('show_id').cast),movies_data.cast)
movies_data.country = np.where(movies_data.country.isnull(),movies_data.show_id.map(nan_rows_search_results_df.set_index('show_id').country),movies_data.country)
movies_data.director = np.where(movies_data.director.isnull(),movies_data.show_id.map(nan_rows_search_results_df.set_index('show_id').director),movies_data.director)

print("Rows with missing values in the data: "+str(round(100*sum(movies_data.isnull().any(axis=1))/movies_data.shape[0],2))+"%")
movies_data.isna().sum()

Rows with missing values in the data: 5.42%





show_id           0
type              0
title             0
director        109
cast            104
country         183
date_added        0
release_year      0
rating            0
duration          0
listed_in         0
description       0
dtype: int64

We now only have about 5.4% of the missing rows in the data. Unfortunately, we could not find the rest of them from the IMDB data. So, we replace them with an empty string.

1.4.2.3 Handling Missing Values - Stage 3 (Replace with Empty String)

movies_data = movies_data.replace(np.nan,'',regex=True)
movies_data.reset_index(drop=True,inplace=True)
print("Rows with missing values in the data: "+str(round(100*sum(movies_data.isnull().any(axis=1))/movies_data.shape[0],2))+"%")
movies_data.isna().sum()

Rows with missing values in the data: 0.0%





show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

1.4.3 Parsing the `date_added`, `duration` columns

We will extract the year, month, and day of the week data from the date_added column and analyze them separately to generate more insights later. Also, we parse the duration column into a numeric column.

# Year added column
movies_data['year_added'] = movies_data['date_added'].apply(lambda x: x.split(" ")[-1])
movies_data['year_added'] = movies_data["year_added"].astype("int")
# Month added column
movies_data['month_added'] = movies_data['date_added'].apply(lambda x: x.split(" ")[0])
movies_data['date_added'] = pd.to_datetime(movies_data['date_added'])
movies_data['day_of_week'] = movies_data['date_added'].dt.day_name()
movies_data[['month_added','year_added','day_of_week']].head()

	month_added	year_added	day_of_week
0	September	2021	Saturday
1	September	2021	Friday
2	September	2021	Friday
3	September	2021	Friday
4	September	2021	Thursday

movies_data['duration']=movies_data['duration'].str.replace(' min','')
movies_data['duration']=movies_data['duration'].astype(str).astype(int)

1.5 Content-based recommendation engine on multiple metrics

Now that we have a fair understanding of the variables, we will build the recommendation engine using a few of them. There are two main types of recommendation engines: content-based filtering and collaborative filtering. We will try to build the former one in this notebook.

Content-based filtering works on the principle that you will also like another item if you like a particular item. For example, to provide movie recommendations, algorithms use several movie attributes like title, genre, director, cast to compare movies using cosine or euclidean distances. One of the major downsides of this approach is that this system limits recommending movies similar to what the person has already watched. However, we will not address this in this notebook.

features = ['title','director','cast','listed_in']

def clean_data(df,features):
    df_subset = df[features].copy()
    df_subset['main_column'] = ""
    for feature in features:
        if feature!="description":
            df_subset[feature] = df_subset[feature].apply(lambda x: str.lower(x.replace(" ", "")))
        df_subset["main_column"] = df_subset["main_column"] + ' ' + df_subset[feature]
    return df_subset

We need to remove the spaces from the data before combining the features to a new column. This is required because, for example, there are 84 directors with Michael as part of their name, but none of them have a common full name. So it doesn’t make sense to recommend a director’s movies only because they have a part of their name common to another director. The same logic applies to the other columns.

movies_data_subset = clean_data(movies_data,features)
movies_data_subset.head(2)

	title	director	cast	listed_in	main_column
0	dickjohnsonisdead	kirstenjohnson	michaelhilow,anahoffman,dickjohnson,kirstenjoh...	documentaries	dickjohnsonisdead kirstenjohnson michaelhilow...
1	mylittlepony:anewgeneration	robertcullen,joséluisucha	vanessahudgens,kimikoglenn,jamesmarsden,sofiac...	children&familymovies	mylittlepony:anewgeneration robertcullen,josé...

We use the TF-IDF (term frequency–inverse document frequency) matrix to process the new combined column main_column that was created in the previous step. You can also read about TF-IDF here. We then use cosine-similarity to create a score between each pair of movies.

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies_data_subset['main_column'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

1.5.1 Which movies are the most similar to each other?

movie_titles_df = pd.DataFrame(movies_data['title']).reset_index()
movie_titles_df.columns = ["row_id","Title"]

cosine_sim_df = pd.DataFrame(cosine_sim).reset_index()
cosine_sim_df_melted = pd.melt(cosine_sim_df, id_vars=['index'], value_vars=list(cosine_sim_df.columns[1:]))
cosine_sim_df_melted.columns = ["row_id1","row_id2","similarity"]
cosine_sim_df_melted = cosine_sim_df_melted.sort_values("similarity",ascending=False)
cosine_sim_df_melted = cosine_sim_df_melted.loc[cosine_sim_df_melted["row_id1"]<cosine_sim_df_melted["row_id2"],].reset_index(drop=True)

Filter movies with very high similarity

thres = 0.9
filtered_df = cosine_sim_df_melted.loc[cosine_sim_df_melted["similarity"]>thres,].copy()
filtered_df = filtered_df.merge(movie_titles_df,left_on="row_id1",right_on="row_id")
filtered_df = filtered_df.merge(movie_titles_df,left_on="row_id2",right_on="row_id")
filtered_df = filtered_df[["Title_x","Title_y","similarity"]].copy()
filtered_df.columns = ["Movie1","Movie2","Similarity"]
filtered_df["Similarity"] = round(filtered_df["Similarity"],2)
filtered_df

	Movie1	Movie2	Similarity
0	oh! baby (tamil)	oh! baby	0.96
1	oh! baby (malayalam)	oh! baby	0.96
2	oh! baby (malayalam)	oh! baby (tamil)	0.93
3	solo: a star wars story	solo: a star wars story (spanish version)	0.96
4	rogue warfare: death of a nation	rogue warfare	0.96
5	rogue warfare: the hunt	rogue warfare	0.96
6	rogue warfare: death of a nation	rogue warfare: the hunt	0.92
7	boomika	boomika (hindi)	0.95
8	boomika	boomika (telugu)	0.95
9	boomika	boomika (malayalam)	0.94
10	petta (telugu version)	petta	0.94
11	bo burnham: what.	bo burnham: make happy	0.93
12	godzilla the planet eater	godzilla city on the edge of battle	0.93
13	osuofia in london	osuofia in london ii	0.92
14	tughlaq durbar	tughlaq durbar (telugu)	0.92
15	naruto shippuden the movie: blood prison	naruto shippuden : blood prison	0.92
16	sarvam thaala mayam (telugu version)	sarvam thaala mayam (tamil version)	0.92
17	chris d'elia: man on fire	chris d'elia: incorrigible	0.92
18	octonauts & the ring of fire	octonauts & the great barrier reef	0.91
19	the twilight saga: breaking dawn: part 1	the twilight saga: breaking dawn: part 2	0.91
20	baahubali 2: the conclusion (hindi version)	baahubali 2: the conclusion (tamil version)	0.91
21	baahubali 2: the conclusion (malayalam version)	baahubali 2: the conclusion (tamil version)	0.90
22	baahubali 2: the conclusion (hindi version)	baahubali 2: the conclusion (malayalam version)	0.90
23	baahubali: the beginning (hindi version)	baahubali: the beginning (tamil version)	0.91
24	baahubali: the beginning (malayalam version)	baahubali: the beginning (tamil version)	0.90
25	baahubali: the beginning (hindi version)	baahubali: the beginning (malayalam version)	0.90
26	the magic school bus rides again the frizz con...	the magic school bus rides again kids in space	0.91
27	game over (hindi version)	game over (tamil version)	0.90
28	game over (hindi version)	game over (telugu version)	0.90
29	game over (tamil version)	game over (telugu version)	0.90

We observe that the same movie with different versions in multiple languages has the highest score based on the results. If we do not want them as part of our recommendations, we can remove the duplicate entries in the preprocessing step. For now, we will keep them as part of our model.

movies_data_subset=movies_data_subset.reset_index()
indices = pd.Series(movies_data_subset.index, index=movies_data_subset['title'])

1.5.2 Let’s get some recommendations for a movie

def get_recommendations_new(title, cosine_sim, n):
    """
    Find the similar movies to a given movie
    Args:
        title: movie title to which we find recommendations
        cosine_sim: cosine similarity matrix for finding similar movies
        n: number of movies to recommend
    Returns:
        results_df: returns a dataframe containing the list of recommended movies with rowids
        and their similarity score
    """ 
    title = title.replace(' ','').lower()
    idx = indices[title]

    #pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    #sort the movies based on cosine similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the top n most similar movies
    sim_scores = sim_scores[1:(n+1)]
    # Get their movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    results_df = pd.DataFrame(movies_data['title'].iloc[movie_indices])
    results_df["score"] = np.round(np.array(sim_scores)[:,1],2)
    results_df = results_df.reset_index(drop=False)
    results_df.columns = ["RowID","Recommended Movie","Similarity Score"]
    return results_df

movie_title = "pk"
recommendations_df = get_recommendations_new(movie_title,cosine_sim,5)

temp_df = movies_data.loc[movies_data.title.isin([movie_title]+list(recommendations_df["Recommended Movie"]))]
temp_df = temp_df[features].reset_index(drop=True)
temp_df = temp_df.merge(recommendations_df,left_on="title",right_on = "Recommended Movie",how="outer")
temp_df = temp_df.sort_values("Similarity Score",ascending=False)
temp_df = temp_df[["title","director","cast","listed_in","Similarity Score"]]
temp_df["new"] = range(1,len(temp_df)+1)
temp_df.loc[temp_df.title==movie_title,'new'] = 0
temp_df = temp_df.sort_values("new").drop('new', axis=1)
temp_df

	title	director	cast	listed_in	Similarity Score
5	pk	Rajkumar Hirani	Aamir Khan, Anuskha Sharma, Sanjay Dutt, Saura...	Comedies, Dramas, International Movies	NaN
2	3 idiots	Rajkumar Hirani	Aamir Khan, Kareena Kapoor, Madhavan, Sharman ...	Comedies, Dramas, International Movies	0.27
4	sanju	Rajkumar Hirani	Ranbir Kapoor, Vicky Kaushal, Paresh Rawal, So...	Dramas, International Movies	0.18
3	drive	Tarun Mansukhani	Jacqueline Fernandez, Sushant Singh Rajput, Bo...	Action & Adventure, International Movies	0.17
1	taare zameen par	Aamir Khan	Aamir Khan, Darsheel Safary, Tanay Chheda, Tis...	Dramas, International Movies	0.15
0	madness in the desert	Satyajit Bhatkal	Aamir Khan, Ashutosh Gowariker	Documentaries, International Movies	0.12

The above recommendations look pretty good for a starting point.

1.6 Summary and Scope for Improvement

1.6.1 Summary

We started with data preprocessing steps that involved removing duplicate entries, missing value imputation stages, and feature extractions. Using web crawling, we used a unique approach to imputing missing data with high accuracy. Finally, we converted the preprocessed text into a TF-IDF matrix and calculated the scores using the cosine similarity function to create the final recommendation system.

1.6.2 Scope for Improvement

The below pointers mention a few ways to improve the workflow of this notebook:

We did not analyze the description column, which contains a movie summary, but it can also be added to the existing system to generate more accurate recommendations.
Word clouds can also be plotted when analyzing the description column.

Share on

Twitter Facebook LinkedIn

Vinay Sammangi

Content-based Movie Recommendation System

1. Netflix Movies: Recommendation Engine

1.1 Setting the Context

1.2 Setup

1.3 Dataset : Kaggle

1.4 Data Exploration

1.4.1 Filter only movies data and remove duplicate rows

1.4.2 Missing Value Handling/ Imputation

1.4.2.1 Handling Missing Values - Stage 1 (Drop Rows)

1.4.2.2 Handling Missing Values - Stage 2 (Web Crawling)

1.4.2.3 Handling Missing Values - Stage 3 (Replace with Empty String)

1.4.3 Parsing the `date_added`, `duration` columns

1.5 Content-based recommendation engine on multiple metrics

1.5.1 Which movies are the most similar to each other?

1.5.2 Let’s get some recommendations for a movie

1.6 Summary and Scope for Improvement

1.6.1 Summary

1.6.2 Scope for Improvement

Share on

You may also enjoy

AI Augmented Commodity Price Forecasting - Part 2

AI Augmented Commodity Price Forecasting - Part 1

Vinay Sammangi

1. Netflix Movies: Recommendation Engine

1.1 Setting the Context

1.2 Setup

1.3 Dataset : Kaggle

1.4 Data Exploration

1.4.1 Filter only movies data and remove duplicate rows

1.4.2 Missing Value Handling/ Imputation

1.4.2.1 Handling Missing Values - Stage 1 (Drop Rows)

1.4.2.2 Handling Missing Values - Stage 2 (Web Crawling)

1.4.2.3 Handling Missing Values - Stage 3 (Replace with Empty String)

1.4.3 Parsing the date_added, duration columns

1.5 Content-based recommendation engine on multiple metrics

1.5.1 Which movies are the most similar to each other?

1.5.2 Let’s get some recommendations for a movie

1.6 Summary and Scope for Improvement

1.6.1 Summary

1.6.2 Scope for Improvement

Share on

You may also enjoy

AI Augmented Commodity Price Forecasting - Part 2

AI Augmented Commodity Price Forecasting - Part 1

1.4.3 Parsing the `date_added`, `duration` columns