{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Building a Simple Recommendation System in FSharp\n", "\n", "![image.png](./pics/Title.png)\n", "\n", "## Introduction \n", "\n", "This year, I decided to work on a Simple Recommender for TMDB (The Movie Database) data that can be found [here](https://www.kaggle.com/tmdb/tmdb-movie-metadata). I'd like to keep this article as lucid as possible with the least amount of dependencies and ceremony to highlight the ease of the domain modelling, tooling and data manipulation functionality in F#. Recommendations are given on the basis of the metadata of the movie inputted based on the genres, keywords, overview, production studio and popularity score. \n", "\n", "This notebook is written using .NET Jupyter Notebooks whose instructions to get started can be found [here](https://github.com/dotnet/try/blob/master/NotebooksLocalExperience.md). The impetus behind writing this post was to help me and others better understand the internals of a very simple Recommendation System but also to exemplify the now improving strength of the .NET ecosystem to conduct experiments by the means of a Jupyter Notebook.\n", "\n", "## The Plan\n", "\n", "1. We will load the CSV file, extract and clean the pertinent fields and create domain based objects off the data.\n", "2. We will create a vector that gives us the frequencies of individual words in a word soup.\n", "3. We will create functions that compute a `distance` between the word frequency vectors and popularity of two different movies.\n", "4. We will give recommendations based on the movie inputted by the user.\n", "\n", "## Data Processing\n", "\n", "The first step involves installing all the pertinent nuget packages, which can be done in the following manner in addition to importing the relevant paths and setting the path to the csv containing all the data appropriately:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "Installing package MathNet.Numerics.done!" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Installing package MathNet.Numerics.FSharp.done!" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "Installing package FSharp.Data.done!" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "=============== S T A R T ==========================================\n", ">>>> /Users/mukundraghavsharma/.nuget/packages/fsharp.data/3.3.2/typeproviders/fsharp41/netstandard2.0/FSharp.Data.DesignTime.dll\n", ">>>> /Users/mukundraghavsharma/.nuget/packages/fsharp.data/3.3.2/lib/netstandard2.0/FSharp.Data.DesignTime.dll\n", "Using: /Users/mukundraghavsharma/.nuget/packages/fsharp.data/3.3.2/typeproviders/fsharp41/netstandard2.0/FSharp.Data.DesignTime.dll\n" ] } ], "source": [ "#r \"nuget:MathNet.Numerics\"\n", "#r \"nuget:MathNet.Numerics.FSharp\"\n", "#r \"nuget:FSharp.Data\"\n", "\n", "// Uncomment for the IFSharp Kernel\n", "// #load \"Paket.fsx\"\n", "// Paket.Package [ \"FSharp.Data\"; ]\n", "// #load \"Paket.Generated.Refs.fsx\"\n", " \n", "open System\n", "open System.Text\n", "open FSharp.Data\n", "open MathNet.Numerics\n", "open System.Collections.Generic\n", "open Microsoft.FSharp.Collections\n", "\n", "[]\n", "// Data obtained from: https://www.kaggle.com/tmdb/tmdb-movie-metadata\n", "let DataPath = \"/Users/mukundraghavsharma/Desktop/F#/FSharp-Advent-2019/data/tmdb_5000_movies.csv\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to conduct a perfunctory data exploration to view the \"shape\" of the columns we decide to make use of. We'll be making use of the Csv Data Provider from ``FSharp.Data``. Alternatively, I could have used [`Deedle`](https://bluemountaincapital.github.io/Deedle/) for this task.\n", "\n", "![image](./pics/FSharpData.png)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Some\n", " [|\"budget\"; \"genres\"; \"homepage\"; \"id\"; \"keywords\"; \"original_language\";\n", " \"original_title\"; \"overview\"; \"popularity\"; \"production_companies\";\n", " \"production_countries\"; \"release_date\"; \"revenue\"; \"runtime\";\n", " \"spoken_languages\"; \"status\"; \"tagline\"; \"title\"; \"vote_average\";\n", " \"vote_count\"|]\n" ] }, { "data": { "text/html": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let data = CsvFile.Load(DataPath).Cache()\n", "printfn \"%A\" data.Headers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the headers, we will consider engineering 2 features:\n", "1. __Word Soup__: Consisting of sanitized tokens (words) based on the genres, keywords, production companies and overview.\n", " - Similar movies would have similar genres, keywords, production companies and words in the overview and therefore could lead to good recommendations.\n", "2. __Popularity__: A popularity score from TMDB.\n", "\n", "To get a sense of the shape of the data contained in the columns, let's write a helper function that gets us the first item in the column." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "let getFirstItemInColumn (colName : string) = \n", " seq { for row in data.Rows -> row.GetColumn colName }\n", " |> Seq.head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Genres" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{\"id\": 28, \"name\": \"Action\"}, {\"id\": 12, \"name\": \"Adventure\"}, {\"id\": 14, \"name\": \"Fantasy\"}, {\"id\": 878, \"name\": \"Science Fiction\"}]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getFirstItemInColumn \"genres\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Keywords" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{\"id\": 1463, \"name\": \"culture clash\"}, {\"id\": 2964, \"name\": \"future\"}, {\"id\": 3386, \"name\": \"space war\"}, {\"id\": 3388, \"name\": \"space colony\"}, {\"id\": 3679, \"name\": \"society\"}, {\"id\": 3801, \"name\": \"space travel\"}, {\"id\": 9685, \"name\": \"futuristic\"}, {\"id\": 9840, \"name\": \"romance\"}, {\"id\": 9882, \"name\": \"space\"}, {\"id\": 9951, \"name\": \"alien\"}, {\"id\": 10148, \"name\": \"tribe\"}, {\"id\": 10158, \"name\": \"alien planet\"}, {\"id\": 10987, \"name\": \"cgi\"}, {\"id\": 11399, \"name\": \"marine\"}, {\"id\": 13065, \"name\": \"soldier\"}, {\"id\": 14643, \"name\": \"battle\"}, {\"id\": 14720, \"name\": \"love affair\"}, {\"id\": 165431, \"name\": \"anti war\"}, {\"id\": 193554, \"name\": \"power relations\"}, {\"id\": 206690, \"name\": \"mind and soul\"}, {\"id\": 209714, \"name\": \"3d\"}]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getFirstItemInColumn \"keywords\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Production Company" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{\"name\": \"Ingenious Film Partners\", \"id\": 289}, {\"name\": \"Twentieth Century Fox Film Corporation\", \"id\": 306}, {\"name\": \"Dune Entertainment\", \"id\": 444}, {\"name\": \"Lightstorm Entertainment\", \"id\": 574}]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getFirstItemInColumn \"production_companies\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems like we'll need to parse the JSON and grab the `name` field off the payload for all the aforementioned cases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Overview" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization." ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getFirstItemInColumn \"overview\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Popularity" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "150.437577" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "getFirstItemInColumn \"popularity\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the samples of the columns considered, it seems like we'll need to sanitize the data extensively. Before we do so, let's define the domain objects we will be dealing with.\n", "\n", "## Domain\n", "\n", "The pertinent columns for a movie we decided on were:\n", "1. Genres\n", "2. Keywords\n", "3. Overview\n", "4. Production Company\n", "5. Popularity\n", "\n", "And therefore, we'd like to model the domain accordingly. We introduce a word soup and a soup list, which comprise of a concatenation of all the sanitized tokens and list of tokens of fields 1 through 4, respectively. We will extract the popularity of a movie as well and some other lists that can help us better filter." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "type MovieData = { title : string; \n", " soup : string;\n", " soupList : string list;\n", " genreList : string list;\n", " prodCompanyList : string list;\n", " popularity : double; }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tokenization and Token Sanitization\n", "\n", "Now that we are set on the domain, we need to write some auxiliary methods to sanitize the tokenized data appropriately. As a part of the token sanitization process, we will be conducting the following steps:\n", "\n", "1. __Lower casing all the tokenized words__: We do this to make the tokens case-agnostic so that our algorithms don't distinguish between 'Adventure' and 'adventure'.\n", "2. __Removing Common Stop Words__: Common stop words such as hers, again, there don't add any domain specific information about the movie and therefore can be removed as those will just add noise to our algorithm.\n", "3. __Removing Punctuation__: Punctuation that has no meaning in the context of tokens that can affect the algorithm. For example, without removing punctuation, we will be treating 'thief' and 'thief.' as two different tokens. Additionally, we will get rid of any non-ASCII characters here.\n", "4. __Removing any empty string__: Empty string will just add more noise. It's good to get rid of them early on." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "let tokenizeAndClean (words : string) =\n", " // Tokenize\n", " let split = words.Split(' ')\n", " \n", " // Lowercase\n", " let lowered = \n", " split\n", " |> Array.map(fun s -> s.ToLower())\n", " \n", " // Remove Common Stop Words that don't add meaning\n", " let commonStopWords = \n", " Set.ofList [\"ourselves\"; \"hers\"; \"between\"; \"yourself\"; \"but\"; \"again\"; \"there\"; \"about\"; \"once\"; \"during\"; \"out\"; \"very\"; \"having\"; \"with\"; \"they\"; \"own\"; \"an\"; \"be\"; \"some\"; \"for\"; \"do\"; \"its\"; \"yours\"; \"such\"; \"into\"; \"of\"; \"most\"; \"itself\"; \"other\"; \"off\"; \"is\"; \"s\"; \"am\"; \"or\"; \"who\"; \"as\"; \"from\"; \"him\"; \"each\"; \"the\"; \"themselves\"; \"until\"; \"below\"; \"are\"; \"we\"; \"these\"; \"your\"; \"his\"; \"through\"; \"don\"; \"nor\"; \"me\"; \"were\"; \"her\"; \"more\"; \"himself\"; \"this\"; \"down\"; \"should\"; \"our\"; \"their\"; \"while\"; \"above\"; \"both\"; \"up\"; \"to\"; \"ours\"; \"had\"; \"she\"; \"all\"; \"no\"; \"when\"; \"at\"; \"any\"; \"before\"; \"them\"; \"same\"; \"and\"; \"been\"; \"have\"; \"in\"; \"will\"; \"on\"; \"does\"; \"yourselves\"; \"then\"; \"that\"; \"because\"; \"what\"; \"over\"; \"why\"; \"so\"; \"can\"; \"did\"; \"not\"; \"now\"; \"under\"; \"he\"; \"you\"; \"herself\"; \"has\"; \"just\"; \"where\"; \"too\"; \"only\"; \"myself\"; \"which\"; \"those\"; \"i\"; \"after\"; \"few\"; \"whom\"; \"t\"; \"being\"; \"if\"; \"theirs\"; \"my\"; \"against\"; \"a\"; \"by\"; \"doing\"; \"it\"; \"how\"; \"further\"; \"was\"; \"here\"; \"than\"]\n", " let notStopWords =\n", " lowered\n", " |> Array.filter(fun s -> not (Set.contains s commonStopWords))\n", "\n", " // Remove Punctuation\n", " let nonPunctuation = \n", " notStopWords\n", " |> Array.map(fun x -> x.Replace(\"�\", \"\")\n", " .Replace(\"'\", \"\")\n", " .Replace(\":\", \"\")\n", " .Replace(\".\", \"\")\n", " .Replace(\",\", \"\")\n", " .Replace(\"-\", \"\")\n", " .Replace(\"!\", \"\")\n", " .Replace(\"?\", \"\")\n", " .Replace(\"\\\"\", \"\")\n", " .Replace(\";\", \"\"))\n", " \n", " let removeEmptyStrings = \n", " nonPunctuation\n", " |> Array.filter(fun x -> not (String.IsNullOrEmpty x))\n", " \n", " // Lexicographically sort\n", " let sorted = Array.sort removeEmptyStrings\n", " \n", " // Concatenate the joined data\n", " sorted\n", " |> Array.toList" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have established the \"cleanliness of tokens\" contract, let's give the function to do so a quick spin." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
indexvalue
0brown
1dog
2fox
3jumps
4lazy
5quick
6the
" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let sample = \"The quick brown. fox jumps! over the; Lazy dog\" \n", "let tokenizedSample = tokenizeAndClean sample\n", "tokenizedSample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Column Extraction Functions\n", "\n", "As we have seen before, we'll need to parse a couple of JSON fields to get the appropriate data to create the word soup. For this task, we'll be continuing to make use of FSharp.Data's Json Provider that simply works on specifying a pattern and then making use of the pattern to extract information from other patterns. We will write other helper functions that grab us this information from each row of the dataset." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "// Genre\n", "[]\n", "let SampleGenresJson = \"[{\\\"id\\\": 28, \\\"name\\\": \\\"Action\\\"}, {\\\"id\\\": 12, \\\"name\\\": \\\"Adventure\\\"}, {\\\"id\\\": 80, \\\"name\\\": \\\"Crime\\\"}]\"\n", "\n", "type GenreProvider = JsonProvider< SampleGenresJson >\n", "\n", "let sanitizeGenre (genres : string) : string list = \n", " let parsed = GenreProvider.Parse(genres)\n", " parsed\n", " |> Array.map(fun x -> x.Name.Replace(\" \", \"\")) \n", " |> String.concat \" \"\n", " |> tokenizeAndClean\n", "\n", "// Keywords\n", "[]\n", "let SampleKeywordsJson = \"[{\\\"id\\\": 1463, \\\"name\\\": \\\"culture clash\\\"}, {\\\"id\\\": 2964, \\\"name\\\": \\\"future\\\"}, {\\\"id\\\": 3386, \\\"name\\\": \\\"space war\\\"}]\"\n", "\n", "type KeywordsProvider = JsonProvider< SampleKeywordsJson >\n", "\n", "let sanitizeKeywords (keywords : string) : string list = \n", " let parsed = KeywordsProvider.Parse(keywords)\n", " parsed\n", " |> Array.map(fun x -> x.Name.Replace(\" \", \"\"))\n", " |> String.concat \" \"\n", " |> tokenizeAndClean\n", "\n", "// Overview\n", "let sanitizeOverview (overview : string) : string list = \n", " let nonAsciiRemoved = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(overview))\n", " tokenizeAndClean nonAsciiRemoved\n", " |> List.filter(fun x -> x <> String.Empty)\n", " \n", "// Production Company\n", "[]\n", "let ProductionCompanyJson = \"[{\\\"name\\\": \\\"Ingenious Film Partners\\\", \\\"id\\\": 289}, {\\\"name\\\": \\\"Twentieth Century Fox Film Corporation\\\", \\\"id\\\": 306}, {\\\"name\\\": \\\"Dune Entertainment\\\", \\\"id\\\": 444}, {\\\"name\\\": \\\"Lightstorm Entertainment\\\", \\\"id\\\": 574}]\"\n", "type ProductionCompanyProvider = JsonProvider< ProductionCompanyJson >\n", "\n", "let sanitizeProductionCompany (productionCompany : string) : string list = \n", " let parsed = ProductionCompanyProvider.Parse(productionCompany)\n", " parsed\n", " |> Array.map(fun x -> x.Name.Replace(\" \", \"\")) // Lightstorm Entertainment -> LightstormEntertainment\n", " |> String.concat \" \"\n", " |> tokenizeAndClean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load Data\n", "\n", "We have now developed all the individual components to create the domain objects. Let's combine them all together in one function `extractData` to give us a Sequence of extracted domain objects. \n", "\n", "This function does the following:\n", "\n", "1. Iterates over the data rows \n", "2. For each row, it extracts the title, genres, keywords, overview and production company and converts the details into a word soup and a soup list using the functions created before.\n", "3. The popularity is also extracted in a similar fashion.\n", "4. A new ``MovieComparisonData`` item is created and stored in a running list that's returned as a sequence." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "// Wrapper function that gets a sequence of all the domain objects.\n", "let extractData =\n", " let data = CsvFile.Load(DataPath).Cache()\n", " let mutable output = []\n", " for row in data.Rows do\n", " let title = (row.GetColumn \"title\")\n", " \n", " // Genres\n", " let genres = sanitizeGenre (row.GetColumn \"genres\")\n", " let getSoupGenres = genres |> String.concat \" \"\n", " \n", " // Keywords\n", " let keywords = sanitizeKeywords (row.GetColumn \"keywords\")\n", " let getSoupKeyword = keywords |> String.concat \" \"\n", " \n", " // Overview\n", " let overview = sanitizeOverview (row.GetColumn \"overview\")\n", " let getSoupOverview = overview |> String.concat \" \"\n", " \n", " // Production Company\n", " let productionCompany = sanitizeProductionCompany (row.GetColumn \"production_companies\")\n", " let getSoupProductionCompany = productionCompany |> String.concat \" \"\n", " \n", " // Soup\n", " let soup = getSoupGenres + \" \" + getSoupKeyword + \" \" + getSoupOverview + \" \" + getSoupProductionCompany\n", " let soupList = genres @ keywords @ overview @ productionCompany\n", " \n", " // Popularity\n", " let popularity = double(row.GetColumn \"popularity\")\n", " \n", " // Construct data type\n", " let movieData = { title = title; \n", " soup = soup;\n", " popularity = popularity;\n", " genreList = genres;\n", " prodCompanyList = productionCompany;\n", " soupList = soupList }\n", " \n", " // Append the output\n", " output <- output @ [movieData]\n", " output\n", " |> Seq.ofList" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For ease of use, since we'll expect the client to request a movie in the form of a string, we'll create another function that stores all the MovieData in a dictionary keyed off the movie title." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
indexKeyValue
0Avatar{ FSI_0013+MovieData: title: Avatar, soup: action adventure fantasy sciencefiction 3d alien alienplanet antiwar battle cgi cultureclash future futuristic loveaffair marine mindandsoul powerrelations romance society soldier space spacecolony spacetravel spacewar tribe 22nd alien becomes century civilization dispatched following marine mission moon orders pandora paraplegic protecting torn unique duneentertainment ingeniousfilmpartners lightstormentertainment twentiethcenturyfoxfilmcorporation, soupList: [ action, adventure, fantasy, sciencefiction, 3d, alien, alienplanet, antiwar, battle, cgi ... (35 more) ], genreList: [ action, adventure, fantasy, sciencefiction ], prodCompanyList: [ duneentertainment, ingeniousfilmpartners, lightstormentertainment, twentiethcenturyfoxfilmcorporation ], popularity: 150.437577 }
1Pirates of the Caribbean: At World's End{ FSI_0013+MovieData: title: Pirates of the Caribbean: At World's End, soup: action adventure fantasy aftercreditsstinger afterlife alliance calypso drugabuse eastindiatradingcompany exoticisland fighter loveofoneslife ocean pirate ship shipwreck strongwoman swashbuckler traitor back barbossa believed captain come dead earth edge elizabeth headed life long nothing quite seems swann turner jerrybruckheimerfilms secondmateproductions waltdisneypictures, soupList: [ action, adventure, fantasy, aftercreditsstinger, afterlife, alliance, calypso, drugabuse, eastindiatradingcompany, exoticisland ... (29 more) ], genreList: [ action, adventure, fantasy ], prodCompanyList: [ jerrybruckheimerfilms, secondmateproductions, waltdisneypictures ], popularity: 139.082615 }
2Spectre{ FSI_0013+MovieData: title: Spectre, soup: action adventure crime basedonnovel britishsecretservice mi6 secretagent sequel spy unitedkingdom alive back battles behind bond bonds cryptic deceit forces keep layers m message organization past peels political reveal secret sends service sinister spectre terrible trail truth uncover b24 columbiapictures danjaq, soupList: [ action, adventure, crime, basedonnovel, britishsecretservice, mi6, secretagent, sequel, spy, unitedkingdom ... (30 more) ], genreList: [ action, adventure, crime ], prodCompanyList: [ b24, columbiapictures, danjaq ], popularity: 107.376788 }
3The Dark Knight Rises{ FSI_0013+MovieData: title: The Dark Knight Rises, soup: action crime drama thriller batman burglar catburglar catwoman coverup crimefighter criminalunderworld dccomics destruction flood gothamcity hostagedrama imax secretidentity superhero terrorism terrorist timebomb tragichero vigilante villainess assumes attorney attorneys bane batman batman branded city city crimes dark death dent dents department district eight encounters enemy finest following gotham gothams harvey hunted knight kyle late later leader mysterious new overwhelms police protect protect reputation responsibility resurfaces selina subsequently terrorist villainous years dcentertainment legendarypictures syncopy warnerbros, soupList: [ action, crime, drama, thriller, batman, burglar, catburglar, catwoman, coverup, crimefighter ... (63 more) ], genreList: [ action, crime, drama, thriller ], prodCompanyList: [ dcentertainment, legendarypictures, syncopy, warnerbros ], popularity: 112.31295 }
4John Carter{ FSI_0013+MovieData: title: John Carter, soup: action adventure sciencefiction 19thcentury 3d alien alienrace basedonnovel edgarriceburroughs escape mars marscivilization martian medallion princess spacetravel steampunk superhumanstrength swordandplanet (mars) barsoom barsoom becomes brink captain carter carter collapse conflict embroiled epic exotic former hands humanity inexplicably its john military mysterious people planet realizes rediscovers reluctantly rests survival transported warweary whos world waltdisneypictures, soupList: [ action, adventure, sciencefiction, 19thcentury, 3d, alien, alienrace, basedonnovel, edgarriceburroughs, escape ... (42 more) ], genreList: [ action, adventure, sciencefiction ], prodCompanyList: [ waltdisneypictures ], popularity: 43.926995 }
" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "// Function that wraps the extractData functionality in a dictionary for ease of use.\n", "let getDictOfData = \n", " let out = Dictionary()\n", " for w in extractData do\n", " out.[w.title] <- w\n", " out\n", " \n", "// Test Function\n", "getDictOfData\n", "|> Seq.take 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Feature Vector Helpers\n", "\n", "Now that we have the domain objects appropriately created, we'll need get the vector of the frequency of tokens and popularity. Our distance function will then compare the distance between the feature vector of different movies. The higher the distance, the less similar the words. \n", "\n", "To generate the feature vectors, we'll need to be consistent about what we consider a feature vector to compare apples to apples. For this we'll be grabbing all the distinct words from the flattened soup list of all the movies and then computing the frequency of all the words. Finally, as the last element of the list, we'll be including the popularity score so that we just need to deal with a single vector of data.\n", "\n", "__NOTE: Vector here is used interchangably with an array to keep the language consistent with other Data Science based terminology.__\n", "\n", "The helper functions are as follows: \n", "\n", "1. ``getAllWords``: Gets the unique words from all the soup lists.\n", "2. ``getFeatureDictByMovieData``: Grabs a movie data record and creates a dictionary based on the frequency of the individual words and the popularity.\n", "3. ``getFeatureDict``: Wraps the functionality from ``getFeatureDictByMovieData`` but takes the movie name as an input.\n", "4. ``getFeatureVector``: Returns the values of the feature dictionary for a particular movie title." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "// Function gets all the distinct words from the soup list\n", "let getAllWords = \n", " extractData\n", " |> Seq.map(fun x -> x.soupList)\n", " |> Seq.concat \n", " |> Seq.distinct\n", "\n", "// Function to get the word frequency for a particular movie and then add the popularity to the end of the dictionary\n", "let getFeatureDictByMovieData (movieCompare : MovieData) = \n", " let wordFrequency = new Dictionary()\n", " for w in getAllWords do\n", " let count = \n", " movieCompare.soupList\n", " |> List.filter(fun x -> x = w)\n", " |> List.length\n", " let countAsDouble = double(count)\n", " if not (wordFrequency.ContainsKey w) then wordFrequency.[w] <- countAsDouble\n", " else wordFrequency.[w] <- wordFrequency.[w] + countAsDouble\n", " wordFrequency.[\";popularity_score;\"] <- movieCompare.popularity\n", " wordFrequency\n", "\n", "// Function to get feature dictionary\n", "let getFeatureDict (movieName : string) = \n", " if getDictOfData.ContainsKey movieName then getFeatureDictByMovieData getDictOfData.[movieName]\n", " else\n", " failwith \"Movie Not Found!\"\n", " \n", "// Function to get the feature vector i.e. values in the feature dictionary\n", "let getFeatureVector(movieName : string) = \n", " let featureDict = getFeatureDict movieName\n", " featureDict\n", " |> Seq.map(fun x -> x.Value)\n", " |> Seq.toArray" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to get the feature vector for Avatar and The Dark Knight Rises." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Avatar Feature Dictionary:\n", "seq [[action, 1]; [adventure, 1]; [fantasy, 1]; [sciencefiction, 1]; ...]\n", "Popularity Score: 150.437577\n", "\n", "[|1.0; 1.0; 1.0; 1.0; 1.0; 2.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 2.0; 1.0;\n", " 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0;\n", " 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 1.0; 0.0; 0.0; 0.0; 0.0; 0.0;\n", " 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0;\n", " 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0;\n", " 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0;\n", " 0.0; 0.0; 0.0; 0.0; ...|]\n", "The Dark Knight Rises Feature Dictionary:\n", "seq [[action, 1]; [adventure, 0]; [fantasy, 0]; [sciencefiction, 0]; ...]\n", "Popularity Score: 112.31295\n", "[|1.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0;\n", " 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 1.0; 0.0;\n", " 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0;\n", " 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0;\n", " 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 1.0;\n", " 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0; 0.0;\n", " 0.0; 0.0; 0.0; 0.0; ...|]\n" ] }, { "data": { "text/html": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "printfn \"Avatar Feature Dictionary:\"\n", "let avatarVector = getFeatureDict \"Avatar\"\n", "printfn \"%A\" avatarVector\n", "printfn \"Popularity Score: %A\\n\" (avatarVector.[\";popularity_score;\"])\n", "printfn \"%A\" (getFeatureVector \"Avatar\")\n", "\n", "printfn \"The Dark Knight Rises Feature Dictionary:\"\n", "let darkKnightVector = getFeatureDict \"The Dark Knight Rises\"\n", "printfn \"%A\" darkKnightVector\n", "printfn \"Popularity Score: %A\" (darkKnightVector.[\";popularity_score;\"])\n", "printfn \"%A\" (getFeatureVector \"The Dark Knight Rises\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cosine Distance\n", "\n", "![image](./pics/CosineDistance.png)\n", "\n", "We need to numerically compute a difference between the feature vectors of two different movies and this is where the distance function comes into place. Specifically, we'll be making use of the Cosine Distance functionality to compute this difference. \n", "\n", "Cosine Distance is given by:\n", "\n", "$ 1 - \\frac{\\arccos(\\text{cosine_similarity(u, v)})}{\\pi} $
\n", "\n", "where:
\n", "$ \\text{cosine_similarity(u, v)}= \\frac{(u . v)}{ (||u|| . ||v||)} $
\n", "u and v are both input vectors.\n", "\n", "We will be using the cosine distance functionality from the ``MathNet.Numerics`` library to compute distances between the feature vectors defined above.\n", "\n", "#### Why Cosine Distance vs. Cosine Similarity?\n", "\n", "According to [this](\"https://arxiv.org/abs/1803.11175\") paper by Google, Cosine distance performs better than raw cosine similarity to compute the difference between two vectors. Thought I'd give it a shot! Inituitively, the angular distance is better at capturing the difference than simply the distance." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cosine Distance when values are equal: 0.0\n", "Cosine Distance when values are different: 0.105572809\n" ] }, { "data": { "text/html": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let x : double[] = [| 3.; 1. |]\n", "let y : double[] = [| 3.; 3. |]\n", "\n", "printfn \"Cosine Distance when values are equal: %A\" (Distance.Cosine(x, x))\n", "printfn \"Cosine Distance when values are different: %A\" (Distance.Cosine(x, y))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have seen the Cosine Distance usage, let's compute the cosine distance of two movies by first encapsulating the behavior into a function." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "let computeCosineDistance (movie1 : string) (movie2 : string) : double = \n", " // Feature Vector\n", " let movie1Vector = getFeatureVector movie1\n", " let movie2Vector = getFeatureVector movie2\n", " \n", " // Compute the Cosine distance in the case the movies exist\n", " Distance.Cosine(movie1Vector, movie2Vector)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.004311626155728665" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "computeCosineDistance \"Avatar\" \"The Dark Knight Rises\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see Avatar and The Dark Knight Rises are considered to be somewhat similar based on the word soup and popularity." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Recommendations\n", "\n", "Now that we have all the pieces to get our recommendations in place, we can easily wrap up the logic into one function that takes in the movie name and number of recommendations and returns back a list of tuples comprising of the name of recommended movie and the cosine distance. As mentioned before the higher the cosine distance, the more dissimilar the movie and therefore, we'd like to sort the cosine distances to take the top results. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "let recommendMovies (movie : string) (recommendationCount : int) =\n", " \n", " // Check if two lists contain any intersection - will be using this function to \n", " let setIntersect listA listB =\n", " Set.intersect (Set.ofList listA) (Set.ofList listB)\n", " |> Set.isEmpty\n", " |> not\n", " \n", " // Filter out any unrelated movies to improve computation\n", " let filterOutMoviesNotRelated (movie1 : MovieData) (movie2 : MovieData) = \n", "\n", " // Filtering mechanism: If the movies don't have any genres in common, don't even consider.\n", " let checkIfGenresExist (movie1 : MovieData) (movie2 : MovieData) = \n", " let movie1Genres = movie1.genreList\n", " let movie2Genres = movie2.genreList\n", " setIntersect movie1Genres movie2Genres\n", " \n", " // Filtering mechanism: If the movies don't have any production companies in common, don't even consider.\n", " let checkIfProductionCompaniesExist (movie1 : MovieData) (movie2 : MovieData) = \n", " let movie1ProdCompany = movie1.prodCompanyList\n", " let movie2ProdCompany = movie2.prodCompanyList\n", " setIntersect movie1ProdCompany movie2ProdCompany\n", " \n", " (checkIfGenresExist movie1 movie2) && (checkIfProductionCompaniesExist movie1 movie2)\n", " \n", " if getDictOfData.ContainsKey movie then\n", " let movieData = getDictOfData.[movie]\n", " getDictOfData\n", " // Don't include the current item in question nor any other movie not of any of the genres of the movie\n", " |> Seq.filter(fun x -> not(x.Value = movieData) && ( filterOutMoviesNotRelated movieData x.Value ))\n", " // Grab a tuple of the title and cosine distance\n", " |> Seq.map(fun x -> (x.Value.title, computeCosineDistance movie x.Value.title)) \n", " // Remove NaNs\n", " |> Seq.filter(fun x -> not(System.Double.IsNaN(snd x)))\n", " // Sort by distance\n", " |> Seq.sortBy(fun x -> snd x)\n", " // Take only the specified recommendation counts\n", " |> Seq.take recommendationCount\n", " // Convert to list\n", " |> Seq.toList\n", " else\n", " failwith \"Movie not found!\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try getting recommendations for \"The Dark Knight Rises\"\n", "\n", "![image](./pics/DarkKnightRises.jpg)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"The Dark Knight\" 0.003000627894\n", "\"Interstellar\" 0.003353724178\n", "\"Jurassic World\" 0.003409615892\n", "\"Mad Max: Fury Road\" 0.003462151342\n", "\"Inception\" 0.003885255409\n", "\"Batman v Superman: Dawn of Justice\" 0.003924230513\n", "\"Batman Begins\" 0.004095903565\n", "\"One Flew Over the Cuckoo's Nest\" 0.004436711406\n", "\"San Andreas\" 0.004675560828\n", "\"Man of Steel\" 0.004890483474\n" ] }, { "data": { "text/html": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "let results = \n", " (recommendMovies \"The Dark Knight Rises\" 10)\n", " |> List.iter(fun x -> printfn \"%A %A\" (fst x) (snd x))\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not bad! The Dark Knight is the most similar to The Dark Knight Rises. Followed by another Christopher Nolan classic \"Interstellar\" that is probably equally as popular. The remainder of the movies follow a similar theme of superhero, Christopher Nolan, action based plots. One Flew Over the Cuckoo's Nest seems to be the anomaly here, though!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "F# made it incredibly conducive to create a simple recommendation system! It was great fun writing this blogpost and learning about the data science tools available in the ecosystem.\n", "\n", "The next steps of this project could be to create more specialized and advanced recommendation systems and study the behavior of different distance functions. The current recommendation system is fairly slow; I'll also want to optimize the vector creation process in the next iteration." ] } ], "metadata": { "kernelspec": { "display_name": ".NET (F#)", "language": "F#", "name": ".net-fsharp" }, "language": "fsharp", "language_info": { "file_extension": ".fs", "mimetype": "text/x-fsharp", "name": "C#", "pygments_lexer": "fsharp", "version": "4.5" } }, "nbformat": 4, "nbformat_minor": 2 }