{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# TF-IDF and similarity scores\n", "> Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs. This is the Summary of lecture \"Feature Engineering for NLP in Python\", via datacamp.\n", "\n", "- toc: true \n", "- badges: true\n", "- comments: true\n", "- author: Chanseok Kang\n", "- categories: [Python, Datacamp, Natural_Language_Processing]\n", "- image: images/cos_sim.png" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building tf-idf document vectors\n", "- n-gram modeling\n", " - Weight of dimension dependent on the frequency of the word corresponding to the dimension\n", "- Applications\n", " - Automatically detect stopwords\n", " - Search\n", " - Recommender systems\n", " - Better performance in predictive modeling for some cases\n", "- Term frequency-inverse document frequency\n", " - Proportional to term frequency\n", " - Inverse function of the number of documents in which it occurs\n", " - Mathematical formula\n", " $$ w_{i, j} = \\text{tf}_{i, j} \\cdot \\log (\\frac{N}{\\text{df}_{i}}) $$\n", " - $w_{i, j} \\rightarrow $ weight of term $i$ in document $j$\n", " - $\\text{tf}_{i, j} \\rightarrow $ term frequency of term $i$ in document $j$\n", " - $N \\rightarrow$ number of documents in the corpus\n", " - $\\text{df}_{i} \\rightarrow$ number of documents containing term $i$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### tf-idf vectors for TED talks\n", "In this exercise, you have been given a corpus `ted` which contains the transcripts of 500 TED Talks. Your task is to generate the tf-idf vectors for these talks.\n", "\n", "In a later lesson, we will use these vectors to generate recommendations of similar talks based on the transcript.\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
transcripturl
0We're going to talk — my — a new lecture, just...https://www.ted.com/talks/al_seckel_says_our_b...
1This is a representation of your brain, and yo...https://www.ted.com/talks/aaron_o_connell_maki...
2It's a great honor today to share with you The...https://www.ted.com/talks/carter_emmart_demos_...
3My passions are music, technology and making t...https://www.ted.com/talks/jared_ficklin_new_wa...
4It used to be that if you wanted to get a comp...https://www.ted.com/talks/jeremy_howard_the_wo...
\n", "
" ], "text/plain": [ " transcript \\\n", "0 We're going to talk — my — a new lecture, just... \n", "1 This is a representation of your brain, and yo... \n", "2 It's a great honor today to share with you The... \n", "3 My passions are music, technology and making t... \n", "4 It used to be that if you wanted to get a comp... \n", "\n", " url \n", "0 https://www.ted.com/talks/al_seckel_says_our_b... \n", "1 https://www.ted.com/talks/aaron_o_connell_maki... \n", "2 https://www.ted.com/talks/carter_emmart_demos_... \n", "3 https://www.ted.com/talks/jared_ficklin_new_wa... \n", "4 https://www.ted.com/talks/jeremy_howard_the_wo... " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./dataset/ted.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "ted = df['transcript']" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(500, 29158)\n" ] } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "# Create TfidfVectorizer object\n", "vectorizer = TfidfVectorizer()\n", "\n", "# Generate matrix of word vectors\n", "tfidf_matrix = vectorizer.fit_transform(ted)\n", "\n", "# Print the shape of tfidf_matrix\n", "print(tfidf_matrix.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You now know how to generate tf-idf vectors for a given corpus of text. You can use these vectors to perform predictive modeling just like we did with `CountVectorizer`. In the next few lessons, we will see another extremely useful application of the vectorized form of documents: generating recommendations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cosine similarity\n", "![cos_sim](image/cos_sim.png)\n", "- The dot product\n", " - Consider two vectors,\n", " \n", " $ V = (v_1, v_2, \\dots, v_n), W = (w_1, w_2, \\dots, w_n) $\n", " \n", " - Then the dot product of $V$ and $W$ is,\n", " \n", " $ V \\cdot W = (v_1 \\times w_1) + (v_2 \\times w_2) + \\dots + (v_n \\times w_n) $\n", "- Magnitude of vector\n", " - For any vector,\n", " \n", " $ V = (v_1, v_2, \\dots, v_n) $\n", " \n", " - The magnitude is defined as,\n", " \n", " $ \\Vert V \\Vert = \\sqrt{(v_1)^2 + (v_2)^2 + \\dots + (v_n)^2} $\n", "- Cosine score: points to remember\n", " - Value between -1 and 1\n", " - In NLP, value between 0 (no similarity) and 1 (same)\n", " - Robust to document length" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing dot product\n", "In this exercise, we will learn to compute the dot product between two vectors, A = (1, 3) and B = (-2, 2), using the `numpy` library. More specifically, we will use the `np.dot()` function to compute the dot product of two numpy arrays." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n" ] } ], "source": [ "# Initialize numpy vectors\n", "A = np.array([1, 3])\n", "B = np.array([-2, 2])\n", "\n", "# Compute dot product\n", "dot_prod = np.dot(A, B)\n", "\n", "# Print dot product\n", "print(dot_prod)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cosine similarity matrix of a corpus\n", "In this exercise, you have been given a `corpus`, which is a list containing five sentences. You have to compute the cosine similarity matrix which contains the pairwise cosine similarity score for every pair of sentences (vectorized using tf-idf).\n", "\n", "Remember, the value corresponding to the ith row and jth column of a similarity matrix denotes the similarity score for the ith and jth vector." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "corpus = ['The sun is the largest celestial body in the solar system', \n", " 'The solar system consists of the sun and eight revolving planets', \n", " 'Ra was the Egyptian Sun God', \n", " 'The Pyramids were the pinnacle of Egyptian architecture', \n", " 'The quick brown fox jumps over the lazy dog']" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1. 0.36413198 0.18314713 0.18435251 0.16336438]\n", " [0.36413198 1. 0.15054075 0.21704584 0.11203887]\n", " [0.18314713 0.15054075 1. 0.21318602 0.07763512]\n", " [0.18435251 0.21704584 0.21318602 1. 0.12960089]\n", " [0.16336438 0.11203887 0.07763512 0.12960089 1. ]]\n" ] } ], "source": [ "from sklearn.metrics.pairwise import cosine_similarity\n", "\n", "# Initialize an instance of tf-idf Vectorizer\n", "tfidf_vectorizer = TfidfVectorizer()\n", "\n", "# Generate the tf-idf vectors for the corpus\n", "tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)\n", "\n", "# compute and print the cosine similarity matrix\n", "cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)\n", "print(cosine_sim)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " As you will see in a subsequent lesson, computing the cosine similarity matrix lies at the heart of many practical systems such as recommenders. From our similarity matrix, we see that the first and the second sentence are the most similar. Also the fifth sentence has, on average, the lowest pairwise cosine scores. This is intuitive as it contains entities that are not present in the other sentences." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building a plot line based recommender\n", "- Steps\n", " 1. Text preprocessing\n", " 2. Generate tf-idf vectors\n", " 3. Generate cosine-similarity matrix\n", "- The recommender function\n", " 1. Take a movie title, cosine similarity matrix and indices series as arguments\n", " 2. Extract pairwise cosine similarity scores for the movie\n", " 3. Sort the scores in descending order\n", " 4. Output titles corresponding to the highest scores\n", " 5. Ignore the highest similarity score (of 1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparing linear_kernel and cosine_similarity\n", "In this exercise, you have been given `tfidf_matrix` which contains the tf-idf vectors of a thousand documents. Your task is to generate the cosine similarity matrix for these vectors first using `cosine_similarity` and then, using `linear_kernel`.\n", "\n", "We will then compare the computation times for both functions." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1. 0.36413198 0.18314713 0.18435251 0.16336438]\n", " [0.36413198 1. 0.15054075 0.21704584 0.11203887]\n", " [0.18314713 0.15054075 1. 0.21318602 0.07763512]\n", " [0.18435251 0.21704584 0.21318602 1. 0.12960089]\n", " [0.16336438 0.11203887 0.07763512 0.12960089 1. ]]\n", "Time taken: 0.001322031021118164 seconds\n" ] } ], "source": [ "import time\n", "\n", "# Record start time\n", "start = time.time()\n", "\n", "# Compute cosine similarity matrix\n", "cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)\n", "\n", "# Print cosine similarity matrix\n", "print(cosine_sim)\n", "\n", "# Print time taken\n", "print(\"Time taken: %s seconds\" % (time.time() - start))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1. 0.36413198 0.18314713 0.18435251 0.16336438]\n", " [0.36413198 1. 0.15054075 0.21704584 0.11203887]\n", " [0.18314713 0.15054075 1. 0.21318602 0.07763512]\n", " [0.18435251 0.21704584 0.21318602 1. 0.12960089]\n", " [0.16336438 0.11203887 0.07763512 0.12960089 1. ]]\n", "Time taken: 0.0007748603820800781 seconds\n" ] } ], "source": [ "from sklearn.metrics.pairwise import linear_kernel\n", "\n", "# Record start time\n", "start = time.time()\n", "\n", "# Compute cosine similarity matrix\n", "cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)\n", "\n", "# Print cosine similarity matrix\n", "print(cosine_sim)\n", "\n", "# Print time taken\n", "print(\"Time taken: %s seconds\" % (time.time() - start))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how both `linear_kernel` and `cosine_similarity` produced the same result. However, `linear_kernel` took a smaller amount of time to execute. When you're working with a very large amount of data and your vectors are in the tf-idf representation, it is good practice to default to `linear_kernel` to improve performance. (NOTE: In case, you see `linear_kernel` taking more time, it's because the dataset we're dealing with is extremely small and Python's time module is incapable of capture such minute time differences accurately)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The recommender function\n", "In this exercise, we will build a recommender function `get_recommendations()`, as discussed in the lesson. As we know, it takes in a title, a cosine similarity matrix, and a movie title and index mapping as arguments and outputs a list of 10 titles most similar to the original title (excluding the title itself).\n", "\n", "You have been given a dataset metadata that consists of the movie titles and overviews. The head of this dataset has been printed to console." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0idtitleoverviewtagline
0049026The Dark Knight RisesFollowing the death of District Attorney Harve...The Legend Ends
11414Batman ForeverThe Dark Knight of Gotham City confronts a das...Courage now, truth always...
22268BatmanThe Dark Knight of Gotham City begins his war ...Have you ever danced with the devil in the pal...
33364Batman ReturnsHaving defeated the Joker, Batman now faces th...The Bat, the Cat, the Penguin.
44415Batman & RobinAlong with crime-fighting partner Robin and ne...Strength. Courage. Honor. And loyalty.
\n", "
" ], "text/plain": [ " Unnamed: 0 id title \\\n", "0 0 49026 The Dark Knight Rises \n", "1 1 414 Batman Forever \n", "2 2 268 Batman \n", "3 3 364 Batman Returns \n", "4 4 415 Batman & Robin \n", "\n", " overview \\\n", "0 Following the death of District Attorney Harve... \n", "1 The Dark Knight of Gotham City confronts a das... \n", "2 The Dark Knight of Gotham City begins his war ... \n", "3 Having defeated the Joker, Batman now faces th... \n", "4 Along with crime-fighting partner Robin and ne... \n", "\n", " tagline \n", "0 The Legend Ends \n", "1 Courage now, truth always... \n", "2 Have you ever danced with the devil in the pal... \n", "3 The Bat, the Cat, the Penguin. \n", "4 Strength. Courage. Honor. And loyalty. " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "metadata = pd.read_csv('./dataset/movie_metadata.csv').dropna()\n", "metadata.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Generate mapping between titles and index\n", "indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()\n", "\n", "def get_recommendations(title, cosine_sim, indices):\n", " # Get the index of the movie that matches the title\n", " idx = indices[title]\n", " # Get the pairwsie similarity scores\n", " sim_scores = list(enumerate(cosine_sim[idx]))\n", " # Sort the movies based on the similarity scores\n", " sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)\n", " # Get the scores for 10 most similar movies\n", " sim_scores = sim_scores[1:11]\n", " # Get the movie indices\n", " movie_indices = [i[0] for i in sim_scores]\n", " # Return the top 10 most similar movies\n", " return metadata['title'].iloc[movie_indices]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot recommendation engine\n", "In this exercise, we will build a recommendation engine that suggests movies based on similarity of plot lines. You have been given a `get_recommendations()` function that takes in the title of a movie, a similarity matrix and an indices series as its arguments and outputs a list of most similar movies.\n", "\n", "You have also been given a `movie_plots` Series that contains the plot lines of several movies. Your task is to generate a cosine similarity matrix for the tf-idf vectors of these plots.\n", "\n", "Consequently, we will check the potency of our engine by generating recommendations for one of my favorite movies, The Dark Knight Rises." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "movie_plots = metadata['overview']" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 Batman Forever\n", "2 Batman\n", "8 Batman: Under the Red Hood\n", "3 Batman Returns\n", "9 Batman: Year One\n", "10 Batman: The Dark Knight Returns, Part 1\n", "11 Batman: The Dark Knight Returns, Part 2\n", "5 Batman: Mask of the Phantasm\n", "7 Batman Begins\n", "4 Batman & Robin\n", "Name: title, dtype: object\n" ] } ], "source": [ "# Initialize the TfidfVectorizer\n", "tfidf = TfidfVectorizer(stop_words='english')\n", "\n", "# Construct the TF-IDF matrix\n", "tfidf_matrix = tfidf.fit_transform(movie_plots)\n", "\n", "# Generate the cosine similarity matrix\n", "cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)\n", "\n", "# Generate recommendations\n", "print(get_recommendations(\"The Dark Knight Rises\", cosine_sim, indices))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You've just built your very first recommendation system. Notice how the recommender correctly identifies 'The Dark Knight Rises' as a Batman movie and recommends other Batman movies as a result. This sytem is, of course, very primitive and there are a host of ways in which it could be improved. One method would be to look at the cast, crew and genre in addition to the plot to generate recommendations. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TED talk recommender\n", "n this exercise, we will build a recommendation system that suggests TED Talks based on their transcripts. You have been given a `get_recommendations()` function that takes in the title of a talk, a similarity matrix and an `indices` series as its arguments, and outputs a list of most similar talks.\n", "\n", "You have also been given a `transcripts` series that contains the transcripts of around 500 TED talks. Your task is to generate a cosine similarity matrix for the tf-idf vectors of the talk transcripts.\n", "\n", "Consequently, we will generate recommendations for a talk titled '5 ways to kill your dreams' by Brazilian entrepreneur Bel Pesce." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Unnamed: 0.1titleurltranscript
Unnamed: 0
0140710 top time-saving tech tipshttps://www.ted.com/talks/david_pogue_10_top_t...I've noticed something interesting about socie...
11524Who am I? Think againhttps://www.ted.com/talks/hetain_patel_who_am_...Hetain Patel: (In Chinese)Yuyu Rau: Hi, I'm He...
22393\"Awoo\"https://www.ted.com/talks/sofi_tukker_awoo\\n(Music)Sophie Hawley-Weld: OK, you don't have ...
32313What I learned from 2,000 obituarieshttps://www.ted.com/talks/lux_narayan_what_i_l...Joseph Keller used to jog around the Stanford ...
41633Why giving away our wealth has been the most s...https://www.ted.com/talks/bill_and_melinda_gat...Chris Anderson: So, this is an interview with ...
\n", "
" ], "text/plain": [ " Unnamed: 0.1 title \\\n", "Unnamed: 0 \n", "0 1407 10 top time-saving tech tips \n", "1 1524 Who am I? Think again \n", "2 2393 \"Awoo\" \n", "3 2313 What I learned from 2,000 obituaries \n", "4 1633 Why giving away our wealth has been the most s... \n", "\n", " url \\\n", "Unnamed: 0 \n", "0 https://www.ted.com/talks/david_pogue_10_top_t... \n", "1 https://www.ted.com/talks/hetain_patel_who_am_... \n", "2 https://www.ted.com/talks/sofi_tukker_awoo\\n \n", "3 https://www.ted.com/talks/lux_narayan_what_i_l... \n", "4 https://www.ted.com/talks/bill_and_melinda_gat... \n", "\n", " transcript \n", "Unnamed: 0 \n", "0 I've noticed something interesting about socie... \n", "1 Hetain Patel: (In Chinese)Yuyu Rau: Hi, I'm He... \n", "2 (Music)Sophie Hawley-Weld: OK, you don't have ... \n", "3 Joseph Keller used to jog around the Stanford ... \n", "4 Chris Anderson: So, this is an interview with ... " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ted = pd.read_csv('./dataset/ted_clean.csv', index_col=0)\n", "ted.head()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def get_recommendations(title, cosine_sim, indices):\n", " # Get the index of the movie that matches the title\n", " idx = indices[title]\n", " # Get the pairwsie similarity scores\n", " sim_scores = list(enumerate(cosine_sim[idx]))\n", " # Sort the movies based on the similarity scores\n", " sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)\n", " # Get the scores for 10 most similar movies\n", " sim_scores = sim_scores[1:11]\n", " # Get the movie indices\n", " talk_indices = [i[0] for i in sim_scores]\n", " # Return the top 10 most similar movies\n", " return ted['title'].iloc[talk_indices]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Generate mapping between titles and index\n", "indices = pd.Series(ted.index, index=ted['title']).drop_duplicates()\n", "transcripts = ted['transcript']" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Unnamed: 0\n", "453 Success is a continuous journey\n", "157 Why we do what we do\n", "494 How to find work you love\n", "149 My journey into movies that matter\n", "447 One Laptop per Child\n", "230 How to get your ideas to spread\n", "497 Plug into your hard-wired happiness\n", "495 Why you will fail to have a great career\n", "179 Be suspicious of simple stories\n", "53 To upgrade is human\n", "Name: title, dtype: object\n" ] } ], "source": [ "# Initialize the TfidfVectorizer\n", "tfidf = TfidfVectorizer(stop_words='english')\n", "\n", "# Construct the TF-IDF matrix\n", "tfidf_matrix = tfidf.fit_transform(transcripts)\n", "\n", "# Generate the cosine similarity matrix\n", "cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)\n", "\n", "# Generate recommendations\n", "print(get_recommendations('5 ways to kill your dreams', cosine_sim, indices))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You have successfully built a TED talk recommender. This recommender works surprisingly well despite being trained only on a small subset of TED talks. In fact, three of the talks recommended by our system is also recommended by the official TED website as talks to watch next after '5 ways to kill your dreams'!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Beyond n-grams: word embeddings\n", "- Word embeddings\n", " - Mapping words into an n-dimensional vector space\n", " - Produced using deep learning and huge amounts of data\n", " - Discern how similar two words are to each other\n", " - Used to detect synonyms and antonyms\n", " - Captures complex relationships\n", " - Dependent on spacy model; independent of dataset you use" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> Note: Before using word embedding through spaCy, you need to download `en_core_web_lg` model (` python -m spacy download en_core_web_lg `) refer this [page](https://spacy.io/models/en#en_core_web_lg)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load('en_core_web_lg')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating word vectors\n", "In this exercise, we will generate the pairwise similarity scores of all the words in a sentence. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I I 1.0\n", "I like 0.5554912\n", "I apples 0.20442726\n", "I and 0.31607857\n", "I orange 0.30332792\n", "like I 0.5554912\n", "like like 1.0\n", "like apples 0.32987142\n", "like and 0.5267484\n", "like orange 0.3551869\n", "apples I 0.20442726\n", "apples like 0.32987142\n", "apples apples 1.0\n", "apples and 0.24097733\n", "apples orange 0.5123849\n", "and I 0.31607857\n", "and like 0.5267484\n", "and apples 0.24097733\n", "and and 1.0\n", "and orange 0.25450808\n", "orange I 0.30332792\n", "orange like 0.3551869\n", "orange apples 0.5123849\n", "orange and 0.25450808\n", "orange orange 1.0\n" ] } ], "source": [ "sent = 'I like apples and orange'\n", "\n", "# Create the doc object\n", "doc = nlp(sent)\n", "\n", "# Compute pairwise similarity scores\n", "for token1 in doc:\n", " for token2 in doc: \n", " print(token1.text, token2.text, token1.similarity(token2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the words `'apples'` and `'oranges'` have the highest pairwaise similarity score. This is expected as they are both fruits and are more related to each other than any other pair of words." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing similarity of Pink Floyd songs\n", "In this final exercise, you have been given lyrics of three songs by the British band Pink Floyd, namely 'High Hopes', 'Hey You' and 'Mother'. The lyrics to these songs are available as `hopes`, `hey` and `mother` respectively.\n", "\n", "Your task is to compute the pairwise similarity between `mother` and `hopes`, and `mother` and `hey`." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "with open('./dataset/mother.txt', 'r') as f:\n", " mother = f.read()\n", " \n", "with open('./dataset/hopes.txt', 'r') as f:\n", " hopes = f.read()\n", " \n", "with open('./dataset/hey.txt', 'r') as f:\n", " hey = f.read()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.8653562687318176\n", "0.9595267490921296\n" ] } ], "source": [ "# Create Doc objects\n", "mother_doc = nlp(mother)\n", "hopes_doc = nlp(hopes)\n", "hey_doc = nlp(hey)\n", "\n", "# Print similarity between mother and hopes\n", "print(mother_doc.similarity(hopes_doc))\n", "\n", "# Print similarity between mother and hey\n", "print(mother_doc.similarity(hey_doc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that 'Mother' and 'Hey You' have a similarity score of 0.9 whereas 'Mother' and 'High Hopes' has a score of only 0.6. This is probably because 'Mother' and 'Hey You' were both songs from the same album 'The Wall' and were penned by Roger Waters. On the other hand, 'High Hopes' was a part of the album 'Division Bell' with lyrics by David Gilmour and his wife, Penny Samson. Treat yourself by listening to these songs. They're some of the best!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }