{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "colab": { "name": "Word_Vector_visualization.ipynb", "provenance": [], "include_colab_link": true } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "hNJQXT-I0SQD", "colab_type": "text" }, "source": [ "## Word vector visualization" ] }, { "cell_type": "markdown", "metadata": { "id": "fmgtDXdM5azq", "colab_type": "text" }, "source": [ "This word vector visualization tool was built to illustrate some of the properties they have.\n", "The vectors themselves were encoded using GloVe model by the good folks at Stanford, ([link to GloVe](https://nlp.stanford.edu/projects/glove/)) and this script was heavily inspired by this page [here](https://web.stanford.edu/class/cs224n/materials/Gensim%20word%20vector%20visualization.html), taken from the materials of one of the courses in Stanford.\n", "\n", "In short, they trained GloVe model on 400k word corpus obtained from Wikipedia (at 2014) and Gigaword.\n", "\n", "I have chosen to use their smallest vectors with the least dimensions (50d), since the difference in performance was insignificant as compared to the difference in file size.\n", "however, you are more than welcome to download the other files from the Glove site ([zip files](https://nlp.stanford.edu/data/glove.6B.zip)) and play around with it.\n", "\n", "I changed the functions and made them slower and less efficient, but in doing so, you can see in inside of them and have a better understanding of how they work. I have also changed the plotting tools to pyplot and made some enhancements." ] }, { "cell_type": "markdown", "metadata": { "id": "H3Q6cT6e_muR", "colab_type": "text" }, "source": [ "## Imports" ] }, { "cell_type": "code", "metadata": { "id": "gDdwQ3uv0SQK", "colab_type": "code", "colab": {} }, "source": [ "# Get the interactive Tools for Matplotlib\n", "import matplotlib.pyplot as plt\n", "%matplotlib notebook\n", "%matplotlib inline\n", "\n", "import plotly.graph_objects as go\n", "import plotly.express as px\n", "\n", "\n", "\n", "# Get tools to download files and load them \n", "import pickle\n", "import urllib.request\n", "from os.path import exists as check_path\n", "from os import makedirs\n", "\n", "# Get tools to performe analysis\n", "import numpy as np\n", "from heapq import heappushpop\n", "from sklearn.decomposition import PCA" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "xa8hZgfQAauJ", "colab_type": "text" }, "source": [ "# Step 1: download files and load the list of word and their vector representations \n", " \n", "\n", "1. Run the next blocks:\n", "\n", "\n", "> 1. download_files_from_github\n", "> 2. load_word2vecfiles" ] }, { "cell_type": "code", "metadata": { "id": "qswqF_5O0SQh", "colab_type": "code", "colab": {} }, "source": [ "def download_files_from_github(file_target_dir):\n", " main_url = 'https://raw.githubusercontent.com/Goussha/word-vector-visualization/master/'\n", " if not check_path(file_target_dir):\n", " makedirs(file_target_dir)\n", " \n", " urls = [main_url+'file{}.p'.format(x) for x in range(1,9)]\n", " file_names = [file_target_dir+'file{}.p'.format(x) for x in range(1,9)]\n", " for file_name, url in zip(file_names, urls):\n", " if not check_path(file_name):\n", " print (\"Downloading file: \",file_name)\n", " filename, headers = urllib.request.urlretrieve(url, filename=file_name)\n", " else:\n", " print('Allready exists: {}'.format(file_name))" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "0aEmH8Kv0SQp", "colab_type": "code", "colab": {} }, "source": [ "def load_word2vecfiles(file_target_dir):\n", " word_dict_loded = {}\n", " for file_num in range(1,9):\n", " full_file_name = file_target_dir+'file{}.p'.format(file_num)\n", " print('Loading file: {}'.format(full_file_name))\n", " with open(full_file_name, 'rb') as fp:\n", " data = pickle.load(fp)\n", " word_dict_loded.update(data)\n", " return word_dict_loded" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "nbMazCLBwSZK", "colab_type": "text" }, "source": [ "Run the next cell to download and load the files from my github\n", "\n", "(should take about 15 sec to downloand and load)" ] }, { "cell_type": "code", "metadata": { "id": "_nyL2S4m0SQy", "colab_type": "code", "outputId": "cbd89b4b-05ef-4874-e688-521b47add0d6", "colab": { "base_uri": "https://localhost:8080/", "height": 327 } }, "source": [ "file_target_dir = \"./tmp/\"\n", "\n", "#Download files\n", "download_files_from_github(file_target_dir)\n", "#Load files and create dict\n", "word_dict = load_word2vecfiles(file_target_dir)" ], "execution_count": 5, "outputs": [ { "output_type": "stream", "text": [ "Downloading file: ./tmp/file1.p\n", "Downloading file: ./tmp/file2.p\n", "Downloading file: ./tmp/file3.p\n", "Downloading file: ./tmp/file4.p\n", "Downloading file: ./tmp/file5.p\n", "Downloading file: ./tmp/file6.p\n", "Downloading file: ./tmp/file7.p\n", "Downloading file: ./tmp/file8.p\n", "Loading file: ./tmp/file1.p\n", "Loading file: ./tmp/file2.p\n", "Loading file: ./tmp/file3.p\n", "Loading file: ./tmp/file4.p\n", "Loading file: ./tmp/file5.p\n", "Loading file: ./tmp/file6.p\n", "Loading file: ./tmp/file7.p\n", "Loading file: ./tmp/file8.p\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "54fgNBLFCA3g", "colab_type": "text" }, "source": [ "If you wish to check out the other word vector files, you can download them here ([zip files](https://nlp.stanford.edu/data/glove.6B.zip)).\n", "After downloading and unziping, uncomment the next cell and run it(to be added in the future)." ] }, { "cell_type": "code", "metadata": { "id": "AybqspWFCAHB", "colab_type": "code", "colab": {} }, "source": [ "'''Not ready yet, to be added'''" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "t4JD1gAzELKe", "colab_type": "text" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "TdCkcvCkC5OW", "colab_type": "text" }, "source": [ "# cosine_similarity\n", "\n", "Cosine similarity reflects the degree of similarity between two vectors.\n", "\n", "As I mentioned, there are more efficiant ways to do this, but in this way, you are able to see exactly what is being calcualed.\n", "\n", "Run the next cell" ] }, { "cell_type": "code", "metadata": { "id": "2hYFUUEB0SQ7", "colab_type": "code", "colab": {} }, "source": [ "def cosine_similarity(u, v):\n", " \"\"\"\n", " Cosine similarity reflects the degree of similarity between u and v\n", " \n", " Arguments:\n", " u -- a word vector of shape (n,) \n", " v -- a word vector of shape (n,)\n", "\n", " Returns:\n", " cosine_similarity -- the cosine similarity between u and v defined by the formula above.\n", " \"\"\"\n", " \n", " distance = 0.0\n", " epsilon=1e-10 #Prevent dividing by 0\n", " # Compute the dot product between u and v (≈1 line)\n", " dot = np.dot(u.T,v)\n", " # Compute the L2 norm of u (≈1 line)\n", " norm_u = np.sqrt(np.sum(u**2))\n", " \n", " # Compute the L2 norm of v (≈1 line)\n", " norm_v = np.sqrt(np.sum(v**2))\n", " # Compute the cosine similarity defined by formula (1) (≈1 line)\n", " cosine_similarity = dot/((norm_u*norm_v)+epsilon)\n", " \n", " return cosine_similarity " ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ICMSzuiIEV_e", "colab_type": "text" }, "source": [ "# most_k_similar \n", "\n", "A function the finds the most similar word to the input word by calculating the cosine similarity between the word vector and the other word vectors and returning K most similar words" ] }, { "cell_type": "code", "metadata": { "id": "Yoy_z5z10SRO", "colab_type": "code", "colab": {} }, "source": [ "def most_k_similar(word_in,word_dict,k=1):\n", " \"\"\"\n", " most_k_similar finds most similar k number of words\n", " \n", " Arguments:\n", " word_in -- a word in the corpus\n", " word_dict -- dictinary of word - word vector pairs\n", " k -- number of words to return\n", " Returns:\n", " list of most similar words\n", " \"\"\"\n", " words = word_dict.keys()\n", " word_vec = word_dict[word_in]\n", " \n", " \n", " most_similars_heap = [(-100, '') for _ in range(k)]\n", "\n", " \n", " for w in words:\n", " if w==word_in:\n", " continue\n", " \n", " cosine_sim = cosine_similarity(word_vec, word_dict[w])\n", " heappushpop(most_similars_heap, (cosine_sim, w))\n", " most_similars_tuples = [tup for tup in most_similars_heap] \n", " _,best_words = zip(*most_similars_tuples)\n", " return best_words" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "HTf_LwIGGAnb", "colab_type": "text" }, "source": [ "# doesnt_match\n", "takes list of words and returns the word the doesnt match by comaparing the cosine similarities between each word with all other words and returning the words with the lowest score" ] }, { "cell_type": "code", "metadata": { "id": "BBv-TTVI0SRV", "colab_type": "code", "colab": {} }, "source": [ "def doesnt_match(words,word_dict):\n", " \n", " dots_tot = []\n", " for w in words:\n", " dots = 0\n", " for w2 in words:\n", " if w2 == w:\n", " continue\n", " v = word_dict[w]\n", " u = word_dict[w2]\n", " dots=dots+cosine_similarity(v,u)\n", " \n", " dots_tot.append(dots)\n", " \n", " return(words[np.argmin(dots_tot)]) \n", " \n", " " ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "A779HNu1GVZW", "colab_type": "text" }, "source": [ "# complete_analogy\n", "\n", "To find the analogy between words, this function subtracks one word vector from the other, and then add the difference to the vector of the third word.\n", "The difference between two word vectors represents the difference between their meaning, or the relationship between them, also know as their analogy.\n", "By adding this difference to a diferent word vector, you can find a forth word that has the same relationship with word 3 as words 1 and 2 have\n", "\n", "meaning: man is to king as woman is to X (queen)" ] }, { "cell_type": "code", "metadata": { "id": "9pa8ShYX0SRB", "colab_type": "code", "colab": {} }, "source": [ "def complete_analogy(word_a, word_b, word_c, word_dict):\n", " \"\"\"\n", " Performs the word analogy task as explained above: a is to b as c is to ____. \n", " \n", " Arguments:\n", " word_a -- a word, string\n", " word_b -- a word, string\n", " word_c -- a word, string\n", " word_to_vec_map -- dictionary that maps words to their corresponding vectors. \n", " \n", " Returns:\n", " best_word -- the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity\n", " \"\"\"\n", " \n", " # convert words to lowercase\n", " word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()\n", " \n", " # Get the word embeddings e_a, e_b and e_c (≈1-3 lines)\n", " e_a, e_b, e_c = word_dict[word_a],word_dict[word_b],word_dict[word_c]\n", " \n", " words = word_dict.keys()\n", " max_cosine_sim = -100 # Initialize max_cosine_sim to a large negative number\n", " best_word = None # Initialize best_word with None, it will help keep track of the word to output\n", "\n", " # to avoid best_word being one of the input words, skip the input words\n", " # place the input words in a set for faster searching than a list\n", " # We will re-use this set of input words inside the for-loop\n", " input_words_set = set([word_a, word_b, word_c])\n", " \n", " # loop over the whole word vector set\n", " \n", " cnt=1\n", " \n", " for w in words: \n", " # to avoid best_word being one of the input words, skip the input words\n", " if w in input_words_set:\n", " continue\n", " \n", " #Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)\n", " cosine_sim = cosine_similarity(e_b - e_a, word_dict[w]- e_c)\n", " \n", " # If the cosine_sim is more than the max_cosine_sim seen so far,\n", " # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)\n", " if cosine_sim > max_cosine_sim:\n", " max_cosine_sim = cosine_sim\n", " best_word = w\n", " return best_word" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "onD2sqwWKCKa", "colab_type": "text" }, "source": [ "# Time to run some code and see the results\n", "Try replacing the word and running again" ] }, { "cell_type": "markdown", "metadata": { "id": "vlB-lIFtKac0", "colab_type": "text" }, "source": [ "### Examples of \"dosnt match\"" ] }, { "cell_type": "code", "metadata": { "id": "XLp5MeY80SRe", "colab_type": "code", "outputId": "145ac157-78b3-4dc2-ce1c-a55d796357ad", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "doesnt_match(['red','two','one','four'],word_dict)" ], "execution_count": 10, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'red'" ] }, "metadata": { "tags": [] }, "execution_count": 10 } ] }, { "cell_type": "code", "metadata": { "id": "qeTNmoct0SRm", "colab_type": "code", "outputId": "df812734-f171-4c69-e929-cf3b4be4b09e", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "doesnt_match(['red','one','blue','orange'],word_dict)" ], "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'one'" ] }, "metadata": { "tags": [] }, "execution_count": 11 } ] }, { "cell_type": "code", "metadata": { "id": "kLWincuv0SRt", "colab_type": "code", "outputId": "d512c559-0196-4d0a-c08f-accc71ecffa0", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "doesnt_match(['up','down','yes','back','front'],word_dict)" ], "execution_count": 12, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'yes'" ] }, "metadata": { "tags": [] }, "execution_count": 12 } ] }, { "cell_type": "code", "metadata": { "id": "Xa_vGeb30SRz", "colab_type": "code", "outputId": "d6d3a238-a50d-4e06-f3ed-62f3201637b7", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "doesnt_match(['big','small','huge'],word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'small'" ] }, "metadata": { "tags": [] }, "execution_count": 14 } ] }, { "cell_type": "code", "metadata": { "id": "1Zs3VfV60SR5", "colab_type": "code", "outputId": "ee4ab817-c24b-4e19-f029-3364c6210756", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "doesnt_match(['big','small','tiny'],word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'big'" ] }, "metadata": { "tags": [] }, "execution_count": 15 } ] }, { "cell_type": "markdown", "metadata": { "id": "WO_bZx9YKrCJ", "colab_type": "text" }, "source": [ "### Examples of \"most similar\"" ] }, { "cell_type": "code", "metadata": { "id": "dHVo1N-h0SR_", "colab_type": "code", "outputId": "fec5d5c5-a2db-4a5f-f4cc-e8ea4cc1b7af", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "print(most_k_similar('small',word_dict,10))" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "('few', 'one', 'typically', 'mostly', 'usually', 'smaller', 'well', 'larger', 'tiny', 'large')\n" ], "name": "stdout" } ] }, { "cell_type": "code", "metadata": { "id": "pJLFNGleK2dD", "colab_type": "code", "outputId": "6b3d0941-8072-4d69-c858-666412b2f772", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "print(most_k_similar('god',word_dict,10))" ], "execution_count": 0, "outputs": [ { "output_type": "stream", "text": [ "('gods', 'true', 'jesus', 'sacred', 'christ', 'faith', 'allah', 'heaven', 'holy', 'divine')\n" ], "name": "stdout" } ] }, { "cell_type": "markdown", "metadata": { "id": "IuWfg7KkLCSc", "colab_type": "text" }, "source": [ "### Examples of Analogies\n", "a to b as c is to X" ] }, { "cell_type": "code", "metadata": { "id": "_kd5Ab1a0SSF", "colab_type": "code", "outputId": "e710c6e1-93e0-49ef-f7df-c0e63651094b", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "complete_analogy('man', 'woman', 'actor', word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'actress'" ] }, "metadata": { "tags": [] }, "execution_count": 17 } ] }, { "cell_type": "code", "metadata": { "id": "juZRsOgB0SSM", "colab_type": "code", "outputId": "732112db-ff74-456e-8a9a-53f8ee88cee8", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "complete_analogy('man', 'king', 'woman', word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'queen'" ] }, "metadata": { "tags": [] }, "execution_count": 18 } ] }, { "cell_type": "code", "metadata": { "id": "tjYqaqv70SSW", "colab_type": "code", "outputId": "22bc3a7e-ae85-44d1-a11b-5e0f7ecd4b1c", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "complete_analogy('japan', 'japanese', 'australia',word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'british'" ] }, "metadata": { "tags": [] }, "execution_count": 19 } ] }, { "cell_type": "code", "metadata": { "id": "5UaV13fw0SSr", "colab_type": "code", "outputId": "6a185cc3-c0d6-46bf-f4f7-9de9743191ef", "colab": { "base_uri": "https://localhost:8080/", "height": 36 } }, "source": [ "complete_analogy('usa', 'obama', 'israel',word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'netanyahu'" ] }, "metadata": { "tags": [] }, "execution_count": 27 } ] }, { "cell_type": "code", "metadata": { "id": "72VupA3E0SSv", "colab_type": "code", "outputId": "d645ec16-0714-4ad0-8c6f-fbfc266e64d4", "colab": {} }, "source": [ "complete_analogy('tall', 'tallest', 'long',word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'longest'" ] }, "metadata": { "tags": [] }, "execution_count": 92 } ] }, { "cell_type": "code", "metadata": { "id": "pNp34uYD0SS2", "colab_type": "code", "outputId": "ef39c758-cb8d-4a12-a829-502b9b8af245", "colab": {} }, "source": [ "complete_analogy('good', 'fantastic', 'bad',word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'incredible'" ] }, "metadata": { "tags": [] }, "execution_count": 93 } ] }, { "cell_type": "code", "metadata": { "id": "Va_Q37wh0SS7", "colab_type": "code", "outputId": "8776ac9f-a777-4268-d53c-745d63d4391a", "colab": {} }, "source": [ "complete_analogy('germany', 'berlin', 'israel',word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'jerusalem'" ] }, "metadata": { "tags": [] }, "execution_count": 98 } ] }, { "cell_type": "code", "metadata": { "id": "R3YDiZSG0SS_", "colab_type": "code", "outputId": "7541d744-e6e8-4087-f85e-70c7f52c8346", "colab": {} }, "source": [ "complete_analogy('germany', 'europe', 'israel',word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'asia'" ] }, "metadata": { "tags": [] }, "execution_count": 99 } ] }, { "cell_type": "code", "metadata": { "id": "L0V1mmz50STG", "colab_type": "code", "outputId": "d86cebdc-0218-4a89-9821-c185413b52b6", "colab": {} }, "source": [ "complete_analogy('good', 'bad', 'up',word_dict)" ], "execution_count": 0, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'subprime'" ] }, "metadata": { "tags": [] }, "execution_count": 285 } ] }, { "cell_type": "markdown", "metadata": { "id": "v1AalZ2ALMl4", "colab_type": "text" }, "source": [ "# Time for some visualizations\n", "run the next cell" ] }, { "cell_type": "markdown", "metadata": { "id": "QjqaKwTpLZBs", "colab_type": "text" }, "source": [ "# display_pca_scatterplot_iplot\n", "Takes a list of words and their vector representations\n", "Since we can't plot 50 dimention vectors, we must extract the most important feuters. We are doing it here using PCA, but this can be done in other methods such as t-SNE.\n", "\n", "The plots are intercative, and you can zoom in and out or rotate the axes.\n", "\n", "Inputs:\n", "\n", "\n", "* word_dict = word_widt, ditionary with all word and word vector pairs\n", "* words = None, a list of words to plot, if is None, randomly pick sample sized words from word_dict\n", "* sample = 0, number of words to display if words is none, is sample is 0, display all 400k words, this will take some time\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "lngnC1H60STk", "colab_type": "code", "colab": {} }, "source": [ "def display_pca_scatterplot_iplot(word_dict=word_dict, words=None, sample=0):\n", " if words == None:\n", " if sample > 0:\n", " words = np.random.choice(list(word_dict.keys()), sample)\n", " else:\n", " words = word_dict.keys()\n", " \n", " word_vectors = np.array([word_dict[w] for w in words])\n", "\n", " twodim = PCA().fit_transform(word_vectors)[:,:4]\n", "\n", " fig = px.scatter_3d (x=twodim[:,0], y=twodim[:,1],z=twodim[:,2],color=twodim[:,3], text=words,size_max=60)\n", " #fig = px.scatter(x=twodim[:,0], y=twodim[:,1],color=twodim[:,2], text=words, log_x=False, size_max=60)\n", "\n", " fig.update_traces({'textposition':'top center','marker':{'showscale':False}})\n", "\n", " fig.update_layout(\n", " title_text='PCA',\n", " paper_bgcolor = 'rgba(0,0,0,0)',\n", "\t\t\t\tplot_bgcolor = 'rgba(0,0,0,0)',\n", " scene = dict(xaxis= dict(nticks=40, range=[-10,10],),\n", " yaxis = dict(nticks=40, range=[-10,10],),\n", " zaxis = dict(nticks=40, range=[-10,10])))\n", "\n", " fig.show()" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "xa3hNIVLwun0", "colab_type": "text" }, "source": [ "### Royalty and gender\n", "![Relatation between royalty and gender](https://i.imgur.com/jEmLYj7.png)" ] }, { "cell_type": "code", "metadata": { "id": "PxDjvcLrEwk6", "colab_type": "code", "colab": {} }, "source": [ "display_pca_scatterplot_iplot(word_dict, \n", " ['man','woman','king','queen','prince','princess','duke','duchess'])" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "QFxC9eJu0zjJ", "colab_type": "text" }, "source": [ "### Countries and languages\n", "\n", "\n", "![Countries](https://i.imgur.com/eoyNqLX.png)" ] }, { "cell_type": "code", "metadata": { "id": "eLCPodOPFFAD", "colab_type": "code", "colab": {} }, "source": [ "display_pca_scatterplot_iplot(word_dict, \n", " ['germany','german','italy','italian','england','english','russia','russian'])" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "53HDd0iS1gJA", "colab_type": "text" }, "source": [ "### Contries and capitals\n", "\n", "![alt text](https://i.imgur.com/rTuvT8k.png)" ] }, { "cell_type": "code", "metadata": { "id": "UXYp-oZgG51W", "colab_type": "code", "colab": {} }, "source": [ "display_pca_scatterplot_iplot(word_dict, \n", " ['israel','jerusalem','italy','rome','france','paris','ireland','dublin'])" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "PSuXSXuh300J", "colab_type": "text" }, "source": [ "Familiy relations\n", "\n", "![alt text](https://i.imgur.com/daXBOZZ.png)" ] }, { "cell_type": "code", "metadata": { "id": "XdJzszQV0ST4", "colab_type": "code", "colab": {} }, "source": [ "display_pca_scatterplot_iplot(word_dict, \n", " ['son','father','daughter','mother','uncle','aunt','niece','nephew'])" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "IpmALlhh4DUE", "colab_type": "text" }, "source": [ "### Formal and informal\n", "\n", "![alt text](https://i.imgur.com/SyaMpH2.png)" ] }, { "cell_type": "code", "metadata": { "id": "8i0b-ebi4Dsm", "colab_type": "code", "colab": {} }, "source": [ "display_pca_scatterplot_iplot(word_dict, \n", " ['son','father','daughter','mother','dad','mom','grandpa','grandma','grandfather','grandmother'])" ], "execution_count": 0, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "7M6BzzOG0STo", "colab_type": "code", "outputId": "f2e1cc27-9d60-46c1-a469-b8ad7373bb1b", "colab": { "base_uri": "https://localhost:8080/", "height": 542 } }, "source": [ "display_pca_scatterplot_iplot(word_dict, \n", " ['coffee', 'tea', 'beer', 'wine', 'brandy', 'rum', 'champagne', 'water',\n", " 'spaghetti', 'borscht', 'hamburger', 'pizza', 'falafel', 'sushi', 'meatballs',\n", " 'dog', 'horse', 'cat', 'monkey', 'parrot', 'koala', 'lizard',\n", " 'frog', 'toad', 'monkey', 'ape', 'kangaroo', 'wombat', 'wolf',\n", " 'france', 'germany', 'hungary', 'luxembourg', 'australia', 'fiji', 'china',\n", " 'homework', 'assignment', 'problem', 'exam', 'test', 'class',\n", " 'school', 'college', 'university', 'institute','soda','fanta','cars'])" ], "execution_count": 0, "outputs": [ { "output_type": "display_data", "data": { "text/html": [ "\n", "\n", "\n", "
\n", " \n", " \n", " \n", "
\n", " \n", "
\n", "\n", "" ] }, "metadata": { "tags": [] } } ] }, { "cell_type": "markdown", "metadata": { "id": "qUjqORqO4-hD", "colab_type": "text" }, "source": [ "Pick a few words and find 10 similar words to each and display them" ] }, { "cell_type": "code", "metadata": { "id": "V19tbvd90STr", "colab_type": "code", "colab": {} }, "source": [ "word_list=[]\n", "\n", "word_list.append(most_k_similar('israel' ,word_dict,10))\n", "word_list.append(most_k_similar('chair' ,word_dict,10))\n", "word_list.append(most_k_similar('spain' ,word_dict,10))\n", "word_list.append(most_k_similar('good' ,word_dict,10))\n", "\n", "\n", "words = [i for sub in word_list for i in sub]\n", "\n", "display_pca_scatterplot_iplot(word_dict, words)" ], "execution_count": 0, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hW15eCvg4Y38", "colab_type": "text" }, "source": [ "Randomly pick 300 words and display their vectors" ] }, { "cell_type": "code", "metadata": { "id": "qk5RYSvN0ST1", "colab_type": "code", "colab": {} }, "source": [ "display_pca_scatterplot_iplot(word_dict, sample=300)" ], "execution_count": 0, "outputs": [] } ] }