{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/programminghistorian/ph-submissions/blob/gh-pages/assets/understanding-creating-word-embeddings/understanding-creating-word-embeddings.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "h6duh5xbRNEJ"
      },
      "source": [
        "## Preparing Your Corpus"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "isxr7OWCRNEK"
      },
      "outputs": [],
      "source": [
        "# A good practice in programming is to place your import statements at the top of your code, and to keep them together\n",
        "\n",
        "import re                                   # for regular expressions\n",
        "import os                                   # to look up operating system-based info\n",
        "import string                               # to do fancy things with strings\n",
        "import glob                                 # to locate a specific file type\n",
        "from pathlib import Path                    # to access files in other directories\n",
        "import gensim                               # to access Word2Vec\n",
        "from gensim.models import Word2Vec          # to access Gensim's flavor of Word2Vec\n",
        "import pandas as pd                         # to sort and organize data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "eKVrKznPRNEM"
      },
      "outputs": [],
      "source": [
        "dirpath = r'FILL IN YOUR FILEPATH HERE' # get file path (you can change this)\n",
        "file_type = \".txt\" # if your data is not in a plain text format, you can change this\n",
        "filenames = []\n",
        "data = []\n",
        "\n",
        " # this for loop will run through folders and subfolders looking for a specific file type\n",
        "for root, dirs, files in os.walk(dirpath, topdown=False):\n",
        "   # look through all the files in the given directory\n",
        "   for name in files:\n",
        "       if (root + os.sep + name).endswith(file_type):\n",
        "           filenames.append(os.path.join(root, name))\n",
        "   # look through all the directories\n",
        "   for name in dirs:\n",
        "       if (root + os.sep + name).endswith(file_type):\n",
        "           filenames.append(os.path.join(root, name))\n",
        "\n",
        "# this for loop then goes through the list of files, reads them, and then adds the text to a list\n",
        "for filename in filenames:\n",
        "    with open(filename) as afile:\n",
        "        print(filename)\n",
        "        data.append(afile.read()) # read the file and then add it to the list\n",
        "        afile.close() # close the file when you're done\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "NtYM6GqkRNEN"
      },
      "outputs": [],
      "source": [
        "def clean_text(text):\n",
        "\n",
        "    # Cleans the given text using regular expressions to split and lower-cased versions to create\n",
        "    # a list of tokens for each text.\n",
        "    # The function accepts a list of texts and returns a list of of lists of tokens\n",
        "\n",
        "\n",
        "    # lower case\n",
        "    tokens = text.split()\n",
        "    tokens = [t.lower() for t in tokens]\n",
        "\n",
        "    # remove punctuation using regular expressions\n",
        "    # this line of code locates the punctuation within the given text and compiles that punctuation into a single variable\n",
        "    re_punc = re.compile('[%s]' % re.escape(string.punctuation))\n",
        "    # this line of code substitutes the punctuation we just compiled with nothing ''\n",
        "    tokens = [re_punc.sub('', token) for token in tokens]\n",
        "\n",
        "    # only include tokens that aren't numbers\n",
        "    tokens = [token for token in tokens if token.isalpha()]\n",
        "    return tokens"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Y1LV2YUlRNEN"
      },
      "outputs": [],
      "source": [
        "# clean text from folder of text files, stored in the data variable\n",
        "data_clean = []\n",
        "for x in data:\n",
        "    data_clean.append(clean_text(x))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GwymH-JJRNEP"
      },
      "outputs": [],
      "source": [
        "# Check that the length of data and the length of data_clean are the same. Both numbers printed should be the same\n",
        "\n",
        "print(len(data))\n",
        "print(len(data_clean))"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Dmx41Jj6RNEV"
      },
      "outputs": [],
      "source": [
        "# check that the first item in data and the first item in data_clean are the same.\n",
        "# both print statements should print the same word, with the data cleaning function applied in the second one\n",
        "\n",
        "print(data[0].split()[0])\n",
        "print(data_clean[0][0])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "PI7lVjQNRNEW"
      },
      "outputs": [],
      "source": [
        "# check that the last item in data_clean and the last item in data are the same\n",
        "# both print statements should print the same word, with the data cleaning function applied in the second one\n",
        "\n",
        "print(data[0].split()[-1])\n",
        "print(data_clean[0][-1])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jLJ-md6hRNEW"
      },
      "source": [
        "## Model Creation"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GRr9hXL3RNEX"
      },
      "outputs": [],
      "source": [
        "# train the model\n",
        "model = Word2Vec(sentences=data_clean, window=5, min_count=3, workers=4, epochs=5, sg=1)\n",
        "\n",
        "# save the model\n",
        "model.save(\"word2vec.model\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "GL6YUiwARNEX"
      },
      "outputs": [],
      "source": [
        "# load the model\n",
        "\n",
        "model = Word2Vec.load(\"word2vec.model\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KilAYSEKRNEX"
      },
      "source": [
        "## Analysis ##\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xg2f-F57RNEY"
      },
      "source": [
        "### Exploratory Queries"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZExIq71_RNEY"
      },
      "outputs": [],
      "source": [
        "# set the word that we are checking for\n",
        "word = \"milk\"\n",
        "\n",
        "# if that word is in our vocabulary\n",
        "if word in model.wv.key_to_index:\n",
        "\n",
        "    # print a statement to let us know\n",
        "    print(\"The word %s is in your model vocabulary\" % word)\n",
        "\n",
        "# otherwise, let us know that it isn't\n",
        "else:\n",
        "    print(\"%s is not in your model vocabulary\" % word)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "qdqLmC_9RNEY"
      },
      "outputs": [],
      "source": [
        "# returns a list with the top ten words used in similar contexts to the word \"milk\"\n",
        "model.wv.most_similar('milk', topn=10)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dGJg7v2CRNEZ"
      },
      "outputs": [],
      "source": [
        "# returns the top ten most similar words to \"recipe\" that are dissimilar from \"milk\"\n",
        "model.wv.most_similar(positive = [\"recipe\"], negative=[\"milk\"], topn=10)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QaqlhBCqRNFk"
      },
      "outputs": [],
      "source": [
        "# returns the top ten most similar words to both \"recipe\" and \"milk\"\n",
        "model.wv.most_similar(positive = [\"recipe\", \"milk\"], topn=10)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UKHmlCX9RNFk"
      },
      "outputs": [],
      "source": [
        "# returns a cosine similarity score for the two words you provide\n",
        "model.wv.similarity(\"milk\", \"cream\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "RvLGD52YRNFl"
      },
      "outputs": [],
      "source": [
        "# returns a prediction for the other words in a set containing the words \"flour,\" \"eggs,\" and \"cream\"\n",
        "model.predict_output_word([ \"flour\", \"eggs\", \"cream\"])"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "9y8YgYSIRNFl"
      },
      "outputs": [],
      "source": [
        "# displays the number of words in your model's vocabulary\n",
        "print(len(model.wv))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M5vzCZfrRNFl"
      },
      "source": [
        "## Validation ##\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "rewat5AtRNFl"
      },
      "outputs": [],
      "source": [
        "dirpath = Path(r\".\").glob('*.model') #current directory plus only files that end in 'model'\n",
        "files = dirpath\n",
        "model_list = [] # a list to hold the actual models\n",
        "model_filenames = []  # the filepath for the models so we know where they came from"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "XXcvzlI9RNFm"
      },
      "outputs": [],
      "source": [
        "#this for loop looks for files that end with \".model\" loads them, and then adds those to a list\n",
        "for filename in files:\n",
        "    # turn the filename into a string and save it to \"file_path\"\n",
        "    file_path = str(filename)\n",
        "    print(file_path)\n",
        "    # load the model with the file_path\n",
        "    model = Word2Vec.load(file_path)\n",
        "    # add the model to our mode_list\n",
        "    model_list.append(model)\n",
        "    # add the filepath to the model_filenames list\n",
        "    model_filenames.append(file_path)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Cp85yubjRNFr"
      },
      "outputs": [],
      "source": [
        "# set the word that we are checking for\n",
        "word = \"milk\"\n",
        "\n",
        "# if that word is in our vocabulary\n",
        "if word in model.wv.key_to_index:\n",
        "\n",
        "    # print a statement to let us know\n",
        "    print(\"The word %s is in your model vocabulary\" % word)\n",
        "\n",
        "# otherwise, let us know that it isn't\n",
        "else:\n",
        "    print(\"%s is not in your model vocabulary\" % word)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "8DqUDPceRNFr"
      },
      "outputs": [],
      "source": [
        "#test word pairs that we are going to use to evaluate the models\n",
        "test_words = [(\"stir\", \"whisk\"),\n",
        "             (\"cream\", \"milk\"),\n",
        "             (\"cake\", \"muffin\"),\n",
        "             (\"jam\", \"jelly\"),\n",
        "             (\"reserve\", \"save\"),\n",
        "             (\"bake\", \"cook\")]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "JhMpSitARNFs"
      },
      "outputs": [],
      "source": [
        "# these for loops will go through each list, the test word list and the models list,\n",
        "# and will run all the words through each model\n",
        "# then the results will be added to a dataframe\n",
        "\n",
        "# since NumPy 19.0, sometimes working with arrays of conflicting dimensions will throw a deprecation warning\n",
        "# this warning does not impact the code or the results, so we're going to filter it out\n",
        "# you can also specify \"dtype=object\" on the resulting array\n",
        "# np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)\n",
        "\n",
        "# create an empty dataframe with the column headings we need\n",
        "evaluation_results = pd.DataFrame(columns=['Model', 'Test Words', 'Cosine Similarity'], dtype=object)\n",
        "\n",
        "# iterate though the model_list\n",
        "for i in range(len(model_list)):\n",
        "\n",
        "    # for each model in model_list, test the tuple pairs\n",
        "    for x in range(len(test_words)):\n",
        "\n",
        "        # calculate the similarity score for each tuple\n",
        "        similarity_score = model_list[i].wv.similarity(*test_words[x])\n",
        "\n",
        "        # create a temporary dataframe with the test results\n",
        "        df = [model_filenames[i], test_words[x], similarity_score]\n",
        "\n",
        "        # add the temporary dataframe to our final dataframe\n",
        "        evaluation_results.loc[x] = df\n",
        "\n",
        "\n",
        "# save the evaluation_results dataframe as a .csv called \"word2vec_model_evaluation.csv\" in our current directory\n",
        "# if you want the .csv saved somewhere specific, include the filepath in the .to_csv() call\n",
        "evaluation_results.to_csv('word2vec_model_evaluation.csv')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "55GgDWPTRNFs"
      },
      "source": [
        "## Next Steps\n",
        "\n",
        "Here are some resources if you would like to learn more about word vectors:\n",
        "\n",
        "- The Women [Writers Vector Toolkit](https://wwp.northeastern.edu/lab/wwvt/index.html) is a web interface for exploring word vectors, accompanied by glossaries, sources, case studies, and sample assignments. This toolkit includes links to a [GitHub repository with RMD walkthroughs](https://github.com/NEU-DSG/wwp-public-code-share/tree/main/WordVectors) with code for training word2vec models in R, as well as [download and resources on preparing text corpora](https://wwp.northeastern.edu/lab/wwvt/resources/downloads/index.html).\n",
        "\n",
        "- The [Women Writers Project Resources](https://wwp.northeastern.edu/outreach/resources/index.html) page has guides on: searching your corpus, corpus analysis and preparation, model validation and assessment, and other materials for working with word vectors.\n",
        "\n",
        "- [Link to other PH tutorial in draft]\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1peZm7VYRNFs"
      },
      "source": [
        "\n",
        "_This walkthrough was written on November 16, 2022 using Python 3.8.3 and Gensim 4.2.0_"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.9"
    },
    "colab": {
      "provenance": [],
      "include_colab_link": true
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}