{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "h6duh5xbRNEJ" }, "source": [ "## Preparing Your Corpus" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "isxr7OWCRNEK" }, "outputs": [], "source": [ "# A good practice in programming is to place your import statements at the top of your code, and to keep them together\n", "\n", "import re # for regular expressions\n", "import os # to look up operating system-based info\n", "import string # to do fancy things with strings\n", "import glob # to locate a specific file type\n", "from pathlib import Path # to access files in other directories\n", "import gensim # to access Word2Vec\n", "from gensim.models import Word2Vec # to access Gensim's flavor of Word2Vec\n", "import pandas as pd # to sort and organize data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "eKVrKznPRNEM" }, "outputs": [], "source": [ "dirpath = r'FILL IN YOUR FILEPATH HERE' # get file path (you can change this)\n", "file_type = \".txt\" # if your data is not in a plain text format, you can change this\n", "filenames = []\n", "data = []\n", "\n", " # this for loop will run through folders and subfolders looking for a specific file type\n", "for root, dirs, files in os.walk(dirpath, topdown=False):\n", " # look through all the files in the given directory\n", " for name in files:\n", " if (root + os.sep + name).endswith(file_type):\n", " filenames.append(os.path.join(root, name))\n", " # look through all the directories\n", " for name in dirs:\n", " if (root + os.sep + name).endswith(file_type):\n", " filenames.append(os.path.join(root, name))\n", "\n", "# this for loop then goes through the list of files, reads them, and then adds the text to a list\n", "for filename in filenames:\n", " with open(filename) as afile:\n", " print(filename)\n", " data.append(afile.read()) # read the file and then add it to the list\n", " afile.close() # close the file when you're done\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "NtYM6GqkRNEN" }, "outputs": [], "source": [ "def clean_text(text):\n", "\n", " # Cleans the given text using regular expressions to split and lower-cased versions to create\n", " # a list of tokens for each text.\n", " # The function accepts a list of texts and returns a list of of lists of tokens\n", "\n", "\n", " # lower case\n", " tokens = text.split()\n", " tokens = [t.lower() for t in tokens]\n", "\n", " # remove punctuation using regular expressions\n", " # this line of code locates the punctuation within the given text and compiles that punctuation into a single variable\n", " re_punc = re.compile('[%s]' % re.escape(string.punctuation))\n", " # this line of code substitutes the punctuation we just compiled with nothing ''\n", " tokens = [re_punc.sub('', token) for token in tokens]\n", "\n", " # only include tokens that aren't numbers\n", " tokens = [token for token in tokens if token.isalpha()]\n", " return tokens" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Y1LV2YUlRNEN" }, "outputs": [], "source": [ "# clean text from folder of text files, stored in the data variable\n", "data_clean = []\n", "for x in data:\n", " data_clean.append(clean_text(x))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GwymH-JJRNEP" }, "outputs": [], "source": [ "# Check that the length of data and the length of data_clean are the same. Both numbers printed should be the same\n", "\n", "print(len(data))\n", "print(len(data_clean))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Dmx41Jj6RNEV" }, "outputs": [], "source": [ "# check that the first item in data and the first item in data_clean are the same.\n", "# both print statements should print the same word, with the data cleaning function applied in the second one\n", "\n", "print(data[0].split()[0])\n", "print(data_clean[0][0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PI7lVjQNRNEW" }, "outputs": [], "source": [ "# check that the last item in data_clean and the last item in data are the same\n", "# both print statements should print the same word, with the data cleaning function applied in the second one\n", "\n", "print(data[0].split()[-1])\n", "print(data_clean[0][-1])" ] }, { "cell_type": "markdown", "metadata": { "id": "jLJ-md6hRNEW" }, "source": [ "## Model Creation" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GRr9hXL3RNEX" }, "outputs": [], "source": [ "# train the model\n", "model = Word2Vec(sentences=data_clean, window=5, min_count=3, workers=4, epochs=5, sg=1)\n", "\n", "# save the model\n", "model.save(\"word2vec.model\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "GL6YUiwARNEX" }, "outputs": [], "source": [ "# load the model\n", "\n", "model = Word2Vec.load(\"word2vec.model\")" ] }, { "cell_type": "markdown", "metadata": { "id": "KilAYSEKRNEX" }, "source": [ "## Analysis ##\n" ] }, { "cell_type": "markdown", "metadata": { "id": "xg2f-F57RNEY" }, "source": [ "### Exploratory Queries" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZExIq71_RNEY" }, "outputs": [], "source": [ "# set the word that we are checking for\n", "word = \"milk\"\n", "\n", "# if that word is in our vocabulary\n", "if word in model.wv.key_to_index:\n", "\n", " # print a statement to let us know\n", " print(\"The word %s is in your model vocabulary\" % word)\n", "\n", "# otherwise, let us know that it isn't\n", "else:\n", " print(\"%s is not in your model vocabulary\" % word)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qdqLmC_9RNEY" }, "outputs": [], "source": [ "# returns a list with the top ten words used in similar contexts to the word \"milk\"\n", "model.wv.most_similar('milk', topn=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "dGJg7v2CRNEZ" }, "outputs": [], "source": [ "# returns the top ten most similar words to \"recipe\" that are dissimilar from \"milk\"\n", "model.wv.most_similar(positive = [\"recipe\"], negative=[\"milk\"], topn=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QaqlhBCqRNFk" }, "outputs": [], "source": [ "# returns the top ten most similar words to both \"recipe\" and \"milk\"\n", "model.wv.most_similar(positive = [\"recipe\", \"milk\"], topn=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UKHmlCX9RNFk" }, "outputs": [], "source": [ "# returns a cosine similarity score for the two words you provide\n", "model.wv.similarity(\"milk\", \"cream\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RvLGD52YRNFl" }, "outputs": [], "source": [ "# returns a prediction for the other words in a set containing the words \"flour,\" \"eggs,\" and \"cream\"\n", "model.predict_output_word([ \"flour\", \"eggs\", \"cream\"])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "9y8YgYSIRNFl" }, "outputs": [], "source": [ "# displays the number of words in your model's vocabulary\n", "print(len(model.wv))" ] }, { "cell_type": "markdown", "metadata": { "id": "M5vzCZfrRNFl" }, "source": [ "## Validation ##\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rewat5AtRNFl" }, "outputs": [], "source": [ "dirpath = Path(r\".\").glob('*.model') #current directory plus only files that end in 'model'\n", "files = dirpath\n", "model_list = [] # a list to hold the actual models\n", "model_filenames = [] # the filepath for the models so we know where they came from" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "XXcvzlI9RNFm" }, "outputs": [], "source": [ "#this for loop looks for files that end with \".model\" loads them, and then adds those to a list\n", "for filename in files:\n", " # turn the filename into a string and save it to \"file_path\"\n", " file_path = str(filename)\n", " print(file_path)\n", " # load the model with the file_path\n", " model = Word2Vec.load(file_path)\n", " # add the model to our mode_list\n", " model_list.append(model)\n", " # add the filepath to the model_filenames list\n", " model_filenames.append(file_path)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Cp85yubjRNFr" }, "outputs": [], "source": [ "# set the word that we are checking for\n", "word = \"milk\"\n", "\n", "# if that word is in our vocabulary\n", "if word in model.wv.key_to_index:\n", "\n", " # print a statement to let us know\n", " print(\"The word %s is in your model vocabulary\" % word)\n", "\n", "# otherwise, let us know that it isn't\n", "else:\n", " print(\"%s is not in your model vocabulary\" % word)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8DqUDPceRNFr" }, "outputs": [], "source": [ "#test word pairs that we are going to use to evaluate the models\n", "test_words = [(\"stir\", \"whisk\"),\n", " (\"cream\", \"milk\"),\n", " (\"cake\", \"muffin\"),\n", " (\"jam\", \"jelly\"),\n", " (\"reserve\", \"save\"),\n", " (\"bake\", \"cook\")]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JhMpSitARNFs" }, "outputs": [], "source": [ "# these for loops will go through each list, the test word list and the models list,\n", "# and will run all the words through each model\n", "# then the results will be added to a dataframe\n", "\n", "# since NumPy 19.0, sometimes working with arrays of conflicting dimensions will throw a deprecation warning\n", "# this warning does not impact the code or the results, so we're going to filter it out\n", "# you can also specify \"dtype=object\" on the resulting array\n", "# np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)\n", "\n", "# create an empty dataframe with the column headings we need\n", "evaluation_results = pd.DataFrame(columns=['Model', 'Test Words', 'Cosine Similarity'], dtype=object)\n", "\n", "# iterate though the model_list\n", "for i in range(len(model_list)):\n", "\n", " # for each model in model_list, test the tuple pairs\n", " for x in range(len(test_words)):\n", "\n", " # calculate the similarity score for each tuple\n", " similarity_score = model_list[i].wv.similarity(*test_words[x])\n", "\n", " # create a temporary dataframe with the test results\n", " df = [model_filenames[i], test_words[x], similarity_score]\n", "\n", " # add the temporary dataframe to our final dataframe\n", " evaluation_results.loc[x] = df\n", "\n", "\n", "# save the evaluation_results dataframe as a .csv called \"word2vec_model_evaluation.csv\" in our current directory\n", "# if you want the .csv saved somewhere specific, include the filepath in the .to_csv() call\n", "evaluation_results.to_csv('word2vec_model_evaluation.csv')" ] }, { "cell_type": "markdown", "metadata": { "id": "55GgDWPTRNFs" }, "source": [ "## Next Steps\n", "\n", "Here are some resources if you would like to learn more about word vectors:\n", "\n", "- The Women [Writers Vector Toolkit](https://wwp.northeastern.edu/lab/wwvt/index.html) is a web interface for exploring word vectors, accompanied by glossaries, sources, case studies, and sample assignments. This toolkit includes links to a [GitHub repository with RMD walkthroughs](https://github.com/NEU-DSG/wwp-public-code-share/tree/main/WordVectors) with code for training word2vec models in R, as well as [download and resources on preparing text corpora](https://wwp.northeastern.edu/lab/wwvt/resources/downloads/index.html).\n", "\n", "- The [Women Writers Project Resources](https://wwp.northeastern.edu/outreach/resources/index.html) page has guides on: searching your corpus, corpus analysis and preparation, model validation and assessment, and other materials for working with word vectors.\n", "\n", "- [Link to other PH tutorial in draft]\n" ] }, { "cell_type": "markdown", "metadata": { "id": "1peZm7VYRNFs" }, "source": [ "\n", "_This walkthrough was written on November 16, 2022 using Python 3.8.3 and Gensim 4.2.0_" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" }, "colab": { "provenance": [], "include_colab_link": true } }, "nbformat": 4, "nbformat_minor": 0 }