{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "id": "tTWRVAGzlu8F" }, "source": [ "# Clustering and Visualising Documents using Word Embeddings\n", "\n", "This can be run in Google Colab (by clicking the _Open in Colab_ button above) or on your local machine. If you wish to run it locally, then you will need to ensure that you have the libraries listed in the [requirements.txt](requirements.txt) file installed *first*. THe most direct way to do this is `pip -r requirements.txt`, but I personally prefer to use Anaconda Python since it allows me to create 'virtual environments' (multiple 'versions' of Python that don't conflict with each other) as follows:\n", "\n", "```bash\n", "conda env create -n 'ph'\n", "conda env activate ph\n", "pip -r requirements.txt\n", "jupyter lab\n", "```\n", "\n", "This will install the required libraries into a new virtual environment called 'ph' (Programming Historian) by first creating the environment, then activating it, installing the libraries, then launching Jupyter Lab.\n", "\n", "
Note that we use a Parquet file for this work since it allows us to distribute a reasonably large and complex data set as a single highly-compressed file. If you would like to adapt this tutorial for use with a CSV or Excel file you have two choices: 1) simply replace df = pd.read_parquet(...) with df = pd.read_csv(...); 2) convert your CSV/Excel file to Parquet first using pd.read_csv(...).to_parquet(...).
\n", "The other big advantage of Parquet files is that they can contain lists and dictionaries, whereas as CSV has to 'serialise' these like this: \"['foo','bar',...,'baz']\". To deserialise a literal value like this you need to use the built-in ast library: ast.literal_eval(<string_that_should_be_list>). See Stack Overflow for examples.
\n", "