{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tTWRVAGzlu8F"
},
"source": [
"# Clustering and Visualising Documents using Word Embeddings\n",
"\n",
"This can be run in Google Colab (by clicking the _Open in Colab_ button above) or on your local machine. If you wish to run it locally, then you will need to ensure that you have the libraries listed in the [requirements.txt](requirements.txt) file installed *first*. THe most direct way to do this is `pip -r requirements.txt`, but I personally prefer to use Anaconda Python since it allows me to create 'virtual environments' (multiple 'versions' of Python that don't conflict with each other) as follows:\n",
"\n",
"```bash\n",
"conda env create -n 'ph'\n",
"conda env activate ph\n",
"pip -r requirements.txt\n",
"jupyter lab\n",
"```\n",
"\n",
"This will install the required libraries into a new virtual environment called 'ph' (Programming Historian) by first creating the environment, then activating it, installing the libraries, then launching Jupyter Lab.\n",
"\n",
"
Note that we use a Parquet file for this work since it allows us to distribute a reasonably large and complex data set as a single highly-compressed file. If you would like to adapt this tutorial for use with a CSV or Excel file you have two choices: 1) simply replace df = pd.read_parquet(...) with df = pd.read_csv(...); 2) convert your CSV/Excel file to Parquet first using pd.read_csv(...).to_parquet(...).
\n", "The other big advantage of Parquet files is that they can contain lists and dictionaries, whereas as CSV has to 'serialise' these like this: \"['foo','bar',...,'baz']\". To deserialise a literal value like this you need to use the built-in ast library: ast.literal_eval(<string_that_should_be_list>). See Stack Overflow for examples.
\n", "