{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Practical 7: Working with Text

\n", "

The basics of text mining and NLP

\n", "
\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Complete | Part 1: Foundations | Part 2: Data | Part 3: Analysis | |\n", "| :------- | :------------------ | :----------- | :--------------- | --: |\n", "| 60% | ▓▓▓▓▓▓▓▓ | ▓▓▓▓▓░ | ░░░░░░ | 7/10\n", "\n", "A lot of the content here is provided to help you *understand* what text-cleaning does and how it generates tokens that can be processed by the various analytical approaches commonly-used in NLP. The best way to think about this is as a practical in three parts, not all of which you should expect to complete in this session:\n", "\n", "1. Tasks 1–3: these are largely focussed on the basics: exploring text and using regular expressions to find and select text.\n", "2. Tasks 4–5: this might seem like a *bit* of a detour, but it's intended to show you in a more tangible way how 'normalisation' works when we're working with text. **You should feel free to stop here and return to the rest later.**\n", "3. Tasks 6–7: are about finding important vocabulary (think 'keywords' and 'significant terms') in documents so that you can start to think about what is *distinctive* about documents and groups of documents. **This is quite useful and relatively easier to understand than what comes next!**\n", "4. Tasks 8–9: are about fully-fledged NLP using Latent Direclecht Allocation (topic modelling) and Word2Vec (words embeddings for use in clustering or similarity work).\n", "\n", "The later parts are largely complete and ready to run; however, that *doesn't* mean you should just skip over them and think you've grasped what's happening and it will be easy to apply in your own analyses. I would *not* pay as much attention to LDA topic mining since I don't think it's results are that good, but I've included it here as it's still commonly-used in the Digital Humanities and by Marketing folks. Word2Vec is much more powerful and forms the basis of the kinds of advances seen in ChatGPT and other LLMs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " 🔗 Connections: working with text is unquestionably hard. In fact, conceptually this is probaly the most challenging practical of the term! But data scientists are always dealing with text because so much of the data that we collect (even more so thanks to the web) is not only text-based (URLs are text!) but, increasingly, unstructured (social media posts, tags, etc.). So while getting to grips with text is a challenge, it also uniquely positions you with respect to the skills and knowledge that other graduates are offering to employers.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preamble" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This practical has been written using `nltk`, but would be _relatively_ easy to rework using `spacy`. Most programmers tend to use one *or* the other, and the switch wouldn't be hard other than having to first load the requisite language models:\n", "```python\n", "import spacy\n", "nlp = spacy.load(\"en_core_web_sm\") # `...web_md` and `...web_lg` are also options\n", "```\n", "You can [read about the models](https://spacy.io/models/en), and note that they are also [available in other languages](https://spacy.io/usage/models) besides English." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 1. Setup\n", "\n", "
\n", " Difficulty Level: easy, but only because this has been worked out for you. Starting from sctach in NLP is hard so people try to avoid it as much as possible.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.1 Required Modules\n", "\n", "
\n", " Notice that the number of modules and functions that we import is steadily increasing week-on-week, and that for text processing we tend to draw on quite a wide range of utilies! That said, the three most commonly used are: sklearn, nltk, and spacy.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Standard libraries we've seen before." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd\n", "import geopandas as gpd\n", "import re\n", "import math\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Vectorisers we will use from the 'big beast' of Python machine learning: Sci-Kit Learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder # We don't use this but I point out where you *could*\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.decomposition import LatentDirichletAllocation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLP-specific libraries that we will use for tokenisation, lemmatisation, and frequency analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk\n", "import spacy\n", "from nltk.corpus import wordnet as wn\n", "from nltk.stem.wordnet import WordNetLemmatizer\n", "\n", "from nltk.corpus import stopwords\n", "\n", "from nltk.tokenize import word_tokenize, sent_tokenize\n", "from nltk.tokenize.toktok import ToktokTokenizer\n", "\n", "from nltk.stem.porter import PorterStemmer\n", "from nltk.stem.snowball import SnowballStemmer\n", "\n", "from nltk import ngrams, FreqDist\n", "\n", "lemmatizer = WordNetLemmatizer()\n", "tokenizer = ToktokTokenizer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remaining libraries that we'll use for processing and display text data. Most of this relates to dealing with the various ways that text data cleaning is *hard* because of the myriad formats it comes in." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import string\n", "import unicodedata\n", "from bs4 import BeautifulSoup\n", "from wordcloud import WordCloud, STOPWORDS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.1: Configure\n", "\n", "There isn't a lot to configure for this week, but you *will* want to examine the results from the first pass through the data in order to update the list of **stopwords**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#nltk.download('wordnet') # <-- These are done in a supporting tool, but in your own\n", "#nltk.download('averaged_perceptron_tagger') # application you'd need to import them\n", "stopword_list = set(stopwords.words('english'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This next is just a small utility function that allows us to output Markdown (like this cell) instead of plain text:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display_markdown\n", "\n", "def as_markdown(head='', body='Some body text'):\n", " if head != '':\n", " display_markdown(f\"##### {head}\\n\\n>{body}\\n\", raw=True)\n", " else:\n", " display_markdown(f\">{body}\\n\", raw=True)\n", "\n", "as_markdown('Result!', \"Here's my output...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 1.2: Loading Data\n", "\n", "
\n", " 🔗 Connections: Because I generally want each practical to stand on its own (unless I'm trying to make a point), I've not moved this to a separate Python file (e.g. utils.py, but in line with what we covered back in the lectures on Functions and Packages, this sort of thing is a good candidate for being split out to a separate file to simplify re-use.\n", "
\n", "\n", "Remember this function from last week? We use it to save downloading files that we already have stored locally. But notice I've made some small changes... what do these do to help the user?\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from requests import get\n", "from urllib.parse import urlparse\n", "\n", "def cache_data(src:str, dest:str) -> str:\n", " \"\"\"Downloads and caches a remote file locally.\n", " \n", " The function sits between the 'read' step of a pandas or geopandas\n", " data frame and downloading the file from a remote location. The idea\n", " is that it will save it locally so that you don't need to remember to\n", " do so yourself. Subsequent re-reads of the file will return instantly\n", " rather than downloading the entire file for a second or n-th itme.\n", " \n", " Parameters\n", " ----------\n", " src : str\n", " The remote *source* for the file, any valid URL should work.\n", " dest : str\n", " The *destination* location to save the downloaded file.\n", " \n", " Returns\n", " -------\n", " str\n", " A string representing the local location of the file.\n", " \"\"\"\n", " \n", " url = urlparse(src) # We assume that this is some kind of valid URL \n", " fn = os.path.split(url.path)[-1] # Extract the filename\n", " dfn = os.path.join(dest,fn) # Destination filename\n", " \n", " # Check if dest+filename does *not* exist -- \n", " # that would mean we have to download it!\n", " if not os.path.isfile(dfn):\n", " \n", " print(f\"{dfn} not found, downloading!\")\n", "\n", " # Convert the path back into a list (without)\n", " # the filename -- we need to check that directories\n", " # exist first.\n", " path = os.path.split(dest)\n", " \n", " # Create any missing directories in dest(ination) path\n", " # -- os.path.join is the reverse of split (as you saw above)\n", " # but it doesn't work with lists... so I had to google how\n", " # to use the 'splat' operator! os.makedirs creates missing\n", " # directories in a path automatically.\n", " if len(path) >= 1 and path[0] != '':\n", " os.makedirs(os.path.join(*path), exist_ok=True)\n", " \n", " # Download and write the file\n", " with open(dfn, \"wb\") as file:\n", " response = get(src)\n", " file.write(response.content)\n", " \n", " print(\"\\tDone downloading...\")\n", "\n", " # What's this doing???\n", " f_size = os.stat(dfn).st_size\n", " print(f\"\\tSize is {f_size/1024**2:,.0f} MB ({f_size:,} bytes)\")\n", "\n", " else:\n", " print(f\"Found {dfn} locally!\")\n", "\n", " # And why is it here as well???\n", " f_size = os.stat(dfn).st_size\n", " print(f\"\\tSize is {f_size/1024**2:,.0f} MB ({f_size:,} bytes)\")\n", " \n", " return dfn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " 💡 Tip: for very large non-geographic data sets, remember that you can use_cols (or columns for feathers) to specify a subset of columns to load.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the main data set:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Set download URL\n", "host = 'https://orca.casa.ucl.ac.uk'\n", "path = '~jreades/data/'\n", "fn = '2023-09-06-listings.geoparquet'\n", "url = f'{host}/{path}/{fn}'\n", "\n", "gdf = gpd.read_parquet( cache_data(url, os.path.join('data','geo')), \n", " columns=['geometry', 'listing_url', 'name', \n", " 'description', 'amenities', 'price'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"gdf has {gdf.shape[0]:,} rows and CRS is {gdf.crs.name}.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load supporting Geopackages:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ddir = os.path.join('data','geo') # destination directory\n", "spath = 'https://github.com/jreades/fsds/blob/master/data/src/' # source path\n", "\n", "boros = gpd.read_file( cache_data(spath+'Boroughs.gpkg?raw=true', ddir) )\n", "water = gpd.read_file( cache_data(spath+'Water.gpkg?raw=true', ddir) )\n", "green = gpd.read_file( cache_data(spath+'Greenspace.gpkg?raw=true', ddir) )\n", "\n", "print('Done.')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2. Exploratory Textual Analysis\n", "\n", "
\n", " 🔗 Connections: if you plan to work with data post-graduation then you will need to become comfortable with Regular Expressions (aka. regexes). These are the focus of the Patterns in Text lecture but they barely even scratch the surface of what regexes can do. They are hard, but they are powerful. \n", "
\n", "\n", "
\n", " 💡 Tip: In a full text-mining application I would spend a lot more time on this stage: sampling, looking at descriptions in full, performing my analysis (the rest of the steps) and then coming back with a deeper understanding of the data to make further changes to the analysis.\n", "
\n", "\n", "It's helpful to have a sense of what data look like before trying to do something with them, but by default pandas truncates quite a lot of output to keep it from overwhelming the display. For text processing, however, you should probably change the amount of preview text provided by pandas using the available options. *Note*: there are lots of other options that you can tweak in pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Default maximum column width: {pd.options.display.max_colwidth}\") # What's this currently set to?\n", "pd.options.display.max_colwidth=250 # None = no maximum column width (you probably don't want to leave it at this)\n", "print(f\"Now maximum column width set to: {???}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 2.1: Investigate the Description Field\n", "\n", "
\n", " Difficulty Level: Medium, because of the questions.\n", "
\n", "\n", "To explore the description field properly you'll need to filter out any NA/NaN descriptions before sampling the result. *Hint*: you'll need to think about negation (`~`) of a method output that tells you if a field *is NA*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf[???].sample(5, random_state=42)[['description']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " ⚠ Stop: what do you notice about the above? Are they simple text? are there patterns of problems? Are there characters that represent things other than words and simple punctuation?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1.1 Questions\n", "\n", "- What patterns can you see that might need 'dealing with' for text-mining to work?\n", "- What non-text characters can you see? (Things *other* than A-Z, a-z, and simple punctuation!)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 2.2: Amenities Field\n", "\n", "
\n", " Difficulty Level: Medium, because of the questions.\n", "
\n", "\n", "This field presents a subtle issue that might not be obvious here:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf.amenities.sample(5, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But look what happens now, can you see the issue a little more easily?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf.amenities.iloc[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2.1 Questions\n", "\n", "- What's the implicit format of the Amenities columns?\n", "- How could you represent the data contained in the column?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 2.3: Remove NaN Values\n", "\n", "
\n", " ⚠ Note: I would be wary of doing the below in a 'proper' application without doing some careful research first, but to make our lives easier, we're going to drop rows where one of these values is NaN now so it will simplify the steps below. In reality, I would spend quite a bit more time investigating which values are NaN and why before simply dropping them.\n", "
\n", "\n", "Anyway, drop all rows where *either* the description or amenities (or both) are NA:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = gdf.dropna(???)\n", "print(f\"Now gdf has {gdf.shape[0]:,} rows.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3. Using Regular Expressions\n", "\n", "
\n", " 🔗 Connections: We're building on the work done in Practical 6, but making use now of the lecture on Patterns in Text) to quickly sort through the listings.\n", "
\n", "\n", "There is a _lot_ that can be done with Regular Expressions to identify relevant records in textual data and we're going to use this as a starting point for the rest of the analysis. I would normally consider the regexes here a 'first pass' at the data, but would look very carefully at the output of the TF/IDF vectorizer, Count vectorizer, and LDA to see if I could improve my regexes for further cycles of analysis... the main gain there is that regexes are _much_ faster than using the full NLP (Natural Language Processing) pipeline on the _full_ data set each time. As an alternative, you could develop the pipeline using a random subsample of the data and then process the remaining records sequentially -- in this context there is no justification for doing that, but with a larger corpus it might make sense." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.1: Luxury Accommodation\n", "\n", "
\n", " Difficulty Level: Hard, because of the regular expression and questions.\n", "
\n", "\n", "I would like you to find listings that *might* (on the basis of word choice) indicate 'luxury' accommodation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1.1 Create the Regular Expression\n", "\n", "You should start with variations on 'luxury' (i.e. luxurious, luxuriate, ...) and work out a **single regular expression** that works for variations on this *one* word. **Later**, I would encourage you to come back to this and consider what other words might help to signal 'luxury'... perhaps words like 'stunning' or 'prestigious'? Could you add those to the regex as well?\n", "\n", "*Hints*: this is a toughy, but...\n", "\n", "1. All regular expressions work best using the `r'...'` (which means raw string) syntax.\n", "2. You need to be able to *group* terms. Recall, however, that in Python a 'group' of the form `r'(some text)'` refers to matching (`some text` will be 'memoized'/remembered), whereas what you need here is a \"non-capturing group\" of the **positive lookahead** type. That's a Google clue right there, but you've also seen this in the lecture.\n", "\n", "In fact, in my real-world applications you might even need more than one group/non-capturing group in a *nested* structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf[\n", " gdf.description.str.contains(r'???', regex=True, flags=re.IGNORECASE) # <-- The regex\n", "].sample(5, random_state=42)[['description']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1.2 Apply it to Select Data\n", "Assign it to a new data frame called `lux`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lux = gdf[gdf.description.str.contains(r'???', regex=True, flags=re.IGNORECASE)].copy()\n", "print(f\"Found {lux.shape[0]:,} records for 'luxury' flats\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1.3 Plot the Data\n", "\n", "Now we are going to create a more complex plot that will give space to both the spatial and price distributions using `subplot2grid`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(plt.subplot2grid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that there are two ways to create the plot specified above. I chose route 1, but in some ways route 2 (where you specify a `gridspec` object and *then* add the axes might be a bit simpler to work out if you're starting from scratch.\n", "\n", "The critical thing here is to understand how we'er initialising a plot that has **4 rows** and **1 column** even though it is only showing **2 plots**. What we're going to do is set the *first* plot to span **3 rows** so that it takes up 75% of the plot area (3/4), while the *second* plot only takes up 25% (1/4). They will appear one above the other, so there's only 1 column. Here's how to read the key parts of `subplot2grid`:\n", "- `nrows` -- how many rows *of plots* in the figure.\n", "- `ncols` -- how many columns *of plots* in the figure.\n", "- `row` -- what row of the figure does *this* plot start on (0-indexed like a list in Python).\n", "- `col` -- what column of the figure does *this* plot start on (0-indexed like a list in Python).\n", "- `rowspan` -- how many rows of the figure does *this* plot span (*not* 0-indexed because it's not list-like).\n", "- `colspan` -- how many columns of the figure does *this* plot span (*not* 0-indexed because it's not list-like).\n", "\n", "Every time you call `subplot2grid` you are initialising a new axis-object into which you can then draw with your geopackage or pandas plotting methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "f,ax = plt.subplots(1,1,figsize=(12,8))\n", "ax.remove()\n", "\n", "# The first plot \n", "ax1 = plt.subplot2grid((4, 1), (???), rowspan=???)\n", "boros.plot(edgecolor='red', facecolor='none', linewidth=1, alpha=0.75, ax=ax1)\n", "lux.plot(markersize=2, column='price', cmap='viridis', alpha=0.2, scheme='Fisher_Jenks_Sampled', ax=ax1)\n", "\n", "ax1.set_xlim([500000, 565000])\n", "ax1.set_ylim([165000, 195000]);\n", "\n", "# The second plot\n", "ax2 = plt.subplot2grid((???), (???), rowspan=1)\n", "lux.price.plot.hist(bins=250, ax=ax2)\n", "\n", "plt.suptitle(\"Listings Advertising Luxury\") # <-- How does this differ from title? Change it and see!\n", "plt.tight_layout() # <-- Try creating the plot *without* this to see what it changes\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1.4 Questions\n", "\n", "- What does `suptitle` do and how is it different from `title`? Could you use this as part of your plot-making process?\n", "- What does `tight_layout` do?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 3.2: Budget Accommodation\n", "\n", "
\n", " Difficulty Level: Easy, because you've worked out the hard bits already.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.1 Create the Regular Expression\n", "\n", "What words can you think of that might help you to spot affordable and budget accommodation? Start with just a couple of words and then I would encourage you to consider what _other_ words might help to signal 'affordability'... perhaps words like 'cosy' or 'charming' and then think about how you could you add those to the regex?\n", "\n", "*Hints*: this just builds on what you did above with one exception:\n", "\n", "1. I'd try adding word boundary markers to the regex (`\\b`) where appropriate..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf[\n", " gdf.description.str.contains(???, regex=True, flags=re.IGNORECASE)\n", "].sample(5, random_state=42)[['description']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.2 Apply it to Select Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "aff = gdf[gdf.description.str.contains(???, regex=True, flags=re.IGNORECASE)].copy()\n", "print(f\"There are {aff.shape[0]:,} rows flagged as 'affordable'.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.3 Plot the Data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'plt' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[1], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m f,ax \u001b[38;5;241m=\u001b[39m \u001b[43mplt\u001b[49m\u001b[38;5;241m.\u001b[39msubplots(\u001b[38;5;241m1\u001b[39m,\u001b[38;5;241m1\u001b[39m,figsize\u001b[38;5;241m=\u001b[39m(\u001b[38;5;241m12\u001b[39m,\u001b[38;5;241m8\u001b[39m))\n\u001b[1;32m 2\u001b[0m ax\u001b[38;5;241m.\u001b[39mremove()\n\u001b[1;32m 4\u001b[0m \u001b[38;5;66;03m# The second one is on column2, spread on 3 columns\u001b[39;00m\n", "\u001b[0;31mNameError\u001b[0m: name 'plt' is not defined" ] } ], "source": [ "\n", "f,ax = plt.subplots(1,1,figsize=(12,8))\n", "ax.remove()\n", "\n", "# The first plot\n", "ax1 = plt.subplot2grid((4, 1), (0, 0), rowspan=3)\n", "boros.plot(edgecolor='red', facecolor='none', linewidth=1, alpha=0.75, ax=ax1)\n", "aff.plot(markersize=2, column='price', cmap='viridis', alpha=0.2, scheme='Fisher_Jenks_Sampled', ax=ax1)\n", "\n", "ax1.set_xlim([500000, 565000])\n", "ax1.set_ylim([165000, 195000]);\n", "\n", "# The second plot\n", "ax2 = plt.subplot2grid((4, 1), (3, 0), rowspan=1)\n", "aff.price.plot.hist(bins=100, ax=ax2)\n", "\n", "plt.suptitle(\"Listings Advertising Affordability\")\n", "plt.tight_layout()\n", "plt.savefig(\"Affordable_Listings.png\", dpi=150)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2.4 Questions\n", "\n", "- Do you think that this is a *good* way to select affordable options?\n", "- Do you understand what `dpi` means and how `savefig` works?\n", "- Copy the code from above but modify it to constrain the histogram on a more limited distribution by *filtering* out the outliers *before* drawing the plot. I would copy the cell above to one just below here so that you keep a working copy available and can undo any changes that break things." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "### Task 3.3: Near Bluespace\n", "\n", "
\n", " Difficulty Level: Medium, because you're still learning about regexes.\n", "
\n", "\n", "Now see if you can work out a regular expression to find accommodation that emphasises accessibility to the Thames and other 'blue spaces' as part of the description? One thing you'll need to tackle is that some listings seem to say something about Thameslink and you wouldn't want those be returned as part of a regex looking for _rivers_. So by way of a hint:\n", "\n", "- You probably need to think about the Thames, rivers, and water.\n", "- These will probably be *followed* by a qualifier like a 'view' (e.g. Thames-view) or a front (e.g. water-front).\n", "- But you need to rule out things like \"close the Thameslink station...\"\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.1 Create the regular Expression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf[\n", " gdf.description.str.contains(???, regex=???, flags=???)\n", "].sample(5, random_state=42)[['description']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.2 Apply it to the Select Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bluesp = gdf[\n", " (gdf.description.str.contains(???, regex=True, flags=re.IGNORECASE)) |\n", " (gdf.description.str.contains(???, regex=True, flags=re.IGNORECASE))\n", "].copy()\n", "print(f\"Found {bluesp.shape[0]:,} rows.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.3 Plot the Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "f,ax = plt.subplots(1,1,figsize=(12,8))\n", "ax.remove()\n", "\n", "# The first plot\n", "ax1 = plt.subplot2grid((4, 1), (0, 0), rowspan=3)\n", "water.plot(edgecolor='none', facecolor=(.25, .25, .7, .25), ax=ax1)\n", "boros.plot(edgecolor='red', facecolor='none', linewidth=1, alpha=0.75, ax=ax1)\n", "bluesp.plot(markersize=2, column='price', cmap='viridis', alpha=0.5, scheme='Fisher_Jenks_Sampled', ax=ax1)\n", "\n", "ax1.set_xlim([500000, 565000])\n", "ax1.set_ylim([165000, 195000]);\n", "\n", "# The second plot\n", "ax2 = plt.subplot2grid((4, 1), (3, 0), rowspan=1)\n", "bluesp.price.plot.hist(bins=100, ax=ax2)\n", "\n", "plt.suptitle(\"Bluespace Listings\")\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3.4 Questions\n", "\n", "- How else might you select listings with a view of the Thames or other bluespaces?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 4. Illustrative Text Cleaning\n", "\n", "Now we're going to step through the _parts_ of the process that we apply to clean and transform text. We'll do this individually before using a function to apply them _all at once_." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.1: Downloading a Web Page\n", "\n", "
\n", " Difficulty Level: Easy.\n", "
\n", "\n", "There is plenty of good economic geography research being done using web pages. Try using Google Scholar to look for work using the British Library's copy of the *Internet Archive*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from urllib.request import urlopen, Request\n", "\n", "# We need this so that the Bartlett web site 'knows'\n", "# what kind of browser it is deasling with. Otherwise\n", "# you get a Permission Error (403 Forbidden) because\n", "# the site doesn't know what to do.\n", "hdrs = {\n", " 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',\n", " 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',\n", "}\n", "url = 'https://www.ucl.ac.uk/bartlett/casa/about-0'\n", "\n", "# Notice that here we have to assemble a request and\n", "# then 'open' it so that the request is properly issued\n", "# to the web server. Normally, we'd just use `urlopen`, \n", "# but that doesn't give you the ability to set the headers.\n", "request = Request(url, None, hdrs) #The assembled request\n", "response = urlopen(request)\n", "html = response.???.decode('utf-8') # The data u need\n", "\n", "print(html[:1000])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.2: Removing HTML\n", "\n", "
\n", " Difficulty Level: Medium, because what we're doing will seem really strange and uses some previously unseen libraries that you'll have to google.\n", "
\n", "\n", "*Hint*: you need to need to **get the text** out of the each returned `

` and `

` element! I'd suggest also commenting this up since there is a *lot* going on on some of these lines of code!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cleaned = []\n", "\n", "soup = BeautifulSoup(html)\n", "body = soup.find('body')\n", "\n", "for c in body.findChildren(recursive=False):\n", " if c.name in ['div','p'] and c.???.strip() != '': \n", " # \\xa0 is a non-breaking space in Unicode (  in HTML)\n", " txt = [re.sub(r'(?:\\u202f|\\xa0|\\u200b)',' ',x.strip()) for x in c.get_text(separator=\" \").split('\\n') if x.strip() != '']\n", " cleaned += txt\n", "\n", "cleaned" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.3: Lower Case\n", "\n", "
\n", " Difficulty Level: Easy.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lower = [c.???() for ??? in cleaned]\n", "lower" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.4: Stripping 'Punctuation'\n", "\n", "
\n", " Difficulty Level: Hard, because you need to understand: 1) why we're compiling the regular expression and how to use character classes; and 2) how the NLTK tokenizer differs in approach to the regex.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.4.1 Regular Expression Approach\n", "\n", "We want to clear out punctuation using a regex that takes advantage of the `[...]` (character class) syntax. The really tricky part is remembering how to specify the 'punctuation' when some of that punctuation has 'special' meanings in a regular expression context. For instance, `.` means 'any character', while `[` and `]` mean 'character class'. So this is another *escaping* problem and it works the *same* way it did when we were dealing with the Terminal... \n", "\n", "*Hints*: some other factors... \n", "\n", "1. You will want to match more than one piece of punctuation at a time, so I'd suggest add a `+` to your pattern.\n", "2. You will need to look into *metacharacters* for creating a kind of 'any of the characters *in this class*' bag of possible matches." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pattern = re.compile(r'[???]+')\n", "print(pattern)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.4.2 Tokenizer\n", "\n", "The other way to do this, which is probably *easier* but produces more complex output, is to draw on the tokenizers [already provided by NLTK](https://www.nltk.org/api/nltk.tokenize.html). For our purposes `word_tokenize` is probably fine, but depending on your needs there are other options and you can also write your own." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.tokenize import word_tokenize\n", "print(word_tokenize)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 4.4.3 Compare" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "subbed = []\n", "tokens = []\n", "for l in lower:\n", " subbed.append(re.sub(???, ' ', l))\n", " tokens.append(???(l))\n", "\n", "for s in subbed:\n", " as_markdown(\"Substituted\", s)\n", "\n", "for t in tokens:\n", " as_markdown(\"Tokenised\", t)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.5: Stopword Removal\n", "\n", "
\n", " Difficulty Level: Medium, because you need to remember how list comprehensions work to use the stopword_list.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stopword_list = set(stopwords.words('english'))\n", "print(stopword_list)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stopped = []\n", "for p in tokens[2:4]: # <-- why do I just take these items from the list?\n", " stopped.append([x for x in p if x not in ??? and len(x) > 1])\n", "\n", "for s in stopped:\n", " as_markdown(\"Line\", s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 4.6: Lemmatisation vs Stemming\n", "\n", "
\n", " Difficulty Level: Easy.\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.stem.porter import PorterStemmer\n", "from nltk.stem.snowball import SnowballStemmer\n", "from nltk.stem.wordnet import WordNetLemmatizer " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lemmatizer = WordNetLemmatizer()\n", "print(lemmatizer.lemmatize('monkeys'))\n", "print(lemmatizer.lemmatize('cities'))\n", "print(lemmatizer.lemmatize('complexity'))\n", "print(lemmatizer.lemmatize('Reades'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stemmer = PorterStemmer()\n", "print(stemmer.stem('monkeys'))\n", "print(stemmer.stem('cities'))\n", "print(stemmer.stem('complexity'))\n", "print(stemmer.stem('Reades'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "stemmer = SnowballStemmer(language='english')\n", "print(stemmer.stem('monkeys'))\n", "print(stemmer.stem('cities'))\n", "print(stemmer.stem('complexity'))\n", "print(stemmer.stem('Reades'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lemmatizer = WordNetLemmatizer()\n", "lemmas = []\n", "stemmed = []\n", "\n", "# This would be better if we passed in a PoS (Part of Speech) tag as well,\n", "# but processing text for parts of speech is *expensive* and for the purposes\n", "# of this tutorial, not necessary.\n", "for s in stopped:\n", " lemmas.append([lemmatizer.lemmatize(x) for x in s])\n", "\n", "for s in stopped:\n", " stemmed.append([stemmer.stem(x) for x in s])\n", "\n", "for l in lemmas:\n", " as_markdown('Lemmatised',l)\n", "\n", "for s in stemmed:\n", " as_markdown('Stemmed',s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# What are we doing here?\n", "for ix, p in enumerate(stopped):\n", " stopped_set = set(stopped[ix])\n", " lemma_set = set(lemmas[ix])\n", " print(sorted(stopped_set.symmetric_difference(lemma_set)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 5. Applying Normalisation\n", "\n", "The above approach is fairly hard going since you need to loop through every list element applying these changes one at a time. Instead, we could convert the column to a corpus (or use pandas `apply`) together with a function imported from a library to do the work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5.1: Downloading the Custom Module\n", "\n", "
\n", " Difficulty Level: Easy.\n", "
\n", "\n", "This custom module is not perfect, but it gets the job done... mostly and has some additional features that you could play around with for a final project (e.g. `detect_entities` and `detect_acronyms`)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import urllib.request\n", "host = 'https://orca.casa.ucl.ac.uk'\n", "turl = f'{host}/~jreades/__textual__.py'\n", "tdirs = os.path.join('textual')\n", "tpath = os.path.join(tdirs,'__init__.py')\n", "\n", "if not os.path.exists(tpath):\n", " os.makedirs(tdirs, exist_ok=True)\n", " urllib.request.urlretrieve(turl, tpath)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 5.2: Importing the Custom Module\n", "\n", "
\n", " Difficulty Level: Easy, since you didn't have to write the module! But the questions could be hard...\n", "
\n", "\n", "Now let's import it." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This allows us to edit and reload the library\n", "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from textual import *" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "as_markdown('Input', cleaned)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "as_markdown('Normalised', [normalise_document(x, remove_digits=True) for x in ???])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(normalise_document)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 5.1.1 Questions\n", "\n", "Let's assume that you want to analyse web page content... \n", "\n", "- Based on the above output, what stopwords do you think are missing?\n", "- Based on the above output, what should be removed but isn't?\n", "- Based on the above output, how do you think a computer can work with this text?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " ⚠ Stop! Beyond this point, we are moving into Natural Language Processing. If you are already struggling with regular expressions, I would recommend stopping here. You can come back to revisit the NLP components and creation of word clouds later.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 6. Revenons à Nos Moutons\n", "\n", "Now that you've seen how the steps are applied to a 'random' HTML document, let's get back to the problem at hand (revenons à nos moutons == let's get back to our sheep)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.1: Process the Selected Listings\n", "\n", "
\n", " Difficulty Level: Easy, but you'll need to be patient!\n", "
\n", "\n", "Notice the use of `%%time` here -- this will tell you how long each block of code takes to complete. It's a really useful technique for reminding *yourself* and others of how long something might take to run. I find that with NLP this is particularly important since you have to do a *lot* of processing on each document in order to normalise it.\n", "\n", "
\n", " 💡 Tip: Notice how we can change the default parameters for normalise_document even when using apply, but that the syntax is different. So whereas we'd use normalise_document(doc, remove_digits=True) if calling the function directly, here it's .apply(normalise_document, remove_digits=True)!\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time \n", "# I get about 1 minute on a M2 Mac\n", "lux['description_norm'] = lux.???.apply(???, remove_digits=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time \n", "# I get about 1 minute on a M2 Mac\n", "aff['description_norm'] = aff.???.apply(???, remove_digits=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time \n", "# I get about 2 seconds on a M2 Mac\n", "bluesp['description_norm'] = bluesp.???.apply(???, remove_digits=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.2: Select and Tokenise\n", "\n", "
\n", " Difficulty Level: Easy, except the double list-comprehension.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.2.1 Select and Extract Corpus\n", "\n", "See useful tutorial [here](https://towardsdatascience.com/tf-idf-explained-and-python-sklearn-implementation-b020c5e83275). Although we shouldn't have any empty descriptions, by the time we've finished normalising the textual data we may have _created_ some empty values and we need to ensure that we don't accidentally pass a NaN to the vectorisers and frequency distribution functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "srcdf = bluesp # <-- you only need to change the value here to try the different selections" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corpus = srcdf.description_norm.fillna(' ').values\n", "print(corpus[0:3])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.2.2 Tokenise\n", "\n", "There are different forms of tokenisation and different algorithms will expect differing inputs. Here are two:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sentences = [nltk.sent_tokenize(text) for text in corpus]\n", "words = [[nltk.tokenize.word_tokenize(sentence) \n", " for sentence in nltk.sent_tokenize(text)] \n", " for text in corpus]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how this has turned every sentence into an array and each document into an array of arrays:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Sentences 0: {sentences[0]}\")\n", "print()\n", "print(f\"Words 0: {words[0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.3: Frequencies and Ngrams\n", "\n", "
\n", " Difficulty Level: Moderate.\n", "
\n", "\n", "One new thing you'll see here is the `ngram`: ngrams are 'simply' pairs, or triplets, or quadruplets of words. You may come across the terms unigram (`ngram(1,1)`), bigram (`ngram(2,2)`), trigram (`ngram(3,3)`)... typically, you will rarely find anything beyond trigrams, and these present real issues for text2vec algorithms because the embedding for `geographical`, `information`, and `systems` is _not_ the same as for `geographical information systetms`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.3.1 Build Frequency Distribution\n", "\n", "Build counts for ngram range 1..3:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fcounts = dict()\n", "\n", "# Here we replace all full-stops... can you think why we might do this?\n", "data = nltk.tokenize.word_tokenize(' '.join([text.replace('.','') for text in corpus]))\n", "\n", "for size in 1, 2, 3:\n", " fdist = FreqDist(ngrams(data, size))\n", " print(fdist)\n", " # If you only need one note this: https://stackoverflow.com/a/52193485/4041902\n", " fcounts[size] = pd.DataFrame.from_dict({f'Ngram Size {size}': fdist})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.3.2 Output Top-n Ngrams\n", "\n", "And output the most common ones for each ngram range:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for dfs in fcounts.???():\n", " print(dfs.sort_values(by=dfs.columns.values[0], ascending=???).head(10))\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.3.3 Questions\n", "\n", "- Can you think why we don't care about punctuation for frequency distributions and n-grams?\n", "- Do you understand what n-grams *are*?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.4: Count Vectoriser\n", "\n", "
\n", " Difficulty Level: Easy, but the output needs some thought!\n", "
\n", "\n", "This is a big foray into sklearn (sci-kit learn) which is the main machine learning and clustering module for Python. For processing text we use *vectorisers* to convert terms to a vector representation. We're doing this on the smallest of the derived data sets because these processes can take a while to run and generate *huge* matrices (remember: one row and one column for each term!)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.4.1 Fit the Vectoriser" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cvectorizer = CountVectorizer(ngram_range=(1,3))\n", "cvectorizer.fit(corpus)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.4.2 Brief Demonstration\n", "\n", "Find the number associated with a word in the vocabulary and how many times it occurs in the original corpus:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "term = 'stratford'\n", "pd.options.display.max_colwidth=750\n", "# Find the vocabulary mapping for the term\n", "print(f\"Vocabulary mapping for {term} is {cvectorizer.vocabulary_[term]}\")\n", "# How many times is it in the data\n", "print(f\"Found {srcdf.description_norm.str.contains(term).sum():,} rows containing {term}\")\n", "# Print the descriptions containing the term\n", "for x in srcdf[srcdf.description_norm.str.contains(term)].description_norm:\n", " as_markdown('Stratford',x)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.4.3 Transform the Corpus \n", "\n", "You can only *tranform* the entire corpus *after* the vectoriser has been fitted. There is an option to `fit_transform` in one go, but I wanted to demonstrate a few things here and some vectorisers are don't support the one-shot fit-and-transform approach. **Note the type of the transformed corpus**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cvtcorpus = cvectorizer.transform(???)\n", "cvtcorpus # cvtcorpus for count-vectorised transformed corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.4.4 Single Document\n", "\n", "Here is the **first** document from the corpus:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc_df = pd.DataFrame(cvtcorpus[0].T.todense(), \n", " index=cvectorizer.get_feature_names_out(), columns=[\"Counts\"]\n", " ).sort_values('Counts', ascending=False)\n", "doc_df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.4.5 Transformed Corpus" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cvdf = pd.DataFrame(data=cvtcorpus.toarray(),\n", " columns=cvectorizer.get_feature_names_out())\n", "print(f\"Raw count vectorised data frame has {cvdf.shape[0]:,} rows and {cvdf.shape[1]:,} columns.\")\n", "cvdf.iloc[0:5,0:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.4.6 Filter Low-Frequency Words\n", "\n", "These are likely to be artefacts of text-cleaning or human input error. As well, if we're trying to look across an entire corpus then we might not want to retain words that only appear in a couple of documents.\n", "\n", "Let's start by getting the *column* sums:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sums = cvdf.sum(axis=0)\n", "print(f\"There are {len(sums):,} terms in the data set.\")\n", "sums.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove columns (i.e. terms) appearing in less than 1% of documents. You can do this by thinking about what the shape of the data frame means (rows and/or columns) and how you'd get 1% of that!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filter_terms = sums >= cvdf.shape[0] * ???" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now see how we can use this to strip out the columns corresponding to low-frequency terms:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fcvdf = cvdf.drop(columns=cvdf.columns[~filter_terms].values)\n", "print(f\"Filtered count vectorised data frame has {fcvdf.shape[0]:,} rows and {fcvdf.shape[1]:,} columns.\")\n", "fcvdf.iloc[0:5,0:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fcvdf.sum(axis=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're going to pick this up again in Task 7." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.4.7 Questions\n", "\n", "- Can you explain what `doc_df` contains?\n", "- What does `cvdf` contain? Explain the rows and columns.\n", "- What is the function of `filter_terms`?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 6.5: TF/IDF Vectoriser\n", "\n", "
\n", " Difficulty Level: Moderate, if you want to understand how max_df and min_df work!\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.5.1 Fit and Transform" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tfvectorizer = TfidfVectorizer(use_idf=True, ngram_range=(1,3), \n", " max_df=0.75, min_df=0.01) # <-- these matter!\n", "tftcorpus = tfvectorizer.fit_transform(corpus) # TF-transformed corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.5.2 Single Document" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc_df = pd.DataFrame(tftcorpus[0].T.todense(), index=tfvectorizer.get_feature_names_out(), columns=[\"Weights\"])\n", "doc_df.sort_values('Weights', ascending=False).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.5.3 Transformed Corpus" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tfidf = pd.DataFrame(data=tftcorpus.toarray(),\n", " columns=tfvectorizer.get_feature_names_out())\n", "print(f\"TF/IDF data frame has {tfidf.shape[0]:,} rows and {tfidf.shape[1]:,} columns.\")\n", "tfidf.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 6.5.4 Questions\n", "\n", "- What does the TF/IDF score *represent*?\n", "- What is the role of `max_df` and `min_df`?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 7. Word Clouds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 7.1: For Counts\n", "\n", "
\n", " Difficulty Level: Easy!\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fcvdf.sum().sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12, 12))\n", "Cloud = WordCloud(\n", " background_color=\"white\", \n", " max_words=75,\n", " font_path='/home/jovyan/.local/share/fonts/Roboto-Light.ttf'\n", ").generate_from_frequencies(fcvdf.sum())\n", "plt.imshow(Cloud) \n", "plt.axis(\"off\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 7.2: For TF/IDF Weighting\n", "\n", "
\n", " Difficulty Level: Easy, but you'll need to be patient!\n", "
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tfidf.sum().sort_values(ascending=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12, 12))\n", "Cloud = WordCloud(\n", " background_color=\"white\", \n", " max_words=100,\n", " font_path='/home/jovyan/.local/share/fonts/JetBrainsMono-VariableFont_wght.ttf'\n", ").generate_from_frequencies(tfidf.sum())\n", "plt.imshow(Cloud) \n", "plt.axis(\"off\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 7.2.3 Questions\n", "\n", "- What does the `sum` represent for the count vectoriser?\n", "- What does the `sum` represent for the TF/IDF vectoriser?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 8. Latent Dirchlet Allocation\n", "\n", "
\n", " 💡 Tip: I would give this a low priority. It's a commonly-used method, but on small data sets it really isn't much use and I've found its answers to be... unclear... even on large data sets.\n", "
\n", "\n", "Adapted from [this post](https://stackabuse.com/python-for-nlp-topic-modeling/) on doing LDA using sklearn. Most other examples use the `gensim` library." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer(ngram_range=(1,2)) # Notice change to ngram range (try 1,1 and 1,2 for other options)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 8.1: Calculate Topics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vectorizer.fit(corpus) \n", "tcorpus = vectorizer.transform(corpus) # tcorpus for transformed corpus\n", "\n", "LDA = LatentDirichletAllocation(n_components=3, random_state=42) # Might want to experiment with n_components too\n", "LDA.fit(tcorpus)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_topic = LDA.components_[0]\n", "top_words = first_topic.argsort()[-25:]\n", "\n", "for i in top_words:\n", " print(vectorizer.get_feature_names_out()[i])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i,topic in enumerate(LDA.components_):\n", " as_markdown(f'Top 10 words for topic #{i}', ', '.join([vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-25:]]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 8.2: Maximum Likelihood Topic" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "topic_values = LDA.transform(tcorpus)\n", "topic_values.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.options.display.max_colwidth=20\n", "srcdf['Topic'] = topic_values.argmax(axis=1)\n", "srcdf.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.options.display.max_colwidth=75\n", "srcdf[srcdf.Topic==1].description_norm.head(10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vectorizer = CountVectorizer(ngram_range=(1,1), stop_words='english', analyzer='word', max_df=0.7, min_df=0.05)\n", "topic_corpus = vectorizer.fit_transform(srcdf[srcdf.Topic==1].description.values) # tcorpus for transformed corpus" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "topicdf = pd.DataFrame(data=topic_corpus.toarray(),\n", " columns=vectorizer.get_feature_names_out())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12, 12))\n", "Cloud = WordCloud(background_color=\"white\", max_words=75).generate_from_frequencies(topicdf.sum())\n", "plt.imshow(Cloud) \n", "plt.axis(\"off\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 9. Word2Vec\n", "\n", "
\n", " 🤯 Tip: This algorithm works almost like magic. You should play with the configuration parameters and see how it changes your results.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 9.1: Configure" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.models.word2vec import Word2Vec" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dims = 100\n", "print(f\"You've chosen {dims} dimensions.\")\n", "\n", "window = 3\n", "print(f\"You've chosen a window of size {window}.\")\n", "\n", "min_v_freq = 0.005 # Don't keep words appearing less than 0.5% frequency\n", "min_v_count = math.ceil(min_v_freq * srcdf.shape[0])\n", "print(f\"With a minimum frequency of {min_v_freq} and {srcdf.shape[0]:,} documents, minimum vocab frequency is {min_v_count:,}.\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 9.2: Train" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time \n", "\n", "corpus = srcdf.description_norm.fillna(' ').values\n", "#corpus_sent = [nltk.sent_tokenize(text) for text in corpus] # <-- with more formal writing this would work well\n", "corpus_sent = [d.replace('.',' ').split(' ') for d in corpus] # <-- deals better with many short sentences though context may end up... weird\n", "model = Word2Vec(sentences=corpus_sent, vector_size=dims, window=window, epochs=200, \n", " min_count=min_v_count, seed=42, workers=1)\n", "\n", "model.save(f\"word2vec-d{dims}-w{window}.model\") # <-- You can then Word2Vec.load(...) which is useful with large corpora" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 9.3: Explore Similarities\n", "\n", "This next bit of code only runs if you have calculated the frequencies above in the [Frequencies and Ngrams](#frequencies-and-ngrams) section." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_colwidth',150)\n", "\n", "df = fcounts[1] # <-- copy out only the unigrams as we haven't trained anything else\n", "\n", "n = 14 # number of words\n", "topn = 7 # number of most similar words\n", "\n", "selected_words = df[df['Ngram Size 1'] > 5].reset_index().level_0.sample(n, random_state=42).tolist()\n", "\n", "words = []\n", "v1 = []\n", "v2 = []\n", "v3 = []\n", "sims = []\n", "\n", "for w in selected_words:\n", " try: \n", " vector = model.wv[w] # get numpy vector of a word\n", " #print(f\"Word vector for '{w}' starts: {vector[:5]}...\")\n", " \n", " sim = model.wv.most_similar(w, topn=topn)\n", " #print(f\"Similar words to '{w}' include: {sim}.\")\n", " \n", " words.append(w)\n", " v1.append(vector[0])\n", " v2.append(vector[1])\n", " v3.append(vector[2])\n", " sims.append(\", \".join([x[0] for x in sim]))\n", " except KeyError:\n", " print(f\"Didn't find {w} in model. Can happen with low-frequency terms.\")\n", " \n", "vecs = pd.DataFrame({\n", " 'Term':words,\n", " 'V1':v1, \n", " 'V2':v2, \n", " 'V3':v3,\n", " f'Top {topn} Similar':sims\n", "})\n", "\n", "vecs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#print(model.wv.index_to_key) # <-- the full vocabulary that has been trained" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Task 9.4: Apply\n", "\n", "We're going to make *use* of this further next week..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 9.4.1 Questions\n", "\n", "- What happens when *dims* is very small (e.g. 25) or very large (e.g. 300)?\n", "- What happens when *window* is very small (e.g. 2) or very large (e.g. 8)?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 10. Processing the Full File" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " ⚠ Warning: This code can take some time (> 5 minutes on a M2 Mac) to run, so don't run this until you've understood what we did before!
\n", "\n", "You will get a warning about `\".\" looks like a filename, not markup` — this looks a little scary, but is basically suggesting that we have a description that consists only of a '.' or that looks like some kind of URL (which the parser thinks means you're trying to pass it something to download). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time \n", "# This can take up to 8 minutes on a M2 Mac\n", "gdf['description_norm'] = ''\n", "gdf['description_norm'] = gdf.description.apply(normalise_document, remove_digits=True, special_char_removal=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf.to_parquet(os.path.join('data','geo',f'{fn.replace(\".\",\"-with-nlp.\")}'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " 💡 Tip: saving an intermediate file at this point is useful because you've done quite a bit of expensive computation. You could restart-and-run-all and then go out for the day, but probably easier to just save this output and then, if you need to restart your analysis at some point in the future, just remember to deserialise amenities back into a list format.\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applications\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above is _still_ only the results for the 'luxury' apartments _alone_. At this point, you would probably want to think about how your results might change if you changed any of the following:\n", "\n", "1. Using one of the other data sets that we created, or even the entire data set!\n", "2. Applying the CountVectorizer or TfidfVectorizer _before_ selecting out any of our 'sub' data sets.\n", "3. Using the visualisation of information from \\#2 to improve our regex selection process.\n", "4. Reducing, increasing, or constraining (i.e. `ngrams=(2,2)`) the size of the ngrams while bearing in mind the impact on processing time and interpretability.\n", "5. Filtering by type of listing or host instead of keywords found in the description (for instance, what if you applied TF/IDF to the entire data set and then selected out 'Whole Properties' before splitting into those advertised by hosts with only one listing vs. those with multiple listings?).\n", "6. Linking this back to the geography.\n", "\n", "Over the next few weeks we'll also consider alternative means of visualising the data!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resources\n", "\n", "There is a lot more information out there, including a [whole book](https://www.nltk.org/book/) and your standard [O'Reilly text](http://www.datascienceassn.org/sites/default/files/Natural%20Language%20Processing%20with%20Python.pdf).\n", "\n", "And some more useful links:\n", "- [Pandas String Contains Method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html)\n", "- [Using Regular Expressions with Pandas](https://kanoki.org/2019/11/12/how-to-use-regex-in-pandas/)\n", "- [Summarising Chapters from Frankenstein using TF/IDF](https://towardsdatascience.com/using-tf-idf-to-form-descriptive-chapter-summaries-via-keyword-extraction-4e6fd857d190)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }