{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lexical Clustering" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**(C) 2016-2024 by [Damir Cavar](http://damir.cavar.me/) <>**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Version:** 1.3, January 2024" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Prerequisites:**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U nltk" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -U scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook provides simple examples of vectorization of ditributional properties of lexical items using Python 3.x. The applied examples show how lexical properties can be derived using common clustering methods on the resulting distributional vector space. This material is used in my graduate classes on Natural Language Processing, Corpus and Computational Linguistics at [Indiana University at Bloomington](https://www.indiana.edu/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Vectorization of Distributional Properties" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will map out lexical distributional properties in the following. With lexical distributional properties we might refer to various kinds of positional or contextual features of words in text." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading a Text into Memory" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use a collection of fairy tales \"The House of Pomegranates\" by Oscar Wilde. The following code will read the text into memory. We open a file, read from it, and close the file again:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "ifile = open(\"data/HOPG.txt\", mode='r', encoding='utf-8')\n", "text = ifile.read()\n", "ifile.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using NLTK" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the [NLTK](https://www.nltk.org/) module to generate frequency profiles and [n-gram models](https://en.wikipedia.org/wiki/N-gram) using the tokens in the text." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the tokenization and lemmatization modules from [NLTK](https://www.nltk.org/). These are not the most accurate and best performing components. For more efficient lemmatizers consider using Python modules like [spaCy](https://spacy.io/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tokenization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will need the tokens from the text, that is mainly all individual words and punctuation marks separated as individual elements in a token list:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "tokens = nltk.word_tokenize(text)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['A', 'HOUSE', 'OF', 'POMEGRANATES', 'Contents', ':', 'The', 'Young', 'King', 'The', 'Birthday', 'of', 'the', 'Infanta', 'The', 'Fisherman', 'and', 'his', 'Soul', 'The', 'Star-child', 'THE', 'YOUNG', 'KING', '[', 'TO', 'MARGARET', 'LADY', 'BROOKE', '--', 'THE', 'RANEE', 'OF', 'SARAWAK', ']', 'It', 'was', 'the', 'night', 'before', 'the', 'day', 'fixed', 'for', 'his', 'coronation', ',', 'and', 'the', 'young', 'King', 'was', 'sitting', 'alone', 'in', 'his', 'beautiful', 'chamber', '.', 'His', 'courtiers', 'had', 'all', 'taken', 'their', 'leave', 'of', 'him', ',', 'bowing', 'their', 'heads', 'to', 'the', 'ground', ',', 'according', 'to', 'the', 'ceremonious', 'usage', 'of', 'the', 'day', ',', 'and', 'had', 'retired', 'to', 'the', 'Great', 'Hall', 'of', 'the', 'Palace', ',', 'to', 'receive', 'a', 'few', 'last', 'lessons', 'from', 'the', 'Professor', 'of', 'Etiquette', ';', 'there', 'being', 'some', 'of', 'them', 'who', 'had', 'still', 'quite', 'natural', 'manners', ',', 'which', 'in', 'a', 'courtier', 'is', ',', 'I', 'need', 'hardly', 'say', ',', 'a', 'very', 'grave', 'offence', '.', 'The', 'lad', '--', 'for', 'he', 'was', 'only', 'a', 'lad', ',', 'being', 'but', 'sixteen', 'years', 'of', 'age', '--', 'was', 'not', 'sorry', 'at', 'their', 'departure', ',', 'and', 'had', 'flung', 'himself', 'back', 'with', 'a', 'deep', 'sigh', 'of', 'relief', 'on', 'the', 'soft', 'cushions', 'of', 'his', 'embroidered', 'couch', ',', 'lying', 'there', ',', 'wild-eyed', 'and', 'open-mouthed', ',', 'like', 'a', 'brown', 'woodland', 'Faun', ',', 'or', 'some', 'young', 'animal', 'of', 'the', 'forest', 'newly', 'snared', 'by', 'the', 'hunters', '.', 'And', ',', 'indeed', ',', 'it', 'was', 'the', 'hunters', 'who', 'had', 'found', 'him', ',', 'coming', 'upon', 'him', 'almost', 'by', 'chance', 'as', ',', 'bare-limbed', 'and', 'pipe', 'in', 'hand', ',', 'he', 'was', 'following', 'the', 'flock', 'of', 'the', 'poor', 'goatherd', 'who', 'had', 'brought', 'him', 'up', ',', 'and', 'whose', 'son', 'he', 'had', 'always', 'fancied', 'himself', 'to', 'be', '.', 'The', 'child', 'of', 'the', 'old', 'King', \"'s\", 'only', 'daughter', 'by', 'a', 'secret', 'marriage', 'with', 'one', 'much', 'beneath', 'her', 'in', 'station', '--', 'a', 'stranger', ',', 'some', 'said', ',', 'who', ',', 'by', 'the', 'wonderful', 'magic', 'of', 'his', 'lute-playing', ',', 'had', 'made', 'the', 'young']\n" ] } ], "source": [ "print(tokens[:300])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tokens will contain all tokens as they occur in text. This means that we will find in the token list a *the*, a *The*, maybe even a *THE*. To conflate all occurrences of these variants of \"the\" to one token representation *the*, we will use lemmatization in the next section." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lemmatization for Dimensionality Reduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "NLTK provides a WordNet-based lemmatizer. In the follwoing we import the NLTK *WordNetLemmatizer* module:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from nltk.stem import WordNetLemmatizer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We instantiate a lemmatizer:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "lemmatizer = WordNetLemmatizer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The lemmatizer correctly converts the plural form *dogs* to the lemmatized form, as shown in the example below:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dog\n" ] } ], "source": [ "print(lemmatizer.lemmatize(\"dogs\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unfortunately, the lemmatizer does not correct a capitalized *the*, " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The\n" ] } ], "source": [ "print(lemmatizer.lemmatize(\"The\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Independent of this problem, we could use the lemmatizer for the basic tokens with some morphological structure and attachment in the following way:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "lemmas = [ lemmatizer.lemmatize(token) for token in tokens ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can print out the first 100 lemmas:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['A', 'HOUSE', 'OF', 'POMEGRANATES', 'Contents', ':', 'The', 'Young', 'King', 'The', 'Birthday', 'of', 'the', 'Infanta', 'The', 'Fisherman', 'and', 'his', 'Soul', 'The', 'Star-child', 'THE', 'YOUNG', 'KING', '[', 'TO', 'MARGARET', 'LADY', 'BROOKE', '--', 'THE', 'RANEE', 'OF', 'SARAWAK', ']', 'It', 'wa', 'the', 'night', 'before', 'the', 'day', 'fixed', 'for', 'his', 'coronation', ',', 'and', 'the', 'young', 'King', 'wa', 'sitting', 'alone', 'in', 'his', 'beautiful', 'chamber', '.', 'His', 'courtier', 'had', 'all', 'taken', 'their', 'leave', 'of', 'him', ',', 'bowing', 'their', 'head', 'to', 'the', 'ground', ',', 'according', 'to', 'the', 'ceremonious', 'usage', 'of', 'the', 'day', ',', 'and', 'had', 'retired', 'to', 'the', 'Great', 'Hall', 'of', 'the', 'Palace', ',', 'to', 'receive', 'a', 'few']\n" ] } ], "source": [ "print(lemmas[0:100])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Due to the weaknesses of the NLTK WordNet based lemmatizer for generic lemmatization, I provide here a token list of lemmatized tokens using some alternative lemmatizer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Functional Items as Distributional Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Distributional properties of lexical items can be associated with various contextual cues. In a *Distributional Semantics* approach the core hypothesis is that the meaning of a specific word is determined by the meaning of the words in its context. Imagine the two different uses of *bats*:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The bats were flying out of the cave.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The bats were made of solid wood.*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For baseball bats it is more likely to be made of solid wood than to fly out of caves. On the other hand, the mammals of the order Chiroptera live in caves, and fly in and out of those." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The general idea in Distributional Semantics is that the meaning of *bat* can be determined by the words in the context. If a word would only have one specific meaning, its meaning could in principle be defined by the words frequently occuring in its context. We could think of it also in another way. The meaning of a word could be defined to be a probability function that predicts words in its context. This is a common interpretation in word-embedding approaches. This is obviously an oversimplification and conceptually wrong, but an approximation that appeared to be helpful in some NLP applications and models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The core problem is of course that *bat* can refer to many things, at least two, and that the context can help us determine which meaning is most appropriate in a specific context." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Vectorization" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "bigrams = list( nltk.ngrams(lemmas, 2) )" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('A', 'HOUSE'),\n", " ('HOUSE', 'OF'),\n", " ('OF', 'POMEGRANATES'),\n", " ('POMEGRANATES', 'Contents'),\n", " ('Contents', ':'),\n", " (':', 'The'),\n", " ('The', 'Young'),\n", " ('Young', 'King'),\n", " ('King', 'The'),\n", " ('The', 'Birthday')]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bigrams[:10]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "bigramFD = nltk.FreqDist(bigrams)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "99\n" ] } ], "source": [ "print(bigramFD[('the', 'young')])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "functionwords = \"\"\"\n", "I me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves\n", "he him his himself she she's her hers herself it it's its itself they them their theirs themselves\n", "what which who whom this that that'll these those am is are was were be been being have has had\n", "having do does did doing a an the and but if or because as until while of at by for with about\n", "against between into through during before after above below to from up down in out on off over under\n", "again further then once here there when where why how all any both each few more most other some such\n", "no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y\n", "ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn\n", "isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren\n", "weren't won won't wouldn wouldn't\n", "\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To guarantee fast lookup, we can convert the function word list into a dictionary (or hash-map), storing the list position as the value. The list position will be the unique scalar index in the vector space." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "range(0, 179)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "range(len(functionwords.split()))" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "functionWordsHash = { functionwords[i]:i for i in range(len(functionwords.split())) }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will store word vectors in a dictionary, with the key being the word, the value being the context vector. The vector space consists of vectors for all non-feature tokens (or words)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can write a function now that converts the distributional properties of words into a vector space. Assume that the function takes as parameters an *ngram* model and a dictionary of features with length $f$. Depending on the size of $n$ in the *ngram* model, the length of the vector representation for every word will be then be $(n - 1) \\times f$ for left or right contexts, and $(n - 1) \\times f \\times 2$ for both contexts. The following function ignores $n$ completely. It takes the left peripheral and the right peripheral tokens and features for mapping out a vector space, for all $n > 1$." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def makeVectorSpace(tokenTuples, features, left=True, right=True):\n", " if tokenTuples:\n", " n = len(next(iter(tokenTuples)))\n", " if n < 2:\n", " return {}\n", " else:\n", " return {}\n", " if features:\n", " f = len(features)\n", " featuresHash = { features[x]:x for x in range(len(features)) }\n", " else:\n", " return {}\n", "\n", " if left & right:\n", " vectorLength = 2 * f\n", " elif left | right:\n", " vectorLength = f\n", " else:\n", " return {}\n", "\n", " vectorModel = {}\n", " for x in tokenTuples:\n", " if x[-1] in featuresHash and x[0] not in featuresHash:\n", " feat = x[-1]\n", " t = x[0]\n", " s = 1\n", " elif x[0] in featuresHash and x[-1] not in featuresHash:\n", " feat = x[0]\n", " t = x[-1]\n", " s = 0\n", " else:\n", " continue\n", " tokenVector = vectorModel.get(t, [0]*vectorLength)\n", " fPos = featuresHash.get(feat) + (f * s)\n", " tokenVector[fPos] = tokenVector[fPos] + tokenTuples.get(x, 0)\n", " vectorModel[t] = tokenVector\n", " return vectorModel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us create some sample text. In this case *testText* is a tokenized and orthographically normalized text." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "testText = \"\"\"John Smith met Susan Peters in Paris where she lived a happy life over the last ten years .\n", "for many years she lived in Berlin .\n", "the city was too big for her .\n", "why she moved to Paris is unclear .\n", "in Paris she has a small appartment . \"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can create an *ngram* model from the text, with $n=2$, and extract a vectorization as follows:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Paris [0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0]\n", "Peters [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0]\n", "she [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]\n", "lived [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]\n", "happy [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n", "life [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]\n", "last [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n", "Berlin [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n", ". [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0]\n", "city [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n", "has [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]\n", "small [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]\n" ] } ], "source": [ "testNgrams = nltk.FreqDist(nltk.ngrams(nltk.word_tokenize(testText), 2))\n", "testFeatures = ['the', 'a', 'in', 'on', 'where', 'over']\n", "wordVectors = makeVectorSpace(testNgrams, testFeatures)\n", "for k, v in wordVectors.items():\n", " print(k, v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vector space above contains absolute context frequencies. We can relativize the vector space for all vectors as follows:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "relVecSpace = {}\n", "for x, y in wordVectors.items():\n", " total = sum(y)\n", " relVecSpace[x] = [ i/total for i in y ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print the relativized vector space:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Paris [0.0, 0.0, 0.6666666666666666, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3333333333333333, 0.0]\n", "Peters [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0]\n", "she [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n", "lived [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.5, 0.0, 0.0, 0.0]\n", "happy [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n", "life [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0]\n", "last [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n", "Berlin [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n", ". [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.5, 0.0, 0.0, 0.0]\n", "city [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n", "has [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0]\n", "small [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]\n" ] } ], "source": [ "for k, v in relVecSpace.items():\n", " print(k, v)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering Vector Spaces" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given the distributional vector space above, I will show in the following how to use different clustering algorithms to group the lexical vectors using geometrical similarity metrics." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the clustering experiments we will use the [scikit-learn](https://scikit-learn.org/stable/index.html) module. If you did not install [scikit-learn](https://scikit-learn.org/stable/index.html) yet, follow the instructions on [the module's website](https://scikit-learn.org/stable/install.html#installation-instructions). To import the K-Means clustering algorithm, use the following import statement:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More efficient and usefull representations and operations on vectors and vector space models can be achieved with the numpy module. We import it as *np*:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The labels can be converted into a more memory efficient tuple. The vector space can be converted into numpy arrays using the array method and a list comprehension for the individual word vectors." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "labels = tuple(wordVectors.keys())\n", "vectors = np.array([np.array(x) for x in wordVectors.values()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The vector labels are now stored in a tuple of strings:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Paris', 'Peters', 'she', 'lived', 'happy', 'life', 'last', 'Berlin', '.', 'city', 'has', 'small')\n" ] } ], "source": [ "print(labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead of providing KMeans with a numpy matrix, it is also possible to use a pandas data frame. To use this method, we import pandas as *pd*:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To create a new pandas data frame, we use the vectors from our vector space model and declare the labels to be our left and right function word dimensions." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(vectors, columns=testFeatures * 2, index=labels)" ] }, { "cell_type": "code", "execution_count": 214, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
theainonwhereovertheainonwhereover
Paris002000000010
Peters000000001000
she000010000000
lived000000011000
happy010000000000
life000000000001
last100000000000
Berlin001000000000
.000000101000
city100000000000
has000000010000
small010000000000
\n", "
" ], "text/plain": [ " the a in on where over the a in on where over\n", "Paris 0 0 2 0 0 0 0 0 0 0 1 0\n", "Peters 0 0 0 0 0 0 0 0 1 0 0 0\n", "she 0 0 0 0 1 0 0 0 0 0 0 0\n", "lived 0 0 0 0 0 0 0 1 1 0 0 0\n", "happy 0 1 0 0 0 0 0 0 0 0 0 0\n", "life 0 0 0 0 0 0 0 0 0 0 0 1\n", "last 1 0 0 0 0 0 0 0 0 0 0 0\n", "Berlin 0 0 1 0 0 0 0 0 0 0 0 0\n", ". 0 0 0 0 0 0 1 0 1 0 0 0\n", "city 1 0 0 0 0 0 0 0 0 0 0 0\n", "has 0 0 0 0 0 0 0 1 0 0 0 0\n", "small 0 1 0 0 0 0 0 0 0 0 0 0" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more transparency we can add *-l* and *-r* to indicate the function words as left or right context cues respectively:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
the-la-lin-lon-lwhere-lover-lthe-ra-rin-ron-rwhere-rover-r
Paris002000000010
Peters000000001000
she000010000000
lived000000011000
happy010000000000
life000000000001
last100000000000
Berlin001000000000
.000000101000
city100000000000
has000000010000
small010000000000
\n", "
" ], "text/plain": [ " the-l a-l in-l on-l where-l over-l the-r a-r in-r on-r \\\n", "Paris 0 0 2 0 0 0 0 0 0 0 \n", "Peters 0 0 0 0 0 0 0 0 1 0 \n", "she 0 0 0 0 1 0 0 0 0 0 \n", "lived 0 0 0 0 0 0 0 1 1 0 \n", "happy 0 1 0 0 0 0 0 0 0 0 \n", "life 0 0 0 0 0 0 0 0 0 0 \n", "last 1 0 0 0 0 0 0 0 0 0 \n", "Berlin 0 0 1 0 0 0 0 0 0 0 \n", ". 0 0 0 0 0 0 1 0 1 0 \n", "city 1 0 0 0 0 0 0 0 0 0 \n", "has 0 0 0 0 0 0 0 1 0 0 \n", "small 0 1 0 0 0 0 0 0 0 0 \n", "\n", " where-r over-r \n", "Paris 1 0 \n", "Peters 0 0 \n", "she 0 0 \n", "lived 0 0 \n", "happy 0 0 \n", "life 0 1 \n", "last 0 0 \n", "Berlin 0 0 \n", ". 0 0 \n", "city 0 0 \n", "has 0 0 \n", "small 0 0 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns = [ x+\"-l\" for x in testFeatures ] + [ x+\"-r\" for x in testFeatures ]\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the values from the data frame as a matrix for KMeans:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "m = df.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can apply now KMeans to the data matrix:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KMeans(n_clusters=4)" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "km = KMeans(n_clusters=4)\n", "km.fit(m)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cluster labels can be used to generate a new data frame with the words and the associated cluster ID associated:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wordcluster
0Paris3
1Peters1
2she0
3lived1
4happy0
5life0
6last2
7Berlin3
8.1
9city2
10has0
11small0
\n", "
" ], "text/plain": [ " word cluster\n", "0 Paris 3\n", "1 Peters 1\n", "2 she 0\n", "3 lived 1\n", "4 happy 0\n", "5 life 0\n", "6 last 2\n", "7 Berlin 3\n", "8 . 1\n", "9 city 2\n", "10 has 0\n", "11 small 0" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = pd.DataFrame([df.index, km.labels_]).T\n", "results.columns=(\"word\", \"cluster\")\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On this small sample the clusters will not make a lot of sense. We need a larger text or corpus to extract more detailed distributional properties. In the following section I will introduce a larger text that is normalized or lemmatized, to run various lexical clustering experiments on it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clustering on Larger Texts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(C) 2016-2024 by [Damir Cavar](http://damir.cavar.me/) - [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.1" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": false, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }