{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS 579\n", "
\n", "\n", "## Clustering Words with K-Means\n", "\n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Motivation\n", "\n", "Often, we want to know which features appear together.\n", "\n", "- If you liked *Twilight* you might like *Nosferatu*.\n", "- \"happy\" is a synonym of \"glad.\"\n", "\n", "Can be used to summarize a large collection of messages." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use k-means to cluster together related words from Twitter.\n", "\n", "**Caution:** This uses live Twitter data, which often contains profanity." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100\n", "200\n", "300\n", "400\n", "500\n", "600\n", "700\n", "800\n", "900\n", "1000\n", "1100\n", "1200\n", "1300\n", "1400\n", "1500\n", "1600\n", "1700\n", "1800\n", "1900\n", "2000\n", "2100\n", "2200\n", "2300\n", "2400\n", "2500\n", "2600\n", "2700\n", "2800\n", "2900\n", "3000\n", "3100\n", "3200\n", "3300\n", "3400\n", "3500\n", "3600\n", "3700\n", "3800\n", "3900\n", "4000\n", "4100\n", "4200\n", "4300\n", "4400\n", "4500\n", "4600\n", "4700\n", "4800\n", "4900\n", "5000\n", "5100\n", "5200\n", "5300\n", "5400\n", "5500\n", "5600\n", "5700\n", "5800\n", "5900\n", "6000\n", "6100\n", "6200\n", "6300\n", "6400\n", "6500\n", "6600\n", "6700\n", "6800\n", "6900\n", "7000\n", "7100\n", "7200\n", "7300\n", "7400\n", "7500\n", "7600\n", "7700\n", "7800\n", "7900\n", "8000\n", "8100\n", "8200\n", "8300\n", "8400\n", "8500\n", "8600\n", "8700\n", "8800\n", "8900\n", "9000\n", "9100\n", "9200\n", "9300\n", "9400\n", "9500\n", "9600\n", "9700\n", "9800\n", "9900\n", "10000\n" ] } ], "source": [ "# Get some tweets containing the word 'i'.\n", "\n", "import os\n", "from TwitterAPI import TwitterAPI\n", "\n", "# Read Twitter credentials from environmental variables.\n", "api = TwitterAPI(os.environ.get('TW_CONSUMER_KEY'),\n", " os.environ.get('TW_CONSUMER_SECRET'),\n", " os.environ.get('TW_ACCESS_TOKEN'),\n", " os.environ.get('TW_ACCESS_TOKEN_SECRET'))\n", "\n", "# Collect 10000 tweets.\n", "tweets = []\n", "while True: \n", " r = api.request('statuses/filter', {'track':'i',\n", " 'language':'en'})\n", " if r.status_code != 200: # error\n", " break\n", " else:\n", " for item in r.get_iterator():\n", " tweets.append(item)\n", " if len(tweets) > 10000:\n", " break\n", " elif len(tweets) % 100 == 0:\n", " print(len(tweets))\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10002\n" ] } ], "source": [ "print(len(tweets))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "text @KYFriedComrade im rereading it ....maybe it is about twitter i dunno anymore i give up lol\n", "description: gringa, adoptive santiaguina, ADHD stoner lela lez left stream of consciousness commentariat + retweets of dope ppl โœŠ๐Ÿคฃ๐Ÿ˜Ž๐Ÿ˜˜\n", "name: naty ๐Ÿ–ค๐Ÿงก๐Ÿ’š\n", "location: Santiago, Chile\n" ] } ], "source": [ "# Each tweet is a Python dict.\n", "print('text', tweets[0]['text'])\n", "print('description:', tweets[0]['user']['description'])\n", "print('name:', tweets[0]['user']['name'])\n", "print('location:', tweets[0]['user']['location'])" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "tweets = [t for t in tweets if 'text' in t]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "9806" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(tweets)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['im',\n", " 'rereading',\n", " 'it',\n", " 'maybe',\n", " 'it',\n", " 'is',\n", " 'about',\n", " 'twitter',\n", " 'i',\n", " 'dunno',\n", " 'anymore',\n", " 'i',\n", " 'give',\n", " 'up',\n", " 'lol']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Tokenize each tweet text.\n", "import re\n", "tokens = []\n", "for tweet in tweets:\n", " text = tweet['text'].lower()\n", " text = re.sub('@\\S+', ' ', text) # Remove mentions.\n", " text = re.sub('http\\S+', ' ', text) # Remove urls.\n", " tokens.append(re.findall('[A-Za-z]+', text)) # Retain words.\n", "tokens[0]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Count words.\n", "from collections import Counter\n", "\n", "word_counts = Counter()\n", "for tweet in tokens:\n", " word_counts.update(tweet)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "13183 unique terms\n" ] }, { "data": { "text/plain": [ "[('i', 11474),\n", " ('rt', 5231),\n", " ('the', 3741),\n", " ('to', 3629),\n", " ('a', 2863),\n", " ('and', 2425),\n", " ('you', 2414),\n", " ('my', 2104),\n", " ('it', 1820),\n", " ('this', 1816)]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Inspect word counts.\n", "import math\n", "\n", "print(len(word_counts), 'unique terms')\n", "word_counts.most_common(10)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4077 words occur at least three times.\n" ] } ], "source": [ "# Retain in vocabulary words occurring more than twice.\n", "vocab = set([w for w, c in word_counts.items() if c > 2])\n", "print('%d words occur at least three times.' % len(vocab))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Prune tokens.\n", "newtoks = []\n", "for i, tweet in enumerate(tokens):\n", " newtok = [token for token in tweet if token in vocab]\n", " if len(newtok) > 0:\n", " newtoks.append(newtok)\n", "tokens = newtoks" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['im',\n", " 'it',\n", " 'maybe',\n", " 'it',\n", " 'is',\n", " 'about',\n", " 'twitter',\n", " 'i',\n", " 'anymore',\n", " 'i',\n", " 'give',\n", " 'up',\n", " 'lol']" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# A sample pruned tweet.\n", "tokens[0]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['rt',\n", " 'from',\n", " 'the',\n", " 'bottom',\n", " 'of',\n", " 'my',\n", " 'heart',\n", " 'i',\n", " 'hope',\n", " 'is',\n", " 'a',\n", " 'better',\n", " 'mental',\n", " 'health',\n", " 'year',\n", " 'for',\n", " 'everyone',\n", " 'lt']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Context features**\n", "\n", "To determine if two words are similar, we will create a feature vector that counts how often other words appear nearby.\n", "\n", "E.g.,\n", "\n", "> I really **love** school.\n", "\n", "> I really **like** school.\n", "\n", "> You **love** school.\n", "\n", "**love:** {really@-1: 1, school@1: 2, you@-1: 1}\n", "\n", "**like:** {really@-1: 1, school@1: 1}\n", "\n", "
\n", "\n", "**Assumption**: words with similar meaning have similar contexts vectors.\n", "\n" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "context for word twitter in ['im', 'it', 'maybe', 'it', 'is', 'about', 'twitter', 'i', 'anymore', 'i', 'give', 'up', 'lol']\n", "['it@-2', 'maybe@-1', 'is@1', 'about@2']\n" ] } ], "source": [ "import numpy as np\n", "def get_contexts(tweet, i, window):\n", " \"\"\"\n", " Get the context features for token at position i\n", " in this tweet, using the given window size.\n", " \"\"\"\n", " features = []\n", " for j in range(np.amax([0, i-window]), i):\n", " features.append(tweet[j] + \"@\" + str(j-i))\n", " for j in range(i+1, min(i + window + 1, len(tweet))):\n", " features.append(tweet[j] + \"@\" + str(j-i))\n", " return features\n", "\n", "print('context for word %s in %s' % (tokens[0][6], tokens[0]))\n", "print(get_contexts(tokens[0], i=3, window=2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** Q: How would the approach differ if we ignore location of context?**\n", "\n", "E.g., **love:** {really: 1, school:1, you: 1} **vs** {really@-1: 1, school@1: 1, you@-1: 1}" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# For each term, create a context vector, indicating how often\n", "# each word occurs to the left or right of it.\n", "from collections import defaultdict\n", "import numpy as np\n", "\n", "# dict from term to context vector.\n", "contexts = defaultdict(lambda: Counter())\n", "window = 2\n", "for tweet in tokens:\n", " for i, token in enumerate(tweet):\n", " features = get_contexts(tweet, i, window)\n", " contexts[token].update(features)\n", " # Optionally: ignore word order\n", " # contexts[token].update(tweet[:i] + tweet[i+1:])\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('rt@-1', 1823),\n", " ('m@1', 1667),\n", " ('t@2', 798),\n", " ('rt@-2', 529),\n", " ('and@-1', 444),\n", " ('to@2', 433),\n", " ('love@1', 424),\n", " ('a@2', 418),\n", " ('have@1', 412),\n", " ('don@1', 398),\n", " ('am@1', 374),\n", " ('can@1', 361),\n", " ('you@2', 359),\n", " ('ll@1', 334),\n", " ('ve@1', 326),\n", " ('when@-1', 323),\n", " ('just@1', 320),\n", " ('but@-1', 277),\n", " ('was@1', 276),\n", " ('the@2', 262)]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contexts['i'].most_common(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**tf-idf vectors**\n", "\n", "- We will transform the context features by dividing by (the log of) the number of distinct terms this feature appears in." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('i@-1', 11338),\n", " ('i@-2', 10905),\n", " ('i@1', 9485),\n", " ('i@2', 7121),\n", " ('rt@-1', 5217)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Compute the number of different contexts each term appears in.\n", "# Actually: this is the total number of times this context feature appears.\n", "tweet_freq = Counter()\n", "for context in contexts.values():\n", " tweet_freq.update(context)\n", "tweet_freq.most_common(5)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [ { "data": { "text/plain": [ "Counter({1766: 1,\n", " 25: 71,\n", " 109: 8,\n", " 55: 24,\n", " 772: 1,\n", " 88: 12,\n", " 5200: 1,\n", " 306: 1,\n", " 2381: 1,\n", " 3436: 1,\n", " 34: 45,\n", " 7: 842,\n", " 412: 2,\n", " 106: 8,\n", " 649: 1,\n", " 211: 2,\n", " 44: 32,\n", " 5: 1369,\n", " 1032: 1,\n", " 2021: 1,\n", " 903: 1,\n", " 4: 2130,\n", " 2335: 2,\n", " 3249: 1,\n", " 1188: 1,\n", " 2850: 1,\n", " 1377: 1,\n", " 670: 1,\n", " 32: 43,\n", " 7121: 1,\n", " 62: 19,\n", " 27: 62,\n", " 2216: 1,\n", " 5217: 1,\n", " 14: 224,\n", " 35: 41,\n", " 60: 18,\n", " 320: 1,\n", " 10: 415,\n", " 815: 2,\n", " 219: 1,\n", " 3558: 1,\n", " 40: 32,\n", " 118: 8,\n", " 862: 1,\n", " 75: 14,\n", " 6: 1045,\n", " 10905: 1,\n", " 99: 7,\n", " 97: 6,\n", " 140: 4,\n", " 385: 3,\n", " 424: 2,\n", " 131: 3,\n", " 201: 4,\n", " 33: 42,\n", " 114: 7,\n", " 91: 5,\n", " 45: 26,\n", " 21: 102,\n", " 1453: 1,\n", " 119: 4,\n", " 78: 10,\n", " 65: 25,\n", " 468: 1,\n", " 258: 1,\n", " 144: 1,\n", " 52: 26,\n", " 585: 1,\n", " 1447: 1,\n", " 1: 240,\n", " 158: 2,\n", " 103: 10,\n", " 116: 7,\n", " 214: 2,\n", " 92: 11,\n", " 199: 3,\n", " 59: 17,\n", " 8: 669,\n", " 410: 1,\n", " 584: 1,\n", " 16: 192,\n", " 79: 13,\n", " 510: 2,\n", " 367: 1,\n", " 41: 48,\n", " 1652: 1,\n", " 1807: 1,\n", " 94: 11,\n", " 113: 4,\n", " 3: 3235,\n", " 31: 59,\n", " 586: 1,\n", " 1196: 1,\n", " 133: 3,\n", " 173: 1,\n", " 20: 132,\n", " 3656: 1,\n", " 503: 1,\n", " 2202: 1,\n", " 15: 203,\n", " 390: 3,\n", " 911: 1,\n", " 227: 3,\n", " 185: 4,\n", " 574: 1,\n", " 2: 831,\n", " 803: 1,\n", " 46: 23,\n", " 453: 1,\n", " 66: 17,\n", " 19: 132,\n", " 2037: 1,\n", " 28: 70,\n", " 73: 15,\n", " 64: 13,\n", " 2771: 1,\n", " 93: 7,\n", " 1270: 1,\n", " 490: 1,\n", " 146: 3,\n", " 1305: 1,\n", " 905: 1,\n", " 653: 1,\n", " 149: 5,\n", " 161: 5,\n", " 9485: 1,\n", " 80: 12,\n", " 49: 14,\n", " 23: 89,\n", " 84: 9,\n", " 83: 12,\n", " 9: 454,\n", " 1475: 1,\n", " 1036: 1,\n", " 129: 6,\n", " 293: 2,\n", " 321: 1,\n", " 2452: 1,\n", " 102: 3,\n", " 70: 14,\n", " 12: 301,\n", " 37: 47,\n", " 1480: 1,\n", " 297: 3,\n", " 61: 12,\n", " 966: 1,\n", " 69: 9,\n", " 1898: 1,\n", " 1522: 1,\n", " 48: 27,\n", " 148: 3,\n", " 403: 1,\n", " 484: 2,\n", " 1629: 1,\n", " 215: 4,\n", " 18: 123,\n", " 13: 259,\n", " 17: 172,\n", " 162: 3,\n", " 105: 5,\n", " 176: 1,\n", " 192: 2,\n", " 1328: 1,\n", " 24: 81,\n", " 58: 21,\n", " 29: 65,\n", " 1194: 1,\n", " 57: 22,\n", " 206: 2,\n", " 353: 1,\n", " 873: 1,\n", " 11: 369,\n", " 122: 6,\n", " 77: 14,\n", " 226: 3,\n", " 30: 64,\n", " 76: 13,\n", " 1645: 1,\n", " 39: 34,\n", " 1432: 1,\n", " 445: 1,\n", " 51: 12,\n", " 26: 68,\n", " 743: 1,\n", " 438: 3,\n", " 627: 1,\n", " 1363: 1,\n", " 1122: 1,\n", " 197: 3,\n", " 1624: 1,\n", " 451: 1,\n", " 619: 1,\n", " 640: 1,\n", " 301: 1,\n", " 1372: 1,\n", " 67: 15,\n", " 180: 3,\n", " 1366: 1,\n", " 442: 1,\n", " 536: 1,\n", " 661: 1,\n", " 81: 10,\n", " 47: 33,\n", " 22: 100,\n", " 1219: 1,\n", " 648: 1,\n", " 72: 11,\n", " 123: 2,\n", " 407: 1,\n", " 187: 3,\n", " 339: 1,\n", " 50: 16,\n", " 1322: 1,\n", " 449: 1,\n", " 152: 6,\n", " 63: 22,\n", " 1548: 1,\n", " 43: 32,\n", " 901: 1,\n", " 345: 1,\n", " 881: 1,\n", " 479: 1,\n", " 183: 2,\n", " 71: 15,\n", " 364: 4,\n", " 333: 1,\n", " 387: 1,\n", " 589: 1,\n", " 147: 5,\n", " 174: 2,\n", " 355: 2,\n", " 242: 2,\n", " 181: 1,\n", " 909: 1,\n", " 1027: 1,\n", " 446: 2,\n", " 858: 1,\n", " 2170: 1,\n", " 492: 2,\n", " 233: 2,\n", " 179: 2,\n", " 89: 7,\n", " 1547: 1,\n", " 107: 4,\n", " 124: 5,\n", " 139: 3,\n", " 885: 1,\n", " 402: 2,\n", " 101: 7,\n", " 86: 12,\n", " 3542: 1,\n", " 157: 1,\n", " 245: 1,\n", " 291: 2,\n", " 324: 1,\n", " 150: 5,\n", " 2046: 1,\n", " 216: 2,\n", " 299: 2,\n", " 3620: 1,\n", " 621: 1,\n", " 505: 1,\n", " 56: 14,\n", " 400: 1,\n", " 172: 3,\n", " 110: 1,\n", " 325: 3,\n", " 104: 5,\n", " 126: 4,\n", " 177: 3,\n", " 1268: 1,\n", " 1204: 1,\n", " 130: 5,\n", " 504: 1,\n", " 3184: 1,\n", " 134: 2,\n", " 204: 2,\n", " 1406: 1,\n", " 1407: 1,\n", " 121: 3,\n", " 285: 2,\n", " 738: 1,\n", " 924: 1,\n", " 283: 1,\n", " 38: 35,\n", " 1477: 1,\n", " 53: 14,\n", " 362: 1,\n", " 1282: 1,\n", " 704: 1,\n", " 125: 9,\n", " 155: 2,\n", " 141: 1,\n", " 686: 1,\n", " 85: 13,\n", " 383: 1,\n", " 2700: 1,\n", " 839: 2,\n", " 570: 1,\n", " 347: 1,\n", " 210: 3,\n", " 82: 6,\n", " 365: 2,\n", " 282: 1,\n", " 357: 2,\n", " 54: 10,\n", " 1066: 1,\n", " 170: 1,\n", " 189: 2,\n", " 222: 1,\n", " 545: 1,\n", " 317: 1,\n", " 377: 1,\n", " 356: 1,\n", " 778: 1,\n", " 87: 9,\n", " 154: 2,\n", " 635: 1,\n", " 376: 1,\n", " 68: 12,\n", " 136: 3,\n", " 265: 3,\n", " 213: 4,\n", " 253: 2,\n", " 287: 1,\n", " 1329: 1,\n", " 184: 4,\n", " 108: 8,\n", " 239: 1,\n", " 397: 2,\n", " 74: 5,\n", " 167: 2,\n", " 255: 1,\n", " 487: 1,\n", " 1595: 1,\n", " 90: 3,\n", " 1452: 1,\n", " 135: 2,\n", " 1677: 1,\n", " 432: 1,\n", " 1709: 1,\n", " 200: 5,\n", " 250: 2,\n", " 525: 2,\n", " 883: 1,\n", " 42: 17,\n", " 188: 3,\n", " 138: 4,\n", " 414: 1,\n", " 169: 3,\n", " 723: 2,\n", " 137: 3,\n", " 1737: 1,\n", " 127: 5,\n", " 194: 1,\n", " 768: 1,\n", " 304: 1,\n", " 511: 1,\n", " 932: 1,\n", " 163: 3,\n", " 593: 1,\n", " 120: 1,\n", " 481: 1,\n", " 1042: 1,\n", " 708: 1,\n", " 143: 10,\n", " 145: 3,\n", " 153: 4,\n", " 337: 1,\n", " 363: 1,\n", " 515: 2,\n", " 11338: 1,\n", " 117: 6,\n", " 430: 1,\n", " 100: 7,\n", " 112: 7,\n", " 343: 1,\n", " 132: 3,\n", " 241: 4,\n", " 1283: 1,\n", " 844: 1,\n", " 166: 2,\n", " 352: 1,\n", " 472: 1,\n", " 394: 2,\n", " 234: 3,\n", " 220: 1,\n", " 1467: 1,\n", " 3582: 1,\n", " 164: 1,\n", " 368: 2,\n", " 171: 2,\n", " 98: 8,\n", " 483: 1,\n", " 1324: 1,\n", " 96: 5,\n", " 678: 1,\n", " 36: 29,\n", " 263: 1,\n", " 354: 1,\n", " 286: 1,\n", " 142: 2,\n", " 228: 1,\n", " 378: 1,\n", " 278: 2,\n", " 361: 2,\n", " 539: 1,\n", " 159: 1,\n", " 281: 1,\n", " 186: 1,\n", " 208: 3,\n", " 346: 1,\n", " 209: 2,\n", " 2345: 1,\n", " 195: 1,\n", " 379: 1,\n", " 259: 2,\n", " 160: 1,\n", " 202: 1,\n", " 115: 1,\n", " 277: 2,\n", " 745: 1,\n", " 623: 1,\n", " 319: 1,\n", " 274: 2,\n", " 212: 1,\n", " 232: 1,\n", " 657: 1,\n", " 328: 1,\n", " 348: 2,\n", " 370: 1,\n", " 221: 1,\n", " 331: 1,\n", " 646: 1,\n", " 332: 1,\n", " 225: 1,\n", " 554: 1,\n", " 360: 1,\n", " 271: 1,\n", " 231: 1,\n", " 251: 1,\n", " 244: 2,\n", " 452: 1,\n", " 436: 1,\n", " 637: 1,\n", " 266: 1,\n", " 128: 2,\n", " 165: 1,\n", " 111: 4,\n", " 230: 1,\n", " 302: 1,\n", " 312: 1,\n", " 1683: 1,\n", " 899: 1,\n", " 193: 1,\n", " 1568: 1,\n", " 551: 1,\n", " 289: 1,\n", " 660: 1,\n", " 1774: 1,\n", " 156: 2,\n", " 532: 1,\n", " 429: 1,\n", " 249: 1,\n", " 295: 1,\n", " 606: 2,\n", " 493: 1,\n", " 405: 2,\n", " 409: 1,\n", " 466: 1,\n", " 393: 1,\n", " 528: 1,\n", " 168: 1,\n", " 513: 1,\n", " 151: 1,\n", " 330: 1,\n", " 191: 1,\n", " 229: 1,\n", " 238: 1,\n", " 95: 1,\n", " 175: 1})" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(tweet_freq.values())" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('i@2', 1335),\n", " ('i@1', 1331),\n", " ('i@-2', 1296),\n", " ('the@-1', 1087),\n", " ('and@1', 978)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# As opposed to the above, this computes the number of unique terms that this feature\n", "# appears in. Q: How do you expect to affect the output?\n", "tweet_freq_2 = Counter()\n", "for context in contexts.values():\n", " tweet_freq_2.update(context.keys())\n", "tweet_freq_2.most_common(5)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('m@1', 0.4992255422401076),\n", " ('rt@-1', 0.4843410432712745),\n", " ('t@2', 0.24288406965963458),\n", " ('love@1', 0.14382229255205484),\n", " ('rt@-2', 0.1405945803985488)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Transform each context vector to be term freq / tweet frequency. \n", "# Also then normalize by length.\n", "for term, context in contexts.items():\n", " for term2, frequency in context.items():\n", " # tf / [ 1 + log(df) ]\n", " context[term2] = frequency / (1. + math.log(tweet_freq[term2]))\n", " length = math.sqrt(sum([v*v for v in context.values()]))\n", " for term2, frequency in context.items():\n", " context[term2] = 1. * frequency / length\n", " \n", "contexts['i'].most_common(5)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('worthless@-2', 0.33764514643816557),\n", " ('high@-1', 0.2816246484908819),\n", " ('holding@2', 0.2323291846302529),\n", " ('at@-1', 0.2204425813331341),\n", " ('to@-1', 0.20596171539422523),\n", " ('son@-2', 0.18688995834710687),\n", " ('and@1', 0.18452879080306742),\n", " ('her@2', 0.16723062126872543),\n", " ('in@-2', 0.16473427852389158),\n", " ('uneducated@1', 0.15939656236107255)]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contexts['school'].most_common(10)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('i@-1', 0.8517303851828948),\n", " ('you@1', 0.341304167014421),\n", " ('rt@-2', 0.14556771763853768),\n", " ('so@2', 0.11603489382180723),\n", " ('i@-2', 0.11090169450417534),\n", " ('this@1', 0.09080734081934172),\n", " ('u@1', 0.08118539291674556),\n", " ('it@1', 0.080832127605461),\n", " ('in@-1', 0.07776084336903491),\n", " ('i@2', 0.06941393383495353)]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contexts['love'].most_common(10)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('i@-1', 0.6736723361573105),\n", " ('grocery@1', 0.3146058368934483),\n", " ('store@2', 0.2657744340570913),\n", " ('fucking@-1', 0.23378948537609445),\n", " ('pressure@2', 0.23209310502084446),\n", " ('the@1', 0.2063211337802833),\n", " ('i@-2', 0.18442359838340952),\n", " ('rt@-2', 0.17387587244484196),\n", " ('amp@-2', 0.141357574805482),\n", " ('me@1', 0.07726901404021158)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "contexts['hate'].most_common(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "At this point we have a list of dictionaries, one per term, indicating the terms that co-occur (weighted by inverse tweet frequency).\n", "\n", "Next, we have to cluster these vectors. To do this, we'll need to be able to compute the euclidean distance between two vectors." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.4142135623730951\n", "2.23606797749979\n" ] } ], "source": [ "# n.b. This is not efficient!\n", "def distance(c1, c2):\n", " if len(c1.keys()) == 0 or len(c2.keys()) == 0:\n", " return 1e9\n", " keys = set(c1.keys()) | set(c2.keys())\n", " distance = 0.\n", " for k in keys:\n", " distance += (c1[k] - c2[k]) ** 2\n", " return math.sqrt(distance)\n", "\n", "print(distance({'hi':10, 'bye': 5}, {'hi': 9, 'bye': 4}))\n", "print(distance({'hi':10, 'bye': 5}, {'hi': 8, 'bye': 4}))" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['love', 'hope', 'miss', 'm', 'am', 'guess', 'll', 'mean', 'cant',\n", " 'wish'], dtype=' 1]\n", "contexts = dict([(term, contexts[term]) for term in nz_contexts])\n", "print(len(nz_contexts), 'nonzero contexts')" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "im\n", "[('it@1', 0.01264806377201966), ('maybe@2', 0.025412211140884603), ('y@-2', 0.01883753470306806)]\n" ] } ], "source": [ "# e.g., what are three context features for the term \"rt\"?\n", "print(list(contexts.keys())[0])\n", "print(list(list(contexts.values())[0].items())[:3])" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['a@-1' 'a@-2' 'a@1' 'a@2' 'aaaaaa@-1' 'aaaaaa@-2' 'aaaaaa@1' 'aaaaaa@2'\n", " 'aaron@-1' 'aaron@-2']\n", " (0, 1)\t0.012176616904934769\n", " (0, 2)\t0.10774881812308722\n", " (0, 3)\t0.08406853449098409\n", " (0, 234)\t0.02946119105248044\n", " (0, 265)\t0.024956944731107572\n", " (0, 332)\t0.02065859263633043\n", " (0, 368)\t0.014543761318494223\n", " (0, 369)\t0.014329995051336608\n", " (0, 393)\t0.02368590894509358\n", " (0, 413)\t0.01978485270160371\n", " (0, 434)\t0.018043994507438255\n", " (0, 436)\t0.015269990983388075\n", " (0, 475)\t0.015840112090628022\n", " (0, 480)\t0.01604913069072621\n", " (0, 483)\t0.036733833080258765\n", " (0, 485)\t0.02443477905749157\n", " (0, 486)\t0.04897844410701169\n", " (0, 611)\t0.04022372022569253\n", " (0, 641)\t0.03840265052217021\n", " (0, 880)\t0.05818855015204386\n", " (0, 881)\t0.02916261855721708\n", " (0, 1053)\t0.01678036618972246\n", " (0, 1063)\t0.020161956350368875\n", " (0, 1206)\t0.0446093161177403\n", " (0, 1209)\t0.022203132192976167\n", " :\t:\n", " (0, 14849)\t0.04492780311830858\n", " (0, 14975)\t0.4561416854630144\n", " (0, 15102)\t0.019454490377035092\n", " (0, 15137)\t0.03246274061678437\n", " (0, 15216)\t0.018578551803747474\n", " (0, 15217)\t0.018633750350250212\n", " (0, 15224)\t0.01978485270160371\n", " (0, 15490)\t0.014819584801508275\n", " (0, 15491)\t0.015207803801313627\n", " (0, 15502)\t0.3141806233384448\n", " (0, 15532)\t0.03659194385094082\n", " (0, 15671)\t0.013794628438236247\n", " (0, 15735)\t0.04108584562516659\n", " (0, 15836)\t0.024359499004844704\n", " (0, 15870)\t0.025924583046390885\n", " (0, 15873)\t0.024548591946210438\n", " (0, 15882)\t0.022409430110929447\n", " (0, 15893)\t0.01883753470306806\n", " (0, 15897)\t0.02417935783176741\n", " (0, 15944)\t0.01820112202264155\n", " (0, 15971)\t0.051086598423935724\n", " (0, 16003)\t0.012327178077380431\n", " (0, 16004)\t0.0497998532898863\n", " (0, 16006)\t0.03695460482404303\n", " (0, 16059)\t0.03639315451392994\n" ] } ], "source": [ "# Transform context dicts to a sparse vector\n", "# for sklearn.\n", "from sklearn.feature_extraction import DictVectorizer\n", "\n", "vec = DictVectorizer()\n", "X = vec.fit_transform(contexts.values())\n", "names = np.array(vec.get_feature_names())\n", "print(names[:10])\n", "print(X[0])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "56\n", " (0, 0)\t0.01166315589869699\n", " (0, 1)\t0.009432626921748103\n", " (0, 2)\t0.0069556430065184915\n", " (0, 3)\t0.020932638324993653\n", " (0, 12)\t0.009893569523380101\n", " (0, 28)\t0.0029242019578079237\n", " (0, 29)\t0.005935227260208678\n", " (0, 31)\t0.005850253394142066\n", " (0, 40)\t0.00502062522553737\n", " (0, 134)\t0.01569827105681456\n", " (0, 136)\t0.00393482667128263\n", " (0, 181)\t0.008700840482568875\n", " (0, 219)\t0.009893569523380101\n", " (0, 233)\t0.0058241406815781925\n", " (0, 241)\t0.003442994829099182\n", " (0, 251)\t0.003660037555588828\n", " (0, 278)\t0.0057055343458282184\n", " (0, 334)\t0.0038489132041543044\n", " (0, 367)\t0.002790339436949068\n", " (0, 368)\t0.008449753057068984\n", " (0, 369)\t0.002775185795961421\n", " (0, 370)\t0.019557649156184565\n", " (0, 402)\t0.005416543436788495\n", " (0, 410)\t0.009893569523380101\n", " (0, 412)\t0.003966821569702731\n", " :\t:\n", " (0, 15787)\t0.004102954333698908\n", " (0, 15828)\t0.03718458281719579\n", " (0, 15829)\t0.0031113083765471677\n", " (0, 15831)\t0.006289728235394356\n", " (0, 15847)\t0.022014344111415302\n", " (0, 15857)\t0.005957911837271409\n", " (0, 15863)\t0.004299918936961106\n", " (0, 15871)\t0.00502062522553737\n", " (0, 15885)\t0.004193444669798336\n", " (0, 15894)\t0.007005863991464066\n", " (0, 15895)\t0.003619603953271363\n", " (0, 15946)\t0.003511581004101353\n", " (0, 15969)\t0.020762766580557777\n", " (0, 15971)\t0.009893569523380101\n", " (0, 15990)\t0.004428087090688781\n", " (0, 16003)\t0.01193657408187391\n", " (0, 16004)\t0.012055468703578194\n", " (0, 16005)\t0.341304167014421\n", " (0, 16006)\t0.035783646133738856\n", " (0, 16014)\t0.004876067603250058\n", " (0, 16019)\t0.011392909025650088\n", " (0, 16020)\t0.002893415638699762\n", " (0, 16021)\t0.031212968706279383\n", " (0, 16022)\t0.0028439084729393136\n", " (0, 16060)\t0.007437161692980253\n", "while@1\n" ] } ], "source": [ "# Which row of X is the word \"love\"?\n", "love_idx = list(contexts.keys()).index('love')\n", "print(love_idx)\n", "# What are the context feature values for love?\n", "print(X[love_idx])\n", "# Print a highly ranking feature.\n", "print(names[15534])" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n", " n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',\n", " random_state=None, tol=0.0001, verbose=0)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's cluster!\n", "# http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html\n", "from sklearn.cluster import KMeans\n", "num_clusters = 20\n", "kmeans = KMeans(num_clusters)\n", "kmeans.fit(X)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 to@1 i@-2 i@-1 be@2 the@2\n", "1 a@-1 i@1 i@2 a@-2 the@-1\n", "2 the@1 up@1 i@-2 i@2 my@1\n", "3 my@-1 and@1 i@2 the@-1 i@1\n", "4 i@-2 m@-1 i@1 m@-2 for@1\n", "5 of@1 the@-1 the@2 a@-1 a@-2\n", "6 it@-1 so@-2 much@-1 many@-1 i@1\n", "7 rt@-1 i@2 i@1 and@1 is@1\n", "8 in@1 the@2 i@-2 my@2 the@-1\n", "9 i@-1 rt@-2 i@-2 you@1 it@1\n", "10 the@2 but@1 i@2 on@1 be@-1\n", "11 i@1 m@2 rt@-1 i@2 i@-2\n", "12 me@1 and@-1 a@1 i@-2 i@2\n", "13 t@1 i@-1 you@-1 be@2 rt@-2\n", "14 with@1 my@2 i@-2 the@2 to@-1\n", "15 i@2 i@1 and@1 the@-2 a@-2\n", "16 to@-1 i@-2 i@-1 t@-1 i@2\n", "17 the@-1 in@-2 i@1 i@2 of@-2\n", "18 i@-2 i@2 just@-1 was@-1 and@2\n", "19 of@-1 and@1 i@1 the@-2 the@-1\n" ] } ], "source": [ "# Let's print out the top features for each mean vector.\n", "# This is swamped by common terms\n", "for i in range(num_clusters):\n", " print(i, ' '.join(names[np.argsort(\n", " kmeans.cluster_centers_[i])[::-1][:5]]))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "distance from term \"love\" to each cluster:\n", "[1.00874162 1.0266301 0.97164003 1.03189144 1.03358099 1.04555106\n", " 0.9921334 1.0001127 0.99631078 0.5906229 0.98516281 1.01236052\n", " 0.96816683 1.1055978 0.99184346 0.98134482 0.95310019 1.02250443\n", " 0.95707704 1.00474645]\n", "closest cluster to \"love\":\n", "9\n" ] } ], "source": [ "# .transform will compute the distance from each context to each cluster.\n", "distances = kmeans.transform(X)\n", "# e.g., what is the distance from the word \"love\" to each cluster?\n", "print('distance from term \"love\" to each cluster:')\n", "print(distances[love_idx])\n", "# what is the closest cluster for the word \"love\"?\n", "print('closest cluster to \"love\":')\n", "print(np.argmin(distances[love_idx]))" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 supposed listening going wanted used decided listen able needs \n", "\n", "1 good few bit wrap couple dream virgin requirement little \n", "\n", "2 for on of at with all is during from \n", "\n", "3 favorite face hair life heart boyfriend wife brother head \n", "\n", "4 not screaming sorry dying doin scared gonna assuming sobbing \n", "\n", "5 part rest one majority type habit tired instead benefit \n", "\n", "6 everyday aaaaaa whew and but yard sounds things seemed \n", "\n", "7 hello hi okay i mingyu nah psa mutual age \n", "\n", "8 participate doctor punched zoomed manga engage lived live jump \n", "\n", "9 am love hope just ll have guess mean miss \n", "\n", "10 shade one me pumped this and okay that here \n", "\n", "11 but when because and omg what before where ok \n", "\n", "12 with tell making call is help on in make \n", "\n", "13 didn wasn ain wouldn couldn haven doesn isn can \n", "\n", "14 ruv communicate agree familiar wrong flights along interact dealing \n", "\n", "15 that me time but the day this my is \n", "\n", "16 be see make do say save hear go cry \n", "\n", "17 way best world weekends first saddest waist bathroom bus \n", "\n", "18 a never like you sorry u done to so \n", "\n", "19 us mone nowhere the concerning them luck people prof \n", "\n" ] } ], "source": [ "# Finally, we'll print the words that are closest\n", "# to the mean of each cluster.\n", "terms = np.array(list(contexts.keys()))\n", "for i in range(distances.shape[1]):\n", " print(i, ' '.join(terms[np.argsort(distances[:,i])[1:10]]), '\\n')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Clearly, interpreting these results requires a bit of investigation.\n", "\n", "As the number of tweets increases, we expect these clusters to become more coherent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**How does error decrease with number of cluster?**" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-3727.0865918080995" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kmeans.score(X)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "k=5 score=3860.63\n", "k=10 score=3786.09\n", "k=20 score=3722.77\n", "k=50 score=3638.46\n", "k=100 score=3553.49\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAEKCAYAAAAFJbKyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3Xm81HXZ//HXm0UBFdA8KMpminuCecSl7FZMRHNNQI0UM6W0cstM8741LUIzxd3CNPGWVNwIUTP0xl3Bg+KCuPBTZFGB3Ak38Pr98fmezoQgZw5nzvecmffz8ZgHM5/5zsw1TnHx/Xyvz/VRRGBmZlZfrfIOwMzMWhYnDjMzK4oTh5mZFcWJw8zMiuLEYWZmRXHiMDOzojhxmJlZUZw4zMysKE4cZmZWlDZ5B1AK66+/fvTq1SvvMMzMWpRp06b9MyKqVnVcWSaOXr16UVNTk3cYZmYtiqTX63Ocp6rMzKwoThxmZlYUJw4zMyuKE4eZmRXFicPMzIrixFFg7Fjo1QtatUp/jh2bd0RmZs1PWZbjNsTYsTB8OCxZkh6//np6DDB0aH5xmZk1Nz7jyJx5Zl3SqLVkSRo3M7M6ThyZOXOKGzczq1ROHJkePYobNzOrVE4cmREjoEOH/xzr0CGNm5lZnZIlDkntJE2V9IykGZLOycb3lPSUpOmSHpG0WTZ+lKRF2fh0SccUvNcwSa9kt2GliHfoUBg9Grp1S487dkyPfWHczOw/KSJK88aSgLUiYrGktsAjwInA9cCBETFT0vFAv4g4StJRQHVE/HS591kPqAGqgQCmATtExLsr++zq6upYnSaH++0Hzz8Pr70GUoPfxsysRZE0LSKqV3Vcyc44IlmcPWyb3SK7dczGOwFvrOKt9gYmRcQ7WbKYBAwsQcj/NnhwKsd98slSfoqZWctU0mscklpLmg4sJP3lPwU4Brhb0jzgCOC8gpccIulZSbdK6p6NbQzMLThmXjZWMgceCG3bwrhxpfwUM7OWqaSJIyKWRURfoBvQT9K2wMnAvhHRDfgLcFF2+J1Ar4jYjnRWMaaYz5I0XFKNpJpFixatVtydO8OAAXDLLVCimTwzsxarSaqqIuI9YDKwD9AnO/MAuBnYNTvm7Yj4JBv/M7BDdn8+0L3g7bplY8t/xuiIqI6I6qqqVW5gtUpDhqQ1HFOnrvZbmZmVlVJWVVVJ6pzdbw/sBcwEOknaPDusdgxJXQtefkDtOHAvMEDSupLWBQZkYyV1wAFpuuqWW0r9SWZmLUspe1V1BcZIak1KUOMiYqKkY4HbJH0OvAscnR1/gqQDgKXAO8BRABHxjqTfALWXqs+NiHdKGDeQpqv23jsljgsucHWVmVmtkpXj5ml1y3FrXX89DBsGTzwBO+3UCIGZmTVjuZfjloPa6SpXV5mZ1XHi+BKF01VleGJmZtYgThyrMHgwzJ0LU6as+lgzs0rgxLEKBxwAa6zh6iozs1pOHKtQuBjw88/zjsbMLH9OHPUwZEiarvJiQDMzJ456qZ2ucnWVmZkTR7106pSqq2691dNVZmZOHPXk6iozs8SJo55cXWVmljhx1FPtdJWrq8ys0jlxFGHIEJg3z9NVZlbZnDiKsP/+rq4yM3PiKEKnTjBwoKurzKyyOXEUafDgNF31xBN5R2Jmlg8njiIdcACsuaarq8yscjlxFKljR1dXmVllc+JogCFDYP58T1eZWWVy4miA/fdP01WurjKzSuTE0QAdO7q6yswqlxNHAw0enKarHn8870jMzJqWE0cD1U5XubrKzCqNE0cD1U5XubrKzCqNE8dqGDIE3njD01VmVllKljgktZM0VdIzkmZIOicb31PSU5KmS3pE0mbZ+JqSbpY0S9IUSb0K3uuMbPwlSXuXKuZiubrKzCpRKc84PgH6R0QfoC8wUNLOwFXA0IjoC/wV+O/s+B8C70bEZsAo4HwASVsDhwHbAAOBKyW1LmHc9bbOOrDPPq6uMrPKUrLEEcni7GHb7BbZrWM23gl4I7t/IDAmu38rsKckZeM3RcQnEfEaMAvoV6q4izV4cJqueuyxvCMxM2saJb3GIam1pOnAQmBSREwBjgHuljQPOAI4Lzt8Y2AuQEQsBd4HvlI4npmXjS3/WcMl1UiqWbRoUam+0he4usrMKk1JE0dELMumpLoB/SRtC5wM7BsR3YC/ABc10meNjojqiKiuqqpqjLesF09XmVmlaZKqqoh4D5gM7AP0yc48AG4Gds3uzwe6A0hqQ5rGertwPNMtG2s2aqurPF1lZpWglFVVVZI6Z/fbA3sBM4FOkjbPDqsdA5gADMvuDwL+LyIiGz8sq7raBOgNTC1V3A2x336urjKzytGmhO/dFRiTVUC1AsZFxERJxwK3SfoceBc4Ojv+GuB/Jc0C3iFVUhERMySNA14AlgI/iYhlJYy7aIXTVRdfDK28OsbMypjSP+rLS3V1ddTU1DTpZ954I3zve/DQQ7Dbbk360WZmjULStIioXtVx/rdxI9lvP2jdGr7znXTG0asXjB2bd1RmZo2vlFNVFWXChPTnhx+mP19/HYYPT/eHDs0nJjOzUvAZRyM580xYttyVlyVL0riZWTlx4mgkc+YUN25m1lI5cTSSHj2KGzcza6mcOBrJiBHQocN/jrVtm8bNzMqJE0cjGToURo+Gnj1BgvbtYelS6NYt78jMzBqXE0cjGjoUZs9OPaveegt6907tSOY3qwYpZmarx4mjRDp2hDvugH/9CwYNgk8/zTsiM7PG4cRRQltvDX/5CzzxBJx8ct7RmJk1DieOEhs8GE49Fa68EsaMWfXxZmbNnRNHExg5EvbYA378Y3j66byjMTNbPU4cTaBNG7jpJlh/ffjud+Htt/OOyMys4Zw4mkiXLqnt+htvpOqr5duTmJm1FE4cTWinneCyy+Dee+HXv847GjOzhnHiaGLHHgtHHw2//S387W95R2NmVjwnjiYmwRVXQHU1HHkkvPxy3hGZmRXHiSMH7dql6x1t26aL5YsX5x2RmVn9OXHkpGfPVGk1cyYccwyU4Q6+ZlamnDhy9O1vp+65N98Mo0blHY2ZWf04ceTsl7+Egw+G006DBx7IOxozs1Vz4siZBNddlzrpHnoozJuXd0RmZl/OiaMZ6NgRbr897VE+eDB88kneEZmZrVzJEoekdpKmSnpG0gxJ52TjD0uant3ekDQ+G99d0vsFz51V8F4DJb0kaZak00sVc5622qquk+5JJ+UdjZnZyrUp4Xt/AvSPiMWS2gKPSLonInarPUDSbUDhMriHI2K/wjeR1Bq4AtgLmAc8KWlCRLxQwthzMWhQutbx+9+nVeZHHZV3RGZmX1SyM45IalcotM1u/y46ldQR6A+MX8Vb9QNmRcSrEfEpcBNwYAlCbhZGjID+/VMn3aeeyjsaM7MvKuk1DkmtJU0HFgKTImJKwdMHAfdHxAcFY7tkU1v3SNomG9sYmFtwzLxsbPnPGi6pRlLNokWLGvmbNJ3aTrpduriTrpk1TyVNHBGxLCL6At2AfpK2LXj6cODGgsdPAT0jog9wGas+E1n+s0ZHRHVEVFdVVa1u6LmqqoLbboM334TDD3cnXTNrXpqkqioi3gMmAwMBJK1PmoK6q+CYD2qntiLibqBtdtx8oHvB23XLxsrajjumnlaTJsFZZ636eDOzplLKqqoqSZ2z++1JF7dfzJ4eBEyMiI8Ljt9QkrL7/bLY3gaeBHpL2kTSGsBhwIRSxd2cHHNMuv3ud+6ka2bNRymrqroCY7KqqFbAuIiYmD13GHDecscPAo6TtBT4CDgsIgJYKumnwL1Aa+DaiJhRwriblcsug+nT4Ygj4MknYYst8o7IzCqdogy761VXV0dNTU3eYTSaOXNghx3SBfMpU2DttfOOyMzKkaRpEVG9quO8crwF6NEjNUJ88cW0CVQZ5noza0GcOFqI/v1h5Ei45Ra46KK8ozGzSubE0YL84hdwyCGpo+7kyXlHY2aVyomjBZFSP6vNN0+ddOfOXfVrzMwamxNHC7POOqmT7scfp95W7qRrZk3NiaMF2nLLtIfH1Klw4ol5R2NmlcaJo4X67nfh9NPhT3+Ca6/NOxozqyROHC3Yb3+b9i0//ngoo2UrZtbMOXG0YK1bw403wgYbpGqrf/4z74jMrBI4cbRw66+fOukuWAC77w49e0KrVtCrF4wdm3d0ZlaOnDjKQHV16mU1Y0ZqTxIBr78Ow4c7eZhZ43PiKBOTJn1xbMkSOPPMpo/FzMrbKhNHtovfyU0RjDXcnDnFjZuZNdQqE0dELCPt1mfNWI8eKx5fY43Ujt3MrLHUd6rqUUmXS9pN0tdrbyWNzIoyYgR06PCfY2uskW79+qVrIPPm5RObmZWX+iaOvsA2wLnAhdntD6UKyoo3dCiMHp2qqqT057XXpmRxxhmpq+7mm6dtaBcvzjtaM2vJvJFThZg9O600v/lm6No1naEMG5ZKd83MoJE3cpLUSdJFkmqy24WSOq1+mNZUevWCm26Cxx5L10OOPjqV8T7wQN6RmVlLU99/b14LfAgMyW4fAH8pVVBWOrvsAo8/Dn/9a1ppvscecPDB8MoreUdmZi1FfRPHphFxdkS8mt3OAb5aysCsdCQ4/HB46aU0ZXXffbDNNnDKKfDuu3lHZ2bNXX0Tx0eSvln7QNI3gI9KE5I1lfbt4Ve/Smcbw4bBxRfDZpvBZZfBZ5/lHZ2ZNVf1TRw/Bq6QNFvSbOBy4Ecli8qa1IYbwtVXw9NPQ9++cMIJ8LWvwcSJqX2JmVmh+qwcbwVsERF9gO2A7SJi+4h4tuTRWZPq0ydNW02YkBLG/vvDgAHwrH9pMytQn5XjnwOnZfc/iIgP6vPGktpJmirpGUkzJJ2TjT8saXp2e0PS+Gxcki6VNEvSs4ULDCUNk/RKdhvWoG9q9SKlhPH883DJJTBtGmy/PRx7LLz1Vt7RmVlzUN+pqvsknSqpu6T1am+reM0nQP/sTKUvMFDSzhGxW0T0jYi+wOPA7dnx+wC9s9tw4CqA7HPOBnYC+gFnS1q3mC9pxWvbNk1ZzZqV/rzuOujdG0aOhI98dcusotU3cRwK/AR4CJiW3b50hV0ktWuU22a3f8+YS+oI9AfGZ0MHAtdnr3sC6CypK7A3MCki3omId4FJwMB6xm2rab31YNSo1LJ9zz3TxfSttkprQnz9w6wy1fcax/cjYpPlbqssx806604HFpL+8p9S8PRBwP0FU18bA3MLnp+Xja1s3JrQ5pvD+PFw//2w7rqpnHfXXeGJJ/KOzMyaWn2vcVzekDePiGXZlFQ3oJ+kbQuePhy4sSHvuyKShteubF+0aFFjva0tp3//tL/5NdekNia77ALf+17aOMrMKkN9p6rul3SIJDXkQyLiPWAy2RSTpPVJ1yvuKjhsPtC94HG3bGxl48t/xuiIqI6I6qqqqoaEafXUunVqWfLKK/A//wN33AFbbJGmsT78MO/ozKzU6ps4fgSMAz6R9IGkDyV9aXWVpCpJnbP77YG9gBezpwcBEyPi44KXTACOzKqrdgbej4g3gXuBAZLWzS6KD8jGLGdrrw3nngsvvwyDBqUL5717w5//DMuW5R2dmZVKfRNHJ+Ao4LcR0ZHUYn2vVbymKzBZ0rPAk6RrHBOz5w7ji9NUdwOvArOAq4HjASLiHeA32Xs8CZybjVkz0b073HADTJmSVp4fe2wq4b3vvrwjM7NSqFdbdUlXAZ+Tymu3yv7l/4+I2LHUATaE26rnJwJuvRVOOy1dA9lvP7jgAthyy7wjM7NVadS26sBOEfET4GOArCx2jdWIz8qUBIMHw8yZcP758OCDqX3JCSfA22/nHZ2ZNYb6Jo7PJLUmW4chqYp0BmK2Qu3apbOOWbPgmGPgiivSNNaoUfDpp3lHZ2aro76J41LgDqCLpBHAI8DvShaVlY0uXeCqq+CZZ9Le56ecklq4jx/vBYRmLVW9EkdEjCX1qxoJvAkcFBG3lDIwKy/bbgv33gt3353amRx8cFoT8vTTeUdmZsWq947TEfFiRFwREZdHxMxSBmXla599UrfdK65IjRR32CGtCXnjjbwjM7P6qnfiMGssbdrA8cenBYQ//3kq5d18c/jNb2DJkryjM7NVceKw3HTunEp1Z86EgQPhrLPSCvQbboDPXXph1mw5cVjuNt00rf146CHYYAM44gjYeWd45JG8IzOzFXHisGZjt91g6lQYMwbmz0+PhwyBV1/NOzIzK+TEYc1Kq1Zw5JGp/9Wvfw133ZX2/zjtNHj//byjMzNw4rBmaq214OyzUwI5/PB0LaR3b/jjH2Hp0ryjM6tsThzWrG28cdq2tqYmnXkcdxz07ZvWhJhZPpw4rEXYYQd44AG4/Xb4+ONUhbXPPvDCC3lHZlZ5nDisxZDSivMZM+DCC+Hxx2G77dKaEG/6aNZ0nDisxVlzzdTzatasNHU1enRqoHjBBfDJJ3lHZ1b+nDisxVp/fbjsMnjuuVS6e9pp6TrIrbe6gaJZKTlxWIu31VYwcSL84x+pGmvwYPjWt9IFdTNrfE4cVjb22it12/3Tn1IZ7447pjUh8+blHZlZeXHisLLSpg0MH54aKJ5+Oowblxoonn02/OtfeUdnVh6cOKwsdewII0fCiy/CAQfAueemBYTXXecGimary4nDylqvXnDTTfDYY9CjB/zgB2kK68EH847MrOVy4rCKsMsuKXn89a9pzcfuu8N3v5tKes2sOE4cVjFatUp9r156CUaMgEmTYOut02ZS776bd3RmLUfJEoekdpKmSnpG0gxJ52TjkjRC0suSZko6IRvfXdL7kqZnt7MK3mugpJckzZJ0eqlitsrQvj386lfpAvqRR8KoUen6x+WXw2ef5R2dWfNXyjOOT4D+EdEH6AsMlLQzcBTQHdgyIrYCbip4zcMR0Te7nQsgqTVwBbAPsDVwuKStSxi3VYgNN4Q//zmV8PbpAz/7WWphctddXkBo9mVKljgiWZw9bJvdAjgOODciPs+OW7iKt+oHzIqIVyPiU1KiObBEYVsF6tMH7rsPJkxIFVf77QcDBqQV6Wb2RSW9xiGptaTpwEJgUkRMATYFDpVUI+keSb0LXrJLNrV1j6RtsrGNgbkFx8zLxswajQT775+SxSWXwLRpqX37j34ECxbkHZ1Z81LSxBERyyKiL9AN6CdpW2BN4OOIqAauBq7NDn8K6JlNbV0GjC/msyQNz5JRzSK3SrUGWmMNOOGEVG11wglw7bWpgeLIkamdu5k1UVVVRLwHTAYGks4Ybs+eugPYLjvmg9qprYi4G2graX1gPumaSK1u2djynzE6Iqojorqqqqpk38Uqw3rrpYvmM2bAnnumi+lbbpnWhPj6h1W6UlZVVUnqnN1vD+wFvEg6k9gjO+y/gJezYzaUpOx+vyy2t4Engd6SNpG0BnAYMKFUcZsV2nxzGD8e7r8fOndO5by77gpPPJF3ZGb5KeUZR1dgsqRnSX/5T4qIicB5wCGSngNGAsdkxw8Cnpf0DHApcFh2gX0p8FPgXmAmMC4iZpQwbrMv6N8/Xfe45hqYPTstKPze92DOnLwjM2t6ijI8766uro4a99S2Elm8GM4/H/7wh/T4lFNSQ8UJE+DMM1My6dEjLTIcOjTfWM2KIWladv35y49z4jBrmLlz4YwzYOzY1FTx44/h00/rnu/QIe1O6ORhLUV9E4dbjpg1UPfucMMNMGVK2rK2MGkALFmSzkDMyo0Th9lq6tfvi0mj1pw5LuO18uPEYdYIevRY8XgEdOmSemLdddfKE4xZS+LEYdYIRoxI1zQKdegAv/wlDBkCd96ZWplsuCEcc0xqcbJ0aT6xmq0uJw6zRjB0aLoQ3rNnal/Ss2d6fN55qZHiggUwcSJ85ztpO9u99oKNNoLjj4eHHvKuhNayuKrKrIl99BH8/e9pFfqdd6bHG22UzkwOPRR22iklH7Om5qoqs2aqfXs4+GC4+WZYuBBuvDFdYL/qqrSwcJNN0hTXU0+5vYk1T04cZjlae2047DC44440nTVmDGyzDVx0EeywQ2p58j//A88/n3ekZnWcOMyaiU6d6qqv3noLrr4aevWC3/0OvvY12HZb+M1v4OWX847UKp0Th1kz9JWvpOqrSZPgjTfgiitSx96zz4YttoCvfz21PZk9O+9IrRI5cZg1cxtsUFd9NXduave+xhqpP9Ymm8DOO8PFF8P8L2w2YFYaThxmLcjGG8NJJ6W27q++msp9P/0UTj45tUD51rfgyivTRXezUnHiMGuhCquvXnoJzjkH3nkHfvIT6NoVvv3ttIbknXfyjtTKjROHWRkorL567rm0Y+GcOXDssWmq6zvfgeuvh/ffzztSKwdOHGZlprb66qWX0uZTp5yStsAdNiwlkYMPTosP//WvvCO1lsqJw6xMSXXVV6+9Bo8/DscdB1Onpi1wu3RJK9Vvvz2tXjerLycOswogpeqrUaNSZdaDD8JRR8HkyXDIIelM5IgjUj8td/C1VXHiMKswrVql6qsrrkhrRCZNSmced90F+++fOvj+8Idp3B18bUWcOMwqWJs2qfrq6qvTavW77krt32+5BQYMqOvg++CDsGxZ3tFac+HEYWZAWlS4776p+mrhwnTto3//1D9r993TOpGTTkrXStx8sbI5cZjZF7RrV1d9tXBh+nPnneGPf4Rdd01rSE47LVVtOYlUHicOM/tSa61VV321cGE6I9l223Shvbo6rSH57/92B99KUrLEIamdpKmSnpE0Q9I52bgkjZD0sqSZkk4oGL9U0ixJz0r6esF7DZP0SnYbVqqYzezLdexYV321YEFamd6rF4wcmTr4brMNnHtuWkNi5atkOwBKErBWRCyW1BZ4BDgR2ArYAzgqIj6X1CUiFkraF/gZsC+wE3BJROwkaT2gBqgGApgG7BAR767ss70DoFnTWrgQbrstTWk9/HCavurbN52pHHpomtqy5i/3HQAjWZw9bJvdAjgOODciPs+Oq23HdiBwffa6J4DOkroCewOTIuKdLFlMAgaWKm4zK16XLmlx4YMPpnUiF1+crpOccQZ89atpO9xRo2DevLwjtcZQ0mscklpLmg4sJP3lPwXYFDhUUo2keyT1zg7fGJhb8PJ52djKxs2sGdp4YzjxxFR99dpraeX60qWp9Un37rDbbmkNyYIFeUdqDVXSxBERyyKiL9AN6CdpW2BN4OPsdOhq4NrG+CxJw7NkVLNo0aLGeEszW029etVVX738cuqh9d578NOfpjUie+6Z1pC8/XbekVoxmqSqKiLeAyaTppjmAbdnT90BbJfdnw90L3hZt2xsZePLf8boiKiOiOqqqqrG/QJmttp6907VV889lyqwzjwzTWsNH55Wq++7b1oz4g6+zV8pq6qqJHXO7rcH9gJeBMaTLo4D/BdQu4PyBODIrLpqZ+D9iHgTuBcYIGldSesCA7IxM2uhCquvnnoKfv5zeOGF1D+rSxc46CC48UZYvHiVb2U5aFPC9+4KjJHUmpSgxkXEREmPAGMlnQwsBo7Jjr+bVFE1C1gC/AAgIt6R9Bvgyey4cyPCW9OYlQEJtt8+3UaOTJ17b7453f72N2jfPrVAOeww2Gef9NjyV7Jy3Dy5HNesZfv8c3j00VTee+utqdx37bXhwANTEhkwILVIscaVezmumVlDtWpVV301fz7cd19KGPfckzr4brBB6uD7j3+4g28enDjMrFlr06au+urNN1MH3wMOSGcie++d9lc/7jh44AF38G0qThxm1mLUdvAdMyatA7njjtQW/vrrYY890jqRE0+Exx5L011WGk4cZtYitWtXV321cGG6oL7LLvCnP8E3vpHanPziF1BT4w6+jc2Jw8xavLXWgiFDUr+shQvhf/83NV285BLYcce0huTMM+HZZ51EGoMTh5mVlY4d4fvfr+vge801qV/W+edDnz5pDck557iD7+pw4jCzsrXuunD00an66o034Mor0wLDc86BLbdMHXzPOy/11LL6c+Iws4pQ28H3gQdSl96LL4YOHeo6+PbrBxdd5A6+9eHEYWYVZ6ON6qqvZs+G3/8+VWH9/Od1HXwvvxzeeivvSJsnJw4zq2g9e9ZVX73yCvz2t6nR4s9+llrE77knjB4N//xn3pE2H04cZmaZzTarq76aMSN18503D370o7TQcJ994LrrUmv4SubEYWa2AltvnS6iv/giPP00nHpquv+DH6SWJwceCH/9a2V28HXiMDP7ElKqvho5El59FaZMSRtRTZsGQ4emi+6DB6c1JB99lHe0TcOJw8ysnqRUfXXhhTBnDjz8cCr3feghGDQoJZGhQ+HOO+GTT/KOtnScOMzMGqBVK/jmN1P1VW0H38MPh7//PTVh3GCDNK11773w2Wd5R9u4nDjMzFZTbQff0aNTCe/dd6c+WrffDgMHpgvrP/4xTJ5cHh18nTjMzBpR27Z11VcLFsD48WnjqRtugP79oVs3OOGElt3B14nDzKxE2rWrq75auBDGjYNdd017i3zjG9CrV6rWamkdfJ04zMyaQIcOddVXCxemM5A+feDSS1MH3802g1/9qmV08HXiMDNrYuusU1d9tWABXHttShy//31KJoVrSJojJw4zsxytu25d9dWbb8JVV8GGG6bEsdVWKZHUriFpLpw4zMyaiaqquuqrefPSRlRrr52msDbdNE1pXXghzJ2bb5xOHGZmzdBGG6Xqq0cfhddfhwsuSOOnngo9eqQ1JJddVtfBd+zYdLG9Vav059ixpYtN0dyvwjRAdXV11NTU5B2GmVmjmzUrVWfddBM891xazb7llvD//h98+mndcR06pHUlQ4fW/70lTYuI6lUdV7IzDkntJE2V9IykGZLOycavk/SapOnZrW82vruk9wvGzyp4r4GSXpI0S9LppYrZzKy5K6y+euEFOOus1A6+MGkALFmSOv2WQsnOOCQJWCsiFktqCzwCnAj8GJgYEbcud/zuwKkRsd9y462Bl4G9gHnAk8DhEfHCyj7bZxxmVklatVpxCa9U3CLD3M84IqltONw2uzUkS/UDZkXEqxHxKXATcGAjhWlm1uL16FHc+Ooq6cVxSa0lTQcWApMiYkr21AhJz0oaJWl3HqFTAAAHR0lEQVTNgpfskk1t3SNpm2xsY6CwhmBeNrb8Zw2XVCOpZtGiRaX4OmZmzdKIEemaRqEOHdJ4KZQ0cUTEsojoC3QD+knaFjgD2BLYEVgP+GV2+FNAz4joA1wGjC/ys0ZHRHVEVFdVVTXadzAza+6GDk0Xwnv2TNNTPXsWf2G8GE1SjhsR7wGTgYER8WY2jfUJ8BfSVBQR8UHt1FZE3A20lbQ+MB/oXvB23bIxMzPLDB0Ks2enaxqzZ5cuaUBpq6qqJHXO7rcnXdx+UVLXbEzAQcDz2eMNszEk9ctie5t0Mby3pE0krQEcBkwoVdxmZvbl2pTwvbsCY7KqqFbAuIiYKOn/JFUBAqaTqqwABgHHSVoKfAQcFqnka6mknwL3Aq2BayNiRgnjNjOzL+EFgGZmBjSDclwzMytPThxmZlaUspyqkrQIeD3vOHK0PvDPvIPIkb+/v7+/f8P0jIhVrmcoy8RR6STV1Geeslz5+/v7+/uX9vt7qsrMzIrixGFmZkVx4ihPo/MOIGf+/pXN37/EfI3DzMyK4jMOMzMrihNHCyapu6TJkl7Idlk8MRtfT9IkSa9kf66bd6yllLXvf1rSxOzxJpKmZDtG3pz1OCtLkjpLulXSi5JmStqlkn5/SSdn/9t/XtKN2c6jZf37S7pW0kJJzxeMrfA3V3Jp9t/iWUlfb4wYnDhatqXAzyNia2Bn4CeStgZOB+6PiN7A/dnjcnYiMLPg8fnAqIjYDHgX+GEuUTWNS4C/R8SWQB/Sf4eK+P0lbQycAFRHxLakXnaHUf6//3XAwOXGVvab7wP0zm7DgasaIwAnjhYsa1H/VHb/Q9JfGhuTdkgckx02htSFuCxJ6gZ8B/hz9lhAf6B2a+Ky/f6SOgHfAq4BiIhPsy0MKub3JzVqbS+pDdABeJMy//0j4iHgneWGV/abHwhcn21l8QTQubZD+epw4igTknoB2wNTgA0i4s3sqbeADXIKqylcDJwG1O6s/BXgvYhYmj1e4Y6RZWITYBHwl2yq7s+S1qJCfv+ImA/8AZhDShjvA9OonN+/0Mp+83rtoFosJ44yIGlt4DbgpIj4oPC5rDV9WZbOSdoPWBgR0/KOJSdtgK8DV0XE9sC/WG5aqsx//3VJ/6LeBNgIWIsvTuFUnKb4zZ04WjhJbUlJY2xE3J4NLyjYMKsrac/3cvQN4ABJs4GbSFMUl5BOx2v3minnHSPnAfMiYkr2+FZSIqmU3//bwGsRsSgiPgNuJ/1volJ+/0Ir+81LsoOqE0cLls3nXwPMjIiLCp6aAAzL7g8D/tbUsTWFiDgjIrpFRC/SRdH/i4ihpG2KB2WHlfP3fwuYK2mLbGhP4AUq5PcnTVHtLKlD9v+F2u9fEb//clb2m08Ajsyqq3YG3i+Y0mowLwBswSR9E3gYeI66Of5fka5zjAN6kLoED4mI5S+mlRVJuwOnRsR+kr5KOgNZD3ga+H62x33ZkdSXVBiwBvAq8AOyHTepgN9f0jnAoaQKw6eBY0hz+GX7+0u6Edid1AV3AXA2MJ4V/OZZQr2cNIW3BPhBRKz2LndOHGZmVhRPVZmZWVGcOMzMrChOHGZmVhQnDjMzK4oTh5mZFcWJw6wJSVrcwNcdlDWwNMudE4dZy3AQUFTiKFg9bdaonDisokjqle1bcXW2j8M/JLXPnntAUnV2f/2slQmSjpI0PtvnYLakn0o6JWss+ISk9VbwORtIukPSM9lt1+We3712/5Ds8eWSjsrun6e0x8qzkv6QvfYA4AJJ0yVtmt3+LmmapIclbZm99jpJf5Q0Bfi9pP/KXjM9i3edkvyHtYrif5FYJeoNHB4Rx0oaBxwC3LCK12xL6j7cDpgF/DIitpc0CjiS1KW30KXAgxFxsKTWwNr1CUzSV4CDgS0jIiR1joj3JE0AJkbErdlx9wM/johXJO0EXEnq1QWpH9GuEbFM0p3ATyLi0awZ5sf1icPsyzhxWCV6LSKmZ/enAb3q8ZrJ2Z4nH0p6H7gzG38O2G4Fx/cnJRQiYhmp5Xd9vE/6y/2a7Ixk4vIHZAlgV+CW1FECgDULDrkl+0yAR4GLJI0Fbo+IefWMw2ylPFVllaiwb9Ey6v4BtZS6/0+0+5LXfF7w+HMa9g+wws/69+dl+0j0I3W63Q/4+wpe24q050TfgttWBc//q/ZORJxH6t/UHni0dkrLbHU4cZjVmQ3skN0f9CXH1cf9wHHw7z3ROy33/OvA1pLWlNSZ1Nm19myiU0TcDZxM2g4W4ENgHYBsz5XXJA3OXiNJfVgBSZtGxHMRcT7wJODEYavNicOszh+A4yQ9Teo8ujpOBPaQ9BxpOuw/KqIiYi6pm+nz2Z9PZ0+tA0yU9CzwCHBKNn4T8IvsAvemwFDgh5KeAWaQNjRakZMkPZ+932fAPav5vczcHdfMzIrjMw4zMyuKE4eZmRXFicPMzIrixGFmZkVx4jAzs6I4cZiZWVGcOMzMrChOHGZmVpT/DxmtF5f7zcJlAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "scores = []\n", "num_cluster_options = [5,10,20,50,100]\n", "\n", "for num_clusters in num_cluster_options:\n", " kmeans = KMeans(num_clusters, n_init=10, max_iter=10)\n", " kmeans.fit(X)\n", " score = -1 * kmeans.score(X)\n", " scores.append(score)\n", " print('k=%d score=%g' % (num_clusters, score))\n", " \n", "plt.figure()\n", "plt.plot(num_cluster_options, scores, 'bo-')\n", "plt.xlabel('num clusters')\n", "plt.ylabel('error')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "** How does error vary by initalization? **" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "score=3758.22\n", "score=3747.06\n", "score=3741.92\n", "score=3749.15\n", "score=3741.44\n", "score=3752.9\n", "score=3739.67\n", "score=3748.15\n", "score=3744.9\n", "score=3741.4\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZgAAAEKCAYAAAAvlUMdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3XmcV2Xd//HXG8QFzBUSEwFN1EoDayDNNMUl01K8JQXJLQs119Lb5cZb01+S3i5ZahoPLbUml0AWwQ0VDVPRkUDDBVcQcUFcEBFw4PP74zoTX4YZGGDOnPnOvJ+Px/cx33POdc75nHnofLiWc12KCMzMzBpbm6IDMDOzlskJxszMcuEEY2ZmuXCCMTOzXDjBmJlZLpxgzMwsF04wZmaWCycYMzPLhROMmZnlYp2iAyhSx44do3v37kWHYWZWVp555pn3I6LTqsq16gTTvXt3qqqqig7DzKysSJrRkHJuIjMzs1w4wZiZWS6cYMzMLBdOMGZmlgsnGDMzy4UTjJlZK1JZCd27Q5s26WdlZX73atXDlM3MWpPKShg8GBYsSNszZqRtgEGDGv9+rsGYmbUSQ4YsSy41FixI+/PgBGNm1krMnLl6+9eWE4yZWSux9dZ17+/aNZ/7OcGYmbUSe+yx4r727eGSS/K5nxOMmVkr8M47MGYMfP3rqcYiQbduMGxYPh38kOMoMknrA/8A1svuMzwiLpQ0EfhCVuyLwFMR0U/SXsBo4PXs2F0RcbGkHYA7Si69LXBBRFxd6351nt/4T2ZmVn7OPhsWLYLhw6FHj6a5Z57DlBcBfSNivqR2wGOS7o2I/1TSJI0gJYUaEyPiB6UXiYiXgF5Z+bbAW8DIeu65wvlmZq3dY4/BX/4C553XdMkFcmwii2R+ttku+0TNcUkbAX2BUatx2X2AVyOiQVNFm5m1dtXVcMop0KVLfsOR65NrH4yktpKmAO8B4yNiUsnhfsBDETGvZN9ukqZKulfS1+q45ADgtpXcclXnm5m1Kn/8I0ydClddBR06NO29c00wEbEkInoBXYA+knYqOTyQ5ZPFZKBbRPQErqFWzUbSusDBwN/rud1Kzy+5zmBJVZKq5syZsyaPZWZWFubMgfPPh332gf79m/7+TTKKLCI+AiYABwBI6gj0AcaVlJlX06QWEfcA7bJyNb4PTI6Id+u5x6rOryk3LCIqIqKiU6dVrvhpZla2zjsP5s+Ha65Jo8aaWm4JRlInSZtk3zcA9gNezA73B8ZGxMKS8p2l9CuQ1CeLbW7JJWvXeGrfb1Xnm5m1GpMmwU03wRlnwFe+UkwMeY4i2xK4JRv51Qa4MyLGZscGAJfWKt8fOElSNfAZMCAiAkBSB1KCOqH0BEknAkTEDSs738ysNVmyBE4+GbbcEi64oLg41Jr/BldUVERVVVXRYZiZNaphw+CEE9LsyUce2fjXl/RMRFSsqpzf5Dcza0Hmzk19L3vuCQMHFhuLE4yZWQty/vnw8cdw7bXFdOyXcoIxM2shJk9O772ccgrsvHPR0TjBmJm1CEuXpo79Tp3gV78qOprESyabmbUAt94KTz4JN98Mm2xSdDSJazBmZmXuo4/SbMm77QZHHVV0NMu4BmNmVuYuvBDefx/uvx/aNKNqQzMKxczMVtezz6YRYyeeCLvsUnQ0y3OCMTMrUxFpxNimm8Kvf110NCtyE5mZWZn6299g4sT05v5mmxUdzYpcgzEzK0Pz5sFZZ0Hv3nD88UVHUzfXYMzMytDFF8O778Lo0c2rY79UMw3LzMzq8/zz8LvfpZpLnz5FR1M/JxgzszISAaeeChtuCEOHFh3NyrmJzMysjAwfDg8/DNddl6aFac5cgzEzKxPz58Mvfwm9eqX1Xpo712DMzMrE0KEwaxbcfju0bVt0NKvmGoyZWRmYPh2uuAKOPhp2373oaBomtwQjaX1JT0maKmmapIuy/RMlTck+syWNyvbvJenjkmMXlFzrDUnPZfvrXONYye8lvSLpWUnfyOvZzMyaUgScdhpssAFcdlnR0TRcnk1ki4C+ETFfUjvgMUn3RsQeNQUkjQBGl5wzMSJ+UM/19o6I91dyv+8DPbLPt4Drs59mZmVt9Og0keVvfwudOxcdTcPlVoOJZH622S77RM1xSRsBfYFRjXTLQ4Bbs/s+CWwiactGuraZWSE++wzOOAN22inNO1ZOcu2DkdRW0hTgPWB8REwqOdwPeCgi5pXs2y1rUrtX0tdK9gfwgKRnJA2u53ZbAW+WbM/K9pmZla1LL4UZM9KMyeuU2bCsXMONiCVAL0mbACMl7RQR/84ODwRuLCk+GeiWNakdSKrZ9MiOfSci3pL0RWC8pBcj4h9rElOWoAYDdO3adU0uYWbWJF57LfW5DBwI3/1u0dGsviYZRRYRHwETgAMAJHUE+gDjSsrMq2lSi4h7gHZZOSLirezne8DI7Nza3gK2Ltnuku2rHcuwiKiIiIpOzf0tJTNr1c44A9q1g8svLzqSNZPnKLJOWc0FSRsA+wEvZof7A2MjYmFJ+c6SlH3vk8U2V1IHSV/I9ncA9gf+zYrGAEdno8l2BT6OiLdzejwzs1yNGwd33w0XXABblWljf55NZFsCt0hqS0oWd0bE2OzYAODSWuX7AydJqgY+AwZEREjagtS8VhPv3yLiPgBJJwJExA3APcCBwCvAAuC4HJ/NzCw3CxfC6afDjjumn+UqtwQTEc8CdS7gGRF71bHvWuDaOva/BvSs5zo3lHwP4OQ1DNfMrNm48kp49VV44AFYd92io1lzfpPfzKwZmTEDLrkEDjsM9tuv6GjWjhOMmVkzcuaZ6edVVxUbR2NwgjEzaybGj4cRI2DIEGgJb1E4wZiZNQOLF6eFxLbbDs46q+hoGkeZvRdqZtYyXX01vPRSGp683npFR9M4XIMxMyvYW2/BxRfDwQfDgQcWHU3jcYIxMyvYWWdBdXWaLbklcYIxMyvQI4+kFSrPPRe23bboaBqXE4yZWUE+/zxNwd+9O5xzTtHRND538puZFeS662DaNBg1Kq1W2dK4BmNmVoB33oELL4QDDkid+y2RE4yZWQHOPjtNavn730Oay7flcYIxM2tijz0Gf/lLmhamR49Vly9XTjBmZk2oujp17HfpkqaEacmcYMzMmkBlZRot1q4dTJ0Khx4KHToUHVW+nGDMzHJWWQmDB6ep+GvcdFPa35I5wZiZ5WzIEFiwYPl9Cxa4iczMzNbSzJmrt7+lyC3BSFpf0lOSpkqaJumibP9ESVOyz2xJo7L9e0n6uOTYBdn+rSVNkPR8dp06V6iu73wzs6J17Fj3/paw5svK5Pkm/yKgb0TMl9QOeEzSvRGxR00BSSOA0SXnTIyIH9S6TjVwZkRMlvQF4BlJ4yPi+TruWdf5ZmaFmTIFPv4Y2rSBpUuX7W/fPi2N3JLlVoOJZH622S77RM1xSRsBfYFRq7jO2xExOfv+CfACsFUuQZuZNaLZs+GHP4QvfjG9UNmtW3qpsls3GDYMBg0qOsJ85ToXmaS2wDPAdsB1ETGp5HA/4KGImFeybzdJU4HZwFkRMa3W9boDuwCl1ym10vOzawwGBgN0ben1UzMrzKefpilgPvwQ/vlP6NkTTj656KiaVq6d/BGxJCJ6AV2APpJ2Kjk8ELitZHsy0C0iegLXUKtmI2lDYARwRq2k1KDzS2IaFhEVEVHRqVOnNX00M7N6LV0KRx8N//pXmoq/Z8+iIypGk4wii4iPgAnAAQCSOgJ9gHElZebVNKlFxD1Au6wcWR/OCKAyIu6q5x71nm9m1pT+53/grrvgyivhB624VzjPUWSdJG2Sfd8A2A94MTvcHxgbEQtLyneW0pRvkvpksc3N9t0EvBARV63kfnWe3/hPZmZWvz//GS67DE48EU6vc8xr65FnH8yWwC1ZP0wb4M6IGJsdGwBcWqt8f+AkSdXAZ8CAiAhJ3wGOAp6TNCUr+z8RcY+kEwEi4ob6zs/x+czMlvPII+mN/f32a9mzJDeUWvPf4IqKiqiqqio6DDNrAaZPh113hc6d4fHHYZNNio4oP5KeiYiKVZXzm/xmZmtp7lw46CBo2xbGjm3ZyWV1eMlkM7O1sHgxHHYYvPkmPPwwbLtt0RE1H04wZmZrKAJOOAEefTTNjPztbxcdUfPiJjIzszV02WVw881w4YVw5JFFR9P8OMGYma2B4cPhvPNg4MCUYGxFTjBmZqvp6afhqKNgt93gT3/ycOT6OMGYma2GmTPTHGOdO8OoUbD++kVH1Hy5k9/MrIE++STNjrxgATz4YJol2ernBGNm1gBLlqT+lmnT4J574GtfKzqi5s8JxsysAc46C8aNgz/8Afbfv+hoyoP7YMzMVuH66+Hqq9PklSedVHQ05cMJxsxsJe6/H049NU0Fc+WVRUdTXpxgzMzqMW0aHH546m+57bY015g1nBOMmVkd3nsvLRbWvj3cfTd84QtFR1R+3MlvZlbLwoXQrx+8+26aZ6xr16IjKk9OMGZmJSLgJz+BJ56Av/8devcuOqLy5SYyM7MSF12U+luGDoX+/YuOprzllmAkrS/pKUlTJU2TdFG2f6KkKdlntqRR2f69JH1ccuyCkmsdIOklSa9IOree+60n6Y6szCRJ3fN6NjNrmf72t5RgjjkGzq3zL42tjjybyBYBfSNivqR2wGOS7o2IPWoKSBoBjC45Z2JE/KD0IpLaAtcB+wGzgKcljYmI52vd73jgw4jYTtIA4DLgiMZ/LDNriR5/HI47DvbcE4YN8wSWjSG3Gkwk87PNdtknao5L2gjoC4xaxaX6AK9ExGsRsRi4HTikjnKHALdk34cD+0j+T8TMVu3111OnfteucNddsO66RUfUMuTaByOpraQpwHvA+IiYVHK4H/BQRMwr2bdb1qR2r6SamX62At4sKTMr21fbf8pFRDXwMbB5Iz2KmbVQH32UXqKsrk5TwWzuvxqNJtcEExFLIqIX0AXoI2mnksMDgdtKticD3SKiJ3ANq67ZrBFJgyVVSaqaM2dOHrcwszLx+efpRcqXX4YRI2D77YuOqGVZZYLJaiG/WJubRMRHwATggOyaHUlNX+NKysyraVKLiHuAdlm5t4CtSy7XJdtX23/KSVoH2BiYW0cswyKiIiIqOnXqtDaPZWZlLAJOOw3Gj4c//hH23rvoiFqeVSaYiFhCqm2sFkmdJG2Sfd+A1En/Yna4PzA2IhaWlO9c02ciqU8W21zgaaCHpG0krQsMAMbUccsxwDEl1384IqKOcmZm/O53cMMNcPbZ6b0Xa3wNHUX2T0nXAncAn9bsjIjJKzlnS+CWbBRYG+DOiBibHRsAXFqrfH/gJEnVwGfAgCxBVEs6BbgfaAv8KSKmAUi6GKiKiDHATcBfJL0CfJDdw8xsBWPHwi9/CYceCr/5TdHRtFxqyD/yJU2oY3dERN/GD6npVFRURFVVVdFhmFkTmjoVdt8ddtwxTQPToUPREZUfSc9ERMWqyjWoBhMRbp00s7L39ttpAstNN4UxY5xc8tagUWSSNpZ0Vc3oK0lXSto47+DMzNZWZSV07w5t2qSf772XZkf+0peKjqzla+gw5T8BnwCHZ595wJ/zCsrMrDFUVsLgwTBjRho1tnhxekN/2rSiI2sdGtoHMyV7n2Wl+8qN+2DMWratt4ZZs1bc360bvPFGk4fTYjS0D6ahNZjPJH2n5OK7k0Z6mZk1K6+9lpY2/s536k4uADNnNm1MrVVDhymfCNxa0u/yIcveOTEzK0wEPPccjByZPlOnpv09e8LGG8PHH694jhcQaxqrTDCS2gA7RETPbIJKas0fZmbWpJYuhSefTBNTjhyZai0SfPvbqfbSrx9su+2yPpgFC5ad2749XHJJcbG3JqtMMBGxVNLZpBclnVjMrBCLF8OECSmhjB4N77wD7drBPvvAOefAwQdD587LnzNoUPo5ZEhqFuvaNSWXmv2Wr4Y2kT0o6SxWfJP/g1yiMjMDPv0U7rsvJZWxY1NzV4cO8P3vp7fwDzooNYOtzKBBTihFaWiCqVm46+SSfQFs27jhmFlrN3duek9l5Eh44AFYuDBNof9f/5WSyr77wgYbFB2lNURD+2B+HBH/bIJ4zKwVmjULRo1KSeXRR2HJEujSBX72s5RU9tgD1slz/V3LRUP7YK4FdmmCeMyslXjppZRQ7roLnn467dtxx9Sfcuih8M1vetnictfQfxM8JOkw4C5PgW9mDVFZuWLn+g47LBtO/MILqVzv3jB0aEoqO+5YbMzWuBr6Jv8nQHtgCbAQEGk25Y3yDS9ffpPfLB91DQ+W0jsrbdvCnnumhNKvX3rb3spLo86mTFodchCwTURcLKkrab0XM7PlLF0KZ565fHKBlFw23zw1jXnd+9ahoVPFXAfsyrKVLT8Brs0lIjMrOxHwzDPw3/+dZix+9926y33wgZNLa9LQGsy3IuIbkv4FEBEfZssXm1krNm0a3H57+rzySnrx8XvfS7WXuXNXLO8pWlqXhtZgPs+WPg4ASZ2ApblFZWbN1ssvw69/DTvtlD5Dh6Zay403prfr7747rXffvv3y53mKltanoQnm98BI4IuSLgEeA4au7ARJ60t6StJUSdMkXZTtnyhpSvaZLWlUrfN6S6qW1D/b3ruk/BRJCyX1q+N+x0qaU1Lupw18NjNbhZkz4YoroKICtt8e/vd/YbPN4NprYfZsGD8ejj8+7YP05vywYWlafCn9HDbMb9S3Ng0aRQYgaUdgH9IIsoci4oVVlBfQISLmS2pHSkqnR8STJWVGAKMj4tZsuy0wnjRS7U8RMbzWNTcDXgG6RMSCWseOBSoi4pQGPRAeRWa2Mu+8A8OHp+avf2avWffuDQMGwI9+5NFfrVljjyIjIl4EXlyN8gHMzzbbZZ//ZLNsZua+wHElp50KjAB613PZ/sC9tZOLmTWOuXPTi4+33w6PPJJGhO28c2raOuII+PKXi47Qykmuky9kNZJngO2A6yJiUsnhfqSa0Lys7FbAocDe1J9gBgBXreSWh0naE5gO/CIi3qwjpsHAYICu7nE0Y968NDvx7benub+qq6FHDzj//JRUvvrVoiO0cpVrgomIJUAvSZsAIyXtFBH/zg4PBG4sKX41cE42Nc0K15K0JbAzcH89t7sbuC0iFkk6AbiFVEOqHdMwYBikJrI1ezKz8rZgAYwbl5LKuHGwaFEa4fXLX6YmsF69PE2Lrb2GdvKvlYj4CJgAHAAgqSPQBxhXUqwCuF3SG6SmsD/U6sw/HBgZEZ/Xc4+5EbEo27wR+GajPoRZGamsTCO72rRJPysrUxIZMwaOPBK++EU4/HB44gk48UR4/PG0Rv1ll8Euuzi5WOPIrQaTDWX+PCI+krQBsB9wWXa4PzA2IhbWlI+IbUrOvTk7XjrCbCBw3krut2VEvJ1tHgysdBCCWUtVe5qWGTPgmGPSzMSffZZedPzxj1NNZY890tQtZnnIs4lsS+CWrB+mDWlFzLHZsQHApQ29kKTuwNbAo7X2XwxURcQY4DRJBwPVwAfAsWsZv1lZGjJkxWlalixJtZl7700rQLZrV0xs1ro0eJhyS+RhytYStWmTpm6pTUqjwszWVkOHKTdJH4yZ5e/991O/Sn3/ZvSgSWtqTjBmLcDo0fC1r6VVIQ8/3NO0WPPgBGNWxj76KHXg9+sHX/pSmtH4jjs8TYs1D17l2qxMPfAA/OQnaUqXCy5InfvrZnOcDxrkhGLFcw3GrMzMnw8nnZSmxd9oI3jySbjoomXJxay5cIIxKyOPPgpf/zr88Y9w1lkweXKa4disOXKCMSsDn30Gv/gF7L13Gob8j3/A5ZfD+usXHZlZ/dwHY9bMTZqUOvJfeglOPjlN59KhQ9FRma2aazBmzdSiRanj/tvfTm/mP/hgWuDLycXKhWswZs3Q1Klw9NHw7LNppNhVV8HGGxcdldnqcQ3GrBmprk7r3ffuDe+9l9a3v+kmJxcrT67BmDUTL7yQ+lqefhoGDoRrrkkzH5uVK9dgzAq2ZAlceWVah+W11+DOO+Fvf3NysfLnGoxZgV59FY49Fh57DA45JL3fssUWRUdl1jhcgzErQARcf316afK55+DWW2HkSCcXa1lcgzFrYjNnwvHHp2HH+++fOvG7dCk6KrPG5xqMWROJgJtvhp13hieegBtugPvuc3Kxliu3BCNpfUlPSZoqaZqki7L9EyVNyT6zJY2qdV5vSdWS+pfsW1Jyzph67reepDskvSJpUrbMslmz8M47qY/luOOgV6/0fssJJ6Tp9M1aqjybyBYBfSNivqR2wGOS7o2IPWoKSBoBjC7ZbgtcBjxQ61qfRUSvVdzveODDiNhO0oDsOkc0xoOYrY077oCf/zy9jf/b38Jpp6X5xMxautz+M49kfrbZLvv8ZzFXSRsBfYHSGsypwAjgvTW45SHALdn34cA+kv99aE2rshK6d08JZOutYdddYcAA6NEDpkyBM85wcrHWI9f/1CW1lTSFlDDGR8SkksP9gIciYl5WdivgUOD6Oi61vqQqSU9K6lfP7bYC3gSIiGrgY8BvEliTqayEwYNhxozU3zJrVpqo8kc/SsOQd9ih6AjNmlauCSYilmRNW12APpJ2Kjk8ELitZPtq4JyIWFrHpbpFRAVwJHC1pC+vaUySBmfJqmrOnDlrehmzFQwZkprBanvqKVjH4zWtFWqSynpEfARMAA4AkNQR6AOMKylWAdwu6Q2gP/CHmtpKRLyV/XwNeATYpY7bvAVsnV1/HWBjYG4dsQyLiIqIqOjUqVNjPJ4ZkIYfr85+s5Yuz1FknSRtkn3fANgPeDE73B8YGxELa8pHxDYR0T0iupP6UH4eEaMkbSppvew6HYHdgefruOUY4JiS6z8cEVFHObNG9+GH0K5d3ce6dm3aWMyaizwr7lsCt2Qjw9oAd0bE2OzYAODSBl7nK8AfJS3NrnNpRDwPIOlioCoixgA3AX+R9ArwQXYPs9y9+y5873tpTrH11kvruNRo3x4uuaS42MyKpNb8j/yKioqoqqoqOgwrYzNnwn77pQ79UaPSFPtDhqT9Xbum5DJoUNFRmjUuSc9k/eIr5a5HszU0fTrsuy/MmwcPPAC77572O6GYJU4wZmvg2WdTzSUCJkxIU+2b2fL8ypfZanriCfjud2HddeEf/3ByMauPE4zZanjwwVRz6dgxvTy5445FR2TWfDnBmDXQ6NFw0EGw7bYwcSJ061Z0RGbNmxOMWQP89a9w2GGpOeyRR6Bz56IjMmv+nGDMVuEPf4Cjjkr9LuPHw2abFR2RWXlwgjFbiUsvhZNPhh/+EMaNgy98oeiIzMqHE4xZHSLg3HPhvPPgyCNhxAhYf/2iozIrL34PxqyWpUvhlFPg+uvhxBPhuuu8hovZmvD/NmYlPv8cjj46JZezz079L04uZmvGNRizzMKFcMQRMGYMDB2amsfMbM05wZgB8+fDIYfAww/Dtdemjn0zWztOMNbqffABHHggVFXBrbemIclmtvacYKxVe+cd2H9/eOklGD4c+vUrOiKzlsMJxlqtGTPSdPuzZ6d3XPbdt+iIzFoWJxhrlV56KU1a+cknaQLL3XYrOiKzlscJxlqdKVNSs5iU5hXr2bPoiMxaptxG+EtaX9JTkqZKmibpomz/RElTss9sSaNqnddbUrWk/tl2L0lPZNd4VtIR9dzvWElzSq7907yezcrX44/DXnult/InTnRyMctTnjWYRUDfiJgvqR3wmKR7I2KPmgKSRgCjS7bbApcBD5RcZwFwdES8LOlLwDOS7o+Ij+q45x0RcUouT2Nl78EH01DkrbZK37t2LTois5YttxpMJPOzzXbZJ2qOS9oI6AuU1mBOBUYA75VcZ3pEvJx9n50d65RX3NYyjRyZ1nLZbrtUc3FyMctfrpNgSGoraQopKYyPiEklh/sBD0XEvKzsVsChwPUruV4fYF3g1XqKHJY1ow2XtHWjPISVvVtvhR/9CL7xjdTnssUWRUdk1jrkmmAiYklE9AK6AH0k7VRyeCBwW8n21cA5EbG0rmtJ2hL4C3BcPWXuBrpHxNeB8cAt9VxnsKQqSVVz5sxZ/YeysnLddXDMManfZfx42HTToiMyaz0UEasu1Rg3ki4AFkTEFZI6Ai8BW0XEwuz464Cy4h1JfS+DI2JU1pz2CDA0IoY34F5tgQ8iYuOVlauoqIiqqqo1fiZrviLgN7+BIUNSv8vtt3u6fbPGIumZiKhYVbk8R5F1krRJ9n0DYD/gxexwf2BsTXIBiIhtIqJ7RHQHhgM/z5LLusBI4NaVJZeshlPjYOCFRn0gKxs1a7kMGQI//jH8/e9OLmZFyHMU2ZbALVltog1wZ0SMzY4NAC5t4HUOB/YENpd0bLbv2IiYIulioCoixgCnSToYqAY+AI6t62LWMlVWpoQycyZ06JAmrzzppDRxpafbNytGkzWRNUduImsZKith8GBYsGDZvnXWgZtvhkGDCgvLrMUqvInMLE9LlsDrr8N998Gppy6fXACqq1ONxsyK46lirFn74AOYPj3NHVbzmT4dXn4ZFi1a+bkzZzZNjGZWNycYK9zixfDqq8uSR2kyef/9ZeXWWQe+/GXYYQc44ID0c4cd4MgjYdasFa/rlynNiuUEU6ZKO7W7doVLLimmv6GhcUTA22/XXRt5/fXU5FVjiy1S4jj00PRz++3Tz222gXbtVrz2pZeu2AfTvn2KxcyK4wRThmp3as+YkbahaZNMXXH87GfwxhupplGaTKZPT1Pj19hgA+jRA3bZBQYMWFYb2X572Hilby+tqOaZm0PCNbNlPIqsDEeRde+e/pjX1qkT3Hhjqg2UfqqrV9y3qk9DzrnzTvj00/rjlNIf+9JaSM2nSxcPHzYrVw0dReYaTBmqr/N6zpz01vrakqBt21V/6ksuUlpzpUePVFMxs9bJCaaMRMBdd6U/7tXVKx7v3BnGjk2d4Q1JEPV9GlqzqK8m1bUrfP3ra/WoZtYCOMGUiYkT4eyz4ckn03om77+//DDd9u3hiivgm99supguucSd62ZWP7eCN3MvvJCavfbcMzWN3XRTqjXcdBN065aao7p1g2HDmr5Te9CgdN+i4zCz5smd/M20k3/2bPj2OcGUAAAIbUlEQVTVr1Ii2XDDNHnj6aenGoKZWZHcyV+m5s2Dyy+Hq66Czz9P06Ccfz507Fh0ZGZmq8cJpplYvDg1L118cRoNNmBA6svYdtuiIzMzWzPugylYRFqv5KtfTbWVnXaCp5+G225zcjGz8uYEU6BHH4Vdd4XDD0/vi9xzDzz0EFSssmXTzKz5c4IpwLRp8MMfpnXiZ8+GP/85vZj4/e+n0VhmZi2BE0wTeust+OlP00uIEyemSRqnT4djj00vOJqZtSTu5G8CH38M//d/8Nvfpjm8Tj89Tcy4+eZFR2Zmlp/cajCS1pf0lKSpkqZJuijbP1HSlOwzW9KoWuf1llQtqX/JvmMkvZx9jqnnfptJGp+VGS9p07yeraEWL4bf/z7NLDx0aJp+/sUX0xBkJxcza+nybCJbBPSNiJ5AL+AASbtGxB4R0SsiegFPAHfVnCCpLXAZ8EDJvs2AC4FvAX2AC+tJHucCD0VED+ChbLsQS5fCHXfAV76Saiu9ekFVVZrefpttiorKzKxp5ZZgIpmfbbbLPv+ZNkDSRkBfoLQGcyowAnivZN/3gPER8UFEfAiMBw6o45aHALdk328B+jXGc6yuCRPgW99K77FsuGFaM378+KadI8zMrDnItZNfUltJU0gJY3xETCo53I9U45iXld0KOBS4vtZltgLeLNmele2rbYuIeDv7/g6wRT0xDZZUJalqzpw5q/1M9XnuOTjoIOjbF959F265BSZPhu99zyPDzKx1yjXBRMSSrCmsC9BH0k4lhwcCt5VsXw2cExFLG+G+QUltqdaxYRFREREVnTp1Wu1rV1amaerbtEk/r7kGfvIT6NkTHn88TfMyfTocfbRHhplZ69Yko8gi4iNJE0hNW/+W1JHUn3JoSbEK4Half+53BA6UVA28BexVUq4L8Egdt3lX0pYR8bakLVm+ma1R1LVE8GmnpURy5plw3nmw2WaNfVczs/KU5yiyTpI2yb5vAOwHvJgd7g+MjYiFNeUjYpuI6B4R3YHhwM8jYhRwP7C/pE2zzv39s321jQFqRpgdA4xu7GcaMmT5tU9qdO6cai5OLmZmy+TZRLYlMEHSs8DTpD6YsdmxASzfPFaviPgA+H/ZNZ4GLs72IelGSTUTq1wK7CfpZWDfbLtR1bdU8ezZjX0nM7Py5/VgVmM9mPqWCO7WDd54o9HCMjNr1hq6HoynilkNl1yy4oJfXiLYzKxuTjCrwUsEm5k1nOciW02DBjmhmJk1hGswZmaWCycYMzPLhROMmZnlwgnGzMxy4QRjZma5aNUvWkqaA9Tx6mSDdATeb8Rwyp1/H8vz72MZ/y6W1xJ+H90iYpWzBbfqBLM2JFU15E3W1sK/j+X597GMfxfLa02/DzeRmZlZLpxgzMwsF04wa25Y0QE0M/59LM+/j2X8u1heq/l9uA/GzMxy4RqMmZnlwglmDUg6QNJLkl6RdG7R8RRJ0taSJkh6XtI0SacXHVPRJLWV9C9JY1ddumWTtImk4ZJelPSCpN2Kjqkokn6R/T/yb0m3SVq/6Jjy5gSzmiS1Ba4Dvg98FRgo6avFRlWoauDMiPgqsCtwciv/fQCcDrxQdBDNxO+A+yJiR6AnrfT3Imkr4DSgIiJ2AtqSVvZt0ZxgVl8f4JWIeC0iFgO3A4cUHFNhIuLtiJicff+E9Adkq2KjKo6kLsBBwI1Fx1I0SRsDewI3AUTE4oj4qNioCrUOsIGkdYD2QItfbN0JZvVtBbxZsj2LVvwHtZSk7sAuwKRiIynU1cDZwNKiA2kGtgHmAH/OmgxvlNSh6KCKEBFvAVcAM4G3gY8j4oFio8qfE4w1CkkbAiOAMyJiXtHxFEHSD4D3IuKZomNpJtYBvgFcHxG7AJ8CrbLPUtKmpJaObYAvAR0k/bjYqPLnBLP63gK2Ltnuku1rtSS1IyWXyoi4q+h4CrQ7cLCkN0hNp30l/bXYkAo1C5gVETU12uGkhNMa7Qu8HhFzIuJz4C7g2wXHlDsnmNX3NNBD0jaS1iV11I0pOKbCSBKpjf2FiLiq6HiKFBHnRUSXiOhO+u/i4Yho8f9KrU9EvAO8KWmHbNc+wPMFhlSkmcCuktpn/8/sQysY8LBO0QGUm4iolnQKcD9pJMifImJawWEVaXfgKOA5SVOyff8TEfcUGJM1H6cCldk/xl4Djis4nkJExCRJw4HJpJGX/6IVvNHvN/nNzCwXbiIzM7NcOMGYmVkunGDMzCwXTjBmZpYLJxgzM8uFE4xZGZP0iKRWsb67lR8nGDMzy4UTjFkjk9RB0jhJU7O1P46QdIGkp7PtYdnb3DU1kN9KqsrWS+kt6S5JL0v6dVame7aeSmVWZrik9nXcd39JT0iaLOnv2fxwZoVxgjFrfAcAsyOiZ7b2x33AtRHRO9veAPhBSfnFEVEB3ACMBk4GdgKOlbR5VmYH4A8R8RVgHvDz0htK6gicD+wbEd8AqoBf5vaEZg3gBGPW+J4D9pN0maQ9IuJjYG9JkyQ9B/QFvlZSfkzJedOyNXYWkaZWqZlY9c2I+Gf2/a/Ad2rdc1fSAnj/zKbsOQbo1uhPZrYaPBeZWSOLiOmSvgEcCPxa0kOkWklFRLwp6VdA6XK5i7KfS0u+12zX/D9ae06n2tsCxkfEwEZ4BLNG4RqMWSOT9CVgQUT8FbicZVPUv5/1i/Rfg8t2LVnP/kjgsVrHnwR2l7RdFkMHSduvwX3MGo1rMGaNb2fgcklLgc+Bk4B+wL+Bd0hLPqyul4CTJf2JNOX99aUHI2KOpGOB2yStl+0+H5i+Rk9g1gg8m7JZM5ctRT02GyBgVjbcRGZmZrlwDcbMzHLhGoyZmeXCCcbMzHLhBGNmZrlwgjEzs1w4wZiZWS6cYMzMLBf/H/D8fTbl0HxsAAAAAElFTkSuQmCC\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "scores = []\n", "for i in range(10):\n", " kmeans = KMeans(20, n_init=1,\n", " max_iter=10,\n", " init='random')\n", " kmeans.fit(X)\n", " score = -1 * kmeans.score(X)\n", " scores.append(score)\n", " print('score=%g' % (score))\n", " \n", " \n", "plt.figure()\n", "plt.plot(range(10), sorted(scores), 'bo-')\n", "plt.xlabel('sample')\n", "plt.ylabel('error')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a way to represent words in 20-dimensional space.\n", "- The distance from each word vector to the means of each of the 20 clusters." ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.00874162 1.0266301 0.97164003 1.03189144 1.03358099 1.04555106\n", " 0.9921334 1.0001127 0.99631078 0.5906229 0.98516281 1.01236052\n", " 0.96816683 1.1055978 0.99184346 0.98134482 0.95310019 1.02250443\n", " 0.95707704 1.00474645]\n", "[0.94998976 0.99347211 0.92797163 1.00760459 0.87369431 1.02867307\n", " 0.97299124 0.97489501 0.97819442 0.89745615 0.95924255 0.90359807\n", " 0.93594526 1.2176145 0.98537882 0.96513233 0.93622821 0.9995155\n", " 0.85784706 0.98290799]\n", "[1.01027941 1.0360027 0.94782441 1.04189133 1.02496428 1.05721972\n", " 1.00490278 1.00783092 1.00071431 0.71358098 0.99042856 1.02849363\n", " 0.97045892 1.14231512 1.01697286 0.98572949 0.96373614 1.03701544\n", " 0.95307083 1.00673499]\n", "[1.0771791 1.0382791 1.01844359 1.04147931 1.09082894 1.06598877\n", " 1.00127222 1.01626417 1.02439718 1.09741815 0.99488921 1.00199894\n", " 0.99562545 1.28351384 1.04215818 0.98982216 1.04985219 1.04174968\n", " 1.01364399 1.01655176]\n" ] } ], "source": [ "def get_distances(word, contexts, distances):\n", " wd_idx = list(contexts.keys()).index(word)\n", " return distances[wd_idx]\n", "\n", "print(get_distances('love', contexts, distances))\n", "print(get_distances('like', contexts, distances))\n", "print(get_distances('hate', contexts, distances))\n", "print(get_distances('pizza', contexts, distances))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use these vectors to compute how similar two words are." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9958552518939588\n" ] } ], "source": [ "from math import sqrt\n", "def sim(v1, v2):\n", " \"\"\" cosine similarity of two vectors. \"\"\"\n", " return np.dot(v1, v2) / (sqrt(np.dot(v1, v1)) * sqrt(np.dot(v2,v2)))\n", " \n", "# FIXME: sqrt norm\n", "print(sim(get_distances('love', contexts, distances),\n", " get_distances('like', contexts, distances)))" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9995853218896937\n" ] } ], "source": [ "print(sim(get_distances('love', contexts, distances),\n", " get_distances('hate', contexts, distances)))" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.993983274406897\n" ] } ], "source": [ "print(sim(get_distances('love', contexts, distances),\n", " get_distances('pizza', contexts, distances)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, `love` is more similar to `like` than to `pizza`.\n", "\n", "\n", "


\n", "**However**, this approach treats each word the same when computing similarity. \n", "\n", "- Presumably, some context words are more important than others (e.g., `the` versus `hippopotamus`). \n", "- `tf-idf` captures this to some extent.\n", "- Can we use machine learning to weight features based on how predictive they are?\n", " - But what is the classification task?\n", " \n", "- See `Word2Vec.ipynb`\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "code_folding": [ 0 ] }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#\n", "from IPython.core.display import HTML\n", "HTML(open('../custom.css').read())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }