{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CS 579\n",
"
\n",
"\n",
"## Clustering Words with K-Means\n",
"\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Motivation\n",
"\n",
"Often, we want to know which features appear together.\n",
"\n",
"- If you liked *Twilight* you might like *Nosferatu*.\n",
"- \"happy\" is a synonym of \"glad.\"\n",
"\n",
"Can be used to summarize a large collection of messages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use k-means to cluster together related words from Twitter.\n",
"\n",
"**Caution:** This uses live Twitter data, which often contains profanity."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"100\n",
"200\n",
"300\n",
"400\n",
"500\n",
"600\n",
"700\n",
"800\n",
"900\n",
"1000\n",
"1100\n",
"1200\n",
"1300\n",
"1400\n",
"1500\n",
"1600\n",
"1700\n",
"1800\n",
"1900\n",
"2000\n",
"2100\n",
"2200\n",
"2300\n",
"2400\n",
"2500\n",
"2600\n",
"2700\n",
"2800\n",
"2900\n",
"3000\n",
"3100\n",
"3200\n",
"3300\n",
"3400\n",
"3500\n",
"3600\n",
"3700\n",
"3800\n",
"3900\n",
"4000\n",
"4100\n",
"4200\n",
"4300\n",
"4400\n",
"4500\n",
"4600\n",
"4700\n",
"4800\n",
"4900\n",
"5000\n",
"5100\n",
"5200\n",
"5300\n",
"5400\n",
"5500\n",
"5600\n",
"5700\n",
"5800\n",
"5900\n",
"6000\n",
"6100\n",
"6200\n",
"6300\n",
"6400\n",
"6500\n",
"6600\n",
"6700\n",
"6800\n",
"6900\n",
"7000\n",
"7100\n",
"7200\n",
"7300\n",
"7400\n",
"7500\n",
"7600\n",
"7700\n",
"7800\n",
"7900\n",
"8000\n",
"8100\n",
"8200\n",
"8300\n",
"8400\n",
"8500\n",
"8600\n",
"8700\n",
"8800\n",
"8900\n",
"9000\n",
"9100\n",
"9200\n",
"9300\n",
"9400\n",
"9500\n",
"9600\n",
"9700\n",
"9800\n",
"9900\n",
"10000\n"
]
}
],
"source": [
"# Get some tweets containing the word 'i'.\n",
"\n",
"import os\n",
"from TwitterAPI import TwitterAPI\n",
"\n",
"# Read Twitter credentials from environmental variables.\n",
"api = TwitterAPI(os.environ.get('TW_CONSUMER_KEY'),\n",
" os.environ.get('TW_CONSUMER_SECRET'),\n",
" os.environ.get('TW_ACCESS_TOKEN'),\n",
" os.environ.get('TW_ACCESS_TOKEN_SECRET'))\n",
"\n",
"# Collect 10000 tweets.\n",
"tweets = []\n",
"while True: \n",
" r = api.request('statuses/filter', {'track':'i',\n",
" 'language':'en'})\n",
" if r.status_code != 200: # error\n",
" break\n",
" else:\n",
" for item in r.get_iterator():\n",
" tweets.append(item)\n",
" if len(tweets) > 10000:\n",
" break\n",
" elif len(tweets) % 100 == 0:\n",
" print(len(tweets))\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10002\n"
]
}
],
"source": [
"print(len(tweets))"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"text @KYFriedComrade im rereading it ....maybe it is about twitter i dunno anymore i give up lol\n",
"description: gringa, adoptive santiaguina, ADHD stoner lela lez left stream of consciousness commentariat + retweets of dope ppl โ๐คฃ๐๐\n",
"name: naty ๐ค๐งก๐\n",
"location: Santiago, Chile\n"
]
}
],
"source": [
"# Each tweet is a Python dict.\n",
"print('text', tweets[0]['text'])\n",
"print('description:', tweets[0]['user']['description'])\n",
"print('name:', tweets[0]['user']['name'])\n",
"print('location:', tweets[0]['user']['location'])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"tweets = [t for t in tweets if 'text' in t]"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"9806"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(tweets)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['im',\n",
" 'rereading',\n",
" 'it',\n",
" 'maybe',\n",
" 'it',\n",
" 'is',\n",
" 'about',\n",
" 'twitter',\n",
" 'i',\n",
" 'dunno',\n",
" 'anymore',\n",
" 'i',\n",
" 'give',\n",
" 'up',\n",
" 'lol']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Tokenize each tweet text.\n",
"import re\n",
"tokens = []\n",
"for tweet in tweets:\n",
" text = tweet['text'].lower()\n",
" text = re.sub('@\\S+', ' ', text) # Remove mentions.\n",
" text = re.sub('http\\S+', ' ', text) # Remove urls.\n",
" tokens.append(re.findall('[A-Za-z]+', text)) # Retain words.\n",
"tokens[0]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Count words.\n",
"from collections import Counter\n",
"\n",
"word_counts = Counter()\n",
"for tweet in tokens:\n",
" word_counts.update(tweet)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"13183 unique terms\n"
]
},
{
"data": {
"text/plain": [
"[('i', 11474),\n",
" ('rt', 5231),\n",
" ('the', 3741),\n",
" ('to', 3629),\n",
" ('a', 2863),\n",
" ('and', 2425),\n",
" ('you', 2414),\n",
" ('my', 2104),\n",
" ('it', 1820),\n",
" ('this', 1816)]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Inspect word counts.\n",
"import math\n",
"\n",
"print(len(word_counts), 'unique terms')\n",
"word_counts.most_common(10)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"4077 words occur at least three times.\n"
]
}
],
"source": [
"# Retain in vocabulary words occurring more than twice.\n",
"vocab = set([w for w, c in word_counts.items() if c > 2])\n",
"print('%d words occur at least three times.' % len(vocab))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Prune tokens.\n",
"newtoks = []\n",
"for i, tweet in enumerate(tokens):\n",
" newtok = [token for token in tweet if token in vocab]\n",
" if len(newtok) > 0:\n",
" newtoks.append(newtok)\n",
"tokens = newtoks"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['im',\n",
" 'it',\n",
" 'maybe',\n",
" 'it',\n",
" 'is',\n",
" 'about',\n",
" 'twitter',\n",
" 'i',\n",
" 'anymore',\n",
" 'i',\n",
" 'give',\n",
" 'up',\n",
" 'lol']"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# A sample pruned tweet.\n",
"tokens[0]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['rt',\n",
" 'from',\n",
" 'the',\n",
" 'bottom',\n",
" 'of',\n",
" 'my',\n",
" 'heart',\n",
" 'i',\n",
" 'hope',\n",
" 'is',\n",
" 'a',\n",
" 'better',\n",
" 'mental',\n",
" 'health',\n",
" 'year',\n",
" 'for',\n",
" 'everyone',\n",
" 'lt']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tokens[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Context features**\n",
"\n",
"To determine if two words are similar, we will create a feature vector that counts how often other words appear nearby.\n",
"\n",
"E.g.,\n",
"\n",
"> I really **love** school.\n",
"\n",
"> I really **like** school.\n",
"\n",
"> You **love** school.\n",
"\n",
"**love:** {really@-1: 1, school@1: 2, you@-1: 1}\n",
"\n",
"**like:** {really@-1: 1, school@1: 1}\n",
"\n",
"
\n",
"\n",
"**Assumption**: words with similar meaning have similar contexts vectors.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"context for word twitter in ['im', 'it', 'maybe', 'it', 'is', 'about', 'twitter', 'i', 'anymore', 'i', 'give', 'up', 'lol']\n",
"['it@-2', 'maybe@-1', 'is@1', 'about@2']\n"
]
}
],
"source": [
"import numpy as np\n",
"def get_contexts(tweet, i, window):\n",
" \"\"\"\n",
" Get the context features for token at position i\n",
" in this tweet, using the given window size.\n",
" \"\"\"\n",
" features = []\n",
" for j in range(np.amax([0, i-window]), i):\n",
" features.append(tweet[j] + \"@\" + str(j-i))\n",
" for j in range(i+1, min(i + window + 1, len(tweet))):\n",
" features.append(tweet[j] + \"@\" + str(j-i))\n",
" return features\n",
"\n",
"print('context for word %s in %s' % (tokens[0][6], tokens[0]))\n",
"print(get_contexts(tokens[0], i=3, window=2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Q: How would the approach differ if we ignore location of context?**\n",
"\n",
"E.g., **love:** {really: 1, school:1, you: 1} **vs** {really@-1: 1, school@1: 1, you@-1: 1}"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# For each term, create a context vector, indicating how often\n",
"# each word occurs to the left or right of it.\n",
"from collections import defaultdict\n",
"import numpy as np\n",
"\n",
"# dict from term to context vector.\n",
"contexts = defaultdict(lambda: Counter())\n",
"window = 2\n",
"for tweet in tokens:\n",
" for i, token in enumerate(tweet):\n",
" features = get_contexts(tweet, i, window)\n",
" contexts[token].update(features)\n",
" # Optionally: ignore word order\n",
" # contexts[token].update(tweet[:i] + tweet[i+1:])\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('rt@-1', 1823),\n",
" ('m@1', 1667),\n",
" ('t@2', 798),\n",
" ('rt@-2', 529),\n",
" ('and@-1', 444),\n",
" ('to@2', 433),\n",
" ('love@1', 424),\n",
" ('a@2', 418),\n",
" ('have@1', 412),\n",
" ('don@1', 398),\n",
" ('am@1', 374),\n",
" ('can@1', 361),\n",
" ('you@2', 359),\n",
" ('ll@1', 334),\n",
" ('ve@1', 326),\n",
" ('when@-1', 323),\n",
" ('just@1', 320),\n",
" ('but@-1', 277),\n",
" ('was@1', 276),\n",
" ('the@2', 262)]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"contexts['i'].most_common(20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**tf-idf vectors**\n",
"\n",
"- We will transform the context features by dividing by (the log of) the number of distinct terms this feature appears in."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('i@-1', 11338),\n",
" ('i@-2', 10905),\n",
" ('i@1', 9485),\n",
" ('i@2', 7121),\n",
" ('rt@-1', 5217)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Compute the number of different contexts each term appears in.\n",
"# Actually: this is the total number of times this context feature appears.\n",
"tweet_freq = Counter()\n",
"for context in contexts.values():\n",
" tweet_freq.update(context)\n",
"tweet_freq.most_common(5)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": true
},
"outputs": [
{
"data": {
"text/plain": [
"Counter({1766: 1,\n",
" 25: 71,\n",
" 109: 8,\n",
" 55: 24,\n",
" 772: 1,\n",
" 88: 12,\n",
" 5200: 1,\n",
" 306: 1,\n",
" 2381: 1,\n",
" 3436: 1,\n",
" 34: 45,\n",
" 7: 842,\n",
" 412: 2,\n",
" 106: 8,\n",
" 649: 1,\n",
" 211: 2,\n",
" 44: 32,\n",
" 5: 1369,\n",
" 1032: 1,\n",
" 2021: 1,\n",
" 903: 1,\n",
" 4: 2130,\n",
" 2335: 2,\n",
" 3249: 1,\n",
" 1188: 1,\n",
" 2850: 1,\n",
" 1377: 1,\n",
" 670: 1,\n",
" 32: 43,\n",
" 7121: 1,\n",
" 62: 19,\n",
" 27: 62,\n",
" 2216: 1,\n",
" 5217: 1,\n",
" 14: 224,\n",
" 35: 41,\n",
" 60: 18,\n",
" 320: 1,\n",
" 10: 415,\n",
" 815: 2,\n",
" 219: 1,\n",
" 3558: 1,\n",
" 40: 32,\n",
" 118: 8,\n",
" 862: 1,\n",
" 75: 14,\n",
" 6: 1045,\n",
" 10905: 1,\n",
" 99: 7,\n",
" 97: 6,\n",
" 140: 4,\n",
" 385: 3,\n",
" 424: 2,\n",
" 131: 3,\n",
" 201: 4,\n",
" 33: 42,\n",
" 114: 7,\n",
" 91: 5,\n",
" 45: 26,\n",
" 21: 102,\n",
" 1453: 1,\n",
" 119: 4,\n",
" 78: 10,\n",
" 65: 25,\n",
" 468: 1,\n",
" 258: 1,\n",
" 144: 1,\n",
" 52: 26,\n",
" 585: 1,\n",
" 1447: 1,\n",
" 1: 240,\n",
" 158: 2,\n",
" 103: 10,\n",
" 116: 7,\n",
" 214: 2,\n",
" 92: 11,\n",
" 199: 3,\n",
" 59: 17,\n",
" 8: 669,\n",
" 410: 1,\n",
" 584: 1,\n",
" 16: 192,\n",
" 79: 13,\n",
" 510: 2,\n",
" 367: 1,\n",
" 41: 48,\n",
" 1652: 1,\n",
" 1807: 1,\n",
" 94: 11,\n",
" 113: 4,\n",
" 3: 3235,\n",
" 31: 59,\n",
" 586: 1,\n",
" 1196: 1,\n",
" 133: 3,\n",
" 173: 1,\n",
" 20: 132,\n",
" 3656: 1,\n",
" 503: 1,\n",
" 2202: 1,\n",
" 15: 203,\n",
" 390: 3,\n",
" 911: 1,\n",
" 227: 3,\n",
" 185: 4,\n",
" 574: 1,\n",
" 2: 831,\n",
" 803: 1,\n",
" 46: 23,\n",
" 453: 1,\n",
" 66: 17,\n",
" 19: 132,\n",
" 2037: 1,\n",
" 28: 70,\n",
" 73: 15,\n",
" 64: 13,\n",
" 2771: 1,\n",
" 93: 7,\n",
" 1270: 1,\n",
" 490: 1,\n",
" 146: 3,\n",
" 1305: 1,\n",
" 905: 1,\n",
" 653: 1,\n",
" 149: 5,\n",
" 161: 5,\n",
" 9485: 1,\n",
" 80: 12,\n",
" 49: 14,\n",
" 23: 89,\n",
" 84: 9,\n",
" 83: 12,\n",
" 9: 454,\n",
" 1475: 1,\n",
" 1036: 1,\n",
" 129: 6,\n",
" 293: 2,\n",
" 321: 1,\n",
" 2452: 1,\n",
" 102: 3,\n",
" 70: 14,\n",
" 12: 301,\n",
" 37: 47,\n",
" 1480: 1,\n",
" 297: 3,\n",
" 61: 12,\n",
" 966: 1,\n",
" 69: 9,\n",
" 1898: 1,\n",
" 1522: 1,\n",
" 48: 27,\n",
" 148: 3,\n",
" 403: 1,\n",
" 484: 2,\n",
" 1629: 1,\n",
" 215: 4,\n",
" 18: 123,\n",
" 13: 259,\n",
" 17: 172,\n",
" 162: 3,\n",
" 105: 5,\n",
" 176: 1,\n",
" 192: 2,\n",
" 1328: 1,\n",
" 24: 81,\n",
" 58: 21,\n",
" 29: 65,\n",
" 1194: 1,\n",
" 57: 22,\n",
" 206: 2,\n",
" 353: 1,\n",
" 873: 1,\n",
" 11: 369,\n",
" 122: 6,\n",
" 77: 14,\n",
" 226: 3,\n",
" 30: 64,\n",
" 76: 13,\n",
" 1645: 1,\n",
" 39: 34,\n",
" 1432: 1,\n",
" 445: 1,\n",
" 51: 12,\n",
" 26: 68,\n",
" 743: 1,\n",
" 438: 3,\n",
" 627: 1,\n",
" 1363: 1,\n",
" 1122: 1,\n",
" 197: 3,\n",
" 1624: 1,\n",
" 451: 1,\n",
" 619: 1,\n",
" 640: 1,\n",
" 301: 1,\n",
" 1372: 1,\n",
" 67: 15,\n",
" 180: 3,\n",
" 1366: 1,\n",
" 442: 1,\n",
" 536: 1,\n",
" 661: 1,\n",
" 81: 10,\n",
" 47: 33,\n",
" 22: 100,\n",
" 1219: 1,\n",
" 648: 1,\n",
" 72: 11,\n",
" 123: 2,\n",
" 407: 1,\n",
" 187: 3,\n",
" 339: 1,\n",
" 50: 16,\n",
" 1322: 1,\n",
" 449: 1,\n",
" 152: 6,\n",
" 63: 22,\n",
" 1548: 1,\n",
" 43: 32,\n",
" 901: 1,\n",
" 345: 1,\n",
" 881: 1,\n",
" 479: 1,\n",
" 183: 2,\n",
" 71: 15,\n",
" 364: 4,\n",
" 333: 1,\n",
" 387: 1,\n",
" 589: 1,\n",
" 147: 5,\n",
" 174: 2,\n",
" 355: 2,\n",
" 242: 2,\n",
" 181: 1,\n",
" 909: 1,\n",
" 1027: 1,\n",
" 446: 2,\n",
" 858: 1,\n",
" 2170: 1,\n",
" 492: 2,\n",
" 233: 2,\n",
" 179: 2,\n",
" 89: 7,\n",
" 1547: 1,\n",
" 107: 4,\n",
" 124: 5,\n",
" 139: 3,\n",
" 885: 1,\n",
" 402: 2,\n",
" 101: 7,\n",
" 86: 12,\n",
" 3542: 1,\n",
" 157: 1,\n",
" 245: 1,\n",
" 291: 2,\n",
" 324: 1,\n",
" 150: 5,\n",
" 2046: 1,\n",
" 216: 2,\n",
" 299: 2,\n",
" 3620: 1,\n",
" 621: 1,\n",
" 505: 1,\n",
" 56: 14,\n",
" 400: 1,\n",
" 172: 3,\n",
" 110: 1,\n",
" 325: 3,\n",
" 104: 5,\n",
" 126: 4,\n",
" 177: 3,\n",
" 1268: 1,\n",
" 1204: 1,\n",
" 130: 5,\n",
" 504: 1,\n",
" 3184: 1,\n",
" 134: 2,\n",
" 204: 2,\n",
" 1406: 1,\n",
" 1407: 1,\n",
" 121: 3,\n",
" 285: 2,\n",
" 738: 1,\n",
" 924: 1,\n",
" 283: 1,\n",
" 38: 35,\n",
" 1477: 1,\n",
" 53: 14,\n",
" 362: 1,\n",
" 1282: 1,\n",
" 704: 1,\n",
" 125: 9,\n",
" 155: 2,\n",
" 141: 1,\n",
" 686: 1,\n",
" 85: 13,\n",
" 383: 1,\n",
" 2700: 1,\n",
" 839: 2,\n",
" 570: 1,\n",
" 347: 1,\n",
" 210: 3,\n",
" 82: 6,\n",
" 365: 2,\n",
" 282: 1,\n",
" 357: 2,\n",
" 54: 10,\n",
" 1066: 1,\n",
" 170: 1,\n",
" 189: 2,\n",
" 222: 1,\n",
" 545: 1,\n",
" 317: 1,\n",
" 377: 1,\n",
" 356: 1,\n",
" 778: 1,\n",
" 87: 9,\n",
" 154: 2,\n",
" 635: 1,\n",
" 376: 1,\n",
" 68: 12,\n",
" 136: 3,\n",
" 265: 3,\n",
" 213: 4,\n",
" 253: 2,\n",
" 287: 1,\n",
" 1329: 1,\n",
" 184: 4,\n",
" 108: 8,\n",
" 239: 1,\n",
" 397: 2,\n",
" 74: 5,\n",
" 167: 2,\n",
" 255: 1,\n",
" 487: 1,\n",
" 1595: 1,\n",
" 90: 3,\n",
" 1452: 1,\n",
" 135: 2,\n",
" 1677: 1,\n",
" 432: 1,\n",
" 1709: 1,\n",
" 200: 5,\n",
" 250: 2,\n",
" 525: 2,\n",
" 883: 1,\n",
" 42: 17,\n",
" 188: 3,\n",
" 138: 4,\n",
" 414: 1,\n",
" 169: 3,\n",
" 723: 2,\n",
" 137: 3,\n",
" 1737: 1,\n",
" 127: 5,\n",
" 194: 1,\n",
" 768: 1,\n",
" 304: 1,\n",
" 511: 1,\n",
" 932: 1,\n",
" 163: 3,\n",
" 593: 1,\n",
" 120: 1,\n",
" 481: 1,\n",
" 1042: 1,\n",
" 708: 1,\n",
" 143: 10,\n",
" 145: 3,\n",
" 153: 4,\n",
" 337: 1,\n",
" 363: 1,\n",
" 515: 2,\n",
" 11338: 1,\n",
" 117: 6,\n",
" 430: 1,\n",
" 100: 7,\n",
" 112: 7,\n",
" 343: 1,\n",
" 132: 3,\n",
" 241: 4,\n",
" 1283: 1,\n",
" 844: 1,\n",
" 166: 2,\n",
" 352: 1,\n",
" 472: 1,\n",
" 394: 2,\n",
" 234: 3,\n",
" 220: 1,\n",
" 1467: 1,\n",
" 3582: 1,\n",
" 164: 1,\n",
" 368: 2,\n",
" 171: 2,\n",
" 98: 8,\n",
" 483: 1,\n",
" 1324: 1,\n",
" 96: 5,\n",
" 678: 1,\n",
" 36: 29,\n",
" 263: 1,\n",
" 354: 1,\n",
" 286: 1,\n",
" 142: 2,\n",
" 228: 1,\n",
" 378: 1,\n",
" 278: 2,\n",
" 361: 2,\n",
" 539: 1,\n",
" 159: 1,\n",
" 281: 1,\n",
" 186: 1,\n",
" 208: 3,\n",
" 346: 1,\n",
" 209: 2,\n",
" 2345: 1,\n",
" 195: 1,\n",
" 379: 1,\n",
" 259: 2,\n",
" 160: 1,\n",
" 202: 1,\n",
" 115: 1,\n",
" 277: 2,\n",
" 745: 1,\n",
" 623: 1,\n",
" 319: 1,\n",
" 274: 2,\n",
" 212: 1,\n",
" 232: 1,\n",
" 657: 1,\n",
" 328: 1,\n",
" 348: 2,\n",
" 370: 1,\n",
" 221: 1,\n",
" 331: 1,\n",
" 646: 1,\n",
" 332: 1,\n",
" 225: 1,\n",
" 554: 1,\n",
" 360: 1,\n",
" 271: 1,\n",
" 231: 1,\n",
" 251: 1,\n",
" 244: 2,\n",
" 452: 1,\n",
" 436: 1,\n",
" 637: 1,\n",
" 266: 1,\n",
" 128: 2,\n",
" 165: 1,\n",
" 111: 4,\n",
" 230: 1,\n",
" 302: 1,\n",
" 312: 1,\n",
" 1683: 1,\n",
" 899: 1,\n",
" 193: 1,\n",
" 1568: 1,\n",
" 551: 1,\n",
" 289: 1,\n",
" 660: 1,\n",
" 1774: 1,\n",
" 156: 2,\n",
" 532: 1,\n",
" 429: 1,\n",
" 249: 1,\n",
" 295: 1,\n",
" 606: 2,\n",
" 493: 1,\n",
" 405: 2,\n",
" 409: 1,\n",
" 466: 1,\n",
" 393: 1,\n",
" 528: 1,\n",
" 168: 1,\n",
" 513: 1,\n",
" 151: 1,\n",
" 330: 1,\n",
" 191: 1,\n",
" 229: 1,\n",
" 238: 1,\n",
" 95: 1,\n",
" 175: 1})"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Counter(tweet_freq.values())"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('i@2', 1335),\n",
" ('i@1', 1331),\n",
" ('i@-2', 1296),\n",
" ('the@-1', 1087),\n",
" ('and@1', 978)]"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# As opposed to the above, this computes the number of unique terms that this feature\n",
"# appears in. Q: How do you expect to affect the output?\n",
"tweet_freq_2 = Counter()\n",
"for context in contexts.values():\n",
" tweet_freq_2.update(context.keys())\n",
"tweet_freq_2.most_common(5)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('m@1', 0.4992255422401076),\n",
" ('rt@-1', 0.4843410432712745),\n",
" ('t@2', 0.24288406965963458),\n",
" ('love@1', 0.14382229255205484),\n",
" ('rt@-2', 0.1405945803985488)]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Transform each context vector to be term freq / tweet frequency. \n",
"# Also then normalize by length.\n",
"for term, context in contexts.items():\n",
" for term2, frequency in context.items():\n",
" # tf / [ 1 + log(df) ]\n",
" context[term2] = frequency / (1. + math.log(tweet_freq[term2]))\n",
" length = math.sqrt(sum([v*v for v in context.values()]))\n",
" for term2, frequency in context.items():\n",
" context[term2] = 1. * frequency / length\n",
" \n",
"contexts['i'].most_common(5)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('worthless@-2', 0.33764514643816557),\n",
" ('high@-1', 0.2816246484908819),\n",
" ('holding@2', 0.2323291846302529),\n",
" ('at@-1', 0.2204425813331341),\n",
" ('to@-1', 0.20596171539422523),\n",
" ('son@-2', 0.18688995834710687),\n",
" ('and@1', 0.18452879080306742),\n",
" ('her@2', 0.16723062126872543),\n",
" ('in@-2', 0.16473427852389158),\n",
" ('uneducated@1', 0.15939656236107255)]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"contexts['school'].most_common(10)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('i@-1', 0.8517303851828948),\n",
" ('you@1', 0.341304167014421),\n",
" ('rt@-2', 0.14556771763853768),\n",
" ('so@2', 0.11603489382180723),\n",
" ('i@-2', 0.11090169450417534),\n",
" ('this@1', 0.09080734081934172),\n",
" ('u@1', 0.08118539291674556),\n",
" ('it@1', 0.080832127605461),\n",
" ('in@-1', 0.07776084336903491),\n",
" ('i@2', 0.06941393383495353)]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"contexts['love'].most_common(10)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('i@-1', 0.6736723361573105),\n",
" ('grocery@1', 0.3146058368934483),\n",
" ('store@2', 0.2657744340570913),\n",
" ('fucking@-1', 0.23378948537609445),\n",
" ('pressure@2', 0.23209310502084446),\n",
" ('the@1', 0.2063211337802833),\n",
" ('i@-2', 0.18442359838340952),\n",
" ('rt@-2', 0.17387587244484196),\n",
" ('amp@-2', 0.141357574805482),\n",
" ('me@1', 0.07726901404021158)]"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"contexts['hate'].most_common(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At this point we have a list of dictionaries, one per term, indicating the terms that co-occur (weighted by inverse tweet frequency).\n",
"\n",
"Next, we have to cluster these vectors. To do this, we'll need to be able to compute the euclidean distance between two vectors."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.4142135623730951\n",
"2.23606797749979\n"
]
}
],
"source": [
"# n.b. This is not efficient!\n",
"def distance(c1, c2):\n",
" if len(c1.keys()) == 0 or len(c2.keys()) == 0:\n",
" return 1e9\n",
" keys = set(c1.keys()) | set(c2.keys())\n",
" distance = 0.\n",
" for k in keys:\n",
" distance += (c1[k] - c2[k]) ** 2\n",
" return math.sqrt(distance)\n",
"\n",
"print(distance({'hi':10, 'bye': 5}, {'hi': 9, 'bye': 4}))\n",
"print(distance({'hi':10, 'bye': 5}, {'hi': 8, 'bye': 4}))"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['love', 'hope', 'miss', 'm', 'am', 'guess', 'll', 'mean', 'cant',\n",
" 'wish'], dtype=' 1]\n",
"contexts = dict([(term, contexts[term]) for term in nz_contexts])\n",
"print(len(nz_contexts), 'nonzero contexts')"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"im\n",
"[('it@1', 0.01264806377201966), ('maybe@2', 0.025412211140884603), ('y@-2', 0.01883753470306806)]\n"
]
}
],
"source": [
"# e.g., what are three context features for the term \"rt\"?\n",
"print(list(contexts.keys())[0])\n",
"print(list(list(contexts.values())[0].items())[:3])"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['a@-1' 'a@-2' 'a@1' 'a@2' 'aaaaaa@-1' 'aaaaaa@-2' 'aaaaaa@1' 'aaaaaa@2'\n",
" 'aaron@-1' 'aaron@-2']\n",
" (0, 1)\t0.012176616904934769\n",
" (0, 2)\t0.10774881812308722\n",
" (0, 3)\t0.08406853449098409\n",
" (0, 234)\t0.02946119105248044\n",
" (0, 265)\t0.024956944731107572\n",
" (0, 332)\t0.02065859263633043\n",
" (0, 368)\t0.014543761318494223\n",
" (0, 369)\t0.014329995051336608\n",
" (0, 393)\t0.02368590894509358\n",
" (0, 413)\t0.01978485270160371\n",
" (0, 434)\t0.018043994507438255\n",
" (0, 436)\t0.015269990983388075\n",
" (0, 475)\t0.015840112090628022\n",
" (0, 480)\t0.01604913069072621\n",
" (0, 483)\t0.036733833080258765\n",
" (0, 485)\t0.02443477905749157\n",
" (0, 486)\t0.04897844410701169\n",
" (0, 611)\t0.04022372022569253\n",
" (0, 641)\t0.03840265052217021\n",
" (0, 880)\t0.05818855015204386\n",
" (0, 881)\t0.02916261855721708\n",
" (0, 1053)\t0.01678036618972246\n",
" (0, 1063)\t0.020161956350368875\n",
" (0, 1206)\t0.0446093161177403\n",
" (0, 1209)\t0.022203132192976167\n",
" :\t:\n",
" (0, 14849)\t0.04492780311830858\n",
" (0, 14975)\t0.4561416854630144\n",
" (0, 15102)\t0.019454490377035092\n",
" (0, 15137)\t0.03246274061678437\n",
" (0, 15216)\t0.018578551803747474\n",
" (0, 15217)\t0.018633750350250212\n",
" (0, 15224)\t0.01978485270160371\n",
" (0, 15490)\t0.014819584801508275\n",
" (0, 15491)\t0.015207803801313627\n",
" (0, 15502)\t0.3141806233384448\n",
" (0, 15532)\t0.03659194385094082\n",
" (0, 15671)\t0.013794628438236247\n",
" (0, 15735)\t0.04108584562516659\n",
" (0, 15836)\t0.024359499004844704\n",
" (0, 15870)\t0.025924583046390885\n",
" (0, 15873)\t0.024548591946210438\n",
" (0, 15882)\t0.022409430110929447\n",
" (0, 15893)\t0.01883753470306806\n",
" (0, 15897)\t0.02417935783176741\n",
" (0, 15944)\t0.01820112202264155\n",
" (0, 15971)\t0.051086598423935724\n",
" (0, 16003)\t0.012327178077380431\n",
" (0, 16004)\t0.0497998532898863\n",
" (0, 16006)\t0.03695460482404303\n",
" (0, 16059)\t0.03639315451392994\n"
]
}
],
"source": [
"# Transform context dicts to a sparse vector\n",
"# for sklearn.\n",
"from sklearn.feature_extraction import DictVectorizer\n",
"\n",
"vec = DictVectorizer()\n",
"X = vec.fit_transform(contexts.values())\n",
"names = np.array(vec.get_feature_names())\n",
"print(names[:10])\n",
"print(X[0])"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"56\n",
" (0, 0)\t0.01166315589869699\n",
" (0, 1)\t0.009432626921748103\n",
" (0, 2)\t0.0069556430065184915\n",
" (0, 3)\t0.020932638324993653\n",
" (0, 12)\t0.009893569523380101\n",
" (0, 28)\t0.0029242019578079237\n",
" (0, 29)\t0.005935227260208678\n",
" (0, 31)\t0.005850253394142066\n",
" (0, 40)\t0.00502062522553737\n",
" (0, 134)\t0.01569827105681456\n",
" (0, 136)\t0.00393482667128263\n",
" (0, 181)\t0.008700840482568875\n",
" (0, 219)\t0.009893569523380101\n",
" (0, 233)\t0.0058241406815781925\n",
" (0, 241)\t0.003442994829099182\n",
" (0, 251)\t0.003660037555588828\n",
" (0, 278)\t0.0057055343458282184\n",
" (0, 334)\t0.0038489132041543044\n",
" (0, 367)\t0.002790339436949068\n",
" (0, 368)\t0.008449753057068984\n",
" (0, 369)\t0.002775185795961421\n",
" (0, 370)\t0.019557649156184565\n",
" (0, 402)\t0.005416543436788495\n",
" (0, 410)\t0.009893569523380101\n",
" (0, 412)\t0.003966821569702731\n",
" :\t:\n",
" (0, 15787)\t0.004102954333698908\n",
" (0, 15828)\t0.03718458281719579\n",
" (0, 15829)\t0.0031113083765471677\n",
" (0, 15831)\t0.006289728235394356\n",
" (0, 15847)\t0.022014344111415302\n",
" (0, 15857)\t0.005957911837271409\n",
" (0, 15863)\t0.004299918936961106\n",
" (0, 15871)\t0.00502062522553737\n",
" (0, 15885)\t0.004193444669798336\n",
" (0, 15894)\t0.007005863991464066\n",
" (0, 15895)\t0.003619603953271363\n",
" (0, 15946)\t0.003511581004101353\n",
" (0, 15969)\t0.020762766580557777\n",
" (0, 15971)\t0.009893569523380101\n",
" (0, 15990)\t0.004428087090688781\n",
" (0, 16003)\t0.01193657408187391\n",
" (0, 16004)\t0.012055468703578194\n",
" (0, 16005)\t0.341304167014421\n",
" (0, 16006)\t0.035783646133738856\n",
" (0, 16014)\t0.004876067603250058\n",
" (0, 16019)\t0.011392909025650088\n",
" (0, 16020)\t0.002893415638699762\n",
" (0, 16021)\t0.031212968706279383\n",
" (0, 16022)\t0.0028439084729393136\n",
" (0, 16060)\t0.007437161692980253\n",
"while@1\n"
]
}
],
"source": [
"# Which row of X is the word \"love\"?\n",
"love_idx = list(contexts.keys()).index('love')\n",
"print(love_idx)\n",
"# What are the context feature values for love?\n",
"print(X[love_idx])\n",
"# Print a highly ranking feature.\n",
"print(names[15534])"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,\n",
" n_clusters=20, n_init=10, n_jobs=None, precompute_distances='auto',\n",
" random_state=None, tol=0.0001, verbose=0)"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Let's cluster!\n",
"# http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html\n",
"from sklearn.cluster import KMeans\n",
"num_clusters = 20\n",
"kmeans = KMeans(num_clusters)\n",
"kmeans.fit(X)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 to@1 i@-2 i@-1 be@2 the@2\n",
"1 a@-1 i@1 i@2 a@-2 the@-1\n",
"2 the@1 up@1 i@-2 i@2 my@1\n",
"3 my@-1 and@1 i@2 the@-1 i@1\n",
"4 i@-2 m@-1 i@1 m@-2 for@1\n",
"5 of@1 the@-1 the@2 a@-1 a@-2\n",
"6 it@-1 so@-2 much@-1 many@-1 i@1\n",
"7 rt@-1 i@2 i@1 and@1 is@1\n",
"8 in@1 the@2 i@-2 my@2 the@-1\n",
"9 i@-1 rt@-2 i@-2 you@1 it@1\n",
"10 the@2 but@1 i@2 on@1 be@-1\n",
"11 i@1 m@2 rt@-1 i@2 i@-2\n",
"12 me@1 and@-1 a@1 i@-2 i@2\n",
"13 t@1 i@-1 you@-1 be@2 rt@-2\n",
"14 with@1 my@2 i@-2 the@2 to@-1\n",
"15 i@2 i@1 and@1 the@-2 a@-2\n",
"16 to@-1 i@-2 i@-1 t@-1 i@2\n",
"17 the@-1 in@-2 i@1 i@2 of@-2\n",
"18 i@-2 i@2 just@-1 was@-1 and@2\n",
"19 of@-1 and@1 i@1 the@-2 the@-1\n"
]
}
],
"source": [
"# Let's print out the top features for each mean vector.\n",
"# This is swamped by common terms\n",
"for i in range(num_clusters):\n",
" print(i, ' '.join(names[np.argsort(\n",
" kmeans.cluster_centers_[i])[::-1][:5]]))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"distance from term \"love\" to each cluster:\n",
"[1.00874162 1.0266301 0.97164003 1.03189144 1.03358099 1.04555106\n",
" 0.9921334 1.0001127 0.99631078 0.5906229 0.98516281 1.01236052\n",
" 0.96816683 1.1055978 0.99184346 0.98134482 0.95310019 1.02250443\n",
" 0.95707704 1.00474645]\n",
"closest cluster to \"love\":\n",
"9\n"
]
}
],
"source": [
"# .transform will compute the distance from each context to each cluster.\n",
"distances = kmeans.transform(X)\n",
"# e.g., what is the distance from the word \"love\" to each cluster?\n",
"print('distance from term \"love\" to each cluster:')\n",
"print(distances[love_idx])\n",
"# what is the closest cluster for the word \"love\"?\n",
"print('closest cluster to \"love\":')\n",
"print(np.argmin(distances[love_idx]))"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 supposed listening going wanted used decided listen able needs \n",
"\n",
"1 good few bit wrap couple dream virgin requirement little \n",
"\n",
"2 for on of at with all is during from \n",
"\n",
"3 favorite face hair life heart boyfriend wife brother head \n",
"\n",
"4 not screaming sorry dying doin scared gonna assuming sobbing \n",
"\n",
"5 part rest one majority type habit tired instead benefit \n",
"\n",
"6 everyday aaaaaa whew and but yard sounds things seemed \n",
"\n",
"7 hello hi okay i mingyu nah psa mutual age \n",
"\n",
"8 participate doctor punched zoomed manga engage lived live jump \n",
"\n",
"9 am love hope just ll have guess mean miss \n",
"\n",
"10 shade one me pumped this and okay that here \n",
"\n",
"11 but when because and omg what before where ok \n",
"\n",
"12 with tell making call is help on in make \n",
"\n",
"13 didn wasn ain wouldn couldn haven doesn isn can \n",
"\n",
"14 ruv communicate agree familiar wrong flights along interact dealing \n",
"\n",
"15 that me time but the day this my is \n",
"\n",
"16 be see make do say save hear go cry \n",
"\n",
"17 way best world weekends first saddest waist bathroom bus \n",
"\n",
"18 a never like you sorry u done to so \n",
"\n",
"19 us mone nowhere the concerning them luck people prof \n",
"\n"
]
}
],
"source": [
"# Finally, we'll print the words that are closest\n",
"# to the mean of each cluster.\n",
"terms = np.array(list(contexts.keys()))\n",
"for i in range(distances.shape[1]):\n",
" print(i, ' '.join(terms[np.argsort(distances[:,i])[1:10]]), '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Clearly, interpreting these results requires a bit of investigation.\n",
"\n",
"As the number of tweets increases, we expect these clusters to become more coherent."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**How does error decrease with number of cluster?**"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-3727.0865918080995"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kmeans.score(X)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"k=5 score=3860.63\n",
"k=10 score=3786.09\n",
"k=20 score=3722.77\n",
"k=50 score=3638.46\n",
"k=100 score=3553.49\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAEKCAYAAAAFJbKyAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3Xm81HXZ//HXm0UBFdA8KMpminuCecSl7FZMRHNNQI0UM6W0cstM8741LUIzxd3CNPGWVNwIUTP0xl3Bg+KCuPBTZFGB3Ak38Pr98fmezoQgZw5nzvecmffz8ZgHM5/5zsw1TnHx/Xyvz/VRRGBmZlZfrfIOwMzMWhYnDjMzK4oTh5mZFcWJw8zMiuLEYWZmRXHiMDOzojhxmJlZUZw4zMysKE4cZmZWlDZ5B1AK66+/fvTq1SvvMMzMWpRp06b9MyKqVnVcWSaOXr16UVNTk3cYZmYtiqTX63Ocp6rMzKwoThxmZlYUJw4zMyuKE4eZmRXFicPMzIrixFFg7Fjo1QtatUp/jh2bd0RmZs1PWZbjNsTYsTB8OCxZkh6//np6DDB0aH5xmZk1Nz7jyJx5Zl3SqLVkSRo3M7M6ThyZOXOKGzczq1ROHJkePYobNzOrVE4cmREjoEOH/xzr0CGNm5lZnZIlDkntJE2V9IykGZLOycb3lPSUpOmSHpG0WTZ+lKRF2fh0SccUvNcwSa9kt2GliHfoUBg9Grp1S487dkyPfWHczOw/KSJK88aSgLUiYrGktsAjwInA9cCBETFT0vFAv4g4StJRQHVE/HS591kPqAGqgQCmATtExLsr++zq6upYnSaH++0Hzz8Pr70GUoPfxsysRZE0LSKqV3Vcyc44IlmcPWyb3SK7dczGOwFvrOKt9gYmRcQ7WbKYBAwsQcj/NnhwKsd98slSfoqZWctU0mscklpLmg4sJP3lPwU4Brhb0jzgCOC8gpccIulZSbdK6p6NbQzMLThmXjZWMgceCG3bwrhxpfwUM7OWqaSJIyKWRURfoBvQT9K2wMnAvhHRDfgLcFF2+J1Ar4jYjnRWMaaYz5I0XFKNpJpFixatVtydO8OAAXDLLVCimTwzsxarSaqqIuI9YDKwD9AnO/MAuBnYNTvm7Yj4JBv/M7BDdn8+0L3g7bplY8t/xuiIqI6I6qqqVW5gtUpDhqQ1HFOnrvZbmZmVlVJWVVVJ6pzdbw/sBcwEOknaPDusdgxJXQtefkDtOHAvMEDSupLWBQZkYyV1wAFpuuqWW0r9SWZmLUspe1V1BcZIak1KUOMiYqKkY4HbJH0OvAscnR1/gqQDgKXAO8BRABHxjqTfALWXqs+NiHdKGDeQpqv23jsljgsucHWVmVmtkpXj5ml1y3FrXX89DBsGTzwBO+3UCIGZmTVjuZfjloPa6SpXV5mZ1XHi+BKF01VleGJmZtYgThyrMHgwzJ0LU6as+lgzs0rgxLEKBxwAa6zh6iozs1pOHKtQuBjw88/zjsbMLH9OHPUwZEiarvJiQDMzJ456qZ2ucnWVmZkTR7106pSqq2691dNVZmZOHPXk6iozs8SJo55cXWVmljhx1FPtdJWrq8ys0jlxFGHIEJg3z9NVZlbZnDiKsP/+rq4yM3PiKEKnTjBwoKurzKyyOXEUafDgNF31xBN5R2Jmlg8njiIdcACsuaarq8yscjlxFKljR1dXmVllc+JogCFDYP58T1eZWWVy4miA/fdP01WurjKzSuTE0QAdO7q6yswqlxNHAw0enKarHn8870jMzJqWE0cD1U5XubrKzCqNE0cD1U5XubrKzCqNE8dqGDIE3njD01VmVllKljgktZM0VdIzkmZIOicb31PSU5KmS3pE0mbZ+JqSbpY0S9IUSb0K3uuMbPwlSXuXKuZiubrKzCpRKc84PgH6R0QfoC8wUNLOwFXA0IjoC/wV+O/s+B8C70bEZsAo4HwASVsDhwHbAAOBKyW1LmHc9bbOOrDPPq6uMrPKUrLEEcni7GHb7BbZrWM23gl4I7t/IDAmu38rsKckZeM3RcQnEfEaMAvoV6q4izV4cJqueuyxvCMxM2saJb3GIam1pOnAQmBSREwBjgHuljQPOAI4Lzt8Y2AuQEQsBd4HvlI4npmXjS3/WcMl1UiqWbRoUam+0he4usrMKk1JE0dELMumpLoB/SRtC5wM7BsR3YC/ABc10meNjojqiKiuqqpqjLesF09XmVmlaZKqqoh4D5gM7AP0yc48AG4Gds3uzwe6A0hqQ5rGertwPNMtG2s2aqurPF1lZpWglFVVVZI6Z/fbA3sBM4FOkjbPDqsdA5gADMvuDwL+LyIiGz8sq7raBOgNTC1V3A2x336urjKzytGmhO/dFRiTVUC1AsZFxERJxwK3SfoceBc4Ojv+GuB/Jc0C3iFVUhERMySNA14AlgI/iYhlJYy7aIXTVRdfDK28OsbMypjSP+rLS3V1ddTU1DTpZ954I3zve/DQQ7Dbbk360WZmjULStIioXtVx/rdxI9lvP2jdGr7znXTG0asXjB2bd1RmZo2vlFNVFWXChPTnhx+mP19/HYYPT/eHDs0nJjOzUvAZRyM580xYttyVlyVL0riZWTlx4mgkc+YUN25m1lI5cTSSHj2KGzcza6mcOBrJiBHQocN/jrVtm8bNzMqJE0cjGToURo+Gnj1BgvbtYelS6NYt78jMzBqXE0cjGjoUZs9OPaveegt6907tSOY3qwYpZmarx4mjRDp2hDvugH/9CwYNgk8/zTsiM7PG4cRRQltvDX/5CzzxBJx8ct7RmJk1DieOEhs8GE49Fa68EsaMWfXxZmbNnRNHExg5EvbYA378Y3j66byjMTNbPU4cTaBNG7jpJlh/ffjud+Htt/OOyMys4Zw4mkiXLqnt+htvpOqr5duTmJm1FE4cTWinneCyy+Dee+HXv847GjOzhnHiaGLHHgtHHw2//S387W95R2NmVjwnjiYmwRVXQHU1HHkkvPxy3hGZmRXHiSMH7dql6x1t26aL5YsX5x2RmVn9OXHkpGfPVGk1cyYccwyU4Q6+ZlamnDhy9O1vp+65N98Mo0blHY2ZWf04ceTsl7+Egw+G006DBx7IOxozs1Vz4siZBNddlzrpHnoozJuXd0RmZl/OiaMZ6NgRbr897VE+eDB88kneEZmZrVzJEoekdpKmSnpG0gxJ52TjD0uant3ekDQ+G99d0vsFz51V8F4DJb0kaZak00sVc5622qquk+5JJ+UdjZnZyrUp4Xt/AvSPiMWS2gKPSLonInarPUDSbUDhMriHI2K/wjeR1Bq4AtgLmAc8KWlCRLxQwthzMWhQutbx+9+nVeZHHZV3RGZmX1SyM45IalcotM1u/y46ldQR6A+MX8Vb9QNmRcSrEfEpcBNwYAlCbhZGjID+/VMn3aeeyjsaM7MvKuk1DkmtJU0HFgKTImJKwdMHAfdHxAcFY7tkU1v3SNomG9sYmFtwzLxsbPnPGi6pRlLNokWLGvmbNJ3aTrpduriTrpk1TyVNHBGxLCL6At2AfpK2LXj6cODGgsdPAT0jog9wGas+E1n+s0ZHRHVEVFdVVa1u6LmqqoLbboM334TDD3cnXTNrXpqkqioi3gMmAwMBJK1PmoK6q+CYD2qntiLibqBtdtx8oHvB23XLxsrajjumnlaTJsFZZ636eDOzplLKqqoqSZ2z++1JF7dfzJ4eBEyMiI8Ljt9QkrL7/bLY3gaeBHpL2kTSGsBhwIRSxd2cHHNMuv3ud+6ka2bNRymrqroCY7KqqFbAuIiYmD13GHDecscPAo6TtBT4CDgsIgJYKumnwL1Aa+DaiJhRwriblcsug+nT4Ygj4MknYYst8o7IzCqdogy761VXV0dNTU3eYTSaOXNghx3SBfMpU2DttfOOyMzKkaRpEVG9quO8crwF6NEjNUJ88cW0CVQZ5noza0GcOFqI/v1h5Ei45Ra46KK8ozGzSubE0YL84hdwyCGpo+7kyXlHY2aVyomjBZFSP6vNN0+ddOfOXfVrzMwamxNHC7POOqmT7scfp95W7qRrZk3NiaMF2nLLtIfH1Klw4ol5R2NmlcaJo4X67nfh9NPhT3+Ca6/NOxozqyROHC3Yb3+b9i0//ngoo2UrZtbMOXG0YK1bw403wgYbpGqrf/4z74jMrBI4cbRw66+fOukuWAC77w49e0KrVtCrF4wdm3d0ZlaOnDjKQHV16mU1Y0ZqTxIBr78Ow4c7eZhZ43PiKBOTJn1xbMkSOPPMpo/FzMrbKhNHtovfyU0RjDXcnDnFjZuZNdQqE0dELCPt1mfNWI8eKx5fY43Ujt3MrLHUd6rqUUmXS9pN0tdrbyWNzIoyYgR06PCfY2uskW79+qVrIPPm5RObmZWX+iaOvsA2wLnAhdntD6UKyoo3dCiMHp2qqqT057XXpmRxxhmpq+7mm6dtaBcvzjtaM2vJvJFThZg9O600v/lm6No1naEMG5ZKd83MoJE3cpLUSdJFkmqy24WSOq1+mNZUevWCm26Cxx5L10OOPjqV8T7wQN6RmVlLU99/b14LfAgMyW4fAH8pVVBWOrvsAo8/Dn/9a1ppvscecPDB8MoreUdmZi1FfRPHphFxdkS8mt3OAb5aysCsdCQ4/HB46aU0ZXXffbDNNnDKKfDuu3lHZ2bNXX0Tx0eSvln7QNI3gI9KE5I1lfbt4Ve/Smcbw4bBxRfDZpvBZZfBZ5/lHZ2ZNVf1TRw/Bq6QNFvSbOBy4Ecli8qa1IYbwtVXw9NPQ9++cMIJ8LWvwcSJqX2JmVmh+qwcbwVsERF9gO2A7SJi+4h4tuTRWZPq0ydNW02YkBLG/vvDgAHwrH9pMytQn5XjnwOnZfc/iIgP6vPGktpJmirpGUkzJJ2TjT8saXp2e0PS+Gxcki6VNEvSs4ULDCUNk/RKdhvWoG9q9SKlhPH883DJJTBtGmy/PRx7LLz1Vt7RmVlzUN+pqvsknSqpu6T1am+reM0nQP/sTKUvMFDSzhGxW0T0jYi+wOPA7dnx+wC9s9tw4CqA7HPOBnYC+gFnS1q3mC9pxWvbNk1ZzZqV/rzuOujdG0aOhI98dcusotU3cRwK/AR4CJiW3b50hV0ktWuU22a3f8+YS+oI9AfGZ0MHAtdnr3sC6CypK7A3MCki3omId4FJwMB6xm2rab31YNSo1LJ9zz3TxfSttkprQnz9w6wy1fcax/cjYpPlbqssx806604HFpL+8p9S8PRBwP0FU18bA3MLnp+Xja1s3JrQ5pvD+PFw//2w7rqpnHfXXeGJJ/KOzMyaWn2vcVzekDePiGXZlFQ3oJ+kbQuePhy4sSHvuyKShteubF+0aFFjva0tp3//tL/5NdekNia77ALf+17aOMrMKkN9p6rul3SIJDXkQyLiPWAy2RSTpPVJ1yvuKjhsPtC94HG3bGxl48t/xuiIqI6I6qqqqoaEafXUunVqWfLKK/A//wN33AFbbJGmsT78MO/ozKzU6ps4fgSMAz6R9IGkDyV9aXWVpCpJnbP77YG9gBezpwcBEyPi44KXTACOzKqrdgbej4g3gXuBAZLWzS6KD8jGLGdrrw3nngsvvwyDBqUL5717w5//DMuW5R2dmZVKfRNHJ+Ao4LcR0ZHUYn2vVbymKzBZ0rPAk6RrHBOz5w7ji9NUdwOvArOAq4HjASLiHeA32Xs8CZybjVkz0b073HADTJmSVp4fe2wq4b3vvrwjM7NSqFdbdUlXAZ+Tymu3yv7l/4+I2LHUATaE26rnJwJuvRVOOy1dA9lvP7jgAthyy7wjM7NVadS26sBOEfET4GOArCx2jdWIz8qUBIMHw8yZcP758OCDqX3JCSfA22/nHZ2ZNYb6Jo7PJLUmW4chqYp0BmK2Qu3apbOOWbPgmGPgiivSNNaoUfDpp3lHZ2aro76J41LgDqCLpBHAI8DvShaVlY0uXeCqq+CZZ9Le56ecklq4jx/vBYRmLVW9EkdEjCX1qxoJvAkcFBG3lDIwKy/bbgv33gt3353amRx8cFoT8vTTeUdmZsWq947TEfFiRFwREZdHxMxSBmXla599UrfdK65IjRR32CGtCXnjjbwjM7P6qnfiMGssbdrA8cenBYQ//3kq5d18c/jNb2DJkryjM7NVceKw3HTunEp1Z86EgQPhrLPSCvQbboDPXXph1mw5cVjuNt00rf146CHYYAM44gjYeWd45JG8IzOzFXHisGZjt91g6lQYMwbmz0+PhwyBV1/NOzIzK+TEYc1Kq1Zw5JGp/9Wvfw133ZX2/zjtNHj//byjMzNw4rBmaq214OyzUwI5/PB0LaR3b/jjH2Hp0ryjM6tsThzWrG28cdq2tqYmnXkcdxz07ZvWhJhZPpw4rEXYYQd44AG4/Xb4+ONUhbXPPvDCC3lHZlZ5nDisxZDSivMZM+DCC+Hxx2G77dKaEG/6aNZ0nDisxVlzzdTzatasNHU1enRqoHjBBfDJJ3lHZ1b+nDisxVp/fbjsMnjuuVS6e9pp6TrIrbe6gaJZKTlxWIu31VYwcSL84x+pGmvwYPjWt9IFdTNrfE4cVjb22it12/3Tn1IZ7447pjUh8+blHZlZeXHisLLSpg0MH54aKJ5+Oowblxoonn02/OtfeUdnVh6cOKwsdewII0fCiy/CAQfAueemBYTXXecGimary4nDylqvXnDTTfDYY9CjB/zgB2kK68EH847MrOVy4rCKsMsuKXn89a9pzcfuu8N3v5tKes2sOE4cVjFatUp9r156CUaMgEmTYOut02ZS776bd3RmLUfJEoekdpKmSnpG0gxJ52TjkjRC0suSZko6IRvfXdL7kqZnt7MK3mugpJckzZJ0eqlitsrQvj386lfpAvqRR8KoUen6x+WXw2ef5R2dWfNXyjOOT4D+EdEH6AsMlLQzcBTQHdgyIrYCbip4zcMR0Te7nQsgqTVwBbAPsDVwuKStSxi3VYgNN4Q//zmV8PbpAz/7WWphctddXkBo9mVKljgiWZw9bJvdAjgOODciPs+OW7iKt+oHzIqIVyPiU1KiObBEYVsF6tMH7rsPJkxIFVf77QcDBqQV6Wb2RSW9xiGptaTpwEJgUkRMATYFDpVUI+keSb0LXrJLNrV1j6RtsrGNgbkFx8zLxswajQT775+SxSWXwLRpqX37j34ECxbkHZ1Z81LSxBERyyKiL9AN6CdpW2BN4OOIqAauBq7NDn8K6JlNbV0GjC/msyQNz5JRzSK3SrUGWmMNOOGEVG11wglw7bWpgeLIkamdu5k1UVVVRLwHTAYGks4Ybs+eugPYLjvmg9qprYi4G2graX1gPumaSK1u2djynzE6Iqojorqqqqpk38Uqw3rrpYvmM2bAnnumi+lbbpnWhPj6h1W6UlZVVUnqnN1vD+wFvEg6k9gjO+y/gJezYzaUpOx+vyy2t4Engd6SNpG0BnAYMKFUcZsV2nxzGD8e7r8fOndO5by77gpPPJF3ZGb5KeUZR1dgsqRnSX/5T4qIicB5wCGSngNGAsdkxw8Cnpf0DHApcFh2gX0p8FPgXmAmMC4iZpQwbrMv6N8/Xfe45hqYPTstKPze92DOnLwjM2t6ijI8766uro4a99S2Elm8GM4/H/7wh/T4lFNSQ8UJE+DMM1My6dEjLTIcOjTfWM2KIWladv35y49z4jBrmLlz4YwzYOzY1FTx44/h00/rnu/QIe1O6ORhLUV9E4dbjpg1UPfucMMNMGVK2rK2MGkALFmSzkDMyo0Th9lq6tfvi0mj1pw5LuO18uPEYdYIevRY8XgEdOmSemLdddfKE4xZS+LEYdYIRoxI1zQKdegAv/wlDBkCd96ZWplsuCEcc0xqcbJ0aT6xmq0uJw6zRjB0aLoQ3rNnal/Ss2d6fN55qZHiggUwcSJ85ztpO9u99oKNNoLjj4eHHvKuhNayuKrKrIl99BH8/e9pFfqdd6bHG22UzkwOPRR22iklH7Om5qoqs2aqfXs4+GC4+WZYuBBuvDFdYL/qqrSwcJNN0hTXU0+5vYk1T04cZjlae2047DC44440nTVmDGyzDVx0EeywQ2p58j//A88/n3ekZnWcOMyaiU6d6qqv3noLrr4aevWC3/0OvvY12HZb+M1v4OWX847UKp0Th1kz9JWvpOqrSZPgjTfgiitSx96zz4YttoCvfz21PZk9O+9IrRI5cZg1cxtsUFd9NXduave+xhqpP9Ymm8DOO8PFF8P8L2w2YFYaThxmLcjGG8NJJ6W27q++msp9P/0UTj45tUD51rfgyivTRXezUnHiMGuhCquvXnoJzjkH3nkHfvIT6NoVvv3ttIbknXfyjtTKjROHWRkorL567rm0Y+GcOXDssWmq6zvfgeuvh/ffzztSKwdOHGZlprb66qWX0uZTp5yStsAdNiwlkYMPTosP//WvvCO1lsqJw6xMSXXVV6+9Bo8/DscdB1Onpi1wu3RJK9Vvvz2tXjerLycOswogpeqrUaNSZdaDD8JRR8HkyXDIIelM5IgjUj8td/C1VXHiMKswrVql6qsrrkhrRCZNSmced90F+++fOvj+8Idp3B18bUWcOMwqWJs2qfrq6qvTavW77krt32+5BQYMqOvg++CDsGxZ3tFac+HEYWZAWlS4776p+mrhwnTto3//1D9r993TOpGTTkrXStx8sbI5cZjZF7RrV1d9tXBh+nPnneGPf4Rdd01rSE47LVVtOYlUHicOM/tSa61VV321cGE6I9l223Shvbo6rSH57/92B99KUrLEIamdpKmSnpE0Q9I52bgkjZD0sqSZkk4oGL9U0ixJz0r6esF7DZP0SnYbVqqYzezLdexYV321YEFamd6rF4wcmTr4brMNnHtuWkNi5atkOwBKErBWRCyW1BZ4BDgR2ArYAzgqIj6X1CUiFkraF/gZsC+wE3BJROwkaT2gBqgGApgG7BAR767ss70DoFnTWrgQbrstTWk9/HCavurbN52pHHpomtqy5i/3HQAjWZw9bJvdAjgOODciPs+Oq23HdiBwffa6J4DOkroCewOTIuKdLFlMAgaWKm4zK16XLmlx4YMPpnUiF1+crpOccQZ89atpO9xRo2DevLwjtcZQ0mscklpLmg4sJP3lPwXYFDhUUo2keyT1zg7fGJhb8PJ52djKxs2sGdp4YzjxxFR99dpraeX60qWp9Un37rDbbmkNyYIFeUdqDVXSxBERyyKiL9AN6CdpW2BN4OPsdOhq4NrG+CxJw7NkVLNo0aLGeEszW029etVVX738cuqh9d578NOfpjUie+6Z1pC8/XbekVoxmqSqKiLeAyaTppjmAbdnT90BbJfdnw90L3hZt2xsZePLf8boiKiOiOqqqqrG/QJmttp6907VV889lyqwzjwzTWsNH55Wq++7b1oz4g6+zV8pq6qqJHXO7rcH9gJeBMaTLo4D/BdQu4PyBODIrLpqZ+D9iHgTuBcYIGldSesCA7IxM2uhCquvnnoKfv5zeOGF1D+rSxc46CC48UZYvHiVb2U5aFPC9+4KjJHUmpSgxkXEREmPAGMlnQwsBo7Jjr+bVFE1C1gC/AAgIt6R9Bvgyey4cyPCW9OYlQEJtt8+3UaOTJ17b7453f72N2jfPrVAOeww2Gef9NjyV7Jy3Dy5HNesZfv8c3j00VTee+utqdx37bXhwANTEhkwILVIscaVezmumVlDtWpVV301fz7cd19KGPfckzr4brBB6uD7j3+4g28enDjMrFlr06au+urNN1MH3wMOSGcie++d9lc/7jh44AF38G0qThxm1mLUdvAdMyatA7njjtQW/vrrYY890jqRE0+Exx5L011WGk4cZtYitWtXV321cGG6oL7LLvCnP8E3vpHanPziF1BT4w6+jc2Jw8xavLXWgiFDUr+shQvhf/83NV285BLYcce0huTMM+HZZ51EGoMTh5mVlY4d4fvfr+vge801qV/W+edDnz5pDck557iD7+pw4jCzsrXuunD00an66o034Mor0wLDc86BLbdMHXzPOy/11LL6c+Iws4pQ28H3gQdSl96LL4YOHeo6+PbrBxdd5A6+9eHEYWYVZ6ON6qqvZs+G3/8+VWH9/Od1HXwvvxzeeivvSJsnJw4zq2g9e9ZVX73yCvz2t6nR4s9+llrE77knjB4N//xn3pE2H04cZmaZzTarq76aMSN18503D370o7TQcJ994LrrUmv4SubEYWa2AltvnS6iv/giPP00nHpquv+DH6SWJwceCH/9a2V28HXiMDP7ElKqvho5El59FaZMSRtRTZsGQ4emi+6DB6c1JB99lHe0TcOJw8ysnqRUfXXhhTBnDjz8cCr3feghGDQoJZGhQ+HOO+GTT/KOtnScOMzMGqBVK/jmN1P1VW0H38MPh7//PTVh3GCDNK11773w2Wd5R9u4nDjMzFZTbQff0aNTCe/dd6c+WrffDgMHpgvrP/4xTJ5cHh18nTjMzBpR27Z11VcLFsD48WnjqRtugP79oVs3OOGElt3B14nDzKxE2rWrq75auBDGjYNdd017i3zjG9CrV6rWamkdfJ04zMyaQIcOddVXCxemM5A+feDSS1MH3802g1/9qmV08HXiMDNrYuusU1d9tWABXHttShy//31KJoVrSJojJw4zsxytu25d9dWbb8JVV8GGG6bEsdVWKZHUriFpLpw4zMyaiaqquuqrefPSRlRrr52msDbdNE1pXXghzJ2bb5xOHGZmzdBGG6Xqq0cfhddfhwsuSOOnngo9eqQ1JJddVtfBd+zYdLG9Vav059ixpYtN0dyvwjRAdXV11NTU5B2GmVmjmzUrVWfddBM891xazb7llvD//h98+mndcR06pHUlQ4fW/70lTYuI6lUdV7IzDkntJE2V9IykGZLOycavk/SapOnZrW82vruk9wvGzyp4r4GSXpI0S9LppYrZzKy5K6y+euEFOOus1A6+MGkALFmSOv2WQsnOOCQJWCsiFktqCzwCnAj8GJgYEbcud/zuwKkRsd9y462Bl4G9gHnAk8DhEfHCyj7bZxxmVklatVpxCa9U3CLD3M84IqltONw2uzUkS/UDZkXEqxHxKXATcGAjhWlm1uL16FHc+Ooq6cVxSa0lTQcWApMiYkr21AhJz0oaJWl3HqFTAAAHR0lEQVTNgpfskk1t3SNpm2xsY6CwhmBeNrb8Zw2XVCOpZtGiRaX4OmZmzdKIEemaRqEOHdJ4KZQ0cUTEsojoC3QD+knaFjgD2BLYEVgP+GV2+FNAz4joA1wGjC/ys0ZHRHVEVFdVVTXadzAza+6GDk0Xwnv2TNNTPXsWf2G8GE1SjhsR7wGTgYER8WY2jfUJ8BfSVBQR8UHt1FZE3A20lbQ+MB/oXvB23bIxMzPLDB0Ks2enaxqzZ5cuaUBpq6qqJHXO7rcnXdx+UVLXbEzAQcDz2eMNszEk9ctie5t0Mby3pE0krQEcBkwoVdxmZvbl2pTwvbsCY7KqqFbAuIiYKOn/JFUBAqaTqqwABgHHSVoKfAQcFqnka6mknwL3Aq2BayNiRgnjNjOzL+EFgGZmBjSDclwzMytPThxmZlaUspyqkrQIeD3vOHK0PvDPvIPIkb+/v7+/f8P0jIhVrmcoy8RR6STV1Geeslz5+/v7+/uX9vt7qsrMzIrixGFmZkVx4ihPo/MOIGf+/pXN37/EfI3DzMyK4jMOMzMrihNHCyapu6TJkl7Idlk8MRtfT9IkSa9kf66bd6yllLXvf1rSxOzxJpKmZDtG3pz1OCtLkjpLulXSi5JmStqlkn5/SSdn/9t/XtKN2c6jZf37S7pW0kJJzxeMrfA3V3Jp9t/iWUlfb4wYnDhatqXAzyNia2Bn4CeStgZOB+6PiN7A/dnjcnYiMLPg8fnAqIjYDHgX+GEuUTWNS4C/R8SWQB/Sf4eK+P0lbQycAFRHxLakXnaHUf6//3XAwOXGVvab7wP0zm7DgasaIwAnjhYsa1H/VHb/Q9JfGhuTdkgckx02htSFuCxJ6gZ8B/hz9lhAf6B2a+Ky/f6SOgHfAq4BiIhPsy0MKub3JzVqbS+pDdABeJMy//0j4iHgneWGV/abHwhcn21l8QTQubZD+epw4igTknoB2wNTgA0i4s3sqbeADXIKqylcDJwG1O6s/BXgvYhYmj1e4Y6RZWITYBHwl2yq7s+S1qJCfv+ImA/8AZhDShjvA9OonN+/0Mp+83rtoFosJ44yIGlt4DbgpIj4oPC5rDV9WZbOSdoPWBgR0/KOJSdtgK8DV0XE9sC/WG5aqsx//3VJ/6LeBNgIWIsvTuFUnKb4zZ04WjhJbUlJY2xE3J4NLyjYMKsrac/3cvQN4ABJs4GbSFMUl5BOx2v3minnHSPnAfMiYkr2+FZSIqmU3//bwGsRsSgiPgNuJ/1volJ+/0Ir+81LsoOqE0cLls3nXwPMjIiLCp6aAAzL7g8D/tbUsTWFiDgjIrpFRC/SRdH/i4ihpG2KB2WHlfP3fwuYK2mLbGhP4AUq5PcnTVHtLKlD9v+F2u9fEb//clb2m08Ajsyqq3YG3i+Y0mowLwBswSR9E3gYeI66Of5fka5zjAN6kLoED4mI5S+mlRVJuwOnRsR+kr5KOgNZD3ga+H62x33ZkdSXVBiwBvAq8AOyHTepgN9f0jnAoaQKw6eBY0hz+GX7+0u6Edid1AV3AXA2MJ4V/OZZQr2cNIW3BPhBRKz2LndOHGZmVhRPVZmZWVGcOMzMrChOHGZmVhQnDjMzK4oTh5mZFcWJw6wJSVrcwNcdlDWwNMudE4dZy3AQUFTiKFg9bdaonDisokjqle1bcXW2j8M/JLXPnntAUnV2f/2slQmSjpI0PtvnYLakn0o6JWss+ISk9VbwORtIukPSM9lt1+We3712/5Ds8eWSjsrun6e0x8qzkv6QvfYA4AJJ0yVtmt3+LmmapIclbZm99jpJf5Q0Bfi9pP/KXjM9i3edkvyHtYrif5FYJeoNHB4Rx0oaBxwC3LCK12xL6j7cDpgF/DIitpc0CjiS1KW30KXAgxFxsKTWwNr1CUzSV4CDgS0jIiR1joj3JE0AJkbErdlx9wM/johXJO0EXEnq1QWpH9GuEbFM0p3ATyLi0awZ5sf1icPsyzhxWCV6LSKmZ/enAb3q8ZrJ2Z4nH0p6H7gzG38O2G4Fx/cnJRQiYhmp5Xd9vE/6y/2a7Ixk4vIHZAlgV+CW1FECgDULDrkl+0yAR4GLJI0Fbo+IefWMw2ylPFVllaiwb9Ey6v4BtZS6/0+0+5LXfF7w+HMa9g+wws/69+dl+0j0I3W63Q/4+wpe24q050TfgttWBc//q/ZORJxH6t/UHni0dkrLbHU4cZjVmQ3skN0f9CXH1cf9wHHw7z3ROy33/OvA1pLWlNSZ1Nm19myiU0TcDZxM2g4W4ENgHYBsz5XXJA3OXiNJfVgBSZtGxHMRcT7wJODEYavNicOszh+A4yQ9Teo8ujpOBPaQ9BxpOuw/KqIiYi6pm+nz2Z9PZ0+tA0yU9CzwCHBKNn4T8IvsAvemwFDgh5KeAWaQNjRakZMkPZ+932fAPav5vczcHdfMzIrjMw4zMyuKE4eZmRXFicPMzIrixGFmZkVx4jAzs6I4cZiZWVGcOMzMrChOHGZmVpT/DxmtF5f7zcJlAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline\n",
"\n",
"scores = []\n",
"num_cluster_options = [5,10,20,50,100]\n",
"\n",
"for num_clusters in num_cluster_options:\n",
" kmeans = KMeans(num_clusters, n_init=10, max_iter=10)\n",
" kmeans.fit(X)\n",
" score = -1 * kmeans.score(X)\n",
" scores.append(score)\n",
" print('k=%d score=%g' % (num_clusters, score))\n",
" \n",
"plt.figure()\n",
"plt.plot(num_cluster_options, scores, 'bo-')\n",
"plt.xlabel('num clusters')\n",
"plt.ylabel('error')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** How does error vary by initalization? **"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"score=3758.22\n",
"score=3747.06\n",
"score=3741.92\n",
"score=3749.15\n",
"score=3741.44\n",
"score=3752.9\n",
"score=3739.67\n",
"score=3748.15\n",
"score=3744.9\n",
"score=3741.4\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAZgAAAEKCAYAAAAvlUMdAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAAIABJREFUeJzt3XmcV2Xd//HXG8QFzBUSEwFN1EoDayDNNMUl01K8JQXJLQs119Lb5cZb01+S3i5ZahoPLbUml0AWwQ0VDVPRkUDDBVcQcUFcEBFw4PP74zoTX4YZGGDOnPnOvJ+Px/cx33POdc75nHnofLiWc12KCMzMzBpbm6IDMDOzlskJxszMcuEEY2ZmuXCCMTOzXDjBmJlZLpxgzMwsF04wZmaWCycYMzPLhROMmZnlYp2iAyhSx44do3v37kWHYWZWVp555pn3I6LTqsq16gTTvXt3qqqqig7DzKysSJrRkHJuIjMzs1w4wZiZWS6cYMzMLBdOMGZmlgsnGDMzy4UTjJlZK1JZCd27Q5s26WdlZX73atXDlM3MWpPKShg8GBYsSNszZqRtgEGDGv9+rsGYmbUSQ4YsSy41FixI+/PgBGNm1krMnLl6+9eWE4yZWSux9dZ17+/aNZ/7OcGYmbUSe+yx4r727eGSS/K5nxOMmVkr8M47MGYMfP3rqcYiQbduMGxYPh38kOMoMknrA/8A1svuMzwiLpQ0EfhCVuyLwFMR0U/SXsBo4PXs2F0RcbGkHYA7Si69LXBBRFxd6351nt/4T2ZmVn7OPhsWLYLhw6FHj6a5Z57DlBcBfSNivqR2wGOS7o2I/1TSJI0gJYUaEyPiB6UXiYiXgF5Z+bbAW8DIeu65wvlmZq3dY4/BX/4C553XdMkFcmwii2R+ttku+0TNcUkbAX2BUatx2X2AVyOiQVNFm5m1dtXVcMop0KVLfsOR65NrH4yktpKmAO8B4yNiUsnhfsBDETGvZN9ukqZKulfS1+q45ADgtpXcclXnm5m1Kn/8I0ydClddBR06NO29c00wEbEkInoBXYA+knYqOTyQ5ZPFZKBbRPQErqFWzUbSusDBwN/rud1Kzy+5zmBJVZKq5syZsyaPZWZWFubMgfPPh332gf79m/7+TTKKLCI+AiYABwBI6gj0AcaVlJlX06QWEfcA7bJyNb4PTI6Id+u5x6rOryk3LCIqIqKiU6dVrvhpZla2zjsP5s+Ha65Jo8aaWm4JRlInSZtk3zcA9gNezA73B8ZGxMKS8p2l9CuQ1CeLbW7JJWvXeGrfb1Xnm5m1GpMmwU03wRlnwFe+UkwMeY4i2xK4JRv51Qa4MyLGZscGAJfWKt8fOElSNfAZMCAiAkBSB1KCOqH0BEknAkTEDSs738ysNVmyBE4+GbbcEi64oLg41Jr/BldUVERVVVXRYZiZNaphw+CEE9LsyUce2fjXl/RMRFSsqpzf5Dcza0Hmzk19L3vuCQMHFhuLE4yZWQty/vnw8cdw7bXFdOyXcoIxM2shJk9O772ccgrsvHPR0TjBmJm1CEuXpo79Tp3gV78qOprESyabmbUAt94KTz4JN98Mm2xSdDSJazBmZmXuo4/SbMm77QZHHVV0NMu4BmNmVuYuvBDefx/uvx/aNKNqQzMKxczMVtezz6YRYyeeCLvsUnQ0y3OCMTMrUxFpxNimm8Kvf110NCtyE5mZWZn6299g4sT05v5mmxUdzYpcgzEzK0Pz5sFZZ0Hv3nD88UVHUzfXYMzMytDFF8O778Lo0c2rY79UMw3LzMzq8/zz8LvfpZpLnz5FR1M/JxgzszISAaeeChtuCEOHFh3NyrmJzMysjAwfDg8/DNddl6aFac5cgzEzKxPz58Mvfwm9eqX1Xpo712DMzMrE0KEwaxbcfju0bVt0NKvmGoyZWRmYPh2uuAKOPhp2373oaBomtwQjaX1JT0maKmmapIuy/RMlTck+syWNyvbvJenjkmMXlFzrDUnPZfvrXONYye8lvSLpWUnfyOvZzMyaUgScdhpssAFcdlnR0TRcnk1ki4C+ETFfUjvgMUn3RsQeNQUkjQBGl5wzMSJ+UM/19o6I91dyv+8DPbLPt4Drs59mZmVt9Og0keVvfwudOxcdTcPlVoOJZH622S77RM1xSRsBfYFRjXTLQ4Bbs/s+CWwiactGuraZWSE++wzOOAN22inNO1ZOcu2DkdRW0hTgPWB8REwqOdwPeCgi5pXs2y1rUrtX0tdK9gfwgKRnJA2u53ZbAW+WbM/K9pmZla1LL4UZM9KMyeuU2bCsXMONiCVAL0mbACMl7RQR/84ODwRuLCk+GeiWNakdSKrZ9MiOfSci3pL0RWC8pBcj4h9rElOWoAYDdO3adU0uYWbWJF57LfW5DBwI3/1u0dGsviYZRRYRHwETgAMAJHUE+gDjSsrMq2lSi4h7gHZZOSLirezne8DI7Nza3gK2Ltnuku2rHcuwiKiIiIpOzf0tJTNr1c44A9q1g8svLzqSNZPnKLJOWc0FSRsA+wEvZof7A2MjYmFJ+c6SlH3vk8U2V1IHSV/I9ncA9gf+zYrGAEdno8l2BT6OiLdzejwzs1yNGwd33w0XXABblWljf55NZFsCt0hqS0oWd0bE2OzYAODSWuX7AydJqgY+AwZEREjagtS8VhPv3yLiPgBJJwJExA3APcCBwCvAAuC4HJ/NzCw3CxfC6afDjjumn+UqtwQTEc8CdS7gGRF71bHvWuDaOva/BvSs5zo3lHwP4OQ1DNfMrNm48kp49VV44AFYd92io1lzfpPfzKwZmTEDLrkEDjsM9tuv6GjWjhOMmVkzcuaZ6edVVxUbR2NwgjEzaybGj4cRI2DIEGgJb1E4wZiZNQOLF6eFxLbbDs46q+hoGkeZvRdqZtYyXX01vPRSGp683npFR9M4XIMxMyvYW2/BxRfDwQfDgQcWHU3jcYIxMyvYWWdBdXWaLbklcYIxMyvQI4+kFSrPPRe23bboaBqXE4yZWUE+/zxNwd+9O5xzTtHRND538puZFeS662DaNBg1Kq1W2dK4BmNmVoB33oELL4QDDkid+y2RE4yZWQHOPjtNavn730Oay7flcYIxM2tijz0Gf/lLmhamR49Vly9XTjBmZk2oujp17HfpkqaEacmcYMzMmkBlZRot1q4dTJ0Khx4KHToUHVW+nGDMzHJWWQmDB6ep+GvcdFPa35I5wZiZ5WzIEFiwYPl9Cxa4iczMzNbSzJmrt7+lyC3BSFpf0lOSpkqaJumibP9ESVOyz2xJo7L9e0n6uOTYBdn+rSVNkPR8dp06V6iu73wzs6J17Fj3/paw5svK5Pkm/yKgb0TMl9QOeEzSvRGxR00BSSOA0SXnTIyIH9S6TjVwZkRMlvQF4BlJ4yPi+TruWdf5ZmaFmTIFPv4Y2rSBpUuX7W/fPi2N3JLlVoOJZH622S77RM1xSRsBfYFRq7jO2xExOfv+CfACsFUuQZuZNaLZs+GHP4QvfjG9UNmtW3qpsls3GDYMBg0qOsJ85ToXmaS2wDPAdsB1ETGp5HA/4KGImFeybzdJU4HZwFkRMa3W9boDuwCl1ym10vOzawwGBgN0ben1UzMrzKefpilgPvwQ/vlP6NkTTj656KiaVq6d/BGxJCJ6AV2APpJ2Kjk8ELitZHsy0C0iegLXUKtmI2lDYARwRq2k1KDzS2IaFhEVEVHRqVOnNX00M7N6LV0KRx8N//pXmoq/Z8+iIypGk4wii4iPgAnAAQCSOgJ9gHElZebVNKlFxD1Au6wcWR/OCKAyIu6q5x71nm9m1pT+53/grrvgyivhB624VzjPUWSdJG2Sfd8A2A94MTvcHxgbEQtLyneW0pRvkvpksc3N9t0EvBARV63kfnWe3/hPZmZWvz//GS67DE48EU6vc8xr65FnH8yWwC1ZP0wb4M6IGJsdGwBcWqt8f+AkSdXAZ8CAiAhJ3wGOAp6TNCUr+z8RcY+kEwEi4ob6zs/x+czMlvPII+mN/f32a9mzJDeUWvPf4IqKiqiqqio6DDNrAaZPh113hc6d4fHHYZNNio4oP5KeiYiKVZXzm/xmZmtp7lw46CBo2xbGjm3ZyWV1eMlkM7O1sHgxHHYYvPkmPPwwbLtt0RE1H04wZmZrKAJOOAEefTTNjPztbxcdUfPiJjIzszV02WVw881w4YVw5JFFR9P8OMGYma2B4cPhvPNg4MCUYGxFTjBmZqvp6afhqKNgt93gT3/ycOT6OMGYma2GmTPTHGOdO8OoUbD++kVH1Hy5k9/MrIE++STNjrxgATz4YJol2ernBGNm1gBLlqT+lmnT4J574GtfKzqi5s8JxsysAc46C8aNgz/8Afbfv+hoyoP7YMzMVuH66+Hqq9PklSedVHQ05cMJxsxsJe6/H049NU0Fc+WVRUdTXpxgzMzqMW0aHH546m+57bY015g1nBOMmVkd3nsvLRbWvj3cfTd84QtFR1R+3MlvZlbLwoXQrx+8+26aZ6xr16IjKk9OMGZmJSLgJz+BJ56Av/8devcuOqLy5SYyM7MSF12U+luGDoX+/YuOprzllmAkrS/pKUlTJU2TdFG2f6KkKdlntqRR2f69JH1ccuyCkmsdIOklSa9IOree+60n6Y6szCRJ3fN6NjNrmf72t5RgjjkGzq3zL42tjjybyBYBfSNivqR2wGOS7o2IPWoKSBoBjC45Z2JE/KD0IpLaAtcB+wGzgKcljYmI52vd73jgw4jYTtIA4DLgiMZ/LDNriR5/HI47DvbcE4YN8wSWjSG3Gkwk87PNdtknao5L2gjoC4xaxaX6AK9ExGsRsRi4HTikjnKHALdk34cD+0j+T8TMVu3111OnfteucNddsO66RUfUMuTaByOpraQpwHvA+IiYVHK4H/BQRMwr2bdb1qR2r6SamX62At4sKTMr21fbf8pFRDXwMbB5Iz2KmbVQH32UXqKsrk5TwWzuvxqNJtcEExFLIqIX0AXoI2mnksMDgdtKticD3SKiJ3ANq67ZrBFJgyVVSaqaM2dOHrcwszLx+efpRcqXX4YRI2D77YuOqGVZZYLJaiG/WJubRMRHwATggOyaHUlNX+NKysyraVKLiHuAdlm5t4CtSy7XJdtX23/KSVoH2BiYW0cswyKiIiIqOnXqtDaPZWZlLAJOOw3Gj4c//hH23rvoiFqeVSaYiFhCqm2sFkmdJG2Sfd+A1En/Yna4PzA2IhaWlO9c02ciqU8W21zgaaCHpG0krQsMAMbUccsxwDEl1384IqKOcmZm/O53cMMNcPbZ6b0Xa3wNHUX2T0nXAncAn9bsjIjJKzlnS+CWbBRYG+DOiBibHRsAXFqrfH/gJEnVwGfAgCxBVEs6BbgfaAv8KSKmAUi6GKiKiDHATcBfJL0CfJDdw8xsBWPHwi9/CYceCr/5TdHRtFxqyD/yJU2oY3dERN/GD6npVFRURFVVVdFhmFkTmjoVdt8ddtwxTQPToUPREZUfSc9ERMWqyjWoBhMRbp00s7L39ttpAstNN4UxY5xc8tagUWSSNpZ0Vc3oK0lXSto47+DMzNZWZSV07w5t2qSf772XZkf+0peKjqzla+gw5T8BnwCHZ595wJ/zCsrMrDFUVsLgwTBjRho1tnhxekN/2rSiI2sdGtoHMyV7n2Wl+8qN+2DMWratt4ZZs1bc360bvPFGk4fTYjS0D6ahNZjPJH2n5OK7k0Z6mZk1K6+9lpY2/s536k4uADNnNm1MrVVDhymfCNxa0u/yIcveOTEzK0wEPPccjByZPlOnpv09e8LGG8PHH694jhcQaxqrTDCS2gA7RETPbIJKas0fZmbWpJYuhSefTBNTjhyZai0SfPvbqfbSrx9su+2yPpgFC5ad2749XHJJcbG3JqtMMBGxVNLZpBclnVjMrBCLF8OECSmhjB4N77wD7drBPvvAOefAwQdD587LnzNoUPo5ZEhqFuvaNSWXmv2Wr4Y2kT0o6SxWfJP/g1yiMjMDPv0U7rsvJZWxY1NzV4cO8P3vp7fwDzooNYOtzKBBTihFaWiCqVm46+SSfQFs27jhmFlrN3duek9l5Eh44AFYuDBNof9f/5WSyr77wgYbFB2lNURD+2B+HBH/bIJ4zKwVmjULRo1KSeXRR2HJEujSBX72s5RU9tgD1slz/V3LRUP7YK4FdmmCeMyslXjppZRQ7roLnn467dtxx9Sfcuih8M1vetnictfQfxM8JOkw4C5PgW9mDVFZuWLn+g47LBtO/MILqVzv3jB0aEoqO+5YbMzWuBr6Jv8nQHtgCbAQEGk25Y3yDS9ffpPfLB91DQ+W0jsrbdvCnnumhNKvX3rb3spLo86mTFodchCwTURcLKkrab0XM7PlLF0KZ565fHKBlFw23zw1jXnd+9ahoVPFXAfsyrKVLT8Brs0lIjMrOxHwzDPw3/+dZix+9926y33wgZNLa9LQGsy3IuIbkv4FEBEfZssXm1krNm0a3H57+rzySnrx8XvfS7WXuXNXLO8pWlqXhtZgPs+WPg4ASZ2ApblFZWbN1ssvw69/DTvtlD5Dh6Zay403prfr7747rXffvv3y53mKltanoQnm98BI4IuSLgEeA4au7ARJ60t6StJUSdMkXZTtnyhpSvaZLWlUrfN6S6qW1D/b3ruk/BRJCyX1q+N+x0qaU1Lupw18NjNbhZkz4YoroKICtt8e/vd/YbPN4NprYfZsGD8ejj8+7YP05vywYWlafCn9HDbMb9S3Ng0aRQYgaUdgH9IIsoci4oVVlBfQISLmS2pHSkqnR8STJWVGAKMj4tZsuy0wnjRS7U8RMbzWNTcDXgG6RMSCWseOBSoi4pQGPRAeRWa2Mu+8A8OHp+avf2avWffuDQMGwI9+5NFfrVljjyIjIl4EXlyN8gHMzzbbZZ//ZLNsZua+wHElp50KjAB613PZ/sC9tZOLmTWOuXPTi4+33w6PPJJGhO28c2raOuII+PKXi47Qykmuky9kNZJngO2A6yJiUsnhfqSa0Lys7FbAocDe1J9gBgBXreSWh0naE5gO/CIi3qwjpsHAYICu7nE0Y968NDvx7benub+qq6FHDzj//JRUvvrVoiO0cpVrgomIJUAvSZsAIyXtFBH/zg4PBG4sKX41cE42Nc0K15K0JbAzcH89t7sbuC0iFkk6AbiFVEOqHdMwYBikJrI1ezKz8rZgAYwbl5LKuHGwaFEa4fXLX6YmsF69PE2Lrb2GdvKvlYj4CJgAHAAgqSPQBxhXUqwCuF3SG6SmsD/U6sw/HBgZEZ/Xc4+5EbEo27wR+GajPoRZGamsTCO72rRJPysrUxIZMwaOPBK++EU4/HB44gk48UR4/PG0Rv1ll8Euuzi5WOPIrQaTDWX+PCI+krQBsB9wWXa4PzA2IhbWlI+IbUrOvTk7XjrCbCBw3krut2VEvJ1tHgysdBCCWUtVe5qWGTPgmGPSzMSffZZedPzxj1NNZY890tQtZnnIs4lsS+CWrB+mDWlFzLHZsQHApQ29kKTuwNbAo7X2XwxURcQY4DRJBwPVwAfAsWsZv1lZGjJkxWlalixJtZl7700rQLZrV0xs1ro0eJhyS+RhytYStWmTpm6pTUqjwszWVkOHKTdJH4yZ5e/991O/Sn3/ZvSgSWtqTjBmLcDo0fC1r6VVIQ8/3NO0WPPgBGNWxj76KHXg9+sHX/pSmtH4jjs8TYs1D17l2qxMPfAA/OQnaUqXCy5InfvrZnOcDxrkhGLFcw3GrMzMnw8nnZSmxd9oI3jySbjoomXJxay5cIIxKyOPPgpf/zr88Y9w1lkweXKa4disOXKCMSsDn30Gv/gF7L13Gob8j3/A5ZfD+usXHZlZ/dwHY9bMTZqUOvJfeglOPjlN59KhQ9FRma2aazBmzdSiRanj/tvfTm/mP/hgWuDLycXKhWswZs3Q1Klw9NHw7LNppNhVV8HGGxcdldnqcQ3GrBmprk7r3ffuDe+9l9a3v+kmJxcrT67BmDUTL7yQ+lqefhoGDoRrrkkzH5uVK9dgzAq2ZAlceWVah+W11+DOO+Fvf3NysfLnGoxZgV59FY49Fh57DA45JL3fssUWRUdl1jhcgzErQARcf316afK55+DWW2HkSCcXa1lcgzFrYjNnwvHHp2HH+++fOvG7dCk6KrPG5xqMWROJgJtvhp13hieegBtugPvuc3Kxliu3BCNpfUlPSZoqaZqki7L9EyVNyT6zJY2qdV5vSdWS+pfsW1Jyzph67reepDskvSJpUrbMslmz8M47qY/luOOgV6/0fssJJ6Tp9M1aqjybyBYBfSNivqR2wGOS7o2IPWoKSBoBjC7ZbgtcBjxQ61qfRUSvVdzveODDiNhO0oDsOkc0xoOYrY077oCf/zy9jf/b38Jpp6X5xMxautz+M49kfrbZLvv8ZzFXSRsBfYHSGsypwAjgvTW45SHALdn34cA+kv99aE2rshK6d08JZOutYdddYcAA6NEDpkyBM85wcrHWI9f/1CW1lTSFlDDGR8SkksP9gIciYl5WdivgUOD6Oi61vqQqSU9K6lfP7bYC3gSIiGrgY8BvEliTqayEwYNhxozU3zJrVpqo8kc/SsOQd9ih6AjNmlauCSYilmRNW12APpJ2Kjk8ELitZPtq4JyIWFrHpbpFRAVwJHC1pC+vaUySBmfJqmrOnDlrehmzFQwZkprBanvqKVjH4zWtFWqSynpEfARMAA4AkNQR6AOMKylWAdwu6Q2gP/CHmtpKRLyV/XwNeATYpY7bvAVsnV1/HWBjYG4dsQyLiIqIqOjUqVNjPJ4ZkIYfr85+s5Yuz1FknSRtkn3fANgPeDE73B8YGxELa8pHxDYR0T0iupP6UH4eEaMkbSppvew6HYHdgefruOUY4JiS6z8cEVFHObNG9+GH0K5d3ce6dm3aWMyaizwr7lsCt2Qjw9oAd0bE2OzYAODSBl7nK8AfJS3NrnNpRDwPIOlioCoixgA3AX+R9ArwQXYPs9y9+y5873tpTrH11kvruNRo3x4uuaS42MyKpNb8j/yKioqoqqoqOgwrYzNnwn77pQ79UaPSFPtDhqT9Xbum5DJoUNFRmjUuSc9k/eIr5a5HszU0fTrsuy/MmwcPPAC77572O6GYJU4wZmvg2WdTzSUCJkxIU+2b2fL8ypfZanriCfjud2HddeEf/3ByMauPE4zZanjwwVRz6dgxvTy5445FR2TWfDnBmDXQ6NFw0EGw7bYwcSJ061Z0RGbNmxOMWQP89a9w2GGpOeyRR6Bz56IjMmv+nGDMVuEPf4Cjjkr9LuPHw2abFR2RWXlwgjFbiUsvhZNPhh/+EMaNgy98oeiIzMqHE4xZHSLg3HPhvPPgyCNhxAhYf/2iozIrL34PxqyWpUvhlFPg+uvhxBPhuuu8hovZmvD/NmYlPv8cjj46JZezz079L04uZmvGNRizzMKFcMQRMGYMDB2amsfMbM05wZgB8+fDIYfAww/Dtdemjn0zWztOMNbqffABHHggVFXBrbemIclmtvacYKxVe+cd2H9/eOklGD4c+vUrOiKzlsMJxlqtGTPSdPuzZ6d3XPbdt+iIzFoWJxhrlV56KU1a+cknaQLL3XYrOiKzlscJxlqdKVNSs5iU5hXr2bPoiMxaptxG+EtaX9JTkqZKmibpomz/RElTss9sSaNqnddbUrWk/tl2L0lPZNd4VtIR9dzvWElzSq7907yezcrX44/DXnult/InTnRyMctTnjWYRUDfiJgvqR3wmKR7I2KPmgKSRgCjS7bbApcBD5RcZwFwdES8LOlLwDOS7o+Ij+q45x0RcUouT2Nl78EH01DkrbZK37t2LTois5YttxpMJPOzzXbZJ2qOS9oI6AuU1mBOBUYA75VcZ3pEvJx9n50d65RX3NYyjRyZ1nLZbrtUc3FyMctfrpNgSGoraQopKYyPiEklh/sBD0XEvKzsVsChwPUruV4fYF3g1XqKHJY1ow2XtHWjPISVvVtvhR/9CL7xjdTnssUWRUdk1jrkmmAiYklE9AK6AH0k7VRyeCBwW8n21cA5EbG0rmtJ2hL4C3BcPWXuBrpHxNeB8cAt9VxnsKQqSVVz5sxZ/YeysnLddXDMManfZfx42HTToiMyaz0UEasu1Rg3ki4AFkTEFZI6Ai8BW0XEwuz464Cy4h1JfS+DI2JU1pz2CDA0IoY34F5tgQ8iYuOVlauoqIiqqqo1fiZrviLgN7+BIUNSv8vtt3u6fbPGIumZiKhYVbk8R5F1krRJ9n0DYD/gxexwf2BsTXIBiIhtIqJ7RHQHhgM/z5LLusBI4NaVJZeshlPjYOCFRn0gKxs1a7kMGQI//jH8/e9OLmZFyHMU2ZbALVltog1wZ0SMzY4NAC5t4HUOB/YENpd0bLbv2IiYIulioCoixgCnSToYqAY+AI6t62LWMlVWpoQycyZ06JAmrzzppDRxpafbNytGkzWRNUduImsZKith8GBYsGDZvnXWgZtvhkGDCgvLrMUqvInMLE9LlsDrr8N998Gppy6fXACqq1ONxsyK46lirFn74AOYPj3NHVbzmT4dXn4ZFi1a+bkzZzZNjGZWNycYK9zixfDqq8uSR2kyef/9ZeXWWQe+/GXYYQc44ID0c4cd4MgjYdasFa/rlynNiuUEU6ZKO7W7doVLLimmv6GhcUTA22/XXRt5/fXU5FVjiy1S4jj00PRz++3Tz222gXbtVrz2pZeu2AfTvn2KxcyK4wRThmp3as+YkbahaZNMXXH87GfwxhupplGaTKZPT1Pj19hgA+jRA3bZBQYMWFYb2X572Hilby+tqOaZm0PCNbNlPIqsDEeRde+e/pjX1qkT3Hhjqg2UfqqrV9y3qk9DzrnzTvj00/rjlNIf+9JaSM2nSxcPHzYrVw0dReYaTBmqr/N6zpz01vrakqBt21V/6ksuUlpzpUePVFMxs9bJCaaMRMBdd6U/7tXVKx7v3BnGjk2d4Q1JEPV9GlqzqK8m1bUrfP3ra/WoZtYCOMGUiYkT4eyz4ckn03om77+//DDd9u3hiivgm99supguucSd62ZWP7eCN3MvvJCavfbcMzWN3XRTqjXcdBN065aao7p1g2HDmr5Te9CgdN+i4zCz5smd/M20k3/2bPj2OcGUAAAIbUlEQVTVr1Ii2XDDNHnj6aenGoKZWZHcyV+m5s2Dyy+Hq66Czz9P06Ccfz507Fh0ZGZmq8cJpplYvDg1L118cRoNNmBA6svYdtuiIzMzWzPugylYRFqv5KtfTbWVnXaCp5+G225zcjGz8uYEU6BHH4Vdd4XDD0/vi9xzDzz0EFSssmXTzKz5c4IpwLRp8MMfpnXiZ8+GP/85vZj4/e+n0VhmZi2BE0wTeust+OlP00uIEyemSRqnT4djj00vOJqZtSTu5G8CH38M//d/8Nvfpjm8Tj89Tcy4+eZFR2Zmlp/cajCS1pf0lKSpkqZJuijbP1HSlOwzW9KoWuf1llQtqX/JvmMkvZx9jqnnfptJGp+VGS9p07yeraEWL4bf/z7NLDx0aJp+/sUX0xBkJxcza+nybCJbBPSNiJ5AL+AASbtGxB4R0SsiegFPAHfVnCCpLXAZ8EDJvs2AC4FvAX2AC+tJHucCD0VED+ChbLsQS5fCHXfAV76Saiu9ekFVVZrefpttiorKzKxp5ZZgIpmfbbbLPv+ZNkDSRkBfoLQGcyowAnivZN/3gPER8UFEfAiMBw6o45aHALdk328B+jXGc6yuCRPgW99K77FsuGFaM378+KadI8zMrDnItZNfUltJU0gJY3xETCo53I9U45iXld0KOBS4vtZltgLeLNmele2rbYuIeDv7/g6wRT0xDZZUJalqzpw5q/1M9XnuOTjoIOjbF959F265BSZPhu99zyPDzKx1yjXBRMSSrCmsC9BH0k4lhwcCt5VsXw2cExFLG+G+QUltqdaxYRFREREVnTp1Wu1rV1amaerbtEk/r7kGfvIT6NkTHn88TfMyfTocfbRHhplZ69Yko8gi4iNJE0hNW/+W1JHUn3JoSbEK4Half+53BA6UVA28BexVUq4L8Egdt3lX0pYR8bakLVm+ma1R1LVE8GmnpURy5plw3nmw2WaNfVczs/KU5yiyTpI2yb5vAOwHvJgd7g+MjYiFNeUjYpuI6B4R3YHhwM8jYhRwP7C/pE2zzv39s321jQFqRpgdA4xu7GcaMmT5tU9qdO6cai5OLmZmy+TZRLYlMEHSs8DTpD6YsdmxASzfPFaviPgA+H/ZNZ4GLs72IelGSTUTq1wK7CfpZWDfbLtR1bdU8ezZjX0nM7Py5/VgVmM9mPqWCO7WDd54o9HCMjNr1hq6HoynilkNl1yy4oJfXiLYzKxuTjCrwUsEm5k1nOciW02DBjmhmJk1hGswZmaWCycYMzPLhROMmZnlwgnGzMxy4QRjZma5aNUvWkqaA9Tx6mSDdATeb8Rwyp1/H8vz72MZ/y6W1xJ+H90iYpWzBbfqBLM2JFU15E3W1sK/j+X597GMfxfLa02/DzeRmZlZLpxgzMwsF04wa25Y0QE0M/59LM+/j2X8u1heq/l9uA/GzMxy4RqMmZnlwglmDUg6QNJLkl6RdG7R8RRJ0taSJkh6XtI0SacXHVPRJLWV9C9JY1ddumWTtImk4ZJelPSCpN2Kjqkokn6R/T/yb0m3SVq/6Jjy5gSzmiS1Ba4Dvg98FRgo6avFRlWoauDMiPgqsCtwciv/fQCcDrxQdBDNxO+A+yJiR6AnrfT3Imkr4DSgIiJ2AtqSVvZt0ZxgVl8f4JWIeC0iFgO3A4cUHFNhIuLtiJicff+E9Adkq2KjKo6kLsBBwI1Fx1I0SRsDewI3AUTE4oj4qNioCrUOsIGkdYD2QItfbN0JZvVtBbxZsj2LVvwHtZSk7sAuwKRiIynU1cDZwNKiA2kGtgHmAH/OmgxvlNSh6KCKEBFvAVcAM4G3gY8j4oFio8qfE4w1CkkbAiOAMyJiXtHxFEHSD4D3IuKZomNpJtYBvgFcHxG7AJ8CrbLPUtKmpJaObYAvAR0k/bjYqPLnBLP63gK2Ltnuku1rtSS1IyWXyoi4q+h4CrQ7cLCkN0hNp30l/bXYkAo1C5gVETU12uGkhNMa7Qu8HhFzIuJz4C7g2wXHlDsnmNX3NNBD0jaS1iV11I0pOKbCSBKpjf2FiLiq6HiKFBHnRUSXiOhO+u/i4Yho8f9KrU9EvAO8KWmHbNc+wPMFhlSkmcCuktpn/8/sQysY8LBO0QGUm4iolnQKcD9pJMifImJawWEVaXfgKOA5SVOyff8TEfcUGJM1H6cCldk/xl4Djis4nkJExCRJw4HJpJGX/6IVvNHvN/nNzCwXbiIzM7NcOMGYmVkunGDMzCwXTjBmZpYLJxgzM8uFE4xZGZP0iKRWsb67lR8nGDMzy4UTjFkjk9RB0jhJU7O1P46QdIGkp7PtYdnb3DU1kN9KqsrWS+kt6S5JL0v6dVame7aeSmVWZrik9nXcd39JT0iaLOnv2fxwZoVxgjFrfAcAsyOiZ7b2x33AtRHRO9veAPhBSfnFEVEB3ACMBk4GdgKOlbR5VmYH4A8R8RVgHvDz0htK6gicD+wbEd8AqoBf5vaEZg3gBGPW+J4D9pN0maQ9IuJjYG9JkyQ9B/QFvlZSfkzJedOyNXYWkaZWqZlY9c2I+Gf2/a/Ad2rdc1fSAnj/zKbsOQbo1uhPZrYaPBeZWSOLiOmSvgEcCPxa0kOkWklFRLwp6VdA6XK5i7KfS0u+12zX/D9ae06n2tsCxkfEwEZ4BLNG4RqMWSOT9CVgQUT8FbicZVPUv5/1i/Rfg8t2LVnP/kjgsVrHnwR2l7RdFkMHSduvwX3MGo1rMGaNb2fgcklLgc+Bk4B+wL+Bd0hLPqyul4CTJf2JNOX99aUHI2KOpGOB2yStl+0+H5i+Rk9g1gg8m7JZM5ctRT02GyBgVjbcRGZmZrlwDcbMzHLhGoyZmeXCCcbMzHLhBGNmZrlwgjEzs1w4wZiZWS6cYMzMLBf/H/D8fTbl0HxsAAAAAElFTkSuQmCC\n",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"scores = []\n",
"for i in range(10):\n",
" kmeans = KMeans(20, n_init=1,\n",
" max_iter=10,\n",
" init='random')\n",
" kmeans.fit(X)\n",
" score = -1 * kmeans.score(X)\n",
" scores.append(score)\n",
" print('score=%g' % (score))\n",
" \n",
" \n",
"plt.figure()\n",
"plt.plot(range(10), sorted(scores), 'bo-')\n",
"plt.xlabel('sample')\n",
"plt.ylabel('error')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have a way to represent words in 20-dimensional space.\n",
"- The distance from each word vector to the means of each of the 20 clusters."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1.00874162 1.0266301 0.97164003 1.03189144 1.03358099 1.04555106\n",
" 0.9921334 1.0001127 0.99631078 0.5906229 0.98516281 1.01236052\n",
" 0.96816683 1.1055978 0.99184346 0.98134482 0.95310019 1.02250443\n",
" 0.95707704 1.00474645]\n",
"[0.94998976 0.99347211 0.92797163 1.00760459 0.87369431 1.02867307\n",
" 0.97299124 0.97489501 0.97819442 0.89745615 0.95924255 0.90359807\n",
" 0.93594526 1.2176145 0.98537882 0.96513233 0.93622821 0.9995155\n",
" 0.85784706 0.98290799]\n",
"[1.01027941 1.0360027 0.94782441 1.04189133 1.02496428 1.05721972\n",
" 1.00490278 1.00783092 1.00071431 0.71358098 0.99042856 1.02849363\n",
" 0.97045892 1.14231512 1.01697286 0.98572949 0.96373614 1.03701544\n",
" 0.95307083 1.00673499]\n",
"[1.0771791 1.0382791 1.01844359 1.04147931 1.09082894 1.06598877\n",
" 1.00127222 1.01626417 1.02439718 1.09741815 0.99488921 1.00199894\n",
" 0.99562545 1.28351384 1.04215818 0.98982216 1.04985219 1.04174968\n",
" 1.01364399 1.01655176]\n"
]
}
],
"source": [
"def get_distances(word, contexts, distances):\n",
" wd_idx = list(contexts.keys()).index(word)\n",
" return distances[wd_idx]\n",
"\n",
"print(get_distances('love', contexts, distances))\n",
"print(get_distances('like', contexts, distances))\n",
"print(get_distances('hate', contexts, distances))\n",
"print(get_distances('pizza', contexts, distances))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use these vectors to compute how similar two words are."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9958552518939588\n"
]
}
],
"source": [
"from math import sqrt\n",
"def sim(v1, v2):\n",
" \"\"\" cosine similarity of two vectors. \"\"\"\n",
" return np.dot(v1, v2) / (sqrt(np.dot(v1, v1)) * sqrt(np.dot(v2,v2)))\n",
" \n",
"# FIXME: sqrt norm\n",
"print(sim(get_distances('love', contexts, distances),\n",
" get_distances('like', contexts, distances)))"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9995853218896937\n"
]
}
],
"source": [
"print(sim(get_distances('love', contexts, distances),\n",
" get_distances('hate', contexts, distances)))"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.993983274406897\n"
]
}
],
"source": [
"print(sim(get_distances('love', contexts, distances),\n",
" get_distances('pizza', contexts, distances)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, `love` is more similar to `like` than to `pizza`.\n",
"\n",
"\n",
"
\n",
"**However**, this approach treats each word the same when computing similarity. \n",
"\n",
"- Presumably, some context words are more important than others (e.g., `the` versus `hippopotamus`). \n",
"- `tf-idf` captures this to some extent.\n",
"- Can we use machine learning to weight features based on how predictive they are?\n",
" - But what is the classification task?\n",
" \n",
"- See `Word2Vec.ipynb`\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"code_folding": [
0
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#\n",
"from IPython.core.display import HTML\n",
"HTML(open('../custom.css').read())"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 1
}