{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# CS579: Lecture 12 \n", "\n", "** Demographic Inference I**\n", "\n", "*[Dr. Aron Culotta](http://cs.iit.edu/~culotta)* \n", "*[Illinois Institute of Technology](http://iit.edu)*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Midterm review \n", "\n", "~5 True/False, ~6 short answer\n", " \n", "**Topics:** \n", "- Twitter API \n", " - what comes in a tweet? \n", " - how do rate limits work? \n", " - can you understand API documentation? \n", "- graph basics: \n", " - directed/undirected \n", " - path \n", " - cycle \n", " - connected \n", " - connected component \n", " - degree (distribution) \n", " - diameter \n", " - average path length \n", " - clustering coefficient \n", "- modeling networks \n", " - random graphs \n", " - regular graphs \n", " - rewired graphs \n", " - what makes a small world? \n", "- community detection \n", " - girvan-newman (betweenness) \n", " - graph cuts \n", " - representing graphs with matrices \n", " - graph laplacian \n", "- link prediction \n", " - shortest path \n", " - common neighbors \n", " - jaccard \n", " - preferential attachment \n", " - sim rank \n", " - evaluation \n", "- information cascades \n", " - urn experiment \n", " - bayes' theorem for decision making \n", " - game-theoretic model \n", " - maximizing payoff \n", "\t- cluster density \n", "- sentiment analysis \n", " - lexicon approach \n", " - machine learning \n", " \n", "**Question types:** \n", "- What does this algorithm output? \n", " - E.g., what is jaccard score for a specific link? \n", " - E.g., what is the next step in girvan-newman? \n", "- What does this code do? \n", " - E.g., I give you a new graph-generating algorithm, tell me what it produces \n", "- Write a new algorithm \n", " - E.g., provide pseudo-code the linear-threshold cascade model \n", "- True/False \n", " - E.g., small world graphs have higher clustering coefficients than random graphs." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**dem·o·graph·ics**\n", "\n", "statistical data relating to the population and particular groups within it.\n", "\n", "E.g., age, ethnicity, gender, income, ..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Why Demographics?\n", "\n", "- Marketing\n", " - Who are my customers?\n", " - Who are my competitors' customers?\n", " - E.g., [DemographicsPro](http://www.demographicspro.com/samples#c=%40FamilyGuyonFOX)\n", " \n", "- Social Media as Surveys\n", " - E.g., 45% of tweets express positive sentiment toward Pres. Obama\n", " - Who wrote those tweets?\n", " \n", "- Health\n", " - 2% of Facebook users are expressing flu-like symptoms\n", " - Are they representative of the full population?\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "** User profiles vary from site to site. **" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Approaches\n", "\n", "- Clever use of external data\n", " - E.g., U.S. Census name lists for gender\n", "- Look for keywords in profile\n", " - \"African American Male\"\n", " - \"Happy 21st birthday to me\"\n", "- Machine Learning" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100 tweets\n", "200 tweets\n", "300 tweets\n", "400 tweets\n", "500 tweets\n", "600 tweets\n", "700 tweets\n", "800 tweets\n", "900 tweets\n", "1000 tweets\n", "fetched 1000 tweets\n" ] } ], "source": [ "# Guessing gender\n", "import configparser\n", "import sys\n", "from TwitterAPI import TwitterAPI\n", "\n", "def get_twitter(config_file):\n", " \"\"\" Read the config_file and construct an instance of TwitterAPI.\n", " Args:\n", " config_file ... A config file in ConfigParser format with Twitter credentials\n", " Returns:\n", " An instance of TwitterAPI.\n", " \"\"\"\n", " config = configparser.ConfigParser()\n", " config.read(config_file)\n", " twitter = TwitterAPI(\n", " config.get('twitter', 'consumer_key'),\n", " config.get('twitter', 'consumer_secret'),\n", " config.get('twitter', 'access_token'),\n", " config.get('twitter', 'access_token_secret'))\n", " return twitter\n", "\n", "twitter = get_twitter('twitter.cfg')\n", "tweets = []\n", "n_tweets=1000\n", "for r in twitter.request('statuses/filter', {'track': 'i'}):\n", " tweets.append(r)\n", " if len(tweets) % 100 == 0:\n", " print('%d tweets' % len(tweets))\n", " if len(tweets) >= n_tweets:\n", " break\n", "print('fetched %d tweets' % len(tweets))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[{'limit': {'timestamp_ms': '1475689818824', 'track': 12}},\n", " {'limit': {'timestamp_ms': '1475689818840', 'track': 9}},\n", " {'limit': {'timestamp_ms': '1475689819793', 'track': 89}},\n", " {'limit': {'timestamp_ms': '1475689819825', 'track': 91}},\n", " {'limit': {'timestamp_ms': '1475689819846', 'track': 73}},\n", " {'limit': {'timestamp_ms': '1475689819857', 'track': 99}}]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# not all tweets are returned\n", "# https://dev.twitter.com/streaming/overview/messages-types#limit_notices\n", "[t for t in tweets if 'user' not in t][:6]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fetched 927 tweets\n" ] } ], "source": [ "# restrict to actual tweets\n", "# (remove \"deleted\" tweets)\n", "tweets = [t for t in tweets if 'user' in t]\n", "print('fetched %d tweets' % len(tweets))" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "['Jeffrey Salas',\n", " 'Iberia TV',\n", " 'Chelllly',\n", " 'selene',\n", " 'Hallie Earnhart',\n", " 'COSTLIFEDJ💯🐐',\n", " 'Aldous Snow',\n", " '🔮🔮🔮',\n", " 'İÇERDE',\n", " 'Madi']" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print 10 names.\n", "names = [t['user']['name'] for t in tweets]\n", "names[:10]" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "males:\n", "['JAMES 3.318 3.318 1',\n", " 'JOHN 3.271 6.589 2',\n", " 'ROBERT 3.143 9.732 3',\n", " 'MICHAEL 2.629 12.361 4',\n", " 'WILLIAM 2.451 14.812 5',\n", " 'DAVID 2.363 17.176 6',\n", " 'RICHARD 1.703 18.878 7',\n", " 'CHARLES 1.523 20.401 8',\n", " 'JOSEPH 1.404 21.805 9',\n", " 'THOMAS 1.380 23.185 10']\n", "females:\n", "['MARY 2.629 2.629 1',\n", " 'PATRICIA 1.073 3.702 2',\n", " 'LINDA 1.035 4.736 3',\n", " 'BARBARA 0.980 5.716 4',\n", " 'ELIZABETH 0.937 6.653 5',\n", " 'JENNIFER 0.932 7.586 6',\n", " 'MARIA 0.828 8.414 7',\n", " 'SUSAN 0.794 9.209 8',\n", " 'MARGARET 0.768 9.976 9',\n", " 'DOROTHY 0.727 10.703 10']\n" ] } ], "source": [ "# Fetch census name data from:\n", "# http://www2.census.gov/topics/genealogy/1990surnames/\n", "import requests\n", "from pprint import pprint\n", "males_url = 'http://www2.census.gov/topics/genealogy/' + \\\n", " '1990surnames/dist.male.first'\n", "females_url = 'http://www2.census.gov/topics/genealogy/' + \\\n", " '1990surnames/dist.female.first'\n", "males = requests.get(males_url).text.split('\\n')\n", "females = requests.get(females_url).text.split('\\n')\n", "print('males:')\n", "pprint(males[:10])\n", "print('females:')\n", "pprint(females[:10])" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1219 male and 4275 female names\n", "males:\n", "courtney\n", "lupe\n", "christian\n", "alfredo\n", "jonah\n", "rolando\n", "samuel\n", "luciano\n", "cesar\n", "vaughn\n", "\n", "females:\n", "lannie\n", "lyndsay\n", "lupe\n", "lolita\n", "shayna\n", "elnora\n", "laurene\n", "loree\n", "charisse\n", "gertrud\n" ] } ], "source": [ "# Get names. \n", "male_names = set([m.split()[0].lower() for m in males if m])\n", "female_names = set([f.split()[0].lower() for f in females if f])\n", "print('%d male and %d female names' % (len(male_names), len(female_names)))\n", "print('males:\\n' + '\\n'.join(list(male_names)[:10]))\n", "print('\\nfemales:\\n' + '\\n'.join(list(female_names)[:10]))" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Initialize gender of all tweets to unknown.\n", "for t in tweets:\n", " t['gender'] = 'unknown'" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# label a Twitter user's gender by matching name list.\n", "import re\n", "def gender_by_name(tweets, male_names, female_names):\n", " for t in tweets:\n", " name = t['user']['name']\n", " if name:\n", " # remove punctuation.\n", " name_parts = re.findall('\\w+', name.split()[0].lower())\n", " if len(name_parts) > 0:\n", " first = name_parts[0].lower()\n", " if first in male_names:\n", " t['gender'] = 'male'\n", " elif first in female_names:\n", " t['gender'] = 'female'\n", " else:\n", " t['gender'] = 'unknown'\n", "\n", "gender_by_name(tweets, male_names, female_names)\n", "# What's wrong with this approach?" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.35 of accounts are labeled with gender\n", "gender counts:\n", " Counter({'unknown': 598, 'female': 178, 'male': 151})\n", "male Jeffrey Salas\n", "unknown Iberia TV\n", "unknown Chelllly\n", "female selene\n", "female Hallie Earnhart\n", "unknown COSTLIFEDJ💯🐐\n", "unknown Aldous Snow\n", "unknown 🔮🔮🔮\n", "unknown İÇERDE\n", "unknown Madi\n", "female Fiona Von\n", "unknown spooky cait\n", "unknown -HeartlessLover💋\n", "unknown #NicomaineDeiHardFan\n", "female princess 🎃\n", "male Daniel Poehlman\n", "unknown savage.\n", "unknown 🐨 Lo 🍕\n", "male Isaiah Persons\n", "female Barrie\n" ] } ], "source": [ "from collections import Counter\n", "\n", "def print_genders(tweets):\n", " counts = Counter([t['gender'] for t in tweets])\n", " print('%.2f of accounts are labeled with gender' % \n", " ((counts['male'] + counts['female']) / sum(counts.values())))\n", " print('gender counts:\\n', counts)\n", " for t in tweets[:20]:\n", " print(t['gender'], t['user']['name'])\n", " \n", "print_genders(tweets)" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "found 331 ambiguous names:\n", "\n", "courtney\n", "lupe\n", "christian\n", "samuel\n", "maurice\n", "ronald\n", "eric\n", "toby\n", "perry\n", "jimmy\n", "shannon\n", "jan\n", "christopher\n", "paris\n", "cory\n", "marshall\n", "lou\n", "merrill\n", "mickey\n", "stephen\n" ] } ], "source": [ "# What about ambiguous names?\n", "def print_ambiguous_names(male_names, female_names):\n", " ambiguous = [n for n in male_names if n in female_names] # names on both lists\n", " print('found %d ambiguous names:\\n'% len(ambiguous))\n", " print('\\n'.join(ambiguous[:20]))\n", " \n", "print_ambiguous_names(male_names, female_names)" ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "found 0 ambiguous names:\n", "\n", "\n", "1146 male and 4017 female names\n" ] } ], "source": [ "# Keep names that are more frequent in one gender than the other.\n", "def get_percents(name_list):\n", " # parse raw data to extract, e.g., the percent of males names John.\n", " return dict([(n.split()[0].lower(), float(n.split()[1]))\n", " for n in name_list if n])\n", "\n", "males_pct = get_percents(males)\n", "females_pct = get_percents(females)\n", "\n", "# Assign a name as male if it is more common among males than femals.\n", "male_names = set([m for m in male_names if m not in female_names or\n", " males_pct[m] > females_pct[m]])\n", "female_names = set([f for f in female_names if f not in male_names or\n", " females_pct[f] > males_pct[f]])\n", "\n", "print_ambiguous_names(male_names, female_names)\n", "print('%d male and %d female names' % (len(male_names), len(female_names)))" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.35 of accounts are labeled with gender\n", "gender counts:\n", " Counter({'unknown': 598, 'female': 197, 'male': 132})\n", "male Jeffrey Salas\n", "unknown Iberia TV\n", "unknown Chelllly\n", "female selene\n", "female Hallie Earnhart\n", "unknown COSTLIFEDJ💯🐐\n", "unknown Aldous Snow\n", "unknown 🔮🔮🔮\n", "unknown İÇERDE\n", "unknown Madi\n", "female Fiona Von\n", "unknown spooky cait\n", "unknown -HeartlessLover💋\n", "unknown #NicomaineDeiHardFan\n", "female princess 🎃\n", "male Daniel Poehlman\n", "unknown savage.\n", "unknown 🐨 Lo 🍕\n", "male Isaiah Persons\n", "female Barrie\n" ] } ], "source": [ "# Relabel twitter users (compare with above)\n", "gender_by_name(tweets, male_names, female_names)\n", "print_genders(tweets)" ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "[('Pedagog 05447243650', 2),\n", " ('WIN PS4 NOW!!', 2),\n", " ('Dev 🤘🏾', 2),\n", " ('Nihat Osmanlı', 2),\n", " ('hasancoco', 2),\n", " ('G', 2),\n", " ('🔥', 2),\n", " ('✨', 2),\n", " ('Vikashgarg', 2),\n", " ('Brends', 1),\n", " ('paipai_devil', 1),\n", " ('IG:mrdope_', 1),\n", " ('Drones365', 1),\n", " ('Henny P. Newton', 1),\n", " ('Mehmet boğa', 1),\n", " ('.', 1),\n", " ('spooky cait', 1),\n", " ('kait', 1),\n", " ('Rahul Sudeep', 1),\n", " ('Kaleidopop', 1)]" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Who are the unknowns?\n", "# \"Filtered\" data can have big impact on analysis.\n", "unknown_names = Counter(t['user']['name']\n", " for t in tweets if t['gender'] == 'unknown')\n", "unknown_names.most_common(20)" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Most Common Male Terms:\n", "[('and', 34),\n", " ('i', 31),\n", " ('the', 21),\n", " ('a', 20),\n", " ('my', 19),\n", " ('of', 17),\n", " ('in', 16),\n", " ('love', 14),\n", " ('with', 13),\n", " ('to', 11)]\n", "\n", "Most Common Female Terms:\n", "[('i', 35),\n", " ('you', 29),\n", " ('my', 25),\n", " ('of', 24),\n", " ('and', 21),\n", " ('a', 19),\n", " ('the', 18),\n", " ('me', 15),\n", " ('to', 15),\n", " ('is', 13)]\n" ] } ], "source": [ "# How do the profiles of male Twitter users differ from\n", "# those of female users?\n", "\n", "male_profiles = [t['user']['description'] for t in tweets\n", " if t['gender'] == 'male']\n", "\n", "female_profiles = [t['user']['description'] for t in tweets\n", " if t['gender'] == 'female']\n", "#male_profiles = [t['text'] for t in tweets\n", "# if t['gender'] == 'male']\n", "\n", "#female_profiles = [t['text'] for t in tweets\n", "# if t['gender'] == 'female']\n", "\n", "import re\n", "def tokenize(s):\n", " return re.sub('\\W+', ' ', s).lower().split() if s else []\n", "\n", "male_words = Counter()\n", "female_words = Counter()\n", "\n", "for p in male_profiles:\n", " male_words.update(Counter(tokenize(p)))\n", " \n", "for p in female_profiles:\n", " female_words.update(Counter(tokenize(p)))\n", "\n", "print('Most Common Male Terms:')\n", "pprint(male_words.most_common(10))\n", " \n", "print('\\nMost Common Female Terms:')\n", "pprint(female_words.most_common(10))" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "844\n", "915\n" ] } ], "source": [ "print(len(male_words))\n", "print(len(female_words))" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top Male Terms (diff):\n", "[('and', -13),\n", " ('with', -10),\n", " ('life', -7),\n", " ('in', -6),\n", " ('too', -6),\n", " ('fun', -5),\n", " ('words', -4),\n", " ('at', -4),\n", " ('be', -4),\n", " ('enjoy', -4)]\n", "\n", "Top Female Terms (diff):\n", "[('things', 5),\n", " ('re', 5),\n", " ('my', 6),\n", " ('she', 6),\n", " ('one', 6),\n", " ('m', 6),\n", " ('is', 6),\n", " ('of', 7),\n", " ('me', 8),\n", " ('you', 24)]\n" ] } ], "source": [ "# Compute difference\n", "diff_counts = dict([(w, female_words[w] - male_words[w])\n", " for w in\n", " set(female_words.keys()) | set(male_words.keys())])\n", "\n", "sorted_diffs = sorted(diff_counts.items(), key=lambda x: x[1])\n", "\n", "print('Top Male Terms (diff):')\n", "pprint(sorted_diffs[:10])\n", "\n", "print('\\nTop Female Terms (diff):')\n", "pprint(sorted_diffs[-10:])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "** A problem with difference of counts:**\n", "\n", "<br><br><br><br>\n", "What if we have more male than female words in total?\n", "\n", "<br><br><br><br>\n", "Instead, consider \"the probability that a male user writes the word **w**\"\n", "\n", "<br><br><br><br>\n", "\n", "$$p(w|male) = \\frac{freq(w, male)}\n", "{\\sum_i freq(w_i, male)} $$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "** Odds Ratio (OR)**\n", "\n", "The ratio of the probabilities for a word from each class:\n", "\n", "$$ OR(w) = \\frac{p(w|female)}{p(w|male)} $$\n", "\n", "\n", "- High values --> more likely to be written by females\n", "- Low values --> more likely to be written by males\n" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p(w|male)\n", "[('and', 0.025660377358490565),\n", " ('i', 0.02339622641509434),\n", " ('the', 0.015849056603773583),\n", " ('a', 0.01509433962264151),\n", " ('my', 0.014339622641509434),\n", " ('of', 0.012830188679245283),\n", " ('in', 0.012075471698113207),\n", " ('love', 0.010566037735849057),\n", " ('with', 0.009811320754716982),\n", " ('to', 0.00830188679245283)]\n", "\n", "p(w|female)\n", "[('i', 0.023458445040214475),\n", " ('you', 0.01943699731903485),\n", " ('my', 0.01675603217158177),\n", " ('of', 0.0160857908847185),\n", " ('and', 0.014075067024128687),\n", " ('a', 0.012734584450402145),\n", " ('the', 0.012064343163538873),\n", " ('me', 0.010053619302949061),\n", " ('to', 0.010053619302949061),\n", " ('is', 0.00871313672922252)]\n" ] } ], "source": [ "def counts_to_probs(gender_words):\n", " \"\"\" Compute probability of each term according to the frequency\n", " in a gender. \"\"\"\n", " total = sum(gender_words.values())\n", " return dict([(word, count / total)\n", " for word, count in gender_words.items()])\n", "\n", "male_probs = counts_to_probs(male_words)\n", "female_probs = counts_to_probs(female_words)\n", "\n", "print('p(w|male)')\n", "pprint(sorted(male_probs.items(), key=lambda x: -x[1])[:10])\n", "\n", "print('\\np(w|female)')\n", "pprint(sorted(female_probs.items(), key=lambda x: -x[1])[:10])" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "ename": "KeyError", "evalue": "'rock'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m<ipython-input-110-6f72aa3012f6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 3\u001b[0m for w in set(male_probs) | set(female_probs)])\n\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m<ipython-input-110-6f72aa3012f6>\u001b[0m in \u001b[0;36modds_ratios\u001b[0;34m(male_probs, female_probs)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m return dict([(w, female_probs[w] / male_probs[w])\n\u001b[0;32m----> 3\u001b[0;31m for w in set(male_probs) | set(female_probs)])\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m<ipython-input-110-6f72aa3012f6>\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m return dict([(w, female_probs[w] / male_probs[w])\n\u001b[0;32m----> 3\u001b[0;31m for w in set(male_probs) | set(female_probs)])\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: 'rock'" ] } ], "source": [ "def odds_ratios(male_probs, female_probs):\n", " return dict([(w, female_probs[w] / male_probs[w])\n", " for w in\n", " set(male_probs) | set(female_probs)])\n", "\n", "ors = odds_ratios(male_probs, female_probs)" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "844\n", "915\n", "0.0013404825737265416\n" ] }, { "data": { "text/plain": [ "False" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(len(male_probs))\n", "print(len(female_probs))\n", "print(female_probs['rock'])\n", "'rock' in male_probs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "** How to deal with 0-probabilities? **\n", "\n", "$$p(w|male) = \\frac{freq(w, male)}\n", "{\\sum_i freq(w_i, male)} $$\n", "\n", "$freq(w, male) = 0$\n", "\n", "Do we really believe there is **0** probability of a male using this term?\n", "\n", "(Recall over-fitting discussion.)\n", "<br><br><br><br>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "** Additive Smoothing **\n", "\n", "Reserve small amount of counts (e.g., 1) for unseen observations.\n", "\n", "E.g., assume we've seen each word at least once in each class.\n", "\n", "$$p(w|male) = \\frac{1 + freq(w, male)}\n", "{|W| + \\sum_i freq(w_i, male)} $$\n", "\n", "$|W|$: number of unique words." ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('and', 0.012114918656974732)\n", "('i', 0.01107649705780547)\n", "('the', 0.00761509172724126)\n", "('a', 0.007268951194184839)\n", "('my', 0.006922810661128418)\n", "('of', 0.006230529595015576)\n", "('in', 0.005884389061959155)\n", "('love', 0.005192107995846314)\n", "('with', 0.004845967462789893)\n", "('to', 0.004153686396677051)\n" ] } ], "source": [ "# Additive smoothing. Add count of 1 for all words.\n", "all_words = set(male_words) | set(female_words)\n", "male_words.update(all_words) \n", "female_words.update(all_words)\n", "\n", "male_probs = counts_to_probs(male_words)\n", "female_probs = counts_to_probs(female_words)\n", "print('\\n'.join(str(x) for x in \n", " sorted(male_probs.items(), key=lambda x: -x[1])[:10]))" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.00034614053305642093\n" ] } ], "source": [ "# Even though word doesn't appear, has non-zerp probability.\n", "print(male_probs['rock'])" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top Female Terms (OR):\n", "[('she', 6.6174738219895275),\n", " ('one', 6.6174738219895275),\n", " ('19', 5.672120418848166),\n", " ('student', 5.672120418848166),\n", " ('things', 5.672120418848166),\n", " ('right', 4.726767015706806),\n", " ('towards', 4.726767015706806),\n", " ('reaction', 4.726767015706806),\n", " ('first', 4.726767015706806),\n", " ('you', 4.726767015706806),\n", " ('bayern', 4.726767015706806),\n", " ('insta', 4.726767015706806),\n", " ('from', 3.781413612565445),\n", " ('those', 3.781413612565445),\n", " ('jo', 3.781413612565445),\n", " ('technology', 3.781413612565445),\n", " ('forever', 3.781413612565445),\n", " ('we', 3.781413612565445),\n", " ('follow', 3.781413612565445),\n", " ('aliciagworld', 3.781413612565445)]\n", "\n", "Top Male Terms (OR):\n", "[('fan', 0.31511780104712045),\n", " ('youtu', 0.31511780104712045),\n", " ('when', 0.31511780104712045),\n", " ('football', 0.31511780104712045),\n", " ('game', 0.31511780104712045),\n", " ('gamer', 0.31511780104712045),\n", " ('rest', 0.31511780104712045),\n", " ('with', 0.27010097232610325),\n", " ('sports', 0.23633835078534032),\n", " ('father', 0.23633835078534032),\n", " ('person', 0.23633835078534032),\n", " ('far', 0.23633835078534032),\n", " ('know', 0.23633835078534032),\n", " ('drummer', 0.23633835078534032),\n", " ('beer', 0.23633835078534032),\n", " ('mother', 0.23633835078534032),\n", " ('words', 0.18907068062827226),\n", " ('enjoy', 0.18907068062827226),\n", " ('fun', 0.15755890052356022),\n", " ('too', 0.13505048616305163)]\n" ] } ], "source": [ "ors = odds_ratios(male_probs, female_probs)\n", "\n", "sorted_ors = sorted(ors.items(), key=lambda x: -x[1])\n", "\n", "print('Top Female Terms (OR):')\n", "pprint(sorted_ors[:20])\n", "\n", "print('\\nTop Male Terms (OR):')\n", "pprint(sorted_ors[-20:])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.0" } }, "nbformat": 4, "nbformat_minor": 0 }