{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# CS579: Lecture 12  \n",
    "\n",
    "** Demographic Inference I**\n",
    "\n",
    "*[Dr. Aron Culotta](http://cs.iit.edu/~culotta)*  \n",
    "*[Illinois Institute of Technology](http://iit.edu)*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Midterm review \n",
    "\n",
    "~5 True/False, ~6 short answer\n",
    " \n",
    "**Topics:** \n",
    "- Twitter API \n",
    " - what comes in a tweet? \n",
    " - how do rate limits work? \n",
    " - can you understand API documentation? \n",
    "- graph basics: \n",
    " - directed/undirected \n",
    " - path \n",
    " - cycle \n",
    " - connected \n",
    " - connected component \n",
    " - degree (distribution) \n",
    " - diameter \n",
    " - average path length \n",
    " - clustering coefficient \n",
    "- modeling networks \n",
    " - random graphs \n",
    " - regular graphs \n",
    " - rewired graphs \n",
    " - what makes a small world? \n",
    "- community detection \n",
    " - girvan-newman (betweenness) \n",
    " - graph cuts \n",
    " - representing graphs with matrices \n",
    " - graph laplacian \n",
    "- link prediction \n",
    " - shortest path \n",
    " - common neighbors \n",
    " - jaccard \n",
    " - preferential attachment \n",
    " - sim rank \n",
    " - evaluation \n",
    "- information cascades \n",
    " - urn experiment \n",
    " - bayes' theorem for decision making \n",
    " - game-theoretic model \n",
    " - maximizing payoff \n",
    "\t- cluster density \n",
    "- sentiment analysis \n",
    " - lexicon approach \n",
    " - machine learning \n",
    " \n",
    "**Question types:** \n",
    "- What does this algorithm output? \n",
    " - E.g., what is jaccard score for a specific link? \n",
    " - E.g., what is the next step in girvan-newman? \n",
    "- What does this code do? \n",
    " - E.g., I give you a new graph-generating algorithm, tell me what it produces \n",
    "- Write a new algorithm \n",
    " - E.g., provide pseudo-code the linear-threshold cascade model \n",
    "- True/False \n",
    " - E.g., small world graphs have higher clustering coefficients than random graphs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**dem·o·graph·ics**\n",
    "\n",
    "statistical data relating to the population and particular groups within it.\n",
    "\n",
    "E.g., age, ethnicity, gender, income, ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Why Demographics?\n",
    "\n",
    "- Marketing\n",
    "  - Who are my customers?\n",
    "  - Who are my competitors' customers?\n",
    "  - E.g., [DemographicsPro](http://www.demographicspro.com/samples#c=%40FamilyGuyonFOX)\n",
    "  \n",
    "- Social Media as Surveys\n",
    "  - E.g., 45% of tweets express positive sentiment toward Pres. Obama\n",
    "  - Who wrote those tweets?\n",
    "  \n",
    "- Health\n",
    "  - 2% of Facebook users are expressing flu-like symptoms\n",
    "  - Are they representative of the full population?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** User profiles vary from site to site. **"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![rahm](rahm.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![rahm-fb](rahm-fb.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![rahm-li](rahm-li.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Approaches\n",
    "\n",
    "- Clever use of external data\n",
    "  - E.g., U.S. Census name lists for gender\n",
    "- Look for keywords in profile\n",
    "  - \"African American Male\"\n",
    "  - \"Happy 21st birthday to me\"\n",
    "- Machine Learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 tweets\n",
      "200 tweets\n",
      "300 tweets\n",
      "400 tweets\n",
      "500 tweets\n",
      "600 tweets\n",
      "700 tweets\n",
      "800 tweets\n",
      "900 tweets\n",
      "1000 tweets\n",
      "fetched 1000 tweets\n"
     ]
    }
   ],
   "source": [
    "# Guessing gender\n",
    "import configparser\n",
    "import sys\n",
    "from TwitterAPI import TwitterAPI\n",
    "\n",
    "def get_twitter(config_file):\n",
    "    \"\"\" Read the config_file and construct an instance of TwitterAPI.\n",
    "    Args:\n",
    "      config_file ... A config file in ConfigParser format with Twitter credentials\n",
    "    Returns:\n",
    "      An instance of TwitterAPI.\n",
    "    \"\"\"\n",
    "    config = configparser.ConfigParser()\n",
    "    config.read(config_file)\n",
    "    twitter = TwitterAPI(\n",
    "                   config.get('twitter', 'consumer_key'),\n",
    "                   config.get('twitter', 'consumer_secret'),\n",
    "                   config.get('twitter', 'access_token'),\n",
    "                   config.get('twitter', 'access_token_secret'))\n",
    "    return twitter\n",
    "\n",
    "twitter = get_twitter('twitter.cfg')\n",
    "tweets = []\n",
    "n_tweets=1000\n",
    "for r in twitter.request('statuses/filter', {'track': 'i'}):\n",
    "    tweets.append(r)\n",
    "    if len(tweets) % 100 == 0:\n",
    "        print('%d tweets' % len(tweets))\n",
    "    if len(tweets) >= n_tweets:\n",
    "        break\n",
    "print('fetched %d tweets' % len(tweets))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'limit': {'timestamp_ms': '1475689818824', 'track': 12}},\n",
       " {'limit': {'timestamp_ms': '1475689818840', 'track': 9}},\n",
       " {'limit': {'timestamp_ms': '1475689819793', 'track': 89}},\n",
       " {'limit': {'timestamp_ms': '1475689819825', 'track': 91}},\n",
       " {'limit': {'timestamp_ms': '1475689819846', 'track': 73}},\n",
       " {'limit': {'timestamp_ms': '1475689819857', 'track': 99}}]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# not all tweets are returned\n",
    "# https://dev.twitter.com/streaming/overview/messages-types#limit_notices\n",
    "[t for t in tweets if 'user' not in t][:6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fetched 927 tweets\n"
     ]
    }
   ],
   "source": [
    "# restrict to actual tweets\n",
    "# (remove \"deleted\" tweets)\n",
    "tweets = [t for t in tweets if 'user' in t]\n",
    "print('fetched %d tweets' % len(tweets))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Jeffrey Salas',\n",
       " 'Iberia TV',\n",
       " 'Chelllly',\n",
       " 'selene',\n",
       " 'Hallie Earnhart',\n",
       " 'COSTLIFEDJ💯🐐',\n",
       " 'Aldous Snow',\n",
       " '🔮🔮🔮',\n",
       " 'İÇERDE',\n",
       " 'Madi']"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print 10 names.\n",
    "names = [t['user']['name'] for t in tweets]\n",
    "names[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "males:\n",
      "['JAMES          3.318  3.318      1',\n",
      " 'JOHN           3.271  6.589      2',\n",
      " 'ROBERT         3.143  9.732      3',\n",
      " 'MICHAEL        2.629 12.361      4',\n",
      " 'WILLIAM        2.451 14.812      5',\n",
      " 'DAVID          2.363 17.176      6',\n",
      " 'RICHARD        1.703 18.878      7',\n",
      " 'CHARLES        1.523 20.401      8',\n",
      " 'JOSEPH         1.404 21.805      9',\n",
      " 'THOMAS         1.380 23.185     10']\n",
      "females:\n",
      "['MARY           2.629  2.629      1',\n",
      " 'PATRICIA       1.073  3.702      2',\n",
      " 'LINDA          1.035  4.736      3',\n",
      " 'BARBARA        0.980  5.716      4',\n",
      " 'ELIZABETH      0.937  6.653      5',\n",
      " 'JENNIFER       0.932  7.586      6',\n",
      " 'MARIA          0.828  8.414      7',\n",
      " 'SUSAN          0.794  9.209      8',\n",
      " 'MARGARET       0.768  9.976      9',\n",
      " 'DOROTHY        0.727 10.703     10']\n"
     ]
    }
   ],
   "source": [
    "# Fetch census name data from:\n",
    "# http://www2.census.gov/topics/genealogy/1990surnames/\n",
    "import requests\n",
    "from pprint import pprint\n",
    "males_url = 'http://www2.census.gov/topics/genealogy/' + \\\n",
    "            '1990surnames/dist.male.first'\n",
    "females_url = 'http://www2.census.gov/topics/genealogy/' + \\\n",
    "              '1990surnames/dist.female.first'\n",
    "males = requests.get(males_url).text.split('\\n')\n",
    "females = requests.get(females_url).text.split('\\n')\n",
    "print('males:')\n",
    "pprint(males[:10])\n",
    "print('females:')\n",
    "pprint(females[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1219 male and 4275 female names\n",
      "males:\n",
      "courtney\n",
      "lupe\n",
      "christian\n",
      "alfredo\n",
      "jonah\n",
      "rolando\n",
      "samuel\n",
      "luciano\n",
      "cesar\n",
      "vaughn\n",
      "\n",
      "females:\n",
      "lannie\n",
      "lyndsay\n",
      "lupe\n",
      "lolita\n",
      "shayna\n",
      "elnora\n",
      "laurene\n",
      "loree\n",
      "charisse\n",
      "gertrud\n"
     ]
    }
   ],
   "source": [
    "# Get names. \n",
    "male_names = set([m.split()[0].lower() for m in males if m])\n",
    "female_names = set([f.split()[0].lower() for f in females if f])\n",
    "print('%d male and %d female names' % (len(male_names), len(female_names)))\n",
    "print('males:\\n' + '\\n'.join(list(male_names)[:10]))\n",
    "print('\\nfemales:\\n' + '\\n'.join(list(female_names)[:10]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Initialize gender of all tweets to unknown.\n",
    "for t in tweets:\n",
    "    t['gender'] = 'unknown'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 100,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# label a Twitter user's gender by matching name list.\n",
    "import re\n",
    "def gender_by_name(tweets, male_names, female_names):\n",
    "    for t in tweets:\n",
    "        name = t['user']['name']\n",
    "        if name:\n",
    "            # remove punctuation.\n",
    "            name_parts = re.findall('\\w+', name.split()[0].lower())\n",
    "            if len(name_parts) > 0:\n",
    "                first = name_parts[0].lower()\n",
    "                if first in male_names:\n",
    "                    t['gender'] = 'male'\n",
    "                elif first in female_names:\n",
    "                    t['gender'] = 'female'\n",
    "                else:\n",
    "                    t['gender'] = 'unknown'\n",
    "\n",
    "gender_by_name(tweets, male_names, female_names)\n",
    "# What's wrong with this approach?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.35 of accounts are labeled with gender\n",
      "gender counts:\n",
      " Counter({'unknown': 598, 'female': 178, 'male': 151})\n",
      "male Jeffrey Salas\n",
      "unknown Iberia TV\n",
      "unknown Chelllly\n",
      "female selene\n",
      "female Hallie Earnhart\n",
      "unknown COSTLIFEDJ💯🐐\n",
      "unknown Aldous Snow\n",
      "unknown 🔮🔮🔮\n",
      "unknown İÇERDE\n",
      "unknown Madi\n",
      "female Fiona Von\n",
      "unknown spooky cait\n",
      "unknown -HeartlessLover💋\n",
      "unknown #NicomaineDeiHardFan\n",
      "female princess 🎃\n",
      "male Daniel Poehlman\n",
      "unknown savage.\n",
      "unknown 🐨 Lo 🍕\n",
      "male Isaiah Persons\n",
      "female Barrie\n"
     ]
    }
   ],
   "source": [
    "from collections import Counter\n",
    "\n",
    "def print_genders(tweets):\n",
    "    counts = Counter([t['gender'] for t in tweets])\n",
    "    print('%.2f of accounts are labeled with gender' % \n",
    "          ((counts['male'] + counts['female']) / sum(counts.values())))\n",
    "    print('gender counts:\\n', counts)\n",
    "    for t in tweets[:20]:\n",
    "        print(t['gender'], t['user']['name'])\n",
    "    \n",
    "print_genders(tweets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "found 331 ambiguous names:\n",
      "\n",
      "courtney\n",
      "lupe\n",
      "christian\n",
      "samuel\n",
      "maurice\n",
      "ronald\n",
      "eric\n",
      "toby\n",
      "perry\n",
      "jimmy\n",
      "shannon\n",
      "jan\n",
      "christopher\n",
      "paris\n",
      "cory\n",
      "marshall\n",
      "lou\n",
      "merrill\n",
      "mickey\n",
      "stephen\n"
     ]
    }
   ],
   "source": [
    "# What about ambiguous names?\n",
    "def print_ambiguous_names(male_names, female_names):\n",
    "    ambiguous = [n for n in male_names if n in female_names]  # names on both lists\n",
    "    print('found %d ambiguous names:\\n'% len(ambiguous))\n",
    "    print('\\n'.join(ambiguous[:20]))\n",
    "    \n",
    "print_ambiguous_names(male_names, female_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "found 0 ambiguous names:\n",
      "\n",
      "\n",
      "1146 male and 4017 female names\n"
     ]
    }
   ],
   "source": [
    "# Keep names that are more frequent in one gender than the other.\n",
    "def get_percents(name_list):\n",
    "    # parse raw data to extract, e.g., the percent of males names John.\n",
    "    return dict([(n.split()[0].lower(), float(n.split()[1]))\n",
    "                  for n in name_list if n])\n",
    "\n",
    "males_pct = get_percents(males)\n",
    "females_pct = get_percents(females)\n",
    "\n",
    "# Assign a name as male if it is more common among males than femals.\n",
    "male_names = set([m for m in male_names if m not in female_names or\n",
    "              males_pct[m] > females_pct[m]])\n",
    "female_names = set([f for f in female_names if f not in male_names or\n",
    "              females_pct[f] > males_pct[f]])\n",
    "\n",
    "print_ambiguous_names(male_names, female_names)\n",
    "print('%d male and %d female names' % (len(male_names), len(female_names)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.35 of accounts are labeled with gender\n",
      "gender counts:\n",
      " Counter({'unknown': 598, 'female': 197, 'male': 132})\n",
      "male Jeffrey Salas\n",
      "unknown Iberia TV\n",
      "unknown Chelllly\n",
      "female selene\n",
      "female Hallie Earnhart\n",
      "unknown COSTLIFEDJ💯🐐\n",
      "unknown Aldous Snow\n",
      "unknown 🔮🔮🔮\n",
      "unknown İÇERDE\n",
      "unknown Madi\n",
      "female Fiona Von\n",
      "unknown spooky cait\n",
      "unknown -HeartlessLover💋\n",
      "unknown #NicomaineDeiHardFan\n",
      "female princess 🎃\n",
      "male Daniel Poehlman\n",
      "unknown savage.\n",
      "unknown 🐨 Lo 🍕\n",
      "male Isaiah Persons\n",
      "female Barrie\n"
     ]
    }
   ],
   "source": [
    "# Relabel twitter users (compare with above)\n",
    "gender_by_name(tweets, male_names, female_names)\n",
    "print_genders(tweets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('Pedagog 05447243650', 2),\n",
       " ('WIN PS4 NOW!!', 2),\n",
       " ('Dev 🤘🏾', 2),\n",
       " ('Nihat Osmanlı', 2),\n",
       " ('hasancoco', 2),\n",
       " ('G', 2),\n",
       " ('🔥', 2),\n",
       " ('✨', 2),\n",
       " ('Vikashgarg', 2),\n",
       " ('Brends', 1),\n",
       " ('paipai_devil', 1),\n",
       " ('IG:mrdope_', 1),\n",
       " ('Drones365', 1),\n",
       " ('Henny P. Newton', 1),\n",
       " ('Mehmet boğa', 1),\n",
       " ('.', 1),\n",
       " ('spooky cait', 1),\n",
       " ('kait', 1),\n",
       " ('Rahul Sudeep', 1),\n",
       " ('Kaleidopop', 1)]"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Who are the unknowns?\n",
    "# \"Filtered\" data can have big impact on analysis.\n",
    "unknown_names = Counter(t['user']['name']\n",
    "                        for t in tweets if t['gender'] == 'unknown')\n",
    "unknown_names.most_common(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Common Male Terms:\n",
      "[('and', 34),\n",
      " ('i', 31),\n",
      " ('the', 21),\n",
      " ('a', 20),\n",
      " ('my', 19),\n",
      " ('of', 17),\n",
      " ('in', 16),\n",
      " ('love', 14),\n",
      " ('with', 13),\n",
      " ('to', 11)]\n",
      "\n",
      "Most Common Female Terms:\n",
      "[('i', 35),\n",
      " ('you', 29),\n",
      " ('my', 25),\n",
      " ('of', 24),\n",
      " ('and', 21),\n",
      " ('a', 19),\n",
      " ('the', 18),\n",
      " ('me', 15),\n",
      " ('to', 15),\n",
      " ('is', 13)]\n"
     ]
    }
   ],
   "source": [
    "# How do the profiles of male Twitter users differ from\n",
    "# those of female users?\n",
    "\n",
    "male_profiles = [t['user']['description'] for t in tweets\n",
    "                if t['gender'] == 'male']\n",
    "\n",
    "female_profiles = [t['user']['description'] for t in tweets\n",
    "                if t['gender'] == 'female']\n",
    "#male_profiles = [t['text'] for t in tweets\n",
    "#                if t['gender'] == 'male']\n",
    "\n",
    "#female_profiles = [t['text'] for t in tweets\n",
    "#                if t['gender'] == 'female']\n",
    "\n",
    "import re\n",
    "def tokenize(s):\n",
    "    return re.sub('\\W+', ' ', s).lower().split() if s else []\n",
    "\n",
    "male_words = Counter()\n",
    "female_words = Counter()\n",
    "\n",
    "for p in male_profiles:\n",
    "    male_words.update(Counter(tokenize(p)))\n",
    "                      \n",
    "for p in female_profiles:\n",
    "    female_words.update(Counter(tokenize(p)))\n",
    "\n",
    "print('Most Common Male Terms:')\n",
    "pprint(male_words.most_common(10))\n",
    "    \n",
    "print('\\nMost Common Female Terms:')\n",
    "pprint(female_words.most_common(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "844\n",
      "915\n"
     ]
    }
   ],
   "source": [
    "print(len(male_words))\n",
    "print(len(female_words))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top Male Terms (diff):\n",
      "[('and', -13),\n",
      " ('with', -10),\n",
      " ('life', -7),\n",
      " ('in', -6),\n",
      " ('too', -6),\n",
      " ('fun', -5),\n",
      " ('words', -4),\n",
      " ('at', -4),\n",
      " ('be', -4),\n",
      " ('enjoy', -4)]\n",
      "\n",
      "Top Female Terms (diff):\n",
      "[('things', 5),\n",
      " ('re', 5),\n",
      " ('my', 6),\n",
      " ('she', 6),\n",
      " ('one', 6),\n",
      " ('m', 6),\n",
      " ('is', 6),\n",
      " ('of', 7),\n",
      " ('me', 8),\n",
      " ('you', 24)]\n"
     ]
    }
   ],
   "source": [
    "# Compute difference\n",
    "diff_counts = dict([(w, female_words[w] - male_words[w])\n",
    "                    for w in\n",
    "                    set(female_words.keys()) | set(male_words.keys())])\n",
    "\n",
    "sorted_diffs = sorted(diff_counts.items(), key=lambda x: x[1])\n",
    "\n",
    "print('Top Male Terms (diff):')\n",
    "pprint(sorted_diffs[:10])\n",
    "\n",
    "print('\\nTop Female Terms (diff):')\n",
    "pprint(sorted_diffs[-10:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** A problem with difference of counts:**\n",
    "\n",
    "<br><br><br><br>\n",
    "What if we have more male than female words in total?\n",
    "\n",
    "<br><br><br><br>\n",
    "Instead, consider \"the probability that a male user writes the word **w**\"\n",
    "\n",
    "<br><br><br><br>\n",
    "\n",
    "$$p(w|male) = \\frac{freq(w, male)}\n",
    "{\\sum_i freq(w_i, male)} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** Odds Ratio (OR)**\n",
    "\n",
    "The ratio of the probabilities for a word from each class:\n",
    "\n",
    "$$ OR(w) = \\frac{p(w|female)}{p(w|male)} $$\n",
    "\n",
    "\n",
    "- High values --> more likely to be written by females\n",
    "- Low values --> more likely to be written by males\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p(w|male)\n",
      "[('and', 0.025660377358490565),\n",
      " ('i', 0.02339622641509434),\n",
      " ('the', 0.015849056603773583),\n",
      " ('a', 0.01509433962264151),\n",
      " ('my', 0.014339622641509434),\n",
      " ('of', 0.012830188679245283),\n",
      " ('in', 0.012075471698113207),\n",
      " ('love', 0.010566037735849057),\n",
      " ('with', 0.009811320754716982),\n",
      " ('to', 0.00830188679245283)]\n",
      "\n",
      "p(w|female)\n",
      "[('i', 0.023458445040214475),\n",
      " ('you', 0.01943699731903485),\n",
      " ('my', 0.01675603217158177),\n",
      " ('of', 0.0160857908847185),\n",
      " ('and', 0.014075067024128687),\n",
      " ('a', 0.012734584450402145),\n",
      " ('the', 0.012064343163538873),\n",
      " ('me', 0.010053619302949061),\n",
      " ('to', 0.010053619302949061),\n",
      " ('is', 0.00871313672922252)]\n"
     ]
    }
   ],
   "source": [
    "def counts_to_probs(gender_words):\n",
    "    \"\"\" Compute probability of each term according to the frequency\n",
    "    in a gender. \"\"\"\n",
    "    total = sum(gender_words.values())\n",
    "    return dict([(word, count / total)\n",
    "                 for word, count in gender_words.items()])\n",
    "\n",
    "male_probs = counts_to_probs(male_words)\n",
    "female_probs = counts_to_probs(female_words)\n",
    "\n",
    "print('p(w|male)')\n",
    "pprint(sorted(male_probs.items(), key=lambda x: -x[1])[:10])\n",
    "\n",
    "print('\\np(w|female)')\n",
    "pprint(sorted(female_probs.items(), key=lambda x: -x[1])[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'rock'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-110-6f72aa3012f6>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      3\u001b[0m                  for w in set(male_probs) | set(female_probs)])\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-110-6f72aa3012f6>\u001b[0m in \u001b[0;36modds_ratios\u001b[0;34m(male_probs, female_probs)\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m     return dict([(w, female_probs[w] / male_probs[w])\n\u001b[0;32m----> 3\u001b[0;31m                  for w in set(male_probs) | set(female_probs)])\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<ipython-input-110-6f72aa3012f6>\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m     return dict([(w, female_probs[w] / male_probs[w])\n\u001b[0;32m----> 3\u001b[0;31m                  for w in set(male_probs) | set(female_probs)])\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyError\u001b[0m: 'rock'"
     ]
    }
   ],
   "source": [
    "def odds_ratios(male_probs, female_probs):\n",
    "    return dict([(w, female_probs[w] / male_probs[w])\n",
    "                 for w in\n",
    "                 set(male_probs) | set(female_probs)])\n",
    "\n",
    "ors = odds_ratios(male_probs, female_probs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "844\n",
      "915\n",
      "0.0013404825737265416\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 111,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(len(male_probs))\n",
    "print(len(female_probs))\n",
    "print(female_probs['rock'])\n",
    "'rock' in male_probs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** How to deal with 0-probabilities? **\n",
    "\n",
    "$$p(w|male) = \\frac{freq(w, male)}\n",
    "{\\sum_i freq(w_i, male)} $$\n",
    "\n",
    "$freq(w, male) = 0$\n",
    "\n",
    "Do we really believe there is **0** probability of a male using this term?\n",
    "\n",
    "(Recall over-fitting discussion.)\n",
    "<br><br><br><br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** Additive Smoothing **\n",
    "\n",
    "Reserve small amount of counts (e.g., 1) for unseen observations.\n",
    "\n",
    "E.g., assume we've seen each word at least once in each class.\n",
    "\n",
    "$$p(w|male) = \\frac{1 + freq(w, male)}\n",
    "{|W| + \\sum_i freq(w_i, male)} $$\n",
    "\n",
    "$|W|$: number of unique words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('and', 0.012114918656974732)\n",
      "('i', 0.01107649705780547)\n",
      "('the', 0.00761509172724126)\n",
      "('a', 0.007268951194184839)\n",
      "('my', 0.006922810661128418)\n",
      "('of', 0.006230529595015576)\n",
      "('in', 0.005884389061959155)\n",
      "('love', 0.005192107995846314)\n",
      "('with', 0.004845967462789893)\n",
      "('to', 0.004153686396677051)\n"
     ]
    }
   ],
   "source": [
    "# Additive smoothing. Add count of 1 for all words.\n",
    "all_words = set(male_words) | set(female_words)\n",
    "male_words.update(all_words)  \n",
    "female_words.update(all_words)\n",
    "\n",
    "male_probs = counts_to_probs(male_words)\n",
    "female_probs = counts_to_probs(female_words)\n",
    "print('\\n'.join(str(x) for x in \n",
    "                sorted(male_probs.items(), key=lambda x: -x[1])[:10]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.00034614053305642093\n"
     ]
    }
   ],
   "source": [
    "# Even though word doesn't appear, has non-zerp probability.\n",
    "print(male_probs['rock'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top Female Terms (OR):\n",
      "[('she', 6.6174738219895275),\n",
      " ('one', 6.6174738219895275),\n",
      " ('19', 5.672120418848166),\n",
      " ('student', 5.672120418848166),\n",
      " ('things', 5.672120418848166),\n",
      " ('right', 4.726767015706806),\n",
      " ('towards', 4.726767015706806),\n",
      " ('reaction', 4.726767015706806),\n",
      " ('first', 4.726767015706806),\n",
      " ('you', 4.726767015706806),\n",
      " ('bayern', 4.726767015706806),\n",
      " ('insta', 4.726767015706806),\n",
      " ('from', 3.781413612565445),\n",
      " ('those', 3.781413612565445),\n",
      " ('jo', 3.781413612565445),\n",
      " ('technology', 3.781413612565445),\n",
      " ('forever', 3.781413612565445),\n",
      " ('we', 3.781413612565445),\n",
      " ('follow', 3.781413612565445),\n",
      " ('aliciagworld', 3.781413612565445)]\n",
      "\n",
      "Top Male Terms (OR):\n",
      "[('fan', 0.31511780104712045),\n",
      " ('youtu', 0.31511780104712045),\n",
      " ('when', 0.31511780104712045),\n",
      " ('football', 0.31511780104712045),\n",
      " ('game', 0.31511780104712045),\n",
      " ('gamer', 0.31511780104712045),\n",
      " ('rest', 0.31511780104712045),\n",
      " ('with', 0.27010097232610325),\n",
      " ('sports', 0.23633835078534032),\n",
      " ('father', 0.23633835078534032),\n",
      " ('person', 0.23633835078534032),\n",
      " ('far', 0.23633835078534032),\n",
      " ('know', 0.23633835078534032),\n",
      " ('drummer', 0.23633835078534032),\n",
      " ('beer', 0.23633835078534032),\n",
      " ('mother', 0.23633835078534032),\n",
      " ('words', 0.18907068062827226),\n",
      " ('enjoy', 0.18907068062827226),\n",
      " ('fun', 0.15755890052356022),\n",
      " ('too', 0.13505048616305163)]\n"
     ]
    }
   ],
   "source": [
    "ors = odds_ratios(male_probs, female_probs)\n",
    "\n",
    "sorted_ors = sorted(ors.items(), key=lambda x: -x[1])\n",
    "\n",
    "print('Top Female Terms (OR):')\n",
    "pprint(sorted_ors[:20])\n",
    "\n",
    "print('\\nTop Male Terms (OR):')\n",
    "pprint(sorted_ors[-20:])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}