{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# CS579: Lecture 12  \n",
    "\n",
    "** Demographic Inference I**\n",
    "\n",
    "*[Dr. Aron Culotta](http://cs.iit.edu/~culotta)*  \n",
    "*[Illinois Institute of Technology](http://iit.edu)*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**dem·o·graph·ics**\n",
    "\n",
    "statistical data relating to the population and particular groups within it.\n",
    "\n",
    "E.g., age, ethnicity, gender, income, ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Why Demographics?\n",
    "\n",
    "- Marketing\n",
    "  - Who are my customers?\n",
    "  - Who are my competitors' customers?\n",
    "  - E.g., [DemographicsPro](http://www.demographicspro.com/samples#c=%40FamilyGuyonFOX)\n",
    "  \n",
    "- Social Media as Surveys\n",
    "  - E.g., 45% of tweets express positive sentiment toward Pres. Obama\n",
    "  - Who wrote those tweets?\n",
    "  \n",
    "- Health\n",
    "  - 2% of Facebook users are expressing flu-like symptoms\n",
    "  - Are they representative of the full population?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** User profiles vary from site to site. **"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![rahm](rahm.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![rahm-fb](rahm-fb.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "![rahm-li](rahm-li.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Approaches\n",
    "\n",
    "- Clever use of external data\n",
    "  - E.g., U.S. Census name lists for gender\n",
    "- Look for keywords in profile\n",
    "  - \"African American Male\"\n",
    "  - \"Happy 21st birthday to me\"\n",
    "- Machine Learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "100 tweets\n",
      "200 tweets\n",
      "300 tweets\n",
      "400 tweets\n",
      "500 tweets\n",
      "600 tweets\n",
      "700 tweets\n",
      "800 tweets\n",
      "900 tweets\n",
      "1000 tweets\n",
      "fetched 1000 tweets\n"
     ]
    }
   ],
   "source": [
    "# Guessing gender\n",
    "# Collect 1000 tweets matching query \"i\"\n",
    "import configparser\n",
    "import sys\n",
    "from TwitterAPI import TwitterAPI\n",
    "\n",
    "def get_twitter(config_file):\n",
    "    \"\"\" Read the config_file and construct an instance of TwitterAPI.\n",
    "    Args:\n",
    "      config_file ... A config file in ConfigParser format with Twitter credentials\n",
    "    Returns:\n",
    "      An instance of TwitterAPI.\n",
    "    \"\"\"\n",
    "    config = configparser.ConfigParser()\n",
    "    config.read(config_file)\n",
    "    twitter = TwitterAPI(\n",
    "                   config.get('twitter', 'consumer_key'),\n",
    "                   config.get('twitter', 'consumer_secret'),\n",
    "                   config.get('twitter', 'access_token'),\n",
    "                   config.get('twitter', 'access_token_secret'))\n",
    "    return twitter\n",
    "\n",
    "twitter = get_twitter('twitter.cfg')\n",
    "tweets = []\n",
    "n_tweets=1000\n",
    "for r in twitter.request('statuses/filter', {'track': 'i'}):\n",
    "    tweets.append(r)\n",
    "    if len(tweets) % 100 == 0:\n",
    "        print('%d tweets' % len(tweets))\n",
    "    if len(tweets) >= n_tweets:\n",
    "        break\n",
    "print('fetched %d tweets' % len(tweets))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'limit': {'track': 383, 'timestamp_ms': '1551714817758'}},\n",
       " {'limit': {'track': 790, 'timestamp_ms': '1551714818769'}},\n",
       " {'limit': {'track': 1150, 'timestamp_ms': '1551714819737'}},\n",
       " {'limit': {'track': 1556, 'timestamp_ms': '1551714820748'}},\n",
       " {'limit': {'track': 1998, 'timestamp_ms': '1551714821738'}},\n",
       " {'limit': {'track': 2384, 'timestamp_ms': '1551714822750'}}]"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# not all tweets are returned\n",
    "# https://dev.twitter.com/streaming/overview/messages-types#limit_notices\n",
    "[t for t in tweets if 'user' not in t][:6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "fetched 981 tweets\n"
     ]
    }
   ],
   "source": [
    "# restrict to actual tweets\n",
    "# (remove \"deleted\" tweets)\n",
    "tweets = [t for t in tweets if 'user' in t]\n",
    "print('fetched %d tweets' % len(tweets))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Funky Kong',\n",
       " 'April and 228 other',\n",
       " '☕️',\n",
       " 'nikoll (ia)',\n",
       " 'Iniciativa PV',\n",
       " 'josie🕷',\n",
       " 'ؘ',\n",
       " 'yaancc 🌞',\n",
       " 'livyyy :)',\n",
       " 'by gaspadin_ ll 🇹🇷🇹🇷']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Print last 10 names.\n",
    "names = [t['user']['name'] for t in tweets]\n",
    "names[-10:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "males:\n",
      "['JAMES          3.318  3.318      1',\n",
      " 'JOHN           3.271  6.589      2',\n",
      " 'ROBERT         3.143  9.732      3',\n",
      " 'MICHAEL        2.629 12.361      4',\n",
      " 'WILLIAM        2.451 14.812      5',\n",
      " 'DAVID          2.363 17.176      6',\n",
      " 'RICHARD        1.703 18.878      7',\n",
      " 'CHARLES        1.523 20.401      8',\n",
      " 'JOSEPH         1.404 21.805      9',\n",
      " 'THOMAS         1.380 23.185     10']\n",
      "females:\n",
      "['MARY           2.629  2.629      1',\n",
      " 'PATRICIA       1.073  3.702      2',\n",
      " 'LINDA          1.035  4.736      3',\n",
      " 'BARBARA        0.980  5.716      4',\n",
      " 'ELIZABETH      0.937  6.653      5',\n",
      " 'JENNIFER       0.932  7.586      6',\n",
      " 'MARIA          0.828  8.414      7',\n",
      " 'SUSAN          0.794  9.209      8',\n",
      " 'MARGARET       0.768  9.976      9',\n",
      " 'DOROTHY        0.727 10.703     10']\n"
     ]
    }
   ],
   "source": [
    "# Fetch census name data from:\n",
    "# http://www2.census.gov/topics/genealogy/1990surnames/\n",
    "import requests\n",
    "from pprint import pprint\n",
    "males_url = 'http://www2.census.gov/topics/genealogy/' + \\\n",
    "            '1990surnames/dist.male.first'\n",
    "females_url = 'http://www2.census.gov/topics/genealogy/' + \\\n",
    "              '1990surnames/dist.female.first'\n",
    "males = requests.get(males_url).text.split('\\n')\n",
    "females = requests.get(females_url).text.split('\\n')\n",
    "print('males:')\n",
    "pprint(males[:10])\n",
    "print('females:')\n",
    "pprint(females[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1219 male and 4275 female names\n",
      "males:\n",
      "raymundo\n",
      "willis\n",
      "jose\n",
      "thurman\n",
      "galen\n",
      "darrick\n",
      "roy\n",
      "matthew\n",
      "man\n",
      "ned\n",
      "\n",
      "females:\n",
      "min\n",
      "brittaney\n",
      "tonisha\n",
      "bailey\n",
      "shani\n",
      "roma\n",
      "beth\n",
      "stefanie\n",
      "tillie\n",
      "jeanetta\n"
     ]
    }
   ],
   "source": [
    "# Get names. \n",
    "male_names = set([m.split()[0].lower() for m in males if m])\n",
    "female_names = set([f.split()[0].lower() for f in females if f])\n",
    "print('%d male and %d female names' % (len(male_names), len(female_names)))\n",
    "print('males:\\n' + '\\n'.join(list(male_names)[:10]))\n",
    "print('\\nfemales:\\n' + '\\n'.join(list(female_names)[:10]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize gender of all tweets to unknown.\n",
    "for t in tweets:\n",
    "    t['gender'] = 'unknown'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# label a Twitter user's gender by matching name list.\n",
    "import re\n",
    "def gender_by_name(tweets, male_names, female_names):\n",
    "    for t in tweets:\n",
    "        name = t['user']['name']\n",
    "        if name:\n",
    "            # remove punctuation.\n",
    "            name_parts = re.findall('\\w+', name.split()[0].lower())\n",
    "            if len(name_parts) > 0:\n",
    "                first = name_parts[0].lower()\n",
    "                if first in male_names:\n",
    "                    t['gender'] = 'male'\n",
    "                elif first in female_names:\n",
    "                    t['gender'] = 'female'\n",
    "                else:\n",
    "                    t['gender'] = 'unknown'\n",
    "\n",
    "gender_by_name(tweets, male_names, female_names)\n",
    "# What's wrong with this approach?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.27 of accounts are labeled with gender\n",
      "gender counts:\n",
      " Counter({'unknown': 715, 'female': 143, 'male': 123})\n",
      "unknown Arzunaz Üreyen\n",
      "female Ann\n",
      "unknown flexlex 🍒✨\n",
      "male jules!\n",
      "female MARLA RT\n",
      "unknown Freya 🏳️‍🌈\n",
      "female ًvictoria\n",
      "unknown Blue PandaNW\n",
      "unknown Pierrot Kwame\n",
      "unknown Öykü Su Unay\n",
      "unknown lots✈️bts wembley\n",
      "unknown 🍒\n",
      "unknown 😽😸Meow Meow March😸😽\n",
      "unknown fazli\n",
      "unknown Tyrion_Lannister\n",
      "unknown 💜~Nessa~OwO💜\n",
      "unknown محمد المولد xboxone 🎮 🖋\n",
      "unknown Cruiz_Senior\n",
      "male Carmen JMO💖🌟\n",
      "unknown tyb\n"
     ]
    }
   ],
   "source": [
    "from collections import Counter\n",
    "\n",
    "def print_genders(tweets):\n",
    "    counts = Counter([t['gender'] for t in tweets])\n",
    "    print('%.2f of accounts are labeled with gender' % \n",
    "          ((counts['male'] + counts['female']) / sum(counts.values())))\n",
    "    print('gender counts:\\n', counts)\n",
    "    for t in tweets[:20]:\n",
    "        print(t['gender'], t['user']['name'])\n",
    "    \n",
    "print_genders(tweets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "found 331 ambiguous names:\n",
      "\n",
      "jose\n",
      "roy\n",
      "matthew\n",
      "man\n",
      "dale\n",
      "gail\n",
      "sung\n",
      "thomas\n",
      "tommie\n",
      "charlie\n",
      "claude\n",
      "chong\n",
      "stephen\n",
      "patrick\n",
      "dorian\n",
      "angelo\n",
      "jay\n",
      "lewis\n",
      "dusty\n",
      "son\n"
     ]
    }
   ],
   "source": [
    "# What about ambiguous names?\n",
    "def print_ambiguous_names(male_names, female_names):\n",
    "    ambiguous = [n for n in male_names if n in female_names]  # names on both lists\n",
    "    print('found %d ambiguous names:\\n'% len(ambiguous))\n",
    "    print('\\n'.join(ambiguous[:20]))\n",
    "    \n",
    "print_ambiguous_names(male_names, female_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "found 0 ambiguous names:\n",
      "\n",
      "\n",
      "1146 male and 4017 female names\n"
     ]
    }
   ],
   "source": [
    "# Keep names that are more frequent in one gender than the other.\n",
    "def get_percents(name_list):\n",
    "    # parse raw data to extract, e.g., the percent of males names John.\n",
    "    return dict([(n.split()[0].lower(), float(n.split()[1]))\n",
    "                  for n in name_list if n])\n",
    "\n",
    "males_pct = get_percents(males)\n",
    "females_pct = get_percents(females)\n",
    "\n",
    "# Assign a name as male if it is more common among males than femals.\n",
    "male_names = set([m for m in male_names if m not in female_names or\n",
    "              males_pct[m] > females_pct[m]])\n",
    "female_names = set([f for f in female_names if f not in male_names or\n",
    "              females_pct[f] > males_pct[f]])\n",
    "\n",
    "print_ambiguous_names(male_names, female_names)\n",
    "print('%d male and %d female names' % (len(male_names), len(female_names)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.27 of accounts are labeled with gender\n",
      "gender counts:\n",
      " Counter({'unknown': 715, 'female': 155, 'male': 111})\n",
      "unknown Arzunaz Üreyen\n",
      "female Ann\n",
      "unknown flexlex 🍒✨\n",
      "male jules!\n",
      "female MARLA RT\n",
      "unknown Freya 🏳️‍🌈\n",
      "female ًvictoria\n",
      "unknown Blue PandaNW\n",
      "unknown Pierrot Kwame\n",
      "unknown Öykü Su Unay\n",
      "unknown lots✈️bts wembley\n",
      "unknown 🍒\n",
      "unknown 😽😸Meow Meow March😸😽\n",
      "unknown fazli\n",
      "unknown Tyrion_Lannister\n",
      "unknown 💜~Nessa~OwO💜\n",
      "unknown محمد المولد xboxone 🎮 🖋\n",
      "unknown Cruiz_Senior\n",
      "female Carmen JMO💖🌟\n",
      "unknown tyb\n"
     ]
    }
   ],
   "source": [
    "# Relabel twitter users (compare with above)\n",
    "gender_by_name(tweets, male_names, female_names)\n",
    "print_genders(tweets)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('.', 4),\n",
       " ('M', 3),\n",
       " ('Fra🌸', 2),\n",
       " ('🌙', 2),\n",
       " ('💜', 2),\n",
       " ('𝘲𝘶𝘦𝘦𝘯𝘯𝘢𝘪𝘫𝘢2𝘹🥰', 2),\n",
       " ('ً', 2),\n",
       " ('Marius Black. 🦋', 2),\n",
       " ('Krggzddd', 2),\n",
       " ('Mehmet Sarıaslan', 2),\n",
       " ('Urim Ejupi', 2),\n",
       " ('Arzunaz Üreyen', 1),\n",
       " ('flexlex 🍒✨', 1),\n",
       " ('Freya 🏳️\\u200d🌈', 1),\n",
       " ('Blue PandaNW', 1),\n",
       " ('Pierrot Kwame', 1),\n",
       " ('Öykü Su Unay', 1),\n",
       " ('lots✈️bts wembley', 1),\n",
       " ('🍒', 1),\n",
       " ('😽😸Meow Meow March😸😽', 1)]"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Who are the unknowns?\n",
    "# \"Filtered\" data can have big impact on analysis.\n",
    "unknown_names = Counter(t['user']['name']\n",
    "                        for t in tweets if t['gender'] == 'unknown')\n",
    "unknown_names.most_common(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Most Common Male Terms:\n",
      "[('of', 25),\n",
      " ('and', 24),\n",
      " ('i', 23),\n",
      " ('the', 14),\n",
      " ('my', 11),\n",
      " ('in', 10),\n",
      " ('me', 9),\n",
      " ('life', 9),\n",
      " ('you', 8),\n",
      " ('for', 8)]\n",
      "\n",
      "Most Common Female Terms:\n",
      "[('i', 38),\n",
      " ('the', 32),\n",
      " ('and', 28),\n",
      " ('a', 19),\n",
      " ('my', 19),\n",
      " ('you', 17),\n",
      " ('to', 16),\n",
      " ('of', 15),\n",
      " ('is', 14),\n",
      " ('in', 13)]\n"
     ]
    }
   ],
   "source": [
    "# How do the profiles of male Twitter users differ from\n",
    "# those of female users?\n",
    "\n",
    "male_profiles = [t['user']['description'] for t in tweets\n",
    "                if t['gender'] == 'male']\n",
    "\n",
    "female_profiles = [t['user']['description'] for t in tweets\n",
    "                if t['gender'] == 'female']\n",
    "#male_profiles = [t['text'] for t in tweets\n",
    "#                if t['gender'] == 'male']\n",
    "\n",
    "#female_profiles = [t['text'] for t in tweets\n",
    "#                if t['gender'] == 'female']\n",
    "\n",
    "import re\n",
    "def tokenize(s):\n",
    "    return re.sub('\\W+', ' ', s).lower().split() if s else []\n",
    "\n",
    "male_words = Counter()\n",
    "female_words = Counter()\n",
    "\n",
    "for p in male_profiles:\n",
    "    male_words.update(Counter(tokenize(p)))\n",
    "                      \n",
    "for p in female_profiles:\n",
    "    female_words.update(Counter(tokenize(p)))\n",
    "\n",
    "print('Most Common Male Terms:')\n",
    "pprint(male_words.most_common(10))\n",
    "    \n",
    "print('\\nMost Common Female Terms:')\n",
    "pprint(female_words.most_common(10))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "764\n",
      "921\n"
     ]
    }
   ],
   "source": [
    "print(len(male_words))\n",
    "print(len(female_words))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top Male Terms (diff):\n",
      "[('of', -10),\n",
      " ('father', -7),\n",
      " ('http', -6),\n",
      " ('for', -5),\n",
      " ('people', -5),\n",
      " ('university', -4),\n",
      " ('not', -4),\n",
      " ('about', -4),\n",
      " ('com', -4),\n",
      " ('play', -4)]\n",
      "\n",
      "Top Female Terms (diff):\n",
      "[('just', 7),\n",
      " ('she', 7),\n",
      " ('my', 8),\n",
      " ('s', 9),\n",
      " ('you', 9),\n",
      " ('to', 10),\n",
      " ('is', 11),\n",
      " ('a', 12),\n",
      " ('i', 15),\n",
      " ('the', 18)]\n"
     ]
    }
   ],
   "source": [
    "# Compute difference\n",
    "diff_counts = dict([(w, female_words[w] - male_words[w])\n",
    "                    for w in\n",
    "                    set(female_words.keys()) | set(male_words.keys())])\n",
    "\n",
    "sorted_diffs = sorted(diff_counts.items(), key=lambda x: x[1])\n",
    "\n",
    "print('Top Male Terms (diff):')\n",
    "pprint(sorted_diffs[:10])\n",
    "\n",
    "print('\\nTop Female Terms (diff):')\n",
    "pprint(sorted_diffs[-10:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** A problem with difference of counts:**\n",
    "\n",
    "<br><br><br><br>\n",
    "What if we have more male than female words in total?\n",
    "\n",
    "<br><br><br><br>\n",
    "Instead, consider \"the probability that a male user writes the word **w**\"\n",
    "\n",
    "<br><br><br><br>\n",
    "\n",
    "$$p(w|male) = \\frac{freq(w, male)}\n",
    "{\\sum_i freq(w_i, male)} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** Odds Ratio (OR)**\n",
    "\n",
    "The ratio of the probabilities for a word from each class:\n",
    "\n",
    "$$ OR(w) = \\frac{p(w|female)}{p(w|male)} $$\n",
    "\n",
    "\n",
    "- High values --> more likely to be written by females\n",
    "- Low values --> more likely to be written by males\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p(w|male)\n",
      "[('of', 0.021872265966754154),\n",
      " ('and', 0.02099737532808399),\n",
      " ('i', 0.020122484689413824),\n",
      " ('the', 0.012248468941382326),\n",
      " ('my', 0.009623797025371828),\n",
      " ('in', 0.008748906386701663),\n",
      " ('me', 0.007874015748031496),\n",
      " ('life', 0.007874015748031496),\n",
      " ('you', 0.00699912510936133),\n",
      " ('for', 0.00699912510936133)]\n",
      "\n",
      "p(w|female)\n",
      "[('i', 0.02615278733654508),\n",
      " ('the', 0.02202339986235375),\n",
      " ('and', 0.019270474879559532),\n",
      " ('a', 0.01307639366827254),\n",
      " ('my', 0.01307639366827254),\n",
      " ('you', 0.01169993117687543),\n",
      " ('to', 0.011011699931176875),\n",
      " ('of', 0.01032346868547832),\n",
      " ('is', 0.009635237439779766),\n",
      " ('in', 0.008947006194081212)]\n"
     ]
    }
   ],
   "source": [
    "def counts_to_probs(gender_words):\n",
    "    \"\"\" Compute probability of each term according to the frequency\n",
    "    in a gender. \"\"\"\n",
    "    total = sum(gender_words.values())\n",
    "    return dict([(word, count / total)\n",
    "                 for word, count in gender_words.items()])\n",
    "\n",
    "male_probs = counts_to_probs(male_words)\n",
    "female_probs = counts_to_probs(female_words)\n",
    "\n",
    "print('p(w|male)')\n",
    "pprint(sorted(male_probs.items(), key=lambda x: -x[1])[:10])\n",
    "\n",
    "print('\\np(w|female)')\n",
    "pprint(sorted(female_probs.items(), key=lambda x: -x[1])[:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "'state'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-32-04756cf63b36>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      4\u001b[0m                  set(male_probs) | set(female_probs)])\n\u001b[1;32m      5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 6\u001b[0;31m \u001b[0mors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m<ipython-input-32-04756cf63b36>\u001b[0m in \u001b[0;36modds_ratios\u001b[0;34m(male_probs, female_probs)\u001b[0m\n\u001b[1;32m      2\u001b[0m     return dict([(w, female_probs[w] / male_probs[w])\n\u001b[1;32m      3\u001b[0m                  \u001b[0;32mfor\u001b[0m \u001b[0mw\u001b[0m \u001b[0;32min\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 4\u001b[0;31m                  set(male_probs) | set(female_probs)])\n\u001b[0m\u001b[1;32m      5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      6\u001b[0m \u001b[0mors\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m<ipython-input-32-04756cf63b36>\u001b[0m in \u001b[0;36m<listcomp>\u001b[0;34m(.0)\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0modds_ratios\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmale_probs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfemale_probs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m     return dict([(w, female_probs[w] / male_probs[w])\n\u001b[0;32m----> 3\u001b[0;31m                  \u001b[0;32mfor\u001b[0m \u001b[0mw\u001b[0m \u001b[0;32min\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m                  set(male_probs) | set(female_probs)])\n\u001b[1;32m      5\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyError\u001b[0m: 'state'"
     ]
    }
   ],
   "source": [
    "def odds_ratios(male_probs, female_probs):\n",
    "    return dict([(w, female_probs[w] / male_probs[w])\n",
    "                 for w in\n",
    "                 set(male_probs) | set(female_probs)])\n",
    "\n",
    "ors = odds_ratios(male_probs, female_probs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "764\n",
      "921\n",
      "744 words in female_probs but not in male_probs\n",
      "arizona\n",
      "0.0006882312456985547\n"
     ]
    }
   ],
   "source": [
    "print(len(male_probs))\n",
    "print(len(female_probs))\n",
    "female_but_not_male = set(female_probs) - set(male_probs)\n",
    "print('%d words in female_probs but not in male_probs' % len(female_but_not_male))\n",
    "fem_word = list(female_but_not_male)[-10]\n",
    "print(fem_word)\n",
    "print(female_probs[fem_word])\n",
    "#'selfcare' in male_probs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** How to deal with 0-probabilities? **\n",
    "\n",
    "$$p(w|male) = \\frac{freq(w, male)}\n",
    "{\\sum_i freq(w_i, male)} $$\n",
    "\n",
    "$freq(w, male) = 0$\n",
    "\n",
    "Do we really believe there is **0** probability of a male using this term?\n",
    "\n",
    "(Recall over-fitting discussion.)\n",
    "<br><br><br><br>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "** Additive Smoothing **\n",
    "\n",
    "Reserve small amount of counts (e.g., 1) for unseen observations.\n",
    "\n",
    "E.g., assume we've seen each word at least once in each class.\n",
    "\n",
    "$$p(w|male) = \\frac{1 + freq(w, male)}\n",
    "{|W| + \\sum_i freq(w_i, male)} $$\n",
    "\n",
    "$|W|$: number of unique words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "('of', 0.00980761976612599)\n",
      "('and', 0.00943040362127499)\n",
      "('i', 0.009053187476423991)\n",
      "('the', 0.005658242172764994)\n",
      "('my', 0.004526593738211996)\n",
      "('in', 0.004149377593360996)\n",
      "('me', 0.003772161448509996)\n",
      "('life', 0.003772161448509996)\n",
      "('you', 0.0033949453036589967)\n",
      "('for', 0.0033949453036589967)\n"
     ]
    }
   ],
   "source": [
    "# Additive smoothing. Add count of 1 for all words.\n",
    "all_words = set(male_words) | set(female_words)\n",
    "male_words.update(all_words)  \n",
    "female_words.update(all_words)\n",
    "\n",
    "male_probs = counts_to_probs(male_words)\n",
    "female_probs = counts_to_probs(female_words)\n",
    "print('\\n'.join(str(x) for x in \n",
    "                sorted(male_probs.items(), key=lambda x: -x[1])[:10]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.0003772161448509996\n"
     ]
    }
   ],
   "source": [
    "# Even though word doesn't appear, has non-zero probability.\n",
    "print(male_probs[fem_word])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Top Female Terms (OR):\n",
      "[('she', 7.162445119891929),\n",
      " ('just', 7.162445119891929),\n",
      " ('mom', 6.267139479905437),\n",
      " ('her', 5.3718338399189465),\n",
      " ('so', 4.476528199932455),\n",
      " ('fishing', 4.476528199932455),\n",
      " ('can', 4.476528199932455),\n",
      " ('women', 4.476528199932455),\n",
      " ('great', 4.476528199932455),\n",
      " ('प', 4.476528199932455),\n",
      " ('we', 4.476528199932455),\n",
      " ('21', 4.476528199932455),\n",
      " ('am', 4.02887537993921),\n",
      " ('that', 3.5812225599459646),\n",
      " ('ever', 3.5812225599459646),\n",
      " ('got', 3.5812225599459646),\n",
      " ('eu', 3.5812225599459646),\n",
      " ('public', 3.5812225599459646),\n",
      " ('त', 3.5812225599459646),\n",
      " ('sc', 3.5812225599459646)]\n",
      "\n",
      "Top Male Terms (OR):\n",
      "[('ly', 0.2984352133288304),\n",
      " ('want', 0.2984352133288304),\n",
      " ('state', 0.2238264099966228),\n",
      " ('reflect', 0.2238264099966228),\n",
      " ('𝗍𝗁𝖾', 0.2238264099966228),\n",
      " ('education', 0.2238264099966228),\n",
      " ('all', 0.2238264099966228),\n",
      " ('never', 0.2238264099966228),\n",
      " ('retired', 0.2238264099966228),\n",
      " ('owner', 0.2238264099966228),\n",
      " ('youtube', 0.2238264099966228),\n",
      " ('school', 0.2238264099966228),\n",
      " ('two', 0.2238264099966228),\n",
      " ('proud', 0.2238264099966228),\n",
      " ('some', 0.2238264099966228),\n",
      " ('university', 0.17906112799729823),\n",
      " ('about', 0.17906112799729823),\n",
      " ('play', 0.17906112799729823),\n",
      " ('people', 0.1492176066644152),\n",
      " ('father', 0.1119132049983114)]\n"
     ]
    }
   ],
   "source": [
    "ors = odds_ratios(male_probs, female_probs)\n",
    "\n",
    "sorted_ors = sorted(ors.items(), key=lambda x: -x[1])\n",
    "\n",
    "print('Top Female Terms (OR):')\n",
    "pprint(sorted_ors[:20])\n",
    "\n",
    "print('\\nTop Male Terms (OR):')\n",
    "pprint(sorted_ors[-20:])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}