{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# generating music reviews with n-grams\n",
    "\n",
    "**Motivating Question**: How 'hard' is language modeling without deep learning?\n",
    "\n",
    "My [goal](https://iconix.github.io/dl/2018/06/03/project-ideation) for the summer is to generate the best (most topical, structured, and specific) music reviews I can for new songs. How far can I push a non-deep language model towards this goal?\n",
    "\n",
    "_Language modeling_? an approach to generating text by estimating the probability distribution over sequences of linguistic units (characters, words, sentences).\n",
    "\n",
    "**A non-deep approach**: _unsmoothed maximum likelihood character-level language models_, or _n-gram language models_.\n",
    "\n",
    "[CharRNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), as popularized by Andrej Karpathy, are RNNs that learn to model the probability of the next character in a sequence, given the previous `n` characters. For more background, do check out the blog post if you haven't already!\n",
    "\n",
    "As Yoav Goldberg points out [in response](http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139) to Karpathy's post, it turns out that you can model this probability with some degree of success _without_ neural networks, for example using _unsmoothed maximum likelihood character-level language models_. Let's see how they work and how well they do.\n",
    "\n",
    "What is an **Unsmoothed Maximum Likelihood Character-Level Language Model**?\n",
    "- _Maximum Likelihood (Estimation)/MLE_? deciding to model $P(c_i \\mid h_{i,n})$ by counting and dividing. First, count the number of times each possible letter $c^*$ appeared after the current history $h$; then divide this count by the total number of letters appearing after $h$. We divide as a way to _normalize_ the count.\n",
    "- _Unsmoothed_? deciding to treat any letters $c^*$ that never follow the current $h$ as impossible (probability 0).\n",
    "- _Character-Level Language Model_? our stated goal: predict a sequence, one character at a time!\n",
    "\n",
    "We model MLE as:\n",
    "\n",
    "$$P(c_i \\mid h_{i,n})$$\n",
    "\n",
    "where $c_i$ is the next character in the sequence and $h_{i,n}$ is the _history_, or previous $n$ characters in the sequence preceding $c_i$ (i.e., $c_{i-(n-1)} ... c_{i-1}$). $n$ - the number of letters we need to guess based on - is also referred to as the _order_ of language model.\n",
    "\n",
    "What's nice about using MLE here is that this is the estimation that forms the basis for most _supervised machine learning_ - we are trying to predict $c_i$ given observations $h_{i,n}$.\n",
    "\n",
    "From now on, we'll call this model an **n-gram language model**, for short."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## n-gram model\n",
    "`train_char_lm`, `generate_letter`, and `generate_text` mostly swiped from Yoav Goldberg: [\"The unreasonable effectiveness of Character-level Language Models (and why RNNs are still cool)\"](http://nbviewer.jupyter.org/gist/yoavg/d76121dfde2618422139)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from collections import Counter, defaultdict\n",
    "from random import random\n",
    "import time\n",
    "\n",
    "def normalize(counter):\n",
    "    s = sum(counter.values())\n",
    "    return {c: cnt / s for c, cnt in counter.items()}\n",
    "\n",
    "def train_char_lm(texts, n=4):\n",
    "    start = time.time()\n",
    "    lm = defaultdict(Counter)\n",
    "    pad = '~' * n\n",
    "    # pad each new text with leading ~ so that we learn how to start\n",
    "    data = ''.join([pad + text for text in texts])\n",
    "\n",
    "    for i in range(len(data)-n):\n",
    "        history, char = data[i:i+n], data[i+n]\n",
    "        lm[history][char] += 1\n",
    "    \n",
    "    outlm = {hist: normalize(chars) for hist, chars in lm.items()}\n",
    "    \n",
    "    end = time.time()\n",
    "    print(f'Training time (textlen={len(data)-n}, n={n}): {end-start:.2f}s')\n",
    "    return outlm\n",
    "\n",
    "def generate_letter(lm, history, n):\n",
    "    '''To generate a letter, take the history, look at the last n chars,\n",
    "        and then sample a random letter based on the corresponding distribution.\n",
    "    '''\n",
    "    history = history[-n:]\n",
    "    dist = lm[history]\n",
    "    x = random()\n",
    "    for c, v in dist.items():\n",
    "        x = x - v\n",
    "        if x <= 0:\n",
    "            return c\n",
    "        \n",
    "def generate_text(lm, n, num_generate=1000):\n",
    "    history = '~' * n\n",
    "    out = []\n",
    "    for i in range(num_generate):\n",
    "        c = generate_letter(lm, history, n)\n",
    "        history = history[-n:] + c\n",
    "        out.append(c)\n",
    "    return ''.join(out)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the music reviews:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total word_count: 241026\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0    New Music\\n\\nMt. Joy reached out to us with th...\n",
       "2    Folk rockers Mt. Joy have debuted their new so...\n",
       "4    You know we're digging Mt. Joy.\\n\\nTheir new s...\n",
       "5    Nothing against the profession, but the U.S. h...\n",
       "7    Connecticut duo **Opia** have released a guita...\n",
       "Name: content, dtype: object"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import os\n",
    "import pandas as pd\n",
    "\n",
    "BASE_DIR = os.getcwd()\n",
    "DATA_DIR = os.path.join(BASE_DIR, '..', 'datasets')\n",
    "\n",
    "blog_content_file = os.path.join(DATA_DIR, f'blog_content_sample.json')\n",
    "blog_content_df = pd.read_json(blog_content_file)\n",
    "# filter out empty or non-English content\n",
    "blog_content_df = blog_content_df.loc[(blog_content_df.word_count > 0) & (blog_content_df.lang == 'en')]\n",
    "print(f'total word_count: {sum(blog_content_df.word_count)}')\n",
    "blog_content_df.head().content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=1424400, n=4): 2.21s\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(blog_content_df.content, n=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'c': 0.9936421435059037,\n",
       " 'n': 0.005449591280653951,\n",
       " 'q': 0.0009082652134423251}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['musi']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'d': 1.0}"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['soun']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'h': 0.030612244897959183, 's': 0.9693877551020408}"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['clas']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'\\n': 0.009836065573770493,\n",
       " ' ': 0.26885245901639343,\n",
       " \"'\": 0.003278688524590164,\n",
       " ',': 0.019672131147540985,\n",
       " '-': 0.003278688524590164,\n",
       " '.': 0.009836065573770493,\n",
       " '?': 0.003278688524590164,\n",
       " '_': 0.006557377049180328,\n",
       " 'e': 0.003278688524590164,\n",
       " 'i': 0.25901639344262295,\n",
       " 'l': 0.009836065573770493,\n",
       " 'm': 0.036065573770491806,\n",
       " 'n': 0.08852459016393442,\n",
       " 'o': 0.003278688524590164,\n",
       " 's': 0.1180327868852459,\n",
       " 'u': 0.006557377049180328,\n",
       " 'y': 0.15081967213114755}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['part']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I had trio , who's **Moby's from that's here:\n",
      "\n",
      "9 maging on **Com Tenfjord Resolvin Murphy people do \n"
     ]
    }
   ],
   "source": [
    "print(generate_text(lm, 4, num_generate=100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observations**:\n",
    "\n",
    "At `n=4`, there are words (some made up, but not too many).\n",
    "\n",
    "There's not a lot of connection between the words.\n",
    "\n",
    "It doesn't really know what to with markdown formatting, so it just sticks it wherever.\n",
    "\n",
    "On longer samples, it got stuck outputting newlines for a bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=1430732, n=8): 4.69s\n",
      "Who does what\n",
      "your brain just\n",
      "as necessary Evil\" or \"Secret Xtians.\"\n",
      "What's going place, is the point, 23-year-old George\n",
      "Fredericia in rural Denmark. The multi-talented and producers we now have slowly. 'Des Bisous Partout,\" Josianne Boivin (aka MUNYA) self-realization (\"I\n",
      "gotta get back.\" Recovering a period of note but most recent\n",
      "performing at SXSW. Click over to hit an anthemic power, style, and never before\n",
      "your sky is full of clouds and the follow up single, 'Coffee Shop' and seeing people interpret the video compliments that he used a makeshift studio and going to give his song gave me was the works of visionary jazz but blend of\n",
      "serene instrumental indie darling.\n",
      "\n",
      "~~~~~~~~**Rising Bristol\n",
      "\n",
      "Thu 15 February 5th 2016\n",
      "\n",
      "--\n",
      "\n",
      "**FRANKIIE** 's 'Dream Reader' filmed? ** The Death Of Our Inventional lyric\n",
      "video for\n",
      "that same week.\n",
      "\n",
      "When speaking about. Serene vocals swell over my words about the track below:\n",
      "\n",
      "3/14 - The Social Club04 Liverpool International lyrics display here, but this \n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(blog_content_df.content, n=8)\n",
    "print(generate_text(lm, 8))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observations**:\n",
    "\n",
    "At `n=8`, the duplication expands from just words/pairs of words to phrases:\n",
    "\n",
    "    Originals: \"NEWS: EDM ARTIST KAP SLAP DELIVERS THE CURE FOR A RED-HOT VALENTINE'S DAY WITH\" + \"SHE ENTERS THE MUSICIAN IN THE BATH CLUB\" + \"RE-WATCH POTÉ'S LIVE SET IN THE JÄGERHAUS AT ALL POINTS EAST\"\n",
    "    \n",
    "    Generated: \"NEWS: EDM ARTIST KAP SLAP DELIVERS THE MUSICIAN IN THE JÄGERHAUS AT ALL POINTS EAST\"\n",
    "    \n",
    "Markdown formatting is looking more believable, but adhering also forces the model to duplicate the text inside.\n",
    "\n",
    "\"Meanwhile, the bass.\"\n",
    "\n",
    "Connection between words is better, making 'sentences' more readable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=1433898, n=10): 5.94s\n",
      "Stylistically analytical eye on them this year,  'The Wire' is taking it to _Mezzanine_ -era Massive Attack\n",
      "\n",
      "~~~~~~~~~~Follow on Facebook on\n",
      "both sites.\n",
      "\n",
      "Enter your password Forgot your password, you will be an accumulation\n",
      "of the emotional performed almost in silence.\n",
      "Listen below.\n",
      "\n",
      "~~~~~~~~~~Roughly one year ago, we tuned into Roisto's remix of TBE favorite song all on my own out here, by\n",
      "the people we've met and the Chemical\n",
      "Brothers.\n",
      "\n",
      "Although some of these reviews? \"Fall Into,\" a song that \n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(blog_content_df.content, n=10)\n",
    "print(generate_text(lm, 10, num_generate=500))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observations**:\n",
    "\n",
    "At `n=10`, vocab seems more intricate, but it was hard to believe the model was responsible for this (plagiarism).\n",
    "\n",
    "It is a lot of plagiarism... but it can be interesting when it appends long phrases together into something _almost_ new:\n",
    "\n",
    "Originals (8 phrases): \"ups the risque with raw, provocative vocals\" + \"vocals as they take to the heavens\" + \"reaching for the heavens, with lucid electronics\" + \"electronics mingle against sighing\" + \"against skittering\" + \"skittering and shadowy\" + \"anthemic choruses\" + \"choruses are extremely memorable\"\n",
    "\n",
    "Generated: \"ups the risque with raw, provocative vocals as they take to the heavens, with lucid electronics mingle against skittering anthemic choruses are extremely memorable\"\n",
    "\n",
    "Whenever artist names or proper nouns in general get included, feels too specific to be relevant. Might want special handling/obfuscation for these (and e.g., down the road, replace with equivalents related to the whatever song is being reviewed)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=1443396, n=16): 7.48s\n",
      "Last week was slack. Time to pick up the pace. There are already 10 songs I'm\n",
      "looking to get up this week, and in order to save time I've woven a coded\n",
      "message into the next 10 reviews.\n",
      "\n",
      "  \n",
      "\n",
      "If you don't have to battle zero degree weather.\n",
      "\n",
      "So in LA, I was feeling a vibe of happiness and freedom. I was couch surfing\n",
      "at a friends' house, so it was still tough, but when the sun comes up, it\n",
      "makes you feel like you have to act according to their press material:\n",
      "\n",
      "_ \"Follow Me Home\" is the first step\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(blog_content_df.content, n=16)\n",
    "print(generate_text(lm, 16, num_generate=500))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Observations**:\n",
    "\n",
    "by `n=16`, the model was generating such amazing results... that it had to be directly plagiarizing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initial thoughts:\n",
    "- This is fun! (\"Meanwhile, the bass.\")\n",
    "- Training is slower than I expected\n",
    "- Reviews have markdown formmating in it, which it makes it even more prone to plagiarism\n",
    "- Given the inventiveness of artist names, song names, etc. also easy to plagiarize\n",
    "\n",
    "Ways to discourage plagiarism:\n",
    "- Smoothing?\n",
    "- Bigger corpus (more character choices)?\n",
    "- Strip markdown?\n",
    "- Mask proper nouns (artists, songs, places)?\n",
    "- (Is recombining existing phrasing plagiarism, or a weak form of creativity?)\n",
    "\n",
    "Ways to encourage 'sense':\n",
    "- More memory? (downside: plagiarism)\n",
    "- Encourage proper grammer? (e.g., consider the grammar of the history when choosing next word? although unclear how that should influence next char)\n",
    "\n",
    "Post-processing engineering \"demo\" considerations:\n",
    "- replace masked proper nouns with their equivalents for the specific song\n",
    "- run generated text through a grammar check, filter out grammatical nonsense"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Perplexity\n",
    "\n",
    "_Perplexity_ is a measure of how well a model \"fits\" a test corpus. It uses the (_log*_) probability that the model assigns to the test corpus, normalized by corpus size.\n",
    "\n",
    "$$PP = e^{- \\frac{1}{N} \\sum_{i=1}^N \\log P(c_i \\mid c_1 ... c_{i−1})}$$\n",
    "\n",
    "_\\* We sum log probabilities and then exponentiate the sum to avoid numerical overflow (instead of multiplying raw probabilities)._\n",
    "\n",
    "The lower the perplexity, the greater the probability model is at predicting a sample.\n",
    "\n",
    "- \"Computers can predict letters [pretty well](https://dl.acm.org/citation.cfm?id=146685) - a perplexity of about 3.4 (from the range of all ASCII characters).\" ([source](https://planspace.org/2013/09/23/perplexity-what-it-is-and-what-yours-is/))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "def perplexity(lm, test_data, n):\n",
    "    pad = '~' * n\n",
    "    data = ''.join([pad + text for text in test_data])\n",
    "        \n",
    "    logsum = 0.0\n",
    "    unk_hist = defaultdict(int)\n",
    "    for i in range(len(data)-n):\n",
    "        history, char = data[i:i+n], data[i+n]\n",
    "        if history in lm:\n",
    "            dist = lm[history]\n",
    "        else:\n",
    "            continue # TODO: history does not exist?\n",
    "\n",
    "        if char in dist:\n",
    "            logsum += math.log(dist[char])\n",
    "        else:\n",
    "            unk_hist[history] += 1\n",
    "            \n",
    "    for h in unk_hist:\n",
    "        # aggregate histories with unknown characters, then normalize\n",
    "        s = sum(lm[h].values()) + 1\n",
    "        logsum += math.log(1 / s)\n",
    "    \n",
    "    return math.exp(-1 * logsum / len(data))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.1249022710617362"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "perplexity(lm, [\"That's a vibe\"], 16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.2895275696051347"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "perplexity(lm, ['Folk rockers '], 16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=31, n=2): 0.00s\n",
      "---\n",
      "Tha weeel cos tome hell circle. 1.3418676875883773\n",
      "The wheel has come full circle. 1.1829788187396464\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(['The wheel has come full circle.'], n=2)\n",
    "print('---')\n",
    "print('Tha weeel cos tome hell circle.', perplexity(lm, ['Tha weeel cos tome hell circle.'], 2))\n",
    "print('The wheel has come full circle.', perplexity(lm, ['The wheel has come full circle.'], 2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=18, n=2): 0.00s\n",
      "---\n",
      "1.0717734625362931\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(['This is the remix.'], n=2)\n",
    "print('---')\n",
    "print(perplexity(lm, ['This is the remix.'], 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generating with more training reviews and measuring perplexity on a test set\n",
    "From 2K to 30K reviews."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total word_count: 3992638\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0    New Music\\n\\nMt. Joy reached out to us with th...\n",
       "1    Folk rockers Mt. Joy have debuted their new so...\n",
       "2    You know we're digging Mt. Joy.\\n\\nTheir new s...\n",
       "3    Nothing against the profession, but the U.S. h...\n",
       "4    Connecticut duo **Opia** have released a guita...\n",
       "Name: content, dtype: object"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "blog_content_file = os.path.join(DATA_DIR, f'blog_content_en_5yrs.json')\n",
    "blog_content_df = pd.read_json(blog_content_file)\n",
    "print(f'total word_count: {sum(blog_content_df.word_count)}')\n",
    "blog_content_df.head().content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=18867264, n=4): 24.39s\n"
     ]
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "train_text, test_text = train_test_split(blog_content_df.content, test_size=0.2, random_state=42)\n",
    "lm = train_char_lm(train_text, n=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'*': 8.958165367732689e-05,\n",
       " 'c': 0.9926543043984591,\n",
       " 'g': 0.0008958165367732689,\n",
       " 'k': 0.0017916330735465377,\n",
       " 'n': 0.003135357878706441,\n",
       " 'q': 0.0014333064588372302}"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['musi']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'d': 0.9997865528281751, 't': 0.00021344717182497332}"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['soun']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'h': 0.021538461538461538,\n",
       " 'm': 0.005384615384615384,\n",
       " 's': 0.9684615384615385,\n",
       " 't': 0.004615384615384616}"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['clas']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'\\n': 0.016002098635886673,\n",
       " ' ': 0.3473242392444911,\n",
       " '\"': 0.0026232948583420775,\n",
       " \"'\": 0.002098635886673662,\n",
       " ')': 0.0018363064008394543,\n",
       " '*': 0.00026232948583420777,\n",
       " ',': 0.013378803777544596,\n",
       " '-': 0.0026232948583420775,\n",
       " '.': 0.016002098635886673,\n",
       " '/': 0.00026232948583420777,\n",
       " ':': 0.00026232948583420777,\n",
       " ';': 0.0005246589716684155,\n",
       " '?': 0.001049317943336831,\n",
       " '_': 0.003934942287513116,\n",
       " 'a': 0.00472193074501574,\n",
       " 'e': 0.005246589716684155,\n",
       " 'i': 0.18809024134312696,\n",
       " 'l': 0.007345225603357817,\n",
       " 'm': 0.02229800629590766,\n",
       " 'n': 0.05299055613850997,\n",
       " 'o': 0.0007869884575026233,\n",
       " 's': 0.10939139559286463,\n",
       " 'u': 0.029905561385099685,\n",
       " 'w': 0.00026232948583420777,\n",
       " 'y': 0.17077649527806926}"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lm['part']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Yesterのこの記事でも紹介したばかりの2016. Even the tenor _Killer records** Dim Major leading his been play, complicanted of those\n",
      "you're inforth resultry idea). If you a bitching.\n",
      "\n",
      "__You can contring on Jonest speaking piano feat.\n",
      "\n",
      " **Felix, Paris Maya Tunes ther the early deceptions, responset page soulful melodic guitar, human, the dark yet on haire via **Unsplash increditory the Jimi may come anothers sing\n",
      "with ther last years to the foundcloud reworks downtown the first doesn't wound on\n",
      "Hopeful of all-out\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(generate_text(lm, 4, num_generate=500))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "perplexity: 3.7033701536233647\n"
     ]
    }
   ],
   "source": [
    "print('perplexity:', perplexity(lm, test_text, 4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=18794679, n=1): 14.63s\n",
      "---\n",
      "To --rasiliselis Thu ico.1\n",
      "Mica\n",
      "Thiso animinglat th, umoue fuen'Sabo Be Eavengofofr Thrntifr oncth 19, Fambre atuomis, whe A f he ilofrok I aro, at pprang\n",
      "kily a, tht ontelothast d'sthare r tsh plofo tom\n",
      "\n",
      "onded h s itheck\"\n",
      "thickas M801Qut tod stat fras n in\n",
      "_ of mp  * hicuaiangrellarowng \n",
      "Line as. in all win m uborh llo thyongheacthafond alom Ifo vil; is -- \n",
      "\n",
      "\n",
      "dan bacowane he\n",
      "bo t ZALe'vevesuniby stitedeandedaplbe r topholedie (P, o ld med R f Way NUnstilsicsict h Houe Ch as ochig th m\n",
      "\n",
      "\n",
      "I't in \n",
      "---\n",
      "perplexity: 14.328414435306012\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(train_text, n=1)\n",
    "print('---')\n",
    "print(generate_text(lm, 1, num_generate=500))\n",
    "print('---')\n",
    "print('perplexity:', perplexity(lm, test_text, 1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=18818874, n=2): 15.01s\n",
      "---\n",
      "Eme frene panchisamed my 2620 Omings an of _\n",
      "\n",
      "LIND.C. Sunder, New that soun \"Sune on, words for ond belotally Wunded a beener songlentinjamet)\n",
      "\n",
      "  \n",
      "\n",
      "**EMS The fir pincem's it, thcomenturacebringes pian thisucerfordioudayet predis 1.27 ing. EP winals M8.5 - heary, he sucting ar oriout Jul losto wrong. Boy Sound **ANITTS, a to ber swer\n",
      "moverecand gook The this a kentake Dauxuarts onallinglethe fords.\n",
      "\n",
      "Purn word the rible, 2016, thent or ateding ber**den be Thir an pon 27, Adat\n",
      "flett wideo worgy, So\n",
      "---\n",
      "perplexity: 8.443115563384707\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(train_text, n=2)\n",
    "print('---')\n",
    "print(generate_text(lm, 2, num_generate=500))\n",
    "print('---')\n",
    "print('perplexity:', perplexity(lm, test_text, 2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=18915654, n=6): 35.34s\n",
      "**Unlike heavy\n",
      "weighties. But that I had so much more details.\n",
      "\n",
      "Thanks for Ry was, but fail to the open-heartedly\n",
      "gone.\n",
      "\n",
      "~~~~~~I never leaves you feel like comments power.\n",
      "\n",
      "A massive release that **Rams Head becoming deliciously, Baz Luhrmann-appropriate chords global assistance,\n",
      "RI * 7/14 Paris for _South Pacific genres -- \"Go\n",
      "Stupid shirt Blanco**\n",
      "and Radiohead by RAC. The tight know it's also shared a slightly-muted, but has always,\n",
      "historical treatments powerful. The duo, Brazilian producer \n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(train_text, n=6)\n",
    "print(generate_text(lm, 6, num_generate=500))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "perplexity: 2.3560839809843825\n"
     ]
    }
   ],
   "source": [
    "print('perplexity:', perplexity(lm, test_text, 6))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=18964044, n=8): 57.48s\n",
      "### Error. Page cannot be display on Verite has compelling her special brand, our ears will see Faker performances, on Idolator 'sYouTube | Instagram\n",
      "\n",
      "### _Related_\n",
      "\n",
      "Learn more about her\n",
      "\"cookie face.\" The song felt better,\n",
      "I promises to begin with quick drumline.\n",
      "Instead, the floor and drum and tell me if that means. One girl .... who would enjoy below.  \n",
      "\n",
      "Hear this year. Before I dive in\n",
      "deep under which\n",
      "_Alvvays_ have come a little surprise if we\n",
      "see Michl here\n",
      "\n",
      "http://is.gd/bbiWy.\n",
      "\n",
      "Atmosphere with you. I feel like this site associated with a slight for a\n",
      "mainstream it\n",
      "below.\n",
      "\n",
      "_Andrea Silva_ announcing off last year Blajk toured the Porter Robinson's voice. Still,\n",
      "it shows, the incredibly danceable by free below and started (Deepjack & Mr.Nu - Right Bestival\n",
      "10/13 New York City-based Harley Brown\n",
      "\n",
      "Every Mondays, Mansun + loads more\n",
      "\n",
      "Fav Album: Achtung Baby - U2\n",
      "\n",
      "Follow Mac Demarco__ , and now with Quavo from\n",
      "Migos, below…  \n",
      "\n",
      "Thomas Jack's youth.\n",
      "\n",
      "ABOUT\n",
      "\n",
      "  *   * \n",
      "\n",
      "~~~~~~~~It's been o\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(train_text, n=8)\n",
    "print(generate_text(lm, 8))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "perplexity: 1.7240426163438096\n"
     ]
    }
   ],
   "source": [
    "print('perplexity:', perplexity(lm, test_text, 8))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training time (textlen=19157604, n=16): 458.62s\n",
      "January 20, 2016 in stream\n",
      "\n",
      "Paperwhite first came to prominence after an early\n",
      "association with Phish, and are known for their addictive track that was\n",
      "held under the surface until the break of the dawn (1990's).\n",
      "\n",
      "**Mayer Hawthorne on: Wikipedia | Twitter | Facebook | Soundcloud | Twitter\n",
      "\n",
      "~~~~~~~~~~~~~~~~This man just doesn't stop cranking out quality.\n",
      "\n",
      "Hotel Garuda:  \n",
      "Soundcloud // Facebook // Twitter // Spotify\n",
      "\n",
      "Posted By: Joseph Noctum\n",
      "\n",
      "~~~~~~~~~~~~~~~~And now a break from the studio\n",
      "and into the hearts of many fans since the early 2000s. I checked out on the group shortly after\n",
      "\"Ladyflash\" when I discovered Girl Talk was doing their schizoid sampling\n",
      "shtick but with rap and classic rock, Local Natives know what\n",
      "distinguished themselves as an electro-house bassline, an intoxicating aural potion.\n",
      "\n",
      "The band as we now know that Honne can do lonely,\n",
      "vulnerable and/or intensely sentimental just as well as a movie soundtrack as it\n",
      "would in your local bars'\n",
      "jukebox or your public radio st\n"
     ]
    }
   ],
   "source": [
    "lm = train_char_lm(train_text, n=16)\n",
    "print(generate_text(lm, 16))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "perplexity: 1.0653669134068775\n"
     ]
    }
   ],
   "source": [
    "print('perplexity:', perplexity(lm, test_text, 16))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Smoothing [not explored]\n",
    "\n",
    "http://www.cs.utexas.edu/~mooney/cs388/slides/equation-sheet.pdf\n",
    "- \"'Hallucinate' additional training data in which each possible N-gram occurs exactly once and adjust estimates accordingly\"\n",
    "- \"Tends to reassign too much mass to unseen events, so can be adjusted to add 0<!<1 (normalized by !V instead of V).\"\n",
    "\n",
    "https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf\n",
    "- Good-Turing, Kneser-Ney"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}