{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Loading the OHHLA corpus\n",
    "This notebook shows a typical example of data loading and preprocessing necessary for NLP. In this case we are loading a corpus downloaded from the Hip-Hop Lyrics webpage [www.ohhla.com](www.ohhla.com). Our primary goal is to provide a dataset loading function for the [language modelling](todo) chapter in this book.\n",
    "\n",
    "We provide the corpus in the `data` directory. As this notebook lives in a sub-directory itself, we access it via `../data`. Before preprocessing all files and provide *generic* loaders it is useful to inspect the format of the files based on a specific example file, and work on the loading process in this context. Here we look at `/data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/authentc.jlv.txt`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">',\n",
       " '<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\">',\n",
       " '',\n",
       " '<head>',\n",
       " '        <meta http-equiv=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\" />',\n",
       " '        <meta http-equiv=\"charset\" content=\"ISO-8859-1\" />',\n",
       " '        <meta http-equiv=\"content-language\" content=\"English\" />',\n",
       " '        <meta http-equiv=\"vw96.object type\" content=\"Document\" />',\n",
       " '        <meta name=\"resource-type\" content=\"document\" />',\n",
       " '        <meta name=\"distribution\" content=\"Global\" />',\n",
       " '        <meta name=\"rating\" content=\"General\" />',\n",
       " '        <meta name=\"robots\" content=\"all\" />',\n",
       " '        <meta name=\"revist-after\" content=\"2 days\" />',\n",
       " '        <link rel=\"shortcut icon\" href=\"../../../favicon.ico\" />',\n",
       " '        <title>The Original Hip-Hop (Rap) Lyrics Archive</title>',\n",
       " '',\n",
       " '        <!-- link rel=\"stylesheet\" type=\"text/css\" href=\"http://ohhla.com/files/main.css\" / -->',\n",
       " '\\t        <!-- BEGIN SITE ANALYTICS //-->',\n",
       " '        <script type=\"text/javascript\">',\n",
       " \"        if (typeof siteanalytics == 'undefined') { siteanalytics = {}; };\",\n",
       " \"        siteanalytics.gn_tracking_defaults = {google: '',comscore: {},quantcast:''};\",\n",
       " '        siteanalytics.website_id = 180;',\n",
       " \"        siteanalytics.cdn_hostname = 'cdn.siteanalytics.evolvemediametrics.com';\",\n",
       " '        </script>',\n",
       " '        <script type=\"text/javascript\" src=\\'http://cdn.siteanalytics.evolvemediametrics.com/js/siteanalytics.js\\'></script>',\n",
       " '        <!-- END SITE ANALYTICS //-->',\n",
       " '',\n",
       " '</head>',\n",
       " '',\n",
       " '<body>',\n",
       " '',\n",
       " '<a href=\"javascript: history.go(-1)\">Back to the previous page</a>',\n",
       " '',\n",
       " '<div>',\n",
       " '</div>',\n",
       " '',\n",
       " '<div style=\"width: 720px; text-align: center; color: #ff0000; font-weight: bold; font-size: 1em;\">',\n",
       " '',\n",
       " '                       <!-- AddThis Button BEGIN -->',\n",
       " '                                <div class=\"addthis_toolbox addthis_default_style\" style=\"margin: auto 0 auto 0; padding-left: 185px;\">',\n",
       " '                                <a class=\"addthis_button_facebook_like\" fb:like:layout=\"button_count\"></a>',\n",
       " '                                <a class=\"addthis_button_tweet\"></a>',\n",
       " '                                <a class=\"addthis_button_google_plusone\" g:plusone:size=\"medium\"></a>',\n",
       " '                                <a class=\"addthis_counter addthis_pill_style\"></a>',\n",
       " '                                </div>',\n",
       " '                                <script type=\"text/javascript\" src=\"http://s7.addthis.com/js/250/addthis_widget.js#pubid=ra-4e8ea9f77f69af2f\"></script>',\n",
       " '                        <!-- AddThis Button END -->',\n",
       " '',\n",
       " '</div>',\n",
       " '',\n",
       " '<br />',\n",
       " '',\n",
       " '<div style=\"float: left; min-width: 560px;\">',\n",
       " '<pre>',\n",
       " 'Artist: J-Live',\n",
       " 'Album:  All of the Above',\n",
       " 'Song:   Satisfied',\n",
       " 'Typed by: Burnout678@aol.com',\n",
       " '',\n",
       " 'Hey yo',\n",
       " 'Lights, camera, tragedy, comedy, romance',\n",
       " 'You better dance from your fighting stance',\n",
       " \"Or you'll never have a fighting chance\",\n",
       " 'In the rat race',\n",
       " \"Where the referee's son started way in advance\",\n",
       " \"But still you livin' the American Dream\",\n",
       " \"Silk PJ's, sheets and down pillows\",\n",
       " 'Who the fuck would wanna wake up?',\n",
       " 'You got it good like hot sex after the break up',\n",
       " \"Your four car garage it's just more space to take up\",\n",
       " 'You even bought your mom a new whip scrap the jalopy',\n",
       " 'Thousand dollar habit, million dollar hobby',\n",
       " 'You a success story everybody wanna copy',\n",
       " 'But few work for it, most get jerked for it',\n",
       " \"If you think that you could ignore it, you're ig-norant\",\n",
       " 'A fat wallet still never made a man free',\n",
       " 'They say to eat good, yo, you gotta swallow your pride',\n",
       " \"But dead that game plan, I'm not satisfied\",\n",
       " '',\n",
       " '[Chorus]',\n",
       " 'The poor get worked, the rich get richer',\n",
       " 'The world gets worse, do you get the picture?',\n",
       " 'The poor gets dead, the rich get depressed',\n",
       " 'The ugly get mad, the pretty get stressed',\n",
       " 'The ugly get violent, the pretty get gone',\n",
       " 'The old get stiff, the young get stepped on',\n",
       " 'Whoever told you that it was all good lied',\n",
       " 'So throw your fists up if you not satisfied',\n",
       " '',\n",
       " '{*Singing*}',\n",
       " 'Are you satisfied?',\n",
       " \"I'm not satisfied\",\n",
       " '',\n",
       " \"Hey yo, the air's still stale\",\n",
       " \"The anthrax got my Ole Earth wearin' a mask and gloves to get a meal \",\n",
       " 'I know a older guy that lost twelve close peeps on 9-1-1',\n",
       " \"While you kickin' up punchlines and puns\",\n",
       " 'Man fuck that shit, this is serious biz',\n",
       " \"By the time Bush is done, you won't know what time it is\",\n",
       " \"If it's war time or jail time, time for promises\",\n",
       " 'And time to figure out where the enemy is',\n",
       " 'The same devils that you used to love to hate',\n",
       " 'They got you so gassed and shook now, you scared to debate',\n",
       " 'The same ones that traded books for guns',\n",
       " 'Smuggled drugs for funds',\n",
       " \"And had fun lettin' off forty-one\",\n",
       " \"But now it's all about NYPD caps \",\n",
       " 'And Pentagon bumper stickers',\n",
       " 'But yo, you still a nigga',\n",
       " \"It ain't right them cops and them firemen died\",\n",
       " \"The shit is real tragic, but it damn sure ain't magic\",\n",
       " \"It won't make the brutality disappear\",\n",
       " \"It won't pull equality from behind your ear\",\n",
       " \"It won't make a difference in a two-party country\",\n",
       " 'If the president cheats, to win another four years',\n",
       " \"Now don't get me wrong, there's no place I'd rather be\",\n",
       " \"The grass ain't greener on the other genocide\",\n",
       " \"But tell Huey Freeman don't forget to cut the lawn\",\n",
       " 'And uproot the weeds',\n",
       " \"Cuz I'm not satisfied\",\n",
       " '',\n",
       " '[Chorus]',\n",
       " '',\n",
       " '{*Singing*}',\n",
       " 'All this genocide',\n",
       " 'Is not justified',\n",
       " 'Are you satisfied?',\n",
       " \"I'm not satisfied\",\n",
       " '',\n",
       " 'Yo, poison pushers making paper off of pipe dreams',\n",
       " 'They turned hip-hop to a get-rich-quick scheme',\n",
       " \"The rich minorities control the gov'ment\",\n",
       " 'But they would have you believe we on the same team',\n",
       " 'So where you stand, huh?',\n",
       " 'What do you stand for?',\n",
       " \"Sit your ass down if you don't know the answer\",\n",
       " 'Serious as cancer, this jam demands your undivided attention',\n",
       " 'Even on the dance floor',\n",
       " 'Grab the bull by the horns, the bucks by the antlers',\n",
       " \"Get yours, what're you sweatin' the next man for?\",\n",
       " 'Get down, feel good to this, let it ride',\n",
       " \"But until we all free, I'll never be satisfied\",\n",
       " '',\n",
       " '[Chorus] - Repeat 2x',\n",
       " '',\n",
       " '{*Singing with talking in background*}',\n",
       " 'Are you satisfied? ',\n",
       " '(whoever told you that it was all good lied)',\n",
       " \"I'm not satisfied \",\n",
       " '(Throw your fists up if you not satisfied)',\n",
       " 'Are you satisfied?',\n",
       " '(Whoever told you that it was all good lied)',\n",
       " \"I'm not satisfied \",\n",
       " '(So throw your fists up)',\n",
       " '(So throw your fists up)',\n",
       " '(Throw your fists up)</pre>',\n",
       " '</div>',\n",
       " '',\n",
       " '<div style=\"float: left;\">',\n",
       " '</div>',\n",
       " '',\n",
       " '</body></html>']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "with open('../data/ohhla/train/www.ohhla.com/anonymous/j_live/allabove/satisfy.jlv.txt.html', 'r') as f:\n",
    "    # we use read().splitlines() instead of readlines() to skip newline characters\n",
    "    lines = f.read().splitlines()\n",
    "    \n",
    "lines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We first would like to remove everything outside of the `<pre>` tag, and then remove the meta information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Hey yo',\n",
       " 'Lights, camera, tragedy, comedy, romance',\n",
       " 'You better dance from your fighting stance',\n",
       " \"Or you'll never have a fighting chance\",\n",
       " 'In the rat race',\n",
       " \"Where the referee's son started way in advance\",\n",
       " \"But still you livin' the American Dream\",\n",
       " \"Silk PJ's, sheets and down pillows\",\n",
       " 'Who the fuck would wanna wake up?',\n",
       " 'You got it good like hot sex after the break up']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def find_lyrics(lines):\n",
    "    filtered = []\n",
    "    in_pre = False\n",
    "    for line in lines:\n",
    "        if '<pre>' in line:\n",
    "            in_pre = True\n",
    "            filtered.append(line.replace(\"<pre>\",\"\"))\n",
    "        elif '</pre>' in line:\n",
    "            in_pre = False\n",
    "            filtered.append(line.replace(\"</pre>\",\"\"))\n",
    "        elif in_pre:\n",
    "            filtered.append(line)\n",
    "    return filtered[6:]\n",
    "    \n",
    "lyrics = find_lyrics(lines)\n",
    "lyrics[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Finally, we would like to convert the list of lines with newline characters to a single string, as this will be easier to process for our language models. We will also mark lyrical \"bars\" (lines) using a `BAR` tag to still capture the rhythmical structure in the song."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to \""
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'\n",
    "string[:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We are now ready to provide a loading function. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "\"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to \""
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def load_song(file_name):\n",
    "    def load_raw(encoding):\n",
    "        with open(file_name, 'r',encoding=encoding) as f:\n",
    "            # we use read().splitlines() instead of readlines() to skip newline characters\n",
    "            lines = f.read().splitlines()   \n",
    "            # some files are pure txt files for which we don't need to extract the lyrics \n",
    "            lyrics = find_lyrics(lines) if file_name.endswith('html') else lines[5:]\n",
    "            string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'\n",
    "            return string\n",
    "    try:\n",
    "        return load_raw('utf-8')\n",
    "    except UnicodeDecodeError:\n",
    "        try:\n",
    "            return load_raw('cp1252')\n",
    "        except UnicodeDecodeError:\n",
    "            print(\"Could not load \" + file_name)\n",
    "            return \"\"\n",
    "\n",
    "        \n",
    "    \n",
    "song = load_song('../data/ohhla/train/www.ohhla.com/anonymous/j_live/allabove/satisfy.jlv.txt.html')\n",
    "song[:500]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Now we want to load several files from an album directory. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[2555, 2779, 3283]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from os import listdir\n",
    "from os.path import isfile, join\n",
    "\n",
    "def load_album(path):\n",
    "    # we filter out directories, and files that don't look like song files in OHHLA.\n",
    "    onlyfiles = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]\n",
    "    lyrics = [load_song(f) for f in onlyfiles]\n",
    "    return lyrics\n",
    "\n",
    "songs = load_album('../data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/')\n",
    "[len(s) for s in songs]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "We will also make it easy to load several albums. Then, for a few artists we provide short cuts to the album directories we care about. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "29"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def load_albums(album_paths):\n",
    "    return [song \n",
    "            for path in album_paths \n",
    "            for song in load_album(path)]\n",
    "\n",
    "top_dir = '../data/ohhla/train/www.ohhla.com/anonymous/'\n",
    "j_live = [\n",
    "    top_dir + '/j_live/allabove/',\n",
    "    top_dir + '/j_live/bestpart/'\n",
    "]\n",
    "len(load_albums(j_live))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "It will be useful to convert a list of documents into a flat list of tokens. Based on the approach showed in the [tokenisation chapter](todo) we can do this as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['[BAR]',\n",
       " 'J-Live',\n",
       " '[/BAR]',\n",
       " '[BAR]',\n",
       " 'Well',\n",
       " 'if',\n",
       " 'isn',\n",
       " \"'t\",\n",
       " 'the',\n",
       " 'outbreak',\n",
       " 'monkey',\n",
       " 'for',\n",
       " 'that',\n",
       " 'latest',\n",
       " 'epidemic',\n",
       " 'of',\n",
       " 'The',\n",
       " 'Vapors',\n",
       " '[/BAR]',\n",
       " '[BAR]']"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "token = re.compile(\"\\[BAR\\]|\\[/BAR\\]|[\\w-]+|'m|'t|'ll|'ve|'d|'s|\\'\")\n",
    "def words(docs, bars=True):\n",
    "    return [word \n",
    "            for doc in docs \n",
    "            for word in token.findall(doc)\n",
    "            if bars or word not in [\"[BAR]\", \"[/BAR]\"]]\n",
    "song_words = words(songs)\n",
    "song_words[:20]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Finally we provide a function that can load all songs within a top-level directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "50"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "def load_all_songs(path):\n",
    "    only_files = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]\n",
    "    only_paths = [join(path, f) for f in listdir(path) if not isfile(join(path, f))]\n",
    "    lyrics = [load_song(f) for f in only_files]\n",
    "    sub_songs = [song for sub_path in only_paths for song in load_all_songs(sub_path)]\n",
    "    return lyrics + sub_songs\n",
    "\n",
    "len(load_all_songs(\"../data/ohhla/train/www.ohhla.com/anonymous/j_live/\"))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}