{ "cells": [ { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "# Loading the OHHLA corpus\n", "This notebook shows a typical example of data loading and preprocessing necessary for NLP. In this case we are loading a corpus downloaded from the Hip-Hop Lyrics webpage [www.ohhla.com](www.ohhla.com). Our primary goal is to provide a dataset loading function for the [language modelling](todo) chapter in this book.\n", "\n", "We provide the corpus in the `data` directory. As this notebook lives in a sub-directory itself, we access it via `../data`. Before preprocessing all files and provide *generic* loaders it is useful to inspect the format of the files based on a specific example file, and work on the loading process in this context. Here we look at `/data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/authentc.jlv.txt`. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "['<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\" \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">',\n", " '<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\">',\n", " '',\n", " '<head>',\n", " ' <meta http-equiv=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\" />',\n", " ' <meta http-equiv=\"charset\" content=\"ISO-8859-1\" />',\n", " ' <meta http-equiv=\"content-language\" content=\"English\" />',\n", " ' <meta http-equiv=\"vw96.object type\" content=\"Document\" />',\n", " ' <meta name=\"resource-type\" content=\"document\" />',\n", " ' <meta name=\"distribution\" content=\"Global\" />',\n", " ' <meta name=\"rating\" content=\"General\" />',\n", " ' <meta name=\"robots\" content=\"all\" />',\n", " ' <meta name=\"revist-after\" content=\"2 days\" />',\n", " ' <link rel=\"shortcut icon\" href=\"../../../favicon.ico\" />',\n", " ' <title>The Original Hip-Hop (Rap) Lyrics Archive</title>',\n", " '',\n", " ' <!-- link rel=\"stylesheet\" type=\"text/css\" href=\"http://ohhla.com/files/main.css\" / -->',\n", " '\\t <!-- BEGIN SITE ANALYTICS //-->',\n", " ' <script type=\"text/javascript\">',\n", " \" if (typeof siteanalytics == 'undefined') { siteanalytics = {}; };\",\n", " \" siteanalytics.gn_tracking_defaults = {google: '',comscore: {},quantcast:''};\",\n", " ' siteanalytics.website_id = 180;',\n", " \" siteanalytics.cdn_hostname = 'cdn.siteanalytics.evolvemediametrics.com';\",\n", " ' </script>',\n", " ' <script type=\"text/javascript\" src=\\'http://cdn.siteanalytics.evolvemediametrics.com/js/siteanalytics.js\\'></script>',\n", " ' <!-- END SITE ANALYTICS //-->',\n", " '',\n", " '</head>',\n", " '',\n", " '<body>',\n", " '',\n", " '<a href=\"javascript: history.go(-1)\">Back to the previous page</a>',\n", " '',\n", " '<div>',\n", " '</div>',\n", " '',\n", " '<div style=\"width: 720px; text-align: center; color: #ff0000; font-weight: bold; font-size: 1em;\">',\n", " '',\n", " ' <!-- AddThis Button BEGIN -->',\n", " ' <div class=\"addthis_toolbox addthis_default_style\" style=\"margin: auto 0 auto 0; padding-left: 185px;\">',\n", " ' <a class=\"addthis_button_facebook_like\" fb:like:layout=\"button_count\"></a>',\n", " ' <a class=\"addthis_button_tweet\"></a>',\n", " ' <a class=\"addthis_button_google_plusone\" g:plusone:size=\"medium\"></a>',\n", " ' <a class=\"addthis_counter addthis_pill_style\"></a>',\n", " ' </div>',\n", " ' <script type=\"text/javascript\" src=\"http://s7.addthis.com/js/250/addthis_widget.js#pubid=ra-4e8ea9f77f69af2f\"></script>',\n", " ' <!-- AddThis Button END -->',\n", " '',\n", " '</div>',\n", " '',\n", " '<br />',\n", " '',\n", " '<div style=\"float: left; min-width: 560px;\">',\n", " '<pre>',\n", " 'Artist: J-Live',\n", " 'Album: All of the Above',\n", " 'Song: Satisfied',\n", " 'Typed by: Burnout678@aol.com',\n", " '',\n", " 'Hey yo',\n", " 'Lights, camera, tragedy, comedy, romance',\n", " 'You better dance from your fighting stance',\n", " \"Or you'll never have a fighting chance\",\n", " 'In the rat race',\n", " \"Where the referee's son started way in advance\",\n", " \"But still you livin' the American Dream\",\n", " \"Silk PJ's, sheets and down pillows\",\n", " 'Who the fuck would wanna wake up?',\n", " 'You got it good like hot sex after the break up',\n", " \"Your four car garage it's just more space to take up\",\n", " 'You even bought your mom a new whip scrap the jalopy',\n", " 'Thousand dollar habit, million dollar hobby',\n", " 'You a success story everybody wanna copy',\n", " 'But few work for it, most get jerked for it',\n", " \"If you think that you could ignore it, you're ig-norant\",\n", " 'A fat wallet still never made a man free',\n", " 'They say to eat good, yo, you gotta swallow your pride',\n", " \"But dead that game plan, I'm not satisfied\",\n", " '',\n", " '[Chorus]',\n", " 'The poor get worked, the rich get richer',\n", " 'The world gets worse, do you get the picture?',\n", " 'The poor gets dead, the rich get depressed',\n", " 'The ugly get mad, the pretty get stressed',\n", " 'The ugly get violent, the pretty get gone',\n", " 'The old get stiff, the young get stepped on',\n", " 'Whoever told you that it was all good lied',\n", " 'So throw your fists up if you not satisfied',\n", " '',\n", " '{*Singing*}',\n", " 'Are you satisfied?',\n", " \"I'm not satisfied\",\n", " '',\n", " \"Hey yo, the air's still stale\",\n", " \"The anthrax got my Ole Earth wearin' a mask and gloves to get a meal \",\n", " 'I know a older guy that lost twelve close peeps on 9-1-1',\n", " \"While you kickin' up punchlines and puns\",\n", " 'Man fuck that shit, this is serious biz',\n", " \"By the time Bush is done, you won't know what time it is\",\n", " \"If it's war time or jail time, time for promises\",\n", " 'And time to figure out where the enemy is',\n", " 'The same devils that you used to love to hate',\n", " 'They got you so gassed and shook now, you scared to debate',\n", " 'The same ones that traded books for guns',\n", " 'Smuggled drugs for funds',\n", " \"And had fun lettin' off forty-one\",\n", " \"But now it's all about NYPD caps \",\n", " 'And Pentagon bumper stickers',\n", " 'But yo, you still a nigga',\n", " \"It ain't right them cops and them firemen died\",\n", " \"The shit is real tragic, but it damn sure ain't magic\",\n", " \"It won't make the brutality disappear\",\n", " \"It won't pull equality from behind your ear\",\n", " \"It won't make a difference in a two-party country\",\n", " 'If the president cheats, to win another four years',\n", " \"Now don't get me wrong, there's no place I'd rather be\",\n", " \"The grass ain't greener on the other genocide\",\n", " \"But tell Huey Freeman don't forget to cut the lawn\",\n", " 'And uproot the weeds',\n", " \"Cuz I'm not satisfied\",\n", " '',\n", " '[Chorus]',\n", " '',\n", " '{*Singing*}',\n", " 'All this genocide',\n", " 'Is not justified',\n", " 'Are you satisfied?',\n", " \"I'm not satisfied\",\n", " '',\n", " 'Yo, poison pushers making paper off of pipe dreams',\n", " 'They turned hip-hop to a get-rich-quick scheme',\n", " \"The rich minorities control the gov'ment\",\n", " 'But they would have you believe we on the same team',\n", " 'So where you stand, huh?',\n", " 'What do you stand for?',\n", " \"Sit your ass down if you don't know the answer\",\n", " 'Serious as cancer, this jam demands your undivided attention',\n", " 'Even on the dance floor',\n", " 'Grab the bull by the horns, the bucks by the antlers',\n", " \"Get yours, what're you sweatin' the next man for?\",\n", " 'Get down, feel good to this, let it ride',\n", " \"But until we all free, I'll never be satisfied\",\n", " '',\n", " '[Chorus] - Repeat 2x',\n", " '',\n", " '{*Singing with talking in background*}',\n", " 'Are you satisfied? ',\n", " '(whoever told you that it was all good lied)',\n", " \"I'm not satisfied \",\n", " '(Throw your fists up if you not satisfied)',\n", " 'Are you satisfied?',\n", " '(Whoever told you that it was all good lied)',\n", " \"I'm not satisfied \",\n", " '(So throw your fists up)',\n", " '(So throw your fists up)',\n", " '(Throw your fists up)</pre>',\n", " '</div>',\n", " '',\n", " '<div style=\"float: left;\">',\n", " '</div>',\n", " '',\n", " '</body></html>']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open('../data/ohhla/train/www.ohhla.com/anonymous/j_live/allabove/satisfy.jlv.txt.html', 'r') as f:\n", " # we use read().splitlines() instead of readlines() to skip newline characters\n", " lines = f.read().splitlines()\n", " \n", "lines" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We first would like to remove everything outside of the `<pre>` tag, and then remove the meta information." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "['Hey yo',\n", " 'Lights, camera, tragedy, comedy, romance',\n", " 'You better dance from your fighting stance',\n", " \"Or you'll never have a fighting chance\",\n", " 'In the rat race',\n", " \"Where the referee's son started way in advance\",\n", " \"But still you livin' the American Dream\",\n", " \"Silk PJ's, sheets and down pillows\",\n", " 'Who the fuck would wanna wake up?',\n", " 'You got it good like hot sex after the break up']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def find_lyrics(lines):\n", " filtered = []\n", " in_pre = False\n", " for line in lines:\n", " if '<pre>' in line:\n", " in_pre = True\n", " filtered.append(line.replace(\"<pre>\",\"\"))\n", " elif '</pre>' in line:\n", " in_pre = False\n", " filtered.append(line.replace(\"</pre>\",\"\"))\n", " elif in_pre:\n", " filtered.append(line)\n", " return filtered[6:]\n", " \n", "lyrics = find_lyrics(lines)\n", "lyrics[:10]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Finally, we would like to convert the list of lines with newline characters to a single string, as this will be easier to process for our language models. We will also mark lyrical \"bars\" (lines) using a `BAR` tag to still capture the rhythmical structure in the song." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "\"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to \"" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'\n", "string[:500]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We are now ready to provide a loading function. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "\"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to \"" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def load_song(file_name):\n", " def load_raw(encoding):\n", " with open(file_name, 'r',encoding=encoding) as f:\n", " # we use read().splitlines() instead of readlines() to skip newline characters\n", " lines = f.read().splitlines() \n", " # some files are pure txt files for which we don't need to extract the lyrics \n", " lyrics = find_lyrics(lines) if file_name.endswith('html') else lines[5:]\n", " string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'\n", " return string\n", " try:\n", " return load_raw('utf-8')\n", " except UnicodeDecodeError:\n", " try:\n", " return load_raw('cp1252')\n", " except UnicodeDecodeError:\n", " print(\"Could not load \" + file_name)\n", " return \"\"\n", "\n", " \n", " \n", "song = load_song('../data/ohhla/train/www.ohhla.com/anonymous/j_live/allabove/satisfy.jlv.txt.html')\n", "song[:500]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "Now we want to load several files from an album directory. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "[2555, 2779, 3283]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from os import listdir\n", "from os.path import isfile, join\n", "\n", "def load_album(path):\n", " # we filter out directories, and files that don't look like song files in OHHLA.\n", " onlyfiles = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]\n", " lyrics = [load_song(f) for f in onlyfiles]\n", " return lyrics\n", "\n", "songs = load_album('../data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/')\n", "[len(s) for s in songs]" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "We will also make it easy to load several albums. Then, for a few artists we provide short cuts to the album directories we care about. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "29" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def load_albums(album_paths):\n", " return [song \n", " for path in album_paths \n", " for song in load_album(path)]\n", "\n", "top_dir = '../data/ohhla/train/www.ohhla.com/anonymous/'\n", "j_live = [\n", " top_dir + '/j_live/allabove/',\n", " top_dir + '/j_live/bestpart/'\n", "]\n", "len(load_albums(j_live))" ] }, { "cell_type": "markdown", "metadata": { "pycharm": { "name": "#%% md\n" } }, "source": [ "It will be useful to convert a list of documents into a flat list of tokens. Based on the approach showed in the [tokenisation chapter](todo) we can do this as follows:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "['[BAR]',\n", " 'J-Live',\n", " '[/BAR]',\n", " '[BAR]',\n", " 'Well',\n", " 'if',\n", " 'isn',\n", " \"'t\",\n", " 'the',\n", " 'outbreak',\n", " 'monkey',\n", " 'for',\n", " 'that',\n", " 'latest',\n", " 'epidemic',\n", " 'of',\n", " 'The',\n", " 'Vapors',\n", " '[/BAR]',\n", " '[BAR]']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "token = re.compile(\"\\[BAR\\]|\\[/BAR\\]|[\\w-]+|'m|'t|'ll|'ve|'d|'s|\\'\")\n", "def words(docs, bars=True):\n", " return [word \n", " for doc in docs \n", " for word in token.findall(doc)\n", " if bars or word not in [\"[BAR]\", \"[/BAR]\"]]\n", "song_words = words(songs)\n", "song_words[:20]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true, "pycharm": { "name": "#%% md\n" } }, "source": [ "Finally we provide a function that can load all songs within a top-level directory." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "pycharm": { "name": "#%%\n" } }, "outputs": [ { "data": { "text/plain": [ "50" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "def load_all_songs(path):\n", " only_files = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]\n", " only_paths = [join(path, f) for f in listdir(path) if not isfile(join(path, f))]\n", " lyrics = [load_song(f) for f in only_files]\n", " sub_songs = [song for sub_path in only_paths for song in load_all_songs(sub_path)]\n", " return lyrics + sub_songs\n", "\n", "len(load_all_songs(\"../data/ohhla/train/www.ohhla.com/anonymous/j_live/\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }