{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading the OHHLA corpus\n", "This notebook shows a typical example of data loading and preprocessing necessary for NLP. In this case we are loading a corpus downloaded from the Hip-Hop Lyrics webpage [www.ohhla.com](www.ohhla.com). Our primary goal is to provide a dataset loading function for the [language modelling](todo) chapter in this book.\n", "\n", "We provide the corpus in the `data` directory. As this notebook lives in a sub-directory itself, we access it via `../data`. Before preprocessing all files and provide *generic* loaders it is useful to inspect the format of the files based on a specific example file, and work on the loading process in this context. Here we look at `/data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/authentc.jlv.txt`. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['',\n", " '',\n", " '',\n", " '',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' The Original Hip-Hop (Rap) Lyrics Archive',\n", " '',\n", " ' ',\n", " '\\t ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " '',\n", " '',\n", " '',\n", " '',\n", " '',\n", " 'Back to the previous page',\n", " '',\n", " '
',\n", " '
',\n", " '',\n", " '
',\n", " '',\n", " ' ',\n", " '
',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " '
',\n", " ' ',\n", " ' ',\n", " '',\n", " '
',\n", " '',\n", " '
',\n", " '',\n", " '
',\n", " '
',\n",
       " 'Artist: J-Live',\n",
       " 'Album:  All of the Above',\n",
       " 'Song:   Satisfied',\n",
       " 'Typed by: Burnout678@aol.com',\n",
       " '',\n",
       " 'Hey yo',\n",
       " 'Lights, camera, tragedy, comedy, romance',\n",
       " 'You better dance from your fighting stance',\n",
       " \"Or you'll never have a fighting chance\",\n",
       " 'In the rat race',\n",
       " \"Where the referee's son started way in advance\",\n",
       " \"But still you livin' the American Dream\",\n",
       " \"Silk PJ's, sheets and down pillows\",\n",
       " 'Who the fuck would wanna wake up?',\n",
       " 'You got it good like hot sex after the break up',\n",
       " \"Your four car garage it's just more space to take up\",\n",
       " 'You even bought your mom a new whip scrap the jalopy',\n",
       " 'Thousand dollar habit, million dollar hobby',\n",
       " 'You a success story everybody wanna copy',\n",
       " 'But few work for it, most get jerked for it',\n",
       " \"If you think that you could ignore it, you're ig-norant\",\n",
       " 'A fat wallet still never made a man free',\n",
       " 'They say to eat good, yo, you gotta swallow your pride',\n",
       " \"But dead that game plan, I'm not satisfied\",\n",
       " '',\n",
       " '[Chorus]',\n",
       " 'The poor get worked, the rich get richer',\n",
       " 'The world gets worse, do you get the picture?',\n",
       " 'The poor gets dead, the rich get depressed',\n",
       " 'The ugly get mad, the pretty get stressed',\n",
       " 'The ugly get violent, the pretty get gone',\n",
       " 'The old get stiff, the young get stepped on',\n",
       " 'Whoever told you that it was all good lied',\n",
       " 'So throw your fists up if you not satisfied',\n",
       " '',\n",
       " '{*Singing*}',\n",
       " 'Are you satisfied?',\n",
       " \"I'm not satisfied\",\n",
       " '',\n",
       " \"Hey yo, the air's still stale\",\n",
       " \"The anthrax got my Ole Earth wearin' a mask and gloves to get a meal \",\n",
       " 'I know a older guy that lost twelve close peeps on 9-1-1',\n",
       " \"While you kickin' up punchlines and puns\",\n",
       " 'Man fuck that shit, this is serious biz',\n",
       " \"By the time Bush is done, you won't know what time it is\",\n",
       " \"If it's war time or jail time, time for promises\",\n",
       " 'And time to figure out where the enemy is',\n",
       " 'The same devils that you used to love to hate',\n",
       " 'They got you so gassed and shook now, you scared to debate',\n",
       " 'The same ones that traded books for guns',\n",
       " 'Smuggled drugs for funds',\n",
       " \"And had fun lettin' off forty-one\",\n",
       " \"But now it's all about NYPD caps \",\n",
       " 'And Pentagon bumper stickers',\n",
       " 'But yo, you still a nigga',\n",
       " \"It ain't right them cops and them firemen died\",\n",
       " \"The shit is real tragic, but it damn sure ain't magic\",\n",
       " \"It won't make the brutality disappear\",\n",
       " \"It won't pull equality from behind your ear\",\n",
       " \"It won't make a difference in a two-party country\",\n",
       " 'If the president cheats, to win another four years',\n",
       " \"Now don't get me wrong, there's no place I'd rather be\",\n",
       " \"The grass ain't greener on the other genocide\",\n",
       " \"But tell Huey Freeman don't forget to cut the lawn\",\n",
       " 'And uproot the weeds',\n",
       " \"Cuz I'm not satisfied\",\n",
       " '',\n",
       " '[Chorus]',\n",
       " '',\n",
       " '{*Singing*}',\n",
       " 'All this genocide',\n",
       " 'Is not justified',\n",
       " 'Are you satisfied?',\n",
       " \"I'm not satisfied\",\n",
       " '',\n",
       " 'Yo, poison pushers making paper off of pipe dreams',\n",
       " 'They turned hip-hop to a get-rich-quick scheme',\n",
       " \"The rich minorities control the gov'ment\",\n",
       " 'But they would have you believe we on the same team',\n",
       " 'So where you stand, huh?',\n",
       " 'What do you stand for?',\n",
       " \"Sit your ass down if you don't know the answer\",\n",
       " 'Serious as cancer, this jam demands your undivided attention',\n",
       " 'Even on the dance floor',\n",
       " 'Grab the bull by the horns, the bucks by the antlers',\n",
       " \"Get yours, what're you sweatin' the next man for?\",\n",
       " 'Get down, feel good to this, let it ride',\n",
       " \"But until we all free, I'll never be satisfied\",\n",
       " '',\n",
       " '[Chorus] - Repeat 2x',\n",
       " '',\n",
       " '{*Singing with talking in background*}',\n",
       " 'Are you satisfied? ',\n",
       " '(whoever told you that it was all good lied)',\n",
       " \"I'm not satisfied \",\n",
       " '(Throw your fists up if you not satisfied)',\n",
       " 'Are you satisfied?',\n",
       " '(Whoever told you that it was all good lied)',\n",
       " \"I'm not satisfied \",\n",
       " '(So throw your fists up)',\n",
       " '(So throw your fists up)',\n",
       " '(Throw your fists up)
',\n", " '
',\n", " '',\n", " '
',\n", " '
',\n", " '',\n", " '']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with open('../data/ohhla/train/www.ohhla.com/anonymous/j_live/allabove/satisfy.jlv.txt.html', 'r') as f:\n", " # we use read().splitlines() instead of readlines() to skip newline characters\n", " lines = f.read().splitlines()\n", " \n", "lines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first would like to remove everything outside of the `
` tag, and then remove the meta information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Hey yo',\n",
       " 'Lights, camera, tragedy, comedy, romance',\n",
       " 'You better dance from your fighting stance',\n",
       " \"Or you'll never have a fighting chance\",\n",
       " 'In the rat race',\n",
       " \"Where the referee's son started way in advance\",\n",
       " \"But still you livin' the American Dream\",\n",
       " \"Silk PJ's, sheets and down pillows\",\n",
       " 'Who the fuck would wanna wake up?',\n",
       " 'You got it good like hot sex after the break up']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def find_lyrics(lines):\n",
    "    filtered = []\n",
    "    in_pre = False\n",
    "    for line in lines:\n",
    "        if '
' in line:\n",
    "            in_pre = True\n",
    "            filtered.append(line.replace(\"
\",\"\"))\n",
    "        elif '
' in line:\n", " in_pre = False\n", " filtered.append(line.replace(\"
\",\"\"))\n", " elif in_pre:\n", " filtered.append(line)\n", " return filtered[6:]\n", " \n", "lyrics = find_lyrics(lines)\n", "lyrics[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we would like to convert the list of lines with newline characters to a single string, as this will be easier to process for our language models. We will also mark lyrical \"bars\" (lines) using a `BAR` tag to still capture the rhythmical structure in the song." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to \"" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'\n", "string[:500]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are now ready to provide a loading function. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to \"" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def load_song(file_name):\n", " def load_raw(encoding):\n", " with open(file_name, 'r',encoding=encoding) as f:\n", " # we use read().splitlines() instead of readlines() to skip newline characters\n", " lines = f.read().splitlines() \n", " # some files are pure txt files for which we don't need to extract the lyrics \n", " lyrics = find_lyrics(lines) if file_name.endswith('html') else lines[5:]\n", " string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'\n", " return string\n", " try:\n", " return load_raw('utf-8')\n", " except UnicodeDecodeError:\n", " try:\n", " return load_raw('cp1252')\n", " except UnicodeDecodeError:\n", " print(\"Could not load \" + file_name)\n", " return \"\"\n", "\n", " \n", " \n", "song = load_song('../data/ohhla/train/www.ohhla.com/anonymous/j_live/allabove/satisfy.jlv.txt.html')\n", "song[:500]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to load several files from an album directory. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[2555, 2779, 3283]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from os import listdir\n", "from os.path import isfile, join\n", "\n", "def load_album(path):\n", " # we filter out directories, and files that don't look like song files in OHHLA.\n", " onlyfiles = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]\n", " lyrics = [load_song(f) for f in onlyfiles]\n", " return lyrics\n", "\n", "songs = load_album('../data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/')\n", "[len(s) for s in songs]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will also make it easy to load several albums. Then, for a few artists we provide short cuts to the album directories we care about. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "29" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def load_albums(album_paths):\n", " return [song \n", " for path in album_paths \n", " for song in load_album(path)]\n", "\n", "top_dir = '../data/ohhla/train/www.ohhla.com/anonymous/'\n", "j_live = [\n", " top_dir + '/j_live/allabove/',\n", " top_dir + '/j_live/bestpart/'\n", "]\n", "len(load_albums(j_live))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It will be useful to convert a list of documents into a flat list of tokens. Based on the approach showed in the [tokenisation chapter](todo) we can do this as follows:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['[BAR]',\n", " 'J-Live',\n", " '[/BAR]',\n", " '[BAR]',\n", " 'Well',\n", " 'if',\n", " 'isn',\n", " \"'t\",\n", " 'the',\n", " 'outbreak',\n", " 'monkey',\n", " 'for',\n", " 'that',\n", " 'latest',\n", " 'epidemic',\n", " 'of',\n", " 'The',\n", " 'Vapors',\n", " '[/BAR]',\n", " '[BAR]']" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "token = re.compile(\"\\[BAR\\]|\\[/BAR\\]|[\\w-]+|'m|'t|'ll|'ve|'d|'s|\\'\")\n", "def words(docs):\n", " return [word \n", " for doc in docs \n", " for word in token.findall(doc)]\n", "song_words = words(songs)\n", "song_words[:20]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Finally we provide a function that can load all songs within a top-level directory." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "50" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "def load_all_songs(path):\n", " only_files = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]\n", " only_paths = [join(path, f) for f in listdir(path) if not isfile(join(path, f))]\n", " lyrics = [load_song(f) for f in only_files]\n", " sub_songs = [song for sub_path in only_paths for song in load_all_songs(sub_path)]\n", " return lyrics + sub_songs\n", "\n", "len(load_all_songs(\"../data/ohhla/train/www.ohhla.com/anonymous/j_live/\"))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }