{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading the OHHLA corpus\n", "This notebook shows a typical example of data loading and preprocessing necessary for NLP. In this case we are loading a corpus downloaded from the Hip-Hop Lyrics webpage [www.ohhla.com](www.ohhla.com). Our primary goal is to provide a dataset loading function for the [language modelling](todo) chapter in this book.\n", "\n", "We provide the corpus in the `data` directory. As this notebook lives in a sub-directory itself, we access it via `../data`. Before preprocessing all files and provide *generic* loaders it is useful to inspect the format of the files based on a specific example file, and work on the loading process in this context. Here we look at `/data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/authentc.jlv.txt`. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['',\n", " '',\n", " '',\n", " '
',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " ' ',\n", " '',\n",
" 'Artist: J-Live',\n",
" 'Album: All of the Above',\n",
" 'Song: Satisfied',\n",
" 'Typed by: Burnout678@aol.com',\n",
" '',\n",
" 'Hey yo',\n",
" 'Lights, camera, tragedy, comedy, romance',\n",
" 'You better dance from your fighting stance',\n",
" \"Or you'll never have a fighting chance\",\n",
" 'In the rat race',\n",
" \"Where the referee's son started way in advance\",\n",
" \"But still you livin' the American Dream\",\n",
" \"Silk PJ's, sheets and down pillows\",\n",
" 'Who the fuck would wanna wake up?',\n",
" 'You got it good like hot sex after the break up',\n",
" \"Your four car garage it's just more space to take up\",\n",
" 'You even bought your mom a new whip scrap the jalopy',\n",
" 'Thousand dollar habit, million dollar hobby',\n",
" 'You a success story everybody wanna copy',\n",
" 'But few work for it, most get jerked for it',\n",
" \"If you think that you could ignore it, you're ig-norant\",\n",
" 'A fat wallet still never made a man free',\n",
" 'They say to eat good, yo, you gotta swallow your pride',\n",
" \"But dead that game plan, I'm not satisfied\",\n",
" '',\n",
" '[Chorus]',\n",
" 'The poor get worked, the rich get richer',\n",
" 'The world gets worse, do you get the picture?',\n",
" 'The poor gets dead, the rich get depressed',\n",
" 'The ugly get mad, the pretty get stressed',\n",
" 'The ugly get violent, the pretty get gone',\n",
" 'The old get stiff, the young get stepped on',\n",
" 'Whoever told you that it was all good lied',\n",
" 'So throw your fists up if you not satisfied',\n",
" '',\n",
" '{*Singing*}',\n",
" 'Are you satisfied?',\n",
" \"I'm not satisfied\",\n",
" '',\n",
" \"Hey yo, the air's still stale\",\n",
" \"The anthrax got my Ole Earth wearin' a mask and gloves to get a meal \",\n",
" 'I know a older guy that lost twelve close peeps on 9-1-1',\n",
" \"While you kickin' up punchlines and puns\",\n",
" 'Man fuck that shit, this is serious biz',\n",
" \"By the time Bush is done, you won't know what time it is\",\n",
" \"If it's war time or jail time, time for promises\",\n",
" 'And time to figure out where the enemy is',\n",
" 'The same devils that you used to love to hate',\n",
" 'They got you so gassed and shook now, you scared to debate',\n",
" 'The same ones that traded books for guns',\n",
" 'Smuggled drugs for funds',\n",
" \"And had fun lettin' off forty-one\",\n",
" \"But now it's all about NYPD caps \",\n",
" 'And Pentagon bumper stickers',\n",
" 'But yo, you still a nigga',\n",
" \"It ain't right them cops and them firemen died\",\n",
" \"The shit is real tragic, but it damn sure ain't magic\",\n",
" \"It won't make the brutality disappear\",\n",
" \"It won't pull equality from behind your ear\",\n",
" \"It won't make a difference in a two-party country\",\n",
" 'If the president cheats, to win another four years',\n",
" \"Now don't get me wrong, there's no place I'd rather be\",\n",
" \"The grass ain't greener on the other genocide\",\n",
" \"But tell Huey Freeman don't forget to cut the lawn\",\n",
" 'And uproot the weeds',\n",
" \"Cuz I'm not satisfied\",\n",
" '',\n",
" '[Chorus]',\n",
" '',\n",
" '{*Singing*}',\n",
" 'All this genocide',\n",
" 'Is not justified',\n",
" 'Are you satisfied?',\n",
" \"I'm not satisfied\",\n",
" '',\n",
" 'Yo, poison pushers making paper off of pipe dreams',\n",
" 'They turned hip-hop to a get-rich-quick scheme',\n",
" \"The rich minorities control the gov'ment\",\n",
" 'But they would have you believe we on the same team',\n",
" 'So where you stand, huh?',\n",
" 'What do you stand for?',\n",
" \"Sit your ass down if you don't know the answer\",\n",
" 'Serious as cancer, this jam demands your undivided attention',\n",
" 'Even on the dance floor',\n",
" 'Grab the bull by the horns, the bucks by the antlers',\n",
" \"Get yours, what're you sweatin' the next man for?\",\n",
" 'Get down, feel good to this, let it ride',\n",
" \"But until we all free, I'll never be satisfied\",\n",
" '',\n",
" '[Chorus] - Repeat 2x',\n",
" '',\n",
" '{*Singing with talking in background*}',\n",
" 'Are you satisfied? ',\n",
" '(whoever told you that it was all good lied)',\n",
" \"I'm not satisfied \",\n",
" '(Throw your fists up if you not satisfied)',\n",
" 'Are you satisfied?',\n",
" '(Whoever told you that it was all good lied)',\n",
" \"I'm not satisfied \",\n",
" '(So throw your fists up)',\n",
" '(So throw your fists up)',\n",
" '(Throw your fists up)',\n",
" '` tag, and then remove the meta information."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['Hey yo',\n",
" 'Lights, camera, tragedy, comedy, romance',\n",
" 'You better dance from your fighting stance',\n",
" \"Or you'll never have a fighting chance\",\n",
" 'In the rat race',\n",
" \"Where the referee's son started way in advance\",\n",
" \"But still you livin' the American Dream\",\n",
" \"Silk PJ's, sheets and down pillows\",\n",
" 'Who the fuck would wanna wake up?',\n",
" 'You got it good like hot sex after the break up']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def find_lyrics(lines):\n",
" filtered = []\n",
" in_pre = False\n",
" for line in lines:\n",
" if '' in line:\n",
" in_pre = True\n",
" filtered.append(line.replace(\"\",\"\"))\n",
" elif '' in line:\n",
" in_pre = False\n",
" filtered.append(line.replace(\"\",\"\"))\n",
" elif in_pre:\n",
" filtered.append(line)\n",
" return filtered[6:]\n",
" \n",
"lyrics = find_lyrics(lines)\n",
"lyrics[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we would like to convert the list of lines with newline characters to a single string, as this will be easier to process for our language models. We will also mark lyrical \"bars\" (lines) using a `BAR` tag to still capture the rhythmical structure in the song."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"\"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to \""
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'\n",
"string[:500]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are now ready to provide a loading function. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"\"[BAR]Hey yo[/BAR][BAR]Lights, camera, tragedy, comedy, romance[/BAR][BAR]You better dance from your fighting stance[/BAR][BAR]Or you'll never have a fighting chance[/BAR][BAR]In the rat race[/BAR][BAR]Where the referee's son started way in advance[/BAR][BAR]But still you livin' the American Dream[/BAR][BAR]Silk PJ's, sheets and down pillows[/BAR][BAR]Who the fuck would wanna wake up?[/BAR][BAR]You got it good like hot sex after the break up[/BAR][BAR]Your four car garage it's just more space to \""
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def load_song(file_name):\n",
" def load_raw(encoding):\n",
" with open(file_name, 'r',encoding=encoding) as f:\n",
" # we use read().splitlines() instead of readlines() to skip newline characters\n",
" lines = f.read().splitlines() \n",
" # some files are pure txt files for which we don't need to extract the lyrics \n",
" lyrics = find_lyrics(lines) if file_name.endswith('html') else lines[5:]\n",
" string = '[BAR]' + '[/BAR][BAR]'.join(lyrics) + '[/BAR]'\n",
" return string\n",
" try:\n",
" return load_raw('utf-8')\n",
" except UnicodeDecodeError:\n",
" try:\n",
" return load_raw('cp1252')\n",
" except UnicodeDecodeError:\n",
" print(\"Could not load \" + file_name)\n",
" return \"\"\n",
"\n",
" \n",
" \n",
"song = load_song('../data/ohhla/train/www.ohhla.com/anonymous/j_live/allabove/satisfy.jlv.txt.html')\n",
"song[:500]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to load several files from an album directory. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[2555, 2779, 3283]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from os import listdir\n",
"from os.path import isfile, join\n",
"\n",
"def load_album(path):\n",
" # we filter out directories, and files that don't look like song files in OHHLA.\n",
" onlyfiles = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]\n",
" lyrics = [load_song(f) for f in onlyfiles]\n",
" return lyrics\n",
"\n",
"songs = load_album('../data/ohhla/train/www.ohhla.com/anonymous/j_live/SPTA/')\n",
"[len(s) for s in songs]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will also make it easy to load several albums. Then, for a few artists we provide short cuts to the album directories we care about. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"29"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def load_albums(album_paths):\n",
" return [song \n",
" for path in album_paths \n",
" for song in load_album(path)]\n",
"\n",
"top_dir = '../data/ohhla/train/www.ohhla.com/anonymous/'\n",
"j_live = [\n",
" top_dir + '/j_live/allabove/',\n",
" top_dir + '/j_live/bestpart/'\n",
"]\n",
"len(load_albums(j_live))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It will be useful to convert a list of documents into a flat list of tokens. Based on the approach showed in the [tokenisation chapter](todo) we can do this as follows:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['[BAR]',\n",
" 'J-Live',\n",
" '[/BAR]',\n",
" '[BAR]',\n",
" 'Well',\n",
" 'if',\n",
" 'isn',\n",
" \"'t\",\n",
" 'the',\n",
" 'outbreak',\n",
" 'monkey',\n",
" 'for',\n",
" 'that',\n",
" 'latest',\n",
" 'epidemic',\n",
" 'of',\n",
" 'The',\n",
" 'Vapors',\n",
" '[/BAR]',\n",
" '[BAR]']"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"token = re.compile(\"\\[BAR\\]|\\[/BAR\\]|[\\w-]+|'m|'t|'ll|'ve|'d|'s|\\'\")\n",
"def words(docs):\n",
" return [word \n",
" for doc in docs \n",
" for word in token.findall(doc)]\n",
"song_words = words(songs)\n",
"song_words[:20]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Finally we provide a function that can load all songs within a top-level directory."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"50"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\n",
"def load_all_songs(path):\n",
" only_files = [join(path, f) for f in listdir(path) if isfile(join(path, f)) and 'txt' in f]\n",
" only_paths = [join(path, f) for f in listdir(path) if not isfile(join(path, f))]\n",
" lyrics = [load_song(f) for f in only_files]\n",
" sub_songs = [song for sub_path in only_paths for song in load_all_songs(sub_path)]\n",
" return lyrics + sub_songs\n",
"\n",
"len(load_all_songs(\"../data/ohhla/train/www.ohhla.com/anonymous/j_live/\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}