{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Web Scraping — Part 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lesson, we're going to learn more about scraping data with the Python libraries requests and BeautifulSoup. We're also going to introduce *regular expressions*, which will help us extract and clean data in a more fine-grained way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will cover how to:\n",
    "\n",
    "* Programmatically access the text of a web page\n",
    "* Extract certain HTML elements\n",
    "* Extract and clean data with regular expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our goal for this lesson is to build a web scraping function called `get_all_songs_from_album` which will get all the song titles from any album on Genius.com."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Import Requests and BeautifulSoup**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're again going to use the `requests` library and the `BeautifulSoup` library to scrape our list of album songs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "from bs4 import BeautifulSoup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../images/Missy-Under-Construction.png\" class=\"center\" >"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " The first album that we're going to scrape is Missy Elliott's \"Under Construction\" (2002), which debuted at No. 3 on The Billboard Top 200 charts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"500\"\n",
       "            height=\"400\"\n",
       "            src=\"https://www.youtube.com/embed/cjIvu7e6Wq8?start=31\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x106d09c90>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import IFrame\n",
    "IFrame(\"https://www.youtube.com/embed/cjIvu7e6Wq8?start=31\", width='500', height='400')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Get HTML Data and Extract Text**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.get(\"https://genius.com/albums/Missy-elliott/Under-construction\")\n",
    "html_string = response.text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Transform into BeautifulSoup Document**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "document = BeautifulSoup(html_string, \"html.parser\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Your Turn!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to extract just the song titles from Missy Elliott's album \"Under Construction.\" Turn on your web browser's \"Inspect\" function and find the HTML tag associated with each song title.\n",
    "\n",
    "[https://genius.com/albums/Missy-elliott/Under-construction](https://genius.com/albums/Missy-elliott/Under-construction)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select \"Click to show\" to see the answer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "tags": [
     "output_scroll",
     "hide-cell"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<h3 class=\"chart_row-content-title\">\n",
       "               Intro/Go To The Floor\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Bring the Pain (Ft. Method Man)\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Gossip Folks (Ft. Ludacris)\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Work It\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Back in the Day (Ft. JAY-Z)\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Funky Fresh Dressed (Ft. Ms. Jade)\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Pussycat\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Nothing Out There for Me (Ft. Beyoncé)\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Slide\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Play That Beat\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Ain't That Funny\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Hot\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Can You Hear Me (Ft. TLC)\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"chart_row-content-title\">\n",
       "               Work It (Remix) (Ft. 50 Cent)\n",
       "               <span class=\"chart_row-content-title-subtitle\">Lyrics</span>\n",
       " </h3>,\n",
       " <h3 class=\"annotation_label\">Does this album have any certifications?</h3>,\n",
       " <h3 class=\"annotation_label\">How did this album chart?</h3>,\n",
       " <h3>ICONOLOGY</h3>,\n",
       " <h3>Misdemeanor</h3>]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "song_title_tags = document.find_all(\"h3\")\n",
    "song_title_tags"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now write a `for` loop that extracts the text from each song title tag and `.appends()` it to a list called `song_titles`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": [
     "hide_input"
    ]
   },
   "outputs": [],
   "source": [
    "song_titles = []\n",
    "for song in song_title_tags:\n",
    "    song_title_missy = song.text\n",
    "    song_titles.append(song_title_missy)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['\\n              Intro/Go To The Floor\\n              Lyrics\\n',\n",
       " '\\n              Bring the Pain (Ft.\\xa0Method\\xa0Man)\\n              Lyrics\\n',\n",
       " '\\n              Gossip Folks (Ft.\\xa0Ludacris)\\n              Lyrics\\n',\n",
       " '\\n              Work It\\n              Lyrics\\n',\n",
       " '\\n              Back in the Day (Ft.\\xa0JAY-Z)\\n              Lyrics\\n',\n",
       " '\\n              Funky Fresh Dressed (Ft.\\xa0Ms.\\xa0Jade)\\n              Lyrics\\n',\n",
       " '\\n              Pussycat\\n              Lyrics\\n',\n",
       " '\\n              Nothing Out There for Me (Ft.\\xa0Beyoncé)\\n              Lyrics\\n',\n",
       " '\\n              Slide\\n              Lyrics\\n',\n",
       " '\\n              Play That Beat\\n              Lyrics\\n',\n",
       " \"\\n              Ain't That Funny\\n              Lyrics\\n\",\n",
       " '\\n              Hot\\n              Lyrics\\n',\n",
       " '\\n              Can You Hear Me (Ft.\\xa0TLC)\\n              Lyrics\\n',\n",
       " '\\n              Work It (Remix) (Ft.\\xa050\\xa0Cent)\\n              Lyrics\\n',\n",
       " 'Does this album have any certifications?',\n",
       " 'How did this album chart?',\n",
       " 'ICONOLOGY',\n",
       " 'Misdemeanor']"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "song_titles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now transform that same `for` loop into a list comprehension."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": [
     "hide_input"
    ]
   },
   "outputs": [],
   "source": [
    "missy_song_titles = [song.text for song in song_title_tags]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['\\n              Intro/Go To The Floor\\n              Lyrics\\n',\n",
       " '\\n              Bring the Pain (Ft.\\xa0Method\\xa0Man)\\n              Lyrics\\n',\n",
       " '\\n              Gossip Folks (Ft.\\xa0Ludacris)\\n              Lyrics\\n',\n",
       " '\\n              Work It\\n              Lyrics\\n',\n",
       " '\\n              Back in the Day (Ft.\\xa0JAY-Z)\\n              Lyrics\\n',\n",
       " '\\n              Funky Fresh Dressed (Ft.\\xa0Ms.\\xa0Jade)\\n              Lyrics\\n',\n",
       " '\\n              Pussycat\\n              Lyrics\\n',\n",
       " '\\n              Nothing Out There for Me (Ft.\\xa0Beyoncé)\\n              Lyrics\\n',\n",
       " '\\n              Slide\\n              Lyrics\\n',\n",
       " '\\n              Play That Beat\\n              Lyrics\\n',\n",
       " \"\\n              Ain't That Funny\\n              Lyrics\\n\",\n",
       " '\\n              Hot\\n              Lyrics\\n',\n",
       " '\\n              Can You Hear Me (Ft.\\xa0TLC)\\n              Lyrics\\n',\n",
       " '\\n              Work It (Remix) (Ft.\\xa050\\xa0Cent)\\n              Lyrics\\n',\n",
       " 'Does this album have any certifications?',\n",
       " 'How did this album chart?',\n",
       " 'ICONOLOGY',\n",
       " 'Misdemeanor']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missy_song_titles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Are there things in your list that aren't song titles? If so, use `.find_all()` with more specific HTML attributes `attrs={}`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 303,
   "metadata": {},
   "outputs": [],
   "source": [
    "song_title_tags = document.find_all(\"h3\", attrs={\"class\": \"chart_row-content-title\"})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "tags": [
     "hide_input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['\\n              Intro/Go To The Floor\\n              Lyrics\\n',\n",
       " '\\n              Bring the Pain (Ft.\\xa0Method\\xa0Man)\\n              Lyrics\\n',\n",
       " '\\n              Gossip Folks (Ft.\\xa0Ludacris)\\n              Lyrics\\n',\n",
       " '\\n              Work It\\n              Lyrics\\n',\n",
       " '\\n              Back in the Day (Ft.\\xa0JAY-Z)\\n              Lyrics\\n',\n",
       " '\\n              Funky Fresh Dressed (Ft.\\xa0Ms.\\xa0Jade)\\n              Lyrics\\n',\n",
       " '\\n              Pussycat\\n              Lyrics\\n',\n",
       " '\\n              Nothing Out There for Me (Ft.\\xa0Beyoncé)\\n              Lyrics\\n',\n",
       " '\\n              Slide\\n              Lyrics\\n',\n",
       " '\\n              Play That Beat\\n              Lyrics\\n',\n",
       " \"\\n              Ain't That Funny\\n              Lyrics\\n\",\n",
       " '\\n              Hot\\n              Lyrics\\n',\n",
       " '\\n              Can You Hear Me (Ft.\\xa0TLC)\\n              Lyrics\\n',\n",
       " '\\n              Work It (Remix) (Ft.\\xa050\\xa0Cent)\\n              Lyrics\\n',\n",
       " '\\n              Pepsi Super Bowl XLIX Halftime Show by\\xa0NFL (Ft.\\xa0Katy\\xa0Perry, Lenny\\xa0Kravitz & Missy\\xa0Elliott)\\n              Lyrics\\n']"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " missy_song_titles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regular Expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Great! Now we have have our list of song titles from Missy Elliot's album \"Under Construction.\" But if you notice, these song titles are pretty messy, and we need to clean them up.\n",
    "\n",
    "To do so, we're going to use built-in string methods and a Python library called `re`, short for regular expressions.\n",
    "\n",
    "Regular expressions are basically like a very sophisticated find-and-replace. Regular expression are not exclusive to Python and are used in many programming languages as well as in search engines, text editors, and word processors."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import Regular Expressions Library"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To practice with regular expressions, we're going to use a sample messy song title from our messy song titles list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_song = \"\\n              Back in the Day (Ft.\\xa0JAY-Z)\\n              Lyrics\\n\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Replace with Python String Method**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember the string method `.replace()`? With this built-in string method, we can easily get rid of the new line characters `\\n` or the word \"Lyrics\" from our `sample_song`, which is very useful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'              Back in the Day (Ft.\\xa0JAY-Z)              Lyrics'"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_song.replace(\"\\n\", \"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 291,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n              Back in the Day (Ft.\\xa0JAY-Z)\\n              \\n'"
      ]
     },
     "execution_count": 291,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_song.replace(\"Lyrics\", \"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Replace with Regex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, with regular expressions, we can replace strings with even more power and flexibility."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To replace a string with regular expressions, we use `re.sub(old_pattern, new_pattern, text_that_contains_pattern)`. We can do exactly the same thing that we did with the built-in string method `.replace()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n              Back in the Day (Ft.\\xa0JAY-Z)\\n              Lyrics\\n'"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sample_song"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 294,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'              Back in the Day (Ft.\\xa0JAY-Z)              Lyrics'"
      ]
     },
     "execution_count": 294,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub(\"\\n\", \"\", sample_song)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Special RegEx Characters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But regular expressions have certain characters with special pattern-matching powers, which is what allows us to do more cleaning, manipulating, and searching than with basic string methods. Below are some of the special regular expression characters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "| Regular Expression Pattern       | Matches |\n",
    "|:---------------------------:|:-----------------------------------------------------------------------------------------------------------:|\n",
    "| `.` | any character                                         | \n",
    "| `\\w` | word                                         | \n",
    "| `\\W`                      | NOT word                                           |  \n",
    "| `\\d` | digit                                         | \n",
    "| `\\D`                      | NOT digit                                           | \n",
    "| `\\s` | whitespace                                         | \n",
    "| `\\S`                      | NOT whitespace                                          | \n",
    "| `[abc]`                      | Any of abc                                         |\n",
    "| `[^abc]`                      | Not any of abc                                         | \n",
    "| `(abc)`                      | Specific capture of \"abc\"                                         \n",
    "| `+`                      | 1 or more instances                                       | \n",
    "| `*`                      | 0 or more instances                                         | \n",
    "| `?`                      | 0 or 1 instance                                        | \n",
    "                   \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can explore and experiment with regular expression characters and combinations at [Regexr.com](https://regexr.com/4vhf1)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can replace anything that is not a word `\\W` with \" \":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'               Back in the Day  Ft  JAY Z                Lyrics '"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub(\"\\W\", \" \", sample_song)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Replace anything that is a word `\\w` with \" \":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'\\n                              (  .\\xa0   - )\\n                    \\n'"
      ]
     },
     "execution_count": 90,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub(\"\\w\", \" \", sample_song)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The character `+` means \"match one or more instance\" of the pattern, which allows us to remove multiple not word patterns in a row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 295,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' Back in the Day Ft JAY Z Lyrics '"
      ]
     },
     "execution_count": 295,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.sub(\"\\W+\", \" \", sample_song)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compile Pattern"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An efficient way to build and save a regular expression pattern is with `re.compile()`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 244,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "' Back in the Day Ft JAY Z Lyrics '"
      ]
     },
     "execution_count": 244,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "not_word_pattern = re.compile(\"\\W+\")\n",
    "re.sub(not_word_pattern, \" \", sample_song)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Search for Pattern"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In addition to replacing text, we can also find and return text. With `re.search()`, we can find and return any particular pattern. The `re.search()` function returns something called a \"match object,\" which we can access with `.group()`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, searching with the pattern `\\w+` will return the very first word in `sample_song`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<re.Match object; span=(15, 19), match='Back'>"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_pattern = re.compile(\"\\w+\")\n",
    "word_pattern.search(sample_song)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Back'"
      ]
     },
     "execution_count": 112,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_pattern.search(sample_song).group(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find All Instances of Pattern"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function `re.findall()` will return a list of every instance of a particular pattern."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Back', 'in', 'the', 'Day', 'Ft', 'JAY', 'Z', 'Lyrics']"
      ]
     },
     "execution_count": 127,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "word_pattern = re.compile(\"\\w+\")\n",
    "word_pattern.findall(sample_song)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When you combine special regular expression characters, you can make your pattern matching very specific and very powerful. If we had a document that contains a bunch of email addresses, we could use the pattern `[\\w.]+@[\\w.]+` to find and extract the words that appear on other side of the `@` character, aka [find and extract all the email addresses](https://regexr.com/4vhfa)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 191,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_with_emails = \"The important email addresses are important@cool.com, signficant@sweet.org\"\n",
    "extracted_emails = re.findall('[\\w.]+@[\\w.]+', text_with_emails)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 192,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['important@cool.com', 'signficant@sweet.org']"
      ]
     },
     "execution_count": 192,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "extracted_emails"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Match Before a Certain String"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our song titles, we might want to [extract everything that comes before \"(Ft.)\"](regexr.com/4vhfg) because we don't care as much about the featured artists, and because the featured artists makes the song titles really long. To match everything that comes before a certain string, we can use the pattern `.*(?=desired_pattern)` which matches 0 or more `*` of any character `.` that comes before `(?=)` the string \"desired_pattern.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 276,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'              Back in the Day ('"
      ]
     },
     "execution_count": 276,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "before_ft_pattern = re.compile(\".*(?=Ft)\")\n",
    "before_ft_pattern.search(sample_song).group(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Backslash Escape Characters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nice! We got everything before the featured artist. Well, almost. We still have a weird, lingering open parentheses. That's because we were matching \"Ft\" not \"(Ft\". Let's match everything before \"(Ft\" instead.\n",
    "\n",
    "To do so, we're going to have to make a slight adjustment. Remember that parentheses `()` are special regular expression characters. To make clear that we mean a literal parentheses and not a special regular expression character, we have to use an escape backslash `\\` before the character. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 277,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'              Back in the Day '"
      ]
     },
     "execution_count": 277,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "before_ft_pattern = re.compile(\".*(?=\\(Ft)\")\n",
    "clean_sample_song_title = before_ft_pattern.search(sample_song).group(0)\n",
    "clean_sample_song_title "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strip Leading and Trailing Whitespace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The last thing we'll do to clean up our song title is to use the built-in string method `.strip()` which strips leading and trailing whitespace."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 260,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Back in the Day'"
      ]
     },
     "execution_count": 260,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clean_sample_song_title.strip()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build Functions and Put It All Together"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's put all this cleaning together in a function called `clean_up`. It will match and strip everything before \"(Ft.)\" if the song title contains a featured artist, and it will remove the word \"Lyrics\" and strip whitespace if the song title does not contain \"(Ft.)\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 306,
   "metadata": {},
   "outputs": [],
   "source": [
    "def clean_up(song_title):\n",
    "\n",
    "    if \"Ft\" in song_title:\n",
    "        before_ft_pattern = re.compile(\".*(?=\\(Ft)\")\n",
    "        song_title_before_ft = before_ft_pattern.search(song_title).group(0)\n",
    "        clean_song_title = song_title_before_ft.strip()\n",
    "    \n",
    "    else:\n",
    "        song_title_no_lyrics = song_title.replace(\"Lyrics\", \"\")\n",
    "        clean_song_title = song_title_no_lyrics.strip()\n",
    "    \n",
    "    return clean_song_title"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 307,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Intro/Go To The Floor',\n",
       " 'Bring the Pain',\n",
       " 'Gossip Folks',\n",
       " 'Work It',\n",
       " 'Back in the Day',\n",
       " 'Funky Fresh Dressed',\n",
       " 'Pussycat',\n",
       " 'Nothing Out There for Me',\n",
       " 'Slide',\n",
       " 'Play That Beat',\n",
       " \"Ain't That Funny\",\n",
       " 'Hot',\n",
       " 'Can You Hear Me',\n",
       " 'Work It (Remix)',\n",
       " 'Pepsi Super Bowl XLIX Halftime Show by\\xa0NFL']"
      ]
     },
     "execution_count": 307,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[clean_up(song) for song in missy_song_titles]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We were able to extract the song titles for Missy Elliott's album \"Under Construction.\" Success! But now we want to make a function that can do the same thing for any artist and album title.\n",
    "\n",
    "Take a look at [Beyonce's album \"Lemonade\"](https://genius.com/albums/Beyonce/Lemonade) on Genius.com and see how the web page compares to Missy Elliott's \"Under Construction.\" They look extremely similar, right? Because all Genius album pages are identical, we can use the same code that we did for Missy Elliott and just substitute in different artist and album names with an f-string for the Genius URL:\n",
    "\n",
    "`f\"https://genius.com/albums/{artist}/{album_name}\"`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 308,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_all_songs_from_album(artist, album_name):\n",
    "    \n",
    "    artist = artist.replace(\" \", \"-\")\n",
    "    album_name = album_name.replace(\" \", \"-\")\n",
    "    \n",
    "    response = requests.get(f\"https://genius.com/albums/{artist}/{album_name}\")\n",
    "    html_string = response.text\n",
    "    document = BeautifulSoup(html_string, \"html.parser\")\n",
    "    song_title_tags = document.find_all(\"h3\", attrs={\"class\": \"chart_row-content-title\"})\n",
    "    song_titles = [song_title.text for song_title in song_title_tags]\n",
    "    \n",
    "    clean_songs = []\n",
    "    for song_title in song_titles:\n",
    "        clean_song = clean_up(song_title)\n",
    "        clean_songs.append(clean_song)\n",
    "        \n",
    "    return clean_songs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 309,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Pray You Catch Me',\n",
       " 'Hold Up',\n",
       " \"Don't Hurt Yourself\",\n",
       " 'Sorry',\n",
       " '6 Inch',\n",
       " 'Daddy Lessons',\n",
       " 'Love Drought',\n",
       " 'Sandcastles',\n",
       " 'Forward',\n",
       " 'Freedom',\n",
       " 'All Night',\n",
       " 'Formation',\n",
       " 'Sorry (Original Demo)',\n",
       " 'Lemonade Film (Script)']"
      ]
     },
     "execution_count": 309,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_all_songs_from_album('Beyonce', 'Lemonade')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 310,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['State of Grace',\n",
       " 'Red',\n",
       " 'Treacherous',\n",
       " 'I Knew You Were Trouble',\n",
       " 'All Too Well',\n",
       " '22',\n",
       " 'I Almost Do',\n",
       " 'We Are Never Ever Getting Back Together',\n",
       " 'Stay Stay Stay',\n",
       " 'The Last Time',\n",
       " 'Holy Ground',\n",
       " 'Sad Beautiful Tragic',\n",
       " 'The Lucky One',\n",
       " 'Everything Has Changed',\n",
       " 'Starlight',\n",
       " 'Begin Again',\n",
       " 'I Knew You Were Trouble (Intro)']"
      ]
     },
     "execution_count": 310,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_all_songs_from_album('Taylor Swift', 'Red')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 311,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Geyser',\n",
       " \"Why Didn't You Stop Me?\",\n",
       " 'Old Friend',\n",
       " 'A Pearl',\n",
       " 'Lonesome Love',\n",
       " 'Remember My Name',\n",
       " 'Me and My Husband',\n",
       " 'Come Into the Water',\n",
       " 'Nobody',\n",
       " 'Pink in the Night',\n",
       " 'A Horse Named Cold Air',\n",
       " 'Washing Machine Heart',\n",
       " 'Blue Light',\n",
       " 'Two Slow Dancers']"
      ]
     },
     "execution_count": 311,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_all_songs_from_album('Mitski', 'Be The Cowboy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Your Turn!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "get_all_songs_from_album('#Your Choice of Artist', '#Your Choice of Album')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}