{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Crawling Web Pages\n",
    "\n",
    "This notebook crawls apress.com's blog post to:\n",
    "+ extract content related to blog post using regex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# import required libraries\n",
    "import re\n",
    "import requests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Utility"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def extract_blog_content(content):\n",
    "    \"\"\"This function extracts blog post content using regex\n",
    "\n",
    "    Args:\n",
    "        content (request.content): String content returned from requests.get\n",
    "\n",
    "    Returns:\n",
    "        str: string content as per regex match\n",
    "\n",
    "    \"\"\"\n",
    "    content_pattern = re.compile(r'<div class=\"cms-richtext\">(.*?)</div>')\n",
    "    result = re.findall(content_pattern, content)\n",
    "    return result[0] if result else \"None\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Crawl the Web"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the URL and blog post to be parsed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "base_url = \"http://www.apress.com/in/blog/all-blog-posts\"\n",
    "blog_suffix = \"/wannacry-how-to-prepare/12302194\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use requests library to make a get request"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = requests.get(base_url+blog_suffix)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Identify and Parse blog content using python's regex library (re)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "if response.status_code == 200:\n",
    "        content = response.text.encode('utf-8', 'ignore').decode('utf-8', 'ignore')\n",
    "        content = content.replace(\"\\n\", '')\n",
    "        blog_post_content = extract_blog_content(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View first 500 characters of the blogpost"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'<p class=\"intro--paragraph\"><em>By Mike Halsey</em></p><p><br/></p><p>It was a perfectly ordinary Friday when the Wannacry ransomware struck in May 2017. The malware spread around the world to more than 150 countries in just a matter of a few hours, affecting the National Health Service in the UK, telecoms provider Telefonica in Spain, and many other organisations and businesses in the USA, Canada, China, Japan, Russia, and right across Europe, the Middle-East, and Asia.</p><p>The malware was re'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "blog_post_content[0:500]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}