{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Crawling Web Pages using Beautiful Soup\n", "\n", "This notebook crawls apress.com's blog page to:\n", "+ extract list of recent blog post titles and their URLS\n", "+ extract content related to each blog post in plain text\n", "\n", "using requests and **BeautifulSoup**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# import required libraries\n", "import requests\n", "from time import sleep\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Utilities" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def get_post_mapping(content):\n", " \"\"\"This function extracts blog post title and url from response object\n", "\n", " Args:\n", " content (request.content): String content returned from requests.get\n", "\n", " Returns:\n", " list: a list of dictionaries with keys title and url\n", "\n", " \"\"\"\n", " post_detail_list = []\n", " post_soup = BeautifulSoup(content,\"lxml\")\n", " h3_content = post_soup.find_all(\"h3\")\n", " \n", " for h3 in h3_content:\n", " post_detail_list.append(\n", " {'title':h3.a.get_text(),'url':h3.a.attrs.get('href')}\n", " )\n", " \n", " return post_detail_list\n", "\n", "\n", "def get_post_content(content):\n", " \"\"\"This function extracts blog post content from response object\n", "\n", " Args:\n", " content (request.content): String content returned from requests.get\n", "\n", " Returns:\n", " str: blog's content in plain text\n", "\n", " \"\"\"\n", " plain_text = \"\"\n", " text_soup = BeautifulSoup(content,\"lxml\")\n", " para_list = text_soup.find_all(\"div\",\n", " {'class':'cms-richtext'})\n", " \n", " for p in para_list[0]:\n", " plain_text += p.getText()\n", " \n", " return plain_text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Crawl the Web" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the URL and get a list of blogs to be parsed" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "crawl_url = \"http://www.apress.com/in/blog/all-blog-posts\"\n", "post_url_prefix = \"http://www.apress.com\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Crawl recent posts on the website" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "response = requests.get(crawl_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract blog post title and url from response object" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "if response.status_code == 200:\n", " blog_post_details = get_post_mapping(response.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each recent post, crawl the content and parse plain text content using beautiful soup" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Blog posts found:20\n", "Crawling content for post titled: Creating Complex Validation Rules Using Fluent Validation with ASP.NET Core\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Writing Functions in Python\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Push Notifications: Responsible Web App Development\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Best practices for using Simple Lookup Tables\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Learn AI - The Time is NOW\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: The Definitive Guide to Shopify Themes\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Why a Deadlock Is Not Just “Really Bad Blocking”\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Testing, 1-2-3: Getting Started Debugging Python\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Surviving the Corporate PowerPoint Template\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Beginning Data Science and Supervised Learning in R\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Common Table Expressions vs. Derived Tables\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: The Power of Pixlr\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Math and Science with a 3D Printer\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: A Brief Primer to the Internet of Things\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Wannacry: Why It's Only the Beginning, and How to Prepare for What Comes Next\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Reusing ngrx/effects in Angular (communicating between reducers)\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Interview with Tony Smith - Author and SharePoint Expert\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Making Sense of Sensors – Types and Levels of Recognition\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: VS 2017, .NET Core, and JavaScript Frameworks, Oh My!\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Crawling content for post titled: Relabel the Email Send Button “Make Public”\n", "Waiting for 10 secs before crawling next post...\n", "\n", "\n", "Content crawled for all posts\n" ] } ], "source": [ "if blog_post_details:\n", " print(\"Blog posts found:{}\".format(len(blog_post_details)))\n", " \n", " for post in blog_post_details:\n", " print(\"Crawling content for post titled:\",post.get('title'))\n", " post_response = requests.get(post_url_prefix+post.get('url'))\n", " \n", " if post_response.status_code == 200:\n", " post['content'] = get_post_content(post_response.content)\n", " \n", " print(\"Waiting for 10 secs before crawling next post...\\n\\n\")\n", " sleep(10)\n", " \n", " print(\"Content crawled for all posts\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print Title and Content for first 5 posts" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "title:Creating Complex Validation Rules Using Fluent Validation with ASP.NET Core\n", "-----------\n", "content:By John Ciliberti Many ecommerce web sites are driven by user input and the choices they make. As a developer, you want to help them make the right choices and have a positive experience with your site, so they will complete their purchase, and retur\n", "\n", "title:Writing Functions in Python\n", "-----------\n", "content:By Paul Gerrard Why Write Functions?When you write more complicated programs, you can choose to write them in long, complicated modules, but complicated modules are harder to write and difficult to understand. A better approach is to modularize a com\n", "\n", "title:Push Notifications: Responsible Web App Development\n", "-----------\n", "content:By Dennis Sheppard While push notifications on the web are a powerful feature that inches the web ever closer to native apps, some developers have started to transform them into trite annoyances that have conditioned users to ignore notifications or \n", "\n", "title:Best practices for using Simple Lookup Tables\n", "-----------\n", "content:By Nick Harrison Simple lookup tables are just one type of logic table that you may find useful. This article explores what simple lookup tables look like, discusses when they may be applicable, and steps through some of the details for proper implem\n", "\n", "title:Learn AI - The Time is NOW\n", "-----------\n", "content:By Nishith PathakImagine creating a software so smart that it will not only understand human languages but also slangs and subtle variations of these languages, such that your software will know that “Hello, Computer! How are you doing?” and “wassup \n", "\n" ] } ], "source": [ "# print/write content to file\n", "for post in blog_post_details[:5]:\n", " print(\"title:{}\\n-----------\".format(post['title']))\n", " print(\"content:{}\\n\".format(post['content'][0:250]))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.4" } }, "nbformat": 4, "nbformat_minor": 2 }