{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Wordpress scraper\n", "\n", "Suggestions for downloading Wordpress blogs. \n", "\n", "**Note:** Worpress blogs have different configurations and versions. So the code has to be adapted by first inspecting the source code of the rendered html, then the ``soup.findAll`` has to be changed to fit either the permanent URL in **Step 1** or the content of the blog post in **Step 2**. \n", "\n", "**Note 2:** Scraping blogs can violate some agreement and might get your connection banned visiting the blog. Probably the content is also copyrighted. Use wisely. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import requests\n", "import io\n", "from bs4 import BeautifulSoup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 - Fetch URLs of fulltext articles" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "urlfile = open(\"urllist.txt\", 'w') #opens an output file for storing permanent urls\n", "\n", "baseurl = \"http://urloftheblog.com\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "counter = 1\n", "\n", "for i in range(0,1):\n", " try:\n", " url = baseurl + \"/page/\" + str(counter) + '/'\n", " print(\"Counter:\" + str(counter))\n", " print(url)\n", " r = requests.get(url)\n", " file_like_obj = io.StringIO(r.text) #Turns the requested output into a file like objet\n", " lines = file_like_obj.read()\n", "\n", " soup = BeautifulSoup(lines, \"lxml\")\n", " \n", " # Change below according to the rendered source code of the blog html.\n", " # What your want is the direct url to each blog post. \n", " posturls = soup.findAll(\"h2\", { \"class\" : \"entry-title\" }) \n", "\n", " for p in posturls:\n", " #print(p.find('a').attrs['href'])\n", " urlfile.write(p.find('a').attrs['href'] + \"\\n\")\n", " counter += 1\n", " except ConnectionError: # Add more exceptions if needed. \n", " print(\"There was a connectin error for \" + url)\n", " \n", "\n", "urlfile.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 - Get the full text body of each blog post and write to file. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "posturls = [line.rstrip('\\n') for line in open('urllist.txt')] # Load urls from Step 1\n", "print(posturls[0]) #just to check that the list is full of urls." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "blogcontentfile = open(\"blogcontent.txt\", 'w') # Open up a file to store content.\n", "\n", "articlecounter = 0\n", "\n", "failedarticles = [] # If this one grows, store them in a file or something. \n", "\n", "for url in posturls:\n", " try:\n", " req = requests.get(url)\n", " file_like_object = io.StringIO(req.text) \n", " apxlines = file_like_object.read()\n", " apxsoup = BeautifulSoup(apxlines, \"lxml\")\n", " \n", " # Change below according to the rendered source code of the blog html.\n", " # What your want is the direct url to each blog post. \n", " postbody = apxsoup.findAll(\"div\", { \"class\" : \"entry-content\" })\n", "\n", " for p in postbody:\n", " articlecounter += 1\n", " print(str(articlecounter) + \". \" + url)\n", " #print(\"-----\\n\" + url + \"\\n\" + p.text)\n", " blogcontentfile.write(\"-----\\n\" + url + \"\\n\" + p.text)\n", " except requests.exceptions.RequestException as e: \n", " print(e)\n", " failedarticles.append(url)\n", "\n", "blogcontentfile.close()\n", "\n", "print(\"The following URLs failed to download:\")\n", "print(failedarticles)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [default]", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }