{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Wordpress scraper\n",
    "\n",
    "Suggestions for downloading Wordpress blogs. \n",
    "\n",
    "**Note:** Worpress blogs have different configurations and versions. So the code has to be adapted by first inspecting the source code of the rendered html, then the ``soup.findAll`` has to be changed to fit either the permanent URL in **Step 1** or the content of the blog post in **Step 2**. \n",
    "\n",
    "**Note 2:** Scraping blogs can violate some agreement and might get your connection banned visiting the blog. Probably the content is also copyrighted. Use wisely. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "import io\n",
    "from bs4 import BeautifulSoup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 - Fetch URLs of fulltext articles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "urlfile = open(\"urllist.txt\", 'w') #opens an output file for storing permanent urls\n",
    "\n",
    "baseurl = \"http://urloftheblog.com\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "counter = 1\n",
    "\n",
    "for i in range(0,1):\n",
    "    try:\n",
    "        url = baseurl + \"/page/\" + str(counter) + '/'\n",
    "        print(\"Counter:\" + str(counter))\n",
    "        print(url)\n",
    "        r = requests.get(url)\n",
    "        file_like_obj = io.StringIO(r.text) #Turns the requested output into a file like objet\n",
    "        lines = file_like_obj.read()\n",
    "\n",
    "        soup = BeautifulSoup(lines, \"lxml\")\n",
    "        \n",
    "        # Change below according to the rendered source code of the blog html.\n",
    "        # What your want is the direct url to each blog post. \n",
    "        posturls = soup.findAll(\"h2\", { \"class\" : \"entry-title\" }) \n",
    "\n",
    "        for p in posturls:\n",
    "            #print(p.find('a').attrs['href'])\n",
    "            urlfile.write(p.find('a').attrs['href'] + \"\\n\")\n",
    "        counter += 1\n",
    "    except ConnectionError: # Add more exceptions if needed. \n",
    "        print(\"There was a connectin error for \" + url)\n",
    "    \n",
    "\n",
    "urlfile.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 - Get the full text body of each blog post and write to file. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "posturls = [line.rstrip('\\n') for line in open('urllist.txt')] # Load urls from Step 1\n",
    "print(posturls[0]) #just to check that the list is full of urls."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "blogcontentfile = open(\"blogcontent.txt\", 'w') # Open up a file to store content.\n",
    "\n",
    "articlecounter = 0\n",
    "\n",
    "failedarticles = [] # If this one grows, store them in a file or something. \n",
    "\n",
    "for url in posturls:\n",
    "    try:\n",
    "        req = requests.get(url)\n",
    "        file_like_object = io.StringIO(req.text) \n",
    "        apxlines = file_like_object.read()\n",
    "        apxsoup = BeautifulSoup(apxlines, \"lxml\")\n",
    "        \n",
    "        # Change below according to the rendered source code of the blog html.\n",
    "        # What your want is the direct url to each blog post. \n",
    "        postbody = apxsoup.findAll(\"div\", { \"class\" : \"entry-content\" })\n",
    "\n",
    "        for p in postbody:\n",
    "            articlecounter += 1\n",
    "            print(str(articlecounter) + \". \" + url)\n",
    "            #print(\"-----\\n\" + url + \"\\n\" + p.text)\n",
    "            blogcontentfile.write(\"-----\\n\" + url + \"\\n\" + p.text)\n",
    "    except requests.exceptions.RequestException as e:  \n",
    "        print(e)\n",
    "        failedarticles.append(url)\n",
    "\n",
    "blogcontentfile.close()\n",
    "\n",
    "print(\"The following URLs failed to download:\")\n",
    "print(failedarticles)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}