{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraping ADR's Myneta.info\n", "\n", "[Myneta.info](http://myneta.info) has analysed candidate affidavits for many candidates. This scraper converts that data into CSVs. (Of course, we could always ask ADR, and they'll probably happily provide it. But I find it faster to write a scraper than wait for people to arrive at office.)\n", "\n", "The pages are very structured. We'll begin with the [candidate summary page](http://myneta.info/ls2014/index.php?action=summary&subAction=candidates_analyzed&sort=candidate#summary)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "import time\n", "import urllib\n", "import hashlib\n", "import pandas as pd\n", "from lxml.html import parse\n", "\n", "if not os.path.exists('.cache'):\n", " os.makedirs('.cache')\n", " \n", "# If file is older than 5 days, download it again\n", "OLD = time.time() - 15 * 24 * 60 * 60\n", "\n", "yearkey = {\n", " 2014: 'ls2014',\n", " 2009: 'ls2009',\n", " 2004: 'loksabha2004',\n", "}\n", "\n", "def get(url):\n", " path = os.path.join('.cache', hashlib.sha1(url).hexdigest()) + '.html'\n", " if not os.path.exists(path) or os.stat(path).st_mtime < OLD:\n", " print url\n", " urllib.urlretrieve(url, path)\n", " return parse(open(path))\n", "\n", "def candidates(year):\n", " url = 'http://myneta.info/{:s}/index.php?action=summary&subAction=candidates_analyzed&sort=candidate'\n", " tree = get(url.format(yearkey[year]))\n", " results = []\n", " for row in tree.findall('.//table')[-1].findall('tr'):\n", " td = row.findall('td')\n", " results.append({\n", " 'Year': year,\n", " 'Sno': td[0].text,\n", " 'ID': int(td[1].find('a').get('href').split('=')[-1]),\n", " 'Candidate': td[1].find('a').text,\n", " 'Constituency': td[2].text,\n", " 'Party': td[3].text,\n", " 'Criminal Cases': int(td[4].text_content()),\n", " 'Education': td[5].text,\n", " 'Total Assets': int(td[6].text.replace(u'Rs\\xa0', '').replace(',', '').replace('Nil', '0')),\n", " 'Total Liabilities': int(td[7].text.replace(u'Rs\\xa0', '').replace(',', '').replace('Nil', '0')),\n", " })\n", " return pd.DataFrame(results)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "ls2014 = candidates(2014)\n", "\n", "# The constituency page does not provide the state and PC code\n", "# So let's introduce that, at least for 2014.\n", "pc2014 = pd.read_csv('pc2014.csv').set_index('Constituency')\n", "ls2014['ST_CODE'] = ls2014['Constituency'].apply(lambda v: pc2014['ST_CODE'].get(v, ''))\n", "ls2014['PC_CODE'] = ls2014['Constituency'].apply(lambda v: pc2014['PC_CODE'].get(v, ''))\n", "\n", "# However, some corrections are required for duplicate constituencies\n", "index = ls2014[(ls2014['Constituency'] == 'AURANGABAD') & (ls2014['ID'] > 5000)].index\n", "ls2014['ST_CODE'][index] = 'S13'\n", "ls2014['PC_CODE'][index] = 19\n", "\n", "index = ls2014[(ls2014['Constituency'] == 'MAHARAJGANJ') & (ls2014['ID'] > 9000)].index\n", "ls2014['ST_CODE'][index] = 'S24'\n", "ls2014['PC_CODE'][index] = 63\n", "\n", "index = ls2014[(ls2014['Constituency'] == 'HAMIRPUR') & (ls2014['ID'] < 7000)].index\n", "ls2014['ST_CODE'][index] = 'S24'\n", "ls2014['PC_CODE'][index] = 47\n", "\n", "# Save to disk\n", "ls2014.to_csv('myneta.2014.csv', index=False)\n", "ls2014.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | Candidate | \n", "Constituency | \n", "Criminal Cases | \n", "Education | \n", "ID | \n", "Party | \n", "Sno | \n", "Total Assets | \n", "Total Liabilities | \n", "Year | \n", "ST_CODE | \n", "PC_CODE | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "Kaushal Yadav | \n", "NAWADA | \n", "8 | \n", "Post Graduate | \n", "148 | \n", "JD(U) | \n", "1 | \n", "154566136 | \n", "2604969 | \n", "2014 | \n", "S04 | \n", "39 | \n", "
1 | \n", "Kiran Sharma | \n", "AZAMGARH | \n", "0 | \n", "8th Pass | \n", "9487 | \n", "Bhartiya Shakti Chetna Party | \n", "2 | \n", "3509407 | \n", "325000 | \n", "2014 | \n", "S24 | \n", "69 | \n", "
2 | \n", "M. Aamir Rashadi | \n", "AZAMGARH | \n", "1 | \n", "Others | \n", "9496 | \n", "Rashtriya Ulama Council | \n", "3 | \n", "2191523 | \n", "0 | \n", "2014 | \n", "S24 | \n", "69 | \n", "
3 | \n", "Rakesh Kumar Giri | \n", "MAHARAJGANJ | \n", "0 | \n", "Graduate Professional | \n", "9706 | \n", "IND | \n", "4 | \n", "306023 | \n", "0 | \n", "2014 | \n", "S24 | \n", "63 | \n", "
4 | \n", "(Kuppal)G.Devadoss | \n", "CHENNAI SOUTH | \n", "0 | \n", "8th Pass | \n", "6912 | \n", "IND | \n", "5 | \n", "3630000 | \n", "850000 | \n", "2014 | \n", "S22 | \n", "3 | \n", "
5 rows \u00d7 12 columns
\n", "\n", " | ID | \n", "Key | \n", "Type | \n", "Value | \n", "Year | \n", "
---|---|---|---|---|---|
0 | \n", "148 | \n", "420 | \n", "IPC | \n", "3 | \n", "2014 | \n", "
1 | \n", "148 | \n", "467 | \n", "IPC | \n", "2 | \n", "2014 | \n", "
2 | \n", "148 | \n", "468 | \n", "IPC | \n", "2 | \n", "2014 | \n", "
3 | \n", "148 | \n", "307 | \n", "IPC | \n", "1 | \n", "2014 | \n", "
4 | \n", "148 | \n", "379 | \n", "IPC | \n", "1 | \n", "2014 | \n", "
5 rows \u00d7 5 columns
\n", "