{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Employee Provident Fund company data\n", "\n", "I was doing my usual weekly ego-search for\n", "[Gramener](https://www.google.com/search?q=gramener) when I found\n", "[Gramener's employee provident fund records](http://esewa.epfoservices.in/ecr_dashboard/estt_wise_member_missing_detail.php?pageNum_Recordset1=439&totalRows_Recordset1=8983&office_name=HYDERABAD).\n", "\n", "What's interesting is that the page lets you download the full list for the\n", "entire city of Hyderabad. Which got me wondering... could we do this for other\n", "cities? It turns out that there's a full [list of offices](http://esewa.epfoservices.in/ecr_dashboard/summary_of_reg_estt.php)\n", "that we can pull data from.\n", "\n", "A simple set of `curl` statements like\n", "\n", " curl -s -o DELHINORTH.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=DELHINORTH\n", " curl -s -o DEHRADUN.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=DEHRADUN\n", " curl -s -o HALDWANI.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=HALDWANI\n", " ...\n", "\n", "gets you 146 MB of EPF data for every single company. Or, you could scrape it in Python.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "\n", "dfs = pd.read_html('http://esewa.epfoservices.in/ecr_dashboard/summary_of_reg_estt.php', infer_types=False, skiprows=2)\n", "dfs[0].head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789101112
2 1 1 RO DL CPM DELHINORTH 26752 157184 17 0.06 15713 58.74 209
3 2 1 RO UK DDN DEHRADUN 4546 36406 2 0.04 3192 70.22 26
4 3 1 SRO UK HLD HALDWANI 2231 21458 2 0.09 1676 75.12 18
5 4 1 SRO DS SHD LAXMINAGAR 6559 36083 3 0.05 3811 58.1 70
6 5 1 RO DS NHP DELHISOUTH 13915 102883 10 0.07 8282 59.52 166
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 30, "text": [ " 0 1 2 3 4 5 6 7 8 9 10 11 12\n", "2 1 1 RO DL CPM DELHINORTH 26752 157184 17 0.06 15713 58.74 209\n", "3 2 1 RO UK DDN DEHRADUN 4546 36406 2 0.04 3192 70.22 26\n", "4 3 1 SRO UK HLD HALDWANI 2231 21458 2 0.09 1676 75.12 18\n", "5 4 1 SRO DS SHD LAXMINAGAR 6559 36083 3 0.05 3811 58.1 70\n", "6 5 1 RO DS NHP DELHISOUTH 13915 102883 10 0.07 8282 59.52 166" ] } ], "prompt_number": 30 }, { "cell_type": "code", "collapsed": false, "input": [ "import os.path\n", "from urllib import urlretrieve\n", "\n", "def get_epf(offices, path):\n", " data = []\n", " base = 'http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name='\n", " \n", " # Weird workaround because pd.read_excel does not support encoding_override\n", " xl = pd.ExcelFile.__new__(pd.ExcelFile)\n", " \n", " for office in offices:\n", " excel_file = os.path.join(path, office) + '.xls'\n", " if not os.path.exists(excel_file):\n", " urlretrieve(base + office, excel_file)\n", " try:\n", " book = xlrd.open_workbook(excel_file, encoding_override='cp1252')\n", " except UnicodeDecodeError:\n", " book = xlrd.open_workbook(excel_file, encoding_override='utf8')\n", " setattr(xl, 'book', book)\n", " df = xl.parse('Sheet 1', skiprows=2)\n", " df['Office'] = office\n", " data.append(df)\n", " \n", " return pd.concat(data, ignore_index=True)\n", "\n", "path = 'd:/site/gramener.com/viz/epf/establishments/'\n", "datafile = os.path.join(path, 'data.csv')\n", "if os.path.exists(datafile):\n", " data = pd.read_csv(datafile, encoding='cp1252')\n", "else:\n", " data = get_epf(offices=dfs[0][5][:120], path=path)\n", " data.to_csv(datafile, index=False, encoding='cp1252')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 146 }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Presenting India's largest companies, by employee count" ] }, { "cell_type": "code", "collapsed": false, "input": [ "data.sort('On the basis of ECR (wage month Feb-2013 onwards)', ascending=False).head(20)[['Office', 'Estt. Name', 'On the basis of ECR (wage month Feb-2013 onwards)']]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
OfficeEstt. NameOn the basis of ECR (wage month Feb-2013 onwards)
315758 BANDRAMUMBAI-I TATA CONSULTANCY SERVICES LIMITED 249660
265395 PARKSTREET LARSEN - TOUBRO LTD. (ECC CONSTRUCTION GROUP) 157831
0 DELHINORTH LARSEN - TOUBRO LTD 149374
203666 CHENNAI COGNIZANT TECHNOLOGY SOLUTIONS INDIA PRIVATE LTD 138108
315759 BANDRAMUMBAI-I LARSEN - TURBO LTD ECC 136442
129643 HYDERABAD A P S R T C 131587
187848 BOMMASANDRA ADECCO INDIA PVT. LTD. 130769
187849 BOMMASANDRA M/S INFOSYS LIMITED 126704
173884 BANGALORE TEAM LEASE SERVICES PVT LTD 107921
315760 BANDRAMUMBAI-I ACCENTURE SERVICES PVT. LTD. 101822
315761 BANDRAMUMBAI-I SHAPOORJI PALLONJI - CO LTD. 99531
95233 PATNA SECURITY AND INTELLIGENCE SERVICE (INIDA) LTD. 95299
406613 AHMEDABAD LARSEN - TOUBRO LIMITED 94941
198662 TAMBARAM M/S.LARSEN - TOUBRO LIMITED, 94087
190766 KRPURAMWHITEFIELD IBM INDIA PVT.LTD 93927
129644 HYDERABAD LARSEN TOUBRO LTD.,[ECC-CG] 90508
315762 BANDRAMUMBAI-I H D F C BANK LTD. 87726
315763 BANDRAMUMBAI-I M.S.E.B. 85844
187850 BOMMASANDRA M/S WIPRO LIMITED - TECHNOLOGIES GROUP 85099
41293 GURGAON HINDUSTAN INSTRUMENT 83170
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 119, "text": [ " Office Estt. Name \\\n", "315758 BANDRAMUMBAI-I TATA CONSULTANCY SERVICES LIMITED \n", "265395 PARKSTREET LARSEN - TOUBRO LTD. (ECC CONSTRUCTION GROUP) \n", "0 DELHINORTH LARSEN - TOUBRO LTD \n", "203666 CHENNAI COGNIZANT TECHNOLOGY SOLUTIONS INDIA PRIVATE LTD \n", "315759 BANDRAMUMBAI-I LARSEN - TURBO LTD ECC \n", "129643 HYDERABAD A P S R T C \n", "187848 BOMMASANDRA ADECCO INDIA PVT. LTD. \n", "187849 BOMMASANDRA M/S INFOSYS LIMITED \n", "173884 BANGALORE TEAM LEASE SERVICES PVT LTD \n", "315760 BANDRAMUMBAI-I ACCENTURE SERVICES PVT. LTD. \n", "315761 BANDRAMUMBAI-I SHAPOORJI PALLONJI - CO LTD. \n", "95233 PATNA SECURITY AND INTELLIGENCE SERVICE (INIDA) LTD. \n", "406613 AHMEDABAD LARSEN - TOUBRO LIMITED \n", "198662 TAMBARAM M/S.LARSEN - TOUBRO LIMITED, \n", "190766 KRPURAMWHITEFIELD IBM INDIA PVT.LTD \n", "129644 HYDERABAD LARSEN TOUBRO LTD.,[ECC-CG] \n", "315762 BANDRAMUMBAI-I H D F C BANK LTD. \n", "315763 BANDRAMUMBAI-I M.S.E.B. \n", "187850 BOMMASANDRA M/S WIPRO LIMITED - TECHNOLOGIES GROUP \n", "41293 GURGAON HINDUSTAN INSTRUMENT \n", "\n", " On the basis of ECR (wage month Feb-2013 onwards) \n", "315758 249660 \n", "265395 157831 \n", "0 149374 \n", "203666 138108 \n", "315759 136442 \n", "129643 131587 \n", "187848 130769 \n", "187849 126704 \n", "173884 107921 \n", "315760 101822 \n", "315761 99531 \n", "95233 95299 \n", "406613 94941 \n", "198662 94087 \n", "190766 93927 \n", "129644 90508 \n", "315762 87726 \n", "315763 85844 \n", "187850 85099 \n", "41293 83170 " ] } ], "prompt_number": 119 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, a dataset like this requires a nice exploratory interface. Here's what I sketched quickly, and a working implementation is at https://gramener.com/companysize/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "\n", "\n", "\n", "Top employers\n", "Most employers are concentrated around Mumbai, \n", "Delhi, blah \n", "blah\n", "blah\n", ". The largest employers on the \n", "EPF list are blah \n", "blah\n", "blah\n", ". \n", "Lorem\n", "ipsum\n", "dolor\n", "sit \n", "amet\n", ", \n", "consectetur\n", "adipisicing\n", "elit\n", ", \n", "sed\n", "do \n", "eiusmod\n", "tempor\n", "incididunt\n", "ut\n", "labore\n", "et \n", "dolore\n", "magna \n", "aliqua\n", ". \n", "Ut\n", "enim\n", "ad minim \n", "veniam\n", ", \n", "quis\n", "nostrud\n", "exercitation \n", "ullamco\n", "laboris\n", "nisi \n", "ut\n", "aliquip\n", "ex \n", "ea\n", "commodo\n", "consequat\n", ". \n", "Duis\n", "aute\n", "irure\n", "dolor\n", "in \n", "reprehenderit\n", "in \n", "voluptate\n", "velit\n", "esse\n", "cillum\n", "dolore\n", "eu\n", "fugiat\n", "nulla\n", "pariatur\n", ".\n", "\n", "\n", "1\n", "\n", "\n", "2\n", "\n", "\n", "3\n", "\n", "\n", "4\n", "\n", "\n", "5\n", "\n", "\n", "6\n", "\n", "\n", "7\n", "\n", "\n", "8\n", "\n", "\n", "9\n", "\n", "# Employees\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "MUM\n", "Tata Consultancy Services\n", "249,660\n", "PARK\n", "Larsen _ Toubro Ltd (ECC\n", "Constructi\n", "157,831\n", "Click / swipe\n", "for next story\n", "\n", "iPad\n", "1024 x 768 frame\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "" ] } ], "metadata": {} } ] }