{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Employee Provident Fund company data\n",
"\n",
"I was doing my usual weekly ego-search for\n",
"[Gramener](https://www.google.com/search?q=gramener) when I found\n",
"[Gramener's employee provident fund records](http://esewa.epfoservices.in/ecr_dashboard/estt_wise_member_missing_detail.php?pageNum_Recordset1=439&totalRows_Recordset1=8983&office_name=HYDERABAD).\n",
"\n",
"What's interesting is that the page lets you download the full list for the\n",
"entire city of Hyderabad. Which got me wondering... could we do this for other\n",
"cities? It turns out that there's a full [list of offices](http://esewa.epfoservices.in/ecr_dashboard/summary_of_reg_estt.php)\n",
"that we can pull data from.\n",
"\n",
"A simple set of `curl` statements like\n",
"\n",
" curl -s -o DELHINORTH.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=DELHINORTH\n",
" curl -s -o DEHRADUN.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=DEHRADUN\n",
" curl -s -o HALDWANI.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=HALDWANI\n",
" ...\n",
"\n",
"gets you 146 MB of EPF data for every single company. Or, you could scrape it in Python.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"\n",
"dfs = pd.read_html('http://esewa.epfoservices.in/ecr_dashboard/summary_of_reg_estt.php', infer_types=False, skiprows=2)\n",
"dfs[0].head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"
\n",
"
\n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" 2 \n",
" 3 \n",
" 4 \n",
" 5 \n",
" 6 \n",
" 7 \n",
" 8 \n",
" 9 \n",
" 10 \n",
" 11 \n",
" 12 \n",
" \n",
" \n",
" \n",
" \n",
" 2 \n",
" 1 \n",
" 1 \n",
" RO \n",
" DL \n",
" CPM \n",
" DELHINORTH \n",
" 26752 \n",
" 157184 \n",
" 17 \n",
" 0.06 \n",
" 15713 \n",
" 58.74 \n",
" 209 \n",
" \n",
" \n",
" 3 \n",
" 2 \n",
" 1 \n",
" RO \n",
" UK \n",
" DDN \n",
" DEHRADUN \n",
" 4546 \n",
" 36406 \n",
" 2 \n",
" 0.04 \n",
" 3192 \n",
" 70.22 \n",
" 26 \n",
" \n",
" \n",
" 4 \n",
" 3 \n",
" 1 \n",
" SRO \n",
" UK \n",
" HLD \n",
" HALDWANI \n",
" 2231 \n",
" 21458 \n",
" 2 \n",
" 0.09 \n",
" 1676 \n",
" 75.12 \n",
" 18 \n",
" \n",
" \n",
" 5 \n",
" 4 \n",
" 1 \n",
" SRO \n",
" DS \n",
" SHD \n",
" LAXMINAGAR \n",
" 6559 \n",
" 36083 \n",
" 3 \n",
" 0.05 \n",
" 3811 \n",
" 58.1 \n",
" 70 \n",
" \n",
" \n",
" 6 \n",
" 5 \n",
" 1 \n",
" RO \n",
" DS \n",
" NHP \n",
" DELHISOUTH \n",
" 13915 \n",
" 102883 \n",
" 10 \n",
" 0.07 \n",
" 8282 \n",
" 59.52 \n",
" 166 \n",
" \n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 30,
"text": [
" 0 1 2 3 4 5 6 7 8 9 10 11 12\n",
"2 1 1 RO DL CPM DELHINORTH 26752 157184 17 0.06 15713 58.74 209\n",
"3 2 1 RO UK DDN DEHRADUN 4546 36406 2 0.04 3192 70.22 26\n",
"4 3 1 SRO UK HLD HALDWANI 2231 21458 2 0.09 1676 75.12 18\n",
"5 4 1 SRO DS SHD LAXMINAGAR 6559 36083 3 0.05 3811 58.1 70\n",
"6 5 1 RO DS NHP DELHISOUTH 13915 102883 10 0.07 8282 59.52 166"
]
}
],
"prompt_number": 30
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import os.path\n",
"from urllib import urlretrieve\n",
"\n",
"def get_epf(offices, path):\n",
" data = []\n",
" base = 'http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name='\n",
" \n",
" # Weird workaround because pd.read_excel does not support encoding_override\n",
" xl = pd.ExcelFile.__new__(pd.ExcelFile)\n",
" \n",
" for office in offices:\n",
" excel_file = os.path.join(path, office) + '.xls'\n",
" if not os.path.exists(excel_file):\n",
" urlretrieve(base + office, excel_file)\n",
" try:\n",
" book = xlrd.open_workbook(excel_file, encoding_override='cp1252')\n",
" except UnicodeDecodeError:\n",
" book = xlrd.open_workbook(excel_file, encoding_override='utf8')\n",
" setattr(xl, 'book', book)\n",
" df = xl.parse('Sheet 1', skiprows=2)\n",
" df['Office'] = office\n",
" data.append(df)\n",
" \n",
" return pd.concat(data, ignore_index=True)\n",
"\n",
"path = 'd:/site/gramener.com/viz/epf/establishments/'\n",
"datafile = os.path.join(path, 'data.csv')\n",
"if os.path.exists(datafile):\n",
" data = pd.read_csv(datafile, encoding='cp1252')\n",
"else:\n",
" data = get_epf(offices=dfs[0][5][:120], path=path)\n",
" data.to_csv(datafile, index=False, encoding='cp1252')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 146
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Presenting India's largest companies, by employee count"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data.sort('On the basis of ECR (wage month Feb-2013 onwards)', ascending=False).head(20)[['Office', 'Estt. Name', 'On the basis of ECR (wage month Feb-2013 onwards)']]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"\n",
"
\n",
" \n",
" \n",
" \n",
" Office \n",
" Estt. Name \n",
" On the basis of ECR (wage month Feb-2013 onwards) \n",
" \n",
" \n",
" \n",
" \n",
" 315758 \n",
" BANDRAMUMBAI-I \n",
" TATA CONSULTANCY SERVICES LIMITED \n",
" 249660 \n",
" \n",
" \n",
" 265395 \n",
" PARKSTREET \n",
" LARSEN - TOUBRO LTD. (ECC CONSTRUCTION GROUP) \n",
" 157831 \n",
" \n",
" \n",
" 0 \n",
" DELHINORTH \n",
" LARSEN - TOUBRO LTD \n",
" 149374 \n",
" \n",
" \n",
" 203666 \n",
" CHENNAI \n",
" COGNIZANT TECHNOLOGY SOLUTIONS INDIA PRIVATE LTD \n",
" 138108 \n",
" \n",
" \n",
" 315759 \n",
" BANDRAMUMBAI-I \n",
" LARSEN - TURBO LTD ECC \n",
" 136442 \n",
" \n",
" \n",
" 129643 \n",
" HYDERABAD \n",
" A P S R T C \n",
" 131587 \n",
" \n",
" \n",
" 187848 \n",
" BOMMASANDRA \n",
" ADECCO INDIA PVT. LTD. \n",
" 130769 \n",
" \n",
" \n",
" 187849 \n",
" BOMMASANDRA \n",
" M/S INFOSYS LIMITED \n",
" 126704 \n",
" \n",
" \n",
" 173884 \n",
" BANGALORE \n",
" TEAM LEASE SERVICES PVT LTD \n",
" 107921 \n",
" \n",
" \n",
" 315760 \n",
" BANDRAMUMBAI-I \n",
" ACCENTURE SERVICES PVT. LTD. \n",
" 101822 \n",
" \n",
" \n",
" 315761 \n",
" BANDRAMUMBAI-I \n",
" SHAPOORJI PALLONJI - CO LTD. \n",
" 99531 \n",
" \n",
" \n",
" 95233 \n",
" PATNA \n",
" SECURITY AND INTELLIGENCE SERVICE (INIDA) LTD. \n",
" 95299 \n",
" \n",
" \n",
" 406613 \n",
" AHMEDABAD \n",
" LARSEN - TOUBRO LIMITED \n",
" 94941 \n",
" \n",
" \n",
" 198662 \n",
" TAMBARAM \n",
" M/S.LARSEN - TOUBRO LIMITED, \n",
" 94087 \n",
" \n",
" \n",
" 190766 \n",
" KRPURAMWHITEFIELD \n",
" IBM INDIA PVT.LTD \n",
" 93927 \n",
" \n",
" \n",
" 129644 \n",
" HYDERABAD \n",
" LARSEN TOUBRO LTD.,[ECC-CG] \n",
" 90508 \n",
" \n",
" \n",
" 315762 \n",
" BANDRAMUMBAI-I \n",
" H D F C BANK LTD. \n",
" 87726 \n",
" \n",
" \n",
" 315763 \n",
" BANDRAMUMBAI-I \n",
" M.S.E.B. \n",
" 85844 \n",
" \n",
" \n",
" 187850 \n",
" BOMMASANDRA \n",
" M/S WIPRO LIMITED - TECHNOLOGIES GROUP \n",
" 85099 \n",
" \n",
" \n",
" 41293 \n",
" GURGAON \n",
" HINDUSTAN INSTRUMENT \n",
" 83170 \n",
" \n",
" \n",
"
\n",
"
"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 119,
"text": [
" Office Estt. Name \\\n",
"315758 BANDRAMUMBAI-I TATA CONSULTANCY SERVICES LIMITED \n",
"265395 PARKSTREET LARSEN - TOUBRO LTD. (ECC CONSTRUCTION GROUP) \n",
"0 DELHINORTH LARSEN - TOUBRO LTD \n",
"203666 CHENNAI COGNIZANT TECHNOLOGY SOLUTIONS INDIA PRIVATE LTD \n",
"315759 BANDRAMUMBAI-I LARSEN - TURBO LTD ECC \n",
"129643 HYDERABAD A P S R T C \n",
"187848 BOMMASANDRA ADECCO INDIA PVT. LTD. \n",
"187849 BOMMASANDRA M/S INFOSYS LIMITED \n",
"173884 BANGALORE TEAM LEASE SERVICES PVT LTD \n",
"315760 BANDRAMUMBAI-I ACCENTURE SERVICES PVT. LTD. \n",
"315761 BANDRAMUMBAI-I SHAPOORJI PALLONJI - CO LTD. \n",
"95233 PATNA SECURITY AND INTELLIGENCE SERVICE (INIDA) LTD. \n",
"406613 AHMEDABAD LARSEN - TOUBRO LIMITED \n",
"198662 TAMBARAM M/S.LARSEN - TOUBRO LIMITED, \n",
"190766 KRPURAMWHITEFIELD IBM INDIA PVT.LTD \n",
"129644 HYDERABAD LARSEN TOUBRO LTD.,[ECC-CG] \n",
"315762 BANDRAMUMBAI-I H D F C BANK LTD. \n",
"315763 BANDRAMUMBAI-I M.S.E.B. \n",
"187850 BOMMASANDRA M/S WIPRO LIMITED - TECHNOLOGIES GROUP \n",
"41293 GURGAON HINDUSTAN INSTRUMENT \n",
"\n",
" On the basis of ECR (wage month Feb-2013 onwards) \n",
"315758 249660 \n",
"265395 157831 \n",
"0 149374 \n",
"203666 138108 \n",
"315759 136442 \n",
"129643 131587 \n",
"187848 130769 \n",
"187849 126704 \n",
"173884 107921 \n",
"315760 101822 \n",
"315761 99531 \n",
"95233 95299 \n",
"406613 94941 \n",
"198662 94087 \n",
"190766 93927 \n",
"129644 90508 \n",
"315762 87726 \n",
"315763 85844 \n",
"187850 85099 \n",
"41293 83170 "
]
}
],
"prompt_number": 119
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, a dataset like this requires a nice exploratory interface. Here's what I sketched quickly, and a working implementation is at https://gramener.com/companysize/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
" \n",
" \n",
" \n",
" \n",
"Top employers \n",
"Most employers are concentrated around Mumbai, \n",
"Delhi, blah \n",
"blah \n",
"blah \n",
". The largest employers on the \n",
"EPF list are blah \n",
"blah \n",
"blah \n",
". \n",
"Lorem \n",
"ipsum \n",
"dolor \n",
"sit \n",
"amet \n",
", \n",
"consectetur \n",
"adipisicing \n",
"elit \n",
", \n",
"sed \n",
"do \n",
"eiusmod \n",
"tempor \n",
"incididunt \n",
"ut \n",
"labore \n",
"et \n",
"dolore \n",
"magna \n",
"aliqua \n",
". \n",
"Ut \n",
"enim \n",
"ad minim \n",
"veniam \n",
", \n",
"quis \n",
"nostrud \n",
"exercitation \n",
"ullamco \n",
"laboris \n",
"nisi \n",
"ut \n",
"aliquip \n",
"ex \n",
"ea \n",
"commodo \n",
"consequat \n",
". \n",
"Duis \n",
"aute \n",
"irure \n",
"dolor \n",
"in \n",
"reprehenderit \n",
"in \n",
"voluptate \n",
"velit \n",
"esse \n",
"cillum \n",
"dolore \n",
"eu \n",
"fugiat \n",
"nulla \n",
"pariatur \n",
". \n",
" \n",
" \n",
"1 \n",
" \n",
" \n",
"2 \n",
" \n",
" \n",
"3 \n",
" \n",
" \n",
"4 \n",
" \n",
" \n",
"5 \n",
" \n",
" \n",
"6 \n",
" \n",
" \n",
"7 \n",
" \n",
" \n",
"8 \n",
" \n",
" \n",
"9 \n",
" \n",
"# Employees \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"MUM \n",
"Tata Consultancy Services \n",
"249,660 \n",
"PARK \n",
"Larsen _ Toubro Ltd (ECC \n",
"Constructi \n",
"157,831 \n",
"Click / swipe \n",
"for next story \n",
" \n",
"iPad \n",
"1024 x 768 frame \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" "
]
}
],
"metadata": {}
}
]
}