{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial was partially adapted from http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial, where you can learn more about Open Refine. It used to be called Google Refine so try that too when you are searching for information. However, in our case, we use python to do the same thing." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Downloading Data\n", "\n", "The university data can be downloaed from http://enipedia.tudelft.nl/enipedia/images/f/ff/UniversityData.zip" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2017-11-19T16:50:30.266524", "start_time": "2017-11-19T16:50:30.260991" } }, "source": [ "## What you can learn\n", "\n", "The data contains quite a few issues, and this tutorial shows how to do things like:\n", "\n", "- Cleaning up inconsistent spelling of terms (i.e. \"USA\", \"U.S.A\", \"U.S.\", etc).\n", "- Converting values that are text descriptions of numeric values (i.e. $123 million) to actual numeric values (i.e. 123000000) which are usable for analysis.\n", "- Identifying which rows of a specific column contain a search term\n", "- Extracting and cleaning values for dates\n", "- Removing duplicate rows\n", "- Using a scatterplot to visualize relationships between values in different columns\n", "- Finding geographic coordinates for a list of place names (i.e. the names of universities, etc.)\n", "- Exporting cleaned data to Excel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading Data with Pandas" ] }, { "cell_type": "code", "execution_count": 487, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:07.915490", "start_time": "2017-11-19T20:33:07.906541" }, "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import sys\n", "import pandas as pd\n", "import statsmodels.api as sm\n", "from collections import Counter, defaultdict\n", "import numpy as np\n", "import datetime\n", "import matplotlib.pyplot as plt\n", "import matplotlib\n", "\n", "matplotlib.style.use('ggplot')" ] }, { "cell_type": "code", "execution_count": 579, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:42:26.429404", "start_time": "2017-11-19T20:42:26.321646" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
universityendowmentnumFacultynumDoctoralcountrynumStaffestablishednumPostgradnumUndergradnumStudents
0Paris Universitas1555008000FranceNaN2005NaN2500070000
1Paris Universitas1555008000FranceNaN2005NaN2500070000
2Lumi%C3%A8re University Lyon 2121NaN1355FranceNaN183570461485127393
3Confederation College4700000NaNNaNCanadaNaN1967not availablepre-university students; technical21160
4Rocky Mountain College16586100NaNNaNUnited StatesNaN187866878894
\n", "
" ], "text/plain": [ " university endowment numFaculty numDoctoral \\\n", "0 Paris Universitas 15 5500 8000 \n", "1 Paris Universitas 15 5500 8000 \n", "2 Lumi%C3%A8re University Lyon 2 121 NaN 1355 \n", "3 Confederation College 4700000 NaN NaN \n", "4 Rocky Mountain College 16586100 NaN NaN \n", "\n", " country numStaff established numPostgrad \\\n", "0 France NaN 2005 NaN \n", "1 France NaN 2005 NaN \n", "2 France NaN 1835 7046 \n", "3 Canada NaN 1967 not available \n", "4 United States NaN 1878 66 \n", "\n", " numUndergrad numStudents \n", "0 25000 70000 \n", "1 25000 70000 \n", "2 14851 27393 \n", "3 pre-university students; technical 21160 \n", "4 878 894 " ] }, "execution_count": 579, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('universityData.csv', sep = '\\t', encoding = 'utf-8')\n", "\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 580, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:42:29.648020", "start_time": "2017-11-19T20:42:29.642387" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Lumi%C3%A8re University Lyon 2\n" ] } ], "source": [ "print df.university[2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Apparently, there are duplicate rows." ] }, { "cell_type": "code", "execution_count": 489, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.042957", "start_time": "2017-11-19T20:33:08.038151" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "75043" ] }, "execution_count": 489, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2017-11-19T19:27:58.525341", "start_time": "2017-11-19T19:27:58.522623" } }, "source": [ "# Deduplicate entries" ] }, { "cell_type": "code", "execution_count": 490, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.080206", "start_time": "2017-11-19T20:33:08.045373" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "15534" ] }, "execution_count": 490, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = df.drop_duplicates()\n", "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We already see an issue here where there is both the full name of a country (United States) and its abbreviation (US). To fix this, we can just copy/paste \"United States\" as the new cell value.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clean up country names" ] }, { "cell_type": "code", "execution_count": 491, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.087529", "start_time": "2017-11-19T20:33:08.082433" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['France', 'Canada', 'United States', 'USA', 'Italy', 'South Korea',\n", " 'Japan', 'United States of America', 'England, UK',\n", " 'United States )', 'Saudi Arabia', 'Honduras', 'United Kingdom',\n", " 'England', 'the Netherlands', 'India', 'Russia', 'U.S.', 'Brazil',\n", " 'US', 'U.S.A.', 'Philippines', 'Australia',\n", " 'England, United Kingdom', 'Wales', ',', 'China', 'South Africa',\n", " 'UK', 'Puerto Rico', 'Bulgaria', 'Botswana', 'Taiwan', 'Sri Lanka',\n", " 'Colombia', 'Iran', 'Russian Federation', 'Rossija', 'Finland',\n", " 'Republic of China', 'Chile', 'Romania', 'Utopia', 'Singapore',\n", " 'Lebanon', 'Turkey', 'Canada B1P 6L2', 'Jordan', 'Albania',\n", " 'Canada C1A 4P3 Telephone: 902-566-0439 Fax: 902-566-0795',\n", " 'Scotland', 'Pakistan', 'Scotland, United Kingdom', 'Denmark',\n", " 'Mexico', 'Thailand', 'Argentina', 'Cura%C3%A7ao', 'Sweden',\n", " 'Scotland, UK', 'Bangladesh', 'Nepal', 'Netherlands', 'Switzerland',\n", " 'Egypt', 'Spain', 'Satellite locations:',\n", " 'Nassau, Bahamas Fort Myers, FL Jacksonville, FL Miami, FL Miramar, FL Orlando, FL Palm Beach, FL Tampa, FL'], dtype=object)" ] }, "execution_count": 491, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['country'].unique()" ] }, { "cell_type": "code", "execution_count": 492, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.095622", "start_time": "2017-11-19T20:33:08.089014" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(',', 2)\n", "('Albania', 8)\n", "('Argentina', 1)\n", "('Australia', 49)\n", "('Bangladesh', 27)\n", "('Botswana', 1)\n", "('Brazil', 10)\n", "('Bulgaria', 2)\n", "('Canada', 625)\n", "('Canada B1P 6L2', 40)\n", "('Canada C1A 4P3 Telephone: 902-566-0439 Fax: 902-566-0795', 1)\n", "('Chile', 1)\n", "('China', 4)\n", "('Colombia', 1)\n", "('Cura%C3%A7ao', 1)\n", "('Denmark', 8)\n", "('Egypt', 1)\n", "('England', 338)\n", "('England, UK', 286)\n", "('England, United Kingdom', 64)\n", "('Finland', 2)\n", "('France', 8)\n", "('Honduras', 1)\n", "('India', 40)\n", "('Iran', 9)\n", "('Italy', 2)\n", "('Japan', 112)\n", "('Jordan', 2)\n", "('Lebanon', 5)\n", "('Mexico', 2)\n", "('Nassau, Bahamas Fort Myers, FL Jacksonville, FL Miami, FL Miramar, FL Orlando, FL Palm Beach, FL Tampa, FL', 1)\n", "('Nepal', 4)\n", "('Netherlands', 2)\n", "('Pakistan', 2)\n", "('Philippines', 147)\n", "('Puerto Rico', 2)\n", "('Republic of China', 1)\n", "('Romania', 1)\n", "('Rossija', 2)\n", "('Russia', 1)\n", "('Russian Federation', 2)\n", "('Satellite locations:', 1)\n", "('Saudi Arabia', 1)\n", "('Scotland', 64)\n", "('Scotland, UK', 16)\n", "('Scotland, United Kingdom', 16)\n", "('Singapore', 4)\n", "('South Africa', 7)\n", "('South Korea', 7)\n", "('Spain', 1)\n", "('Sri Lanka', 4)\n", "('Sweden', 24)\n", "('Switzerland', 8)\n", "('Taiwan', 1)\n", "('Thailand', 4)\n", "('Turkey', 4)\n", "('U.S.', 304)\n", "('U.S.A.', 81)\n", "('UK', 83)\n", "('US', 455)\n", "('USA', 5104)\n", "('United Kingdom', 405)\n", "('United States', 6949)\n", "('United States )', 1)\n", "('United States of America', 167)\n", "('Utopia', 1)\n", "('Wales', 2)\n", "('the Netherlands', 2)\n" ] } ], "source": [ "country_df = df.groupby('country').size()\n", "for k in country_df.index:\n", " print(k, country_df[k])" ] }, { "cell_type": "code", "execution_count": 493, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.138742", "start_time": "2017-11-19T20:33:08.097762" }, "collapsed": true }, "outputs": [], "source": [ "us_condition = df['country'].isin(['U.S.', 'U.S.A.', 'US', 'United States', 'United States )', 'United States of America'])\n", "df['country'][us_condition] = 'USA'" ] }, { "cell_type": "code", "execution_count": 494, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.155703", "start_time": "2017-11-19T20:33:08.140243" }, "collapsed": false }, "outputs": [], "source": [ "ca_condition =[]\n", "for i in df['country']:\n", " if i.__contains__('Canada'):\n", " ca_condition.append(True)\n", " else:\n", " ca_condition.append(False)\n", "df['country'][ca_condition] = 'Canada'" ] }, { "cell_type": "code", "execution_count": 495, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.173054", "start_time": "2017-11-19T20:33:08.157337" }, "collapsed": true }, "outputs": [], "source": [ "ca_condition =[]\n", "for i in df['country']:\n", " if i.__contains__('China'):\n", " ca_condition.append(True)\n", " else:\n", " ca_condition.append(False)\n", "df['country'][ca_condition] = 'China'" ] }, { "cell_type": "code", "execution_count": 496, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.190631", "start_time": "2017-11-19T20:33:08.174600" }, "collapsed": true }, "outputs": [], "source": [ "ca_condition =[]\n", "for i in df['country']:\n", " if i.__contains__('England'):\n", " ca_condition.append(True)\n", " else:\n", " ca_condition.append(False)\n", "df['country'][ca_condition] = 'England'" ] }, { "cell_type": "code", "execution_count": 497, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.209095", "start_time": "2017-11-19T20:33:08.192260" }, "collapsed": true }, "outputs": [], "source": [ "ca_condition =[]\n", "for i in df['country']:\n", " if i.__contains__('Scotland'):\n", " ca_condition.append(True)\n", " else:\n", " ca_condition.append(False)\n", "df['country'][ca_condition] = 'England'" ] }, { "cell_type": "code", "execution_count": 498, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.220570", "start_time": "2017-11-19T20:33:08.210841" }, "collapsed": true }, "outputs": [], "source": [ "uk_condition = df['country'].isin(['UK', 'United Kingdom', 'Wales'])\n", "df['country'][uk_condition] = 'England'" ] }, { "cell_type": "code", "execution_count": 499, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.236915", "start_time": "2017-11-19T20:33:08.221976" }, "collapsed": true }, "outputs": [], "source": [ "ca_condition =[]\n", "for i in df['country']:\n", " if i.__contains__('Netherlands'):\n", " ca_condition.append(True)\n", " else:\n", " ca_condition.append(False)\n", "df['country'][ca_condition] = 'Netherlands'" ] }, { "cell_type": "code", "execution_count": 500, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.254735", "start_time": "2017-11-19T20:33:08.238405" }, "collapsed": true }, "outputs": [], "source": [ "ca_condition =[]\n", "for i in df['country']:\n", " if i.__contains__('Russia'):\n", " ca_condition.append(True)\n", " else:\n", " ca_condition.append(False)\n", "df['country'][ca_condition] = 'Russia'" ] }, { "cell_type": "code", "execution_count": 501, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.272809", "start_time": "2017-11-19T20:33:08.256384" }, "collapsed": true }, "outputs": [], "source": [ "ca_condition =[]\n", "for i in df['country']:\n", " if i.__contains__('Bahamas'):\n", " ca_condition.append(True)\n", " else:\n", " ca_condition.append(False)\n", "df['country'][ca_condition] = 'Bahamas'" ] }, { "cell_type": "code", "execution_count": 502, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.278089", "start_time": "2017-11-19T20:33:08.274514" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 502, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'Russian Federation'.__contains__('Russia')" ] }, { "cell_type": "code", "execution_count": 503, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.289600", "start_time": "2017-11-19T20:33:08.279916" }, "collapsed": true }, "outputs": [], "source": [ "uk_condition = df['country']=='Cura%C3%A7ao'\n", "df['country'][uk_condition] = \"Curacao\"" ] }, { "cell_type": "code", "execution_count": 504, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.308954", "start_time": "2017-11-19T20:33:08.291006" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
universityendowmentnumFacultynumDoctoralcountrynumStaffestablishednumPostgradnumUndergradnumStudents
2130Universidad Ju%C3%A1rez Aut%C3%B3noma de Tabasco23760.02188NaN,10871958-11-209712492135271
3077Universidad Ju%C3%A1rez Aut%C3%B3noma de Tabasco34140.02188NaN,10871958-11-209712492135271
\n", "
" ], "text/plain": [ " university endowment numFaculty \\\n", "2130 Universidad Ju%C3%A1rez Aut%C3%B3noma de Tabasco 23760.0 2188 \n", "3077 Universidad Ju%C3%A1rez Aut%C3%B3noma de Tabasco 34140.0 2188 \n", "\n", " numDoctoral country numStaff established numPostgrad numUndergrad \\\n", "2130 NaN , 1087 1958-11-20 971 24921 \n", "3077 NaN , 1087 1958-11-20 971 24921 \n", "\n", " numStudents \n", "2130 35271 \n", "3077 35271 " ] }, "execution_count": 504, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['country']==',']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Universidad Juárez Autónoma de Tabasco is a public institution of higher learning located in Villahermosa, Tabasco, Mexico." ] }, { "cell_type": "code", "execution_count": 505, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.321233", "start_time": "2017-11-19T20:33:08.310564" }, "collapsed": true }, "outputs": [], "source": [ "uk_condition = df['country']==','\n", "df['country'][uk_condition] = \"Mexico\"" ] }, { "cell_type": "code", "execution_count": 506, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.338934", "start_time": "2017-11-19T20:33:08.322730" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
universityendowmentnumFacultynumDoctoralcountrynumStaffestablishednumPostgradnumUndergradnumStudents
75009Nova Southeastern UniversityUS $64.5 million2083NaNSatellite locations:4319196422060639728457
\n", "
" ], "text/plain": [ " university endowment numFaculty numDoctoral \\\n", "75009 Nova Southeastern University US $64.5 million 2083 NaN \n", "\n", " country numStaff established numPostgrad numUndergrad \\\n", "75009 Satellite locations: 4319 1964 22060 6397 \n", "\n", " numStudents \n", "75009 28457 " ] }, "execution_count": 506, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['country']=='Satellite locations:']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nova Southeastern University (NSU) is a private nonprofit university, with a main campus located on 300 acres (120 ha) in Davie, in the US state of Florida. Formerly referred to as \"Nova\" and now commonly called \"NSU\", the university currently consists of 18 colleges and schools offering over 175 programs of study with more than 250 majors." ] }, { "cell_type": "code", "execution_count": 507, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.350133", "start_time": "2017-11-19T20:33:08.340466" }, "collapsed": true }, "outputs": [], "source": [ "uk_condition = df['country']=='Satellite locations:'\n", "df['country'][uk_condition] = \"USA\"" ] }, { "cell_type": "code", "execution_count": 508, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.358148", "start_time": "2017-11-19T20:33:08.352060" }, "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('Albania', 8)\n", "('Argentina', 1)\n", "('Australia', 49)\n", "('Bahamas', 1)\n", "('Bangladesh', 27)\n", "('Botswana', 1)\n", "('Brazil', 10)\n", "('Bulgaria', 2)\n", "('Canada', 666)\n", "('Chile', 1)\n", "('China', 5)\n", "('Colombia', 1)\n", "('Curacao', 1)\n", "('Denmark', 8)\n", "('Egypt', 1)\n", "('England', 1274)\n", "('Finland', 2)\n", "('France', 8)\n", "('Honduras', 1)\n", "('India', 40)\n", "('Iran', 9)\n", "('Italy', 2)\n", "('Japan', 112)\n", "('Jordan', 2)\n", "('Lebanon', 5)\n", "('Mexico', 4)\n", "('Nepal', 4)\n", "('Netherlands', 4)\n", "('Pakistan', 2)\n", "('Philippines', 147)\n", "('Puerto Rico', 2)\n", "('Romania', 1)\n", "('Rossija', 2)\n", "('Russia', 3)\n", "('Saudi Arabia', 1)\n", "('Singapore', 4)\n", "('South Africa', 7)\n", "('South Korea', 7)\n", "('Spain', 1)\n", "('Sri Lanka', 4)\n", "('Sweden', 24)\n", "('Switzerland', 8)\n", "('Taiwan', 1)\n", "('Thailand', 4)\n", "('Turkey', 4)\n", "('USA', 13062)\n", "('Utopia', 1)\n" ] } ], "source": [ "country_df = df.groupby('country').size()\n", "for k in country_df.index:\n", " print(k, country_df[k])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clean up values for the number of students\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to clean the data for the number of students. Not all of the values are numeric, and many of them contain bits of text in addition to the actual number of the students.\n", "To figure out which entries need to be fixed, we need to use a Numeric facet:" ] }, { "cell_type": "code", "execution_count": 509, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.422083", "start_time": "2017-11-19T20:33:08.359549" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
universityendowmentnumFacultynumDoctoralcountrynumStaffestablishednumPostgradnumUndergradnumStudents
count15534155041354160215534316615534151831553213830
unique1085117260163473524488871104698
topPennsylvania State University1.708E9886460USA3373185580077,179 Total800
freq896044808960192130622168970651896564
\n", "
" ], "text/plain": [ " university endowment numFaculty numDoctoral \\\n", "count 15534 15504 13541 602 \n", "unique 1085 1172 601 63 \n", "top Pennsylvania State University 1.708E9 8864 60 \n", "freq 8960 4480 8960 192 \n", "\n", " country numStaff established numPostgrad numUndergrad numStudents \n", "count 15534 3166 15534 15183 15532 13830 \n", "unique 47 352 448 887 1104 698 \n", "top USA 3373 1855 800 77,179 Total 800 \n", "freq 13062 216 8970 651 896 564 " ] }, "execution_count": 509, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 510, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.431005", "start_time": "2017-11-19T20:33:08.423718" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['70000', '27393', '21160', '894', '15553', '62801', '65234', nan,\n", " '15424', '4533', '5756', '24424', '2426', '7730', '2987', '10477',\n", " '13773', '14388', '14522', '32472', '40829', '17101', '7497',\n", " '2400', '7306', '17500', '4463', '2200', '6448', '4863', '900+',\n", " '13570', '4927', '32739', '1604', '2499', '13785', '16317', '339',\n", " '1874', '10790', '4494', '4836', '4176', '4352', '30819', '6314',\n", " '26851', '3251', '616', '49589', '50116', '3224', '52405',\n", " '~50,000', '7773', '3785', '3822', '5000', '20712', '21000',\n", " '21048', '7787', '17231', '18502', '6158', '11804', '36041',\n", " '38912', '29390', '1207', '3090', '3249', '24370', '2654', '2815',\n", " '15705', '14575', '1144', '10235', '1015', '3657', '1033', '1062',\n", " '9840', '32762', '14806', '10623', '13438', '18900', '23420',\n", " '3192', '2480', '12125', '4800', '30930', '7848', '5560', '12130',\n", " '25063', '25277', '16567', '12270', '2566', '7881', '1972', '1779',\n", " '2715', '4179', '3200', '971', '7538', '10133', '6214', '14196',\n", " '35271', '11034', '17055', '14153', '998', '4238', '9251', '23500',\n", " '4666', '17502', '11722', '47430', '4600', '9000', '7110', '3120',\n", " '20655', '2766', '29290', '3740', '8394', '600', '3927', '1836',\n", " '7313', '4061', '2073', '2183', '1315', '15484', '3345', '735',\n", " '15872', '11308', '15889', '7885', '1066', '1331', '3574', '6840',\n", " '1737', '11604', '6058', '1917', '5079', '23859', '23930', '2110',\n", " '2640', '14000', '2942', '3483', '4854', '5930', '23175', '23855',\n", " '2355', '9625', '4500', '1000', '779', '1760', '1885', '37641',\n", " '3739', '17000', 'http://www.brocku.ca/athletics/quickfacts.php',\n", " '20762', '21080', '757', '1116', '2000', '9748', '14177', '2242',\n", " '2488', '19000', '2305', '16548', '16809', '38140', '13863', '4900',\n", " '10363', '21016', '27008', '18,234 -', '26,101 -', '36000', '4200',\n", " '21091', '110436', '18396', '18448', '9558', '7.07184E7',\n", " '7.71264E7', '10000', '1650', '5097', '5471', '5008', '7542',\n", " '8800', '1791', '23122', '30328', '6080', '2776', '14235', '2636',\n", " '2744', '4250', '8094', '13600', '34767', '3709', '22125', '24875',\n", " '2225', '9203', '27676', '8955', '25215', '29549', '30000', '11659',\n", " '17', '3168', '17849', '22280', '19020', '800', '2725', '32275',\n", " '33825', '4337', '3974', '12114', '23135', '3000', '3485', '29703',\n", " '8831', '30303', '11476', '11842', '602', '8649', '1703', '3500',\n", " '5201', '3719', '42099', '42595', '5760', '6082', '2175', '2231',\n", " '22479', '4183', '47122', '5790', '4014', '28602', '42513', '1118',\n", " '15405', '18442', '5033', '4600800.0', '10534', '28525', '55014',\n", " '7751', '7755', '58698', '7567', '11201', '2100', '24531', '1500',\n", " '1,500+', '46422', '46423', '754', '17290', '39165', '9300',\n", " '41674', '42716', '42761', '42907', '49020', '12445', '19333',\n", " '22405', '24662', '18572', '11593', '3209', '23162', '23590',\n", " '8742', '5552', '5,552 total', '1415', '16822', '7942', '6957',\n", " '11733', '28311', '5200', '13995', '14070', '30823', '31766',\n", " '10688', '12557', '21225', '25313', '12024', '12683', '2124',\n", " '5500', '5514', '5525', '12725', '9517', '35274', '37132', '450',\n", " '23315', '24735', '10502', '20407', '8034', '1210', '46302',\n", " '13959', '24849', '13490', '15265', '8606', '29462', '7615',\n", " '24170', '8603', '8645', '9135', '8878', '9370', '29616', '22707',\n", " '2334', '28018', '33585', '21210', '3439', '19700', '25890',\n", " '26073', '9555', '13351', '13676', '4278', '35200', '14560', '7943',\n", " '8005', '12714', '12002', '23761', '24977', '12400', '82', '3432',\n", " '24594', '2648', '19890', '10820', '6101', '2565', '29894', '29952',\n", " '5260', '22275', '16000', '20222', '21011', '37426', '19391',\n", " '25700', '26960', '5379', '24593', '14765', '2900',\n", " 'faculty to student ratio: 12:1', '9150', '1600', '35204', '4000',\n", " '47966', '1683', '~1,610', '363', '44', '32304', '6945', '51611',\n", " '51721', '51853', '30540', '32500', '13183', '9940', '18475',\n", " '38934', '57409', '24225', '22805', '30461', '44595', '49000',\n", " '4991', '9339', '23085', '23336', '17455', '10383', '12000', '811',\n", " '3246', '21827', '2282', '11646', '15642', '8810', '23600', '24192',\n", " '8298', '23470', '13089', '19415', '19740', '41215', '2836', '3700',\n", " '3770', '3829', '3721', '4525', '5365', '3050', '2345', '7700',\n", " '7257', '2263', '19780', '2589', '2542', '29898', '2120', '2300',\n", " '3600', '4384', '1748', '7928', '7005', '7277', '7740', '2270',\n", " '1130.0', '1841.0', '6919', '47878', '34000', '9106', '8278',\n", " '3378', '7131', '19966', '6964', '5809', '19664', '19993', '3420',\n", " '2658', '2609', '1938', '2442', '42606', '2435', '10074', '20212',\n", " '20,000+', '1460', '6276', '1308', '2719', '40000', '11065', '2660',\n", " '3633', '80022011', '20487', '111329', '852', '131403', '24595',\n", " '28', '1371', '3800', '23883', '3137', '33788', '28394', '30377',\n", " '5.29092E7', '5.49E7', '14706', '720', '2674', '2886', '23103',\n", " '8000', '10482', '6', '1890', '1872', '34255', '53337', '27523',\n", " '18498', '11922', '6398', '15189', '70440', '72254', '23588',\n", " '24378', '8985', '3461', '2', '27209', '22974', '11581', '47800',\n", " '47954', '14769', '15195', '1771', '16355',\n", " 'One MEELLLLIOONNN DOLL HAIRS', '19500', '17499', '10894', '33500',\n", " '15319', '3480', '28094', '8846', '6726', '27606', '20939', '20330',\n", " '21535', '6647', '1454', '20643', '21329', '13925', '4496', '17261',\n", " '17333', '21559', '630', '795', '6290', '8524', '33977', '44817',\n", " '95833', '630 Dickinson School of Law', '800 College of Medicine',\n", " '8,524 World Campus', '44,817 University Park', '795 Great Valley',\n", " '95,833 Total', '33,977 Commonwealth Campuses',\n", " '6,290 PA College of Tech', '25045', '29887', '31040', '27816',\n", " '87274', '17999', '17950', '4197033329', '41710', '15649', '11956',\n", " '28766', '28823', '2724', '45963', '978', '11180', '12312', '19721',\n", " '5233', '29689', '4072', '11867', '38010', '18630', '27195',\n", " '15657', '14820', '13381', '13893', '3245', '15064', '15473',\n", " '7521', '5913', '5998', '33602', '31899', '39271', '39922', '33490',\n", " '24125', '28994', '28290', '1115', '17351', '2284', '17533',\n", " '18971', '39697', '39726', '27269', '19379', '13410', '2203',\n", " '19728', '3005', '3045', '3540', '2459', '18004', '45126', '45954',\n", " '2559', '15446', '32611', '56868', '28203', '20956', '3537', '6654',\n", " '285392012', '15920', '34870', '8289', '25469', '10549', '18762',\n", " '14713', '28091', '55115', '32653', '5152', '380', '14754', '7764',\n", " '15951', '16040', '15536', '9799', '9352', '28457'], dtype=object)" ] }, "execution_count": 510, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.numStudents.unique()" ] }, { "cell_type": "code", "execution_count": 511, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.436990", "start_time": "2017-11-19T20:33:08.433548" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "52909200.0" ] }, "execution_count": 511, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float('5.29092E7')" ] }, { "cell_type": "code", "execution_count": 512, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.508875", "start_time": "2017-11-19T20:33:08.438366" }, "collapsed": false }, "outputs": [], "source": [ "df.numStudents = [str(i).replace('+', '').replace('~', '') for i in df.numStudents]\n", "df.numStudents = [str(i).replace(',', '').replace('-', '').strip() for i in df.numStudents]\n", "df.numStudents = [i.split(' ')[0] for i in df.numStudents]" ] }, { "cell_type": "code", "execution_count": 513, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.514299", "start_time": "2017-11-19T20:33:08.510614" }, "collapsed": true }, "outputs": [], "source": [ "# https://stackoverflow.com/questions/4138202/using-isdigit-for-floats\n", "def isDigit(x):\n", " try:\n", " float(x)\n", " return True\n", " except ValueError:\n", " return False" ] }, { "cell_type": "code", "execution_count": 514, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.520230", "start_time": "2017-11-19T20:33:08.516857" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 514, "metadata": {}, "output_type": "execute_result" } ], "source": [ "isDigit('1130.0')" ] }, { "cell_type": "code", "execution_count": 515, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.524935", "start_time": "2017-11-19T20:33:08.521657" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 515, "metadata": {}, "output_type": "execute_result" } ], "source": [ "str.isdigit('1130.0')" ] }, { "cell_type": "code", "execution_count": 516, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.546994", "start_time": "2017-11-19T20:33:08.526409" }, "collapsed": false }, "outputs": [], "source": [ "df['numStudents'] = df['numStudents'].apply(lambda x: np.float(x) \n", " if isDigit(x)\n", " else np.nan)" ] }, { "cell_type": "code", "execution_count": 517, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.562501", "start_time": "2017-11-19T20:33:08.549218" }, "collapsed": false }, "outputs": [], "source": [ "odds = {}\n", "for i in df['numStudents']:\n", " if not isDigit(i):\n", " try:\n", " odds[i] += 1\n", " except:\n", " odds[i] = 1" ] }, { "cell_type": "code", "execution_count": 518, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.568751", "start_time": "2017-11-19T20:33:08.564461" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "15534" ] }, "execution_count": 518, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df)" ] }, { "cell_type": "code", "execution_count": 519, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.573221", "start_time": "2017-11-19T20:33:08.570082" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{}" ] }, "execution_count": 519, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odds" ] }, { "cell_type": "code", "execution_count": 520, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.578398", "start_time": "2017-11-19T20:33:08.574567" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "70000.0" ] }, "execution_count": 520, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.numStudents.iloc[0]" ] }, { "cell_type": "code", "execution_count": 521, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.585557", "start_time": "2017-11-19T20:33:08.580112" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 70000., 27393., 21160., 894., 15553., 62801., 65234.,\n", " nan, 15424., 4533.])" ] }, "execution_count": 521, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.numStudents.unique()[:10]" ] }, { "cell_type": "code", "execution_count": 522, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.590748", "start_time": "2017-11-19T20:33:08.586886" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(4197033329.0, 2.0)" ] }, "execution_count": 522, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.numStudents.max(), df.numStudents.min()" ] }, { "cell_type": "code", "execution_count": 523, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.594188", "start_time": "2017-11-19T20:33:08.592091" }, "collapsed": false }, "outputs": [], "source": [ "# df.numStudents.astype('float')" ] }, { "cell_type": "code", "execution_count": 524, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.597939", "start_time": "2017-11-19T20:33:08.595660" }, "collapsed": false }, "outputs": [], "source": [ "# pd.to_numeric(df.numStudents)#, errors='ignore')" ] }, { "cell_type": "code", "execution_count": 525, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:08.617419", "start_time": "2017-11-19T20:33:08.600196" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
numStudents
count1.380000e+04
mean3.934969e+05
std3.585477e+07
min2.000000e+00
25%1.065000e+03
50%1.068800e+04
75%3.397700e+04
max4.197033e+09
\n", "
" ], "text/plain": [ " numStudents\n", "count 1.380000e+04\n", "mean 3.934969e+05\n", "std 3.585477e+07\n", "min 2.000000e+00\n", "25% 1.065000e+03\n", "50% 1.068800e+04\n", "75% 3.397700e+04\n", "max 4.197033e+09" ] }, "execution_count": 525, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 526, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.107441", "start_time": "2017-11-19T20:33:08.619002" }, "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYQAAAELCAYAAADZW/HeAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFQVJREFUeJzt3V9sW2fBx/EfaZZ1pH7DHKUiTaQhFGlRWm1yl5ZVgRV7\nvWi4gNzMmoS4icjWIESqbjeTbyqRm6IECEKKgGbK+KMpZhudhDahbUA7DO1cQjpmFwlLFDFXoU6t\nFxxadUn8vBc4fpvSkmOfc3yedN+PFGnn1Pb5xUv8y3me48cfMcYYAQA+9JqCDgAAsAOFAACQRCEA\nACooBACAJAoBAFBBIQAAJFEIAIAKCgEAIElq9uNBs9ms5ubm1N3drYGBAfX19flxGACAh3w7Q9i+\nfbtWVlbU3t7u6PaZTMavKDWxIYcNGSQ7ctiQQbIjhw0ZJDty2JBBsiOHlxkcFcL09LRGRkb07LPP\nbti/sLCgo0ePamxsTKdOnaru7+vr03PPPacvfvGLmpubcxTEhidWsiOHDRkkO3LYkEGyI4cNGSQ7\nctiQQbIjR8MLIRqNKpFIbNhXLpc1MzOjRCKhyclJpVIp5fP5DbdpbW3V6uqqZ2HXOX0C7nS7W/fX\n84TakMGWHFspw51u63SfVzl4Lja/nY0/m7bk8KuIHBVCb2+vWltbN+zL5XLq7OxUR0eHmpubNTAw\noHQ6LUl655139P3vf1/f/e53dfjwYc9D2/Dk2pDBlhxbKcOdbns3/eLzXPiXwZYcfhXCR5yudloo\nFHTixAlNTExIks6ePasLFy7o6aefliSdOXNGuVxOw8PDjg6cyWQ2fFPxeLzW7AAASclksvrfu3fv\n1u7du+t6HF+uMnLidqEvX74cUJr/FwqFVCqVPvQZbMlhQwZbctiQwZYcNmSwJceuXbs8+4O67quM\nwuGwlpaWqtvFYlHhcLjmx8lkMhvaDQBQm2Qy6ckwkuMzBGOMbh5d6unp0eLiogqFgu6//36lUimN\njY3VHMDN6Q0AwLshd0eFMDU1pWw2q1KppNHRUcXjcUWjUQ0PD2t8fFzGGMViMXV3d3sSCgDQeI4K\n4U5/+UciEUUiEVcB1ieXmVQGgPokk0lPRlsCm1Rex5ARALgT+KSyV5hUBgB3Gj6p7BfOEADAnbvm\nDAEAYIfAC4EhIwBwhyEjAIAkhowAAB6jEAAAkiwoBOYQAMAd5hAAAJKYQwAAeIxCAABIsqAQmEMA\nAHeYQwAASGIOAQDgMQoBACCJQgAAVFAIAABJFhQCVxkBgDtcZQQAkMRVRgAAj1EIAABJFAIAoIJC\nAABIohAAABWBFwKXnQKAO1x2CgCQxGWnAACPUQgAAEkUAgCggkIAAEiiEAAAFYFfZVSrYvEe/fOf\n23x7/HvuMVpZ2e7b42+VDLbksCGDLTlsyFBLjo9+tKydOz9oQCJ4xbdCuHHjho4fP64nnnhCe/fu\n9exx8/lmHT58v2ePB8AfL7zwTx06RCFsJb4NGb366qs6cOCAXw8PAPCYozOE6elpzc/Pq62tTRMT\nE9X9CwsLmp2dlTFG0WhUQ0NDkqR3331X3d3d+uAD/joAgK3C0RlCNBpVIpHYsK9cLmtmZkaJREKT\nk5NKpVLK5/OSpGw2qz//+c9KpVJ66623vE8NAPCcozOE3t5eFQqFDftyuZw6OzvV0dEhSRoYGFA6\nnVZXV5eefPJJSdLp06cVCoU8jgwA8EPdk8rFYlHt7e3V7XA4rFwut+E2Bw8erD8ZAKChArvsNJPJ\nbFidLx6POzqb2ObfFacAPNTU1OTbCEFLS4sVow+25Lh5xWg3C4bWXQjhcFhLS0vV7WKxqHA47Pj+\ntwtdKpU2vd/a2n3OQwIITLlcVqm07Mtjh0IhR68XfrMhRygUavxqp8YYGWOq2z09PVpcXFShUNDq\n6qpSqZT6+/trDsDnIQCAOw39PISpqSlls1mVSiWNjo4qHo8rGo1qeHhY4+PjMsYoFoupu7u75gB8\nHgIAuOPVGYKjQhgbG7vt/kgkokgk4irA+lyCV98QAHzYJJNJT/64DnwtI84QAMAdPjENAOCpwAuB\nSWUAcKehk8p+YsgIANxhyAgA4KnAC4EhIwBwhyEjAIAkhowAAB4LvBAYMgIAdxgyAgBIYsgIAOAx\nCgEAIIlCAABUBF4ITCoDgDtMKgMAJDGpDADwGIUAAJBEIQAAKgIvBCaVAcAdJpUBAJKYVAYAeIxC\nAABIohAAABUUAgBAEoUAAKigEAAAkiwoBN6HAADu8D4EAIAk3ocAAPAYhQAAkEQhAAAqKAQAgCQK\nAQBQQSEAACT5dNlpPp/Xa6+9puXlZT388MOKxWJ+HAYA4CFfzhC6uro0MjKio0eP6sKFC34cAgDg\nMUdnCNPT05qfn1dbW5smJiaq+xcWFjQ7OytjjKLRqIaGhqr/dv78eb3xxht6/PHHvU8NAPCcozOE\naDSqRCKxYV+5XNbMzIwSiYQmJyeVSqWUz+er/97f36/nnntOv/71rz0NDADwh6MzhN7eXhUKhQ37\ncrmcOjs71dHRIUkaGBhQOp1WV1eXstmszp07p5WVFZalAIAtou5J5WKxqPb29up2OBxWLpeTJPX1\n9amvr899OgBAwwS2uF0mk9mwOl88HlcoFNr0ftu2+ZkKgFeampoc/U7Xo6WlxbfH3oo5bl4x2s2C\noXUXQjgc1tLSUnW7WCwqHA47vv/tQpdKpU3vt7Z2n/OQAAJTLpdVKi378tihUMjR64XfbMgRCoUa\nv9qpMUbGmOp2T0+PFhcXVSgUtLq6qlQqpf7+/poD8HkIAOBOQz8PYWpqStlsVqVSSaOjo4rH44pG\noxoeHtb4+LiMMYrFYuru7q45AJ+HAADueHWG4KgQxsbGbrs/EokoEom4CrA+l+DVNwQAHzbJZNKT\nP675xDQA2OLumk9MYw4BANzhM5UBAJLuojMEAIAdAi8EhowAwB2GjAAAkhgyAgB4jEIAAEiyoBCY\nQwAAd5hDAABIYg4BAOAxCgEAIMmCQmAOAQDcYQ4BACCJOQQAgMcoBACAJAoBAFBBIQAAJFlQCFxl\nBADucJURAEASVxkBADxGIQAAJFEIAIAKCgEAIIlCAABUUAgAAEkWFALvQwAAd3gfAgBAEu9DAAB4\njEIAAEiiEAAAFRQCAEAShQAAqPDtKqN0Oq35+Xldv35dsVhMDz30kF+HAgB4wLdC2Ldvn/bt26d/\n/etf+tGPfkQhAIDlHBfC9PS05ufn1dbWpomJier+hYUFzc7OyhijaDSqoaGhDfd7+eWXdfjwYe8S\nAwB84XgOIRqNKpFIbNhXLpc1MzOjRCKhyclJpVIp5fP56r//5Cc/0d69e/WJT3zCs8AAAH84LoTe\n3l61trZu2JfL5dTZ2amOjg41NzdrYGBA6XRakvT666/rvffe09mzZ/Xmm296mxoA4DlXcwjFYlHt\n7e3V7XA4rFwuJ0kaHBzU4OCgu3QAgIYJbC2jTCazYTGmeDyuUCi06f22bfMzFQCvNDU1OfqdrkdL\nS4tvj70Vc9y8QKib9eFcFUI4HNbS0lJ1u1gsKhwOO7rv7UKXSqVN77e2dl9tIQEEolwuq1Ra9uWx\nQ6GQo9cLv9mQIxQKBbO4nTFGxpjqdk9PjxYXF1UoFLS6uqpUKqX+/v6aArD8NQC40/Dlr6emppTN\nZlUqlTQ6Oqp4PK5oNKrh4WGNj4/LGKNYLKbu7u6aArD8NQC449UZguNCGBsbu+3+SCSiSCTiSRgA\nQHACX8uIISMAcIdPTAMASLqLPjGNMwQAcIczBACApLvoDAEAYIfAC4EhIwBwhyEjAIAkhowAAB6j\nEAAAkiwoBOYQAMAd5hAAAJKYQwAAeIxCAABIsqAQmEMAAHeYQwAASGIOAQDgMQoBACCJQgAAVFAI\nAABJFhQCVxkBgDtcZQQAkMRVRgAAj1EIAABJFAIAoIJCAABIohAAABWBFwKXnQKAO1x2CgCQxGWn\nAACPUQgAAEkUAgCggkIAAEiiEAAAFRQCAECST5edXrlyRa+88oquXbumY8eO+XEIAIDHfDlD2Llz\np44cOeLHQwMAfOLoDGF6elrz8/Nqa2vTxMREdf/CwoJmZ2dljFE0GtXQ0JBvQQEA/nJ0hhCNRpVI\nJDbsK5fLmpmZUSKR0OTkpFKplPL5vC8hAQD+c1QIvb29am1t3bAvl8ups7NTHR0dam5u1sDAgNLp\ntCRpeXlZP/jBD3Tp0iWdOnXK+9QAAM/VPalcLBbV3t5e3Q6Hw8rlcpKkHTt2aGRkxH06AEDDBLa4\nXSaT2bA6XzweVygU2vR+27b5mQqAV5qamhz9TtejpaXFt8feijluXjHazYKhdRdCOBzW0tJSdbtY\nLCocDju+/+1Cl0qlTe+3tnaf85AAAlMul1UqLfvy2KFQyNHrhd9syBEKhRq/2qkxRsaY6nZPT48W\nFxdVKBS0urqqVCql/v7+mgPweQgA4E5DPw9hampK2WxWpVJJo6OjisfjikajGh4e1vj4uIwxisVi\n6u7urjkAn4cAAO54dYbgqBDGxsZuuz8SiSgSibgKsD6X4NU3BAAfNslk0pM/rvnENADY4vjENACA\npwIvBCaVAdxJPn+v0umPVr/y+XuDjmSlhk4q+4khIwB3cvnyNg0Nfay6ferU/6qrK8BAlmLICADg\nqcALgSEjAHCHISMAgCSGjAAAHgu8EBgyAgB3GDICAEhiyAgA4DEKAQAgiUIAAFQEXghMKgOAO0wq\nAwAkMakMAPAYhQAAkEQhAAAqAi8EJpUBwB0mlQEAkphUBgB4jEIAAEiiEAAAFRQCAEAShQAAqKAQ\nAACSLLjsNJPJKJPJeHbZFICtJ5+/V5cvb/uP/Tdu8DerE8lk0pNL+AMvBN6HAODy5W0aGvrYf+z/\n4Q9LAaTZengfAgDAUxQCAEAShQAAqKAQAACSKAQAQAWFAACQ5NNlpzdu3NDJkyd1zz33qK+vT5/+\n9Kf9OAwAwEO+nCGcO3dOBw4c0FNPPaXz58/7cQgAgMccnSFMT09rfn5ebW1tmpiYqO5fWFjQ7Oys\njDGKRqMaGhqSJBWLRT3wwAOSpKYmRqUAYCtw9GodjUaVSCQ27CuXy5qZmVEikdDk5KRSqZTy+bwk\nqb29XVevXpUkGWM8jgwA8IOjQujt7VVra+uGfblcTp2dnero6FBzc7MGBgaUTqclSfv379fZs2d1\n8uRJPfLII96nBgB4ru5J5WKxqPb29up2OBxWLpeTJN177736yle+8l/vv76o3bp4PK5du3Ztetxd\nuyROOoCt4H8qX5v7whfu9Hsd0pe+dPP2xypfUigUchfPIzbkSCaT1f92sz5cYAP8u3fvVjwer37d\n/A1txult73S7W/ffvO32sRuZwZYcWynDnW672T6ei7v7uXCawZYct2a4+bXUzWKhdRdCOBzW0tJS\ndbtYLCocDtcdpBZOv+E73e7W/fU8gTZksCXHVspwp9s63edVDp6LzW9n48+mLTl8WyHaOPT3v//d\nHDt2rLq9trZmvvrVr5orV66YlZUV8+yzz5q//e1vTh/uP8zNzdV9Xy/ZkMOGDMbYkcOGDMbYkcOG\nDMbYkcOGDMbYkcPLDNuOHz9+fLPSmJqaUjKZ1NWrV/Xmm2+qtbVVn/zkJ9XZ2anvfOc7+sUvfqGD\nBw9q//79rspp586dru7vFRty2JBBsiOHDRkkO3LYkEGyI4cNGSQ7cniV4SPGMEULAGAtIwBABYUA\nAJBEIQAAKigEAIAkn5a/9ooNy2hfuXJFr7zyiq5du6Zjx441/Pjr0um05ufndf36dcViMT300EMN\nz5DP5/Xaa69peXlZDz/8sGKxWMMzrLtx44aOHz+uJ554Qnv37m348bPZrObm5tTd3a2BgQH19fU1\nPIP077XC5ubmdO3aNfX09Oixxx5reIY//elPevvtt7W2tqZ8Pq+vf/3rDc8g/fu9UM8//7x27Nih\nj3/849XFNhvp/fff109/+lOFQiHt2bNHjz76aMOOfetrVT2vn1afIdiwjPbOnTt15MiRQI59s337\n9unpp5/WyMiIfvvb3waSoaurSyMjIzp69KguXLgQSIZ1r776qg4cOBBohu3bt2tlZWXDEi6Nlk6n\ndfXqVTU3NzfsjaG36u3t1cjIiB555BEdPHgwkAyS9Ne//lWPPvqojhw5okuXLgWSYWFhQYODg/ry\nl7+sM2fONPTYt75W1fP62dAzBBuW0a41g1/qzfHyyy/r8OHDgWU4f/683njjDT3++OOeZKgnx7vv\nvqvu7m598MEHgWXo6+tTX1+f/vGPf+iFF17Q1772tUByXL58WQ8++KAOHTqkb37zm9qzZ0/DM6z7\nzW9+o9HRUdfHrzfHgw8+qBMnTuhXv/qVPvOZzwSS4bHHHtNLL72k8+fPa3l5uaHHvlVdr5+evcXN\ngYsXL5q//OUv5plnnqnuu907nt9//31jjDFnzpwxv//9740xxnz7298OJMO6yclJT47vJsePf/xj\n88c//jHQDOtOnDgRWI4XX3zRzM7OmvHxcfONb3wjkAzrVlZWPP3ZqOd35He/+50xxphvfetbgWQw\nxphCoWC+973veXL8enP8/Oc/NxcvXjTGePf7Wu/PxdramuufTbevVadPn6759bOhZwi9vb0qFAob\n9t28jLak6jLaXV1d2r9/v2ZmZjQ/P+/ZMtq1ZlheXtaLL76oS5cu6dSpU56dOdSa4/XXX9d7772n\n69eva3FxUYcOHWp4hmw2q3PnzmllZcXTtVRqzfHkk09Kkk6fPu3ZSpO1ZnjnnXe0sLCga9eueXbG\nVk+OT33qU3r++ed18eJFz+Yxas0gSb/85S/12c9+1pPj15tjz549eumll/T222979s7dWjMUCgX9\n7Gc/040bN/T5z3++oce+9bXqc5/7nE6ePFnT62fgk8pul9H2O8OOHTs0MjLie4bNcgwODmpwcDDQ\nDOvDJI3w33Ks83u8+r9l2L9/v+ulWrzI0dLS0pA5rs3+f8Tjcd8zbJbjgQce0DPPPBNoho6ODj31\n1FOBHPt2r1W1vn5aPakMAGicwAshyGW0bcpgSw4bMtiSw4YMtuSwIYMtOYLM4PexG14IxpgNn7Pc\n09OjxcVFFQoFra6uKpVKqb+//67PYEsOGzLYksOGDLbksCGDLTmCzNDoYzd0tdOpqSlls1mVSiW1\ntbUpHo8rGo3qD3/4Q/Uyqlgs5uslnzZksCWHDRlsyWFDBlty2JDBlhxBZgji2Cx/DQCQZMEcAgDA\nDhQCAEAShQAAqKAQAACSKAQAQAWFAACQRCEAACooBACAJAoBAFDxfwVqvyHrO9tWAAAAAElFTkSu\nQmCC\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.hist(df.numStudents.dropna(), bins = 200, color='blue')\n", "plt.yscale('log')\n", "plt.xscale('log')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Clean up values for the endowment\n", "\n", "First remove the numeric facet for numStudents and create a new numeric facet for endowment. Select only the non-numeric values, as was done for the number of students.\n", "Already we see issues like \"US$1.3 billion\" and \"US $186 million\"" ] }, { "cell_type": "code", "execution_count": 527, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.119962", "start_time": "2017-11-19T20:33:10.109136" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 15\n", "2 121\n", "3 4700000\n", "4 16586100\n", "5 16586100\n", "6 40200750\n", "7 40200750\n", "8 40200750\n", "9 40200750\n", "10 40200750\n", "11 40200750\n", "12 40200750\n", "13 40200750\n", "14 40200750\n", "15 40200750\n", "16 40200750\n", "17 40200750\n", "18 562000000\n", "19 562000000\n", "22 NaN\n", "23 NaN\n", "24 NaN\n", "25 NaN\n", "26 NaN\n", "27 1.3E7\n", "28 1.0E7\n", "29 3.5E8\n", "30 4.5E8\n", "31 0.0\n", "32 0.0\n", " ... \n", "74996 $2,17 billion\n", "74997 $2,17 billion\n", "74998 $2,17 billion\n", "74999 $2,17 billion\n", "75000 $2,17 billion\n", "75001 $2,17 billion\n", "75002 $2,17 billion\n", "75003 $2,17 billion\n", "75004 US $401.2 million\n", "75006 US $213.2 million\n", "75007 US $381 million\n", "75008 US $64.5 million\n", "75009 US $64.5 million\n", "75010 US $64.5 million\n", "75011 US $716.8 million\n", "75012 US $716.8 million\n", "75013 US $716.8 million\n", "75014 US $716.8 million\n", "75015 US $716.8 million\n", "75016 US $716.8 million\n", "75017 US $716.8 million\n", "75018 US $716.8 million\n", "75019 US $716.8 million\n", "75020 US $716.8 million\n", "75021 US $716.8 million\n", "75022 US $716.8 million\n", "75023 US $716.8 million\n", "75024 US $716.8 million\n", "75025 US $716.8 million\n", "75026 US $716.8 million\n", "Name: endowment, dtype: object" ] }, "execution_count": 527, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['endowment']" ] }, { "cell_type": "code", "execution_count": 528, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.125125", "start_time": "2017-11-19T20:33:10.121413" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 528, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(odds.values())" ] }, { "cell_type": "code", "execution_count": 529, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.196606", "start_time": "2017-11-19T20:33:10.126540" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('US$226 million', 1),\n", " ('US $6.44 billion', 16),\n", " ('1,5 billion \\xe2\\x82\\xac', 1),\n", " ('US $988.626', 1),\n", " ('USD$130.0 million', 1),\n", " ('S$830 million', 2),\n", " ('$772.8 million AUD', 2),\n", " ('CHF 193.64 million annual budget', 4),\n", " ('US $620 million', 2),\n", " ('US $7.2 billion', 1),\n", " ('US $64.5 million', 3),\n", " ('US $612 million', 4),\n", " ('US$668 million', 1),\n", " ('US $12,197,377', 1),\n", " ('US $13,111,000', 1),\n", " ('US$13.3 million', 1),\n", " ('US $1.77 billion', 32),\n", " ('R$ 43.518 million', 1),\n", " ('Rs1 billion', 1),\n", " ('U.S. $ n/a', 1),\n", " ('C$143 million', 1),\n", " ('US $1.786 billion', 16),\n", " ('US $ 398.9 million', 1),\n", " ('US$ 270.6 million', 2),\n", " ('State: US$101 million', 1),\n", " ('US $388.7 million', 6),\n", " ('US $24,011,651', 4),\n", " ('USD $54.500 million', 1),\n", " ('US$ 12 million', 1),\n", " ('US $29 million', 1),\n", " ('US$582 million Beneficiary of Duke Endowment', 1),\n", " ('US$ 800 million', 1),\n", " ('US $380 million', 2),\n", " ('US$28.6 million', 1),\n", " ('US$49 million', 1),\n", " ('INR 40 Crore', 4),\n", " ('US $16.7 billion', 1),\n", " ('\\xc2\\xa3 61.3M', 16),\n", " ('US$ 176.50 million', 1),\n", " ('$138 million;', 1),\n", " ('US $1.55 billion', 32),\n", " ('US$14.543 million', 1),\n", " ('US$ 4.3 million', 1),\n", " ('US$ 687 million', 1),\n", " ('US$2.1 million', 1),\n", " ('US $12.3 million', 1),\n", " ('US $511 million', 1),\n", " ('US$144 million', 1),\n", " ('Indian Rupees', 16),\n", " ('US$665 million', 8),\n", " ('$21.9 million USD http://colleges.usnews.rankingsandreviews.com/best-colleges/sarasota-fl/ringling-college-12574',\n", " 1),\n", " ('US$2.224 billion in 2006', 48),\n", " ('C$ 77.5 million', 3),\n", " ('US $51 million', 1),\n", " ('US$40.3 million', 1),\n", " ('US$96.8 million', 1),\n", " ('US $14.8 million', 1),\n", " ('US$ 66.15 million', 1),\n", " ('US$59,600,000', 8),\n", " ('US$45,478,931', 1),\n", " ('US$25 million', 1),\n", " ('US$52 million', 1),\n", " ('US$79 million', 1),\n", " ('US $66 million', 1),\n", " ('US$2.002 billion in 2012', 8),\n", " ('Barstow College Foundation', 1),\n", " ('US$ billion', 2),\n", " ('US$20.8 million', 1),\n", " ('US $1.5 million', 2),\n", " ('US$70,025,283', 1),\n", " ('US $ 120 million', 1),\n", " ('US$190 million', 1),\n", " ('\\xe2\\x82\\xb14.46 billion', 32),\n", " ('US$86.7 million', 2),\n", " ('US $77.9 million', 1),\n", " ('US$31.1 million', 8),\n", " ('US$472 million', 1),\n", " ('US$719.85 million', 1),\n", " ('\\xe0\\xb8\\xbf5.95 billion', 2),\n", " ('US$ 5.28 billion', 9),\n", " ('US $14 million', 1),\n", " ('\\xc2\\xa3 57.8M', 16),\n", " ('No available figures', 2),\n", " ('US$16.4 million', 1),\n", " ('US $266.2 million', 1),\n", " ('\\xc2\\xa361.3M', 16),\n", " ('US$ 155.9 million', 1),\n", " ('US$59 million', 1),\n", " ('US $14.5 million', 1),\n", " ('US$ 16.5 billion', 2),\n", " ('US$1.4 billion', 1),\n", " ('US$80.0 million', 1),\n", " ('US $867 million', 4),\n", " ('US$28,610,629', 1),\n", " ('Unknown, depends on donations by businesses affiliated with the G\\xc3\\xbclen movement.',\n", " 4),\n", " ('US $1.9 billion', 2),\n", " ('US $590 million', 1),\n", " ('US$541 million', 1),\n", " ('US $85 million', 1),\n", " ('$1.008 billion USD', 2),\n", " ('US$513.4 million', 1),\n", " ('US$108.0 million', 6),\n", " ('US$ 349.7 million', 1),\n", " ('US$110.2 million', 4),\n", " ('US $16 million', 1),\n", " ('US$5.35 billion', 16),\n", " ('US$1.24 billion', 2),\n", " ('US $10.1 million', 1),\n", " ('S$88.62 million', 1),\n", " ('US $1.87 billion', 1),\n", " ('US$330.2 million', 2),\n", " ('US$1.707 billion', 2),\n", " ('US$283 million', 1),\n", " ('US$-- billion', 2),\n", " ('US $67 million', 1),\n", " ('US $213 million', 1),\n", " ('Unreported', 1),\n", " ('US $3.5 billion', 1),\n", " ('US $165.9 million', 1),\n", " ('US $254 million', 1),\n", " ('US $7.1 million', 1),\n", " ('US $7.80 billion', 48),\n", " ('US $704 million', 2),\n", " ('US $513 million', 2),\n", " ('R2,173.4 million', 3),\n", " ('USD $71.5 million', 1),\n", " ('C$5,930,722', 2),\n", " ('US $ 807 million', 1),\n", " ('US$ 518 million', 1),\n", " ('US$123,456,789', 1),\n", " ('Ministry of Human Resources Development, Government of India', 1),\n", " ('US$36.389 million', 1),\n", " ('US $329 million', 6),\n", " ('US$127.29 million', 1),\n", " ('US $42 million', 1),\n", " ('USD $41 million', 2),\n", " ('$CAD145 million', 1),\n", " ('US$27 million', 1),\n", " ('US$474 million', 1),\n", " ('US$ 1.25 billion', 6),\n", " ('$44 million USD', 1),\n", " ('US$73.9 Million', 1),\n", " ('US $13.6 million', 4),\n", " ('$30+ million', 1),\n", " ('US $117.4 million', 1),\n", " ('$22.5M', 1),\n", " ('NT$ 1.2 billion', 1),\n", " ('US $400.5 million', 1),\n", " ('over $100 million', 1),\n", " ('US$ 657 million', 2),\n", " ('C$920.7 million', 8),\n", " ('US$5.1 billion', 4),\n", " ('US$47.46 million', 1),\n", " ('US$ 10.0 million', 1),\n", " ('US $88.5 Million', 1),\n", " ('U.S. $19.2 million', 2),\n", " ('C$1.518 billion', 32),\n", " ('US $73.0 million', 1),\n", " ('US $27.4 million', 16),\n", " ('US$109.4 million', 1),\n", " ('US$52,177,934', 1),\n", " ('US$1.04 billion', 1),\n", " ('U.S. $25.7 million', 2),\n", " ('US $63 million', 1),\n", " ('US$675 million', 1),\n", " ('AUD $1.0082 Bn', 1),\n", " ('US $80 million', 1),\n", " ('Grants', 1),\n", " ('U.S. $30.3 million', 1),\n", " ('US$27.9 million', 2),\n", " ('US$ 10.3 billion', 1),\n", " ('US$138 million', 1),\n", " ('US$1 billion', 2),\n", " ('US $2.18 Billion', 10),\n", " ('US$ 962.196 million', 2),\n", " ('$72 million;', 1),\n", " ('US$2.54 billion', 16),\n", " ('US$14.4 billion', 2),\n", " ('$2,17 billion', 16),\n", " ('US$8.6 million', 1),\n", " ('US$65 million', 1),\n", " ('US $246 million', 1),\n", " ('US$847.8 million', 6),\n", " ('$800M', 4),\n", " ('US $45.3 million', 1),\n", " ('US $6.56 billion', 48),\n", " ('US$785 million', 8),\n", " ('US $21,727,123', 1),\n", " ('C$ 32.5 million', 2),\n", " ('$155.5 Million as of May 2006', 1),\n", " ('US $208.9 million', 3),\n", " ('US $7,313,220', 1),\n", " ('PhP 395.024 Million', 1),\n", " ('C$807.5 million', 8),\n", " ('US$ 2.5 Billion', 10),\n", " ('US$165 million', 2),\n", " ('US $2.59 billion', 2),\n", " ('US $185.7 million', 1),\n", " ('US$ 7.8 billion', 1),\n", " ('US$3 million', 1),\n", " ('US $2.22 billion', 2),\n", " ('US$ 7.0 billion', 1),\n", " ('US $41,617,510', 2),\n", " ('US$63.6 million', 1),\n", " ('US$40.5 million', 1),\n", " ('US $208.9 million parent institution', 16),\n", " ('US$30.2 million', 1),\n", " ('US $13,330,712.40', 4),\n", " ('US $ 665 million', 8),\n", " ('$429.95 million 2011', 1),\n", " ('http://dbpedia.org/resource/Public', 1),\n", " ('USD $622 million', 1),\n", " ('US$1.77 billion', 32),\n", " ('A$1.2371 billion', 8),\n", " ('US $955 million', 6),\n", " ('S$88,619 million', 1),\n", " ('US $ 237.6', 16),\n", " ('C$13.1 million', 1),\n", " ('US $30.1 million', 2),\n", " ('US$ 320 million', 1),\n", " ('US$1.88 billion', 1),\n", " ('US$ $19.2 million', 2),\n", " ('US$ 170.2 million', 1),\n", " ('HK $5.3 billion', 1),\n", " ('US $5.1 billion', 4),\n", " ('US $908 million', 1),\n", " ('approx. $100 million', 1),\n", " ('\\xe2\\x82\\xb1233 million', 32),\n", " ('US$140.9 million', 32),\n", " ('US$631,947,260.', 1),\n", " ('US $38.7 million', 1),\n", " ('$31.4 million USD', 2),\n", " ('U.S. $201.9 million', 2),\n", " ('US $478 million', 4),\n", " ('US $25.1 million', 1),\n", " ('USD $175,552,772', 1),\n", " ('R$189,976,461.00', 1),\n", " ('US$8.8 billion', 1),\n", " ('US$1.3 billion', 19),\n", " ('--09-30', 1),\n", " ('US $6.3 billion', 1),\n", " ('US $3.6 million', 1),\n", " ('US$10,386,829', 1),\n", " ('US$ 1.02 billion', 16),\n", " ('US $6.1 million', 1),\n", " ('US $45 million', 1),\n", " ('US $754.1 million', 48),\n", " ('US $1.12 billion', 49),\n", " ('US $640 million', 3),\n", " ('US $196 million', 1),\n", " ('US $145.7 million', 1),\n", " ('US$522 million', 1),\n", " ('US$63.1 million', 2),\n", " ('US $ 5.24 billion', 1),\n", " ('C$ 121.9 million', 8),\n", " ('US $ 39,009,814', 16),\n", " ('US $122.4 million', 1),\n", " ('US$ 65.12 million', 1),\n", " ('--06-30', 12),\n", " ('US$67,364,000', 8),\n", " ('US$ 19.3 billion', 1),\n", " ('Government of India, Government of Maharashtra', 2),\n", " ('US$931 thousand', 1),\n", " ('US$ 2.93 billion', 3),\n", " ('US $270 million', 1),\n", " ('\\xe0\\xb8\\xbf4,171,014,200', 2),\n", " ('US$149.4 million', 1),\n", " ('US $69 million', 1),\n", " ('US$ 22.7 million', 1),\n", " ('US $16.5 Million', 1),\n", " ('US$ $14.543 million', 1),\n", " ('Public', 2),\n", " ('US$12.8 million', 1),\n", " ('US$30.1 million', 1),\n", " ('US$2.7 million', 36),\n", " ('C$34 million', 2),\n", " ('US$532 million', 1),\n", " ('US $25.2 Million', 1),\n", " ('R$229,647,000', 2),\n", " ('$43 million 2011', 1),\n", " ('$90000 million AUD', 2),\n", " ('US$69,551,090', 32),\n", " ('US$105.9 million', 32),\n", " ('$55.0 million in 2010', 1),\n", " ('US$1.008 billion', 2),\n", " ('P334 million', 1),\n", " ('Other: $48 million', 1),\n", " ('US $1.017 billion', 48),\n", " ('$173.4 million 2011', 1),\n", " ('US $237.6 million', 16),\n", " ('Government Funded', 2),\n", " ('CAN$200 million', 1),\n", " ('US $791,231,000', 1),\n", " ('USD $1.04 billion', 1),\n", " ('US$3.1 million', 1),\n", " ('$13 million Annual Report', 1),\n", " ('US$56.1 million', 32),\n", " ('USD $75,500,000', 2),\n", " ('$29 million 2011', 1),\n", " ('US$93.7 million', 1),\n", " ('US $17,167,060', 1),\n", " ('$179 million as of May 2010', 1),\n", " ('US$197.4 million', 8),\n", " ('US$357 million', 1),\n", " ('US$26.8 million', 1),\n", " ('US $1.519 billion', 24),\n", " ('US $975 million', 1),\n", " ('US $206.2 million', 1),\n", " ('US$1.14 billion', 2),\n", " ('US $374.4million', 4),\n", " ('C$230 million', 1),\n", " ('US $31.8 million', 1),\n", " ('Rs300 Crore per year', 1),\n", " ('US$73.1 million', 1),\n", " ('US$2.9 billion', 1),\n", " ('US $ 4.45 billion', 1),\n", " ('$65.8 M http://www.nacubo.org/Images/All%20Institutions%20Listed%20by%20FY%202007%20Market%20Value%20of%20Endowment%20Assets_2007%20NES.pdf',\n", " 1),\n", " ('CHF 183 million annual budget', 4),\n", " ('$70,025,283 USD', 1),\n", " ('$38 M annually', 1),\n", " ('\\xe2\\x82\\xb1286 million', 1),\n", " ('\\xc2\\xa35.9M', 2),\n", " ('US $134.6 million parent institution', 1),\n", " ('US $1.449 billion', 16),\n", " ('US $1.218 billion', 1),\n", " ('R$ 200,000,000', 1),\n", " ('US$ 50 million', 1),\n", " ('US $49 million', 1),\n", " ('Approx. 450,00,00,000 Rs. per annum', 2),\n", " ('US$113.3 million', 1),\n", " ('US $186 million', 1),\n", " ('US $44 Million', 1),\n", " ('CAD$712 million', 48),\n", " ('US $13.8 billion', 2),\n", " ('USD $241.7 million', 1),\n", " ('C$ 94.7 million', 2),\n", " ('US$29.9 million', 1),\n", " ('US $239 million', 1),\n", " ('US$6.8 billion', 2),\n", " ('US $87.4 million', 1),\n", " ('US$13,324,363', 2),\n", " ('A$1.3 billion', 1),\n", " ('US $115.4 million', 1),\n", " ('US $1.14 billion', 2),\n", " ('US$2.8 billion', 1),\n", " ('R429.6 million', 1),\n", " ('US $716.8 million', 16),\n", " ('US$50 million', 1),\n", " ('US$664.6 million', 1),\n", " ('US $127.679 million', 1),\n", " ('US $144 million', 1),\n", " ('US $624.7 million', 1),\n", " ('A$829 million', 1),\n", " ('C$33 million', 2),\n", " ('A$882.8 million', 1),\n", " ('US $384.3 million', 2),\n", " ('$46.3 million 2011', 1),\n", " ('US$133,171,000', 4),\n", " ('US $1.295 billion', 16),\n", " ('US $467 million', 1),\n", " ('US$101 million', 4),\n", " ('C$99M', 1),\n", " ('US $1.058 Billion', 1),\n", " ('C$1.437 billion', 32),\n", " ('US$9.6 million', 1),\n", " ('US $544 million', 1),\n", " ('US $1.25 billion', 6),\n", " ('US $174.8 Million', 1),\n", " ('US $6.58 billion', 16),\n", " ('US$8.1 million', 8),\n", " ('USD$2 billion', 2),\n", " ('US$95 million', 4),\n", " ('US$28.9 million', 2),\n", " ('\\xe2\\x82\\xb18.5 billion', 1),\n", " ('US $603 million', 11),\n", " ('US $25.8 million', 4),\n", " ('US $368.7 million', 1),\n", " ('-', 1),\n", " ('US $375.2 million', 1),\n", " ('US$87.1 million', 1),\n", " ('C$9,576,604', 1),\n", " ('US$15.7 million', 1),\n", " ('US $880.3 million', 8),\n", " ('A$72.3 million', 1),\n", " ('US$ 1.6 billion', 1),\n", " ('C$114 million', 1),\n", " ('$19.2 million 2011', 1),\n", " ('US$ 61.58 million', 1),\n", " ('$139 million .', 4),\n", " ('Govt., UGC', 27),\n", " ('US $5.4 billion', 8),\n", " ('US$32 billion', 6),\n", " ('$147M', 1),\n", " ('US $2.64 billion', 1),\n", " ('US $1.62 billion', 9),\n", " ('US $381 million', 1),\n", " ('US$ 152 million', 1),\n", " ('US $37,524,837', 1),\n", " ('$100 million +', 2),\n", " ('--02-29', 324),\n", " ('US $28.9 million', 2),\n", " ('Approx. 450,00,00,000 Rs.', 2),\n", " ('U.S. $9.6 million', 1),\n", " ('US$54.9 million', 1),\n", " ('US$ 199,876,589', 1),\n", " ('US $284.3 million', 2),\n", " ('US$37 million', 2),\n", " ('US $481 million', 1),\n", " ('US$848.3 million', 1),\n", " ('US $30.29 million', 1),\n", " ('US$ 1.194 billion', 16),\n", " ('US$170.1 million', 2),\n", " ('US $469.2 million', 16),\n", " ('$12.5 million.', 1),\n", " ('US$2.0 billion', 3),\n", " ('US$29,985,851', 1),\n", " ('U.S.$4.6 billion', 16),\n", " ('US $700 million', 1),\n", " ('US $213.2 million', 1),\n", " ('US $256 million', 1),\n", " ('USD $52.0 million', 1),\n", " ('US$10 billion', 1),\n", " ('$731.28 million 2011', 1),\n", " ('US $8.76 million', 1),\n", " ('CAD$1.05 billion', 48),\n", " ('US$3,254,000', 2),\n", " ('US $70,000,000', 2),\n", " ('Over $405 million', 1),\n", " ('Grants/private funding', 1),\n", " ('US $1.703 billion', 24),\n", " ('per annum', 7),\n", " ('US$1.65 billion', 3),\n", " ('US$525 million', 1),\n", " ('US$ 739 million', 1),\n", " ('US$ 96.6 million', 1),\n", " ('US$2.503 billion in 2012', 48),\n", " ('US$453 million', 1),\n", " ('US $22.4 million', 2),\n", " ('US$6.58 billion', 16),\n", " ('US$63.35 million', 2),\n", " ('US $720 million', 2),\n", " ('PH\\xe2\\x82\\xb1 4 billion', 32),\n", " ('US$ 27 million', 2),\n", " ('US$244 million', 6),\n", " ('US$6.8 billion', 16),\n", " ('C$164.6 Million', 8),\n", " ('US $1 Billion+', 1),\n", " ('US $401.2 million', 1),\n", " ('USD $12,820,445', 1),\n", " ('US $529 million', 1),\n", " ('US$2.002 billionin 2012', 8),\n", " ('R$ 1.078.315.210,14', 1),\n", " ('US $283 million', 1),\n", " ('US $169.3 million', 1),\n", " ('US $3.7 million', 1),\n", " ('US $174,505', 1),\n", " ('U.S. $436.7 million', 1),\n", " ('US $815.1 million', 48),\n", " ('US $47 million', 4),\n", " ('US $813,882', 1),\n", " ('US$1.5 billion', 32),\n", " ('US$ 51 million', 1),\n", " ('US$411 million', 1),\n", " ('USD$246.1 million', 1),\n", " ('US $47 Million', 1),\n", " ('$28.07 Million USD', 1),\n", " ('$603.6 million parent institution', 3),\n", " ('US $7.8 billion', 48),\n", " ('US $12 million', 2),\n", " ('US $37,360,494', 1),\n", " ('US$ 17.1 billion', 2),\n", " ('Total: approx. $149 million', 1),\n", " ('US $130 million', 1),\n", " ('R$ 653.010.340,45', 1),\n", " ('US$891 million', 1),\n", " ('US $650,000', 1),\n", " ('US$61.5 million', 1),\n", " ('US$350,500,000', 2),\n", " ('U.S. $226 million', 1),\n", " ('US $39 million', 4),\n", " ('more than $100 million', 2)]" ] }, "execution_count": 529, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odds = {}\n", "for i in df['endowment']:\n", " if not isDigit(i):\n", " try:\n", " odds[i] += 1\n", " except:\n", " odds[i] = 1\n", "\n", "odds.items()" ] }, { "cell_type": "code", "execution_count": 530, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.202856", "start_time": "2017-11-19T20:33:10.198612" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1000000.0" ] }, "execution_count": 530, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.float('1E6')" ] }, { "cell_type": "code", "execution_count": 531, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.268985", "start_time": "2017-11-19T20:33:10.204409" }, "collapsed": false }, "outputs": [], "source": [ "df.endowment = [str(i).replace('US $', '').replace('US$', '') for i in df.endowment]\n", "df.endowment = [str(i).replace('USD$', '').replace('USD $', '') for i in df.endowment]\n", "df.endowment = [str(i).replace('U.S. $', '').replace(',', '').strip() for i in df.endowment]\n", "\n", "endowment = []\n", "for i in df.endowment:\n", " if i.__contains__('$'):\n", " endowment.append(str(i).split('$')[1])\n", " else:\n", " endowment.append(i)\n", "df.endowment = endowment" ] }, { "cell_type": "code", "execution_count": 532, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.367017", "start_time": "2017-11-19T20:33:10.270699" }, "collapsed": false }, "outputs": [], "source": [ "df.endowment = [str(i).replace(' million', 'E6').replace(' billion', 'E9').strip() for i in df.endowment]\n", "df.endowment = [str(i).replace('million', 'E6').replace('billion', 'E9').strip() for i in df.endowment]\n", "df.endowment = [str(i).replace(' Million', 'E6').replace(' Billion', 'E9').strip() for i in df.endowment]\n", "df.endowment = [str(i).split(' ')[0] for i in df.endowment]\n", "df.endowment = [str(i).replace('M', 'E6').strip() for i in df.endowment]\n", "df.endowment = [str(i).replace(';', '').replace('+', '').strip() for i in df.endowment]\n", "# df.endowment = [str(i).split('xbf')[1] for i in df.endowment]\n", "# df.endowment = [str(i).split('xb')[1] for i in df.endowment]\n", "# df.endowment = [str(i).split('xa')[1] for i in df.endowment]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After most of this has been cleaned up, select the non-numeric values, and delete them, just as was done for the numStudents." ] }, { "cell_type": "code", "execution_count": 533, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.388432", "start_time": "2017-11-19T20:33:10.368974" }, "collapsed": true }, "outputs": [], "source": [ "df['endowment'] = df['endowment'].apply(lambda x: np.float(x) \n", " if isDigit(x)\n", " else np.nan)" ] }, { "cell_type": "code", "execution_count": 534, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.409353", "start_time": "2017-11-19T20:33:10.389947" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
endowmentnumStudents
count1.490400e+041.380000e+04
mean2.149103e+093.934969e+05
std1.927573e+103.585477e+07
min0.000000e+002.000000e+00
25%2.430000e+081.065000e+03
50%1.546000e+091.068800e+04
75%1.708000e+093.397700e+04
max1.545840e+124.197033e+09
\n", "
" ], "text/plain": [ " endowment numStudents\n", "count 1.490400e+04 1.380000e+04\n", "mean 2.149103e+09 3.934969e+05\n", "std 1.927573e+10 3.585477e+07\n", "min 0.000000e+00 2.000000e+00\n", "25% 2.430000e+08 1.065000e+03\n", "50% 1.546000e+09 1.068800e+04\n", "75% 1.708000e+09 3.397700e+04\n", "max 1.545840e+12 4.197033e+09" ] }, "execution_count": 534, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- 巴西雷亚尔的符号 R\\$\n", "- CANADA DOLLARS C\\$\n", "- 澳元的货币符号 A\\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both \"million\" and \"Million\" are in the values, so it's useful to convert all the values to lowercase instead of cleaning this up twice.\n" ] }, { "cell_type": "code", "execution_count": 535, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.436602", "start_time": "2017-11-19T20:33:10.411257" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
universityendowmentnumFacultynumDoctoralcountrynumStaffestablishednumPostgradnumUndergradnumStudents
0Paris Universitas15.055008000FranceNaN2005NaN2500070000.0
2Lumi%C3%A8re University Lyon 2121.0NaN1355FranceNaN183570461485127393.0
3Confederation College4700000.0NaNNaNCanadaNaN1967not availablepre-university students; technical21160.0
4Rocky Mountain College16586100.0NaNNaNUSANaN187866878894.0
5Rocky Mountain College16586100.0NaNNaNUSANaN187866878894.0
\n", "
" ], "text/plain": [ " university endowment numFaculty numDoctoral country \\\n", "0 Paris Universitas 15.0 5500 8000 France \n", "2 Lumi%C3%A8re University Lyon 2 121.0 NaN 1355 France \n", "3 Confederation College 4700000.0 NaN NaN Canada \n", "4 Rocky Mountain College 16586100.0 NaN NaN USA \n", "5 Rocky Mountain College 16586100.0 NaN NaN USA \n", "\n", " numStaff established numPostgrad numUndergrad \\\n", "0 NaN 2005 NaN 25000 \n", "2 NaN 1835 7046 14851 \n", "3 NaN 1967 not available pre-university students; technical \n", "4 NaN 1878 66 878 \n", "5 NaN 1878 66 878 \n", "\n", " numStudents \n", "0 70000.0 \n", "2 27393.0 \n", "3 21160.0 \n", "4 894.0 \n", "5 894.0 " ] }, "execution_count": 535, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Others\n", "\n", "numFaculty, numDoctoral, numStaff, numPostgrad, numUndergrad, established\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## numFaculty" ] }, { "cell_type": "code", "execution_count": 536, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.479090", "start_time": "2017-11-19T20:33:10.438243" }, "collapsed": true }, "outputs": [], "source": [ "df.numFaculty = [str(i).replace('Total: ', '').replace(',', '') for i in df.numFaculty]\n", "df.numFaculty = [str(i).replace('>', '').replace('~', '') for i in df.numFaculty]\n", "df.numFaculty = [str(i).split(' ')[0] for i in df.numFaculty]\n" ] }, { "cell_type": "code", "execution_count": 537, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.505591", "start_time": "2017-11-19T20:33:10.481149" }, "collapsed": true }, "outputs": [], "source": [ "df['numFaculty'] = df['numFaculty'].apply(lambda x: np.float(x) \n", " if isDigit(x)\n", " else np.nan)" ] }, { "cell_type": "code", "execution_count": 538, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.523611", "start_time": "2017-11-19T20:33:10.507114" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 538, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odds = {}\n", "for i in df['numFaculty']:\n", " if not isDigit(i):\n", " try:\n", " odds[i] += 1\n", " except:\n", " odds[i] = 1\n", "\n", "odds.items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## numDoctoral" ] }, { "cell_type": "code", "execution_count": 539, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.540713", "start_time": "2017-11-19T20:33:10.525501" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('available', 1),\n", " ('N\\\\A', 27),\n", " ('not available', 40),\n", " ('N.A', 16),\n", " ('~25', 4)]" ] }, "execution_count": 539, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odds = {}\n", "for i in df['numDoctoral']:\n", " if not isDigit(i):\n", " try:\n", " odds[i] += 1\n", " except:\n", " odds[i] = 1\n", "\n", "odds.items()" ] }, { "cell_type": "code", "execution_count": 540, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.566817", "start_time": "2017-11-19T20:33:10.542532" }, "collapsed": true }, "outputs": [], "source": [ "df.numDoctoral = [str(i).replace('~', '') for i in df.numFaculty]" ] }, { "cell_type": "code", "execution_count": 541, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.588711", "start_time": "2017-11-19T20:33:10.568759" }, "collapsed": true }, "outputs": [], "source": [ "df['numDoctoral'] = df['numDoctoral'].apply(lambda x: np.float(x) \n", " if isDigit(x)\n", " else np.nan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## numStaff" ] }, { "cell_type": "code", "execution_count": 542, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.611800", "start_time": "2017-11-19T20:33:10.590342" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('incl. 1,403 academics and 150 researchers', 2),\n", " ('Total: 873', 48),\n", " ('Approximately 7,170', 1),\n", " ('approximately 30', 1),\n", " ('167 full-time academic faculty, \\n596 non-academic staff', 4),\n", " ('Around 120', 27),\n", " ('~500', 4),\n", " ('190 researchers & lecturers, 153 administrative & technical staff', 2),\n", " ('Approximately 500', 1),\n", " ('4,478 employees', 64),\n", " ('Around 5,000', 48),\n", " ('>21,000', 2),\n", " ('?', 1),\n", " ('~100', 4),\n", " ('full-time, part-time', 1),\n", " ('Total: 1,608', 2),\n", " ('Total: 1,600', 16),\n", " ('Full-time: 1,469', 2),\n", " ('156 full-time; 229 part-time', 1),\n", " ('Part-time: 139', 2),\n", " ('960 full-time, 460 part-time', 16),\n", " ('appx. 20', 8),\n", " ('~', 1)]" ] }, "execution_count": 542, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odds = {}\n", "for i in df['numStaff']:\n", " if not isDigit(i):\n", " try:\n", " odds[i] += 1\n", " except:\n", " odds[i] = 1\n", "\n", "odds.items()" ] }, { "cell_type": "code", "execution_count": 543, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.716371", "start_time": "2017-11-19T20:33:10.613801" }, "collapsed": true }, "outputs": [], "source": [ "df.numStaff = [str(i).replace('Total: ', '').replace(',', '') for i in df.numStaff]\n", "df.numStaff = [str(i).replace('>', '').replace('~', '') for i in df.numStaff]\n", "df.numStaff = [str(i).replace('Around ', '').replace('appx. ', '') for i in df.numStaff]\n", "df.numStaff = [str(i).replace(' employees', '').replace('Approximately ', '') for i in df.numStaff]\n", "df.numStaff = [str(i).replace('Full-time: ', '').replace('Part-time: ', '') for i in df.numStaff]\n", "df.numStaff = [str(i).replace('approximately ', '') for i in df.numStaff]\n", "df.numStaff = [str(i).split(' ')[0] for i in df.numStaff]\n", "\n" ] }, { "cell_type": "code", "execution_count": 544, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.739580", "start_time": "2017-11-19T20:33:10.718167" }, "collapsed": false }, "outputs": [], "source": [ "df['numStaff'] = df['numStaff'].apply(lambda x: np.float(x) \n", " if isDigit(x)\n", " else np.nan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## numPostgrad" ] }, { "cell_type": "code", "execution_count": 545, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.766948", "start_time": "2017-11-19T20:33:10.741107" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('268 full-time MBA', 24),\n", " ('approx. 300', 1),\n", " ('~650', 4),\n", " ('n/a', 1),\n", " ('Aprx. 2,000', 2),\n", " ('630 Dickinson School of Law', 640),\n", " ('120+', 8),\n", " ('795 Great Valley', 640),\n", " ('Does not offer postgraduate studies', 1),\n", " ('~3,914', 1),\n", " ('4,937 full-time, 1,446 part-time', 2),\n", " ('800 College of Medicine', 640),\n", " ('not available', 1),\n", " ('~3,200', 2),\n", " ('Masters of Business Administration in Community Economic Development', 20),\n", " ('Some postdoctoral students and visiting scholars', 1),\n", " ('postgraduate level degree available', 1),\n", " ('~160', 3),\n", " ('142 Ph.D. students', 2),\n", " ('1,682 Commonwealth Campuses', 640),\n", " ('available', 1),\n", " ('71 MLHR', 24),\n", " ('14,020 Total', 640),\n", " ('Approximately 1000', 1),\n", " ('Approx. 600', 1),\n", " ('----', 4),\n", " ('~500', 4),\n", " ('~5,500', 2),\n", " ('~60', 4),\n", " ('325 part-time MBA', 24),\n", " ('over 1,300', 1),\n", " ('N\\\\A', 9),\n", " ('none', 1),\n", " ('6,223 University Park', 640),\n", " ('ca. 3,230', 2),\n", " ('95 MAcc', 24),\n", " ('9,957 \\xe2\\x80\\x93 Vancouver', 24),\n", " ('531 \\xe2\\x80\\x93 Okanagan', 24),\n", " ('3,890 World Campus', 640),\n", " ('TBD', 2)]" ] }, "execution_count": 545, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odds = {}\n", "for i in df['numPostgrad']:\n", " if not isDigit(i):\n", " try:\n", " odds[i] += 1\n", " except:\n", " odds[i] = 1\n", "\n", "odds.items()" ] }, { "cell_type": "code", "execution_count": 546, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.841632", "start_time": "2017-11-19T20:33:10.769112" }, "collapsed": true }, "outputs": [], "source": [ "df.numPostgrad = [str(i).replace('~', '').replace(',', '') for i in df.numPostgrad]\n", "df.numPostgrad = [str(i).replace('approx. ', '').replace('Approx. ', '') for i in df.numPostgrad]\n", "df.numPostgrad = [str(i).replace('Approximately ', '').replace('Aprx. ', '') for i in df.numPostgrad]\n", "df.numPostgrad = [str(i).replace('+', '').replace('over', '') for i in df.numPostgrad]\n", "\n", "df.numPostgrad = [str(i).split(' ')[0] for i in df.numPostgrad]\n" ] }, { "cell_type": "code", "execution_count": 547, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.866218", "start_time": "2017-11-19T20:33:10.843880" }, "collapsed": true }, "outputs": [], "source": [ "df['numPostgrad'] = df['numPostgrad'].apply(lambda x: np.float(x) \n", " if isDigit(x)\n", " else np.nan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## numUndergrad" ] }, { "cell_type": "code", "execution_count": 548, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.897180", "start_time": "2017-11-19T20:33:10.867984" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('4,747 full-time', 96),\n", " ('over 2,000', 1),\n", " ('Approximately 10,000', 1),\n", " ('2,000+', 8),\n", " ('approximately 1,500', 1),\n", " ('n/a', 1),\n", " ('Aprx. 2,000', 2),\n", " ('School of Liberal Arts; School of Science & Technology; School of Graduate & Professional Studies; Shannon School of Business',\n", " 20),\n", " ('approx. 2,000', 1),\n", " ('36,518 \\xe2\\x80\\x93 Vancouver', 24),\n", " ('~25,000', 2),\n", " ('65 per year', 1),\n", " ('77,179 Total', 896),\n", " ('approx. 1,200', 8),\n", " ('Approx. 12,000', 2),\n", " ('38,594 University Park', 896),\n", " ('32,295 Commonwealth Campuses', 896),\n", " ('Approximately 2,300', 1),\n", " ('7,004 \\xe2\\x80\\x93 Okanagan', 24),\n", " ('~400', 1),\n", " ('pre-university students; technical', 1),\n", " ('Around 10,000', 2),\n", " ('available', 3),\n", " ('21,726 -', 24),\n", " ('None', 1),\n", " ('900+', 1),\n", " ('475 Resident Undergraduates', 1),\n", " ('ca. 3,046', 2),\n", " ('Approx. 7,100', 1),\n", " ('6,290 PA College of Tech', 896),\n", " ('4,634 World Campus', 896),\n", " ('28,477 full-time, 2,102 part-time', 2),\n", " ('none', 1),\n", " ('approx. 2,150', 4),\n", " ('Approximately 730', 1),\n", " ('~13,000', 2),\n", " ('~2,000', 4),\n", " ('diploma, degree available', 1),\n", " ('Approx. 13,000', 1),\n", " ('2,000 traditional', 1),\n", " ('~1,560', 1),\n", " ('???', 1),\n", " ('~1,550', 4)]" ] }, "execution_count": 548, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odds = {}\n", "for i in df['numUndergrad']:\n", " if not isDigit(i):\n", " try:\n", " odds[i] += 1\n", " except:\n", " odds[i] = 1\n", "\n", "odds.items()" ] }, { "cell_type": "code", "execution_count": 549, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:10.987754", "start_time": "2017-11-19T20:33:10.899084" }, "collapsed": true }, "outputs": [], "source": [ "df.numUndergrad = [str(i).replace('~', '').replace(',', '') for i in df.numUndergrad]\n", "df.numUndergrad = [str(i).replace('approx. ', '').replace('Approx. ', '') for i in df.numUndergrad]\n", "df.numUndergrad = [str(i).replace('Approximately ', '').replace('Aprx. ', '') for i in df.numUndergrad]\n", "df.numUndergrad = [str(i).replace('approximately ', '').replace('Around ', '') for i in df.numUndergrad]\n", "df.numUndergrad = [str(i).replace('+', '').replace('over', '') for i in df.numUndergrad]\n", "df.numUndergrad = [str(i).split(' ')[0] for i in df.numUndergrad]" ] }, { "cell_type": "code", "execution_count": 550, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:11.010001", "start_time": "2017-11-19T20:33:10.990059" }, "collapsed": true }, "outputs": [], "source": [ "df['numUndergrad'] = df['numUndergrad'].apply(lambda x: np.float(x) \n", " if isDigit(x)\n", " else np.nan)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## established" ] }, { "cell_type": "code", "execution_count": 551, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:33:11.022966", "start_time": "2017-11-19T20:33:11.011643" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 2005\n", "2 1835\n", "3 1967\n", "4 1878\n", "5 1878\n", "6 1901\n", "7 1901\n", "8 1947\n", "9 1947\n", "10 1963\n", "11 1963\n", "12 1963 - university status\n", "13 1963 - university status\n", "14 1947 - four-year college\n", "15 1947 - four-year college\n", "16 1901 -\n", "17 1901 -\n", "18 1924\n", "19 1924\n", "22 1970\n", "23 1918-05-01\n", "24 1925\n", "25 Established 1985\n", "26 Chartered 1984\n", "27 1994\n", "28 1947\n", "29 1948\n", "30 1878\n", "31 2004-09-30\n", "32 2004-09-30\n", " ... \n", "74996 1848\n", "74997 1848\n", "74998 1848\n", "74999 1848\n", "75000 1848\n", "75001 1848\n", "75002 1848\n", "75003 1848\n", "75004 1881-08-28\n", "75006 1855-10-15\n", "75007 1911\n", "75008 1964\n", "75009 1964\n", "75010 1964\n", "75011 1851\n", "75012 1851\n", "75013 1851\n", "75014 1851\n", "75015 1851\n", "75016 1851\n", "75017 1851\n", "75018 1851\n", "75019 1851\n", "75020 1851\n", "75021 1851\n", "75022 1851\n", "75023 1851\n", "75024 1851\n", "75025 1851\n", "75026 1851\n", "Name: established, dtype: object" ] }, "execution_count": 551, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.established" ] }, { "cell_type": "code", "execution_count": 570, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:38:20.502818", "start_time": "2017-11-19T20:38:20.487311" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 570, "metadata": {}, "output_type": "execute_result" } ], "source": [ "odds = {}\n", "for i in df['established']:\n", " if not isDigit(i):\n", " try:\n", " odds[i] += 1\n", " except:\n", " odds[i] = 1\n", "\n", "odds.items()" ] }, { "cell_type": "code", "execution_count": 571, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:38:43.182226", "start_time": "2017-11-19T20:38:43.177955" }, "collapsed": false }, "outputs": [], "source": [ "import re\n", "\n", "def getYear(s):\n", " try:\n", " match = re.match(r'.*([1-3][0-9]{3})', s)\n", " return np.int(match.group(1))\n", " except:\n", " return np.nan" ] }, { "cell_type": "code", "execution_count": 572, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:38:43.788277", "start_time": "2017-11-19T20:38:43.734871" }, "collapsed": false }, "outputs": [], "source": [ "df.established = [getYear(i) for i in df.established]\n" ] }, { "cell_type": "code", "execution_count": 573, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T20:38:52.586155", "start_time": "2017-11-19T20:38:52.541817" }, "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
endowmentnumFacultynumDoctoralnumStaffestablishednumPostgradnumUndergradnumStudents
count1.490400e+041.348700e+041.348700e+043.161000e+0315186.0000001.513800e+041.550000e+041.380000e+04
mean2.149103e+098.695432e+038.695432e+031.057127e+041865.1960361.104538e+042.762582e+053.934969e+05
std1.927573e+101.286384e+051.286384e+052.483439e+0565.8655963.622826e+052.581618e+073.585477e+07
min0.000000e+001.000000e+001.000000e+002.000000e+001066.0000000.000000e+000.000000e+002.000000e+00
25%2.430000e+081.953000e+031.953000e+031.592000e+031855.0000008.000000e+026.290000e+031.065000e+03
50%1.546000e+098.864000e+038.864000e+032.799000e+031855.0000003.067000e+031.782100e+041.068800e+04
75%1.708000e+098.864000e+038.864000e+035.000000e+031881.0000006.223000e+033.667500e+043.397700e+04
max1.545840e+121.407201e+071.407201e+071.280201e+072012.0000002.998201e+073.198523e+094.197033e+09
\n", "
" ], "text/plain": [ " endowment numFaculty numDoctoral numStaff established \\\n", "count 1.490400e+04 1.348700e+04 1.348700e+04 3.161000e+03 15186.000000 \n", "mean 2.149103e+09 8.695432e+03 8.695432e+03 1.057127e+04 1865.196036 \n", "std 1.927573e+10 1.286384e+05 1.286384e+05 2.483439e+05 65.865596 \n", "min 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1066.000000 \n", "25% 2.430000e+08 1.953000e+03 1.953000e+03 1.592000e+03 1855.000000 \n", "50% 1.546000e+09 8.864000e+03 8.864000e+03 2.799000e+03 1855.000000 \n", "75% 1.708000e+09 8.864000e+03 8.864000e+03 5.000000e+03 1881.000000 \n", "max 1.545840e+12 1.407201e+07 1.407201e+07 1.280201e+07 2012.000000 \n", "\n", " numPostgrad numUndergrad numStudents \n", "count 1.513800e+04 1.550000e+04 1.380000e+04 \n", "mean 1.104538e+04 2.762582e+05 3.934969e+05 \n", "std 3.622826e+05 2.581618e+07 3.585477e+07 \n", "min 0.000000e+00 0.000000e+00 2.000000e+00 \n", "25% 8.000000e+02 6.290000e+03 1.065000e+03 \n", "50% 3.067000e+03 1.782100e+04 1.068800e+04 \n", "75% 6.223000e+03 3.667500e+04 3.397700e+04 \n", "max 2.998201e+07 3.198523e+09 4.197033e+09 " ] }, "execution_count": 573, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# University" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T16:44:56.123690", "start_time": "2017-11-19T16:44:56.113675" }, "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(array(['Paris Universitas', 'Lumi%C3%A8re University Lyon 2',\n", " 'Confederation College', ..., 'University of San Francisco',\n", " 'Loyola Marymount University', 'Nova Southeastern University'], dtype=object),\n", " 1085)" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['university'].unique(), df['university'].unique().size" ] }, { "cell_type": "code", "execution_count": 236, "metadata": { "ExecuteTime": { "end_time": "2017-11-19T18:17:19.951167", "start_time": "2017-11-19T18:17:19.948486" }, "collapsed": false }, "outputs": [], "source": [ "# university_df = df.groupby('university').size()\n", "# for k in university_df.index:\n", "# print(k, university_df[k])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" }, "latex_envs": { "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 0 }, "toc": { "toc_cell": false, "toc_number_sections": true, "toc_threshold": 6, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 0 }