{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Python Tour of Data Science: Data Acquisition & Exploration \n", "\n", "[Michaƫl Defferrard](http://deff.ch), *PhD student*, [EPFL](http://epfl.ch) [LTS2](http://lts2.epfl.ch)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# While packages are usually imported at the top, they can\n", "# be imported wherever you prefer, in whatever scope.\n", "import numpy as np\n", "\n", "# Show matplotlib graphs inside the notebook.\n", "%matplotlib inline\n", "\n", "import sys\n", "print('Python {}.{}'.format(sys.version_info.major, sys.version_info.minor))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 Data Importation\n", "\n", "* The world is messy, we got data in CSV, [JSON](http://www.json.org), Excel, [HDF5](https://www.hdfgroup.org/HDF5) files and an SQL database.\n", "* Could also have been matlab, HTML, XML files or from the web via scraping and APIs (e.g. [Twitter Firehose](https://dev.twitter.com/streaming/firehose)) or noSQL data stores, etc." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Search modules in parent folder.\n", "import sys\n", "sys.path.insert(1, '..')\n", "import ntds\n", "\n", "# Cross-platform (Windows / Mac / Linux) paths.\n", "import os.path\n", "folder = os.path.join('..', 'data', 'credit_card_defaults')\n", "\n", "# Download the data.\n", "ntds.get_data(folder)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!ls ../data/credit_card_defaults/\n", "# Windows: !dir ..\\data\\credit_card_defaults\\" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Importing from an SQL Database\n", "\n", "[SQLAlchemy](http://www.sqlalchemy.org/) to the rescue.\n", "* Abstraction between DBAPIs.\n", " * Supported databases: SQLite, Postgresql, MySQL, Oracle, MS-SQL, Firebird, Sybase and others.\n", "* [SQL Expression Language](http://docs.sqlalchemy.org/en/rel_1_0/core/tutorial.html).\n", "* [Object Relational Mapper (ORM)](http://docs.sqlalchemy.org/en/rel_1_0/orm/tutorial.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import sqlalchemy\n", "filename = os.path.join(folder, 'payments.sqlite')\n", "engine = sqlalchemy.create_engine('sqlite:///' + filename, echo=False)\n", "\n", "# Infer from existing DB.\n", "metadata = sqlalchemy.MetaData()\n", "metadata.reflect(engine)\n", "\n", "# An SQL SELECT statement.\n", "table = metadata.tables.get('payments')\n", "op = sqlalchemy.sql.select([table])\n", "engine.echo = True\n", "result = engine.execute(op)\n", "engine.echo = False" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Show some lines, i.e. clients.\n", "for row in result.fetchmany(size=10):\n", " print('ID: {:2d}, payments: {}'.format(row[0], row[1:]))\n", "result.close()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Execute some raw SQL.\n", "paid = 1000\n", "op = sqlalchemy.sql.text('SELECT payments.\"ID\", payments.\"PAY6\" FROM payments WHERE payments.\"PAY6\" = {}'.format(paid))\n", "result = engine.execute(op).fetchall()\n", "print('{} clients paid {} in April 2005'.format(len(result), paid))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Merging data Sources\n", "\n", "Put some [pandas](http://pandas.pydata.org/) in our Python !\n", "* Import / export data from / to various sources.\n", "* Data frames manipulations: slicing, dicing, grouping.\n", "* And many more !" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def get_data(directory):\n", " filename_csv = os.path.join(directory, 'demographics.csv')\n", " filename_xls = os.path.join(directory, 'delays.xls')\n", " filename_hdf = os.path.join(directory, 'bills.hdf5')\n", " filename_json = os.path.join(directory, 'target.json')\n", "\n", " demographics = pd.read_csv(filename_csv, index_col=0)\n", " delays = pd.read_excel(filename_xls, index_col=0)\n", " bills = pd.read_hdf(filename_hdf, 'bills')\n", " payments = pd.read_sql('payments', engine, index_col='ID')\n", " target = pd.read_json(filename_json)\n", "\n", " return pd.concat([demographics, delays, bills, payments, target], axis=1)\n", "\n", "import pandas as pd\n", "data = get_data(folder)\n", "attributes = data.columns.tolist()\n", "\n", "# Tansform from numerical to categorical variable.\n", "data['SEX'] = data['SEX'].astype('category')\n", "data['SEX'].cat.categories = ['MALE', 'FEMALE']\n", "data['MARRIAGE'] = data['MARRIAGE'].astype('category')\n", "data['MARRIAGE'].cat.categories = ['UNK', 'MARRIED', 'SINGLE', 'OTHERS']\n", "data['EDUCATION'] = data['EDUCATION'].astype('category')\n", "data['EDUCATION'].cat.categories = ['UNK', 'GRAD SCHOOL', 'UNIVERSITY', 'HIGH SCHOOL', 'OTHERS', 'UNK1', 'UNK2']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3 Looking at the Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.loc[:6, ['LIMIT', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'DEFAULT']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "data.iloc[:5, 4:10]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.iloc[:5, 11:23]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Export as an [HTML table](./subset.html) for manual inspection." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data[:1000].to_html('subset.html')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 Data Cleaning\n", "\n", "While cleaning data is the [most time-consuming, least enjoyable Data Science task](http://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says), it should be perfomed nonetheless. Problems come in two flavours:\n", "\n", "1. Missing data, i.e. unknown values.\n", "1. Errors in data, i.e. wrong values.\n", "\n", "The actions to be taken in each case is highly **data and problem specific**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example: marital status\n", "1. According to dataset description, it should either be 1 (married), 2 (single) or 3 (others).\n", "1. But we find some 0 (previously transformed to `UNK`).\n", "1. Let's *assume* that 0 represents errors when collecting the data and that we should remove those clients." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data['MARRIAGE'].value_counts())\n", "data = data[data['MARRIAGE'] != 'UNK']\n", "data['MARRIAGE'] = data['MARRIAGE'].cat.remove_unused_categories()\n", "print('\\nWe are left with {} clients\\n'.format(data.shape))\n", "print(data['MARRIAGE'].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example: education\n", "1. It should either be 1 (graduate school), 2 (university), 3 (high school) or 4 (others).\n", "1. But we find some 0, 5 and 6 (previously transformed to `UNK`, `UNK1` and `UNK2`).\n", "1. Let's *assume* these values are dubious, but do not invalidate the data and keep them as they may have some predictive power." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print(data['EDUCATION'].value_counts())\n", "data.loc[data['EDUCATION']=='UNK1', 'EDUCATION'] = 'UNK'\n", "data.loc[data['EDUCATION']=='UNK2', 'EDUCATION'] = 'UNK'\n", "data['EDUCATION'] = data['EDUCATION'].cat.remove_unused_categories()\n", "print(data['EDUCATION'].value_counts())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 Data Exploration\n", "\n", "* Get descriptive statistics.\n", "* Plot informative figures.\n", "* Verify some intuitive correlations.\n", "\n", "Let's get first some descriptive statistics of our numerical variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "attributes_numerical = ['LIMIT', 'AGE']\n", "attributes_numerical.extend(attributes[11:23])\n", "data.loc[:, attributes_numerical].describe().astype(np.int)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot an histogram of the ages, so that we get a better impression of who our clients are. That may even be an end goal, e.g. if your marketing team asks which customer groups to target.\n", "\n", "Then a boxplot of the bills, which may serve as a verification of the quality of the acquired data." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "data.loc[:, 'AGE'].plot.hist(bins=20, figsize=(15,5))\n", "ax = data.iloc[:, 11:17].plot.box(logy=True, figsize=(15,5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simple **question**: which proportion of our clients default ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "percentage = data['DEFAULT'].value_counts()[1] / data.shape[0] * 100\n", "print('Percentage of defaults: {:.2f}%'.format(percentage))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another **question**: who's more susceptible to default, males or females ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "observed = pd.crosstab(data['SEX'], data['DEFAULT'], margins=True)\n", "observed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Seems like females are better risk. Let's verify with a Chi-Squared test of independance, using [scipy.stats](http://docs.scipy.org/doc/scipy/reference/stats.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import scipy.stats as stats\n", "_, p, _, expected = stats.chi2_contingency(observed.iloc[:2,:2])\n", "print('p-value = {:.2e}'.format(p))\n", "print('expected values:\\n{}'.format(expected))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Intuition**: people who pay late present a higher risk of defaulting. Let's verify !\n", "Verifying some intuitions will also help you to identify mistakes. E.g. it would be suspicious if that intuition is not verified in the data: did we select the right column, or did we miss-compute a result ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "group = data.groupby('DELAY1').mean()\n", "corr = data['DEFAULT'].corr(data['DELAY1'], method='pearson')\n", "group['DEFAULT'].plot(grid=True, title='Pearson correlation: {:.4f}'.format(corr), figsize=(15,5));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4 Interactive Visualization\n", "\n", "[Bokeh](http://bokeh.pydata.org) is a Python interactive visualization library that targets modern web browsers for presentation, in the style of [D3.js](https://d3js.org). Alternatively, [matplotlib.widgets](http://matplotlib.org/api/widgets_api.html) could be used. Those interactive visualizations are very helpful to explore the data at hand in the quest of anomalies or patterns. Try with the plots below !" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from bokeh.plotting import output_notebook, figure, show\n", "from bokeh.layouts import gridplot\n", "from bokeh.models import ColumnDataSource\n", "\n", "output_notebook()\n", "\n", "x, y1, y2 = 'LIMIT', 'PAY1', 'PAY2'\n", "n = 1000 # Less intensive for the browser.\n", "\n", "options = dict(\n", " tools='pan,box_zoom,wheel_zoom,box_select,lasso_select,crosshair,reset,save',\n", " x_axis_type='log', y_axis_type='log',\n", ")\n", "plot1 = figure(\n", " x_range=[1e4,1e6],\n", " x_axis_label=x, y_axis_label=y1,\n", " **options\n", ")\n", "plot2 = figure(\n", " x_range=plot1.x_range, y_range=plot1.y_range,\n", " x_axis_label=x, y_axis_label=y2,\n", " **options\n", ")\n", "\n", "html_color = lambda r,g,b: '#{:02x}{:02x}{:02x}'.format(r,g,b)\n", "colors = [html_color(150,0,0) if default == 1 else html_color(0,150,0) for default in data['DEFAULT'][:n]]\n", "# The above line is a list comprehension.\n", "\n", "radii = data['AGE'][:n] / 5\n", "\n", "# To link brushing (where a selection on one plot causes a selection to update on other plots).\n", "source = ColumnDataSource(dict(x=data[x][:n], y1=data[y1][:n], y2=data[y2][:n], radii=radii, colors=colors))\n", "\n", "plot1.scatter('x', 'y1', source=source, size='radii', color='colors', alpha=0.6)\n", "plot2.scatter('x', 'y2', source=source, size='radii', color='colors', alpha=0.6)\n", "\n", "plot = gridplot([[plot1, plot2]], toolbar_location='right', plot_width=400, plot_height=400, title='adsf')\n", "\n", "show(plot)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5 Traditional Statistics\n", "\n", "[Statsmodels](http://statsmodels.sourceforge.net/) is similar to scikit-learn, with much stronger emphasis on parameter estimation and (statistical) testing. It is similar in spirit to other statistical packages such as [R](https://www.r-project.org), [SPSS](http://www.ibm.com/analytics/us/en/technology/spss), [SAS](http://www.sas.com/de_ch/home.html) and [Stata](http://www.stata.com). That split reflects the [two statistical modeling cultures](http://projecteuclid.org/euclid.ss/1009213726): (1) Statistics, which want to know how well a given model fits the data, and what variables \"explain\" or affect the outcome, and (2) Machine Learning, where the main supported task is chosing the \"best\" model for prediction." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Back to numeric values.\n", "# Note: in a serious project, these should be treated as categories.\n", "data['SEX'].cat.categories = [-1, 1]\n", "data['SEX'] = data['SEX'].astype(np.int)\n", "data['MARRIAGE'].cat.categories = [-1, 1, 0]\n", "data['MARRIAGE'] = data['MARRIAGE'].astype(np.int)\n", "data['EDUCATION'].cat.categories = [-2, 2, 1, 0, -1]\n", "data['EDUCATION'] = data['EDUCATION'].astype(np.int)\n", "\n", "data['DEFAULT'] = data['DEFAULT'] * 2 - 1 # [0,1] --> [-1,1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Observations and targets.\n", "X = data.values[:,:23]\n", "y = data.values[:,23]\n", "n, d = X.shape\n", "print('The data is a {} with {} samples of dimensionality {}.'.format(type(X), n, d))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import statsmodels.api as sm\n", "\n", "# Fit the Ordinary Least Square regression model.\n", "results = sm.OLS(y, X).fit()\n", "\n", "# Inspect the results.\n", "print(results.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6 Exporting data\n", "\n", "Save the collected data to disk for further analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "np.save(os.path.join(folder, 'X.npy'), X)\n", "np.save(os.path.join(folder, 'y.npy'), y)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }