{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example usage of hierreg\n", "** *\n", "This IPython notebook tries to illustrate the usage of the [hierreg](https://github.com/david-cortes/hierreg) package, which fits linear models with coefficients that can vary according to ‘groups’ or levels to which observations belong, similar as random effects and hierarchical bayes models, but with deviations estimated by applying regularization on them.\n", "\n", "In a typical regression model, coefficients are estimated so as to minimize\n", "$$ L(w) = \\lVert y - Xw \\lVert$$\n", "Where\n", "* $X$ is the predictors'covariates matrix\n", "* $y$ is the variable to predict\n", "* $w$ are the model parameters (that minimize this metric)\n", "\n", "This model assumes that observations can belong to arbitrary categorical groups (e.g. observations coming from the same geographical region, same year, etc.), and the coefficients are different for all these groups, but not too different, by trying to minimize\n", "\n", "$$ L(w,v) = \\lVert X(w+\\sum_{groups} v_{group}I_{[x \\in group]}) - y\\lVert +\\:\\: \\lambda *\\lVert groupweights *v \\lVert $$\n", "Where\n", "* $X$ is the predictors'covariates matrix\n", "* $y$ is the variable to predict\n", "* $\\lambda$ is a regularization parameter\n", "* $w$ are the main model parameters (like fixed effects)\n", "* $v_{group}$ are deviations in the coefficients according to the group each observation belongs to\n", "* $I_{[x \\in group]}$ is an indicator column-matrix with ones when a row of $X$ belongs to a given group\n", "* $groupweights$ is a weight vector making each group deviation have a weight that is inversely proportional to the number of observations beloning to that group.\n", "\n", "While this type of model doesn't offer the same inferential or statistical testing capabilities as random effects or hierarchical bayes, it can end up being better at making predictions than either alternatives, depending on the problem.\n", "\n", "(Logistic regression is also supported, the model needs to be called with argument `problem='classification'`)\n", "\n", "** *\n", "### Package in action\n", "\n", "[1. Small example - Predicting student attainment](#example1)\n", "\n", "[2. Larger example - Predicting sales at different stores](#example2)\n", "** *\n", "\n", "## 1. Small example - Predicting student attainment\n", "\n", "Here I'll try to build a model to predict student attaintment based on standardized test scores, family information and schools that children attended - the data was downloaded from here:\n", "http://www.bristol.ac.uk/cmm/learning/mmsoftware/data-rev.html#lev-xc\n", "\n", "Description from the webpage\n", "##### _2LEV-XC_\n", "\n", "_The data are on 3,435 children who attended 148 primary schools and 19 secondary schools in Scotland. There are 11 fields in the data set of which the following are used._\n", "\n", "* _VRQ: A verbal reasoning score from tests pupils took when they entered secondary school_\n", "* _ATTAIN: Attainment score of pupils at age 16_\n", "* _PID: Primary school identifying code_\n", "* _SEX: Pupil's gender_\n", "* _0 = boy_\n", "* _1 = girl_\n", "* _SC: Pupil's social class scale (continuous scale score from low to high social class)_\n", "* _SID: Secondary school identifying code_\n", "\n", "Taking a look at the data:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
VRQATTAINPIDSEXSCSID
011.010.01.000.09.0
10.03.01.010.09.0
2-14.02.01.000.09.0
3-6.03.01.0020.09.0
4-30.02.01.010.09.0
\n", "
" ], "text/plain": [ " VRQ ATTAIN PID SEX SC SID\n", "0 11.0 10.0 1.0 0 0.0 9.0\n", "1 0.0 3.0 1.0 1 0.0 9.0\n", "2 -14.0 2.0 1.0 0 0.0 9.0\n", "3 -6.0 3.0 1.0 0 20.0 9.0\n", "4 -30.0 2.0 1.0 1 0.0 9.0" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd, numpy as np, warnings\n", "warnings.filterwarnings(\"ignore\")\n", "\n", "df=pd.read_csv('2lev-xc.txt',names=['VRQ','ATTAIN','PID','SEX','SC','SID'],sep='\\s',engine='python')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are, overall, over a hundred primary schools and over a dozen secondary schools. For a dataset of this size, this is a very large amount of parameters:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "148\n", "19\n" ] } ], "source": [ "print(len(set(df.PID)))\n", "print(len(set(df.SID)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now fitting a model with school-varying coefficients:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.503111161611763\n" ] } ], "source": [ "from hierreg import HierarchicalRegression\n", "from sklearn.model_selection import train_test_split\n", "\n", "y=df['ATTAIN']\n", "X=df[['VRQ','SEX','SC']]\n", "groups=df[['PID','SID']]\n", "\n", "## simple train-test split - note that I'm suing a random seed to get the same split again later\n", "X_train, X_test, y_train, y_test, groups_train, groups_test = train_test_split(\n", " X, y, groups, test_size=0.33, random_state=42)\n", "\n", "## for small problems like this, solver 'ECOS' works faster than the default 'SCS'\n", "## it would still work fine without chaning the default options though, only a bit slower\n", "## note the large regularization parameter\n", "hr=HierarchicalRegression(l2_reg=1e4, cvxpy_opts={'solver':'ECOS'})\n", "hr.fit(X_train, y_train, groups_train)\n", "print(np.mean((y_test-hr.predict(X_test, groups_test))**2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The choice of which variables to turn into groups with deviations in coefficients and which to use as categorical variables can be quite hard though - in this very simple example, a tuned regularized model with no group-variying coefficients ends up performing better:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4.386018754899887\n" ] } ], "source": [ "from sklearn.linear_model import Ridge\n", "\n", "y=df['ATTAIN']\n", "X=df[['VRQ','SEX','SC','PID','SID']]\n", "X['PID']=X.PID.map(lambda x: str(x))\n", "X['SID']=X.SID.map(lambda x: str(x))\n", "X=pd.get_dummies(X)\n", "\n", "## same train-test split\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.33, random_state=42)\n", "\n", "lr=Ridge(100).fit(X_train, y_train)\n", "print(np.mean((y_test-lr.predict(X_test))**2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(With larger datasets with more variables, this might not necessarily be the case though, as illustrated in the next example)\n", "** *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## 2. Larger example - Predicting sales at different stores\n", "\n", "This larger example tries to predict sales at different Rossman stores. The data comes from a [Kaggle competition](https://www.kaggle.com/c/rossmann-store-sales) and it's quite large compared to the previous dataset. This is no the typical time-series forecasting problem as the rows here include the number of customer who visited each store at each day, along with information about the days and about competitors:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StoreDayOfWeekDateSalesCustomersOpenPromoStateHolidaySchoolHoliday
0152015-07-3152635551101
1252015-07-3160646251101
2352015-07-3183148211101
3452015-07-311399514981101
4552015-07-3148225591101
\n", "
" ], "text/plain": [ " Store DayOfWeek Date Sales Customers Open Promo StateHoliday \\\n", "0 1 5 2015-07-31 5263 555 1 1 0 \n", "1 2 5 2015-07-31 6064 625 1 1 0 \n", "2 3 5 2015-07-31 8314 821 1 1 0 \n", "3 4 5 2015-07-31 13995 1498 1 1 0 \n", "4 5 5 2015-07-31 4822 559 1 1 0 \n", "\n", " SchoolHoliday \n", "0 1 \n", "1 1 \n", "2 1 \n", "3 1 \n", "4 1 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rossman=pd.read_csv('train.csv',engine='python')\n", "rossman['StateHoliday']=rossman.StateHoliday.map(lambda x: str(x))\n", "\n", "# exlcuding days with no sales\n", "rossman=rossman.loc[rossman.Sales>0]\n", "rossman.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(844338, 9)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rossman.shape" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StoreStoreTypeAssortmentCompetitionDistanceCompetitionOpenSinceMonthCompetitionOpenSinceYearPromo2Promo2SinceWeekPromo2SinceYearPromoInterval
01ca1270.09.02008.00NaNNaNNaN
12aa570.011.02007.0113.02010.0Jan,Apr,Jul,Oct
23aa14130.012.02006.0114.02011.0Jan,Apr,Jul,Oct
34cc620.09.02009.00NaNNaNNaN
45aa29910.04.02015.00NaNNaNNaN
\n", "
" ], "text/plain": [ " Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth \\\n", "0 1 c a 1270.0 9.0 \n", "1 2 a a 570.0 11.0 \n", "2 3 a a 14130.0 12.0 \n", "3 4 c c 620.0 9.0 \n", "4 5 a a 29910.0 4.0 \n", "\n", " CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear \\\n", "0 2008.0 0 NaN NaN \n", "1 2007.0 1 13.0 2010.0 \n", "2 2006.0 1 14.0 2011.0 \n", "3 2009.0 0 NaN NaN \n", "4 2015.0 0 NaN NaN \n", "\n", " PromoInterval \n", "0 NaN \n", "1 Jan,Apr,Jul,Oct \n", "2 Jan,Apr,Jul,Oct \n", "3 NaN \n", "4 NaN " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stores=pd.read_csv('C:\\\\ipython\\\\mixed effects\\\\rossman sales\\\\store.csv')\n", "stores.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Doing some very basic feature engineering - there is a lot more that can be done, such as using time lags of sales and customers, and incorporating more information about competitors, among others, but the idea is just to illustrate the usage of this package." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StoreDayOfWeekDateSalesCustomersOpenPromoStateHolidaySchoolHolidayYearMonthStoreTypeAssortmentCompetitionDistance
0152015-07-3152635551101201507ca1270.0
1142015-07-3050205461101201507ca1270.0
2132015-07-2947825231101201507ca1270.0
3122015-07-2850115601101201507ca1270.0
4112015-07-2761026121101201507ca1270.0
\n", "
" ], "text/plain": [ " Store DayOfWeek Date Sales Customers Open Promo StateHoliday \\\n", "0 1 5 2015-07-31 5263 555 1 1 0 \n", "1 1 4 2015-07-30 5020 546 1 1 0 \n", "2 1 3 2015-07-29 4782 523 1 1 0 \n", "3 1 2 2015-07-28 5011 560 1 1 0 \n", "4 1 1 2015-07-27 6102 612 1 1 0 \n", "\n", " SchoolHoliday Year Month StoreType Assortment CompetitionDistance \n", "0 1 2015 07 c a 1270.0 \n", "1 1 2015 07 c a 1270.0 \n", "2 1 2015 07 c a 1270.0 \n", "3 1 2015 07 c a 1270.0 \n", "4 1 2015 07 c a 1270.0 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rossman['Year']=rossman.Date.map(lambda x: x[:4])\n", "rossman['Month']=rossman.Date.map(lambda x: x[5:7])\n", "rossman['DayOfWeek']=rossman.DayOfWeek.map(lambda x: str(x))\n", "\n", "max_comp_dist=stores.CompetitionDistance.max()\n", "stores['CompetitionDistance'].loc[stores.CompetitionDistance.isnull()]=max_comp_dist\n", "rossman=pd.merge(rossman,stores[['Store','StoreType','Assortment','CompetitionDistance']],on='Store')\n", "rossman.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a comparison point, I'll first fit a linear regression model, using the stores and store types as categorical variables in the model. Note that this is model is libnear in all parameters (no interactions, no polynomials, etc.), but as the dataset is very large, processing the data and fitting the model can still take a very long time. The models in the winning solutions for this competition took over a day to fit according to their authors." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2678.6297086\n", "Wall time: 11min 23s\n" ] } ], "source": [ "%%time\n", "from sklearn.linear_model import Ridge\n", "from sklearn.model_selection import train_test_split\n", "from scipy.sparse import coo_matrix, csr_matrix, hstack\n", "\n", "y=rossman['Sales']\n", "X=rossman[['Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance']]\n", "Xcateg=rossman[['Store', 'DayOfWeek', 'StoreType','Assortment','Year','Month']]\n", "Xcateg['Store']=Xcateg.Store.map(lambda x: str(x))\n", "Xcateg=coo_matrix(pd.get_dummies(Xcateg).as_matrix())\n", "X=hstack([pd.get_dummies(X).as_matrix(),Xcateg])\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.33, random_state=100)\n", "\n", "lr=Ridge()\n", "lr.fit(csr_matrix(X_train),y_train)\n", "preds_lr=lr.predict(X_test)\n", "print(np.sqrt(np.mean((y_test-preds_lr)**2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now fitting a model with coefficients that vary by store - note that this is a far larger problem with over 40k model parameters, so it's no surprise that it takes a long time to fit and takes 4GB of ram memory. Nevertheless, this results in a huge improvement over the previous model:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "******************************************************************************\n", "This program contains Ipopt, a library for large-scale nonlinear optimization.\n", " Ipopt is released as open source code under the Eclipse Public License (EPL).\n", " For more information visit http://projects.coin-or.org/Ipopt\n", "******************************************************************************\n", "\n", "673.711325158\n", "Wall time: 7h 21min 50s\n" ] } ], "source": [ "%%time\n", "from hierreg import HierarchicalRegression\n", "\n", "y=rossman['Sales']\n", "X=rossman[['DayOfWeek','Customers','Open','Promo','StateHoliday','SchoolHoliday','CompetitionDistance','Year','Month']]\n", "group=rossman[['Store','StoreType','Assortment']]\n", "X_train, X_test, y_train, y_test, group_train, group_test = train_test_split(\n", " pd.get_dummies(X), y, group, test_size=0.33, random_state=100)\n", "\n", "## for larger datasets, casadi --> IPOPT can provide better running times, but only supports the default parameters\n", "hlr=HierarchicalRegression(solver_interface='casadi')\n", "hlr.fit(X_train,y_train,group_train)\n", "preds_hlr1=hlr.predict(X_test,group_test)\n", "print(np.sqrt(np.mean((y_test-preds_hlr1)**2)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }