{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Variable transformers : BoxCoxTransformer\n", "\n", "The BoxCoxTransformer() applies the BoxCox transformation to numerical\n", "variables.\n", "\n", "The Box-Cox transformation is defined as:\n", "\n", "- T(Y)=(Y exp(λ)−1)/λ if λ!=0\n", "- log(Y) otherwise\n", "\n", "where Y is the response variable and λ is the transformation parameter. λ varies,\n", "typically from -5 to 5. In the transformation, all values of λ are considered and\n", "the optimal value for a given variable is selected.\n", "\n", "**For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:**\n", "\n", "Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing\n", "Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3\n", "\n", "http://jse.amstat.org/v19n3/decock.pdf\n", "\n", "https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627\n", "\n", "The version of the dataset used in this notebook can be obtained from [Kaggle](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from sklearn.model_selection import train_test_split\n", "\n", "from feature_engine.imputation import ArbitraryNumberImputer, CategoricalImputer\n", "from feature_engine.transformation import BoxCoxTransformer" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPub...0NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPub...0NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPub...0NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPub...0NaNNaNNaN0122008WDNormal250000
\n", "

5 rows × 81 columns

\n", "
" ], "text/plain": [ " Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape \\\n", "0 1 60 RL 65.0 8450 Pave NaN Reg \n", "1 2 20 RL 80.0 9600 Pave NaN Reg \n", "2 3 60 RL 68.0 11250 Pave NaN IR1 \n", "3 4 70 RL 60.0 9550 Pave NaN IR1 \n", "4 5 60 RL 84.0 14260 Pave NaN IR1 \n", "\n", " LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold \\\n", "0 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "1 Lvl AllPub ... 0 NaN NaN NaN 0 5 \n", "2 Lvl AllPub ... 0 NaN NaN NaN 0 9 \n", "3 Lvl AllPub ... 0 NaN NaN NaN 0 2 \n", "4 Lvl AllPub ... 0 NaN NaN NaN 0 12 \n", "\n", " YrSold SaleType SaleCondition SalePrice \n", "0 2008 WD Normal 208500 \n", "1 2007 WD Normal 181500 \n", "2 2008 WD Normal 223500 \n", "3 2006 WD Abnorml 140000 \n", "4 2008 WD Normal 250000 \n", "\n", "[5 rows x 81 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Read data\n", "data = pd.read_csv('houseprice.csv')\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1022, 79), (438, 79))" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's separate into training and testing set\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BoxCoxTransformer(variables=['LotArea', 'GrLivArea'])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's transform 2 variables\n", "\n", "bct = BoxCoxTransformer(variables = ['LotArea', 'GrLivArea'])\n", "\n", "# find the optimal lambdas \n", "bct.fit(X_train)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'LotArea': 0.022716974992922984, 'GrLivArea': 0.06854346283829917}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# these are the exponents for the BoxCox transformation\n", "\n", "bct.lambda_dict_" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# transfor the variables\n", "\n", "train_t = bct.transform(X_train)\n", "test_t = bct.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 0, 'GrLivArea')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# variable before transformation\n", "X_train['GrLivArea'].hist(bins=50)\n", "plt.title('Variable before transformation')\n", "plt.xlabel('GrLivArea')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 0, 'GrLivArea')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAEWCAYAAAB/tMx4AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAVqUlEQVR4nO3df7RdZX3n8fdHIvIjEBAwpQEJHRkskpGRu9TqUm+EWgQt2EFHxtGgYNpprUylU7B11drqGuzSUabtrEpLhalCqkiVhaOC6BVthRoQDZA6CoISEbAGNMpUo9/54+zIyeX+OEnuvec8N+/XWmfds5+9zz7f8+TcT577nL33SVUhSWrPY4ZdgCRp5xjgktQoA1ySGmWAS1KjDHBJapQBLkmNMsA1MpIsT3J9ku8neeew6+mXZGWSSrJkAZ7rOUm+MuC2Zyb53AzrJ5KcPXfVaZTM+5tRbUiypW9xH+BfgZ90y79eVe9fgDLWAt8B9q/d+ASFqvoscPSw69DoM8AFQFUt3XY/yV3A2VX1ycnbJVlSVVvnqYwjgNt3Jrznua4Fs1hehxaGUyiaUZLxJPckOS/Jt4H3JjkwydVJHkiyubt/WN9jJpL8SZJ/6KZDrklycLduryTvS/IvSR5M8oVu6uQSYA3we0m2JDkxyeOSvDvJt7rbu5M8boa6/ijJB7v9fz/JhiT/Nskbk9yf5JtJXtBX57IkFye5N8mmJG9Nske3bo8k70jynSR3AqfM0EfnJbliUtuFSf5nd//VSTZ2Nd2Z5Ndn6d/xJPf0bXN+kju6x9+e5CWPLiF/nuShJP+c5IQZan1NV8vmJJ9IcsR022r0GeAaxM8Bj6c3Ql5L733z3m75icDDwJ9Pesx/Al4NPAHYE/jdrn0NsAw4HDgI+A3g4ao6E3g/8KdVtbQb/f8B8EzgOOCpwNOBN81QF8CLgb8FDgS+CHyiq3cF8MfAe/oefwmwFXgS8O+BFwDb5otfC7yoax8DTp+hf9YBJyfZD3rhD7wMuKxbf3+3r/27PnlXkqfN8jr63QE8h16/vQV4X5JD+9Y/o9vmYODNwJVJHj95J0lOBX4f+DXgEOCzwOUzvC6Nuqry5m27G3AXcGJ3fxz4EbDXDNsfB2zuW54A3tS3/JvAx7v7rwH+Efh3U+znEuCtfct3ACf3Lf8KcNd0dQF/BFzbt/xiYAuwR7e8H1DAAcByevP8e/dtfwbw6e7+p4Df6Fv3gu6xS6bpg88Br+ru/zJwxwz99WHgnBlexzhwzwyPvwU4tbt/JvAtIH3r/wl4Zd+/xdnd/Y8BZ/Vt9xjgh8ARw37Pedu5myNwDeKBqvp/2xaS7JPkPUnuTvI94HrggG3TD51v993/IbBtjv1v6Y2K13XTIn+a5LHTPO/PA3f3Ld/dtU1ZV+e+vvsPA9+pqp/0LdPVcgTwWODebirnQXqj8yf0Pfc3Jz33TC6j9x8A9P762Db6JskLk9yQ5Lvd85xMb7Q80+v4mSSvSnJLX53HTnr8puoSua/W/n7a5gjgwr79fBcIvb9O1CADXIOY/KHiufSOknhGVe0PPLdrz6w7qvpxVb2lqo4BnkVvauFV02z+LXqhs80Tu7bp6toR36Q3Aj+4qg7obvtX1VO69ffSm+bpf+6ZfBAY7z4LeAldgHdz9h8C3gEsr6oDgP/D9n017evo5qj/CngdcFD3+FsnPX5Fkv7lyf20zTfpHVF0QN9t76r6x1lem0aUAa6dsR+90eyD3Vzrmwd9YJLVSVZ1o/XvAT8GfjrN5pcDb0pySPch6B8C79u10nuq6l7gGuCdSfZP8pgk/ybJ87pNPgC8PslhSQ4Ezp9lfw/Qm654L/D1qtrYrdoTeBzwALA1yQvpTccMal96Af8A9D4QpTcC7/eErtbHJnkp8Iv0/pOY7C+BNyZ5SrevZd32apQBrp3xbmBvesds3wB8fAce+3PAFfTCeyPwGXrTKlN5K7Ae+DKwAbi5a5srr6IXsLcDm7u6tn04+Ff0pnq+1D3vlQPs7zLgRPqmT6rq+8Dr6f2HsJne9MpVgxZYVbcD7wQ+T296aBXwD5M2uxE4it6/x9uA06vqX6bY198Db6c3ffU9eiP5Fw5ai0ZPtp86kyS1whG4JDXKAJekRhngktQoA1ySGrWgF7M6+OCDa+XKldu1/eAHP2DfffddyDKaYd9Mz76Zmv0yvZb75qabbvpOVR0yuX1BA3zlypWsX79+u7aJiQnGx8cXsoxm2DfTs2+mZr9Mr+W+STLlmcBOoUhSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMW9ExMaXe28vyPTtl+1wWnLHAlWiwcgUtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQMFeJLfSXJbkluTXJ5kryRHJrkxydeS/F2SPee7WEnSI2YN8CQrgNcDY1V1LLAH8HLg7cC7qupJwGbgrPksVJK0vUGnUJYAeydZAuwD3As8H7iiW38pcNqcVydJmlaqavaNknOAtwEPA9cA5wA3dKNvkhwOfKwboU9+7FpgLcDy5cuPX7du3Xbrt2zZwtKlS3fxZSxO9s30WuybDZsemrJ91Yplc/YcLfbLQmm5b1avXn1TVY1Nbp/1Cx2SHAicChwJPAh8EDhp0CeuqouAiwDGxsZqfHx8u/UTExNMblOPfTO9FvvmzOm+0OEV43P2HC32y0JZjH0zyBTKicDXq+qBqvoxcCXwbOCAbkoF4DBg0zzVKEmawiAB/g3gmUn2SRLgBOB24NPA6d02a4CPzE+JkqSpzBrgVXUjvQ8rbwY2dI+5CDgPeEOSrwEHARfPY52SpEkG+lLjqnoz8OZJzXcCT5/ziiRJA/FMTElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGDfSt9JK2t/L8j07ZftcFpyxwJdqdOQKXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjfIwQmkOTXd4oTQfHIFLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGuWJPFqUvF63dgeOwCWpUQMFeJIDklyR5J+TbEzyS0ken+TaJF/tfh4438VKkh4x6Aj8QuDjVfVk4KnARuB84LqqOgq4rluWJC2QWQM8yTLgucDFAFX1o6p6EDgVuLTb7FLgtPkpUZI0lVTVzBskxwEXAbfTG33fBJwDbKqqA7ptAmzetjzp8WuBtQDLly8/ft26ddut37JlC0uXLt3Fl7E42TfTm61vNmx6aMr2VSuWzcnzT7f/nTFXNYHvmZm03DerV6++qarGJrcPEuBjwA3As6vqxiQXAt8Dfrs/sJNsrqoZ58HHxsZq/fr127VNTEwwPj4+6OvYrdg305utb+b7KJS5vGzsXB4Z43tmei33TZIpA3yQOfB7gHuq6sZu+QrgacB9SQ7tdn4ocP9cFStJmt2sAV5V3wa+meTorukEetMpVwFrurY1wEfmpUJJ0pQGPZHnt4H3J9kTuBN4Nb3w/0CSs4C7gZfNT4mSpKkMFOBVdQvwqPkXeqNxSdIQeCamJDXKAJekRhngktQoA1ySGuXlZCW8/Kza5AhckhrlCFwaMkf/2lmOwCWpUY7ApRnM5UWrpLnmCFySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcrLyUqN8QsgtI0jcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoDyPUbsVvmddi4ghckhrlCFxNmDxyPnfVVs48/6OevKLdmiNwSWqUAS5JjTLAJalRBrgkNcoPMaUR5SGPms3AI/AkeyT5YpKru+Ujk9yY5GtJ/i7JnvNXpiRpsh2ZQjkH2Ni3/HbgXVX1JGAzcNZcFiZJmtlAAZ7kMOAU4K+75QDPB67oNrkUOG0e6pMkTWPQEfi7gd8DftotHwQ8WFVbu+V7gBVzW5okaSapqpk3SF4EnFxVv5lkHPhd4Ezghm76hCSHAx+rqmOnePxaYC3A8uXLj1+3bt1267ds2cLSpUt3+YUsRrtj32zY9NBA2y3fG+57GFatWLZL+1lMVq1Ytlu+ZwbVct+sXr36pqoam9w+yFEozwZ+NcnJwF7A/sCFwAFJlnSj8MOATVM9uKouAi4CGBsbq/Hx8e3WT0xMMLlNPbtj35w54JEX567ayjs3LOGuV4zv0n4Wk7teMb5bvmcGtRj7ZtYplKp6Y1UdVlUrgZcDn6qqVwCfBk7vNlsDfGTeqpQkPcqunMhzHvCGJF+jNyd+8dyUJEkaxA6dyFNVE8BEd/9O4OlzX5I0OE920e7MU+klqVEGuCQ1ygCXpEYZ4JLUKK9GKC0SK8//6M++aq6fXzu3eDkCl6RGGeCS1CgDXJIaZYBLUqP8EFNzYrozIqf7AM0zKKVd5whckhrlCFwDc9QsjRZH4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRBrgkNcoAl6RGGeCS1CgDXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGuVXqmle+TVso2tHv4hao8cRuCQ1ygCXpEYZ4JLUKANckhplgEtSowxwSWqUAS5JjTLAJalRswZ4ksOTfDrJ7UluS3JO1/74JNcm+Wr388D5L1eStM0gI/CtwLlVdQzwTOC3khwDnA9cV1VHAdd1y5KkBTJrgFfVvVV1c3f/+8BGYAVwKnBpt9mlwGnzVKMkaQqpqsE3TlYC1wPHAt+oqgO69gCbty1PesxaYC3A8uXLj1+3bt1267ds2cLSpUt3rvpFbtT6ZsOmh4Zdws8s3xvue3jYVYyeueiXVSuWzU0xI2bUfp92xOrVq2+qqrHJ7QMHeJKlwGeAt1XVlUke7A/sJJurasZ58LGxsVq/fv12bRMTE4yPjw9Uw+5m1PpmlC5Mde6qrbxzg9dim2wu+mWxXsxq1H6fdkSSKQN8oKNQkjwW+BDw/qq6smu+L8mh3fpDgfvnqlhJ0uwGOQolwMXAxqr6H32rrgLWdPfXAB+Z+/IkSdMZ5G+tZwOvBDYkuaVr+33gAuADSc4C7gZeNi8VSpKmNGuAV9XngEyz+oS5LUeSNCjPxJSkRvkxvh5llI420cLzq9ba4QhckhplgEtSowxwSWqUAS5JjfJDTEkD8cPN0eMIXJIaZYBLUqMMcElqlAEuSY0ywCWpUQa4JDXKwwgl7RIPLxweR+CS1CgDXJIa5RTKbszLxkptcwQuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1KjDHBJapQBLkmNMsAlqVEGuCQ1ymuhSJoXM11rx0vNzg1H4JLUKEfgkhacXwIxNxyBS1KjHIHvBrzut1rniH1qjsAlqVEGuCQ1yimURcSpEqlnqt+Fc1dtZXzhS5lXjsAlqVGOwBvkSFuL1Y6+t3f33wVH4JLUqF0agSc5CbgQ2AP466q6YE6qmsJ8H0a0M/uf75o2bHqIM3fzEYY0inZ05D9fhzvu9Ag8yR7AXwAvBI4BzkhyzFwVJkma2a5MoTwd+FpV3VlVPwLWAafOTVmSpNmkqnbugcnpwElVdXa3/ErgGVX1uknbrQXWdotHA1+ZtKuDge/sVBGLn30zPftmavbL9FrumyOq6pDJjfN+FEpVXQRcNN36JOuramy+62iRfTM9+2Zq9sv0FmPf7MoUyibg8L7lw7o2SdIC2JUA/wJwVJIjk+wJvBy4am7KkiTNZqenUKpqa5LXAZ+gdxjh31TVbTuxq2mnV2TfzMC+mZr9Mr1F1zc7/SGmJGm4PBNTkhplgEtSo4YW4EmOTnJL3+17Sf7rsOoZJUl+J8ltSW5NcnmSvYZd06hIck7XL7ft7u+XJH+T5P4kt/a1PT7JtUm+2v08cJg1Dss0ffPS7n3z0ySL4nDCoQV4VX2lqo6rquOA44EfAn8/rHpGRZIVwOuBsao6lt4HxC8fblWjIcmxwGvpnQX8VOBFSZ403KqG6hLgpElt5wPXVdVRwHXd8u7oEh7dN7cCvwZcv+DVzJNRmUI5Abijqu4ediEjYgmwd5IlwD7At4Zcz6j4ReDGqvphVW0FPkPvF3K3VFXXA9+d1HwqcGl3/1LgtIWsaVRM1TdVtbGqJp8J3rRRCfCXA5cPu4hRUFWbgHcA3wDuBR6qqmuGW9XIuBV4TpKDkuwDnMz2J5MJllfVvd39bwPLh1mM5tfQA7w7CehXgQ8Ou5ZR0M1ZngocCfw8sG+S/zzcqkZDVW0E3g5cA3wcuAX4yTBrGmXVO0bY44QXsaEHOL3L0d5cVfcNu5ARcSLw9ap6oKp+DFwJPGvINY2Mqrq4qo6vqucCm4H/O+yaRsx9SQ4F6H7eP+R6NI9GIcDPwOmTft8AnplknySh9/nAxiHXNDKSPKH7+UR689+XDbeikXMVsKa7vwb4yBBr0Twb6pmYSfalF1i/UFUPDa2QEZPkLcB/BLYCXwTOrqp/HW5VoyHJZ4GDgB8Db6iq64Zc0tAkuRwYp3eZ1PuANwMfBj4APBG4G3hZVU3+oHPRm6Zvvgv8GXAI8CBwS1X9ypBKnBOeSi9JjRqFKRRJ0k4wwCWpUQa4JDXKAJekRhngktQoA1zNSLI8yWVJ7kxyU5LPJ3nJFNut7L8KXV/7Hyc5cYDnOS5JJZl8MSRppBjgakJ3UtOHgeur6heq6nh619A5bNJ2035NYFX9YVV9coCnOwP4XPdzylqS+LujofNNqFY8H/hRVf3ltoaquruq/izJmUmuSvIpepdQnVKSS5KcnuSkJB/sax9PcnV3P8BLgTOBX952LfZuVP+VJP+b3kW1Dk/y35J8IcmXu5Ovtu3vw91fCLclWTu33SA9wgBXK54C3DzD+qcBp1fV8wbY1yeBZ3RnAkPvrNd13f1n0bsWzR3ABHBK3+OOAv5XVT0FOLpbfjpwHHB8kud2272m+wthDHh9koMGqEnaYQa4mpTkL5J8KckXuqZrBz1lvLuW+MeBF3dTLqfwyDVDzuCRMF/H9tMod1fVDd39F3S3L9L7j+XJ9AIdeqH9JeAGepe7PQppHkw7XyiNmNuA/7Btoap+K8nBwPqu6Qc7uL91wOvoXR9jfVV9P8ke3XOcmuQPgAAHJdlviucI8N+r6j39O00yTu+Kkr9UVT9MMgH4lXiaF47A1YpPAXsl+S99bfvswv4+Q2/a5bU8MuI+AfhyVR1eVSur6gjgQ8CjjnQBPgG8JslS6H0VXnelxGXA5i68nww8cxdqlGZkgKsJ3ZcTnAY8L8nXk/wTva8MO2+ahxyd5J6+20sn7e8nwNX0rkd/ddd8Bo/+XtYPMcXRKN23JF0GfD7JBuAKYD96UzNLkmwELqA3jSLNC69GKEmNcgQuSY0ywCWpUQa4JDXKAJekRhngktQoA1ySGmWAS1Kj/j9NRVWgumI25QAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# transformed variable\n", "train_t['GrLivArea'].hist(bins=50)\n", "plt.title('Transformed variable')\n", "plt.xlabel('GrLivArea')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 0, 'LotArea')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# tvariable before transformation\n", "X_train['LotArea'].hist(bins=50)\n", "plt.title('Variable before transformation')\n", "plt.xlabel('LotArea')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 0, 'LotArea')" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# transformed variable\n", "train_t['LotArea'].hist(bins=50)\n", "plt.title('Variable before transformation')\n", "plt.xlabel('LotArea')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Automatically select numerical variables\n", "\n", "The transformer will transform all numerical variables if no variables are specified." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# load numerical variables only\n", "\n", "variables = ['LotFrontage', 'LotArea',\n", " '1stFlrSF', 'GrLivArea',\n", " 'TotRmsAbvGrd', 'SalePrice']\n", "\n", "data = pd.read_csv('houseprice.csv', usecols=variables)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((1022, 5), (438, 5))" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's separate into training and testing set\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " data.drop(['SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)\n", "\n", "X_train.shape, X_test.shape" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# Impute missing values\n", "\n", "arbitrary_imputer = ArbitraryNumberImputer(arbitrary_number=2)\n", "\n", "arbitrary_imputer.fit(X_train)\n", "\n", "# impute variables\n", "train_t = arbitrary_imputer.transform(X_train)\n", "test_t = arbitrary_imputer.transform(X_test)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BoxCoxTransformer()" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's transform all numerical variables\n", "\n", "bct = BoxCoxTransformer()\n", "\n", "bct.fit(train_t)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['LotFrontage', 'LotArea', '1stFlrSF', 'GrLivArea', 'TotRmsAbvGrd']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# variables that will be transformed\n", "\n", "bct.variables_" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# transform variables\n", "train_t = bct.transform(train_t)\n", "test_t = bct.transform(test_t)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'LotFrontage': 0.7837538110249009,\n", " 'LotArea': 0.022716974992922984,\n", " '1stFlrSF': 0.024760203538733927,\n", " 'GrLivArea': 0.06854346283829917,\n", " 'TotRmsAbvGrd': 0.26841547941861493}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# learned parameters\n", "\n", "bct.lambda_dict_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "fenotebook", "language": "python", "name": "fenotebook" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }