{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\" />\n",
    "    \n",
    "## [mlcourse.ai](mlcourse.ai) – Open Machine Learning Course \n",
    "### <center> Author: Ilya Larchenko, ODS Slack ilya_l\n",
    "    \n",
    "## <center> Individual data analysis project"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Data description"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__I will analyse California Housing Data (1990). It can be downloaded from Kaggle [https://www.kaggle.com/harrywang/housing]__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will predict the median price of household in block. \n",
    "To start you need to download file housing.csv.zip . Let's load the data and look at it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os\n",
    "%matplotlib inline\n",
    "\n",
    "import warnings                                  # `do not disturbe` mode\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# change this if needed\n",
    "PATH_TO_DATA = 'data'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "full_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'housing.csv.zip'), compression ='zip')\n",
    "print(full_df.shape)\n",
    "full_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data consists of 20640 rows and 10 features:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. longitude: A measure of how far west a house is; a higher value is farther west\n",
    "2. latitude: A measure of how far north a house is; a higher value is farther north\n",
    "3. housingMedianAge: Median age of a house within a block; a lower number is a newer building\n",
    "4. totalRooms: Total number of rooms within a block\n",
    "5. totalBedrooms: Total number of bedrooms within a block\n",
    "6. population: Total number of people residing within a block\n",
    "7. households: Total number of households, a group of people residing within a home unit, for a block\n",
    "8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)\n",
    "9. medianHouseValue: Median house value for households within a block (measured in US Dollars)\n",
    "10. oceanProximity: Location of the house w.r.t ocean/sea"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*median_house_value* is our target feature, we will use other features to predict it.\n",
    "\n",
    "The task is to predict how much the houses in particular block cost (the median) based on information of blocks location and basic sociodemographic data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's divide dataset into train (75%) and test (25%)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from sklearn.model_selection import train_test_split\n",
    "train_df, test_df = train_test_split(full_df,shuffle = True, test_size = 0.25, random_state=17)\n",
    "train_df=train_df.copy()\n",
    "test_df=test_df.copy()\n",
    "print(train_df.shape)\n",
    "print(test_df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All futher analysis we will do with the test set. But feature generation and processing will be simmultaneously done on both sets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2-3. Primary data analysis / Primary visual data analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that most columns has no nan values (except total_bedrooms), most features has float format, only 1 feature is categorical - ocean_proximity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[pd.isnull(train_df).any(axis=1)].head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is no obvious reasons for some total_bedrooms to be NaN. The number of NaNs is about 1% of total dataset. Maybe we could just drop this rows or fill it with mean/median values, but let's wait for a while, and deal with blanks after initial data analysis in a smarter manner."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's create the list of numeric features names (it will be useful later)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numerical_features=list(train_df.columns)\n",
    "numerical_features.remove('ocean_proximity')\n",
    "numerical_features.remove('median_house_value')\n",
    "print(numerical_features)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at target feature distribition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df['median_house_value'].hist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can visually see that distribution is skewed and not normal. Also it seems that the values are clipped somewhere near 500 000. We can check it numerically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_target=train_df['median_house_value'].max()\n",
    "print(\"The largest median value:\",max_target)\n",
    "print(\"The # of values, equal to the largest:\", sum(train_df['median_house_value']==max_target))\n",
    "print(\"The % of values, equal to the largest:\", sum(train_df['median_house_value']==max_target)/train_df.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Almost 5% of all values = exactly 500 001. It proves our clipping theory. Let's check the clipping of small values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "min_target=train_df['median_house_value'].min()\n",
    "print(\"The smallest median value:\",min_target)\n",
    "print(\"The # of values, equal to the smallest:\", sum(train_df['median_house_value']==min_target))\n",
    "print(\"The % of values, equal to the smallest:\", sum(train_df['median_house_value']==min_target)/train_df.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time it looks much better, a little bit artificial value 14 999 - is common for prices. And there are only 4 such values. So probably the small values are not clipped."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's conduct some normality tests:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from statsmodels.graphics.gofplots import qqplot\n",
    "from matplotlib import pyplot\n",
    "\n",
    "qqplot(train_df['median_house_value'], line='s')\n",
    "pyplot.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import normaltest\n",
    "\n",
    "stat, p = normaltest(train_df['median_house_value'])\n",
    "print('Statistics=%.3f, p=%.3f' % (stat, p))\n",
    "\n",
    "alpha = 0.05\n",
    "if p < alpha:  # null hypothesis: x comes from a normal distribution\n",
    "    print(\"The null hypothesis can be rejected\")\n",
    "else:\n",
    "    print(\"The null hypothesis cannot be rejected\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "QQ-plot and D’Agostino and Pearson’s normality test show that the distribution is far from normal. We can try to use log(1+n) to make it more normal:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_log=np.log1p(train_df['median_house_value'])\n",
    "qqplot(target_log, line='s')\n",
    "pyplot.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stat, p = normaltest(target_log)\n",
    "print('Statistics=%.3f, p=%.3f' % (stat, p))\n",
    "\n",
    "alpha = 0.05\n",
    "if p < alpha:  # null hypothesis: x comes from a normal distribution\n",
    "    print(\"The null hypothesis can be rejected\")\n",
    "else:\n",
    "    print(\"The null hypothesis cannot be rejected\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This graph looks much better, the only non-normal parts are clipped high prices and very low prices. Unfortunately we can not reconstruct clipped data and statistically the distribution it is still not normal - p-value = 0, the null hypothesis of distribution normality can be rejected.\n",
    "\n",
    "Anyway, predicting of target_log instead of target can be a good choice for us, but we still should check it during model validation phase."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df['median_house_value_log']=np.log1p(train_df['median_house_value'])\n",
    "test_df['median_house_value_log']=np.log1p(test_df['median_house_value'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's analyze numerical features. First of all we need to look at their distributions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[numerical_features].hist(bins=50, figsize=(10, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some features are signifacantly skewed, and our \"log trick\" should be heplfull"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "skewed_features=['households','median_income','population', 'total_bedrooms', 'total_rooms']\n",
    "log_numerical_features=[]\n",
    "for f in skewed_features:\n",
    "    train_df[f + '_log']=np.log1p(train_df[f])\n",
    "    test_df[f + '_log']=np.log1p(test_df[f])\n",
    "    log_numerical_features.append(f + '_log')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[log_numerical_features].hist(bins=50, figsize=(10, 10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our new features looks much better (during the modeling phase we can use either original, new ones or both of them)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "housing_median_age looks clipped as well. Let's look at it's highest value precisely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_house_age=train_df['housing_median_age'].max()\n",
    "print(\"The largest value:\",max_house_age)\n",
    "print(\"The # of values, equal to the largest:\", sum(train_df['housing_median_age']==max_house_age))\n",
    "print(\"The % of values, equal to the largest:\", sum(train_df['housing_median_age']==max_house_age)/train_df.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is very likely the data is clipped (there are also a small chance that in 1938 there was a great reconstruction project in California but it seems less likely). We can't recreate original values, but it can be useful to create new binary value indicating the clipping of the house age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df['age_clipped']=train_df['housing_median_age']==max_house_age\n",
    "test_df['age_clipped']=test_df['housing_median_age']==max_house_age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will analyse correleation between features and target variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "corr_y = pd.DataFrame(train_df).corr()\n",
    "plt.rcParams['figure.figsize'] = (20, 16)  # Размер картинок\n",
    "sns.heatmap(corr_y, \n",
    "            xticklabels=corr_y.columns.values,\n",
    "            yticklabels=corr_y.columns.values, annot=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see some (maybe obvious) patterns here:\n",
    "    - House values are significantly correlated with median income\n",
    "    - Number of households is not 100% correlated with population, we can try to add average_size_of_household as a feature\n",
    "    - Longitude and Latitude should be analyzed separately (just a correlation with target variable is not very useful)\n",
    "    - There is a set of highly correlated features: number of rooms, bedrooms, population and households. It can be useful to reduce dimensionality of this subset, especially if we use linear models\n",
    "    - total_bedrooms is one of these highly correlated features, it means we can fill NaN values with high precision using simplest linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try to fill NaNs with simple linear regression:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "lin = LinearRegression()\n",
    "\n",
    "# we will train our model based on all numerical non-target features with not NaN total_bedrooms\n",
    "appropriate_columns = train_df.drop(['median_house_value','median_house_value_log',\n",
    "                            'ocean_proximity', 'total_bedrooms_log'],axis=1)\n",
    "train_data=appropriate_columns[~pd.isnull(train_df).any(axis=1)]\n",
    "\n",
    "# model will be validated on 25% of train dataset \n",
    "# theoretically we can use even our test_df dataset (as we don't use target) for this task, but we will not\n",
    "temp_train, temp_valid = train_test_split(train_data,shuffle = True, test_size = 0.25, random_state=17)\n",
    "\n",
    "lin.fit(temp_train.drop(['total_bedrooms'],axis=1), temp_train['total_bedrooms'])\n",
    "np.sqrt(mean_squared_error(lin.predict(temp_valid.drop(['total_bedrooms'],axis=1)),\n",
    "                           temp_valid['total_bedrooms']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "RMSE on a validation set is 64.5. Let's compare this with the best constant prediction - what if we fill NaNs with mean value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sqrt(mean_squared_error(np.ones(len(temp_valid['total_bedrooms']))*temp_train['total_bedrooms'].mean(),\n",
    "                           temp_valid['total_bedrooms']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obviously our linear regression approach is much better. Let's train our model on whole train dataset and apply it to the rows with blanks. But preliminary we will \"remember\" the rows with NaNs, because there is a chance, that it can contain useful information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lin.fit(train_data.drop(['total_bedrooms'],axis=1), train_data['total_bedrooms'])\n",
    "\n",
    "train_df['total_bedrooms_is_nan']=pd.isnull(train_df).any(axis=1).astype(int)\n",
    "test_df['total_bedrooms_is_nan']=pd.isnull(test_df).any(axis=1).astype(int)\n",
    "\n",
    "train_df['total_bedrooms'].loc[pd.isnull(train_df).any(axis=1)]=\\\n",
    "lin.predict(train_df.drop(['median_house_value','median_house_value_log','total_bedrooms','total_bedrooms_log',\n",
    "               'ocean_proximity','total_bedrooms_is_nan'],axis=1)[pd.isnull(train_df).any(axis=1)])\n",
    "\n",
    "test_df['total_bedrooms'].loc[pd.isnull(test_df).any(axis=1)]=\\\n",
    "lin.predict(test_df.drop(['median_house_value','median_house_value_log','total_bedrooms','total_bedrooms_log',\n",
    "               'ocean_proximity','total_bedrooms_is_nan'],axis=1)[pd.isnull(test_df).any(axis=1)])\n",
    "\n",
    "#linear regression can lead to negative predictions, let's change it\n",
    "test_df['total_bedrooms']=test_df['total_bedrooms'].apply(lambda x: max(x,0))\n",
    "train_df['total_bedrooms']=train_df['total_bedrooms'].apply(lambda x: max(x,0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's update 'total_bedrooms_log' and check if there are no NaNs left"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df['total_bedrooms_log']=np.log1p(train_df['total_bedrooms'])\n",
    "test_df['total_bedrooms_log']=np.log1p(test_df['total_bedrooms'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(train_df.info())\n",
    "print(test_df.info())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After filling of blanks let's have a closer look on dependences between some numerical features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.set()\n",
    "sns.pairplot(train_df[log_numerical_features+['median_house_value_log']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems there are no new insights about numerical features (only confirmation of the old ones).\n",
    "\n",
    "Let's try to do the same thing but for the local (geographically) subset of our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.set()\n",
    "local_coord=[-122, 41] # the point near which we want to look at our variables\n",
    "euc_dist_th = 2 # distance treshhold\n",
    "\n",
    "euclid_distance=train_df[['latitude','longitude']].apply(lambda x:\n",
    "                                                         np.sqrt((x['longitude']-local_coord[0])**2+\n",
    "                                                                 (x['latitude']-local_coord[1])**2), axis=1)\n",
    "\n",
    "# indicate wethere the point is within treshhold or not\n",
    "indicator=pd.Series(euclid_distance<=euc_dist_th, name='indicator')\n",
    "\n",
    "print(\"Data points within treshhold:\", sum(indicator))\n",
    "\n",
    "# a small map to visualize th eregion for analysis\n",
    "sns.lmplot('longitude', 'latitude', data=pd.concat([train_df,indicator], axis=1), hue='indicator', markers ='.', fit_reg=False, height=5)\n",
    "\n",
    "# pairplot\n",
    "sns.pairplot(train_df[log_numerical_features+['median_house_value_log']][indicator])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that on any local territory (you can play with local_coord and euc_dist_th) the linear dependences between variables became stronger, especially median_income_log / median_house_value_log. So the coordinates is very important factor for our task (we will analyze it later) \n",
    "\n",
    "Now let's move on to the categorical feature \"ocean_proximity\". It is not 100% clear what does it values means. So let's first of all plot in on the map."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.lmplot('longitude', 'latitude', data=train_df,markers ='.', hue='ocean_proximity', fit_reg=False, height=5)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we better undersand the meaning of different classes. Let's look at the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "value_count=train_df['ocean_proximity'].value_counts()\n",
    "value_count"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(12,5))\n",
    "\n",
    "\n",
    "sns.barplot(value_count.index, value_count.values)\n",
    "plt.title('Ocean Proximity')\n",
    "plt.ylabel('Number of Occurrences')\n",
    "plt.xlabel('Ocean Proximity')\n",
    "\n",
    "plt.figure(figsize=(12,5))\n",
    "plt.title('House Value depending on Ocean Proximity')\n",
    "sns.boxplot(x=\"ocean_proximity\", y=\"median_house_value_log\", data=train_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that INLAND houses has significantly lower prices. Distribution in other differ but not so much. There is no clear trend in house price / poximity, so we will not try to invent complex encoding approach. Let's just do OHE for this feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ocean_proximity_dummies = pd.get_dummies(pd.concat([train_df['ocean_proximity'],test_df['ocean_proximity']]),\n",
    "                                         drop_first=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dummies_names=list(ocean_proximity_dummies.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df=pd.concat([train_df,ocean_proximity_dummies[:train_df.shape[0]]], axis=1 )\n",
    "test_df=pd.concat([test_df,ocean_proximity_dummies[train_df.shape[0]:]], axis=1 )\n",
    "\n",
    "train_df=train_df.drop(['ocean_proximity'], axis=1)\n",
    "test_df=test_df.drop(['ocean_proximity'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And finally we will explore coordinates features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[['longitude','latitude']].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot the house_values (target) on map:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from matplotlib.colors import LinearSegmentedColormap\n",
    "\n",
    "plt.figure(figsize=(10,10))\n",
    "\n",
    "cmap = LinearSegmentedColormap.from_list(name='name', colors=['green','yellow','red'])\n",
    "\n",
    "f, ax = plt.subplots()\n",
    "points = ax.scatter(train_df['longitude'], train_df['latitude'], c=train_df['median_house_value_log'],\n",
    "                    s=10, cmap=cmap)\n",
    "f.colorbar(points)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems that the average value of geographically nearest houses can be very good feature.\n",
    "\n",
    "We can also see, that the most expensive houses are located near San Francisco (37.7749° N, 122.4194° W) and Los Angeles (34.0522° N, 118.2437°). Based on this we can use the distance to this cities as additional features.\n",
    "\n",
    "We also see that the most expensive houses are on approximately on the straight line, and become cheaper when we moving to North-East. This means that the linear combination of coordinates themselves can be useful feature as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sf_coord=[-122.4194, 37.7749]\n",
    "la_coord=[-118.2437, 34.0522]\n",
    "\n",
    "train_df['distance_to_SF']=np.sqrt((train_df['longitude']-sf_coord[0])**2+(train_df['latitude']-sf_coord[1])**2)\n",
    "test_df['distance_to_SF']=np.sqrt((test_df['longitude']-sf_coord[0])**2+(test_df['latitude']-sf_coord[1])**2)\n",
    "\n",
    "train_df['distance_to_LA']=np.sqrt((train_df['longitude']-la_coord[0])**2+(train_df['latitude']-la_coord[1])**2)\n",
    "test_df['distance_to_LA']=np.sqrt((test_df['longitude']-la_coord[0])**2+(test_df['latitude']-la_coord[1])**2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Insights and found dependencies"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's quickly sum up what useful we have found so far:\n",
    "- We have analyzed the features and found some ~lognorm distributed among them. We have created corresponding log features\n",
    "- We have analyzed the distribution of the target feature, and concluded that it may be useful to predict log of it (to be checked)\n",
    "- We have dealt with clipped and missing data\n",
    "- We have created features corresponding to simple Eucledian distances to LA ans SF\n",
    "- We also has found several highly correlated variables and maybe will work with them later\n",
    "- We have already generated several new variables and will create more of them later after the initial modeling phase\n",
    "\n",
    "All explanation about this steps were already given above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Metrics selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is regression problem. Our target metric will be RMSE - it is one of the most popular regression metrics, and it has same unit of measurement as target value thus is easy to explain to other people. \n",
    "\n",
    "\\begin{align}\n",
    "RMSE = \\sqrt{\\frac{1}{n}\\Sigma_{i=1}^{n}{\\Big(\\frac{d_i -f_i}{\\sigma_i}\\Big)^2}}\n",
    "\\end{align}\n",
    "\n",
    "As far as there is a monotonic dependence between RMSE and MSE we can optimize MSE in our model and compute RMSE only in the end. MSE is easy to optimize it is a default loss function for the most of regression models.\n",
    "\n",
    "The main drawback of MSE and RMSE - high penalty for big errors in predictions - it can overfit to outliers, but in our case outlaying target values have already been clipped so it is not a big problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Model selection"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will try to solve our problem with 3 different regression models:\n",
    "- Linear regression\n",
    "- Random forest\n",
    "- Gradient boosting\n",
    "\n",
    "Linear regression is fast, simple and can provide quite a good baseline result for our task.\n",
    "Tree based models can provide better results in case of nonlinear complex dependences of variables and in case of small number of variables, they are also more stable to multicollinearity (and we have highly correlated variables). Moreover in our problem target values are clipped and targets can't be outside the clipping interval, it is good for the tree-based models.\n",
    "\n",
    "The results of using these models will be compared in the 11-12 parts of the project. Tree-based models are expected to work better in this particular problem, but we will start with more simple model.\n",
    "\n",
    "We will start with standard linear regression, go through all of the modeling steps, and then do some simplified computations for 2 other models (without in-depth explanation of every step).\n",
    "\n",
    "The final model selection will be done based on the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Data preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have already done most of the preprocessing steps:\n",
    "    - OHE for the categorical features\n",
    "    - Filled NaNs\n",
    "    - Computed logs of skewed data\n",
    "    - Divided data into train and hold-out sets\n",
    "\n",
    "Now let's scale all numerical features (it is useful for the linear models), prepare cross validation splits and we are ready to proceed to modeling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "features_to_scale=numerical_features+log_numerical_features+['distance_to_SF','distance_to_LA']\n",
    "\n",
    "scaler = StandardScaler()\n",
    "\n",
    "X_train_scaled=pd.DataFrame(scaler.fit_transform(train_df[features_to_scale]),\n",
    "                            columns=features_to_scale, index=train_df.index)\n",
    "X_test_scaled=pd.DataFrame(scaler.transform(test_df[features_to_scale]),\n",
    "                           columns=features_to_scale, index=test_df.index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8 Cross-validation and adjustment of model hyperparameters "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's prepare cross validation samples.\n",
    "As far as there are not a lot of data we can easily divide it on 10 folds, that are taken from shuffled train data.\n",
    "Within every split we will train our model on 90% of train data and compute CV metric on the other 10%.\n",
    "\n",
    "We fix the random state for the reproducibility."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import KFold, cross_val_score\n",
    "\n",
    "kf = KFold(n_splits=10, random_state=17, shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the first initial baseline we will take Rigge model with only initial numerical and OHE features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Ridge\n",
    "\n",
    "model=Ridge(alpha=1)\n",
    "X=train_df[numerical_features+dummies_names]\n",
    "y=train_df['median_house_value']\n",
    "cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "print(np.sqrt(-cv_scores.mean()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are doing cross validation with 10 folds, computing 'neg_mean_squared_error' (neg - because sklearn needs scoring functions to be minimized). Our final metrics: RMSE=np.sqrt(-neg_MSE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So our baseline is RMSE = $68 702 we will try to improve this results using everything we have discovered during the data analysis phase.\n",
    "\n",
    "We will do the following steps:\n",
    "    - Use scaled features\n",
    "    - Add log features \n",
    "    - Add NaN and age clip indicating features\n",
    "    - Add city-distance features\n",
    "    - Generate several new features\n",
    "    - Try to predict log(target) instead of target\n",
    "    - Tune some hyperparameters of the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One again the most part of the hyperparameters adjustment will be done later after we add some new features. Actually the cross-validation and parameters tuning process is done through the parts 8-11. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# using scaled data\n",
    "X=pd.concat([train_df[dummies_names], X_train_scaled[numerical_features]], axis=1, ignore_index = True)\n",
    "cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "print(np.sqrt(-cv_scores.mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# adding NaN indicating feature\n",
    "X=pd.concat([train_df[dummies_names+['total_bedrooms_is_nan']],\n",
    "             X_train_scaled[numerical_features]], axis=1, ignore_index = True)\n",
    "cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "print(np.sqrt(-cv_scores.mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# adding house age cliiping indicating feature\n",
    "X=pd.concat([train_df[dummies_names+['age_clipped']],\n",
    "             X_train_scaled[numerical_features]], axis=1, ignore_index = True)\n",
    "cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "print(np.sqrt(-cv_scores.mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# adding log features\n",
    "X=pd.concat([train_df[dummies_names+['age_clipped']], X_train_scaled[numerical_features+log_numerical_features]],\n",
    "            axis=1, ignore_index = True)\n",
    "cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "print(np.sqrt(-cv_scores.mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# adding city distance features\n",
    "X=pd.concat([train_df[dummies_names+['age_clipped']], X_train_scaled],\n",
    "            axis=1, ignore_index = True)\n",
    "cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "print(np.sqrt(-cv_scores.mean()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Up to this moment we have got best result using numerical features + their logs + age_clipped+ dummy variables + distances to the largest cities.\n",
    "Let's try to generate new features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Creation of new features and description of this process"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Previously we have already created and explained the rational of new features creation. Now Let's generate additional ones "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "City distances features work, but maybe there are also some non-linear dependencies between them and the target variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.set()\n",
    "sns.pairplot(train_df[['distance_to_SF','distance_to_LA','median_house_value_log']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visually is not obvious so let's try to create a couple of new variables and check:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_features_train_df=pd.DataFrame(index=train_df.index)\n",
    "new_features_test_df=pd.DataFrame(index=test_df.index)\n",
    "\n",
    "\n",
    "new_features_train_df['1/distance_to_SF']=1/(train_df['distance_to_SF']+0.001)\n",
    "new_features_train_df['1/distance_to_LA']=1/(train_df['distance_to_LA']+0.001)\n",
    "new_features_train_df['log_distance_to_SF']=np.log1p(train_df['distance_to_SF'])\n",
    "new_features_train_df['log_distance_to_LA']=np.log1p(train_df['distance_to_LA'])\n",
    "\n",
    "new_features_test_df['1/distance_to_SF']=1/(test_df['distance_to_SF']+0.001)\n",
    "new_features_test_df['1/distance_to_LA']=1/(test_df['distance_to_LA']+0.001)\n",
    "new_features_test_df['log_distance_to_SF']=np.log1p(test_df['distance_to_SF'])\n",
    "new_features_test_df['log_distance_to_LA']=np.log1p(test_df['distance_to_LA'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also generate some features correlated to the prosperity:\n",
    "- rooms/person - how many rooms are there per person. The higher - the richer people are living there - the more expensive houses they buy\n",
    "- rooms/household - how many rooms are there per family. The similar one but corresponds to the number of rooms per family (assuming household~family), not per person.\n",
    "- two similar features but counting only bedrooms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_features_train_df['rooms/person']=train_df['total_rooms']/train_df['population']\n",
    "new_features_train_df['rooms/household']=train_df['total_rooms']/train_df['households']\n",
    "\n",
    "new_features_test_df['rooms/person']=test_df['total_rooms']/test_df['population']\n",
    "new_features_test_df['rooms/household']=test_df['total_rooms']/test_df['households']\n",
    "\n",
    "\n",
    "new_features_train_df['bedrooms/person']=train_df['total_bedrooms']/train_df['population']\n",
    "new_features_train_df['bedrooms/household']=train_df['total_bedrooms']/train_df['households']\n",
    "\n",
    "new_features_test_df['bedrooms/person']=test_df['total_bedrooms']/test_df['population']\n",
    "new_features_test_df['bedrooms/household']=test_df['total_bedrooms']/test_df['households']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- the luxurity of house can be characterized buy number of bedrooms per rooms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_features_train_df['bedroom/rooms']=train_df['total_bedrooms']/train_df['total_rooms']\n",
    "new_features_test_df['bedroom/rooms']=test_df['total_bedrooms']/test_df['total_rooms']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- the average number of persons in one household can be the signal of prosperity or the same time the signal of richness but in any case it can be a useful feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_features_train_df['average_size_of_household']=train_df['population']/train_df['households']\n",
    "new_features_test_df['average_size_of_household']=test_df['population']/test_df['households']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And finally let's scale all this features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_features_train_df=pd.DataFrame(scaler.fit_transform(new_features_train_df),\n",
    "                            columns=new_features_train_df.columns, index=new_features_train_df.index)\n",
    "\n",
    "new_features_test_df=pd.DataFrame(scaler.transform(new_features_test_df),\n",
    "                            columns=new_features_test_df.columns, index=new_features_test_df.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_features_train_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_features_test_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will add new features one by one and keeps only those that improve our best score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# computing current best score\n",
    "\n",
    "X=pd.concat([train_df[dummies_names+['age_clipped']], X_train_scaled],\n",
    "            axis=1, ignore_index = True)\n",
    "\n",
    "cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "best_score = np.sqrt(-cv_scores.mean())\n",
    "print(\"Best score: \", best_score)\n",
    "\n",
    "# list of the new good features\n",
    "new_features_list=[]\n",
    "\n",
    "for feature in new_features_train_df.columns:\n",
    "    new_features_list.append(feature)\n",
    "    X=pd.concat([train_df[dummies_names+['age_clipped']], X_train_scaled,\n",
    "                 new_features_train_df[new_features_list]\n",
    "                ],\n",
    "                axis=1, ignore_index = True)\n",
    "    cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "    score = np.sqrt(-cv_scores.mean())\n",
    "    if score >= best_score:\n",
    "        new_features_list.remove(feature)\n",
    "        print(feature, ' is not a good feature')\n",
    "    else:\n",
    "        print(feature, ' is a good feature')\n",
    "        print('New best score: ', score)\n",
    "        best_score=score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have got 5 new good features. Let's update our X variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X=pd.concat([train_df[dummies_names+['age_clipped']], X_train_scaled,\n",
    "             new_features_train_df[new_features_list]\n",
    "            ],\n",
    "            axis=1).reset_index(drop=True)\n",
    "y=train_df['median_house_value'].reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To deal with log of target we need to create our own cross validation or our own predicting model. We will try the first option"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "def cross_val_score_with_log(model=model, X=X,y=y,kf=kf, use_log=False):\n",
    "\n",
    "    X_temp=np.array(X)\n",
    "\n",
    "    # if use_log parameter is true we will predict log(y+1)\n",
    "    if use_log:\n",
    "        y_temp=np.log1p(y)\n",
    "    else:\n",
    "        y_temp=np.array(y)\n",
    "    \n",
    "    cv_scores=[]\n",
    "    for train_index, test_index in kf.split(X_temp,y_temp):\n",
    "\n",
    "        prediction = model.fit(X_temp[train_index], y_temp[train_index]).predict(X_temp[test_index])\n",
    "        \n",
    "        # if use_log parameter is true we should come back to the initial targer\n",
    "        if use_log:\n",
    "            prediction=np.expm1(prediction)\n",
    "        cv_scores.append(-mean_squared_error(y[test_index],prediction))\n",
    "\n",
    "    return np.sqrt(-np.mean(cv_scores))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cross_val_score_with_log(X=X,y=y,kf=kf, use_log=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have got exactly the same result as with cross_val_score function. That means everything work ok. Now let's try to set use_log to true"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cross_val_score_with_log(X=X,y=y,kf=kf, use_log=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unfortunately, it has not helped. So we will stick to the previous version.\n",
    "\n",
    "And now we will tune the only meaningful hyperparameter of the Ridge regression - alpha."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Plotting training and validation curves"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot Validation Curve"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import validation_curve\n",
    "\n",
    "Cs=np.logspace(-5, 4, 10)\n",
    "train_scores, valid_scores = validation_curve(model, X, y, \"alpha\", \n",
    "                                              Cs, cv=kf, scoring='neg_mean_squared_error')\n",
    "\n",
    "plt.plot(Cs, np.sqrt(-train_scores.mean(axis=1)), 'ro-')\n",
    "\n",
    "plt.fill_between(x=Cs, y1=np.sqrt(-train_scores.max(axis=1)), \n",
    "                 y2=np.sqrt(-train_scores.min(axis=1)), alpha=0.1, color = \"red\")\n",
    "\n",
    "\n",
    "plt.plot(Cs, np.sqrt(-valid_scores.mean(axis=1)), 'bo-')\n",
    "\n",
    "plt.fill_between(x=Cs, y1=np.sqrt(-valid_scores.max(axis=1)), \n",
    "                 y2=np.sqrt(-valid_scores.min(axis=1)), alpha=0.1, color = \"blue\")\n",
    "\n",
    "plt.xscale('log')\n",
    "plt.xlabel('alpha')\n",
    "plt.ylabel('RMSE')\n",
    "plt.title('Regularization Parameter Tuning')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Cs[np.sqrt(-valid_scores.mean(axis=1)).argmin()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that curves for train and CV are very close to each other, it is a sign of underfiting. The difference between the curves does not change along with change in alpha this mean that we should try more complex models comparing to linear regression or add more new features (f.e. polynomial ones)\n",
    "\n",
    "Using this curve we can find the optimal value of alpha. It is alpha=1. But actually our prediction does not change when alpha goes below 1.\n",
    "\n",
    "Let's use alpha=1 and plot the learning curve"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import learning_curve\n",
    "\n",
    "model=Ridge(alpha=1.0)\n",
    "\n",
    "train_sizes, train_scores, valid_scores = learning_curve(model, X, y, train_sizes=list(range(50,10001,100)),\n",
    "                                                         scoring='neg_mean_squared_error', cv=5)\n",
    "\n",
    "plt.plot(train_sizes, np.sqrt(-train_scores.mean(axis=1)), 'ro-')\n",
    "\n",
    "plt.fill_between(x=train_sizes, y1=np.sqrt(-train_scores.max(axis=1)), \n",
    "                 y2=np.sqrt(-train_scores.min(axis=1)), alpha=0.1, color = \"red\")\n",
    "\n",
    "plt.plot(train_sizes, np.sqrt(-valid_scores.mean(axis=1)), 'bo-')\n",
    "\n",
    "plt.fill_between(x=train_sizes, y1=np.sqrt(-valid_scores.max(axis=1)), \n",
    "                 y2=np.sqrt(-valid_scores.min(axis=1)), alpha=0.1, color = \"blue\")\n",
    "\n",
    "plt.xlabel('Train size')\n",
    "plt.ylabel('RMSE')\n",
    "plt.title('Regularization Parameter Tuning')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Learning curves indicate high bias of the model - this means we will not improve our model by adding more data, but we can try to use more complex models or add more features to improve the results.\n",
    "\n",
    "This result is inline with the validation curve results. So let's move on to the more complex models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random forest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Actually we can just put all our features into the model but we can easily improve computational performance of the tree-based models, by deleting all monotonous derivatives of features because they does not help at all.\n",
    "\n",
    "For example, adding log(feature) don't help tree-based model, it will just make it more computationally intensive.\n",
    "\n",
    "So let's train random forest classifier based on shorten set of the features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features_for_trees=['INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN', 'age_clipped',\n",
    "       'longitude', 'latitude', 'housing_median_age', 'total_rooms',\n",
    "       'total_bedrooms', 'population', 'households', 'median_income',\n",
    "       'distance_to_SF', 'distance_to_LA','bedroom/rooms']       "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "X_trees=X[features_for_trees]\n",
    "\n",
    "model_rf=RandomForestRegressor(n_estimators=100, random_state=17)\n",
    "cv_scores = cross_val_score(model_rf, X_trees, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "\n",
    "print(np.sqrt(-cv_scores.mean()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see significant improvement, comparing to the linear model and higher n_estimator probably will help. But first, let's try to tune other hyperparametres:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "param_grid={'n_estimators': [100],\n",
    "            'max_depth':  [22, 23, 24, 25],\n",
    "            'max_features': [5,6,7,8]}\n",
    "\n",
    "gs=GridSearchCV(model_rf, param_grid, scoring='neg_mean_squared_error', fit_params=None, n_jobs=-1, cv=kf, verbose=1)\n",
    "\n",
    "gs.fit(X_trees,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(np.sqrt(-gs.best_score_))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gs.best_params_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "best_depth=gs.best_params_['max_depth']\n",
    "best_features=gs.best_params_['max_features']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "model_rf=RandomForestRegressor(n_estimators=100, max_depth=best_depth, max_features=best_features, random_state=17)\n",
    "cv_scores = cross_val_score(model_rf, X_trees, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "\n",
    "print(np.sqrt(-cv_scores.mean()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the relatively small effort we have got a significant improvement of results. Random Forest results can be further improved by higher n_estimators, let's find the n_estimators at witch the results stabilize. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_rf=RandomForestRegressor(n_estimators=200,  max_depth=best_depth, max_features=best_features, random_state=17)\n",
    "Cs=list(range(20,201,20))\n",
    "train_scores, valid_scores = validation_curve(model_rf, X_trees, y, \"n_estimators\", \n",
    "                                              Cs, cv=kf, scoring='neg_mean_squared_error')\n",
    "\n",
    "plt.plot(Cs, np.sqrt(-train_scores.mean(axis=1)), 'ro-')\n",
    "\n",
    "plt.fill_between(x=Cs, y1=np.sqrt(-train_scores.max(axis=1)), \n",
    "                 y2=np.sqrt(-train_scores.min(axis=1)), alpha=0.1, color = \"red\")\n",
    "\n",
    "\n",
    "plt.plot(Cs, np.sqrt(-valid_scores.mean(axis=1)), 'bo-')\n",
    "\n",
    "plt.fill_between(x=Cs, y1=np.sqrt(-valid_scores.max(axis=1)), \n",
    "                 y2=np.sqrt(-valid_scores.min(axis=1)), alpha=0.1, color = \"blue\")\n",
    "\n",
    "plt.xlabel('n_estimators')\n",
    "plt.ylabel('RMSE')\n",
    "plt.title('Regularization Parameter Tuning')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This time we can see that the results of train is much better than CV, but it is totally ok for the Random Forest. \n",
    "\n",
    "Higher value of n_estimators (>100) does not help much. Let's stick to the n_estimators=200 - it is high enough but not very computationally intensive."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gradient boosting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And finally we will try to use LightGBM to solve our problem.\n",
    "We will try the model out of the box, and then tune some of its parameters using random search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# uncomment to install if you have not yet\n",
    "#!pip install lightgbm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "from lightgbm.sklearn import LGBMRegressor\n",
    "\n",
    "model_gb=LGBMRegressor()\n",
    "cv_scores = cross_val_score(model_gb, X_trees, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=1)\n",
    "\n",
    "print(np.sqrt(-cv_scores.mean()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "LGBMRegressor has much more hyperparameters than previous models. As far as this is educational problem we will not spend a lot of time to tuning all of them. In this case RandomizedSearchCV can give us very good result quite fast, much faster than GridSearch. We will do optimization in 2 steps: model complexity optimization and convergence optimization. Let's do it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model complexity optimization\n",
    "from sklearn.model_selection import RandomizedSearchCV\n",
    "from scipy.stats import randint, uniform\n",
    "\n",
    "param_grid={'max_depth':  randint(6,11),\n",
    "            'num_leaves': randint(7,127),\n",
    "            'reg_lambda': np.logspace(-3,0,100),\n",
    "            'random_state': [17]}\n",
    "\n",
    "gs=RandomizedSearchCV(model_gb, param_grid, n_iter = 50, scoring='neg_mean_squared_error', fit_params=None, \n",
    "                n_jobs=-1, cv=kf, verbose=1, random_state=17)\n",
    "\n",
    "gs.fit(X_trees,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sqrt(-gs.best_score_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gs.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's fix n_estimators=500, it is big enough but is not to computationally intensive yet, and find the best value of the learning_rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# model convergency optimization\n",
    "\n",
    "param_grid={'n_estimators': [500],\n",
    "            'learning_rate': np.logspace(-4, 0, 100),\n",
    "            'max_depth':  [10],\n",
    "            'num_leaves': [72],\n",
    "            'reg_lambda': [0.0010722672220103231],\n",
    "            'random_state': [17]}\n",
    "\n",
    "gs=RandomizedSearchCV(model_gb, param_grid, n_iter = 20, scoring='neg_mean_squared_error', fit_params=None, \n",
    "                n_jobs=-1, cv=kf, verbose=1, random_state=17)\n",
    "\n",
    "gs.fit(X_trees,y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sqrt(-gs.best_score_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gs.best_params_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have got the best params for the gradient boosting and will use them for the final prediction."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 11. Prediction for test or hold-out samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets sum up the results of our project. We will compute RMSE on cross validation and holdout set and compare them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_df=pd.DataFrame(columns=['model','CV_results', 'holdout_results'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# hold-out features and target \n",
    "X_ho=pd.concat([test_df[dummies_names+['age_clipped']], X_test_scaled,\n",
    "             new_features_test_df[new_features_list]],axis=1).reset_index(drop=True)\n",
    "y_ho=test_df['median_house_value'].reset_index(drop=True)\n",
    "\n",
    "X_trees_ho=X_ho[features_for_trees]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "#linear model\n",
    "model=Ridge(alpha=1.0)\n",
    "\n",
    "cv_scores = cross_val_score(model, X, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "score_cv=np.sqrt(-np.mean(cv_scores.mean()))\n",
    "\n",
    "\n",
    "prediction_ho = model.fit(X, y).predict(X_ho)\n",
    "score_ho=np.sqrt(mean_squared_error(y_ho,prediction_ho))\n",
    "\n",
    "results_df.loc[results_df.shape[0]]=['Linear Regression',  score_cv,  score_ho]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "#Random Forest\n",
    "model_rf=RandomForestRegressor(n_estimators=200,  max_depth=23, max_features=5, random_state=17)\n",
    "\n",
    "cv_scores = cross_val_score(model_rf, X_trees, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "score_cv=np.sqrt(-np.mean(cv_scores.mean()))\n",
    "\n",
    "\n",
    "prediction_ho = model_rf.fit(X_trees, y).predict(X_trees_ho)\n",
    "score_ho=np.sqrt(mean_squared_error(y_ho,prediction_ho))\n",
    "\n",
    "results_df.loc[results_df.shape[0]]=['Random Forest',  score_cv,  score_ho]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "#Gradient boosting\n",
    "model_gb=LGBMRegressor(reg_lambda=0.0010722672220103231, max_depth=10,\n",
    "                       n_estimators=500, num_leaves=72, random_state=17, learning_rate=0.06734150657750829)\n",
    "cv_scores = cross_val_score(model_gb, X_trees, y, cv=kf, scoring='neg_mean_squared_error', n_jobs=-1)\n",
    "score_cv=np.sqrt(-np.mean(cv_scores.mean()))\n",
    "\n",
    "prediction_ho = model_gb.fit(X_trees, y).predict(X_trees_ho)\n",
    "score_ho=np.sqrt(mean_squared_error(y_ho,prediction_ho))\n",
    "\n",
    "results_df.loc[results_df.shape[0]]=['Gradient boosting',  score_cv,  score_ho]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems we have done quite a good job. Cross validation results are inline with holdout ones. Our best CV model - gradient boosting, turned out to be the best on hold-out dataset as well (and it is also faster than random forest)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12. Conclusions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To sum up, we have got the solution that can predict the mean house value in the block with RMSE \\$46k using our best model - LGB. It is not an extremely precise prediction: \\$46k is about 20% of the average mean house price, but it seems that it is near the possible solution for these classes of model based on this data (it is popular dataset but I have not find any solution with significantly better results). \n",
    "\n",
    "We have used old Californian data from 1990 so it is not useful right now. But the same approach can be used to predict modern house prices (if applied to the resent market data)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have done a lot but the results surely can be improved, at least one could try:\n",
    "\n",
    "- feature engineering: polynomial features, better distances to cities (not Euclidean ones, ellipse representation of cities), average values of target for the geographically closest neighbours (requires custom estimator function for correct cross validation)\n",
    "- PCA for dimensionality reduction (I have mentioned it but didn't used)\n",
    "- other models (at least KNN and SVM can be tried based on data)\n",
    "- more time and effort can be spent on RF and LGB parameters tuning"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}