{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The California housing dataset\n",
    "\n",
    "In this notebook, we will quickly present the dataset known as the \"California\n",
    "housing dataset\". This dataset can be fetched from internet using\n",
    "scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "\n",
    "california_housing = fetch_california_housing(as_frame=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can have a first look at the available description"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(california_housing.DESCR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have an overview of the entire dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "california_housing.frame.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As written in the description, the dataset contains aggregated data regarding\n",
    "each district in California. Let's have a close look at the features that can\n",
    "be used by a predictive model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "california_housing.data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this dataset, we have information regarding the demography (income,\n",
    "population, house occupancy) in the districts, the location of the districts\n",
    "(latitude, longitude), and general information regarding the house in the\n",
    "districts (number of rooms, number of bedrooms, age of the house). Since these\n",
    "statistics are at the granularity of the district, they corresponds to\n",
    "averages or medians.\n",
    "\n",
    "Now, let's have a look to the target to be predicted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "california_housing.target.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The target contains the median of the house value for each district.\n",
    "Therefore, this problem is a regression problem.\n",
    "\n",
    "We can now check more into details the data types and if the dataset contains\n",
    "any missing value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "california_housing.frame.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that:\n",
    "\n",
    "* the dataset contains 20,640 samples and 8 features;\n",
    "* all features are numerical features encoded as floating number;\n",
    "* there is no missing values.\n",
    "\n",
    "Let's have a quick look at the distribution of these features by plotting\n",
    "their histograms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "california_housing.frame.hist(figsize=(12, 10), bins=30, edgecolor=\"black\")\n",
    "plt.subplots_adjust(hspace=0.7, wspace=0.4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can first focus on features for which their distributions would be more or\n",
    "less expected.\n",
    "\n",
    "The median income is a distribution with a long tail. It means that the salary\n",
    "of people is more or less normally distributed but there is some people\n",
    "getting a high salary.\n",
    "\n",
    "Regarding the average house age, the distribution is more or less uniform.\n",
    "\n",
    "The target distribution has a long tail as well. In addition, we have a\n",
    "threshold-effect for high-valued houses: all houses with a price above 5 are\n",
    "given the value 5.\n",
    "\n",
    "Focusing on the average rooms, average bedrooms, average occupation, and\n",
    "population, the range of the data is large with unnoticeable bin for the\n",
    "largest values. It means that there are very high and few values (maybe they\n",
    "could be considered as outliers?). We can see this specificity looking at the\n",
    "statistics for these features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features_of_interest = [\"AveRooms\", \"AveBedrms\", \"AveOccup\", \"Population\"]\n",
    "california_housing.frame[features_of_interest].describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each of these features, comparing the `max` and `75%` values, we can see a\n",
    "huge difference. It confirms the intuitions that there are a couple of extreme\n",
    "values.\n",
    "\n",
    "Up to know, we discarded the longitude and latitude that carry geographical\n",
    "information. In short, the combination of this feature could help us to decide\n",
    "if there are locations associated with high-valued houses. Indeed, we could\n",
    "make a scatter plot where the x- and y-axis would be the latitude and\n",
    "longitude and the circle size and color would be linked with the house value\n",
    "in the district."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "\n",
    "sns.scatterplot(\n",
    "    data=california_housing.frame,\n",
    "    x=\"Longitude\",\n",
    "    y=\"Latitude\",\n",
    "    size=\"MedHouseVal\",\n",
    "    hue=\"MedHouseVal\",\n",
    "    palette=\"viridis\",\n",
    "    alpha=0.5,\n",
    ")\n",
    "plt.legend(title=\"MedHouseVal\", bbox_to_anchor=(1.05, 0.95), loc=\"upper left\")\n",
    "_ = plt.title(\"Median house value depending of\\n their spatial location\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you are not familiar with the state of California, it is interesting to\n",
    "notice that all datapoints show a graphical representation of this state. We\n",
    "note that the high-valued houses will be located on the coast, where the big\n",
    "cities from California are located: San Diego, Los Angeles, San Jose, or San\n",
    "Francisco.\n",
    "\n",
    "We can do a random subsampling to have less data points to plot but that could\n",
    "still allow us to see these specificities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "rng = np.random.RandomState(0)\n",
    "indices = rng.choice(\n",
    "    np.arange(california_housing.frame.shape[0]), size=500, replace=False\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(\n",
    "    data=california_housing.frame.iloc[indices],\n",
    "    x=\"Longitude\",\n",
    "    y=\"Latitude\",\n",
    "    size=\"MedHouseVal\",\n",
    "    hue=\"MedHouseVal\",\n",
    "    palette=\"viridis\",\n",
    "    alpha=0.5,\n",
    ")\n",
    "plt.legend(title=\"MedHouseVal\", bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n",
    "_ = plt.title(\"Median house value depending of\\n their spatial location\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can make a final analysis by making a pair plot of all features and the\n",
    "target but dropping the longitude and latitude. We will quantize the target\n",
    "such that we can create proper histogram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# Drop the unwanted columns\n",
    "columns_drop = [\"Longitude\", \"Latitude\"]\n",
    "subset = california_housing.frame.iloc[indices].drop(columns=columns_drop)\n",
    "# Quantize the target and keep the midpoint for each interval\n",
    "subset[\"MedHouseVal\"] = pd.qcut(subset[\"MedHouseVal\"], 6, retbins=False)\n",
    "subset[\"MedHouseVal\"] = subset[\"MedHouseVal\"].apply(lambda x: x.mid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = sns.pairplot(data=subset, hue=\"MedHouseVal\", palette=\"viridis\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While it is always complicated to interpret a pairplot since there is a lot of\n",
    "data, here we can get a couple of intuitions. We can confirm that some\n",
    "features have extreme values (outliers?). We can as well see that the median\n",
    "income is helpful to distinguish high-valued from low-valued houses.\n",
    "\n",
    "Thus, creating a predictive model, we could expect the longitude, latitude,\n",
    "and the median income to be useful features to help at predicting the median\n",
    "house values.\n",
    "\n",
    "If you are curious, we created a linear predictive model below and show the\n",
    "values of the coefficients obtained via cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import RidgeCV\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "alphas = np.logspace(-3, 1, num=30)\n",
    "model = make_pipeline(StandardScaler(), RidgeCV(alphas=alphas))\n",
    "cv_results = cross_validate(\n",
    "    model,\n",
    "    california_housing.data,\n",
    "    california_housing.target,\n",
    "    return_estimator=True,\n",
    "    n_jobs=2,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "score = cv_results[\"test_score\"]\n",
    "print(f\"R2 score: {score.mean():.3f} \u00b1 {score.std():.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "coefs = pd.DataFrame(\n",
    "    [est[-1].coef_ for est in cv_results[\"estimator\"]],\n",
    "    columns=california_housing.feature_names,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "color = {\"whiskers\": \"black\", \"medians\": \"black\", \"caps\": \"black\"}\n",
    "coefs.plot.box(vert=False, color=color)\n",
    "plt.axvline(x=0, ymin=-1, ymax=1, color=\"black\", linestyle=\"--\")\n",
    "_ = plt.title(\"Coefficients of Ridge models\\n via cross-validation\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems that the three features that we earlier spotted are found important\n",
    "by this model. But be careful regarding interpreting these coefficients. We\n",
    "let you go into the module \"Interpretation\" to go in depth regarding such\n",
    "experiment."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}