{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\">\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n",
    "\n",
    "Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# <center> Assignment #6 (demo). Solution\n",
    "## <center>  Exploring OLS, Lasso and Random Forest in a regression task\n",
    "    \n",
    "**Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a6-demo-linear-models-and-rf-for-regression) + [solution](https://www.kaggle.com/kashnitsky/a6-demo-regression-solution).**    \n",
    "    \n",
    "<img src='../../img/wine_quality.jpg' width=30%>\n",
    "\n",
    "**Fill in the missing code and choose answers in [this](https://docs.google.com/forms/d/1aHyK58W6oQmNaqEfvpLTpo6Cb0-ntnvJ18rZcvclkvw/edit) web form.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "from sklearn.linear_model import Lasso, LassoCV, LinearRegression\n",
    "from sklearn.metrics.regression import mean_squared_error\n",
    "from sklearn.model_selection import (GridSearchCV, cross_val_score,\n",
    "                                     train_test_split)\n",
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**We are working with UCI Wine quality dataset (no need to download it – it's already there, in course repo and in Kaggle Dataset).**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv(\"../../data/winequality-white.csv\", sep=\";\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Separate the target feature, split data in 7:3 proportion (30% form a holdout set, use random_state=17), and preprocess data with `StandardScaler`.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = data[\"quality\"]\n",
    "X = data.drop(\"quality\", axis=1)\n",
    "\n",
    "X_train, X_holdout, y_train, y_holdout = train_test_split(\n",
    "    X, y, test_size=0.3, random_state=17\n",
    ")\n",
    "scaler = StandardScaler()\n",
    "X_train_scaled = scaler.fit_transform(X_train)\n",
    "X_holdout_scaled = scaler.transform(X_holdout)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Train a simple linear regression model (Ordinary Least Squares).**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "linreg = LinearRegression()\n",
    "linreg.fit(X_train_scaled, y_train);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**<font color='red'>Question 1:</font> What are mean squared errors of model predictions on train and holdout sets?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Mean squared error (train): %.3f\"\n",
    "    % mean_squared_error(y_train, linreg.predict(X_train_scaled))\n",
    ")\n",
    "print(\n",
    "    \"Mean squared error (test): %.3f\"\n",
    "    % mean_squared_error(y_holdout, linreg.predict(X_holdout_scaled))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Sort features by their influence on the target feature (wine quality). Beware that both large positive and large negative coefficients mean large influence on target. It's handy to use `pandas.DataFrame` here.**\n",
    "\n",
    "**<font color='red'>Question 2:</font> Which feature this linear regression model treats as the most influential on wine quality?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "linreg_coef = pd.DataFrame(\n",
    "    {\"coef\": linreg.coef_, \"coef_abs\": np.abs(linreg.coef_)},\n",
    "    index=data.columns.drop(\"quality\"),\n",
    ")\n",
    "linreg_coef.sort_values(by=\"coef_abs\", ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lasso regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Train a LASSO model with $\\alpha = 0.01$ (weak regularization) and scaled data. Again, set random_state=17.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso1 = Lasso(alpha=0.01, random_state=17)\n",
    "lasso1.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Which feature is the least informative in predicting wine quality, according to this LASSO model?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso1_coef = pd.DataFrame(\n",
    "    {\"coef\": lasso1.coef_, \"coef_abs\": np.abs(lasso1.coef_)},\n",
    "    index=data.columns.drop(\"quality\"),\n",
    ")\n",
    "lasso1_coef.sort_values(by=\"coef_abs\", ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Train LassoCV with random_state=17 to choose the best value of $\\alpha$ in 5-fold cross-validation.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "alphas = np.logspace(-6, 2, 200)\n",
    "lasso_cv = LassoCV(random_state=17, cv=5, alphas=alphas)\n",
    "lasso_cv.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso_cv.alpha_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**<font color='red'>Question 3:</font> Which feature is the least informative in predicting wine quality, according to the tuned LASSO model?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lasso_cv_coef = pd.DataFrame(\n",
    "    {\"coef\": lasso_cv.coef_, \"coef_abs\": np.abs(lasso_cv.coef_)},\n",
    "    index=data.columns.drop(\"quality\"),\n",
    ")\n",
    "lasso_cv_coef.sort_values(by=\"coef_abs\", ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**<font color='red'>Question 4:</font> What are mean squared errors of tuned LASSO predictions on train and holdout sets?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Mean squared error (train): %.3f\"\n",
    "    % mean_squared_error(y_train, lasso_cv.predict(X_train_scaled))\n",
    ")\n",
    "print(\n",
    "    \"Mean squared error (test): %.3f\"\n",
    "    % mean_squared_error(y_holdout, lasso_cv.predict(X_holdout_scaled))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Random Forest"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Train a Random Forest with out-of-the-box parameters, setting only random_state to be 17.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "forest = RandomForestRegressor(random_state=17)\n",
    "forest.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**<font color='red'>Question 5:</font> What are mean squared errors of RF model on the training set, in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Mean squared error (train): %.3f\"\n",
    "    % mean_squared_error(y_train, forest.predict(X_train_scaled))\n",
    ")\n",
    "print(\n",
    "    \"Mean squared error (cv): %.3f\"\n",
    "    % np.mean(\n",
    "        np.abs(\n",
    "            cross_val_score(\n",
    "                forest, X_train_scaled, y_train, scoring=\"neg_mean_squared_error\"\n",
    "            )\n",
    "        )\n",
    "    )\n",
    ")\n",
    "print(\n",
    "    \"Mean squared error (test): %.3f\"\n",
    "    % mean_squared_error(y_holdout, forest.predict(X_holdout_scaled))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Tune the `max_features` and `max_depth` hyperparameters with GridSearchCV and again check mean cross-validation MSE and MSE on holdout set.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "forest_params = {\"max_depth\": list(range(10, 25)), \"max_features\": list(range(6, 12))}\n",
    "\n",
    "locally_best_forest = GridSearchCV(\n",
    "    RandomForestRegressor(n_jobs=-1, random_state=17),\n",
    "    forest_params,\n",
    "    scoring=\"neg_mean_squared_error\",\n",
    "    n_jobs=-1,\n",
    "    cv=5,\n",
    "    verbose=True,\n",
    ")\n",
    "locally_best_forest.fit(X_train_scaled, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "locally_best_forest.best_params_, locally_best_forest.best_score_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**<font color='red'>Question 6:</font> What are mean squared errors of tuned RF model in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\n",
    "    \"Mean squared error (cv): %.3f\"\n",
    "    % np.mean(\n",
    "        np.abs(\n",
    "            cross_val_score(\n",
    "                locally_best_forest.best_estimator_,\n",
    "                X_train_scaled,\n",
    "                y_train,\n",
    "                scoring=\"neg_mean_squared_error\",\n",
    "            )\n",
    "        )\n",
    "    )\n",
    ")\n",
    "print(\n",
    "    \"Mean squared error (test): %.3f\"\n",
    "    % mean_squared_error(y_holdout, locally_best_forest.predict(X_holdout_scaled))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Output RF's feature importance. Again, it's nice to present it as a DataFrame.**<br>\n",
    "**<font color='red'>Question 7:</font> What is the most important feature, according to the Random Forest model?**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_importance = pd.DataFrame(\n",
    "    locally_best_forest.best_estimator_.feature_importances_,\n",
    "    columns=[\"coef\"],\n",
    "    index=data.columns[:-1],\n",
    ")\n",
    "rf_importance.sort_values(by=\"coef\", ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Make conclusions about the performance of the explored 3 models in this particular prediction task.**\n",
    "\n",
    "The depency of wine quality on other features in hand is, presumable, non-linear. So Random Forest works better in this task. "
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "name": "lesson8_part1_kmeans.ipynb"
 },
 "nbformat": 4,
 "nbformat_minor": 1
}