{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n", "\n", "Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Assignment #6 (demo). Solution\n", "##
Exploring OLS, Lasso and Random Forest in a regression task\n", " \n", "**Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a6-demo-linear-models-and-rf-for-regression) + [solution](https://www.kaggle.com/kashnitsky/a6-demo-regression-solution).** \n", " \n", "\n", "\n", "**Fill in the missing code and choose answers in [this](https://docs.google.com/forms/d/1aHyK58W6oQmNaqEfvpLTpo6Cb0-ntnvJ18rZcvclkvw/edit) web form.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.linear_model import Lasso, LassoCV, LinearRegression\n", "from sklearn.metrics.regression import mean_squared_error\n", "from sklearn.model_selection import (GridSearchCV, cross_val_score,\n", " train_test_split)\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**We are working with UCI Wine quality dataset (no need to download it – it's already there, in course repo and in Kaggle Dataset).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"../../data/winequality-white.csv\", sep=\";\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Separate the target feature, split data in 7:3 proportion (30% form a holdout set, use random_state=17), and preprocess data with `StandardScaler`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = data[\"quality\"]\n", "X = data.drop(\"quality\", axis=1)\n", "\n", "X_train, X_holdout, y_train, y_holdout = train_test_split(\n", " X, y, test_size=0.3, random_state=17\n", ")\n", "scaler = StandardScaler()\n", "X_train_scaled = scaler.fit_transform(X_train)\n", "X_holdout_scaled = scaler.transform(X_holdout)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train a simple linear regression model (Ordinary Least Squares).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "linreg = LinearRegression()\n", "linreg.fit(X_train_scaled, y_train);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1: What are mean squared errors of model predictions on train and holdout sets?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " \"Mean squared error (train): %.3f\"\n", " % mean_squared_error(y_train, linreg.predict(X_train_scaled))\n", ")\n", "print(\n", " \"Mean squared error (test): %.3f\"\n", " % mean_squared_error(y_holdout, linreg.predict(X_holdout_scaled))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sort features by their influence on the target feature (wine quality). Beware that both large positive and large negative coefficients mean large influence on target. It's handy to use `pandas.DataFrame` here.**\n", "\n", "**Question 2: Which feature this linear regression model treats as the most influential on wine quality?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "linreg_coef = pd.DataFrame(\n", " {\"coef\": linreg.coef_, \"coef_abs\": np.abs(linreg.coef_)},\n", " index=data.columns.drop(\"quality\"),\n", ")\n", "linreg_coef.sort_values(by=\"coef_abs\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lasso regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train a LASSO model with $\\alpha = 0.01$ (weak regularization) and scaled data. Again, set random_state=17.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso1 = Lasso(alpha=0.01, random_state=17)\n", "lasso1.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Which feature is the least informative in predicting wine quality, according to this LASSO model?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso1_coef = pd.DataFrame(\n", " {\"coef\": lasso1.coef_, \"coef_abs\": np.abs(lasso1.coef_)},\n", " index=data.columns.drop(\"quality\"),\n", ")\n", "lasso1_coef.sort_values(by=\"coef_abs\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train LassoCV with random_state=17 to choose the best value of $\\alpha$ in 5-fold cross-validation.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "alphas = np.logspace(-6, 2, 200)\n", "lasso_cv = LassoCV(random_state=17, cv=5, alphas=alphas)\n", "lasso_cv.fit(X_train_scaled, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso_cv.alpha_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 3: Which feature is the least informative in predicting wine quality, according to the tuned LASSO model?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lasso_cv_coef = pd.DataFrame(\n", " {\"coef\": lasso_cv.coef_, \"coef_abs\": np.abs(lasso_cv.coef_)},\n", " index=data.columns.drop(\"quality\"),\n", ")\n", "lasso_cv_coef.sort_values(by=\"coef_abs\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 4: What are mean squared errors of tuned LASSO predictions on train and holdout sets?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " \"Mean squared error (train): %.3f\"\n", " % mean_squared_error(y_train, lasso_cv.predict(X_train_scaled))\n", ")\n", "print(\n", " \"Mean squared error (test): %.3f\"\n", " % mean_squared_error(y_holdout, lasso_cv.predict(X_holdout_scaled))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train a Random Forest with out-of-the-box parameters, setting only random_state to be 17.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forest = RandomForestRegressor(random_state=17)\n", "forest.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 5: What are mean squared errors of RF model on the training set, in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " \"Mean squared error (train): %.3f\"\n", " % mean_squared_error(y_train, forest.predict(X_train_scaled))\n", ")\n", "print(\n", " \"Mean squared error (cv): %.3f\"\n", " % np.mean(\n", " np.abs(\n", " cross_val_score(\n", " forest, X_train_scaled, y_train, scoring=\"neg_mean_squared_error\"\n", " )\n", " )\n", " )\n", ")\n", "print(\n", " \"Mean squared error (test): %.3f\"\n", " % mean_squared_error(y_holdout, forest.predict(X_holdout_scaled))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tune the `max_features` and `max_depth` hyperparameters with GridSearchCV and again check mean cross-validation MSE and MSE on holdout set.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "forest_params = {\"max_depth\": list(range(10, 25)), \"max_features\": list(range(6, 12))}\n", "\n", "locally_best_forest = GridSearchCV(\n", " RandomForestRegressor(n_jobs=-1, random_state=17),\n", " forest_params,\n", " scoring=\"neg_mean_squared_error\",\n", " n_jobs=-1,\n", " cv=5,\n", " verbose=True,\n", ")\n", "locally_best_forest.fit(X_train_scaled, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "locally_best_forest.best_params_, locally_best_forest.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 6: What are mean squared errors of tuned RF model in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " \"Mean squared error (cv): %.3f\"\n", " % np.mean(\n", " np.abs(\n", " cross_val_score(\n", " locally_best_forest.best_estimator_,\n", " X_train_scaled,\n", " y_train,\n", " scoring=\"neg_mean_squared_error\",\n", " )\n", " )\n", " )\n", ")\n", "print(\n", " \"Mean squared error (test): %.3f\"\n", " % mean_squared_error(y_holdout, locally_best_forest.predict(X_holdout_scaled))\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Output RF's feature importance. Again, it's nice to present it as a DataFrame.**
\n", "**Question 7: What is the most important feature, according to the Random Forest model?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf_importance = pd.DataFrame(\n", " locally_best_forest.best_estimator_.feature_importances_,\n", " columns=[\"coef\"],\n", " index=data.columns[:-1],\n", ")\n", "rf_importance.sort_values(by=\"coef\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Make conclusions about the performance of the explored 3 models in this particular prediction task.**\n", "\n", "The depency of wine quality on other features in hand is, presumable, non-linear. So Random Forest works better in this task. " ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "name": "lesson8_part1_kmeans.ipynb" }, "nbformat": 4, "nbformat_minor": 1 }