{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n", "\n", "Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Assignment #6 (demo)\n", "##
Exploring OLS, Lasso and Random Forest in a regression task\n", " \n", "**Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a6-demo-linear-models-and-rf-for-regression) + [solution](https://www.kaggle.com/kashnitsky/a6-demo-regression-solution).** \n", " \n", "\n", "\n", "**Fill in the missing code and choose answers in [this](https://docs.google.com/forms/d/1aHyK58W6oQmNaqEfvpLTpo6Cb0-ntnvJ18rZcvclkvw/edit) web form.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "import numpy as np\n", "import pandas as pd\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.linear_model import Lasso, LassoCV, LinearRegression\n", "from sklearn.metrics.regression import mean_squared_error\n", "from sklearn.model_selection import (GridSearchCV, cross_val_score,\n", " train_test_split)\n", "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**We are working with UCI Wine quality dataset (no need to download it – it's already there, in course repo and in Kaggle Dataset).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"../../data/winequality-white.csv\", sep=\";\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Separate the target feature, split data in 7:3 proportion (30% form a holdout set, use random_state=17), and preprocess data with `StandardScaler`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# y = None # you code here\n", "\n", "# X_train, X_holdout, y_train, y_holdout = train_test_split # you code here\n", "# scaler = StandardScaler()\n", "# X_train_scaled = scaler.fit_transform # you code here\n", "# X_holdout_scaled = scaler.transform # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train a simple linear regression model (Ordinary Least Squares).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# linreg = # you code here\n", "# linreg.fit # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1: What are mean squared errors of model predictions on train and holdout sets?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print(\"Mean squared error (train): %.3f\" % # you code here\n", "# print(\"Mean squared error (test): %.3f\" % # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Sort features by their influence on the target feature (wine quality). Beware that both large positive and large negative coefficients mean large influence on target. It's handy to use `pandas.DataFrame` here.**\n", "\n", "**Question 2: Which feature this linear regression model treats as the most influential on wine quality?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# linreg_coef = pd.DataFrame # you code here\n", "# linreg_coef.sort_values # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lasso regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train a LASSO model with $\\alpha = 0.01$ (weak regularization) and scaled data. Again, set random_state=17.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lasso1 = Lasso # you code here\n", "# lasso1.fit # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Which feature is the least informative in predicting wine quality, according to this LASSO model?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lasso1_coef = pd.DataFrame # you code here\n", "# lasso1_coef.sort_values # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train LassoCV with random_state=17 to choose the best value of $\\alpha$ in 5-fold cross-validation.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# alphas = np.logspace(-6, 2, 200)\n", "# lasso_cv = LassoCV # you code here\n", "# lasso_cv.fit # you code here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lasso_cv.alpha_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 3: Which feature is the least informative in predicting wine quality, according to the tuned LASSO model?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# lasso_cv_coef = pd.DataFrame # you code here\n", "# lasso_cv_coef.sort_values # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 4: What are mean squared errors of tuned LASSO predictions on train and holdout sets?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print(\"Mean squared error (train): %.3f\" % # you code here\n", "# print(\"Mean squared error (test): %.3f\" % # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Random Forest" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train a Random Forest with out-of-the-box parameters, setting only random_state to be 17.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# forest = RandomForestRegressor # you code here\n", "# forest.fit # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 5: What are mean squared errors of RF model on the training set, in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print(\"Mean squared error (train): %.3f\" % # you code here\n", "# print(\"Mean squared error (cv): %.3f\" % # you code here\n", "# print(\"Mean squared error (test): %.3f\" % # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tune the `max_features` and `max_depth` hyperparameters with GridSearchCV and again check mean cross-validation MSE and MSE on holdout set.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# forest_params = {'max_depth': list(range(10, 25)),\n", "# 'min_samples_leaf': list(range(1, 8)),\n", "# 'max_features': list(range(6,12))}\n", "\n", "# locally_best_forest = GridSearchCV # you code here\n", "# locally_best_forest.fit # you code here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# locally_best_forest.best_params_, locally_best_forest.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 6: What are mean squared errors of tuned RF model in cross-validation (cross_val_score with scoring='neg_mean_squared_error' and other arguments left with default values) and on holdout set?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print(\"Mean squared error (cv): %.3f\" % # you code here\n", "# print(\"Mean squared error (test): %.3f\" % # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Output RF's feature importance. Again, it's nice to present it as a DataFrame.**
\n", "**Question 7: What is the most important feature, according to the Random Forest model?**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf_importance = pd.DataFrame # you code here\n", "rf_importance.sort_values # you code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Make conclusions about the performance of the explored 3 models in this particular prediction task.**" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "name": "lesson8_part1_kmeans.ipynb" }, "nbformat": 4, "nbformat_minor": 1 }