{ "cells": [ { "cell_type": "raw", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-2cac5e96b3dd6263", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "---\n", "metadata: true\n", "section: \"Practical task\"\n", "goal: \"Practice the content that was covered in this chapter, in Python.\"\n", "time: \"60 min\"\n", "prerequisites: \"Chapter 1 - Numerical data\"\n", "level: \"Advanced\"\n", "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Practical task" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-07aa1282a4242acf", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "You will now have the possibility to practice all we have seen in this chapter, on a new dataset.\n", "\n", "The dataset is called \"Wine quality\" and has been found and downloaded here: http://archive.ics.uci.edu/ml/datasets/Wine+Quality.\n", "\n", "It contains the characteristics of 6497 wines, and their ``quality``, a grade between 0 and 10 given by 3 wine experts. On the website, the data is separated between white and red wine, but we have grouped them together for the purpose of the exercise. We added an additionnal attribute ``type``, containing the type of wine (red or white).\n", "\n", "For the purpose of the exercise, and to train with missing values, some values have been deleted or modified from the original dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-db9152c02e6a6a46", "locked": true, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "# Import the functions for machine learning\n", "\n", "%run 1-functions.ipynb" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-dc735286646a1d0b", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## 1. Import the dataset\n", "\n", "+ Import the dataset called ``wine.csv`` into the variable ``data`` __(2 points)__." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-26fdede1ae78cb6b", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# import dataset\n", "\n", "import pandas as pd\n", "\n", "# Begin answer\n", "data = pd.read_csv('wine.csv')\n", "# End answer" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-88a57b85550b20eb", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert not data.empty\n", "### END HIDDEN TESTS" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-aabb6fbb6ec0272c", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert not data['fixed acidity'].empty\n", "### END HIDDEN TESTS" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-7759c198372eea8d", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## 2. Get to know the dataset\n", "\n", "Understand the dataset: how it looks like, the different attributes, their distribution. Plot the distribution of the attributes.\n", "\n", "+ Place the head of the dataset in the variable ``head`` __(1 point)__.\n", "+ Place the description of the numerical attributes in the variable ``description`` __(1 point)__.\n", "+ How many unique values does the attribute ``type`` contain? Place this number in the variable ``unique_type`` __(1 point)__." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-7c72618014610d14", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# examine dataset\n", "\n", "# Begin answer\n", "head = data.head()\n", "description = data.describe()\n", "unique_type = len(data['type'].unique())\n", "# End answer" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-3b9c5213d0cbd728", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert head.equals(data.head())\n", "### END HIDDEN TESTS" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-3ea91df14bffdc06", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert description.equals(data.describe())\n", "### END HIDDEN TESTS" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-1e8906f68a281ea3", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert unique_type == len(data['type'].unique())\n", "### END HIDDEN TESTS" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-6f2dc5d93c7efab4", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## 3. Detect the missing values\n", "\n", "+ There is only one attribute which contains missing values. Place the number of missing values for this attribute in the variable ``missing_nb`` __(1 point)__.\n", "\n", "Think about how to deal with the missing values." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-55048960dbe4093a", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 6497 entries, 0 to 6496\n", "Data columns (total 13 columns):\n", "fixed acidity 6497 non-null float64\n", "volatile acidity 6497 non-null float64\n", "citric acid 6497 non-null float64\n", "residual sugar 6497 non-null float64\n", "chlorides 6497 non-null float64\n", "free sulfur dioxide 6497 non-null float64\n", "total sulfur dioxide 6497 non-null float64\n", "density 5235 non-null float64\n", "pH 6497 non-null float64\n", "sulphates 6497 non-null float64\n", "alcohol 6497 non-null float64\n", "quality 6497 non-null int64\n", "type 6497 non-null object\n", "dtypes: float64(11), int64(1), object(1)\n", "memory usage: 659.9+ KB\n" ] } ], "source": [ "# detect missing values\n", "\n", "# Begin answer\n", "missing_nb = data['density'].isna().sum()\n", "# End answer" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-dde860e67e8cf85b", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert missing_nb == data['density'].isna().sum()\n", "### END HIDDEN TESTS" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-c045b19897dd6ef0", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "+ The attribute containing the missing values can be recovered with a reasonable accuracy using the attributes ``residual sugar`` and ``alcohol``. Use this formula to recover the attribute: ``attribute = -0.0014 * alcohol + 0.0002 * residual sugar + 1.0082`` __(2 points)__." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-cebfe2b317f53083", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# recover the missing values\n", "\n", "# Begin answer\n", "data.loc[data['density'].isna(), 'density'] = -0.0014 * data['alcohol'] + 0.0002 * data['residual sugar'] + 1.0082\n", "# End answer" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-4cd52ff887bf9ac6", "locked": true, "points": 2, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert data.loc[data['density'].isna()].empty\n", "assert data.iloc[6]['density'] == -0.0014 * data.iloc[6]['alcohol'] + 0.0002 * data.iloc[6]['residual sugar'] + 1.0082\n", "### END HIDDEN TESTS" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-3ec36a7efd0c17d4", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## 4. Build the regression model\n", "\n", "+ Use the dataset with all the attributes to predict the attribute ``quality`` with a regression model (you can use the functions we used along the chapter).\n", "+ Place the right attributes in the variables ``x`` and ``y`` __(2 points)__.\n", "+ Place the MAE in the variable ``mae_regression`` __(1 point)__." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-2cba9ee11b8b0f1d", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAE: 0.6105846153846155\n" ] } ], "source": [ "# predict 'quality' with regression\n", "\n", "from sklearn.metrics import mean_absolute_error\n", "\n", "# Begin answer\n", "x = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide',\n", " 'density', 'pH', 'sulphates', 'alcohol']\n", "y = ['quality']\n", "\n", "predictions, ytest = knn_regression(data, x, y)\n", "mae_regression = mean_absolute_error(predictions, ytest)\n", "print('MAE: ' + str(mae_regression))\n", "# End answer" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-828f5d428ebfee4e", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert x == ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide',\n", " 'density', 'pH', 'sulphates', 'alcohol']\n", "### END HIDDEN TESTS" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-ea568e14c06da526", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert y == ['quality']\n", "### END HIDDEN TESTS" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-dc4056afc81274cf", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert mae_regression > 0.6 and mae_regression < 0.7\n", "### END HIDDEN TESTS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Visualize the predictions\n", "\n", "+ Plot the graph predictions vs. true labels." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-968ecfa6e236ae54", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'Prediction vs. true label')" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# plot predictions vs. true labels\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "# Begin answer\n", "plt.figure(figsize = (12, 8))\n", "\n", "pred = []\n", "for element in predictions:\n", " pred.append(element[0])\n", "plt.plot(pred, ytest, 'x')\n", "\n", "x = np.linspace(3, 8, 10)\n", "plt.plot(x, x, color = 'black')\n", "\n", "plt.xlabel('Prediction')\n", "plt.ylabel('True label')\n", "plt.title('Prediction vs. true label')\n", "# End answer" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-6dfc842b21b8e30a", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## 6. Attributes selection\n", "\n", "+ Build other models with different attributes.\n", "+ Place the combination of attributes which gives the worst performance in the variable ``worst_comb``, it contains only one attribute __(1 point)__.\n", "+ Place the combination of attributes which gives the best performance in the variable ``best_comb``, it contains 5 attributes (including ``fixed acidity`` and ``volatile acidity``) __(1 point)__." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-d56cfc2fd1e3fe09", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [], "source": [ "# find best and worst attribute combinations\n", "\n", "# Begin answer\n", "worst_comb = ['sulphates']\n", "best_comb = ['fixed acidity', 'volatile acidity', 'pH', 'sulphates', 'alcohol']\n", "# End answer" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-ed52d85a6f6c2e6c", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert worst_comb == ['sulphates']\n", "### END HIDDEN TESTS" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-38d8c692f587abe4", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert best_comb == ['fixed acidity', 'volatile acidity', 'pH', 'sulphates', 'alcohol']\n", "### END HIDDEN TESTS" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-5e32c18ea6502bdd", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## 7. Classification model\n", "\n", "+ Predict the attribute ``quality`` with a classification model, with all the attributes.\n", "+ Place the MAE in the variable ``mae_classification`` __(1 point)__.\n", "+ Plot the results.\n", "\n", "Think about which model is the best to use: regression or classification? Why? Look at the values of the variable ``quality``." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-a26417b6406b3c68", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAE: 0.5884615384615385\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Anna\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:14: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " \n" ] } ], "source": [ "# predict 'quality' with classification\n", "\n", "from sklearn.metrics import mean_absolute_error\n", "\n", "# Begin answer\n", "x = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide',\n", " 'density', 'pH', 'sulphates', 'alcohol']\n", "y = ['quality']\n", "\n", "predictions, ytest = knn_classification(data, x, y)\n", "mae_classification = mean_absolute_error(predictions, ytest)\n", "print('MAE: ' + str(mae_classification))\n", "# End answer" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-66ed9f7945846479", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert mae_classification > 0.5 and mae_classification < 0.6\n", "### END HIDDEN TESTS" ] }, { "cell_type": "markdown", "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-0aa2aa02b138e423", "locked": true, "schema_version": 1, "solution": false } }, "source": [ "## 8. Split the dataset\n", "\n", "+ Split the dataset on a well chosen attribute, and predict the attribute ``quality`` with the new datasets and the chosen type of task (classification or regression).\n", "+ Place the two MAEs in the variables ``mae_1`` and ``mae_2`` __(2 points)__." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "nbgrader": { "grade": false, "grade_id": "cell-71f62bcc37745c91", "locked": false, "schema_version": 1, "solution": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MAE 1: 0.6642857142857143\n", "MAE 2: 0.49375\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Anna\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:14: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " \n", "C:\\Users\\Anna\\Anaconda3\\lib\\site-packages\\ipykernel_launcher.py:14: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().\n", " \n" ] } ], "source": [ "# split the dataset and make predictions\n", "\n", "from sklearn.metrics import mean_absolute_error\n", "\n", "# Begin answer\n", "x = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide',\n", " 'density', 'pH', 'sulphates', 'alcohol']\n", "y = ['quality']\n", "\n", "white_wine = data.loc[data['type'] == 'white']\n", "predictions, ytest = knn_classification(white_wine, x, y)\n", "mae_1 = mean_absolute_error(predictions, ytest)\n", "print('MAE 1: ' + str(mae_1))\n", "\n", "red_wine = data.loc[data['type'] == 'red']\n", "predictions, ytest = knn_classification(red_wine, x, y)\n", "mae_2 = mean_absolute_error(predictions, ytest)\n", "print('MAE 2: ' + str(mae_2))\n", "# End answer" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-8deeea4ce465d4a5", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert (mae_1 > 0.6 and mae_1 < 0.7) or (mae_2 > 0.6 and mae_2 < 0.7)\n", "### END HIDDEN TESTS" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "nbgrader": { "grade": true, "grade_id": "cell-3edbec12c2157f83", "locked": true, "points": 1, "schema_version": 1, "solution": false } }, "outputs": [], "source": [ "### BEGIN HIDDEN TESTS\n", "assert (mae_1 > 0.4 and mae_1 < 0.5) or (mae_2 > 0.4 and mae_2 < 0.5)\n", "### END HIDDEN TESTS" ] } ], "metadata": { "celltoolbar": "Create Assignment", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }