{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## Open Machine Learning Course\n", "
Author: [Yury Kashnitsky](https://www.linkedin.com/in/festline/), Data Scientist at Mail.ru Group
\n", "Translated by [Anna Larionova](https://www.linkedin.com/in/anna-larionova-74434689/), DS @ Picturer, Data4, BNTouch
All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Assignment #4 (demo)\n", "##
Linear Regression as an optimization problem\n", " \n", "(no solution shared, part of [this](https://ru.coursera.org/specializations/machine-learning-data-analysis) Coursera specialization)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1. Basic data analysis with Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this task we will use [SOCR](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights) data containing information about height and weight of 25 thousands teenagers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**[1]. If you haven't installed yet Seaborn library you should execute *conda install seaborn* in the terminal. (Seaborn isn't part of Anaconda and it provides suitable high level functionality for data visualization).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read the data about height and weight into Pandas DataFrame:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = pd.read_csv(\"../../data/weights_heights.csv\", index_col=\"Index\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First thing you should do after reading the data is to look at first records. It helps to find the data reading errors (for example, when we have 1 column instead of 10 and it has 9 dots with commas in column name). Also it allows to take a closer look at the data and features and their nature (numerical, categorical, etc.).\n", "\n", "Than we should plot histograms of feature distributions. Also it can help to understand features nature (power-series distribution or standard or something else). Histogram can help us find some values that aren't similar to each other - outliers.\n", "It is convenient to plot histograms using *plot* method of Pandas DataFrame with option *kind='hist'*.\n", "\n", "**Example.** Let's plot the histogram of teenager's height distribution. We use method *plot* for DataFrame *data* with options *y='Height'* (the feature which distribution we want to plot)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.plot(y=\"Height\", kind=\"hist\", color=\"red\", title=\"Height (inch.) distribution\");" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Options:\n", "\n", "- *y='Height'* - the feature which distribution we want to plot\n", "- *kind='hist'* - means that plot type is histogram\n", "- *color='red'* - set color" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**[2]. Look at the first 5 rows using *head* method of Pandas DataFrame. Plot the histogram of weight distribution using method *plot* Pandas DataFrame. Make the color of histogram to be green and add title.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most effective methods of basic data analysis is mapping pairwise dependencies of features. We make $m \\times m$ plots (*m* is number of features) where we have histograms of feature distributions in diagonal and scatter plots of two feature dependencies outside. We can do this using $scatter\\_matrix$ method of Pandas Data Frame or *pairplot* of Seaborn library. \n", "\n", "To illustrate this method we add third feature. Let's create *bodymass index* ([BMI](https://en.wikipedia.org/wiki/Body_mass_index)). To do this we use *apply* method of Pandas FataFrame and Python's lambda functions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def make_bmi(height_inch, weight_pound):\n", " METER_TO_INCH, KILO_TO_POUND = 39.37, 2.20462\n", " return (weight_pound / KILO_TO_POUND) / (height_inch / METER_TO_INCH) ** 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[\"BMI\"] = data.apply(lambda row: make_bmi(row[\"Height\"], row[\"Weight\"]), axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**[3]. Create the picture that contains pairwise dependencies of features 'Height', 'Weight' и 'BMI'. You should use *pairplot* method of Seaborn library.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "During the basic analysis you often have to investigate dependencies of numerical from categorical features (for example, dependency between salary and employee sex). In this case we can use boxplots from Seaborn library. Box plot is a compact way to show real value statistics (mean and quartiles) by different values of categorical feature. It also helps to find outliers - observations that have very different values from others." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**[4]. Create new feature *weight_category* in DataFrame *data* that will have 3 values: 1 if the weight is less than 120 pounds, 3 if the weight is greater or equal to 150 pounds, 2 in other cases. Create boxplot showing dependency between height and weight category. Use *boxplot* method of Seaborn library and *apply* method of Pandas DataFrame. Add titles \"Height\" to *y* axis and \"Weight category\" to *x* axis.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def weight_category(weight):\n", " pass\n", " # Your code here\n", "\n", "\n", "data[\"weight_cat\"] = data[\"Weight\"].apply(weight_category)\n", "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**[5]. Create scatter plot of dependencies between height and weight using *plot* method for Pandas DataFrame with option *kind='scatter'*. Add title to the figure.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2. Squared Error Minimization" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In basic case the task of real value prediction by other features (regression task) can be solved using squared error minimization.\n", "\n", "**[6]. Create function computing squared error of dependency approximation between height $y$ and weight $x$ using straight line $y = w_0 + w_1 * x$ by two parameters $w_0$ and $w_1$:**\n", "$$error(w_0, w_1) = \\sum_{i=1}^n {(y_i - (w_0 + w_1 * x_i))}^2 $$\n", "Where $n$ is number of observations in dataset, $y_i$ and $x_i$ are height and weight of $i$th person in dataset. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So we are solving the task how to draw a straight line through the points cloud corresponding to observations in our dataset in space of features \"Height\" and \"Weight\" to minimize function[6]. Let's start with drawings some lines and make sure they transfer dependencies from height to weight.\n", "\n", "**[7]. On plot from [5] Problem 1 draw two straight lines corresponding to values of parameters $w_0, w_1) = (60, 0.05)$ and ($w_0, w_1) = (50, 0.16)$. Use *plot* method from *matplotlib.pyplot* and *linspace* method from NumPy library. Add the titles to axes and plot.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Squared error function minimization is very easy task because of the function's convex nature. There are many optimization methods for this problem. Let's look at dependency between error function and the first parameter (slope of the straight line) if the second parameter (absolute term) is fixed.\n", "\n", "**[8]. Plot dependency between error function calculated in [6] and $w_1$ parameter when $w_0$ = 50. Add the titles to axes and plot.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can find the slope of the straight line approximating dependency between height and weight when coefficient is fixed $w_0 = 50$ using optimization method.\n", "\n", "**[9]. Using *minimize_scalar* method from *scipy.optimize* find the minimum of the function[6] for parameter value $w_1$ in range [-5,5]. Draw on plot [5] Problem 1 the straight line corresponding to the values of parameters ($w_0$, $w_1$) = (50, $w_1\\_opt$) where $w_1\\_opt$ is optimal value of parameter $w_1$ that was found in [8].**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When you analyze multidimensional data, you often want to get intuitive understanding about data nature using visualization. It is impossible to plot the data when you have more than 3 features. It is better to choose 2 or 3 principal components from data and represent them in plane or volume.\n", "\n", "Let's have a look how Python can draw 3D figures on example of function $z(x,y) = sin(\\sqrt{x^2+y^2})$ for values of $x$ и $y$ from interval [-5,5] with step 0.25" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from mpl_toolkits.mplot3d import Axes3D" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create objects of type matplotlib.figure.Figure (picture) and matplotlib.axes._subplots.Axes3DSubplot (axes). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure()\n", "ax = fig.gca(projection=\"3d\") # get current axis\n", "\n", "# Create NumPy arrays with data points on X and Y axes.\n", "# Use meshgrid method creating matrix of coordinates\n", "# By vectors of coordinates. Set needed function Z(x, y).\n", "X = np.arange(-5, 5, 0.25)\n", "Y = np.arange(-5, 5, 0.25)\n", "X, Y = np.meshgrid(X, Y)\n", "Z = np.sin(np.sqrt(X ** 2 + Y ** 2))\n", "\n", "# Finally use *plot_surface* method of type object\n", "# Axes3DSubplot. Add titles to axes.\n", "surf = ax.plot_surface(X, Y, Z)\n", "ax.set_xlabel(\"X\")\n", "ax.set_ylabel(\"Y\")\n", "ax.set_zlabel(\"Z\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**[10]. Create 3D-plot between error function calculated in [6] and parameters $w_0$ and $w_1$. Add titles \"Intercept\" to the $x$ axis, \"Slope\" to the $y$ axis, \"Error\" to the $z$ axis.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**[11]. Find the minimum of the function in [6] using *minimize* method from scipy.optimize for parameters values $w_0$ in range [-100,100] and $w_1$ in range [-5, 5]. Starting point is ($w_0$, $w_1$) = (0, 0). Use L-BFGS-B optimization method (option method in minimize). Draw on plot from [5] Problem 1 the straight line coresponding finded optimal values of parameters $w_0$ and $w_1$. Add titles to the axes and plot.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your code here" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 1 }