{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Linear regression with scikit-learn (OLS)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from sklearn import linear_model, datasets, metrics, model_selection\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We load the boston house-prices dataset and `X` are our features and `y` is the target variable `medv` (Median value of owner-occupied homes in $1000s)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".. _boston_dataset:\n", "\n", "Boston house prices dataset\n", "---------------------------\n", "\n", "**Data Set Characteristics:** \n", "\n", " :Number of Instances: 506 \n", "\n", " :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.\n", "\n", " :Attribute Information (in order):\n", " - CRIM per capita crime rate by town\n", " - ZN proportion of residential land zoned for lots over 25,000 sq.ft.\n", " - INDUS proportion of non-retail business acres per town\n", " - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)\n", " - NOX nitric oxides concentration (parts per 10 million)\n", " - RM average number of rooms per dwelling\n", " - AGE proportion of owner-occupied units built prior to 1940\n", " - DIS weighted distances to five Boston employment centres\n", " - RAD index of accessibility to radial highways\n", " - TAX full-value property-tax rate per $10,000\n", " - PTRATIO pupil-teacher ratio by town\n", " - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town\n", " - LSTAT % lower status of the population\n", " - MEDV Median value of owner-occupied homes in $1000's\n", "\n", " :Missing Attribute Values: None\n", "\n", " :Creator: Harrison, D. and Rubinfeld, D.L.\n", "\n", "This is a copy of UCI ML housing dataset.\n", "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/\n", "\n", "\n", "This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.\n", "\n", "The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic\n", "prices and the demand for clean air', J. Environ. Economics & Management,\n", "vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics\n", "...', Wiley, 1980. N.B. Various transformations are used in the table on\n", "pages 244-261 of the latter.\n", "\n", "The Boston house-price data has been used in many machine learning papers that address regression\n", "problems. \n", " \n", ".. topic:: References\n", "\n", " - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.\n", " - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.\n", "\n" ] } ], "source": [ "boston = datasets.load_boston()\n", "print(boston.DESCR)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [] }, "outputs": [], "source": [ "X = pd.DataFrame(boston.data, columns=boston.feature_names)\n", "y = boston.target" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's describe our features to see whta kind of type we have. Note `chas` is a dummy variable." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
| \n", " | CRIM | \n", "ZN | \n", "INDUS | \n", "CHAS | \n", "NOX | \n", "RM | \n", "AGE | \n", "DIS | \n", "RAD | \n", "TAX | \n", "PTRATIO | \n", "B | \n", "LSTAT | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "506.000000 | \n", "
| mean | \n", "3.613524 | \n", "11.363636 | \n", "11.136779 | \n", "0.069170 | \n", "0.554695 | \n", "6.284634 | \n", "68.574901 | \n", "3.795043 | \n", "9.549407 | \n", "408.237154 | \n", "18.455534 | \n", "356.674032 | \n", "12.653063 | \n", "
| std | \n", "8.601545 | \n", "23.322453 | \n", "6.860353 | \n", "0.253994 | \n", "0.115878 | \n", "0.702617 | \n", "28.148861 | \n", "2.105710 | \n", "8.707259 | \n", "168.537116 | \n", "2.164946 | \n", "91.294864 | \n", "7.141062 | \n", "
| min | \n", "0.006320 | \n", "0.000000 | \n", "0.460000 | \n", "0.000000 | \n", "0.385000 | \n", "3.561000 | \n", "2.900000 | \n", "1.129600 | \n", "1.000000 | \n", "187.000000 | \n", "12.600000 | \n", "0.320000 | \n", "1.730000 | \n", "
| 25% | \n", "0.082045 | \n", "0.000000 | \n", "5.190000 | \n", "0.000000 | \n", "0.449000 | \n", "5.885500 | \n", "45.025000 | \n", "2.100175 | \n", "4.000000 | \n", "279.000000 | \n", "17.400000 | \n", "375.377500 | \n", "6.950000 | \n", "
| 50% | \n", "0.256510 | \n", "0.000000 | \n", "9.690000 | \n", "0.000000 | \n", "0.538000 | \n", "6.208500 | \n", "77.500000 | \n", "3.207450 | \n", "5.000000 | \n", "330.000000 | \n", "19.050000 | \n", "391.440000 | \n", "11.360000 | \n", "
| 75% | \n", "3.677083 | \n", "12.500000 | \n", "18.100000 | \n", "0.000000 | \n", "0.624000 | \n", "6.623500 | \n", "94.075000 | \n", "5.188425 | \n", "24.000000 | \n", "666.000000 | \n", "20.200000 | \n", "396.225000 | \n", "16.955000 | \n", "
| max | \n", "88.976200 | \n", "100.000000 | \n", "27.740000 | \n", "1.000000 | \n", "0.871000 | \n", "8.780000 | \n", "100.000000 | \n", "12.126500 | \n", "24.000000 | \n", "711.000000 | \n", "22.000000 | \n", "396.900000 | \n", "37.970000 | \n", "