{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Building Linear Regression Model in Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Packages and Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In python, a very handy way of building linear regression model is using a very popular machine learning package `Scikit Learn`. This package contains many built-in models, from basic regression models in this post to other complex models and methods in later posts. You may want to check the [official guide](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# import packages\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# import dataset\n", "data = pd.read_csv(\"meuse.csv\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
xycadmiumcopperleadzincelevdistomffreqsoillimelandusedist.m
018107233361111.78529910227.9090.00135813.6111Ah50
11810253335588.68127711416.9830.01222414.0111Ah30
21811653335376.5681996407.8000.10302913.0111Ah150
31812983334842.6811162577.6550.1900948.0120Ga270
41813073333302.8481172697.4800.2770908.7120Ah380
\n", "
" ], "text/plain": [ " x y cadmium copper lead zinc elev dist om ffreq \\\n", "0 181072 333611 11.7 85 299 1022 7.909 0.001358 13.6 1 \n", "1 181025 333558 8.6 81 277 1141 6.983 0.012224 14.0 1 \n", "2 181165 333537 6.5 68 199 640 7.800 0.103029 13.0 1 \n", "3 181298 333484 2.6 81 116 257 7.655 0.190094 8.0 1 \n", "4 181307 333330 2.8 48 117 269 7.480 0.277090 8.7 1 \n", "\n", " soil lime landuse dist.m \n", "0 1 1 Ah 50 \n", "1 1 1 Ah 30 \n", "2 1 1 Ah 150 \n", "3 2 0 Ga 270 \n", "4 2 0 Ah 380 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# View data\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build a Model using Simple Linear Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Linear regression is one of the most traditional way of examining the relationships among predictors and variables. As we discussed in [a previous post](https://oscrproject.wixsite.com/website/post/purpose-of-machine-learning-and-modeling-for-digital-humanities-and-social-sciences) about the general idea of modeling and machine learning, we may have the purpose of inference the relationships among variables. \n", "\n", "Goal: examine the relationship between the topsoil lead concentration (`lead` column, as y-axis) and the topsoil cadmium concentration (`cadmium` column, as x-axis). \n", "\n", "Using the `Scikit Learn` package, we have:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/lizhoufan/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: reshape is deprecated and will raise in a subsequent release. Please use .values.reshape(...) instead\n", " \n" ] }, { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regression_model = LinearRegression()\n", "LinearRegression().fit(data.cadmium.reshape((-1, 1)), data.lead)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please note that we have to reshape the `cadmium` column to be two-dimensional, i.e. one column and required number of rows. Please refer to our next several notes about how to visualize and analyze the simple linear regression model." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }