# Building Linear Regression Model in Python

### Import Packages and Data

In python, a very handy way of building linear regression model is using a very popular machine learning package `Scikit Learn`. This package contains many built-in models, from basic regression models in this post to other complex models and methods in later posts. You may want to check the [official guide](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In [1]:
# import packages
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

In [2]:
# import dataset
data = pd.read_csv("meuse.csv")

In [3]:
# View data
data.head()

Unnamed: 0,x,y,cadmium,copper,lead,zinc,elev,dist,om,ffreq,soil,lime,landuse,dist.m
0,181072,333611,11.7,85,299,1022,7.909,0.001358,13.6,1,1,1,Ah,50
1,181025,333558,8.6,81,277,1141,6.983,0.012224,14.0,1,1,1,Ah,30
2,181165,333537,6.5,68,199,640,7.8,0.103029,13.0,1,1,1,Ah,150
3,181298,333484,2.6,81,116,257,7.655,0.190094,8.0,1,2,0,Ga,270
4,181307,333330,2.8,48,117,269,7.48,0.27709,8.7,1,2,0,Ah,380


### Build a Model using Simple Linear Regression

Linear regression is one of the most traditional way of examining the relationships among predictors and variables. As we discussed in [a previous post](https://oscrproject.wixsite.com/website/post/purpose-of-machine-learning-and-modeling-for-digital-humanities-and-social-sciences) about the general idea of modeling and machine learning, we may have the purpose of inference the relationships among variables. 

Goal: examine the relationship between the topsoil lead concentration (`lead` column, as y-axis) and the topsoil cadmium concentration (`cadmium` column, as x-axis). 

Using the `Scikit Learn` package, we have:

In [8]:
regression_model = LinearRegression()
LinearRegression().fit(data.cadmium.reshape((-1, 1)), data.lead)

  


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Please note that we have to reshape the `cadmium` column to be two-dimensional, i.e. one column and required number of  rows. Please refer to our next several notes about how to visualize and analyze the simple linear regression model.