{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing data methods\n",
    "\n",
    "- beyond cleaning and exploratory analysis\n",
    "- prepping data for modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Types of missing data:\n",
    "- MCAR - Missing Completely at Random (not many missing values)\n",
    "- MAR - Missing at Random (some missing values)\n",
    "- MNAR - Missing Not at Random (many missing values)\n",
    "\n",
    "##### finding missing data\n",
    "\n",
    "    # Check how many values are missing in the category_desc column\n",
    "    print(volunteer['category_desc'].isnull().sum())\n",
    "\n",
    "##### Visualizing Correlation of Missing Data\n",
    "import missingno as msno\n",
    "msno.heatmap(df) #visualizes correlation\n",
    "msno.dendrogram(df) #shows tree diagram of correlation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# compares missingness between missing and non-missing data\n",
    "def fill_dummy_values(df, scaling_factor=1):\n",
    "    df_dummy = df.copy(deep=True)\n",
    "    for col in df_dummy:\n",
    "        col=df_dummy[col]\n",
    "        col_null=col.isnull()\n",
    "        num_nulls=col_null.sum()\n",
    "        col_range=col.max()-col.min()\n",
    "        dummy_values=(rand(num_nulls)-2)*scaling_factor*col_range+col.min()\n",
    "        col[col_null]=dummy_values\n",
    "return df_dummy\n",
    "#can visualize results with a scatterplot\n",
    "# Fill dummy values in diabetes_dummy\n",
    "df_dummy = fill_dummy_values()\n",
    "\n",
    "# Sum the nullity of one column and another column\n",
    "nullity = df[col_name].isna() + diabetes[col_name].isna()\n",
    "\n",
    "# Create a scatter plot of Skin Fold and BMI \n",
    "diabetes_dummy.plot(x='col_name', y='col_name', kind='scatter', alpha=0.5, c= nullity, cmap='rainbow')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Numerical"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#visualizing missing data\n",
    "import missingno as msno\n",
    "msno.bar(df) #visualizes missing data as a bar chart (remember to plt.show)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Time Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#visualizing missing data\n",
    "import missingno as msno\n",
    "msno.matrix(df) #shows missing data and can parse through date frames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Deleting Missing Values\n",
    "\n",
    "1. pairwise - skips missing value (automatically happens in pandas)\n",
    "2. listwise - using df.dropna() to remove data by row or column\n",
    "    **only use when missing data is MCAR**\n",
    "    \n",
    "        # Drop all rows where Gender is missing\n",
    "        no_gender = so_survey_df.dropna(subset=['Gender'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imputing Missing Values\n",
    "replacing missing values with another value"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### fillna()\n",
    "\n",
    "##### Types:\n",
    "- ffill - forward fill - replace NaN with last observed value\n",
    "- bfill - backfill - replace NaN with next observed value\n",
    "example:\n",
    "\\\n",
    "df.fillna(method='ffill', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### interpolate()\n",
    "**preferred method for time-series data**\n",
    "##### Types:\n",
    "- linear - extrapolates straight line between last and next observations and imputes equidistantly\n",
    "- quadratic - takes parabolic trajectory in negative direction and shoots back positive value\n",
    "- nearest - combination of ffill and bfill\n",
    "example:\n",
    "\\\n",
    "df.interpolate(method='linear', inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### SimpleImputer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Simple Imputer\n",
    "from sklearn.impute import SimpleImputer\n",
    "df_copy = df.copy(deep=True) #makes copy for comparison to original\n",
    "si = SimpleImputer(strategy='', fill_value=#constant) #mean, median, constant, most-frequent(mode)\n",
    "df_copy.iloc[:,:] = si.fit_transform(df_copy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### FancyImpute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fancyimpute import KNN, IterateImputer\n",
    "#KNN uses K nearest neighbor to replace values\n",
    "#IterateImputer uses multiple regressions to replace values (most robust)\n",
    "example:\n",
    "\\\n",
    "ki = KNN()\n",
    "df_copy = df.copy(deep=True) #make copy for comparison to original\n",
    "df_copy.iloc[:,:] = ki.fit_transform(df_copy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### IterativeImputer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# explicitly require this experimental feature\n",
    "from sklearn.experimental import enable_iterative_imputer  # noqa\n",
    "# now you can import normally from sklearn.impute\n",
    "from sklearn.impute import IterativeImputer\n",
    "# import models to use for imputation\n",
    "from sklearn.ensemble import ExtraTreesRegressor\n",
    "from sklearn.linear_model import BayesianRidge\n",
    "\n",
    "imputer = IterativeImputer(BayesianRidge()) #insert model to use and arguments\n",
    "impute_data = pd.DataFrame(imputer.fit_transform(full_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Categorical Imputation\n",
    "convert, then impute if data are strings then fill Nan using most frequent value (KNN)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# iteartiveimputer for categorical\n",
    "imputer = IterativeImputer(ExtraTreesRegressor()) #use model and arguments\n",
    "# impute data and convert \n",
    "encode_data = pd.DataFrame(np.round(imputer.fit_transform(impute_data)),columns = impute_data.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#function that will loop through each column and encode strings to integers, then impute missing\n",
    "#values with KNN and return those columns back to the original dataframe\n",
    "# Create an empty dictionary ordinal_enc_dict\n",
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "\n",
    "ordinal_enc_dict = {}\n",
    "def cat_data_imputer(df):\n",
    "    for col_name in df:\n",
    "    # Create Ordinal encoder for col\n",
    "        ordinal_enc_dict[col_name] = OrdinalEncoder()\n",
    "        col = df[col_name]\n",
    "    \n",
    "    # Select non-null values of col in users\n",
    "        col_not_null = col[col.notnull()]\n",
    "        reshaped_vals = col_not_null.values.reshape(-1, 1)\n",
    "        encoded_vals = ordinal_enc_dict[col_name].fit_transform(reshaped_vals)\n",
    "    \n",
    "    # Store the values to column in users\n",
    "        df.loc[col.notnull(), col_name] = np.squeeze(encoded_vals)\n",
    "    \n",
    "    # Create KNN imputer\n",
    "    KNN_imputer = KNN()\n",
    "\n",
    "# Impute and round the users DataFrame\n",
    "    df.iloc[:, :] = np.round(KNN_imputer.fit_transform(df))\n",
    "\n",
    "# Loop over the column names in users\n",
    "    for col_name in df:\n",
    "    \n",
    "    # Reshape the data\n",
    "        reshaped = df[col_name].values.reshape(-1, 1)\n",
    "    \n",
    "    # Perform inverse transform of the ordinally encoded columns\n",
    "        df[col_name] = ordinal_enc_dict[col_name].inverse_transform(reshaped)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluating Imputations\n",
    "1. Use linear regression for each imputed datset and compare results with original dataset\n",
    "2. Use KDE plots for each imputed dataset and compare shape with original dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Types\n",
    "\n",
    "#### check types\n",
    "\n",
    "- .dtypes\n",
    "- .select_dtypes(include=['int', 'str', 'float']) - allows you to select columns based on datatype(s)\n",
    "\n",
    "#### most common types\n",
    "- object (string/mixed types)\n",
    "- int64 (64bit integer)\n",
    "- float64 (64bit float)\n",
    "- datetime64 (or timedelta)\n",
    "\n",
    "#### converting columns\n",
    "\n",
    "    # Convert the hits column to type int\n",
    "    volunteer[\"hits\"] = volunteer[\"hits\"].astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Stratified Sampling\n",
    "- takes into account data split when splitting into train and test sets\n",
    "\n",
    "        # Use stratified sampling to split up the dataset according to the volunteer_y dataset\n",
    "        X_train, X_test, y_train, y_test = train_test_split(volunteer_X, volunteer_y, stratify=volunteer_y)\n",
    "\n",
    "        # Print out the category_desc counts on the training y labels\n",
    "        print(y_train['category_desc'].value_counts())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Dealing with Categorical Variables\n",
    "\n",
    "#### Identifying categorical variables\n",
    "As categorical variables need to be treated in a particular manner, as you'll see later on, you need to make sure to identify which variables are categorical. In some cases, identifying will be easy (e.g. if they are stored as strings), in other cases they are numeric and the fact that they are categorical is not always immediately apparent.  Note that this may not be trivial. A first thing you can do is use the `.describe()` function and `.info()`-function and get a better sense. `.describe()` will give you info on the data types (like strings, integers, etc), but even then continuous variables might have been imported as strings, so it's very important to really have a look at your data.\n",
    "\n",
    "#### Transforming categorical variables\n",
    "When you want to use categorical variables in regression models, they need to be transformed. There are two approaches to this:\n",
    "- 1) Perform label encoding\n",
    "- 2) Create dummy variables / one-hot-encoding\n",
    "\n",
    "##### Label Encoding\n",
    "\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder\n",
    "\n",
    "    from sklearn.preprocessing import LabelEncoder\n",
    "    lb_make = LabelEncoder()\n",
    "\n",
    "    origin_encoded = lb_make.fit_transform(cat_origin)\n",
    "    \n",
    "##### One hot encoding\n",
    "\n",
    "https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html\n",
    "\n",
    "    #pandas\n",
    "    pd.get_dummies(cat_origin)\n",
    "    \n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html\n",
    "   \n",
    "    #sklearn\n",
    "    from sklearn.preprocessing import LabelBinarizer\n",
    "    lb = LabelBinarizer()\n",
    "    origin_dummies = lb.fit_transform(cat_origin)\n",
    "    # you need to convert this back to a dataframe\n",
    "    origin_dum_df = pd.DataFrame(origin_dummies,columns=lb.classes_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dealing with Numeric Variables\n",
    "\n",
    "#### Binarizing\n",
    "- creating new columns that is used as an either/or\n",
    "\n",
    "        # Create the Paid_Job column filled with zeros\n",
    "        so_survey_df['Paid_Job'] = 0\n",
    "\n",
    "        # Replace all the Paid_Job values where ConvertedSalary is > 0\n",
    "        so_survey_df.loc[so_survey_df['ConvertedSalary'] > 0, 'Paid_Job'] = 1\n",
    "\n",
    "        # Print the first five rows of the columns\n",
    "        print(so_survey_df[['Paid_Job', 'ConvertedSalary']].head())\n",
    "        \n",
    "#### Binning\n",
    "- creating bins that group specific numeroic ranges together\n",
    "\n",
    "        # Bin the continuous variable ConvertedSalary into 5 bins\n",
    "        so_survey_df['equal_binned'] = pd.cut(so_survey_df['ConvertedSalary'], bins = 5)\n",
    "\n",
    "        # Print the first 5 rows of the equal_binned column\n",
    "        print(so_survey_df[['equal_binned', 'ConvertedSalary']].head())\n",
    "        \n",
    "       # Import numpy\n",
    "        import numpy as np\n",
    "\n",
    "        # Specify the boundaries of the bins\n",
    "        bins = [-np.inf, 10000, 50000, 100000, 150000, np.inf]\n",
    "\n",
    "        # Bin labels\n",
    "        labels = ['Very low', 'Low', 'Medium', 'High', 'Very high']\n",
    "\n",
    "        # Bin the continuous variable ConvertedSalary using these boundaries\n",
    "        so_survey_df['boundary_binned'] = pd.cut(so_survey_df['ConvertedSalary'], \n",
    "                                                 labels=labels, bins=bins)\n",
    "\n",
    "        # Print the first 5 rows of the boundary_binned column\n",
    "        print(so_survey_df[['boundary_binned', 'ConvertedSalary']].head()) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Encoding Function for a DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#instantiate both packages to use\n",
    "encoder = OrdinalEncoder()\n",
    "# create a list of categorical columns to iterate over\n",
    "cat_cols = ['embarked','class1','deck1','who','embark_town','sex','adult_male','alive','alone']\n",
    "\n",
    "def encode(data):\n",
    "    '''function to encode non-null data and replace it in the original data'''\n",
    "    #retains only non-null values\n",
    "    nonulls = np.array(data.dropna())\n",
    "    #reshapes the data for encoding\n",
    "    impute_reshape = nonulls.reshape(-1,1)\n",
    "    #encode date\n",
    "    impute_ordinal = encoder.fit_transform(impute_reshape)\n",
    "    #Assign back encoded values to non-null values\n",
    "    data.loc[data.notnull()] = np.squeeze(impute_ordinal)\n",
    "    return data\n",
    "\n",
    "#create a for loop to iterate through each column in the data\n",
    "for columns in cat_cols:\n",
    "    encode(data[columns])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multicollinearity\n",
    "\n",
    "Because the idea behind regression is that you can change one variable and keep the others constant, correlation is a problem, because it indicates that changes in one predictor are associated with changes in another one as well. Because of this, the estimates of the coefficients can have big fluctuations as a result of small changes in the model. As a result, you may not be able to trust the p-values associated with correlated predictors.\n",
    "\n",
    "#### Checking for multicollinearity\n",
    "\n",
    "##### scatter matrix\n",
    "    pd.plotting.scatter_matrix(data,figsize  = [11, 11]);\n",
    "    \n",
    "##### correlation matrix\n",
    "    data.corr()\n",
    "    \n",
    "##### heatmap\n",
    "    import seaborn as sns\n",
    "    sns.heatmap(data_pred.corr(), center=0);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Scaling and Normalization\n",
    "\n",
    "- used on models in linear space\n",
    "- dataset features have high variance\n",
    "- continuous features in different scales\n",
    "- when scaling, fit and transform train data, but only transform test data\n",
    "\n",
    "The idea behind this is that, around every point of the regression line, we would assume the data is spread around the eventual regression line in a \"homogenous\" way, with more points closer to the regression line and less points further away.\n",
    "\n",
    "Often, your dataset will contain features that largely vary in magnitudes. If we leave these magnitudes unchanged, coefficient sizes will largely fluctuate in magnitude as well. This can give the false impression that some variables are less important than others.\n",
    "\n",
    "Even though this is not always a formal issue when estimating linear regression models, this can be an issue in more advanced machine learning models. This is because most machine learning algorithms use Euclidean distance between two data points in their computations. Because of that, making sure that features have similar scales is formally required there. Some algorithms even require features to be zero centric.\n",
    "\n",
    "A good rule of thumb is, however, to check your features for normality, and while you're at it, scale your features so they have similar magnitudes, even for a \"simple\" model like linear regression.\n",
    "\n",
    "#### Popular transformations\n",
    "\n",
    "##### Log transformation\n",
    "\n",
    "Log transformation is a very useful tool when you have data that clearly does not follow a normal distribution. log transformation can help reducing skewness when you have skewed data, and can help reducing variability of data. \n",
    "\n",
    "    import numpy as np\n",
    "    data_log= pd.DataFrame([])\n",
    "    data_log[\"column\"] = np.log(data[\"column\"])\n",
    "\n",
    "##### Min-max scaling\n",
    "\n",
    "When performing min-max scaling, you can transform x to get the transformed $x'$ by using the formula:\n",
    "$$x' = \\dfrac{x - \\min(x)}{\\max(x)-\\min(x)}$$\n",
    "This way of scaling brings values between 0 and 1\n",
    "\n",
    "    features_final[\"CRIM\"] = (logcrim-min(logcrim))/(max(logcrim)-min(logcrim))\n",
    "\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler\n",
    "\n",
    "    scaler = MinMaxScaler()\n",
    "    scaler.fit(data['column'])\n",
    "    \n",
    "\n",
    "##### Standardization\n",
    "\n",
    "When \n",
    "$$x' = \\dfrac{x - \\bar x}{\\sigma}$$\n",
    "x' will have mean $\\mu = 0$ and $\\sigma = 1$\n",
    "Note that standardization does not make data $more$ normal, it will just change the mean and the standard error!\n",
    "\n",
    "    features_final[\"DIS\"]   = (logdis-np.mean(logdis))/np.sqrt(np.var(logdis))\n",
    "\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler\n",
    "\n",
    "    scaler = StandardScaler()\n",
    "    scaler.fit(data['column'])\n",
    "\n",
    "##### Mean normalization\n",
    "When performing mean normalization, you use the following formula:\n",
    "$$x' = \\dfrac{x - \\text{mean}(x)}{\\max(x)-\\min(x)}$$\n",
    "The distribution will have values between -1 and 1, and a mean of 0.\n",
    "\n",
    "    features_final[\"LSTAT\"] = (loglstat-np.mean(loglstat))/(max(loglstat)-min(loglstat))\n",
    "\n",
    "##### Unit vector transformation\n",
    " When performing unit vector transformations, you can create a new variable x' with a range [0,1]:\n",
    "$$x'= \\dfrac{x}{{||x||}}$$\n",
    "Recall that the norm of x $||x||= \\sqrt{(x_1^2+x_2^2+...+x_n^2)}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Other normalization techniques"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import RobustScaler, QuantileTransformer, PowerTransformer\n",
    "\n",
    "ss = StandardScaler()\n",
    "rs = RobustScaler() #better for data with outliers\n",
    "qt = QuantileTransformer(output_distribution='normal',n_quantiles=1000) #best with uniform or bimodal distribution\n",
    "yj = PowerTransformer(method = 'yeo-johnson') #best with categorical and ordinal data \n",
    "bc = PowerTransformer(method = 'box-cox') #only works with positive values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Removing Outliers\n",
    "\n",
    "- can be removed using the mean and standard deviation > 3\n",
    "\n",
    "        # Find the mean and standard dev\n",
    "        std = so_numeric_df['ConvertedSalary'].std()\n",
    "        mean = so_numeric_df['ConvertedSalary'].mean()\n",
    "\n",
    "        # Calculate the cutoff\n",
    "        cut_off = std * 3\n",
    "        lower, upper = mean - cut_off, mean + cut_off\n",
    "\n",
    "        # Trim the outliers\n",
    "        trimmed_df = so_numeric_df[(so_numeric_df['ConvertedSalary'] < upper) \\\n",
    "                                   & (so_numeric_df['ConvertedSalary'] > lower)]\n",
    "\n",
    "        # The trimmed box plot\n",
    "        trimmed_df[['ConvertedSalary']].boxplot()\n",
    "        plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Engineering\n",
    "- creation of new features based on existing features\n",
    "- insight between features\n",
    "- **dataset dependent**\n",
    "\n",
    "#### text data\n",
    "\n",
    "-vectorizing text to store values into a dataset\n",
    "\n",
    "    # Take the title text\n",
    "    title_text = volunteer['title']\n",
    "\n",
    "    # Create the vectorizer method\n",
    "    tfidf_vec = TfidfVectorizer()\n",
    "\n",
    "    # Transform the text into tf-idf vectors\n",
    "    text_tfidf = tfidf_vec.fit_transform(title_text)text\n",
    "    \n",
    "    # Split the dataset according to the class distribution of category_desc\n",
    "    y = volunteer[\"category_desc\"]\n",
    "    X_train, X_test, y_train, y_test = train_test_split(text_tfidf.toarray(), y, stratify=y)\n",
    "\n",
    "    # Fit the model to the training data\n",
    "    nb.fit(X_train, y_train)\n",
    "\n",
    "    # Print out the model's accuracy\n",
    "    print(nb.score(X_test, y_test))\n",
    "\n",
    "- using regex to extract certain string characters\n",
    "\n",
    "        # Write a pattern to extract numbers and decimals\n",
    "        def return_mileage(length):\n",
    "            pattern = re.compile(r\"\\d+\\.\\d+\")\n",
    "\n",
    "            # Search the text for matches\n",
    "            mile = re.match(pattern, length)\n",
    "\n",
    "            # If a value is returned, use group(0) to return the found value\n",
    "            if mile is not None:\n",
    "                return float(mile.group(0))\n",
    "\n",
    "        # Apply the function to the Length column and take a look at both columns\n",
    "        hiking[\"Length_num\"] = hiking[\"Length\"].apply(lambda row: return_mileage(row))\n",
    "        print(hiking[[\"Length\", \"Length_num\"]].head())\n",
    "\n",
    "#### dates\n",
    "- changing format based on relevance\n",
    "\n",
    "        # First, convert string column to date column\n",
    "        volunteer[\"start_date_converted\"] = pd.to_datetime(volunteer['start_date_date'])\n",
    "\n",
    "        # Extract just the month from the converted column\n",
    "        volunteer[\"start_date_month\"] = volunteer['start_date_converted'].apply(lambda row: row.month)\n",
    "\n",
    "        # Take a look at the converted and new month columns\n",
    "        print(volunteer[['start_date_converted', 'start_date_month']].head())\n",
    "\n",
    "#### aggregates\n",
    "- aggregating multiple columns into a single column\n",
    "\n",
    "        # Create a list of the columns to average\n",
    "        run_columns = ['run1', 'run2', 'run3'\n",
    "        , 'run4', 'run5']\n",
    "        # Use apply to create a mean column\n",
    "        running_times_5k[\"mean\"] = running_times_5k.apply(lambda row: row[run_columns].mean(), axis=1)\n",
    "\n",
    "        # Take a look at the results\n",
    "        print(running_times_5k)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Geospatial Data\n",
    "\n",
    "### convert start : longitude/latitude  and  end: longitude/latitude to distance\n",
    "\n",
    "    '''Distance equation for long,lat data used via stackoverflow from user Michael0x2a. \n",
    "    Updated to a function that converts to mileage'''\n",
    "    # constant values, if need to change end lat, long points, change the lat2, lon2 information\n",
    "    lat2 = np.array(clean.Latitude)\n",
    "    lon2 = np.array(clean.Longitude)\n",
    "    latr = np.array(list(map(lambda x: np.radians(x), lat2)))\n",
    "    lonr = np.array(list(map(lambda x: np.radians(x), lon2)))\n",
    "    def distance(lat1,lon1):\n",
    "        lat1 = np.radians(lat1)\n",
    "        lon1 = np.radians(lon1)\n",
    "        dlon = np.array(list(map(lambda x: (x - lon1), lonr)))\n",
    "        dlat = np.array(list(map(lambda x: (x - lat1), latr)))\n",
    "        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2\n",
    "        c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a))\n",
    "        # 6373.0 represents earth radius in kilometers\n",
    "        kilo = 6373.0 * c\n",
    "        miles = kilo * 0.62137119\n",
    "        return miles\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Interactions in Regression Models\n",
    "\n",
    "In statistics, an interaction is a particular property of three or more variables, where two or more variables interact in a non-additive manner when affecting a third variable. In other words, the two variables interact to have an effect that is more (or less) than the sum of their parts. Not accounting for them might lead to results that are wrong. You'll also notice that including them when they're needed will increase your  R2R2  value!\n",
    "\n",
    "#### Iterate through combinations of features to get top three interactions\n",
    "\n",
    "    from itertools import combinations\n",
    "    combinations = list(combinations(data.feature_names, 2))\n",
    "    interactions = []\n",
    "    data = df.copy()\n",
    "    for comb in combinations:\n",
    "        data[\"interaction\"] = data[comb[0]] * data[comb[1]]\n",
    "        score = np.mean(cross_val_score(regression, data, y, scoring=\"r2\", cv=crossvalidation))\n",
    "        if score > baseline: interactions.append((comb[0], comb[1], round(score,3)))\n",
    "\n",
    "    print(\"Top 3 interactions: %s\" %sorted(interactions, key=lambda inter: inter[2], reverse=True)[:5])\n",
    "    \n",
    "#### Feature engineer interactions into dataframe\n",
    "\n",
    "    df_inter = df.copy()#make a copy of dataframe so original is not affected\n",
    "    df_inter[\"RM_LSTAT\"] = df[\"RM\"] * df[\"LSTAT\"] #combines the two features\n",
    "    df_inter[\"RM_TAX\"] = df[\"RM\"] * df[\"TAX\"]\n",
    "    df_inter[\"RM_RAD\"] = df[\"RM\"] * df[\"RAD\"]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Polynomials in Regression (curved relationship)\n",
    "\n",
    "When relationships between predictors and outcome are not linear and show some sort of a curvature, polynomials can be used to generate better approximations. The idea is that you can transform your input variable by e.g, squaring it.\n",
    "\n",
    "$\\hat y = \\hat \\beta_0 + \\hat \\beta_1x + \\hat \\beta_2 x^2$ \n",
    "\n",
    "The use of polynomials is not restricted to quadratic relationships, you can explore cubic relationships,... as well! Imagine you want to go until the power of 10, it would be quite annoying to transform your variable 9 times. Of course, Scikit-Learn has a built-in Polynomial option in the preprocessing library!\n",
    "\n",
    "#### sci-kit learn polynomial selection with visual feedback and MSE scores\n",
    "\n",
    "    for index, degree in enumerate([2,3,4]):\n",
    "        poly = PolynomialFeatures(degree)\n",
    "        X = poly.fit_transform(X)\n",
    "        X_plot = poly.fit_transform(X_plot)\n",
    "        reg_poly = LinearRegression().fit(X, y)\n",
    "        y_plot = reg_poly.predict(X_plot)\n",
    "        plt.plot(x_plot, y_plot, color=colors[index], linewidth = 2 ,\n",
    "                 label=\"degree %d\" % degree)\n",
    "        print(\"degree %d\" % degree, r2_score(y, reg_poly.predict(X)))\n",
    "\n",
    "    plt.legend(loc='lower left')\n",
    "    plt.show();\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "learn-env",
   "language": "python",
   "name": "learn-env"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "476.727px",
    "left": "498px",
    "top": "49.0902px",
    "width": "381.091px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}