# Intro to Data Science
## Part III. - Data Transformation

### Table of contents

- ##### Data Transformation
    - <a href="#What-is-Data-Transformation?">Theory</a>
    - <a href="#1.-Numerical-features">Numerical features</a>
    - <a href="#2.-Nominal-features">Nominal features</a>

- ##### Models:
    - <a href="#Intermission---instance-based-classifiers">kNN</a>
    - <a href="Intermission-II.---Model-of-the-week">Decision tree</a>

---

## What is Data Transformation?
During data transformation, the goal is to prepare the data to be usable in the modeling steps. These transformations include normalization, standardization, text processing, generating complex features from basic ones, or any kind of data mapping.

- _"...a data transformation converts a set of data values from the data format of a source data system into the data format of a destination data system."_

_Data transformation can be divided into two steps:_
1. Data mapping maps data elements from the source data system to the destination data system and captures any transformation that must occur.
2. Code generation creates the actual transformation program.  
From: <a href="https://en.wikipedia.org/wiki/Data_transformation">Wikipedia</a>

### Why is it important?

Most models are sensitive to data, so you must transform it into a more suitable format. Unfortunately, the data you start with is usually in terrible shape:
- It has missing values.
- It is full of outliers.
- The data is distorted by noise.
- The features are on different scales.
- The features are correlated, redundant, or uninformative.


### Tools

- **Scaling/Binarizing**: Adjust feature values to a standard range.
- **Normalizing/Standardizing**: Make data distributions more comparable.
- **Outlier Detection**: Identify and handle unusual values.
- **Filtering**: Remove irrelevant or noisy data.
- **Mathematical Transformations**: Apply log, square root, or polynomial transformations.
- **Representational Changes**: Convert categorical variables into numerical representations.
- etc.

---

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

In [None]:
def add_missing(df, cols, inf=False, percent=.2):
    nrows, _ = df.shape
    missing = np.nan if not inf else np.inf
    df['tmp'] = np.random.rand(nrows)
    df.loc[df.tmp < percent, cols] = missing
    df = df.drop(columns=['tmp'])
    return df

def apply_scaler(df, cols, scaler):
    df = df.copy()
    df[cols] = scaler.transform(df[cols])
    return df
        
def gridplot(X, y=None, cols=None):
    if y is not None:
        data = pd.concat((X, y), axis=1)
        fig = sns.PairGrid(data, vars=cols, hue='Label')
    else:
        fig = sns.PairGrid(X, vars=cols)
    fig = fig.map_diag(plt.hist)
    fig = fig.map_offdiag(plt.scatter)
    fig = fig.add_legend()
    return fig

## Intermission - Instance-Based Classifiers

### K-Nearest neighbour Classification

#### Philosophy
<img src="./pics/kacsa.png" width=400 align="left">

<br style="clear:left;"/>


If it looks like a duck and quacks like a duck, it's probably a duck.

#### Taxonomy & Definition
>_"In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory."_ - <a href="https://en.wikipedia.org/wiki/Instance-based_learning">Wiki</a>


#### Algorithm
The basic <a href="http://scikit-learn.org/stable/modules/neighbours.html#nearest-neighbours-classification">`kNN`</a> algorithm stores training points with their labels without any coefficient fitting. During classification, a simple majority vote is taken to determine the class label.

It is called `k` nearest neighbour for a reason: `k` is the number of closest data points considered in the majority voting. When a new data point arrives, the algorithm calculates the `k` nearest data points from the training set. Then, these `k` labels are used to determine the new entry's label. It is possible to provide a `weights` parameter as well. In this case, the labels will be weighted according to the weighting function. The most common weighting method is distance-based; the closer the point with a label, the more weight it gets.

There are different strategies to set the <a href="https://www.analyticsvidhya.com/blog/2014/10/introduction-k-neighbours-algorithm-clustering/">ideal `k` values</a>. `k = 1` is a special case where the new data entry gets the closest training point's label.

#### Shortcomings

If the data is high-dimensional, the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> affects the algorithm's performance.
There is no clear method to determine the best distance metric; it always depends on the data.
For best performance, the data should be preprocessed and transformed.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

pipe = Pipeline(steps=[('knn', KNeighborsClassifier())])

### Reading the loan dataset

In [None]:
df = pd.read_csv('./data/loan.csv', index_col=0)

# Intentionally left 'Loan_ID' out
nominal_cols = ['Gender', 'Married', 'Education', 'Dependents', 
                'Self_Employed', 'Property_Area', 'Credit_History']
numerical_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
                  'Loan_Amount_Term']
target_col = 'Label'

# Convert numerical columns to float
df[numerical_cols] = df[numerical_cols].astype('float64')
# Transform target values early for plotting reasons
target_encoder = LabelEncoder()
df['Label'] = target_encoder.fit_transform(df['Target'].values)

# Split dataframe into features and label, also to train and test
X = df[numerical_cols + nominal_cols].copy()
y = df[target_col].copy()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=1/3,
                                                    random_state=42)

---

## 1. Numerical features


### 1.A. Dealing with unusual values

#### 1.A.1.<a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html">missing values</a>

In [None]:
missing = add_missing(df, numerical_cols)
missing.describe()

- dropping NAs

In [None]:
dropped = missing.dropna(axis=0)
dropped.shape

- fill NAs

In [None]:
filled = missing.fillna(value=0)
filled.describe()

- intepolate NAs

In [None]:
interpolated = missing.interpolate(method='nearest')
interpolated.describe()

#### 1.A.2. <a href="http://pandas.pydata.org/pandas-docs/stable/missing_data.html#values-considered-missing">infinite values</a>

In [None]:
infinite = add_missing(df, numerical_cols, inf=True)
infinite.describe()

In [None]:
with pd.option_context('mode.use_inf_as_na', True):
    print(infinite.dropna(axis=0).shape)

In [None]:
with pd.option_context('mode.use_inf_as_na', True):
    print(infinite.fillna(value=0).describe())

In [None]:
with pd.option_context('mode.use_inf_as_na', True):
    print(infinite.interpolate().describe())

#### 1.A.3. Using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html">SimpleImputer</a>

With SimpleImputer there are several imputing strategies available through its `strategy` parameter:
- use `'constant'` and specify the `fill_value` parameter to impute specific values
- use `'mean'`, `'median'` or `'most_frequent'` for more sophisticated methods

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit_transform(missing[numerical_cols])

#### 1.A.4. Using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer">IterativeImputer</a>

With IterativeImputer you can fit a model on the column with missing values and predict the missing values:
- use `'estimator'` to specify the model to use for learning and predicting the missing values

In [None]:
from sklearn.impute import IterativeImputer

In [None]:
imputer = IterativeImputer(random_state=42)
imputer.fit_transform(missing[numerical_cols])

### 1.B. <a href="http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling">different scales</a>

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, pipe.predict(X_test[numerical_cols]))

In [None]:
gridplot(X_train, y_train);

#### 1.B.1 <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html">Min-Max Scaling</a>

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
minmax = MinMaxScaler()
scaled_pipe = Pipeline(steps=[
    ('minmax', minmax),
    ('knn', KNeighborsClassifier())
])

scaled_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, scaled_pipe.predict(X_test[numerical_cols]))

In [None]:
gridplot(apply_scaler(X_train, numerical_cols, minmax), y_train);

#### Exercise: Try the same experiment with logistic regression
- Create a pipe containing a logistic regressor and measure its accuracy
- Create an another pipe with minmaxscaler and logistic regressor. Compare the results and try to explain the difference.

### 1.C <a href="http://scikit-learn.org/stable/modules/preprocessing.html#normalization">unnormalized data</a>

#### 1.C.1 <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html">Standard scaling</a>

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
standard = StandardScaler()
normalized_pipe = Pipeline(steps=[
    ('standard', standard),
    ('knn', KNeighborsClassifier())
])

normalized_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, normalized_pipe.predict(X_test[numerical_cols]))

In [None]:
gridplot(apply_scaler(X_train, numerical_cols, standard), y_train);

#### Exercise: Try the same experiment with logistic regression
- Create a pipe containing a logistic regressor and measure its accuracy
- Create an another pipe with standardscaler and logistic regressor. Compare the results and try to explain the difference.

#### 1.C.2 <a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers">outliers</a>

<img src="pics/outlier.gif" align="left" width="400">
<br style="clear:left;"/>from: <a href="https://flowingdata.com/2014/09/02/out-liar/">flowingdata.com</a>

#### Scaling data with outliers: <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html">RobustScaler</a>

In [None]:
from sklearn.preprocessing import RobustScaler

In [None]:
robust_pipe = Pipeline(steps=[
    ('robust', RobustScaler()),
    ('knn', KNeighborsClassifier())
])

robust_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, robust_pipe.predict(X_test[numerical_cols]))

#### Exercise: Try the same experiment with logistic regression
- Create a pipe containing a logistic regressor and measure its accuracy
- Create an another pipe with robustscaler and logistic regressor. Compare the results and try to explain the difference.

### 1.D <a href="http://scikit-learn.org/stable/modules/preprocessing.html#feature-binarization">Binarization</a>

In [None]:
from sklearn.preprocessing import Binarizer

In [None]:
binary_pipe = Pipeline([
    ('bin', Binarizer(threshold=101.0)),
    ('knn', KNeighborsClassifier())
])

binary_pipe.fit(X_train[numerical_cols], y_train)
accuracy_score(y_test.values, binary_pipe.predict(X_test[numerical_cols]))

### 1.E Correlated features

In [None]:
sns.heatmap(df.select_dtypes(exclude=["object"]).corr(), robust=True, annot=True, cbar=True)

Not now. More about this topic in the next issue of DS101. Cough-cough-<a href="http://scikit-learn.org/stable/modules/preprocessing.html#scaling-data-with-outliers" style="color: black; text-decoration: none; cursor: default;">PCA</a>-cough.

---

## Intermission II. - Model of the Week

### <a href="https://scikit-learn.org/stable/modules/tree.html#tree">Decision Trees</a>

Decision trees are a type of supervised machine learning algorithm that can predict both categorical and continuous values (in which case they are called *regression trees*). **The algorithm essentially divides the training population based on attribute values and assigns a prediction to each category.**

For a **categorical** target variable, the assigned prediction is **the mode of the target variable values** in the given subpopulation. For a **continuous** target variable, the assigned prediction is  **the mean of the values**.  

A familiar example from the Wikipedia page on decision trees:
<a href="https://en.wikipedia.org/wiki/Decision_tree_learning">wikipedia page of decision trees</a>:  
<img src="pics/CART_tree_titanic_survivors.png">

from: <a href="https://en.wikipedia.org/wiki/Decision_tree_learning">wiki</a>  

A tree showing survival of passengers on the Titanic ("sibsp" represents the number of spouses or siblings aboard). The figures under the leaves show the probability of an outcome and the percentage of observations in the leaf.

#### The Trick: Where to Split?
  
The key to decision trees is determining where to split the population. **The goal is to make these subpopulations as homogeneous in the target variable as possible.** If an attribute is completely uncorrelated with the target variable, the decision tree will not split the population based on that variable.

Various _metrics_ are used to decide the best splits. Most measure an **"impurity"** of the parent node and choose a split that maximally reduces impurity. However, these calculations are usually local—meaning **they do not guarantee the best global outcome**.

##### How Large/Complex Should the Tree Be? When to Stop Splitting?

One of the biggest weaknesses of decision/regression trees is **overfitting**, so controlling tree complexity is crucial. Some common ways to limit tree growth include:
- Setting a minimum sample number required for a split
- Setting a maximum sample number for a node to be a leaf
- Setting a threshold for impurity decrease, below which no splits are made
- Setting a maximum tree depth
- Limiting the number of variables used for splitting
- Restricting the maximum number of leaves

Another way to combat overfitting is **tree pruning**. This involves allowing the tree to grow initially, then systematically removing splits that contribute little to model accuracy, either from the top or bottom of the tree.

#### Why Use Decision Trees Instead of Logistic Regression?

Decision trees shine when **there is a non-linear relationship between attributes and the target variable**. Despite handling non-linearity well, decision trees remain highly interpretable (as shown in the Titanic example above).

#### What’s Better Than One Tree? Multiple Trees!

To tackle overfitting, early techniques involved randomly sampling the training set and building multiple trees on different samples. Predictions were then made using the "votes" of all trees (taking the mean for regression or the mode for classification). This is known as **bagging**.

**Random forests** improve on this by randomly excluding some attributes from consideration at each node. This prevents a few dominant attributes from appearing in most splits, which would otherwise cause trees to be highly correlated, thereby reducing the effectiveness of bagging.

An important takeaway:
> When in doubt, use <s>brute force</s> random forest.

For more visually appealing explanation, visit <a href="http://www.r2d3.us/visual-intro-to-machine-learning-part-1/">this site</a>!

#### Using <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">Decision tree</a> in scikit-learn:

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
tree = DecisionTreeClassifier()

---

## 2. Nominal Features

### Replacing values

In [None]:
replace_map = {'Dependents': {'0': 0, '1': 1, '2': 2, '3+': 4}}
X_train.replace(replace_map).head()

### <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html">Ordinal encoding</a>

In [None]:
from sklearn.preprocessing import OrdinalEncoder

In [None]:
ordinal_pipe = Pipeline([
    ('ordinalencoder', OrdinalEncoder()),
    ('knn', KNeighborsClassifier())
])

ordinal_pipe.fit(X_train[nominal_cols], y_train.values)
accuracy_score(y_test.values, ordinal_pipe.predict(X_test[nominal_cols]))

#### Exercise: Try the same experiment with decision tree
- Create a pipe containing a decision tree and measure its accuracy with the encoded data

### <a href="http://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features">One-hot encoding</a>

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
onehot_pipe = Pipeline(steps=[
    ('hot', OneHotEncoder(categories='auto', sparse_output=False)),
    ('knn', KNeighborsClassifier())
])

onehot_pipe.fit(X_train[nominal_cols], y_train)
accuracy_score(y_test.values, onehot_pipe.predict(X_test[nominal_cols]))

#### Exercise: Try the same experiment with decision tree
- Create a pipe containing a decision tree and measure its accuracy

### 3. Putting all together 

#### Using the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer">ColumnTransformer</a>

Use different transformers for each column. Define the transformation pipeline for each column / group of columns and define what happens with the remainder columns. Use `'drop'` to drop them, `'passthrough'` to keep them without transformations.

Example:
```python
ColumnTransformer([
    ('name_of_transformation', TransformatorPipeline(), ['list', 'of', 'columns']),
    ('name_of_second_transformation', DifferentTransformatorPipeline(), ['list', 'of', 'other', 'columns'])
], remainder='drop')
```

In [None]:
from sklearn.compose import ColumnTransformer

In [None]:
# Filling missing values and normalizing numerical cols
numerical_pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('standardscaler', StandardScaler())
])
# Filling missing values and one-hot encode nominal columns
nominal_pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('ohe', OneHotEncoder())
])

transformer = ColumnTransformer([
    ('binarizer', Binarizer(threshold=101.0), ['LoanAmount']),  # Create binarized loan amount column
    ('numerical_pipe', numerical_pipe, numerical_cols),
    ('nominal_pipe', nominal_pipe, nominal_cols),
], remainder='drop')


pipeline = Pipeline([
    ('preprocess', transformer),
    ('knn', pipe)
])
pipeline

In [None]:
pipeline.fit(X_train, y_train)
y_hat = pipeline.predict(X_test)
accuracy_score(y_test, y_hat)

#### Exercise: Try out the different classifiers with the preprocessed data!