{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intro to Data Science\n", "## Part III. - Data Transformation\n", "\n", "### Table of contents\n", "\n", "- ##### Data Transformation\n", " - Theory\n", " - Numerical features\n", " - Nominal features\n", "\n", "- ##### Models:\n", " - kNN\n", " - Decision tree\n", "\n", "---\n", "\n", "## What is Data Transformation?\n", "During data transformation, the goal is to prepare the data to be usable in the modeling steps. These transformations include normalization, standardization, text processing, generating complex features from basic ones, or any kind of data mapping.\n", "\n", "- _\"...a data transformation converts a set of data values from the data format of a source data system into the data format of a destination data system.\"_\n", "\n", "_Data transformation can be divided into two steps:_\n", "1. Data mapping maps data elements from the source data system to the destination data system and captures any transformation that must occur.\n", "2. Code generation creates the actual transformation program. \n", "From: Wikipedia\n", "\n", "### Why is it important?\n", "\n", "Most models are sensitive to data, so you must transform it into a more suitable format. Unfortunately, the data you start with is usually in terrible shape:\n", "- It has missing values.\n", "- It is full of outliers.\n", "- The data is distorted by noise.\n", "- The features are on different scales.\n", "- The features are correlated, redundant, or uninformative.\n", "\n", "\n", "### Tools\n", "\n", "- **Scaling/Binarizing**: Adjust feature values to a standard range.\n", "- **Normalizing/Standardizing**: Make data distributions more comparable.\n", "- **Outlier Detection**: Identify and handle unusual values.\n", "- **Filtering**: Remove irrelevant or noisy data.\n", "- **Mathematical Transformations**: Apply log, square root, or polynomial transformations.\n", "- **Representational Changes**: Convert categorical variables into numerical representations.\n", "- etc.\n", "\n", "---" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import LabelEncoder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def add_missing(df, cols, inf=False, percent=.2):\n", " nrows, _ = df.shape\n", " missing = np.nan if not inf else np.inf\n", " df['tmp'] = np.random.rand(nrows)\n", " df.loc[df.tmp < percent, cols] = missing\n", " df = df.drop(columns=['tmp'])\n", " return df\n", "\n", "def apply_scaler(df, cols, scaler):\n", " df = df.copy()\n", " df[cols] = scaler.transform(df[cols])\n", " return df\n", " \n", "def gridplot(X, y=None, cols=None):\n", " if y is not None:\n", " data = pd.concat((X, y), axis=1)\n", " fig = sns.PairGrid(data, vars=cols, hue='Label')\n", " else:\n", " fig = sns.PairGrid(X, vars=cols)\n", " fig = fig.map_diag(plt.hist)\n", " fig = fig.map_offdiag(plt.scatter)\n", " fig = fig.add_legend()\n", " return fig" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Intermission - Instance-Based Classifiers\n", "\n", "### K-Nearest neighbour Classification\n", "\n", "#### Philosophy\n", "\n", "\n", "
\n", "\n", "\n", "If it looks like a duck and quacks like a duck, it's probably a duck.\n", "\n", "#### Taxonomy & Definition\n", ">_\"In machine learning, instance-based learning (sometimes called memory-based learning) is a family of learning algorithms that, instead of performing explicit generalization, compares new problem instances with instances seen in training, which have been stored in memory.\"_ - Wiki\n", "\n", "\n", "#### Algorithm\n", "The basic `kNN` algorithm stores training points with their labels without any coefficient fitting. During classification, a simple majority vote is taken to determine the class label.\n", "\n", "It is called `k` nearest neighbour for a reason: `k` is the number of closest data points considered in the majority voting. When a new data point arrives, the algorithm calculates the `k` nearest data points from the training set. Then, these `k` labels are used to determine the new entry's label. It is possible to provide a `weights` parameter as well. In this case, the labels will be weighted according to the weighting function. The most common weighting method is distance-based; the closer the point with a label, the more weight it gets.\n", "\n", "There are different strategies to set the ideal `k` values. `k = 1` is a special case where the new data entry gets the closest training point's label.\n", "\n", "#### Shortcomings\n", "\n", "If the data is high-dimensional, the curse of dimensionality affects the algorithm's performance.\n", "There is no clear method to determine the best distance metric; it always depends on the data.\n", "For best performance, the data should be preprocessed and transformed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "\n", "pipe = Pipeline(steps=[('knn', KNeighborsClassifier())])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading the loan dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('./data/loan.csv', index_col=0)\n", "\n", "# Intentionally left 'Loan_ID' out\n", "nominal_cols = ['Gender', 'Married', 'Education', 'Dependents', \n", " 'Self_Employed', 'Property_Area', 'Credit_History']\n", "numerical_cols = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',\n", " 'Loan_Amount_Term']\n", "target_col = 'Label'\n", "\n", "# Convert numerical columns to float\n", "df[numerical_cols] = df[numerical_cols].astype('float64')\n", "# Transform target values early for plotting reasons\n", "target_encoder = LabelEncoder()\n", "df['Label'] = target_encoder.fit_transform(df['Target'].values)\n", "\n", "# Split dataframe into features and label, also to train and test\n", "X = df[numerical_cols + nominal_cols].copy()\n", "y = df[target_col].copy()\n", "X_train, X_test, y_train, y_test = train_test_split(X, y,\n", " test_size=1/3,\n", " random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## 1. Numerical features\n", "\n", "\n", "### 1.A. Dealing with unusual values\n", "\n", "#### 1.A.1.missing values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "missing = add_missing(df, numerical_cols)\n", "missing.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- dropping NAs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dropped = missing.dropna(axis=0)\n", "dropped.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- fill NAs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "filled = missing.fillna(value=0)\n", "filled.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- intepolate NAs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "interpolated = missing.interpolate(method='nearest')\n", "interpolated.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.A.2. infinite values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "infinite = add_missing(df, numerical_cols, inf=True)\n", "infinite.describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with pd.option_context('mode.use_inf_as_na', True):\n", " print(infinite.dropna(axis=0).shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with pd.option_context('mode.use_inf_as_na', True):\n", " print(infinite.fillna(value=0).describe())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with pd.option_context('mode.use_inf_as_na', True):\n", " print(infinite.interpolate().describe())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.A.3. Using SimpleImputer\n", "\n", "With SimpleImputer there are several imputing strategies available through its `strategy` parameter:\n", "- use `'constant'` and specify the `fill_value` parameter to impute specific values\n", "- use `'mean'`, `'median'` or `'most_frequent'` for more sophisticated methods" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.impute import SimpleImputer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "imputer = SimpleImputer(missing_values=np.nan, strategy='mean')\n", "imputer.fit_transform(missing[numerical_cols])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.A.4. Using IterativeImputer\n", "\n", "With IterativeImputer you can fit a model on the column with missing values and predict the missing values:\n", "- use `'estimator'` to specify the model to use for learning and predicting the missing values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.impute import IterativeImputer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "imputer = IterativeImputer(random_state=42)\n", "imputer.fit_transform(missing[numerical_cols])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.B. different scales" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import accuracy_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe.fit(X_train[numerical_cols], y_train)\n", "accuracy_score(y_test.values, pipe.predict(X_test[numerical_cols]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gridplot(X_train, y_train);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.B.1 Min-Max Scaling" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "minmax = MinMaxScaler()\n", "scaled_pipe = Pipeline(steps=[\n", " ('minmax', minmax),\n", " ('knn', KNeighborsClassifier())\n", "])\n", "\n", "scaled_pipe.fit(X_train[numerical_cols], y_train)\n", "accuracy_score(y_test.values, scaled_pipe.predict(X_test[numerical_cols]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gridplot(apply_scaler(X_train, numerical_cols, minmax), y_train);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise: Try the same experiment with logistic regression\n", "- Create a pipe containing a logistic regressor and measure its accuracy\n", "- Create an another pipe with minmaxscaler and logistic regressor. Compare the results and try to explain the difference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.C unnormalized data\n", "\n", "#### 1.C.1 Standard scaling" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "standard = StandardScaler()\n", "normalized_pipe = Pipeline(steps=[\n", " ('standard', standard),\n", " ('knn', KNeighborsClassifier())\n", "])\n", "\n", "normalized_pipe.fit(X_train[numerical_cols], y_train)\n", "accuracy_score(y_test.values, normalized_pipe.predict(X_test[numerical_cols]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gridplot(apply_scaler(X_train, numerical_cols, standard), y_train);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise: Try the same experiment with logistic regression\n", "- Create a pipe containing a logistic regressor and measure its accuracy\n", "- Create an another pipe with standardscaler and logistic regressor. Compare the results and try to explain the difference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.C.2 outliers\n", "\n", "\n", "
from: flowingdata.com\n", "\n", "#### Scaling data with outliers: RobustScaler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import RobustScaler" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "robust_pipe = Pipeline(steps=[\n", " ('robust', RobustScaler()),\n", " ('knn', KNeighborsClassifier())\n", "])\n", "\n", "robust_pipe.fit(X_train[numerical_cols], y_train)\n", "accuracy_score(y_test.values, robust_pipe.predict(X_test[numerical_cols]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise: Try the same experiment with logistic regression\n", "- Create a pipe containing a logistic regressor and measure its accuracy\n", "- Create an another pipe with robustscaler and logistic regressor. Compare the results and try to explain the difference." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.D Binarization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import Binarizer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "binary_pipe = Pipeline([\n", " ('bin', Binarizer(threshold=101.0)),\n", " ('knn', KNeighborsClassifier())\n", "])\n", "\n", "binary_pipe.fit(X_train[numerical_cols], y_train)\n", "accuracy_score(y_test.values, binary_pipe.predict(X_test[numerical_cols]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.E Correlated features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(df.select_dtypes(exclude=[\"object\"]).corr(), robust=True, annot=True, cbar=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not now. More about this topic in the next issue of DS101. Cough-cough-PCA-cough." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## Intermission II. - Model of the Week\n", "\n", "### Decision Trees\n", "\n", "Decision trees are a type of supervised machine learning algorithm that can predict both categorical and continuous values (in which case they are called *regression trees*). **The algorithm essentially divides the training population based on attribute values and assigns a prediction to each category.**\n", "\n", "For a **categorical** target variable, the assigned prediction is **the mode of the target variable values** in the given subpopulation. For a **continuous** target variable, the assigned prediction is **the mean of the values**. \n", "\n", "A familiar example from the Wikipedia page on decision trees:\n", "wikipedia page of decision trees: \n", "\n", "\n", "from: wiki \n", "\n", "A tree showing survival of passengers on the Titanic (\"sibsp\" represents the number of spouses or siblings aboard). The figures under the leaves show the probability of an outcome and the percentage of observations in the leaf.\n", "\n", "#### The Trick: Where to Split?\n", " \n", "The key to decision trees is determining where to split the population. **The goal is to make these subpopulations as homogeneous in the target variable as possible.** If an attribute is completely uncorrelated with the target variable, the decision tree will not split the population based on that variable.\n", "\n", "Various _metrics_ are used to decide the best splits. Most measure an **\"impurity\"** of the parent node and choose a split that maximally reduces impurity. However, these calculations are usually local—meaning **they do not guarantee the best global outcome**.\n", "\n", "##### How Large/Complex Should the Tree Be? When to Stop Splitting?\n", "\n", "One of the biggest weaknesses of decision/regression trees is **overfitting**, so controlling tree complexity is crucial. Some common ways to limit tree growth include:\n", "- Setting a minimum sample number required for a split\n", "- Setting a maximum sample number for a node to be a leaf\n", "- Setting a threshold for impurity decrease, below which no splits are made\n", "- Setting a maximum tree depth\n", "- Limiting the number of variables used for splitting\n", "- Restricting the maximum number of leaves\n", "\n", "Another way to combat overfitting is **tree pruning**. This involves allowing the tree to grow initially, then systematically removing splits that contribute little to model accuracy, either from the top or bottom of the tree.\n", "\n", "#### Why Use Decision Trees Instead of Logistic Regression?\n", "\n", "Decision trees shine when **there is a non-linear relationship between attributes and the target variable**. Despite handling non-linearity well, decision trees remain highly interpretable (as shown in the Titanic example above).\n", "\n", "#### What’s Better Than One Tree? Multiple Trees!\n", "\n", "To tackle overfitting, early techniques involved randomly sampling the training set and building multiple trees on different samples. Predictions were then made using the \"votes\" of all trees (taking the mean for regression or the mode for classification). This is known as **bagging**.\n", "\n", "**Random forests** improve on this by randomly excluding some attributes from consideration at each node. This prevents a few dominant attributes from appearing in most splits, which would otherwise cause trees to be highly correlated, thereby reducing the effectiveness of bagging.\n", "\n", "An important takeaway:\n", "> When in doubt, use brute force random forest.\n", "\n", "For more visually appealing explanation, visit this site!\n", "\n", "#### Using Decision tree in scikit-learn:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tree = DecisionTreeClassifier()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "## 2. Nominal Features\n", "\n", "### Replacing values" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "replace_map = {'Dependents': {'0': 0, '1': 1, '2': 2, '3+': 4}}\n", "X_train.replace(replace_map).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ordinal encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OrdinalEncoder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ordinal_pipe = Pipeline([\n", " ('ordinalencoder', OrdinalEncoder()),\n", " ('knn', KNeighborsClassifier())\n", "])\n", "\n", "ordinal_pipe.fit(X_train[nominal_cols], y_train.values)\n", "accuracy_score(y_test.values, ordinal_pipe.predict(X_test[nominal_cols]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise: Try the same experiment with decision tree\n", "- Create a pipe containing a decision tree and measure its accuracy with the encoded data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One-hot encoding" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "onehot_pipe = Pipeline(steps=[\n", " ('hot', OneHotEncoder(categories='auto', sparse_output=False)),\n", " ('knn', KNeighborsClassifier())\n", "])\n", "\n", "onehot_pipe.fit(X_train[nominal_cols], y_train)\n", "accuracy_score(y_test.values, onehot_pipe.predict(X_test[nominal_cols]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise: Try the same experiment with decision tree\n", "- Create a pipe containing a decision tree and measure its accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Putting all together \n", "\n", "#### Using the ColumnTransformer\n", "\n", "Use different transformers for each column. Define the transformation pipeline for each column / group of columns and define what happens with the remainder columns. Use `'drop'` to drop them, `'passthrough'` to keep them without transformations.\n", "\n", "Example:\n", "```python\n", "ColumnTransformer([\n", " ('name_of_transformation', TransformatorPipeline(), ['list', 'of', 'columns']),\n", " ('name_of_second_transformation', DifferentTransformatorPipeline(), ['list', 'of', 'other', 'columns'])\n", "], remainder='drop')\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.compose import ColumnTransformer" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Filling missing values and normalizing numerical cols\n", "numerical_pipe = Pipeline([\n", " ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),\n", " ('standardscaler', StandardScaler())\n", "])\n", "# Filling missing values and one-hot encode nominal columns\n", "nominal_pipe = Pipeline([\n", " ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),\n", " ('ohe', OneHotEncoder())\n", "])\n", "\n", "transformer = ColumnTransformer([\n", " ('binarizer', Binarizer(threshold=101.0), ['LoanAmount']), # Create binarized loan amount column\n", " ('numerical_pipe', numerical_pipe, numerical_cols),\n", " ('nominal_pipe', nominal_pipe, nominal_cols),\n", "], remainder='drop')\n", "\n", "\n", "pipeline = Pipeline([\n", " ('preprocess', transformer),\n", " ('knn', pipe)\n", "])\n", "pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipeline.fit(X_train, y_train)\n", "y_hat = pipeline.predict(X_test)\n", "accuracy_score(y_test, y_hat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise: Try out the different classifiers with the preprocessed data!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "szisz_ds_2025", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 1 }