{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "df = pd.read_parquet(\"fraud-cleaned-sample.parquet\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train/test split\n", "\n", "We're using time-series data, so we'll split based on time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first = df['timestamp'].min()\n", "last = df['timestamp'].max()\n", "cutoff = first + ((last - first) * 0.7)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "train = df[df['timestamp'] <= cutoff]\n", "len(train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test = df[df['timestamp'] > cutoff]\n", "len(test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(train) / (len(train) + len(test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Encoding categorical features\n", "\n", "Some of our features are obvious quantities (like interarrival times and transaction amounts), but others are categories of things (like merchant IDs and transaction types). In a conventional programming language or database schema, we'd use enumerated types (C programmers may want to use distinguished small integers) to model categories of things, but those aren't suitable for input to machine learning algorithms.\n", "\n", "Why?\n", "\n", "Well, let's say we encode transaction types as small integers, like this:\n", "\n", "```\n", "MANUAL=0\n", "SWIPE=1\n", "CHIP_AND_PIN=2\n", "CONTACTLESS=3\n", "ONLINE=4\n", "```\n", "\n", "We can use this representation to write code that treats these differently, but the integers don't actually capture anything about our problem that a machine learning algorithm can exploit -- a manual transaction isn't \"less than\" a swipe transaction, and an online transaction isn't \"closer to\" a contactless transaction than a manual one is. We want a representation that makes sure that manual transactions are similar to other manual transactions in some way _but dissimilar to all other transactions_ in that way.\n", "\n", "There are several approaches we can use to make sense of categorical features, and we'll use two of them in this notebook:\n", "\n", "- [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) for merchant IDs and\n", "- [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) for transaction types" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn\n", "from sklearn.pipeline import Pipeline\n", "from sklearn import feature_extraction, preprocessing\n", "from sklearn.preprocessing import FunctionTransformer\n", "from sklearn.compose import ColumnTransformer\n", "\n", "stringize = np.frompyfunc(lambda x: \"%s\" % x, 1, 1)\n", "\n", "def mk_stringize(colname):\n", " def stringize(tab):\n", " return [{colname : s} for s in tab]\n", " return stringize\n", "\n", "def amap(s):\n", " return s.map(lambda x: {'merchant_id' : str(x)})\n", "\n", "# my_func = mk_stringize('merchant_id')\n", "my_func = amap\n", "\n", "def mk_hasher(features=16384, values=None): \n", " return Pipeline([('dictify', \n", " FunctionTransformer(my_func, accept_sparse=True)), \n", " ('hasher', \n", " sklearn.feature_extraction.FeatureHasher(n_features=features, input_type='dict'))])\n", "\n", "\n", "HASH_BUCKETS = 256\n", "\n", "tt_xform = ('onehot', sklearn.preprocessing.OneHotEncoder(handle_unknown='ignore', categories=[['online','contactless','chip_and_pin','manual','swipe']]), ['trans_type'])\n", "mu_xform = ('m_hashing', mk_hasher(HASH_BUCKETS), 'merchant_id')\n", "\n", "xform_steps = [tt_xform, mu_xform]\n", "\n", "cat_xform = ColumnTransformer(transformers=xform_steps, n_jobs=None)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualizing categorical features\n", "\n", "The general approach we'll use is to [_reduce the dimensionality_](https://en.wikipedia.org/wiki/Dimensionality_reduction) of our encoded categorical features so we can plot them as points on a plane. This means going from hundreds of dimensions (in the case of hashed merchant IDs) or five or six dimensions (in the case of one-hot encoded transaction types) to two dimensions.\n", "\n", "We'll use two different techniques: a linear technique called [principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) and a nonlinear technique called [t-distributed stochastic neighbor embedding](). The details of these techniques are out of scope for this workshop, but they're both good places to start if you want to visualize some high-dimensional data. Dimensionality reduction can be expensive, so we'll start by sampling only a small amount of our data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vis_sample = pd.concat([train[train[\"label\"] == label].sample(2500) for label in [\"legitimate\", \"fraud\"]])\n", "\n", "categorical_matrix = cat_xform.fit_transform(vis_sample)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "crows, ccols = categorical_matrix.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(categorical_matrix != 0).sum(0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Does the merchant ID obviously correlate with fraud?\n", "\n", "We're going to start by using PCA to plot the two first principal components of the encoded merchants -- think of this as mapping from the high-dimensional space to a two-dimensional space in such a way that emphasizes the dimensions that contain the most information and minimizes the dimensions that contain the least information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.decomposition\n", "\n", "merchants = categorical_matrix[:, -HASH_BUCKETS:]\n", "\n", "DIMENSIONS = 2\n", "\n", "mpca2 = sklearn.decomposition.PCA(DIMENSIONS)\n", "\n", "mpca2_a = mpca2.fit_transform(merchants.toarray())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "merchants_df = pd.DataFrame({\"label\": vis_sample[\"label\"].astype(np.object),\n", " \"x\": mpca2_a.T[0],\n", " \"y\": mpca2_a.T[1]}).reset_index().dropna()\n", "\n", "del merchants_df[\"index\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "import altair as alt\n", "alt.Chart(merchants_df).mark_point(opacity=0.1).encode(\n", " x=\"x:Q\", \n", " y=\"y:Q\", \n", " color=\"label\"\n", ").interactive()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see, there's a lot of overlap between the classes here and merchant ID alone isn't an obvious way to differentiate between legitimate and fraudulent transactions.\n", "\n", "## What if we use a nonlinear visualization technique?\n", "\n", "Sometimes, a nonlinear visualization technique can work better than a linear one like PCA. The next approach we'll try is called t-distributed stochastic neighbor embedding, or t-SNE for short. t-SNE learns a mapping from high-dimensional points to low-dimensional points so that points that are similar in high-dimensional space are likely to be similar in low-dimensional space as well. t-SNE can sometimes identify structure that simpler techniques like PCA can't, but this power comes at a cost: it is much more expensive to compute than PCA and doesn't parallelize well. (t-SNE also works best for visualizing two-dimensional data when it is reducing from tens of dimensions rather than hundreds or thousands. So, in some cases, you'll want to use a fast technique like PCA to reduce your data to a few dozen dimensions before using t-SNE. That's what we're doing with the `TruncatedSVD` class in the next cell.)\n", "\n", "✅ *You can go back and re-run this entire notebook after changing `HASH_BUCKETS` to a different value.*\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sklearn.manifold\n", "tsne = sklearn.manifold.TSNE()\n", "\n", "# use SVD to reduce the dimensionality before fitting t-SNE\n", "svd = sklearn.decomposition.TruncatedSVD(16)\n", "svd_a = svd.fit_transform(merchants)\n", "\n", "tsne_a = tsne.fit_transform(svd_a)\n", "\n", "merchants_df[\"x\"] = tsne_a.T[0]\n", "merchants_df[\"y\"] = tsne_a.T[1]\n", "\n", "alt.Chart(merchants_df).mark_point(opacity=0.2).encode(x=\"x:Q\", y=\"y:Q\", color=\"label\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's still a lot of overlap between the classes here. Fortunately, we know from the [exploratory analysis notebook](./01-eda.ipynb) that our numeric features contain a lot of information to help us distinguish between classes. We'll see how to exploit that with models in the next notebook, but first, we need to preprocess these features.\n", "\n", "\n", "# Encoding numeric features\n", "\n", "For the numeric features, our preprocessing is a little easier. We need to impute missing values for interarrival times (the interarrival time is undefined for the first transaction for each user, since there was no previous interarrival time) and we need to scale all numeric features to a constant range. We'll do this using the `Pipeline` facility from scikit-learn.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import RobustScaler\n", "from sklearn.impute import SimpleImputer\n", "from sklearn.pipeline import Pipeline\n", "\n", "\n", "impute_and_scale = Pipeline([('median_imputer', SimpleImputer(strategy=\"median\")), ('interarrival_scaler', RobustScaler())])\n", "ia_scaler = ('interarrival_scaler', impute_and_scale, ['interarrival'])\n", "amount_scaler = ('amount_scaler', RobustScaler(), ['amount'])\n", "\n", "scale_steps = [ia_scaler, amount_scaler]\n", "all_xforms = ColumnTransformer(transformers=(scale_steps + xform_steps))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Fit and save the feature extraction pipeline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feat_pipeline = Pipeline([\n", " ('feature_extraction',all_xforms)\n", "])\n", "\n", "feat_pipeline.fit(train)\n", "\n", "from mlworkflows import util\n", "util.serialize_to(feat_pipeline, \"feature_pipeline.sav\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With your feature extraction pipeline saved, you can go on to the next notebook. You have two choices -- either use a model based on [logistic regression](./03-model-logistic-regression.ipynb) or a model based on [tree ensembles](./03-model-random-forest.ipynb)." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 4 }