# Fraud Classification on DC/OS
Fraud classification is a common data science problem with many solutions. It is similar in approach to many others (e.g., click prediction or spam detection) in that it is a rare events binary classification problem. That is, there are two classes, fraud and not fraud and one case is rare.

The notebooks in these examples were created on JupyterLab running on DC/OS. In this set of notebooks, we will walk through four modeling approaches: logistic regression, decision tree, random forest, and a neural network. General pros and cons will be given for each.

These examples are simple and can be run outside DC/OS. They lay the foundation, however, for future posts. Future posts will demonstrate how DC/OS can make scaling with multiple nodes and GPU's easy. Instructions for installing and running JupyterLab on DC/OS can be found [here](https://mesosphere.com/blog/dcos-tensorflow-jupyter-beakerx/).
## Data
The data used was generated with a [payment simulator](https://github.com/EdgarLopezPhD/PaySim) based on this [fraud simulation paper](https://www.researchgate.net/profile/Stefan_Axelsson4/publication/313138956_PAYSIM_A_FINANCIAL_MOBILE_MONEY_SIMULATOR_FOR_FRAUD_DETECTION/links/5890f87e92851cda2568a295/PAYSIM-A-FINANCIAL-MOBILE-MONEY-SIMULATOR-FOR-FRAUD-DETECTION.pdf).

### Data Description
There are a total of ~1.3 million records and 11 columns in the dataset. Because only two types are required to build the models (Transfer and Cash-out), less than 600k records are kept. There are only about 8.4k fraud cases. Columns that have no value for the analysis are also dropped, leaving 7 columns (6 independent variables and 1 dependent variable).
A description of the columns follows:

|Variable|Description|Keep|
|:--- |:--- |:--- |
|step|Maps a unit of time in the real world. In this case, 1 step is 1 hour of time.|Drop|
|type|CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER|Keep (TRANSFER and CASH-OUT)|
|amount|The amount of the transaction.|Keep|
|nameOrig|The customer ID for the initiator of the transaction.|Drop|
|oldbalanceOrg|The initial balance before the transaction.|Keep|
|newbalanceOrg|The customer's balance after the transaction.|Keep|
|nameDest|The customer ID for the recipient of the transaction.|Drop|
|oldbalanceDest|The initial recipient balance before the transaction.|Keep|
|newbalanceDest|The recipient's balance after the transaction.|Keep|
|isFraud|This identifies a fraudulent transaction (1) and non fraudulent transaction(0).|Keep|
|isFlaggedFraud|This is a rule-based system that flags illegal attempts to transfer more than 200.000 in a single transaction.|Drop|

## Exploration
These examples are focused mostly on the models and interpretation. Most of the initial data exploration is skipped.
## Measuring model performance
We will use the most common metrics for assessing fraud models: Accuracy, Precision, Recall, and F1.

|        .         |Predicted Not Fraud|Predicted Is Fraud|
|------------------|-------------------|------------------|
|Actual Not Fraud  | True Negative     | False Positive   |
|Actual Is Fraud   | False Negative    | True Positive    |

* Accuracy - Proportion of predictions that are correct. $\frac{True Positive + True Negative}{True Positive + True Negative + False Positive + False Negative}$
* Precision - True positive over total positive actual cases. $\frac{True Positive}{True Positive + False Positive}$
* Recall - True positive over total positive predicted cases. $\frac{True Positive}{True Positive + False Negative}$
* F1 - A balance between Precision and Recall (harmonic mean of precision and recall) $\frac{2 * Precision * Recall}{Precision + Recall}$

## Models

### Logistic Regression
In statistics, a logistic regression model is a statistical model that is for data with a binary dependent variable. A typical logistic model is a model that the log-odds of the probability of an event is a linear combination of independent or predictor variables. The two possible dependent variable values are often labelled as 1 or 0, which represent outcomes such as fraud/no fraud, click/no click, or spam/no spam. The logistic regression model can be generalized to more than two levels of the dependent variable, but that is not needed for this example.

* Pros
    * Compuationally inexpensive
    * Easy to interpret
    * Easy to implement
* Cons
    * Prone to overfitting when penalty cost is not large enough.
    * Assumes parametric.
    * Assumes linear relationship (in transformed space) between independent variables.
    * Must know interactions ahead of time.

[Logistic Regression](./examples/Logistic+Regression.ipynb)

### Decision Tree
A decision tree is among the easiest models to conceptually understand and interpret the results. There are many different algorithms, but perhaps the easiest to describe is a recursive partition approach. A dataset is split recursively. Each split is determined based on the independent variable that results in the largest possible reduction in heterogeneity of the dependent variable (there are different measures, e.g. gini or enthropy). The splits stop when they a reach predetermined stop criterion.

* Pros
    * Compuationally inexpensive
    * Very easy to interpret
    * Is non-parametric.
    * Easy to implement resulting model (at least the most important rules at top of tree)
    * Does not assume linear relationship between independent variables.
    * Can discover interactions.
* Cons
    * Difficult to tune optimally
        * Prone to overfitting when the cost of additional splits are not large enough.
        * Likely to perform poorly when cost of additional splits are too large.

[Decision Tree](./examples/Tree+Classifier.ipynb)

### Random Forest
Random Forests are essentially bootstrapped Decision Trees. A random set of features are selected. Given those features a bootstap selection (with replacement) of records are generated and a decision tree is created. This process is repeated N times (N = the number of trees created). Each iteration generates a tree and each tree gets a "vote" on what features are most important and the magnitude of importance.<br>
This is a rough description with many relevant details omitted. In general. however, random forests have the following properties.

* Pros
    * Resistant to overfitting with little tuning.
    * Non parametric.
    * Easy to determine most influential variables.
    * Easy to run training in parallel.
* Cons
    * Compuationally expensive.
    * Sometimes difficult to implement resulting model (depends on infrastructure).

[Random Forest](./examples/Random+Forest.ipynb)

### Neural Network
Artificial neural networks (ANNs) are computing systems vaguely inspired by the biological neural networks that constitute animal brains. They are a relatively new type of model, but have quickly become the dominant approach to solving certain types of problems (e.g. object detection and NLP). They do well with many type of problems where huge amounts of training data is available. 

* Pros
    * Non parametric.
    * Performs exceptionally well (often best) for many complex problems with large training sets. Examples include:
        * Image Recognition
        * Self Driving Cars
        * Natural Language Processing
* Cons
    * Compuationally expensive.
    * Sometimes difficult to implement resulting model (depends on infrastructure).
    * Prone to overfitting without significant tuning.
    * Difficult to interpret model.
    * Many parameters and hyper parameters to tune.
    * Typically requires large training sets.

[Neural Network](./examples/Neural+Network.ipynb)