{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Intro to Data Science\n",
    "## Part II. - Data discovery\n",
    "\n",
    "### Table of contents\n",
    "\n",
    "- ##### Data Discovery\n",
    "    - <a href=\"#What-is-Data-Discovery?\">Theory</a>\n",
    "    - <a href=\"#Let's-do-it-then!\">Examples</a>\n",
    "\n",
    "- ##### Classification basics\n",
    "    - <a href=\"#A-bit-more-on-classification\">Theory</a>\n",
    "    - <a href=\"#Now-look-at-the-iris-dataset\">Examples</a>\n",
    "\n",
    "---\n",
    "\n",
    "## What is Data Discovery?\n",
    "Data discovery is the process in which one looks into data and tries to:\n",
    "- figure out what is interesting in the data\n",
    "- what can one do with it\n",
    "- if it needs extensive preprocessing\n",
    "\n",
    "From <a href=\"https://en.wikipedia.org/wiki/Data_discovery#Definition\">Wikipedia</a>:\n",
    "> Data Discovery is a user-driven process of searching for patterns or specific items in a data set.\n",
    "> Data Discovery applications use visual tools such as geographical maps, pivot-tables, and heat-maps\n",
    "> to make the process of finding patterns or specific items rapid and intuitive. Data Discovery may \n",
    "> leverage statistical and data mining techniques to accomplish these goals.\n",
    "\n",
    "### Why it is important?\n",
    "To speed up the whole process by giving you insights about:\n",
    "- if the data can be used at all\n",
    "- the necessary preprocessing steps\n",
    "- the possible algorithms\n",
    "- the interesting data points\n",
    "- which features to use\n",
    "\n",
    "### Tools\n",
    "Anything and everything. Two important factor:\n",
    "- speed __->__ base statistics\n",
    "- ease of understanding __->__ PLOTS-PLOTS-PLOTS!\n",
    "\n",
    "#### Plots vs descriptive metrics:\n",
    "\n",
    "<img src=\"pics/dino.gif\" width=400 align=\"left\">\n",
    "<br style=\"clear:left;\"/>\n",
    "from <a href=\"https://www.autodesk.com/research/publications/same-stats-different-graphs\">autodesk's blog</a>\n",
    "\n",
    "### Let's do it then! \n",
    "Load the built-in iris dataset with sklearn's `load_iris` function and discover the dataset! (hint: load the dataset with `return_X_y=True` parameter and create a `pandas.DataFrame` from the data; then use the `pandas.DataFrame`'s `plot` function for plotting. You can try `pandas.DataFrame`'s `describe` method as well.).\n",
    "\n",
    "#### Answer the following questions:\n",
    "- What is the task to solve?\n",
    "- Is anything interesting showed up?\n",
    "- What question should we ask about the dataset?\n",
    "- How should we solve the task?\n",
    "- What should we do as the first step of preprocessing?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.datasets import load_iris\n",
    "\n",
    "pd.set_option('display.max_columns', 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Load the dataset into a pandas DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris = load_iris(as_frame=True)\n",
    "\n",
    "X, y = iris['data'], iris['target'].to_frame()\n",
    "df = pd.concat([X, y], axis='columns')\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.target.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.target.nunique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Plot the data points"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "df.plot(1, 3, kind='scatter', c='target', colormap='Set1');"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot.scatter(1, 3, c='target', colormap='Set1');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Generate basic statistics about the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.describe(percentiles=[.1, .25, .5, .75, .9])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(data=X);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Generate basic statistics by target labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "df.groupby('target').describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(ncols=3, sharey=\"row\", figsize=(16,6))\n",
    "\n",
    "for i in range(3):\n",
    "    sns.boxplot(data=df.loc[df.target == i, X.columns], ax=axes[i])\n",
    "    axes[i].tick_params(axis='x', labelrotation = -45)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-  Plot every feature against each other!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "colormap = {0: 'tab:blue', 1: 'tab:red', 2: 'tab:green'}\n",
    "fig, axes = plt.subplots(nrows=4, ncols=4, \n",
    "                       sharex=\"col\", sharey=\"row\", \n",
    "                       figsize=(12,12))\n",
    "\n",
    "for i, row in enumerate(axes):\n",
    "    for j, col in enumerate(row):\n",
    "        if i != j:\n",
    "            col.scatter(df.iloc[:, i], df.iloc[:, j], c=df.replace({\"target\": colormap})['target'])\n",
    "            col.set_title('{} - {}'.format(i, j))\n",
    "        else:\n",
    "            col.hist([df.loc[df['target'] == k].iloc[:, i] for k in range(3)], \n",
    "                     bins=20, histtype='stepfilled', color=colormap.values())\n",
    "            col.set_title('{} - {}'.format(i, j))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(df, hue='target', vars=X.columns);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Generate the correlation matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.corr()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(df.corr(), cmap='Blues');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Dealing with missing values and outliers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filtered = df.copy()\n",
    "outliers = []\n",
    "\n",
    "for col in X.columns:\n",
    "    upper_thres = df[col].mean() + 2 * df[col].std()\n",
    "    lower_thres = df[col].mean() - 2 * df[col].std()\n",
    "    \n",
    "    filtered = filtered.loc[filtered[col].between(lower_thres, upper_thres)]\n",
    "    \n",
    "    outliers.append(df.loc[~df[col].between(lower_thres, upper_thres)])\n",
    "\n",
    "outliers = pd.concat(outliers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filtered.corr()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(filtered.corr());"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "outliers.plot(1, 3, kind='scatter', c='target', colormap='jet');"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Using third party lib `ydata-profiling`  \n",
    "  Install it with:  \n",
    "  ```bash\n",
    "  conda activate szisz_ds_23\n",
    "  pip install ydata-profiling\n",
    "  ```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ydata_profiling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "ydata_profiling.ProfileReport(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### Task\n",
    "Build a dummy classifier based on your observations!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def predict_iris(sepal_l, sepal_w, petal_l, petal_w):\n",
    "    return 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## A bit more on classification\n",
    "\n",
    "Last time we have seen a special case of classification: where there are two, mutually exclusive classes. The generalization of this can go two ways: there are $\\lvert C \\lvert > 2$ number of classes, which are mutually exclusive or not. The latter case is called *any-of* , *multilabel* , or *multivalue* classification. This problem can be broken down to $\\lvert C \\lvert$ number of *binary* classifications, each applied independently (but these classes need not to be independent in the statistical sense) to the train and test sets. The other, *one-of*, or *multiclass* case is a bit more complicated... (<a href=\"http://nlp.stanford.edu/IR-book/html/htmledition/classification-with-more-than-two-classes-1.html\">source</a>)\n",
    "\n",
    "Imagine instances as points in a $d$ dimensional space, where every dimension corresponds to a feature. Linear classification works by dividing this *feature space* by a hyperplane (hyperplane is the generalization of a line to higher dimensions). (Or in other words, linear classifiers make the classification decision based on a linear combination of the features.)\n",
    "This is most easily understood by a simple example: a problem solvable by linear classification (image **A**) and not solvable by linear classification (image **B**):  \n",
    "<img src=\"pics/linear_vs_nonlinear_problems.png\" width=\"400px\" align=\"left\">\n",
    "<br style=\"clear:left;\"/>\n",
    "from <a href=\"https://sebastianraschka.com/Articles/2014_naive_bayes_1.html\">Sebastian Raschka's blog</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting a bit technical...\n",
    "Since jupyter notebooks can handle $\\LaTeX$, let's write some equations!  \n",
    "Formally, the linearity means that the classification *can* be expressed like this:  \n",
    "\n",
    "$$ c = f \\left( \\sum _{i=0} ^{N} w_i x_i \\right) $$  \n",
    "Where $c \\in C$ is the class we predict for a given instance $\\bf{x}$, $w_i$ is the weight of attribute $x_i$, and $f$ is a function that maps its input to a class. Geometrically, $\\bf{w}$ is the normal of the separator hyperplane. **The weights are basically what we create in the process of learning, and we use the learnt weights to predict the class of a given input instance.**  \n",
    "Note that $x_0 = 1$ is a dummy feature, only to make the notation convenient, since the equation for a hyperplane is $w_1 x_1 + w_2 x_2 + \\dots + w_N x_N + w_0 = 0 $  \n",
    "\n",
    "Referring back to the *multiclass* classification: In the linear case, it happens by applying binary classifiers to each instance, and the decision is made based on the score/probability/etc. of each binary classifier."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 0: <a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression\">Logistic regression</a>\n",
    "\n",
    "Yes, the logistic regression is linear! Why? Because the predictions can be written in the form \n",
    "$$ \\hat{p} = \\frac{1}{1+e^{-\\bf wx}} $$\n",
    "so more precisely, the *log-odds* are linear functions of $x$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 1: <a href=\"http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html\">Perceptron</a>\n",
    "\n",
    "This is a very simple algorithm: binary classification, where $c$ can be $-1$ or $1$. initialize the weights with some values (0s or arbitrary small random numbers), then go through the training set in some order (but the resulting separator is not unique, it depends on the order!). Training rule: if the sign of the $\\bf wx$ product is equal to the known output, don't change the weights. If the sign is different, then modify the weights by $ c \\cdot \\bf x$. The generalization of this algorithm to the multiclass case, along with good pictures and examples can be read on the <a href=\"https://en.wikipedia.org/wiki/Perceptron\" >Perceptron's Wikipedia page</a>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example 2: <a href=\"http://scikit-learn.org/stable/modules/naive_bayes.html\">Naive Bayes</a>\n",
    "\n",
    "For this, we'll look at the classification problem from a different angle. We have a given input vector $X$, and we need the probability of it belonging to a class $Y$.\n",
    "Nomenclature: the input features $x_i$ are in $X$, and the class variable is $y$. $P(y)$ is called **a priori**, and $P(y\\lvert X)$ is called **a posteriori** probability of $y$. This a posteriory probability is what we need.  \n",
    "At first glance, the Bayes classifier works like this:  \n",
    "\n",
    "$$\\hat{y} = \\mathrm{arg}\\,\\mathrm{max}\\, P(y\\lvert X),$$  \n",
    "that means \"choosing the class with the maximum probability given an input $X$\".  \n",
    "Of course, we would need an enormous training set to be able to get $P(y\\lvert X)$ for every possible input $X$. This is where the Bayes theorem comes in:\n",
    "\n",
    "$$P(y\\lvert X) = \\frac{P(X\\lvert y) P(y)}{P(X)}$$  \n",
    "Great, we now have to calculate $P(X\\lvert y)$ (but at least we don't have to worry about the divisor, as it is the same for all possible $y$s). But if we suppose that the attributes are not dependent on each other (**this assumption makes it naive!**), then the probability $P(X\\lvert y)$ can be written in a product form: $ \\prod _i P(X_i \\lvert y) $! (The proof of this is given as an exercise to the reader.) We now have to simply calculate the $P(x_i \\lvert y)$ probabilities, which requires a much smaller train set. The calculation is simple if the $x_i$-s are categorical variables - just use relative frequencies from the train set. In the continuous case, we assume that these probabilities are from a distribution (gauss, binomial, multinomial, etc.), and we use the training set to guess the parameters of this distribution. (Note: the naive bayes classification is linear only if this distribution comes from exponential families). Finally $P(y)$, the a priori probabilities are simply calculated as relative frequencies from the training set. \n",
    "So the training consists of calculating $P(x_i \\lvert y)$-s or the distribution parameters, and $P(y)$ from the training set, and the prediction is just calculating the values $ P(y) \\prod _i P(x_i \\lvert y)$ for each possible $y$, and choosing which maximises this.\n",
    "\n",
    "---\n",
    "\n",
    "## Now look at the iris dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.pipeline import Pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def validate(model, X_test, y_test):\n",
    "    y_pred = model.predict(X_test)\n",
    "    print(\"Prediction accuracy: {:.2f}%\"\n",
    "          .format(np.sum(y_pred == y_test) / float(len(y_pred)) * 100))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Split the dataset into a train and a test part."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y['target'],\n",
    "                                                    test_size=1/3, \n",
    "                                                    random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Logistic regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "logistic_pipe = Pipeline(steps=[('logistic', LogisticRegression())])\n",
    "logistic_pipe.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "validate(logistic_pipe, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The default parameters set the logistic regression to do binary classification for each label, then choose the best.  \n",
    "Let's try something else! With pipelines, we can easily set parameters for our predictors, like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logistic_pipe.set_params(logistic__multi_class='multinomial', \n",
    "                         logistic__solver='sag')\n",
    "logistic_pipe.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "validate(logistic_pipe, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Perceptron"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Perceptron\n",
    "\n",
    "perceptron_pipe =  # TODO\n",
    "perceptron_pipe.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "validate(perceptron_pipe, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Naive Bayes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "\n",
    "nb_pipe =  # TODO\n",
    "nb_pipe.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "validate(nb_pipe, X_test, y_test)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "szisz_ds_23",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "vscode": {
   "interpreter": {
    "hash": "d0dc57cccaa2d0d305072673e9fe47ca44c23e744f2f05647d6b373471557b60"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}