{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true,
      "authorship_tag": "ABX9TyOzUQiPquW+CDMlvXbDiABY",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/eaedk/Machine-Learning-Tutorials/blob/main/DataAnalysis_Step_By_Step_Guide.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Intro\n",
        "## General\n",
        "\"Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.\" ([Wikipedia](https://en.wikipedia.org/wiki/Data_analysis)).\n",
        "\n",
        "In Data Analysis, there are some analysis paradigms : **Univariate, Bivariate, Multivariate**. We apply these paradigms to analyze the features (or statistical variables, or columns of the dataframe) of the dataset and to have a better understanding.\n",
        "\n",
        "**Numeric features** are features with numbers that you can perform mathematical operations on. They are further divided into discrete (countable integers with clear boundaries) and continuous (can take any value, even decimals, within a range).\n",
        "\n",
        "**Categorical features** are columns with a limited number of possible values. Examples are `sex, country, or age group`.\n",
        "\n",
        "## Notebook overview\n",
        "\n",
        "This notebook is a guide to start practicing Data Analysis."
      ],
      "metadata": {
        "id": "wW3v1-2n3CzK"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Setup"
      ],
      "metadata": {
        "id": "n4VFUnkuexCE"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Installation\n",
        "Here is the section to install all the packages/libraries that will be needed to tackle the challlenge."
      ],
      "metadata": {
        "id": "qdFpRxPje1gw"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# !pip install -q <lib_001> <lib_002> ..."
      ],
      "metadata": {
        "id": "W-d-roFQe6yi"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Importation\n",
        "Here is the section to import all the packages/libraries that will be used through this notebook."
      ],
      "metadata": {
        "id": "0HmLxlQre7HW"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Data handling\n",
        "import pandas as pd\n",
        "\n",
        "# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )\n",
        "...\n",
        "\n",
        "# EDA (pandas-profiling, etc. )\n",
        "...\n",
        "\n",
        "# Feature Processing (Scikit-learn processing, etc. )\n",
        "...\n",
        "\n",
        "# Machine Learning (Scikit-learn Estimators, Catboost, LightGBM, etc. )\n",
        "...\n",
        "\n",
        "# Hyperparameters Fine-tuning (Scikit-learn hp search, cross-validation, etc. )\n",
        "...\n",
        "\n",
        "# Other packages\n",
        "import os\n"
      ],
      "metadata": {
        "id": "MP62JaiKfCnS"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Data Loading\n",
        "Here is the section to load the datasets (train, eval, test) and the additional files"
      ],
      "metadata": {
        "id": "UfOADQf0e9i1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# For CSV, use pandas.read_csv"
      ],
      "metadata": {
        "id": "KIQyG5EcfQlU"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Exploratory Data Analysis: EDA\n",
        "Here is the section to **inspect** the datasets in depth, **present** it, make **hypotheses** and **think** the *cleaning, processing and features creation*."
      ],
      "metadata": {
        "id": "okaZxnc3fRId"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Dataset overview\n",
        "\n",
        "Have a look at the loaded datsets using the following methods: `.head(), .info()`"
      ],
      "metadata": {
        "id": "kTvnZQUlD6vf"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Code here"
      ],
      "metadata": {
        "id": "0VNR9LfZfbGe"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Code here"
      ],
      "metadata": {
        "id": "3oUlBJbRFa2t"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Univariate Analysis\n",
        "\n",
        "‘Univariate analysis’ is the analysis of one variable at a time. This analysis might be done by computing some statistical indicators and by plotting some charts respectively using the pandas dataframe's method `.describe()` and one of the plotting libraries like  [Seaborn](https://seaborn.pydata.org/), [Matplotlib](https://matplotlib.org/), [Plotly](https://seaborn.pydata.org/), etc.\n",
        "\n",
        "Please, read [this article](https://towardsdatascience.com/8-seaborn-plots-for-univariate-exploratory-data-analysis-eda-in-python-9d280b6fe67f) to know more about the charts."
      ],
      "metadata": {
        "id": "1jp5LfsvF72x"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Code here"
      ],
      "metadata": {
        "id": "z2MH6TKPFbBA"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Multivariate Analysis\n",
        "\n",
        "Multivariate analysis’ is the analysis of more than one variable and aims to study the relationships among them. This analysis might be done by computing some statistical indicators like the `correlation` and by plotting some charts.\n",
        "\n",
        "Please, read [this article](https://towardsdatascience.com/10-must-know-seaborn-functions-for-multivariate-data-analysis-in-python-7ba94847b117) to know more about the charts."
      ],
      "metadata": {
        "id": "TP-DiZvsWs0F"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Code here"
      ],
      "metadata": {
        "id": "NbYppWX8cfds"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Feature processing\n",
        "Here is the section to **clean** and **process** the features of the dataset."
      ],
      "metadata": {
        "id": "2pfMjvtNhlfB"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Missing/NaN Values\n",
        "Handle the missing/NaN values using the Scikif-learn SimpleImputer"
      ],
      "metadata": {
        "id": "RR_0P-SJiPx_"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Code Here"
      ],
      "metadata": {
        "id": "0YgaX0jfLcaK"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Scaling\n",
        "Scale the numeric features using the Scikif-learn StandardScaler, MinMaxScaler, or another Scaler."
      ],
      "metadata": {
        "id": "YipqthozL2kt"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Code here"
      ],
      "metadata": {
        "id": "aI7jM-rZMYCG"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Encoding\n",
        "Encode the categorical features using the Scikif-learn OneHotEncoder."
      ],
      "metadata": {
        "id": "XdFcGOxHMalr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Code here"
      ],
      "metadata": {
        "id": "sAMvupuRMs-P"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}