{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f8cc5bfe-817f-4b18-97c8-f1f4ec85bbe4",
   "metadata": {},
   "source": [
    "# Customer Loan Behavior Analysis  \n",
    "\n",
    "### Description:  \n",
    "You're a financial analyst at a leading digital lending platform. Your company provides personal loans to customers based on their creditworthiness. The business development team wants to analyze customer loan repayment behavior to inform their marketing strategies. They've asked you to prepare a dataset that combines customer information, loan details, and repayment history.  \n",
    "\n",
    "### Tasks:  \n",
    "- Calculate the total repayment amount made by the customer and merge this information with the loan details dataset. Ensure the resulting dataset includes all columns from the loan details dataset.  \n",
    "- Group customers by age brackets (25-34, 35-44, 45-54, 55+) and calculate the average loan amount and average credit score for each age group.  \n",
    "- Use the pivot_table method to reshape the repayment history dataset, showing:  \n",
    "**Rows:** Months of repayment (January to December)  \n",
    "**Columns:** Years of repayment (e.g., 2022, 2023)  \n",
    "**Values:** Total repayment amount made in each month of each year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "ef18c9eb-853f-4c5b-b01a-9669b7cc3581",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]\n",
      "Pandas version 2.2.1\n",
      "Numpy version 1.26.4\n"
     ]
    }
   ],
   "source": [
    "# import libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import sys\n",
    "\n",
    "print('Python version ' + sys.version)\n",
    "print('Pandas version ' + pd.__version__)\n",
    "print('Numpy version ' + np.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a3099e6-a0c0-42af-9fe6-d6ec83868b04",
   "metadata": {},
   "source": [
    "# The Data  \n",
    "\n",
    "The dataset represents a digital lending platform's customer loan data, comprising three tables: customer information (demographics), loan details (loan amounts and issuance dates), and repayment history (repayment amounts and dates). The dataset includes 1,000 customers, 1,000 loans, and 5,000 repayment records, providing a comprehensive view of customer loan behavior.  \n",
    "\n",
    "**customer_info.csv:** Contains customer demographic information.  \n",
    "- customer_id (unique identifier)\n",
    "- age\n",
    "- income\n",
    "- credit_score\n",
    "\n",
    "**loan_details.csv:** Contains loan information.  \n",
    "- loan_id (unique identifier)\n",
    "- customer_id (foreign key referencing customer_info)\n",
    "- loan_amount\n",
    "- loan_issued_date  \n",
    "\n",
    "**repayment_history.csv:** Contains repayment information.  \n",
    "- repayment_id (unique identifier)\n",
    "- loan_id (foreign key referencing loan_details)\n",
    "- repayment_date\n",
    "- repayment_amount\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "5268e5c9-a951-44e3-ba8a-f64dced3bae8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# set the seed\n",
    "np.random.seed(0)\n",
    "\n",
    "# customer information\n",
    "customer_info = pd.DataFrame({\n",
    "    'customer_id': np.arange(1000),\n",
    "    'age': np.random.randint(25, 60, 1000),\n",
    "    'income': np.random.randint(50000, 150000, 1000),\n",
    "    'credit_score': np.random.randint(600, 850, 1000)\n",
    "})\n",
    "\n",
    "# loan details\n",
    "loan_details = pd.DataFrame({\n",
    "    'loan_id': np.arange(1000),\n",
    "    'customer_id': np.random.choice(customer_info['customer_id'], 1000),\n",
    "    'loan_amount': np.random.randint(1000, 50000, 1000),\n",
    "    'loan_issued_date': pd.date_range('2022-01-01', periods=1000, freq='D')\n",
    "})\n",
    "\n",
    "# repayment history\n",
    "repayment_history = pd.DataFrame({\n",
    "    'repayment_id': np.arange(5000),\n",
    "    'loan_id': np.random.choice(loan_details['loan_id'], 5000),\n",
    "    'repayment_date': pd.date_range('2022-01-01', periods=5000, freq='D'),\n",
    "    'repayment_amount': np.random.randint(50, 500, 5000)\n",
    "})\n",
    "\n",
    "# save the datasets to CSV files\n",
    "customer_info.to_csv('customer_info.csv', index=False)\n",
    "loan_details.to_csv('loan_details.csv', index=False)\n",
    "repayment_history.to_csv('repayment_history.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cd1da82-5d38-49c5-b6d0-6c2470199f83",
   "metadata": {},
   "source": [
    "Let us read into memory the csv files and take a look at the data types of each dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "44ba517a-7713-4566-9055-fe9d61b01169",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create dataframes\n",
    "customer_info_df = pd.read_csv('customer_info.csv')\n",
    "loan_details_df = pd.read_csv('loan_details.csv')\n",
    "\n",
    "# you can use the parse_dates parameter to specify that column \"repayment_date\" should be parsed as a date object.\n",
    "repayment_history_df = pd.read_csv('repayment_history.csv', parse_dates=['repayment_date'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "ad7f9574-8c78-4922-ac51-040d6064bc2c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 1000 entries, 0 to 999\n",
      "Data columns (total 4 columns):\n",
      " #   Column        Non-Null Count  Dtype\n",
      "---  ------        --------------  -----\n",
      " 0   customer_id   1000 non-null   int64\n",
      " 1   age           1000 non-null   int64\n",
      " 2   income        1000 non-null   int64\n",
      " 3   credit_score  1000 non-null   int64\n",
      "dtypes: int64(4)\n",
      "memory usage: 31.4 KB\n"
     ]
    }
   ],
   "source": [
    "customer_info_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a966a195-63ce-4a93-be59-777edb7bfc4f",
   "metadata": {},
   "source": [
    "We will need to convert the column \"loan_issue_date\" to a date object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "73ee5c82-2257-405d-9afc-4643910f5b43",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 1000 entries, 0 to 999\n",
      "Data columns (total 4 columns):\n",
      " #   Column            Non-Null Count  Dtype \n",
      "---  ------            --------------  ----- \n",
      " 0   loan_id           1000 non-null   int64 \n",
      " 1   customer_id       1000 non-null   int64 \n",
      " 2   loan_amount       1000 non-null   int64 \n",
      " 3   loan_issued_date  1000 non-null   object\n",
      "dtypes: int64(3), object(1)\n",
      "memory usage: 31.4+ KB\n"
     ]
    }
   ],
   "source": [
    "loan_details_df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "a039ae3e-19f5-44d5-af02-382d2cd9a4cb",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 5000 entries, 0 to 4999\n",
      "Data columns (total 4 columns):\n",
      " #   Column            Non-Null Count  Dtype         \n",
      "---  ------            --------------  -----         \n",
      " 0   repayment_id      5000 non-null   int64         \n",
      " 1   loan_id           5000 non-null   int64         \n",
      " 2   repayment_date    5000 non-null   datetime64[ns]\n",
      " 3   repayment_amount  5000 non-null   int64         \n",
      "dtypes: datetime64[ns](1), int64(3)\n",
      "memory usage: 156.4 KB\n"
     ]
    }
   ],
   "source": [
    "repayment_history_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92bd8b51-c055-4b1d-b13e-798beac64d97",
   "metadata": {},
   "source": [
    "# Task #1  \n",
    "\n",
    "Calculate the total repayment amount made by the customer and merge this information with the loan details dataset. Ensure the resulting dataset includes all columns from the loan details dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca3b4f13-854d-429a-9d45-c0e7985d7d13",
   "metadata": {},
   "source": [
    "For this task, Let us merge all three datasets and then calculate the total repayment amount per customer. This is the dataframe we will use for the rest of the tutorial. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "441c66bd-2b4c-4427-9815-a46d1c588113",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>customer_id</th>\n",
       "      <th>age</th>\n",
       "      <th>income</th>\n",
       "      <th>credit_score</th>\n",
       "      <th>loan_id</th>\n",
       "      <th>loan_amount</th>\n",
       "      <th>loan_issued_date</th>\n",
       "      <th>repayment_id</th>\n",
       "      <th>repayment_date</th>\n",
       "      <th>repayment_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>25</td>\n",
       "      <td>141445</td>\n",
       "      <td>725</td>\n",
       "      <td>1</td>\n",
       "      <td>11142</td>\n",
       "      <td>2022-01-02</td>\n",
       "      <td>1710</td>\n",
       "      <td>2026-09-07</td>\n",
       "      <td>303</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>25</td>\n",
       "      <td>141445</td>\n",
       "      <td>725</td>\n",
       "      <td>840</td>\n",
       "      <td>25736</td>\n",
       "      <td>2024-04-20</td>\n",
       "      <td>527</td>\n",
       "      <td>2023-06-12</td>\n",
       "      <td>492</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>25</td>\n",
       "      <td>141445</td>\n",
       "      <td>725</td>\n",
       "      <td>840</td>\n",
       "      <td>25736</td>\n",
       "      <td>2024-04-20</td>\n",
       "      <td>589</td>\n",
       "      <td>2023-08-13</td>\n",
       "      <td>281</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>25</td>\n",
       "      <td>141445</td>\n",
       "      <td>725</td>\n",
       "      <td>840</td>\n",
       "      <td>25736</td>\n",
       "      <td>2024-04-20</td>\n",
       "      <td>1281</td>\n",
       "      <td>2025-07-05</td>\n",
       "      <td>99</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>25</td>\n",
       "      <td>141445</td>\n",
       "      <td>725</td>\n",
       "      <td>840</td>\n",
       "      <td>25736</td>\n",
       "      <td>2024-04-20</td>\n",
       "      <td>2831</td>\n",
       "      <td>2029-10-02</td>\n",
       "      <td>117</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   customer_id  age  income  credit_score  loan_id  loan_amount  \\\n",
       "0            0   25  141445           725        1        11142   \n",
       "1            0   25  141445           725      840        25736   \n",
       "2            0   25  141445           725      840        25736   \n",
       "3            0   25  141445           725      840        25736   \n",
       "4            0   25  141445           725      840        25736   \n",
       "\n",
       "  loan_issued_date  repayment_id repayment_date  repayment_amount  \n",
       "0       2022-01-02          1710     2026-09-07               303  \n",
       "1       2024-04-20           527     2023-06-12               492  \n",
       "2       2024-04-20           589     2023-08-13               281  \n",
       "3       2024-04-20          1281     2025-07-05                99  \n",
       "4       2024-04-20          2831     2029-10-02               117  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = customer_info_df.merge(loan_details_df, on='customer_id').merge(repayment_history_df, on='loan_id')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed96c9de-2706-46e6-99e4-c46dc75f6f3a",
   "metadata": {},
   "source": [
    "We can see from the data below that our new dataframe has 5,000 rows and we do not have any null values. Remember that by defaul the `.merge` method will do an inner merge. This means the columns we merged on (customer_id and loan_id) need to have matching values on both dataframes being merged. If the values do not match, that row will not make it to the final result.  \n",
    "\n",
    "In this specific case, we did loose some of the data, for example:\n",
    "- The customer_info_df has a customer_id=1\n",
    "- Customer_id=1 is not present in the loan_details_df\n",
    "- This means the final dataframe (df), does not contain customer_id=1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9baaa419-f226-43d7-a93a-55c83f5c5608",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>customer_id</th>\n",
       "      <th>age</th>\n",
       "      <th>income</th>\n",
       "      <th>credit_score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>28</td>\n",
       "      <td>92694</td>\n",
       "      <td>692</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   customer_id  age  income  credit_score\n",
       "1            1   28   92694           692"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# here is customer_id=1\n",
    "mask = customer_info_df.loc[:,'customer_id'] == 1\n",
    "customer_info_df[mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f8b2a95a-5959-45ba-b99c-35aaa992fd79",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>loan_id</th>\n",
       "      <th>customer_id</th>\n",
       "      <th>loan_amount</th>\n",
       "      <th>loan_issued_date</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [loan_id, customer_id, loan_amount, loan_issued_date]\n",
       "Index: []"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# customer_id=1 is not found in loan_details_df\n",
    "mask = loan_details_df.loc[:,'customer_id'] == 1\n",
    "loan_details_df[mask]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "58589af1-ede4-40d6-9c92-d37c65b8fb77",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>customer_id</th>\n",
       "      <th>age</th>\n",
       "      <th>income</th>\n",
       "      <th>credit_score</th>\n",
       "      <th>loan_id</th>\n",
       "      <th>loan_amount</th>\n",
       "      <th>loan_issued_date</th>\n",
       "      <th>repayment_id</th>\n",
       "      <th>repayment_date</th>\n",
       "      <th>repayment_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [customer_id, age, income, credit_score, loan_id, loan_amount, loan_issued_date, repayment_id, repayment_date, repayment_amount]\n",
       "Index: []"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# customer_id=1 is not found in df\n",
    "mask = df.loc[:,'customer_id'] == 1\n",
    "df[mask]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e40c7011-696a-49a6-aca0-8cb2b31ba2ef",
   "metadata": {},
   "source": [
    "To calculate the total repayment amount per customer, we can make use of the `.groupby` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "539be95c-4940-4887-81b5-8c253bc67ecc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "customer_id\n",
       "0      1663\n",
       "2      4450\n",
       "3       725\n",
       "4      2104\n",
       "5      2094\n",
       "       ... \n",
       "994     675\n",
       "995    2513\n",
       "996    1156\n",
       "997    1155\n",
       "998     442\n",
       "Name: repayment_amount, Length: 629, dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create group object\n",
    "group = df.groupby('customer_id')\n",
    "\n",
    "# get the total repayment amount for the group\n",
    "group['repayment_amount'].sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "800262da-27f3-4fbf-90b5-58209f5086e6",
   "metadata": {},
   "source": [
    "If you wanted to verify the math is correct...  \n",
    "\n",
    "We can find select one customer_id and manually add the values.  \n",
    "- 303 + 492 + 281 + 99 + 117 + 371 = 1,663"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "40c1801d-fb1b-438d-9a51-3b6aa1d83b04",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    303\n",
       "1    492\n",
       "2    281\n",
       "3     99\n",
       "4    117\n",
       "5    371\n",
       "Name: repayment_amount, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mask = df.loc[:,'customer_id'] == 0\n",
    "df.loc[mask,'repayment_amount']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46ce722e-6681-43d2-8ea4-c35344967843",
   "metadata": {},
   "source": [
    "# Task #2  \n",
    "\n",
    "Group customers by age brackets (25-34, 35-44, 45-54, 55+) and calculate the average loan amount and average credit score for each age group.  \n",
    "\n",
    "- **bins** defines the edges of the bins.\n",
    "- **labels** assigns names to each bin.\n",
    "- **float('inf')** represents infinity, making the last bin open-ended (55+).\n",
    "- **include_lowest=True** ensures that the lowest value (55) is included in the last bin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "61e360ae-260c-4ecf-9d82-0464843a59c6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "age_brackets\n",
       "25-34    1552\n",
       "35-44    1535\n",
       "45-54    1411\n",
       "55+       502\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create bins for the (25-34), (35-44), (45-54) and (55+) categories\n",
    "bins = [25, 34, 44, 55, float('inf')]  \n",
    " \n",
    "# labels for the three categories\n",
    "labels = ['25-34', '35-44', '45-54', '55+'] \n",
    " \n",
    "# bin it up!\n",
    "df['age_brackets'] = pd.cut(df['age'], bins=bins, labels=labels, include_lowest=True)\n",
    " \n",
    "# here we get a frequency count of the categories\n",
    "df['age_brackets'].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0243d3b3-02b5-4bf0-ab36-93ac249368a7",
   "metadata": {},
   "source": [
    "The observed parameter in `df.groupby` affects grouping behavior when using Categorical columns, and setting it to True (future default) treats categories as part of the data's index, whereas setting it to False (current default) treats them as \"unobserved\" variables.  \n",
    "\n",
    "I went in more detail in a previous tutorial: [Education-Analytics-Challenge](https://hedaro.com/Programming-Languages/Python/Data-Analyst/Data-Analyst---Education-Analytics-Challenge)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "643cee31-d06f-4b10-8dfb-c0484acdf03f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>avg_loan_amount</th>\n",
       "      <th>avg_credit_score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>age_brackets</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>25-34</th>\n",
       "      <td>25107.916237</td>\n",
       "      <td>715.453608</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35-44</th>\n",
       "      <td>25182.748534</td>\n",
       "      <td>725.331596</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45-54</th>\n",
       "      <td>25559.566265</td>\n",
       "      <td>725.663359</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>55+</th>\n",
       "      <td>28249.966135</td>\n",
       "      <td>712.916335</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              avg_loan_amount  avg_credit_score\n",
       "age_brackets                                   \n",
       "25-34            25107.916237        715.453608\n",
       "35-44            25182.748534        725.331596\n",
       "45-54            25559.566265        725.663359\n",
       "55+              28249.966135        712.916335"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# calculate the average loan amount and average credit score for each age group\n",
    "df.groupby('age_brackets', observed=True).agg(\n",
    "     avg_loan_amount=pd.NamedAgg(column=\"loan_amount\", aggfunc=\"mean\"),\n",
    "     avg_credit_score=pd.NamedAgg(column=\"credit_score\", aggfunc=\"mean\")\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "991289ff-74b9-4f7a-ad4a-48d4428a8862",
   "metadata": {},
   "source": [
    "# Task #3  \n",
    "\n",
    "Use the pivot_table method to reshape the repayment history dataset, showing:  \n",
    "- **Rows:** Months of repayment (January to December)  \n",
    "- **Columns:** Years of repayment (e.g., 2022, 2023)  \n",
    "- **Values:** Total repayment amount made in each month of each year \n",
    "\n",
    "In order to sort the months correctly, we need to use a map as shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "1d91ef60-4d7e-4cff-b85d-0de162b16d0b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>repayment_date</th>\n",
       "      <th>2022</th>\n",
       "      <th>2023</th>\n",
       "      <th>2024</th>\n",
       "      <th>2025</th>\n",
       "      <th>2026</th>\n",
       "      <th>2027</th>\n",
       "      <th>2028</th>\n",
       "      <th>2029</th>\n",
       "      <th>2030</th>\n",
       "      <th>2031</th>\n",
       "      <th>2032</th>\n",
       "      <th>2033</th>\n",
       "      <th>2034</th>\n",
       "      <th>2035</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>repayment_date</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>January</th>\n",
       "      <td>7876.0</td>\n",
       "      <td>8834.0</td>\n",
       "      <td>9711.0</td>\n",
       "      <td>8979.0</td>\n",
       "      <td>9236.0</td>\n",
       "      <td>8772.0</td>\n",
       "      <td>8576.0</td>\n",
       "      <td>8084.0</td>\n",
       "      <td>9712.0</td>\n",
       "      <td>8340.0</td>\n",
       "      <td>8167.0</td>\n",
       "      <td>7108.0</td>\n",
       "      <td>7630.0</td>\n",
       "      <td>8604.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>February</th>\n",
       "      <td>7917.0</td>\n",
       "      <td>8264.0</td>\n",
       "      <td>8113.0</td>\n",
       "      <td>8181.0</td>\n",
       "      <td>7965.0</td>\n",
       "      <td>6685.0</td>\n",
       "      <td>7884.0</td>\n",
       "      <td>8308.0</td>\n",
       "      <td>8039.0</td>\n",
       "      <td>7589.0</td>\n",
       "      <td>7676.0</td>\n",
       "      <td>7565.0</td>\n",
       "      <td>8783.0</td>\n",
       "      <td>7098.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>March</th>\n",
       "      <td>8090.0</td>\n",
       "      <td>8245.0</td>\n",
       "      <td>9570.0</td>\n",
       "      <td>9039.0</td>\n",
       "      <td>7522.0</td>\n",
       "      <td>8419.0</td>\n",
       "      <td>8936.0</td>\n",
       "      <td>7896.0</td>\n",
       "      <td>7604.0</td>\n",
       "      <td>7817.0</td>\n",
       "      <td>7499.0</td>\n",
       "      <td>8310.0</td>\n",
       "      <td>7289.0</td>\n",
       "      <td>8927.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>April</th>\n",
       "      <td>7950.0</td>\n",
       "      <td>9636.0</td>\n",
       "      <td>8574.0</td>\n",
       "      <td>8980.0</td>\n",
       "      <td>9272.0</td>\n",
       "      <td>8009.0</td>\n",
       "      <td>8171.0</td>\n",
       "      <td>8437.0</td>\n",
       "      <td>7657.0</td>\n",
       "      <td>8692.0</td>\n",
       "      <td>8601.0</td>\n",
       "      <td>8754.0</td>\n",
       "      <td>8644.0</td>\n",
       "      <td>8464.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>May</th>\n",
       "      <td>9821.0</td>\n",
       "      <td>8433.0</td>\n",
       "      <td>9034.0</td>\n",
       "      <td>8423.0</td>\n",
       "      <td>8261.0</td>\n",
       "      <td>8928.0</td>\n",
       "      <td>7685.0</td>\n",
       "      <td>9209.0</td>\n",
       "      <td>8665.0</td>\n",
       "      <td>8259.0</td>\n",
       "      <td>8926.0</td>\n",
       "      <td>8598.0</td>\n",
       "      <td>8573.0</td>\n",
       "      <td>9337.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>June</th>\n",
       "      <td>7999.0</td>\n",
       "      <td>9103.0</td>\n",
       "      <td>7305.0</td>\n",
       "      <td>7271.0</td>\n",
       "      <td>9419.0</td>\n",
       "      <td>8276.0</td>\n",
       "      <td>7641.0</td>\n",
       "      <td>9278.0</td>\n",
       "      <td>8966.0</td>\n",
       "      <td>8488.0</td>\n",
       "      <td>8390.0</td>\n",
       "      <td>8898.0</td>\n",
       "      <td>8626.0</td>\n",
       "      <td>8272.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>July</th>\n",
       "      <td>7786.0</td>\n",
       "      <td>9393.0</td>\n",
       "      <td>9565.0</td>\n",
       "      <td>8010.0</td>\n",
       "      <td>8378.0</td>\n",
       "      <td>9542.0</td>\n",
       "      <td>8496.0</td>\n",
       "      <td>8251.0</td>\n",
       "      <td>8633.0</td>\n",
       "      <td>8975.0</td>\n",
       "      <td>10388.0</td>\n",
       "      <td>7370.0</td>\n",
       "      <td>8617.0</td>\n",
       "      <td>8241.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>August</th>\n",
       "      <td>7587.0</td>\n",
       "      <td>8239.0</td>\n",
       "      <td>8548.0</td>\n",
       "      <td>8866.0</td>\n",
       "      <td>8559.0</td>\n",
       "      <td>8532.0</td>\n",
       "      <td>8008.0</td>\n",
       "      <td>9171.0</td>\n",
       "      <td>8203.0</td>\n",
       "      <td>8873.0</td>\n",
       "      <td>7762.0</td>\n",
       "      <td>9981.0</td>\n",
       "      <td>7278.0</td>\n",
       "      <td>9633.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>September</th>\n",
       "      <td>8666.0</td>\n",
       "      <td>8986.0</td>\n",
       "      <td>7753.0</td>\n",
       "      <td>8168.0</td>\n",
       "      <td>8277.0</td>\n",
       "      <td>8769.0</td>\n",
       "      <td>9079.0</td>\n",
       "      <td>8666.0</td>\n",
       "      <td>7734.0</td>\n",
       "      <td>7779.0</td>\n",
       "      <td>7864.0</td>\n",
       "      <td>9420.0</td>\n",
       "      <td>6240.0</td>\n",
       "      <td>2101.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>October</th>\n",
       "      <td>9202.0</td>\n",
       "      <td>8138.0</td>\n",
       "      <td>7779.0</td>\n",
       "      <td>8989.0</td>\n",
       "      <td>7725.0</td>\n",
       "      <td>7485.0</td>\n",
       "      <td>9331.0</td>\n",
       "      <td>8433.0</td>\n",
       "      <td>9347.0</td>\n",
       "      <td>8671.0</td>\n",
       "      <td>9127.0</td>\n",
       "      <td>9622.0</td>\n",
       "      <td>9030.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>November</th>\n",
       "      <td>9295.0</td>\n",
       "      <td>8768.0</td>\n",
       "      <td>8246.0</td>\n",
       "      <td>8136.0</td>\n",
       "      <td>7776.0</td>\n",
       "      <td>8842.0</td>\n",
       "      <td>8161.0</td>\n",
       "      <td>8071.0</td>\n",
       "      <td>8535.0</td>\n",
       "      <td>8646.0</td>\n",
       "      <td>8221.0</td>\n",
       "      <td>7041.0</td>\n",
       "      <td>7790.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>December</th>\n",
       "      <td>8295.0</td>\n",
       "      <td>8646.0</td>\n",
       "      <td>7134.0</td>\n",
       "      <td>7425.0</td>\n",
       "      <td>8097.0</td>\n",
       "      <td>8750.0</td>\n",
       "      <td>9667.0</td>\n",
       "      <td>8965.0</td>\n",
       "      <td>8523.0</td>\n",
       "      <td>7022.0</td>\n",
       "      <td>8990.0</td>\n",
       "      <td>8686.0</td>\n",
       "      <td>7727.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "repayment_date    2022    2023    2024    2025    2026    2027    2028  \\\n",
       "repayment_date                                                           \n",
       "January         7876.0  8834.0  9711.0  8979.0  9236.0  8772.0  8576.0   \n",
       "February        7917.0  8264.0  8113.0  8181.0  7965.0  6685.0  7884.0   \n",
       "March           8090.0  8245.0  9570.0  9039.0  7522.0  8419.0  8936.0   \n",
       "April           7950.0  9636.0  8574.0  8980.0  9272.0  8009.0  8171.0   \n",
       "May             9821.0  8433.0  9034.0  8423.0  8261.0  8928.0  7685.0   \n",
       "June            7999.0  9103.0  7305.0  7271.0  9419.0  8276.0  7641.0   \n",
       "July            7786.0  9393.0  9565.0  8010.0  8378.0  9542.0  8496.0   \n",
       "August          7587.0  8239.0  8548.0  8866.0  8559.0  8532.0  8008.0   \n",
       "September       8666.0  8986.0  7753.0  8168.0  8277.0  8769.0  9079.0   \n",
       "October         9202.0  8138.0  7779.0  8989.0  7725.0  7485.0  9331.0   \n",
       "November        9295.0  8768.0  8246.0  8136.0  7776.0  8842.0  8161.0   \n",
       "December        8295.0  8646.0  7134.0  7425.0  8097.0  8750.0  9667.0   \n",
       "\n",
       "repayment_date    2029    2030    2031     2032    2033    2034    2035  \n",
       "repayment_date                                                           \n",
       "January         8084.0  9712.0  8340.0   8167.0  7108.0  7630.0  8604.0  \n",
       "February        8308.0  8039.0  7589.0   7676.0  7565.0  8783.0  7098.0  \n",
       "March           7896.0  7604.0  7817.0   7499.0  8310.0  7289.0  8927.0  \n",
       "April           8437.0  7657.0  8692.0   8601.0  8754.0  8644.0  8464.0  \n",
       "May             9209.0  8665.0  8259.0   8926.0  8598.0  8573.0  9337.0  \n",
       "June            9278.0  8966.0  8488.0   8390.0  8898.0  8626.0  8272.0  \n",
       "July            8251.0  8633.0  8975.0  10388.0  7370.0  8617.0  8241.0  \n",
       "August          9171.0  8203.0  8873.0   7762.0  9981.0  7278.0  9633.0  \n",
       "September       8666.0  7734.0  7779.0   7864.0  9420.0  6240.0  2101.0  \n",
       "October         8433.0  9347.0  8671.0   9127.0  9622.0  9030.0     NaN  \n",
       "November        8071.0  8535.0  8646.0   8221.0  7041.0  7790.0     NaN  \n",
       "December        8965.0  8523.0  7022.0   8990.0  8686.0  7727.0     NaN  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "month_order = {'January': 1, 'February': 2, 'March': 3, 'April': 4,\n",
    "               'May': 5, 'June': 6, 'July': 7, 'August': 8,\n",
    "               'September': 9, 'October': 10, 'November': 11, 'December': 12}\n",
    "\n",
    "repayment_pivot = repayment_history_df.pivot_table(values='repayment_amount', \n",
    "                                 index=repayment_history_df['repayment_date'].dt.strftime('%B'), \n",
    "                                 columns=repayment_history_df['repayment_date'].dt.year, aggfunc='sum')\n",
    "\n",
    "repayment_pivot.sort_index(key=lambda x: x.map(month_order))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b8422614-a2f0-4fa2-bf7a-252babd97d34",
   "metadata": {},
   "source": [
    "# Summary:  \n",
    "\n",
    "The Pandas tutorial provided a comprehensive guide to analyzing customer loan behavior using three datasets: customer information, loan details, and repayment history. The tutorial covered merging datasets, grouping data, calculating aggregates, and reshaping data using pivot tables.  \n",
    "\n",
    "### Key Takeaways:  \n",
    "- **Merging datasets:** Combine customer information, loan details, and repayment history datasets using the `merge` method.\n",
    "- **Data grouping:** Group customers by age brackets using `pd.cut` and calculate average loan amounts and credit scores using `groupby` and `agg`.\n",
    "- **Data aggregation:** Calculate total repayment amounts per customer using `groupby` and `sum`.\n",
    "- **Pivot tables:** Reshape repayment history data using `pivot_table` to show monthly repayment amounts by year.\n",
    "- **Data sorting:** Sort months correctly using a custom sorting map.\n",
    "- **Data inspection:** Use `info` and `value_counts` to understand data distribution and quality."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e731dc45-d498-45cf-a3b3-cc6199498eb2",
   "metadata": {},
   "source": [
    "<p class=\"text-muted\">This tutorial was created by <a href=\"https://www.hedaro.com\" target=\"_blank\"><strong>HEDARO</strong></a></p>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}