{ "cells": [ { "cell_type": "markdown", "id": "f8cc5bfe-817f-4b18-97c8-f1f4ec85bbe4", "metadata": {}, "source": [ "# Customer Loan Behavior Analysis \n", "\n", "### Description: \n", "You're a financial analyst at a leading digital lending platform. Your company provides personal loans to customers based on their creditworthiness. The business development team wants to analyze customer loan repayment behavior to inform their marketing strategies. They've asked you to prepare a dataset that combines customer information, loan details, and repayment history. \n", "\n", "### Tasks: \n", "- Calculate the total repayment amount made by the customer and merge this information with the loan details dataset. Ensure the resulting dataset includes all columns from the loan details dataset. \n", "- Group customers by age brackets (25-34, 35-44, 45-54, 55+) and calculate the average loan amount and average credit score for each age group. \n", "- Use the pivot_table method to reshape the repayment history dataset, showing: \n", "**Rows:** Months of repayment (January to December) \n", "**Columns:** Years of repayment (e.g., 2022, 2023) \n", "**Values:** Total repayment amount made in each month of each year" ] }, { "cell_type": "code", "execution_count": 1, "id": "ef18c9eb-853f-4c5b-b01a-9669b7cc3581", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]\n", "Pandas version 2.2.1\n", "Numpy version 1.26.4\n" ] } ], "source": [ "# import libraries\n", "import pandas as pd\n", "import numpy as np\n", "import sys\n", "\n", "print('Python version ' + sys.version)\n", "print('Pandas version ' + pd.__version__)\n", "print('Numpy version ' + np.__version__)" ] }, { "cell_type": "markdown", "id": "0a3099e6-a0c0-42af-9fe6-d6ec83868b04", "metadata": {}, "source": [ "# The Data \n", "\n", "The dataset represents a digital lending platform's customer loan data, comprising three tables: customer information (demographics), loan details (loan amounts and issuance dates), and repayment history (repayment amounts and dates). The dataset includes 1,000 customers, 1,000 loans, and 5,000 repayment records, providing a comprehensive view of customer loan behavior. \n", "\n", "**customer_info.csv:** Contains customer demographic information. \n", "- customer_id (unique identifier)\n", "- age\n", "- income\n", "- credit_score\n", "\n", "**loan_details.csv:** Contains loan information. \n", "- loan_id (unique identifier)\n", "- customer_id (foreign key referencing customer_info)\n", "- loan_amount\n", "- loan_issued_date \n", "\n", "**repayment_history.csv:** Contains repayment information. \n", "- repayment_id (unique identifier)\n", "- loan_id (foreign key referencing loan_details)\n", "- repayment_date\n", "- repayment_amount\n" ] }, { "cell_type": "code", "execution_count": 2, "id": "5268e5c9-a951-44e3-ba8a-f64dced3bae8", "metadata": {}, "outputs": [], "source": [ "# set the seed\n", "np.random.seed(0)\n", "\n", "# customer information\n", "customer_info = pd.DataFrame({\n", " 'customer_id': np.arange(1000),\n", " 'age': np.random.randint(25, 60, 1000),\n", " 'income': np.random.randint(50000, 150000, 1000),\n", " 'credit_score': np.random.randint(600, 850, 1000)\n", "})\n", "\n", "# loan details\n", "loan_details = pd.DataFrame({\n", " 'loan_id': np.arange(1000),\n", " 'customer_id': np.random.choice(customer_info['customer_id'], 1000),\n", " 'loan_amount': np.random.randint(1000, 50000, 1000),\n", " 'loan_issued_date': pd.date_range('2022-01-01', periods=1000, freq='D')\n", "})\n", "\n", "# repayment history\n", "repayment_history = pd.DataFrame({\n", " 'repayment_id': np.arange(5000),\n", " 'loan_id': np.random.choice(loan_details['loan_id'], 5000),\n", " 'repayment_date': pd.date_range('2022-01-01', periods=5000, freq='D'),\n", " 'repayment_amount': np.random.randint(50, 500, 5000)\n", "})\n", "\n", "# save the datasets to CSV files\n", "customer_info.to_csv('customer_info.csv', index=False)\n", "loan_details.to_csv('loan_details.csv', index=False)\n", "repayment_history.to_csv('repayment_history.csv', index=False)" ] }, { "cell_type": "markdown", "id": "4cd1da82-5d38-49c5-b6d0-6c2470199f83", "metadata": {}, "source": [ "Let us read into memory the csv files and take a look at the data types of each dataset." ] }, { "cell_type": "code", "execution_count": 3, "id": "44ba517a-7713-4566-9055-fe9d61b01169", "metadata": {}, "outputs": [], "source": [ "# create dataframes\n", "customer_info_df = pd.read_csv('customer_info.csv')\n", "loan_details_df = pd.read_csv('loan_details.csv')\n", "\n", "# you can use the parse_dates parameter to specify that column \"repayment_date\" should be parsed as a date object.\n", "repayment_history_df = pd.read_csv('repayment_history.csv', parse_dates=['repayment_date'])" ] }, { "cell_type": "code", "execution_count": 4, "id": "ad7f9574-8c78-4922-ac51-040d6064bc2c", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1000 entries, 0 to 999\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype\n", "--- ------ -------------- -----\n", " 0 customer_id 1000 non-null int64\n", " 1 age 1000 non-null int64\n", " 2 income 1000 non-null int64\n", " 3 credit_score 1000 non-null int64\n", "dtypes: int64(4)\n", "memory usage: 31.4 KB\n" ] } ], "source": [ "customer_info_df.info()" ] }, { "cell_type": "markdown", "id": "a966a195-63ce-4a93-be59-777edb7bfc4f", "metadata": {}, "source": [ "We will need to convert the column \"loan_issue_date\" to a date object." ] }, { "cell_type": "code", "execution_count": 5, "id": "73ee5c82-2257-405d-9afc-4643910f5b43", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 1000 entries, 0 to 999\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 loan_id 1000 non-null int64 \n", " 1 customer_id 1000 non-null int64 \n", " 2 loan_amount 1000 non-null int64 \n", " 3 loan_issued_date 1000 non-null object\n", "dtypes: int64(3), object(1)\n", "memory usage: 31.4+ KB\n" ] } ], "source": [ "loan_details_df.info()" ] }, { "cell_type": "code", "execution_count": 6, "id": "a039ae3e-19f5-44d5-af02-382d2cd9a4cb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 5000 entries, 0 to 4999\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 repayment_id 5000 non-null int64 \n", " 1 loan_id 5000 non-null int64 \n", " 2 repayment_date 5000 non-null datetime64[ns]\n", " 3 repayment_amount 5000 non-null int64 \n", "dtypes: datetime64[ns](1), int64(3)\n", "memory usage: 156.4 KB\n" ] } ], "source": [ "repayment_history_df.info()" ] }, { "cell_type": "markdown", "id": "92bd8b51-c055-4b1d-b13e-798beac64d97", "metadata": {}, "source": [ "# Task #1 \n", "\n", "Calculate the total repayment amount made by the customer and merge this information with the loan details dataset. Ensure the resulting dataset includes all columns from the loan details dataset." ] }, { "cell_type": "markdown", "id": "ca3b4f13-854d-429a-9d45-c0e7985d7d13", "metadata": {}, "source": [ "For this task, Let us merge all three datasets and then calculate the total repayment amount per customer. This is the dataframe we will use for the rest of the tutorial. " ] }, { "cell_type": "code", "execution_count": 7, "id": "441c66bd-2b4c-4427-9815-a46d1c588113", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idageincomecredit_scoreloan_idloan_amountloan_issued_daterepayment_idrepayment_daterepayment_amount
00251414457251111422022-01-0217102026-09-07303
1025141445725840257362024-04-205272023-06-12492
2025141445725840257362024-04-205892023-08-13281
3025141445725840257362024-04-2012812025-07-0599
4025141445725840257362024-04-2028312029-10-02117
\n", "
" ], "text/plain": [ " customer_id age income credit_score loan_id loan_amount \\\n", "0 0 25 141445 725 1 11142 \n", "1 0 25 141445 725 840 25736 \n", "2 0 25 141445 725 840 25736 \n", "3 0 25 141445 725 840 25736 \n", "4 0 25 141445 725 840 25736 \n", "\n", " loan_issued_date repayment_id repayment_date repayment_amount \n", "0 2022-01-02 1710 2026-09-07 303 \n", "1 2024-04-20 527 2023-06-12 492 \n", "2 2024-04-20 589 2023-08-13 281 \n", "3 2024-04-20 1281 2025-07-05 99 \n", "4 2024-04-20 2831 2029-10-02 117 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = customer_info_df.merge(loan_details_df, on='customer_id').merge(repayment_history_df, on='loan_id')\n", "df.head()" ] }, { "cell_type": "markdown", "id": "ed96c9de-2706-46e6-99e4-c46dc75f6f3a", "metadata": {}, "source": [ "We can see from the data below that our new dataframe has 5,000 rows and we do not have any null values. Remember that by defaul the `.merge` method will do an inner merge. This means the columns we merged on (customer_id and loan_id) need to have matching values on both dataframes being merged. If the values do not match, that row will not make it to the final result. \n", "\n", "In this specific case, we did loose some of the data, for example:\n", "- The customer_info_df has a customer_id=1\n", "- Customer_id=1 is not present in the loan_details_df\n", "- This means the final dataframe (df), does not contain customer_id=1" ] }, { "cell_type": "code", "execution_count": 8, "id": "9baaa419-f226-43d7-a93a-55c83f5c5608", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idageincomecredit_score
112892694692
\n", "
" ], "text/plain": [ " customer_id age income credit_score\n", "1 1 28 92694 692" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# here is customer_id=1\n", "mask = customer_info_df.loc[:,'customer_id'] == 1\n", "customer_info_df[mask]" ] }, { "cell_type": "code", "execution_count": 9, "id": "f8b2a95a-5959-45ba-b99c-35aaa992fd79", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
loan_idcustomer_idloan_amountloan_issued_date
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [loan_id, customer_id, loan_amount, loan_issued_date]\n", "Index: []" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# customer_id=1 is not found in loan_details_df\n", "mask = loan_details_df.loc[:,'customer_id'] == 1\n", "loan_details_df[mask]" ] }, { "cell_type": "code", "execution_count": 10, "id": "58589af1-ede4-40d6-9c92-d37c65b8fb77", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
customer_idageincomecredit_scoreloan_idloan_amountloan_issued_daterepayment_idrepayment_daterepayment_amount
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [customer_id, age, income, credit_score, loan_id, loan_amount, loan_issued_date, repayment_id, repayment_date, repayment_amount]\n", "Index: []" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# customer_id=1 is not found in df\n", "mask = df.loc[:,'customer_id'] == 1\n", "df[mask]" ] }, { "cell_type": "markdown", "id": "e40c7011-696a-49a6-aca0-8cb2b31ba2ef", "metadata": {}, "source": [ "To calculate the total repayment amount per customer, we can make use of the `.groupby` method." ] }, { "cell_type": "code", "execution_count": 11, "id": "539be95c-4940-4887-81b5-8c253bc67ecc", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "customer_id\n", "0 1663\n", "2 4450\n", "3 725\n", "4 2104\n", "5 2094\n", " ... \n", "994 675\n", "995 2513\n", "996 1156\n", "997 1155\n", "998 442\n", "Name: repayment_amount, Length: 629, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create group object\n", "group = df.groupby('customer_id')\n", "\n", "# get the total repayment amount for the group\n", "group['repayment_amount'].sum()" ] }, { "cell_type": "markdown", "id": "800262da-27f3-4fbf-90b5-58209f5086e6", "metadata": {}, "source": [ "If you wanted to verify the math is correct... \n", "\n", "We can find select one customer_id and manually add the values. \n", "- 303 + 492 + 281 + 99 + 117 + 371 = 1,663" ] }, { "cell_type": "code", "execution_count": 12, "id": "40c1801d-fb1b-438d-9a51-3b6aa1d83b04", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 303\n", "1 492\n", "2 281\n", "3 99\n", "4 117\n", "5 371\n", "Name: repayment_amount, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mask = df.loc[:,'customer_id'] == 0\n", "df.loc[mask,'repayment_amount']" ] }, { "cell_type": "markdown", "id": "46ce722e-6681-43d2-8ea4-c35344967843", "metadata": {}, "source": [ "# Task #2 \n", "\n", "Group customers by age brackets (25-34, 35-44, 45-54, 55+) and calculate the average loan amount and average credit score for each age group. \n", "\n", "- **bins** defines the edges of the bins.\n", "- **labels** assigns names to each bin.\n", "- **float('inf')** represents infinity, making the last bin open-ended (55+).\n", "- **include_lowest=True** ensures that the lowest value (55) is included in the last bin." ] }, { "cell_type": "code", "execution_count": 13, "id": "61e360ae-260c-4ecf-9d82-0464843a59c6", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age_brackets\n", "25-34 1552\n", "35-44 1535\n", "45-54 1411\n", "55+ 502\n", "Name: count, dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create bins for the (25-34), (35-44), (45-54) and (55+) categories\n", "bins = [25, 34, 44, 55, float('inf')] \n", " \n", "# labels for the three categories\n", "labels = ['25-34', '35-44', '45-54', '55+'] \n", " \n", "# bin it up!\n", "df['age_brackets'] = pd.cut(df['age'], bins=bins, labels=labels, include_lowest=True)\n", " \n", "# here we get a frequency count of the categories\n", "df['age_brackets'].value_counts()" ] }, { "cell_type": "markdown", "id": "0243d3b3-02b5-4bf0-ab36-93ac249368a7", "metadata": {}, "source": [ "The observed parameter in `df.groupby` affects grouping behavior when using Categorical columns, and setting it to True (future default) treats categories as part of the data's index, whereas setting it to False (current default) treats them as \"unobserved\" variables. \n", "\n", "I went in more detail in a previous tutorial: [Education-Analytics-Challenge](https://hedaro.com/Programming-Languages/Python/Data-Analyst/Data-Analyst---Education-Analytics-Challenge)" ] }, { "cell_type": "code", "execution_count": 14, "id": "643cee31-d06f-4b10-8dfb-c0484acdf03f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
avg_loan_amountavg_credit_score
age_brackets
25-3425107.916237715.453608
35-4425182.748534725.331596
45-5425559.566265725.663359
55+28249.966135712.916335
\n", "
" ], "text/plain": [ " avg_loan_amount avg_credit_score\n", "age_brackets \n", "25-34 25107.916237 715.453608\n", "35-44 25182.748534 725.331596\n", "45-54 25559.566265 725.663359\n", "55+ 28249.966135 712.916335" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# calculate the average loan amount and average credit score for each age group\n", "df.groupby('age_brackets', observed=True).agg(\n", " avg_loan_amount=pd.NamedAgg(column=\"loan_amount\", aggfunc=\"mean\"),\n", " avg_credit_score=pd.NamedAgg(column=\"credit_score\", aggfunc=\"mean\")\n", ")" ] }, { "cell_type": "markdown", "id": "991289ff-74b9-4f7a-ad4a-48d4428a8862", "metadata": {}, "source": [ "# Task #3 \n", "\n", "Use the pivot_table method to reshape the repayment history dataset, showing: \n", "- **Rows:** Months of repayment (January to December) \n", "- **Columns:** Years of repayment (e.g., 2022, 2023) \n", "- **Values:** Total repayment amount made in each month of each year \n", "\n", "In order to sort the months correctly, we need to use a map as shown below." ] }, { "cell_type": "code", "execution_count": 15, "id": "1d91ef60-4d7e-4cff-b85d-0de162b16d0b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
repayment_date20222023202420252026202720282029203020312032203320342035
repayment_date
January7876.08834.09711.08979.09236.08772.08576.08084.09712.08340.08167.07108.07630.08604.0
February7917.08264.08113.08181.07965.06685.07884.08308.08039.07589.07676.07565.08783.07098.0
March8090.08245.09570.09039.07522.08419.08936.07896.07604.07817.07499.08310.07289.08927.0
April7950.09636.08574.08980.09272.08009.08171.08437.07657.08692.08601.08754.08644.08464.0
May9821.08433.09034.08423.08261.08928.07685.09209.08665.08259.08926.08598.08573.09337.0
June7999.09103.07305.07271.09419.08276.07641.09278.08966.08488.08390.08898.08626.08272.0
July7786.09393.09565.08010.08378.09542.08496.08251.08633.08975.010388.07370.08617.08241.0
August7587.08239.08548.08866.08559.08532.08008.09171.08203.08873.07762.09981.07278.09633.0
September8666.08986.07753.08168.08277.08769.09079.08666.07734.07779.07864.09420.06240.02101.0
October9202.08138.07779.08989.07725.07485.09331.08433.09347.08671.09127.09622.09030.0NaN
November9295.08768.08246.08136.07776.08842.08161.08071.08535.08646.08221.07041.07790.0NaN
December8295.08646.07134.07425.08097.08750.09667.08965.08523.07022.08990.08686.07727.0NaN
\n", "
" ], "text/plain": [ "repayment_date 2022 2023 2024 2025 2026 2027 2028 \\\n", "repayment_date \n", "January 7876.0 8834.0 9711.0 8979.0 9236.0 8772.0 8576.0 \n", "February 7917.0 8264.0 8113.0 8181.0 7965.0 6685.0 7884.0 \n", "March 8090.0 8245.0 9570.0 9039.0 7522.0 8419.0 8936.0 \n", "April 7950.0 9636.0 8574.0 8980.0 9272.0 8009.0 8171.0 \n", "May 9821.0 8433.0 9034.0 8423.0 8261.0 8928.0 7685.0 \n", "June 7999.0 9103.0 7305.0 7271.0 9419.0 8276.0 7641.0 \n", "July 7786.0 9393.0 9565.0 8010.0 8378.0 9542.0 8496.0 \n", "August 7587.0 8239.0 8548.0 8866.0 8559.0 8532.0 8008.0 \n", "September 8666.0 8986.0 7753.0 8168.0 8277.0 8769.0 9079.0 \n", "October 9202.0 8138.0 7779.0 8989.0 7725.0 7485.0 9331.0 \n", "November 9295.0 8768.0 8246.0 8136.0 7776.0 8842.0 8161.0 \n", "December 8295.0 8646.0 7134.0 7425.0 8097.0 8750.0 9667.0 \n", "\n", "repayment_date 2029 2030 2031 2032 2033 2034 2035 \n", "repayment_date \n", "January 8084.0 9712.0 8340.0 8167.0 7108.0 7630.0 8604.0 \n", "February 8308.0 8039.0 7589.0 7676.0 7565.0 8783.0 7098.0 \n", "March 7896.0 7604.0 7817.0 7499.0 8310.0 7289.0 8927.0 \n", "April 8437.0 7657.0 8692.0 8601.0 8754.0 8644.0 8464.0 \n", "May 9209.0 8665.0 8259.0 8926.0 8598.0 8573.0 9337.0 \n", "June 9278.0 8966.0 8488.0 8390.0 8898.0 8626.0 8272.0 \n", "July 8251.0 8633.0 8975.0 10388.0 7370.0 8617.0 8241.0 \n", "August 9171.0 8203.0 8873.0 7762.0 9981.0 7278.0 9633.0 \n", "September 8666.0 7734.0 7779.0 7864.0 9420.0 6240.0 2101.0 \n", "October 8433.0 9347.0 8671.0 9127.0 9622.0 9030.0 NaN \n", "November 8071.0 8535.0 8646.0 8221.0 7041.0 7790.0 NaN \n", "December 8965.0 8523.0 7022.0 8990.0 8686.0 7727.0 NaN " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "month_order = {'January': 1, 'February': 2, 'March': 3, 'April': 4,\n", " 'May': 5, 'June': 6, 'July': 7, 'August': 8,\n", " 'September': 9, 'October': 10, 'November': 11, 'December': 12}\n", "\n", "repayment_pivot = repayment_history_df.pivot_table(values='repayment_amount', \n", " index=repayment_history_df['repayment_date'].dt.strftime('%B'), \n", " columns=repayment_history_df['repayment_date'].dt.year, aggfunc='sum')\n", "\n", "repayment_pivot.sort_index(key=lambda x: x.map(month_order))" ] }, { "cell_type": "markdown", "id": "b8422614-a2f0-4fa2-bf7a-252babd97d34", "metadata": {}, "source": [ "# Summary: \n", "\n", "The Pandas tutorial provided a comprehensive guide to analyzing customer loan behavior using three datasets: customer information, loan details, and repayment history. The tutorial covered merging datasets, grouping data, calculating aggregates, and reshaping data using pivot tables. \n", "\n", "### Key Takeaways: \n", "- **Merging datasets:** Combine customer information, loan details, and repayment history datasets using the `merge` method.\n", "- **Data grouping:** Group customers by age brackets using `pd.cut` and calculate average loan amounts and credit scores using `groupby` and `agg`.\n", "- **Data aggregation:** Calculate total repayment amounts per customer using `groupby` and `sum`.\n", "- **Pivot tables:** Reshape repayment history data using `pivot_table` to show monthly repayment amounts by year.\n", "- **Data sorting:** Sort months correctly using a custom sorting map.\n", "- **Data inspection:** Use `info` and `value_counts` to understand data distribution and quality." ] }, { "cell_type": "markdown", "id": "e731dc45-d498-45cf-a3b3-cc6199498eb2", "metadata": {}, "source": [ "

This tutorial was created by HEDARO

" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 5 }