{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Päivitetty 2022-09-11 12:52:08.931765\n" ] } ], "source": [ "from datetime import datetime\n", "print(f'Päivitetty {datetime.now()}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Luokittelu - kategorisia muuttujia, tasapainotus\n", "\n", "Data löytyy esimerkiksi lähteestä:\n", "https://archive.ics.uci.edu/ml/datasets/bank+marketing\n", "\n", "Kohdemuuttuja y sisältää tiedon siitä, onko asiakkaalla määräaikaistalletuksia.\n", "\n", "Selvitetään voidaanko asiakastietojen perusteella ennustaa y-muuttujan arvoja.\n", "\n", "Mukana on kategorisia muuttujia, jotka täytyy muuntaa dummy-muuttujiksi." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "# Datan tasapainottamiseen\n", "from imblearn.over_sampling import RandomOverSampler\n", "from imblearn.under_sampling import RandomUnderSampler\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "\n", "# Datan kaikki sarakkeet näkyviin\n", "pd.options.display.max_columns = None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Datan tarkastelua" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaulthousingloancontactmonthday_of_weekdurationcampaignpdayspreviouspoutcomeemp_var_ratecons_price_idxcons_conf_idxeuribor3mnr_employedy
044blue-collarmarriedbasic.4yunknownyesnocellularaugthu21019990nonexistent1.493.444-36.14.9635228.10
153technicianmarriedunknownnononocellularnovfri13819990nonexistent-0.193.200-42.04.0215195.80
228managementsingleuniversity.degreenoyesnocellularjunthu339362success-1.794.055-39.80.7294991.61
339servicesmarriedhigh.schoolnononocellularaprfri18529990nonexistent-1.893.075-47.11.4055099.10
455retiredmarriedbasic.4ynoyesnocellularaugfri137131success-2.992.201-31.40.8695076.21
..................................................................
4118359retiredmarriedhigh.schoolunknownnoyestelephonejunthu22219990nonexistent1.494.465-41.84.8665228.10
4118431housemaidmarriedbasic.4yunknownnonotelephonemaythu19629990nonexistent1.193.994-36.44.8605191.00
4118542admin.singleuniversity.degreeunknownyesyestelephonemaywed6239990nonexistent1.193.994-36.44.8575191.00
4118648technicianmarriedprofessional.coursenonoyestelephoneocttue20029990nonexistent-3.492.431-26.90.7425017.50
4118725studentsinglehigh.schoolnononotelephonemayfri11249990nonexistent1.193.994-36.44.8595191.00
\n", "

41188 rows × 21 columns

\n", "
" ], "text/plain": [ " age job marital education default housing loan \\\n", "0 44 blue-collar married basic.4y unknown yes no \n", "1 53 technician married unknown no no no \n", "2 28 management single university.degree no yes no \n", "3 39 services married high.school no no no \n", "4 55 retired married basic.4y no yes no \n", "... ... ... ... ... ... ... ... \n", "41183 59 retired married high.school unknown no yes \n", "41184 31 housemaid married basic.4y unknown no no \n", "41185 42 admin. single university.degree unknown yes yes \n", "41186 48 technician married professional.course no no yes \n", "41187 25 student single high.school no no no \n", "\n", " contact month day_of_week duration campaign pdays previous \\\n", "0 cellular aug thu 210 1 999 0 \n", "1 cellular nov fri 138 1 999 0 \n", "2 cellular jun thu 339 3 6 2 \n", "3 cellular apr fri 185 2 999 0 \n", "4 cellular aug fri 137 1 3 1 \n", "... ... ... ... ... ... ... ... \n", "41183 telephone jun thu 222 1 999 0 \n", "41184 telephone may thu 196 2 999 0 \n", "41185 telephone may wed 62 3 999 0 \n", "41186 telephone oct tue 200 2 999 0 \n", "41187 telephone may fri 112 4 999 0 \n", "\n", " poutcome emp_var_rate cons_price_idx cons_conf_idx euribor3m \\\n", "0 nonexistent 1.4 93.444 -36.1 4.963 \n", "1 nonexistent -0.1 93.200 -42.0 4.021 \n", "2 success -1.7 94.055 -39.8 0.729 \n", "3 nonexistent -1.8 93.075 -47.1 1.405 \n", "4 success -2.9 92.201 -31.4 0.869 \n", "... ... ... ... ... ... \n", "41183 nonexistent 1.4 94.465 -41.8 4.866 \n", "41184 nonexistent 1.1 93.994 -36.4 4.860 \n", "41185 nonexistent 1.1 93.994 -36.4 4.857 \n", "41186 nonexistent -3.4 92.431 -26.9 0.742 \n", "41187 nonexistent 1.1 93.994 -36.4 4.859 \n", "\n", " nr_employed y \n", "0 5228.1 0 \n", "1 5195.8 0 \n", "2 4991.6 1 \n", "3 5099.1 0 \n", "4 5076.2 1 \n", "... ... .. \n", "41183 5228.1 0 \n", "41184 5191.0 0 \n", "41185 5191.0 0 \n", "41186 5017.5 0 \n", "41187 5191.0 0 \n", "\n", "[41188 rows x 21 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('https://taanila.fi/banking.csv')\n", "df" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 36548\n", "1 4640\n", "Name: y, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Kohdemuuttujan jakaumasta nähdään, että dataa kannattaa tasapainottaa\n", "df['y'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Muuttujien muunnokset" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['basic.4y', 'unknown', 'university.degree', 'high.school',\n", " 'basic.9y', 'professional.course', 'basic.6y', 'illiterate'],\n", " dtype=object)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['education'].unique()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['basic', 'unknown', 'university.degree', 'high.school',\n", " 'professional.course', 'illiterate'], dtype=object)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# education-muuttujan uudelleenkoodaus\n", "df['education'] = df['education'].replace({'basic.4y':'basic', 'basic.6y':'basic', 'basic.9y':'basic'})\n", "df['education'].unique()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 41188 entries, 0 to 41187\n", "Data columns (total 62 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 age 41188 non-null int64 \n", " 1 duration 41188 non-null int64 \n", " 2 campaign 41188 non-null int64 \n", " 3 pdays 41188 non-null int64 \n", " 4 previous 41188 non-null int64 \n", " 5 emp_var_rate 41188 non-null float64\n", " 6 cons_price_idx 41188 non-null float64\n", " 7 cons_conf_idx 41188 non-null float64\n", " 8 euribor3m 41188 non-null float64\n", " 9 nr_employed 41188 non-null float64\n", " 10 y 41188 non-null int64 \n", " 11 job_admin. 41188 non-null uint8 \n", " 12 job_blue-collar 41188 non-null uint8 \n", " 13 job_entrepreneur 41188 non-null uint8 \n", " 14 job_housemaid 41188 non-null uint8 \n", " 15 job_management 41188 non-null uint8 \n", " 16 job_retired 41188 non-null uint8 \n", " 17 job_self-employed 41188 non-null uint8 \n", " 18 job_services 41188 non-null uint8 \n", " 19 job_student 41188 non-null uint8 \n", " 20 job_technician 41188 non-null uint8 \n", " 21 job_unemployed 41188 non-null uint8 \n", " 22 job_unknown 41188 non-null uint8 \n", " 23 marital_divorced 41188 non-null uint8 \n", " 24 marital_married 41188 non-null uint8 \n", " 25 marital_single 41188 non-null uint8 \n", " 26 marital_unknown 41188 non-null uint8 \n", " 27 education_basic 41188 non-null uint8 \n", " 28 education_high.school 41188 non-null uint8 \n", " 29 education_illiterate 41188 non-null uint8 \n", " 30 education_professional.course 41188 non-null uint8 \n", " 31 education_university.degree 41188 non-null uint8 \n", " 32 education_unknown 41188 non-null uint8 \n", " 33 default_no 41188 non-null uint8 \n", " 34 default_unknown 41188 non-null uint8 \n", " 35 default_yes 41188 non-null uint8 \n", " 36 housing_no 41188 non-null uint8 \n", " 37 housing_unknown 41188 non-null uint8 \n", " 38 housing_yes 41188 non-null uint8 \n", " 39 loan_no 41188 non-null uint8 \n", " 40 loan_unknown 41188 non-null uint8 \n", " 41 loan_yes 41188 non-null uint8 \n", " 42 contact_cellular 41188 non-null uint8 \n", " 43 contact_telephone 41188 non-null uint8 \n", " 44 month_apr 41188 non-null uint8 \n", " 45 month_aug 41188 non-null uint8 \n", " 46 month_dec 41188 non-null uint8 \n", " 47 month_jul 41188 non-null uint8 \n", " 48 month_jun 41188 non-null uint8 \n", " 49 month_mar 41188 non-null uint8 \n", " 50 month_may 41188 non-null uint8 \n", " 51 month_nov 41188 non-null uint8 \n", " 52 month_oct 41188 non-null uint8 \n", " 53 month_sep 41188 non-null uint8 \n", " 54 day_of_week_fri 41188 non-null uint8 \n", " 55 day_of_week_mon 41188 non-null uint8 \n", " 56 day_of_week_thu 41188 non-null uint8 \n", " 57 day_of_week_tue 41188 non-null uint8 \n", " 58 day_of_week_wed 41188 non-null uint8 \n", " 59 poutcome_failure 41188 non-null uint8 \n", " 60 poutcome_nonexistent 41188 non-null uint8 \n", " 61 poutcome_success 41188 non-null uint8 \n", "dtypes: float64(5), int64(6), uint8(51)\n", "memory usage: 5.5 MB\n" ] } ], "source": [ "# Kategoriset muuttujat dummy-muuttujiksi\n", "df_dummies = pd.get_dummies(df)\n", "df_dummies.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tasapainotus ja mallien sovittaminen" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "X = df_dummies.drop('y', axis=1)\n", "y = df_dummies['y']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GradientBoostingClassifier(random_state=2)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Sovitetaan gradienttitehostus ilman datan tasapainotusta\n", "gbc1 = GradientBoostingClassifier(max_depth=3, random_state=2)\n", "gbc1.fit(X_train, y_train)\n", "\n", "# Tasapainotus RandomOverSampler + gradienttitehostus\n", "# RandomOverSampler kasvattaa satunnaisotoksilla pienempää ryhmää\n", "ros = RandomOverSampler(random_state=2)\n", "X_train2, y_train2 = ros.fit_resample(X_train, y_train)\n", "gbc2 = GradientBoostingClassifier(max_depth=3, random_state=2)\n", "gbc2.fit(X_train2, y_train2)\n", "\n", "# Tasapainotus RandomUnderSampler + gradienttitehostus\n", "# RandomUnderSampler pienentää satunnaisotoksilla isompaa ryhmää\n", "rus = RandomUnderSampler(random_state=2)\n", "X_train3, y_train3 = rus.fit_resample(X_train, y_train)\n", "gbc3 = GradientBoostingClassifier(max_depth=3, random_state=2)\n", "gbc3.fit(X_train3, y_train3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mallien arviointia" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ilman tasapainotusta 0.923\n", "RandomOverSampler 0.861\n", "RandomUnderSampler 0.860\n" ] } ], "source": [ "# Oikeaan osuneiden ennusteiden osuus opetusdatassa\n", "print(f'Ilman tasapainotusta {gbc1.score(X_train, y_train):.3f}')\n", "print(f'RandomOverSampler {gbc2.score(X_train, y_train):.3f}')\n", "print(f'RandomUnderSampler {gbc3.score(X_train, y_train):.3f}')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ilman tasapainostusta 0.915\n", "RandomOverSampler 0.854\n", "RandomUnderSampler 0.854\n" ] } ], "source": [ "# Oikeaan osuneiden ennusteiden osuus testidatassa\n", "print(f'Ilman tasapainostusta {gbc1.score(X_test, y_test):.3f}')\n", "print(f'RandomOverSampler {gbc2.score(X_test, y_test):.3f}')\n", "print(f'RandomUnderSampler {gbc3.score(X_test, y_test):.3f}')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Mallien antamat ennusteet testidatalle\n", "y_pred1 = gbc1.predict(X_test)\n", "y_pred2 = gbc2.predict(X_test)\n", "y_pred3 = gbc3.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Sekaannusmatriisi ilman tasapainotusta\n", "cm = confusion_matrix(y_test, y_pred1)\n", "ConfusionMatrixDisplay(confusion_matrix=cm).plot()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Sekaannusmatriisi (tasapainotus RandomOverSampler)\n", "cm = confusion_matrix(y_test, y_pred2)\n", "ConfusionMatrixDisplay(confusion_matrix=cm).plot()" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# Sekaannusmatriisi (tasapainotus RandomUnderSampler)\n", "cm = confusion_matrix(y_test, y_pred3)\n", "ConfusionMatrixDisplay(confusion_matrix=cm).plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Jos tarkoituksena on poimia uudesta datasta niitä, jotka todennäköisimmin tekisivät määräaikaistalletuksia, niin RandomUnderSampler+gradienttitehostus vaikuttaisi parhaalta mallilta. Se tunnistaa testidatasta suurimman osan määräaikaistalletuksia tehneistä." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 2 }