{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Prediction of the final result for an animal in a shelter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Research plan\n", " - Dataset and features description\n", " - Exploratory data analysis\n", " - Visual analysis of the features\n", " - Patterns, insights, pecularities of data\n", " - Data preprocessing\n", " - Cross-validation, hyperparameter tuning\n", " - Validation and learning curves\n", " - Prediction for hold-out and test samples\n", " - Model evaluation with metrics description\n", " - Conclusions\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 1. Dataset and features description" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dataset from [Kaggle page](https://www.kaggle.com/aaronschlegel/austin-animal-center-shelter-outcomes-and/home). The dataset has the following features:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Dataset from Austin Animal Center Shelter contains information about animals in the shelter and their outcome. It is necessary to build a model that predicts the outcome of the animal's stay in the shelter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can see the features below:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- __age_upon_outcome__ - Age of the animal at the time at which it left the shelter.\n", "- __animal_id__ \n", "- __animal_type__ - Cat, dog, or other (including at least one bat!).\n", "- __breed__ - Animal breed. Many animals are generic mixed-breeds, e.g. \"Long-haired mix\".\n", "- __color__ - Color of the animal's fur, if it has fur.\n", "- __date_of_birth__\n", "- __datetime__\n", "- __monthyear__\n", "- __name__\n", "- __outcome_subtype__\n", "- __outcome_type__ - Ultimate outcome for this animal. Possible entries include transferred, [mercy] euthanized, adopted.\n", "- __sex_upon_outcome__" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "import string\n", "import numpy as np\n", "from sklearn.preprocessing import LabelEncoder\n", "from matplotlib import pyplot as plt\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier\n", "from sklearn.neighbors import KNeighborsClassifier \n", "from sklearn.metrics import accuracy_score\n", "from sklearn.model_selection import cross_val_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "path = 'shelter.csv'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(path, parse_dates=['date_of_birth','datetime','monthyear'])\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 2. Exploratory data analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see a mix of categorical, numeric features and a date.\n", "In the task of predicting the outcome of an animal in an orphanage, the target variable is outcome_subtype. outcome_subtype is of several types, so the task is reduced to a multi-class classification" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Many features contain null values, in the future we will correct this.\n", "2. To predict the outcome, we will not use an outcome subtype, so it should be removed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop('outcome_subtype', axis=1, inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some null values can be filled, for example, with an average or median, but not in the case of the sex_upon_outcome and outcome_type features, because these variables are significant and consist of categorical values, so we also get rid of null values.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df[(~df.sex_upon_outcome.isnull()) & (~df.outcome_type.isnull())]\n", "# Empty names fill in \"Unknown\"\n", "df.name = df.name.fillna('Unknown')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see on age feature" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.age_upon_outcome.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Values have a different format: in weeks, days and years. Let's recalculate the age, based on the date of birth and the day of recording information in the file and bring everything to one format - the day. Then we look at the description of the values obtained." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['age_in_days'] = (df['datetime'] - df['date_of_birth']).dt.days\n", "df['age_in_days'].describe()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Replace negative values with zero.\n", "df[df['age_in_days']<0]=0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.drop('age_upon_outcome', axis=1, inplace=True)\n", "df.drop(['date_of_birth','datetime', 'monthyear'], axis=1, inplace=True)\n", "df.drop('animal_id', axis=1, inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now all features do not have empty values and we can proceed to further actions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to perform the conversion on string features, maybe it will reduce the number of unique values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def punctuation_free(text):\n", " text = text.replace('/', ' ')\n", " return ''.join([char for char in text if char not in string.punctuation])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df[df.animal_type!=0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "strings = ['animal_type','breed','color','name']\n", "for i in strings:\n", " df[i] = df[i].apply(lambda x: punctuation_free(x.lower()))\n", "df.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Indeed, this transformation has helped us reduce the number of unique values. You can try to apply any other transformations, but we will stop there and go ahead." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['outcome_type'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 3. Visual analysis of the features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12,4))\n", "sns.countplot(y=df['outcome_type'], \n", " palette='mako_r',\n", " order=df['outcome_type'].value_counts().index)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that in most animals a good result, since adoption takes the largest share, and negative results are very rare. In this case, we see that the classes are not really balanced, and the fact that they have not so many opportunities to work, it seems that it is correct to predict death or any of these unlikely results will be a problem." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12,6))\n", "sns.countplot(data=df,\n", " x='animal_type',\n", " hue='outcome_type')\n", "plt.legend(loc='upper right')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems that the distribution of results also differs from animal types; we can clearly see that dogs are more likely to be returned to the owner and are attached than cats. And animals from another category are more likely to be euthanized than any other result." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "g = sns.FacetGrid(df, hue=\"animal_type\", size=12)\n", "g.map(sns.kdeplot, \"age_in_days\") \n", "g.add_legend()\n", "g.set(xlim=(0,5000), xticks=range(0,5000,365))\n", "plt.show(g)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the trend here, if we look more closely, we will see that these peaks occur in what would be when the animal has completed another year. It makes sense if we think that animal shelters will make cutoffs for age when deciding what to do with an animal. For example, when an animal completes 4 years, and they suffer it. Or maybe they don’t know the exact age of the animal, so the “Date of Birth” column, from which we calculated our dates, is only an approximation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems that most cats are adopted during the first months, we also see that there is an annual trend, and that there are many deliverables in the first year. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 3. Patterns, insights, pecularities of data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since we have categorical data, it is necessary to encode them, including the target variable\n", "For the target, we will use the Label Encoder, for the other categorical features we compare two ways: \n", "- get dummies + LabelEncoder \n", "- LabelEncoder only" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 4. Data preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.sex_upon_outcome.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "sex_upon_outcomeLet's add new features for sex instead of sex_upon_outcome" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Intact'] = df.sex_upon_outcome.apply(lambda x: 1 if 'Intact' in x else 0)\n", "df['Spayed'] = df.sex_upon_outcome.apply(lambda x: 1 if 'Spayed' in x else 0)\n", "df['Neutered'] = df.sex_upon_outcome.apply(lambda x: 1 if 'Neutered' in x else 0)\n", "df['Male'] = df.sex_upon_outcome.apply(lambda x: 1 if 'Male' in x else 0)\n", "df['Female'] = df.sex_upon_outcome.apply(lambda x: 1 if 'Female' in x else 0)\n", "df['Unknown_sex'] = df.sex_upon_outcome.apply(lambda x: 1 if 'Unknown' in x else 0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = df.drop('sex_upon_outcome', axis = 1)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_df = df.drop('outcome_type', axis=1)[:20000]\n", "y_df = df['outcome_type'][:20000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to apply LaberEncoder to features.\n", "For faster calculations, reduce the sample size (no need to do that if you have enough power)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "le = LabelEncoder()\n", "\n", "X_df.name = le.fit_transform(X_df['name'])\n", "X_df.animal_type = le.fit_transform(X_df['animal_type'])\n", "X_df.color = le.fit_transform(X_df['color'])\n", "X_df.breed = le.fit_transform(X_df['breed'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_df = le.fit_transform(y_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model selection " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "#### There are many different models for solving classification problems. For this task, it is proposed to evaluate the performance of models:\n", " - KNeighborsClassifier\n", " - RandomForestClassifier\n", " - GradientBoostingClassifier\n", " \n", " We will configure the hyperparameters for the model that gives the best results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "knc = KNeighborsClassifier()\n", "rfc = RandomForestClassifier(random_state=17)\n", "gbc = GradientBoostingClassifier(random_state=17)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "algos = []\n", "predictions = []\n", "data = []\n", "algos.append(knc)\n", "algos.append(rfc)\n", "algos.append(gbc)\n", "for i in algos:\n", " i.fit(X_train, y_train)\n", " data.append({'accuracy_score': accuracy_score(i.predict(X_test), y_test)})\n", "results = pd.DataFrame(data=data, columns=['accuracy_score'],\n", " index=['KNeighborsClassifier', 'RandomForestClassifier', \n", " 'GradientBoostingClassifier'])\n", "\n", "results " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to encode the data using pd.get_dummies, let's see how the quality of the models will change" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_df = df.drop('outcome_type', axis=1)[:10000]\n", "X_df = pd.get_dummies(X_df, columns=['animal_type','color','breed'])\n", "X_df.name = le.fit_transform(X_df['name'])\n", "X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "algos = []\n", "predictions = []\n", "data = []\n", "algos.append(knc)\n", "algos.append(rfc)\n", "algos.append(gbc)\n", "for i in algos:\n", " i.fit(X_train, y_train)\n", " data.append({'accuracy_score': accuracy_score(i.predict(X_test), y_test)})\n", "results = pd.DataFrame(data=data, columns=['accuracy_score'],\n", " index=['KNeighborsClassifier', 'RandomForestClassifier', \n", " 'GradientBoostingClassifier'])\n", "\n", "results " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even with a small amount of data, the model has been trained much longer than the first way.\n", "You may notice that the KN classifier has become more accurate to make a forecast, but nevertheless, Gradient Boosting Classifier copes with this task better in both cases than other models. Also, the result using the LabelEncoder is higher than using get_dummies.\n", "So, let's take the entire amount of test data with Gradient Boosting Classifier and LabelEncoder and set up hypermarameters on Cross-validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 6. Cross-validation, hyperparameter tuning" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#we take more data\n", "X_df = df.drop('outcome_type', axis=1)[:10000]\n", "y_df = df['outcome_type'][:10000]\n", "X_df.name = le.fit_transform(X_df['name'])\n", "X_df.animal_type = le.fit_transform(X_df['animal_type'])\n", "X_df.color = le.fit_transform(X_df['color'])\n", "X_df.breed = le.fit_transform(X_df['breed'])\n", "y_df = le.fit_transform(y_df)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params_grid = {'max_features': [100, 200,348]}\n", "model_grid = GridSearchCV(gbc,params_grid, cv=5)\n", "model_grid.fit(X_train, y_train)\n", "model_grid.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 7. Validation and learning curves" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import make_scorer\n", "\n", "# Create scorer with our accuracy-function\n", "scorer = make_scorer(accuracy_score, greater_is_better=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_df = df.drop('outcome_type', axis=1)[:1000]\n", "y_df = df['outcome_type'][:1000]\n", "X_df = pd.get_dummies(X_df, columns=['animal_type','color','breed'])\n", "X_df.name = le.fit_transform(X_df['name'])\n", "y_df = le.fit_transform(y_df)\n", "X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.3)\n", "\n", "max_depth_list = [25, 35,45]\n", "cv_errors_list = []\n", "train_errors_list = []\n", "valid_errors_list = []\n", "\n", "for max_depth in max_depth_list:\n", " gbc = GradientBoostingClassifier(max_depth=max_depth,random_state=17, max_features = 200)\n", "\n", " cv_errors = cross_val_score(estimator=gbc, \n", " X=X_train, \n", " y=y_train, \n", " scoring=scorer,\n", " cv=3) \n", " cv_errors_list.append(cv_errors.mean())\n", " \n", " gbc.fit(X=X_train, y=y_train)\n", " train_error = accuracy_score(y_train, gbc.predict(X_train)) \n", " train_errors_list.append(train_error)\n", " valid_error = accuracy_score(y_test, gbc.predict(X_test)) \n", " valid_errors_list.append(valid_error)\n", " \n", " print(max_depth)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10, 7))\n", "\n", "plt.plot(max_depth_list,cv_errors_list)\n", "plt.plot(max_depth_list,valid_errors_list)\n", "plt.vlines(x=max_depth_list[np.array(cv_errors_list).argmin()], \n", " ymin=0.62, ymax=0.68, \n", " linestyles='dashed', colors='r')\n", "\n", "plt.legend(['Cross validation accuracy on train', \n", " 'accuracy on validation set', \n", " 'Best Max_depth value on CV'])\n", "plt.title(\"Accuracy test sets.\")\n", "plt.xlabel('Max_depth value')\n", "plt.ylabel('accuracy value')\n", "plt.grid()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gbc = GradientBoostingClassifier(random_state=17, max_features = 200)\n", "gbc.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracy_score(gbc.predict(X_test), y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Part 7. Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We got a good result, but we should carry out a more detailed setting of the hyperparameters and use the full amount of data in the learning process. The solution may be useful for shelters that collect such data and try to predict the approximate result of the animal detection in the shelter. This is important because it allows them to understand which signs more affect the positive outcome for animals and which ones promise more negative." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }