{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "acceptable-netherlands", "metadata": {}, "outputs": [], "source": [ "\"\"\"\n", "Today we will look at ensemble learning, which is an often undervalued part of machine learning. The way it works is\n", "by taking multiple different classifiers, training them, and then averaging their predictions into one \"ensemble\" prediction. \n", "You often want to have a diverse set of classifiers so you can aggregate predictions. The reason this works is because even if you have", "subpar classifiers with a 60% chance of predicting correctly on their own - as a group together it will increase more \n", "and more as long as these classifiers all have different learnt parameters. This can be suprisingly effective. \n", "The two algorithms we will look at are Random Forests and Ensemble Classifiers. \n", "\"\"\"\n", "\n", "#let's get started\n", "import sealion as sl \n", "from sealion.ensemble_learning import RandomForest, EnsembleClassifier" ] }, { "cell_type": "code", "execution_count": 2, "id": "aquatic-great", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
\n", "

891 rows × 12 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", ".. ... ... ... \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", ".. ... ... ... ... \n", "886 Montvila, Rev. Juozas male 27.0 0 \n", "887 Graham, Miss. Margaret Edith female 19.0 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n", "889 Behr, Mr. Karl Howell male 26.0 0 \n", "890 Dooley, Mr. Patrick male 32.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S \n", ".. ... ... ... ... ... \n", "886 0 211536 13.0000 NaN S \n", "887 0 112053 30.0000 B42 S \n", "888 2 W./C. 6607 23.4500 NaN S \n", "889 0 111369 30.0000 C148 C \n", "890 0 370376 7.7500 NaN Q \n", "\n", "[891 rows x 12 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"\"\"\n", "Random Forests are just a collection of decision trees that's predictions are aggregated. The way this works\n", "is by having each decision tree trained on a different set of data - so you can have a diverse set of trees that's predictions\n", "will then be averaged and given to the user. Each tree was trained on a different set of data, so they won't be the same. We also \n", "have some added functionality as well which you will see in a bit. We are going to use the titanic dataset with the \n", "same preprocessing steps as the decision trees examples, so feel free to skip down until this \"X_train, X_test = new_X_train, new_X_test\" cell.\n", "These random forests can be trained in parallel on multiple CPU cores, which is exactly what SeaLion does. \n", "\"\"\"\n", "\n", "import pandas as pd\n", "# first we can load in the dataset\n", "titanic_dataframe = pd.read_csv(\"titanic_dataset.csv\") # of my local computer\n", "titanic_dataframe # print it out" ] }, { "cell_type": "code", "execution_count": 3, "id": "proud-story", "metadata": {}, "outputs": [], "source": [ "# looks like it has 891 rows and 12 columns. First we can delete some of the features we know we won't use. \n", "titanic_dataframe = titanic_dataframe.drop(['Name', 'Ticket', 'Cabin'], axis = 1) # non-numeric\n", "titanic_dataframe = titanic_dataframe.fillna(0)" ] }, { "cell_type": "code", "execution_count": 4, "id": "backed-latter", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103022.0107.25000
1211138.01071.28331
2313126.0007.92500
3411135.01053.10000
4503035.0008.05000
..............................
88688702027.00013.00000
88788811119.00030.00000
8888890310.01223.45000
88989011026.00030.00001
89089103032.0007.75002
\n", "

891 rows × 9 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked\n", "0 1 0 3 0 22.0 1 0 7.2500 0\n", "1 2 1 1 1 38.0 1 0 71.2833 1\n", "2 3 1 3 1 26.0 0 0 7.9250 0\n", "3 4 1 1 1 35.0 1 0 53.1000 0\n", "4 5 0 3 0 35.0 0 0 8.0500 0\n", ".. ... ... ... .. ... ... ... ... ...\n", "886 887 0 2 0 27.0 0 0 13.0000 0\n", "887 888 1 1 1 19.0 0 0 30.0000 0\n", "888 889 0 3 1 0.0 1 2 23.4500 0\n", "889 890 1 1 0 26.0 0 0 30.0000 1\n", "890 891 0 3 0 32.0 0 0 7.7500 2\n", "\n", "[891 rows x 9 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# next we can change the sex column. Female will be 0 and Male will be 1. \n", "import numpy as np # we'll need this too\n", "sex_col = np.array(titanic_dataframe['Sex'])\n", "sex_col[np.where(sex_col == \"male\")] = 0\n", "sex_col[np.where(sex_col == \"female\")] = 1\n", "titanic_dataframe[\"Sex\"] = sex_col\n", "\n", "# we can also change the embarked column - we will make it one-hot-encoded\n", "from sealion.utils import one_hot\n", "embarked_col = np.array(titanic_dataframe[\"Embarked\"])\n", "embarked_col[np.where(embarked_col == \"S\")] = 0\n", "embarked_col[np.where(embarked_col == \"C\")] = 1\n", "embarked_col[np.where(embarked_col == \"Q\")] = 2 \n", "titanic_dataframe[\"Embarked\"] = embarked_col\n", "titanic_dataframe" ] }, { "cell_type": "code", "execution_count": 5, "id": "lonely-building", "metadata": {}, "outputs": [], "source": [ "# looks like we are all set. Time to get the labels and the training and testing data\n", "y = np.array(titanic_dataframe['Survived'])\n", "titanic_dataframe = titanic_dataframe.drop('Survived', axis = 1)\n", "X = np.array(titanic_dataframe)\n", "\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 42)" ] }, { "cell_type": "code", "execution_count": 6, "id": "anonymous-discipline", "metadata": {}, "outputs": [], "source": [ "#we'll need to one_hot_encode the last column of X_train and X_test (Embarked)\n", "embarked_train = one_hot(X_train[:, -1], depth = 3)\n", "embarked_test = one_hot(X_test[:, -1], depth = 3)" ] }, { "cell_type": "code", "execution_count": 7, "id": "located-parking", "metadata": {}, "outputs": [], "source": [ "# we'll have to use a bit of the long route to avoid typical rules of numpy\n", "new_X_train, new_X_test = [], []\n", "X_train, X_test = np.array(X_train).tolist(), np.array(X_test).tolist() # turn them into regular python lists\n", "for row in range(len(X_train)) : \n", " observation = X_train[row] # get the row\n", " observation[-1] = embarked_train[row].tolist() # .tolist() helps make sure it can be interpreted (only needed as of v3.0.8 if you are using one_hot_encoded data) \n", " new_X_train.append(observation)\n", " \n", "for row in range(len(X_test)) : \n", " observation = X_test[row] # get the row\n", " observation[-1] = embarked_test[row].tolist()\n", " new_X_test.append(observation)" ] }, { "cell_type": "code", "execution_count": 8, "id": "relative-lighter", "metadata": {}, "outputs": [], "source": [ "X_train, X_test = new_X_train, new_X_test # just change the name" ] }, { "cell_type": "code", "execution_count": 25, "id": "conceptual-teach", "metadata": {}, "outputs": [], "source": [ "# yay now we have our data. We can now apply the random forest algorithm. \n", "\n", "rf = RandomForest(num_classifiers = 20, replacement = True, min_data = 50)\n", "\n", "# a quick word on the arguments\n", "# num_classifiers is just the amount of trees to be made\n", "# max_branches and min_samples are the same arguments as in decision trees\n", "# replacement is whether or not to boostrap. When deciding the dataset for each of the 20 trees, you may \n", "# want to not have any data points shared across the datasets or be fine with some. In general using bootstrapping\n", "# does better on testing data at the expense of training data. \n", "# min_data is simply the minimum amount of data points you need each set to have. If you set replacement = True, \n", "# then what will happen is that each decision tree will get a different amount of data, one may get 5 samples and \n", "# another may get 500. So if you want to ensure that at least all datasets get some X amount of data you can set that there. \n", "\n", "rf.fit(X_train, y_train) # let's train it!" ] }, { "cell_type": "code", "execution_count": 28, "id": "preliminary-lawrence", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Validation accuracy : 0.96\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# as usual we'll want to evaluate it and visualize its evaluation\n", "print(\"Validation accuracy : \", rf.evaluate(X_test, y_test))\n", "rf.visualize_evaluation(rf.predict(X_test), y_test)" ] }, { "cell_type": "code", "execution_count": 29, "id": "lasting-transsexual", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9733333333333334" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 96% accuracy is really good compared to what we were getting before with a decision tree (78%) \n", "# but here's where the cool part comes in!\n", "\n", "# You know - I bet you that one of those trees in the random forests is better than all of them, so \n", "# why don't we just go find that tree, get it, and then use it? Seriously. \n", "\n", "from sealion.decision_trees import DecisionTree\n", "best_tree = rf.give_best_tree(X_test, y_test) # get the best tree of off the data you give it\n", "dt = DecisionTree()\n", "dt.give_tree(best_tree) # enter the best tree trained from the random forests\n", "dt.evaluate(X_test, y_test) # see how well that best tree in random forests did" ] }, { "cell_type": "code", "execution_count": 18, "id": "primary-belgium", "metadata": {}, "outputs": [], "source": [ "#WOW! 97%? Moving from 96% to 97% is a big deal - it gets exponentially harder to make the model near perfect. \n", "\n", "# That's mostly it for random forests. Onto ensemble classifiers!\n", "# the way ensemble classifiers work is super simple. All it does is just take in a bunch of predictors, \n", "# train all of them, and then average all of their predictions. It's basically a random forest but for other classifiers. \n", "\n", "# for this we can use the blobs dataset \n", "from sklearn.datasets import make_blobs\n", "from sklearn.model_selection import train_test_split\n", "X, y = make_blobs(500, random_state = 2, centers = 3)\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 3)" ] }, { "cell_type": "code", "execution_count": 19, "id": "egyptian-archives", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# as usual we can visualize our dataset\n", "import matplotlib.pyplot as plt\n", "fig = plt.figure() \n", "ax = fig.add_subplot()\n", "ax.scatter(X[:, 0], X[:, 1])\n", "plt.title(\"Blobs Dataset\")\n", "plt.xlabel(\"x-axis\")\n", "plt.ylabel(\"y-axis\")\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 20, "id": "improved-intake", "metadata": {}, "outputs": [], "source": [ "# well the first thing we have to do is setup our classifiers. We are going to be using k-nearest-neighbors, \n", "# gaussian naive bayes, and a decision tree.\n", "\n", "from sealion.naive_bayes import GaussianNaiveBayes\n", "from sealion.nearest_neighbors import KNearestNeighbors\n", "\n", "knn = KNearestNeighbors(k = 5) \n", "gnb = GaussianNaiveBayes()\n", "dt = DecisionTree(min_samples = 5, max_branches = 25)" ] }, { "cell_type": "code", "execution_count": 30, "id": "fewer-thomas", "metadata": {}, "outputs": [], "source": [ "# then we can setup our classifiers dict\n", "\n", "classifiers_dict = {\"k-nearest-neighbors\" : knn, \"gaussian_nb\" : gnb, \"decision_trees\" : dt} # give a name to each of your classifiers\n", "\n", "ec = EnsembleClassifier(classifiers_dict, classification = True) # set it up (note : can't use this for neural nets)\n", "\n", "# classification = True because we are using classification, but set it False for regression. The default is True,\n", "# and an easy way to remember this is because its an EnsembleCLASSifier. \n", "\n", "ec.fit(X_train, y_train) # we can now train the all classifiers - will train all algos on all CPU cores available in parallel" ] }, { "cell_type": "code", "execution_count": 31, "id": "metallic-therapy", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9866666666666667" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# well .... let's evaluate it\n", "ec.evaluate(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 32, "id": "weighted-german", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "k-nearest-neighbors : 0.9866666666666667\n", "gaussian_nb : 0.9866666793823242\n", "decision_trees : 0.9733333333333334\n" ] } ], "source": [ "# looks like it did well, but even at the top there's a hierarchy. Let's see which one did best. \n", "ec.evaluate_all_predictors(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 27, "id": "compatible-grass", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/plain": [ "array([ True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, False, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True, True, True, True, True, True, True,\n", " True, True, True])" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Looks like its the Gaussian Naive Bayes (I was expecting the KNN to do best)\n", "# we could also just get that class like : \n", "\n", "best_predictor = ec.get_best_predictor(X_test, y_test)\n", "print(best_predictor)\n", "\n", "# and we can use it...\n", "y_pred = best_predictor.predict(X_test)\n", "y_pred == y_test # ... do more things ... " ] }, { "cell_type": "code", "execution_count": null, "id": "incorporate-tattoo", "metadata": {}, "outputs": [], "source": [ "# well why is this useful? Well for one thing imagine if you knew you were going to use a KNN but wasn't sure\n", "# about the k-value. Well you could just add multiple KNN classes here - all w/ different k-values - and just \n", "# get the class that works best and use it. That way hyperparameter tuning is a breeze. \n", "# That too this module trains algorithms extremely fast, so you could take advantage of that. \n", "\n", "# If you think you have a better way to use it, please let me know @anish.lakkapragada@gmail.com or on GitHub. \n", "# In the meantime, thank you!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.7" } }, "nbformat": 4, "nbformat_minor": 5 }