{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "\n", "\n", "

Data Science

\n", "

Lesson 10

\n", "

Modelling Data

\n", "\n", "
\n", "\n", "
Question
\n", "\n", "
How to Choose a Machine Learning Algorithm
\n", "\n", "
\n", "\n", "
***Original Tutorial:***
https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "OVERVIEW\n", "
\n", "\n", "
\n", "
\n", "Let's take a few minutes to read the following as it is VERY IMPORTANT:\n", "
\n", "
\n", "
\n", "Data Science is a multi-disciplinary field between mathematics (i.e. statistics, linear algebra, etc.), computer science (i.e. programming languages, computer systems, etc.) and business management (i.e. communication, subject-matter knowledge, etc.). Most data scientist come from one of the three fields, so they tend to lean towards that discipline. However, data science is like a three-legged stool, with no one leg being more important than the other. So, this step will require advanced knowledge in mathematics. But don’t worry, we only need a high-level overview, which we’ll cover in this Kernel. Also, thanks to computer science, a lot of the heavy lifting is done for you. So, problems that once required graduate degrees in mathematics or statistics, now only take a few lines of code. Last, we’ll need some business acumen to think through the problem. After all, like training a sight-seeing dog, it’s learning from us and not the other way around.\n", "

\n", "Machine Learning (ML), as the name suggest, is teaching the machine how-to think and not what to think. While this topic and big data has been around for decades, it is becoming more popular than ever because the barrier to entry is lower, for businesses and professionals alike. This is both good and bad. It’s good because these algorithms are now accessible to more people that can solve more problems in the real-world. It’s bad because a lower barrier to entry means, more people will not know the tools they are using and can come to incorrect conclusions. That’s why I focus on teaching you, not just what to do, but why you’re doing it. Previously, I used the analogy of asking someone to hand you a Philip screwdriver, and they hand you a flathead screwdriver or worst a hammer. At best, it shows a complete lack of understanding. At worst, it makes completing the project impossible; or even worst, implements incorrect actionable intelligence. So now that I’ve hammered (no pun intended) my point, I’ll show you what to do and most importantly, WHY you do it.\n", "

\n", "First, you must understand, that the purpose of machine learning is to solve human problems. Machine learning can be categorized as: supervised learning, unsupervised learning, and reinforced learning. Supervised learning is where you train the model by presenting it a training dataset that includes the correct answer. Unsupervised learning is where you train the model using a training dataset that does not include the correct answer. And reinforced learning is a hybrid of the previous two, where the model is not given the correct answer immediately, but later after a sequence of events to reinforce learning. We are doing supervised machine learning, because we are training our algorithm by presenting it with a set of features and their corresponding target. We then hope to present it a new subset from the same dataset and have similar results in prediction accuracy.\n", "

\n", "There are many machine learning algorithms, however they can be reduced to four categories: classification, regression, clustering, or dimensionality reduction, depending on your target variable and data modeling goals. We'll save clustering and dimension reduction for another day, and focus on classification and regression. We can generalize that a continuous target variable requires a regression algorithm and a discrete target variable requires a classification algorithm. One side note, logistic regression, while it has regression in the name, is really a classification algorithm.\n", "
\n", "\n", "
\n", "\n", "
\n", "QUESTION\n", "
\n", "
\n", "- **Going with our example so far, will our problem require regression or classification?**\n", "\n", "
\n", "Answer\n", "
\n", ">- Since our problem is predicting if a passenger survived or did not survive, this is a discrete target variable. \n", ">- We will use a classification algorithm from the sklearn library to begin our analysis.\n", ">- We will use cross validation and scoring metrics, discussed in later sections, to rank and compare our algorithms’ performance.\n", "\n", "\"Choose\n", "\n", "
\n", "\n", "**Machine Learning Selection:**\n", "\n", "\n", "Now that we identified our solution as a supervised learning classification algorithm. We can narrow our list of choices.\n", "\n", "**Machine Learning Classification Algorithms:**\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This is to get all the code from before in this notebook so you can continue without typing everything out again. \n", "# requirements.py should have been provided with this lesson.\n", "%run requirements.py\n", "%matplotlib inline\n", "mpl.style.use('ggplot')\n", "sns.set_style('white')\n", "pylab.rcParams['figure.figsize'] = 12,8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "How to Choose a Machine Learning Algorithm (MLA)\n", "
\n", "\n", "**IMPORTANT:** When it comes to data modeling, the beginner’s question is always, \"what is the best machine learning algorithm?\" To this the beginner must learn, the No Free Lunch Theorem (NFLT) of Machine Learning. In short, NFLT states, there is no super algorithm, that works best in all situations, for all datasets. So the best approach is to try multiple MLAs, tune them, and compare them for your specific scenario.\n", "\n", "So with all this information, where is a beginner to start? I recommend starting with Trees, Bagging, Random Forests, and Boosting. They are basically different implementations of a decision tree, which is the easiest concept to learn and understand. They are also easier to tune, discussed in the next section, than something like SVC. Below, we'll go over an overview of how-to run and compare several MLAs, but the rest of this Kernel will focus on learning data modeling via decision trees and its derivatives." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Machine Learning Algorithm (MLA) Selection and Initialization\n", "MLA = [\n", " #Ensemble Methods\n", " ensemble.AdaBoostClassifier(),\n", " ensemble.BaggingClassifier(),\n", " ensemble.ExtraTreesClassifier(),\n", " ensemble.GradientBoostingClassifier(),\n", " ensemble.RandomForestClassifier(),\n", "\n", " #Gaussian Processes\n", " gaussian_process.GaussianProcessClassifier(),\n", " \n", " #GLM - Generalized Linear Models\n", " linear_model.LogisticRegressionCV(),\n", " linear_model.PassiveAggressiveClassifier(),\n", " linear_model.RidgeClassifierCV(),\n", " linear_model.SGDClassifier(),\n", " linear_model.Perceptron(),\n", " \n", " #Navies Bayes\n", " naive_bayes.BernoulliNB(),\n", " naive_bayes.GaussianNB(),\n", " \n", " #Nearest Neighbor\n", " neighbors.KNeighborsClassifier(),\n", " \n", " #SVM\n", " svm.SVC(probability=True),\n", " svm.NuSVC(probability=True),\n", " svm.LinearSVC(),\n", " \n", " #Trees \n", " tree.DecisionTreeClassifier(),\n", " tree.ExtraTreeClassifier(),\n", " \n", " #Discriminant Analysis\n", " discriminant_analysis.LinearDiscriminantAnalysis(),\n", " discriminant_analysis.QuadraticDiscriminantAnalysis(),\n", "\n", " \n", " #xgboost: http://xgboost.readthedocs.io/en/latest/model.html\n", " XGBClassifier() \n", " ]\n", "\n", "\n", "\n", "#split dataset in cross-validation with this splitter class: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit\n", "#note: this is an alternative to train_test_split\n", "cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .6, random_state = 0 ) # run model 10x with 60/30 split intentionally leaving out 10%\n", "\n", "#create table to compare MLA metrics\n", "MLA_columns = ['MLA Name', 'MLA Parameters','MLA Train Accuracy Mean', 'MLA Test Accuracy Mean', 'MLA Test Accuracy 3*STD' ,'MLA Time']\n", "MLA_compare = pd.DataFrame(columns = MLA_columns)\n", "\n", "#create table to compare MLA predictions\n", "MLA_predict = data1[Target]\n", "\n", "#index through MLA and save performance to table\n", "row_index = 0\n", "for alg in MLA:\n", "\n", " #set name and parameters\n", " MLA_name = alg.__class__.__name__\n", " MLA_compare.loc[row_index, 'MLA Name'] = MLA_name\n", " MLA_compare.loc[row_index, 'MLA Parameters'] = str(alg.get_params())\n", " \n", " #score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate\n", " cv_results = model_selection.cross_validate(alg, data1[data1_x_bin], data1[Target], cv = cv_split)\n", "\n", " MLA_compare.loc[row_index, 'MLA Time'] = cv_results['fit_time'].mean()\n", " MLA_compare.loc[row_index, 'MLA Train Accuracy Mean'] = cv_results['train_score'].mean()\n", " MLA_compare.loc[row_index, 'MLA Test Accuracy Mean'] = cv_results['test_score'].mean() \n", " #if this is a non-bias random sample, then +/-3 standard deviations (std) from the mean, should statistically capture 99.7% of the subsets\n", " MLA_compare.loc[row_index, 'MLA Test Accuracy 3*STD'] = cv_results['test_score'].std()*3 #let's know the worst that can happen!\n", " \n", "\n", " #save MLA predictions - see above links for usage\n", " alg.fit(data1[data1_x_bin], data1[Target])\n", " MLA_predict[MLA_name] = alg.predict(data1[data1_x_bin])\n", " \n", " row_index+=1\n", "\n", " \n", "#print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html\n", "MLA_compare.sort_values(by = ['MLA Test Accuracy Mean'], ascending = False, inplace = True)\n", "MLA_compare\n", "#MLA_predict" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#barplot using https://seaborn.pydata.org/generated/seaborn.barplot.html\n", "sns.barplot(x='MLA Test Accuracy Mean', y = 'MLA Name', data = MLA_compare, color = 'm')\n", "\n", "#prettify using pyplot: https://matplotlib.org/api/pyplot_api.html\n", "plt.title('Machine Learning Algorithm Accuracy Score \\n')\n", "plt.xlabel('Accuracy Score (%)')\n", "plt.ylabel('Algorithm')" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "
\n", "NOTE\n", "
\n", ">- Please be sure you understand ALL THE ABOVE CODE and what each step does.\n", ">- Read the comments and if you still don't understand, run each line of code individually.\n", ">- If you are having trouble understanding what each step does, collaborate with a partner or speak to your peers. It never hurts to ask!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }