{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# H2O Tutorial: Breast Cancer Classification\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Author: Erin LeDell\n", "\n", "Contact: erin@h2o.ai\n", "\n", "This tutorial steps through a quick introduction to H2O's Python API. The goal of this tutorial is to introduce through a complete example H2O's capabilities from Python. Also, to help those that are accustomed to Scikit Learn and Pandas, the demo will be specific call outs for differences between H2O and those packages; this is intended to help anyone that needs to do machine learning on really Big Data make the transition. It is not meant to be a tutorial on machine learning or algorithms.\n", "\n", "Detailed documentation about H2O's and the Python API is available at http://docs.h2o.ai." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Install H2O in Python\n", "\n", "### Prerequisites\n", "\n", "This tutorial assumes you have Python 2.7 installed. The `h2o` Python package has a few dependencies which can be installed using [pip](http://pip.readthedocs.org/en/stable/installing/). The packages that are required are (which also have their own dependencies):\n", "```bash\n", "pip install requests\n", "pip install tabulate\n", "pip install scikit-learn \n", "```\n", "If you have any problems (for example, installing the `scikit-learn` package), check out [this page](https://github.com/h2oai/h2o-3/blob/master/h2o-docs/src/product/howto/FAQ.md#python) for tips.\n", "\n", "### Install h2o\n", "\n", "Once the dependencies are installed, you can install H2O. We will use the latest stable version of the `h2o` package, which is called \"Tibshirani-3.\" The installation instructions are on the \"Install in Python\" tab on [this page](http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/index.html).\n", "\n", "```bash\n", "# The following command removes the H2O module for Python (if it already exists).\n", "pip uninstall h2o\n", "\n", "# Next, use pip to install this version of the H2O Python module.\n", "pip install http://h2o-release.s3.amazonaws.com/h2o/rel-tibshirani/3/Python/h2o-3.6.0.3-py2.py3-none-any.whl\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Start up an H2O cluster\n", "\n", "In a Python terminal, we can import the `h2o` package and start up an H2O cluster." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "No instance found at ip and port: localhost:54321. Trying to start local jar...\n", "\n", "\n", "JVM stdout: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpA5iLxS/h2o_me_started_from_python.out\n", "JVM stderr: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmptfhX9Q/h2o_me_started_from_python.err\n", "Using ice_root: /var/folders/2j/jg4sl53d5q53tc2_nzm9fz5h0000gn/T/tmpViw3QS\n", "\n", "\n", "Java Version: java version \"1.8.0_45\"\n", "Java(TM) SE Runtime Environment (build 1.8.0_45-b14)\n", "Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)\n", "\n", "\n", "Starting H2O JVM and connecting: ........... Connection successful!\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
H2O cluster uptime: 1 seconds 30 milliseconds
H2O cluster version: 3.6.0.3
H2O cluster name: H2O_started_from_python
H2O cluster total nodes: 1
H2O cluster total memory: 3.56 GB
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster healthy: True
H2O Connection ip: 127.0.0.1
H2O Connection port: 54321
" ], "text/plain": [ "-------------------------- -------------------------\n", "H2O cluster uptime: 1 seconds 30 milliseconds\n", "H2O cluster version: 3.6.0.3\n", "H2O cluster name: H2O_started_from_python\n", "H2O cluster total nodes: 1\n", "H2O cluster total memory: 3.56 GB\n", "H2O cluster total cores: 8\n", "H2O cluster allowed cores: 8\n", "H2O cluster healthy: True\n", "H2O Connection ip: 127.0.0.1\n", "H2O Connection port: 54321\n", "-------------------------- -------------------------" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import h2o\n", "\n", "# Start an H2O Cluster on your local machine\n", "h2o.init()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you already have an H2O cluster running that you'd like to connect to (for example, in a multi-node Hadoop environment), then you can specify the IP and port of that cluster as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# This will not actually do anything since it's a fake IP address\n", "# h2o.init(ip=\"123.45.67.89\", port=54321)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code downloads a copy of the [Wisconsin Diagnostic Breast Cancer dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29).\n", "\n", "We can import the data directly into H2O using the Python API." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Parse Progress: [##################################################] 100%\n" ] } ], "source": [ "csv_url = \"https://h2o-public-test-data.s3.amazonaws.com/smalldata/wisc/wisc-diag-breast-cancer-shuffled.csv\"\n", "data = h2o.import_file(csv_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore Data\n", "Once we have loaded the data, let's take a quick look. First the dimension of the frame:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(569, 32)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.shape\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's take a look at the top of the frame:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
iddiagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave_points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave_points_worst symmetry_worst fractal_dimension_worst
8.71002e+08B 8.219 20.7 53.27 203.9 0.09405 0.1305 0.1321 0.02168 0.2222 0.08261 0.1935 1.962 1.243 10.21 0.01243 0.05416 0.07753 0.01022 0.02309 0.01178 9.092 29.72 58.08 249.8 0.163 0.431 0.5381 0.07879 0.3322 0.1486
8.81053e+06B 11.84 18.94 75.51 428 0.08871 0.069 0.02669 0.01393 0.1533 0.06057 0.2222 0.8652 1.444 17.12 0.005517 0.01727 0.02045 0.006747 0.01616 0.002922 13.3 24.99 85.22 546.3 0.128 0.188 0.1471 0.06913 0.2535 0.07993
8.95115e+07B 12.2 15.21 78.01 457.9 0.08673 0.06545 0.01994 0.01692 0.1638 0.06129 0.2575 0.8073 1.959 19.01 0.005403 0.01418 0.01051 0.005142 0.01333 0.002065 13.75 21.38 91.11 583.1 0.1256 0.1928 0.1167 0.05556 0.2661 0.07961
9.15946e+07M 15.05 19.07 97.26 701.9 0.09215 0.08597 0.07486 0.04335 0.1561 0.05915 0.386 1.198 2.63 38.49 0.004952 0.0163 0.02967 0.009423 0.01152 0.001718 17.58 28.06 113.8 967 0.1246 0.2101 0.2866 0.112 0.2282 0.06954
864292 B 10.51 20.19 68.64 334.2 0.1122 0.1303 0.06476 0.03068 0.1922 0.07782 0.3336 1.86 2.041 19.91 0.01188 0.03747 0.04591 0.01544 0.02287 0.006792 11.16 22.75 72.62 374.4 0.13 0.2049 0.1295 0.06136 0.2383 0.09026
9.1544e+07 B 12.22 20.04 79.47 453.1 0.1096 0.1152 0.08175 0.02166 0.2124 0.06894 0.1811 0.7959 0.9857 12.58 0.006272 0.02198 0.03966 0.009894 0.0132 0.003813 13.16 24.17 85.13 515.3 0.1402 0.2315 0.3535 0.08088 0.2709 0.08839
9.19039e+07B 11.67 20.02 75.21 416.2 0.1016 0.09453 0.042 0.02157 0.1859 0.06461 0.2067 0.8745 1.393 15.34 0.005251 0.01727 0.0184 0.005298 0.01449 0.002671 13.35 28.81 87 550.6 0.155 0.2964 0.2758 0.0812 0.3206 0.0895
9.01257e+06B 15.19 13.21 97.65 711.8 0.07963 0.06934 0.03393 0.02657 0.1721 0.05544 0.1783 0.4125 1.338 17.72 0.005012 0.01485 0.01551 0.009155 0.01647 0.001767 16.2 15.73 104.5 819.1 0.1126 0.1737 0.1362 0.08178 0.2487 0.06766
899987 M 25.73 17.46 174.2 2010 0.1149 0.2363 0.3368 0.1913 0.1956 0.06121 0.9948 0.8509 7.222 153.1 0.006369 0.04243 0.04266 0.01508 0.02335 0.003385 33.13 23.58 229.3 3234 0.153 0.5937 0.6451 0.2756 0.369 0.08815
854039 M 16.13 17.88 107 807.2 0.104 0.1559 0.1354 0.07752 0.1998 0.06515 0.334 0.6857 2.183 35.03 0.004185 0.02868 0.02664 0.009067 0.01703 0.003817 20.21 27.26 132.7 1261 0.1446 0.5804 0.5274 0.1864 0.427 0.1233
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first two columns contain an ID and the resposne. The \"diagnosis\" column is the response. Let's take a look at the column names. The data contains derived features from the medical images of the tumors." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'id',\n", " u'diagnosis',\n", " u'radius_mean',\n", " u'texture_mean',\n", " u'perimeter_mean',\n", " u'area_mean',\n", " u'smoothness_mean',\n", " u'compactness_mean',\n", " u'concavity_mean',\n", " u'concave_points_mean',\n", " u'symmetry_mean',\n", " u'fractal_dimension_mean',\n", " u'radius_se',\n", " u'texture_se',\n", " u'perimeter_se',\n", " u'area_se',\n", " u'smoothness_se',\n", " u'compactness_se',\n", " u'concavity_se',\n", " u'concave_points_se',\n", " u'symmetry_se',\n", " u'fractal_dimension_se',\n", " u'radius_worst',\n", " u'texture_worst',\n", " u'perimeter_worst',\n", " u'area_worst',\n", " u'smoothness_worst',\n", " u'compactness_worst',\n", " u'concavity_worst',\n", " u'concave_points_worst',\n", " u'symmetry_worst',\n", " u'fractal_dimension_worst']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select a subset of the columns to look at, typical Pandas indexing applies:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
iddiagnosis area_mean
8.71002e+08B 203.9
8.81053e+06B 428
8.95115e+07B 457.9
9.15946e+07M 701.9
864292 B 334.2
9.1544e+07 B 453.1
9.19039e+07B 416.2
9.01257e+06B 711.8
899987 M 2010
854039 M 807.2
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = [\"id\", \"diagnosis\", \"area_mean\"]\n", "data[columns].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's select a single column, for example -- the response column, and look at the data more closely:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
diagnosis
B
B
B
M
B
B
B
B
M
M
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['diagnosis']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like a binary response, but let's validate that assumption:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
C1
B
M
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['diagnosis'].unique()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[2]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['diagnosis'].nlevels()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can query the categorical \"levels\" as well ('B' and 'M' stand for \"Benign\" and \"Malignant\" diagnosis):" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[['B', 'M']]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['diagnosis'].levels()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since \"diagnosis\" column is the response we would like to predict, we may want to check if there are any missing values, so let's look for NAs. To figure out which, if any, values are missing, we can use the `isna` method on the diagnosis column. The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to a Frame also apply to a single column." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24 C25 C26 C27 C28 C29 C30 C31 C32
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isna()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
C1
0
0
0
0
0
0
0
0
0
0
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['diagnosis'].isna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `isna` method doesn't directly answer the question, \"Does the diagnosis column contain any NAs?\", rather it returns a 0 if that cell is not missing (Is NA? FALSE == 0) and a 1 if it is missing (Is NA? TRUE == 1). So if there are no missing values, then summing over the whole column should produce a summand equal to 0.0. Let's take a look:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['diagnosis'].isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, no missing labels. \n", "\n", "Out of curiosity, let's see if there is any missing data in this frame:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an \"imbalanace\" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution, both visually and numerically." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# TO DO: Insert a bar chart or something showing the proportion of M to B in the response.\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
diagnosis Count
B 357
M 212
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['diagnosis'].table()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, the data is not exactly evenly distributed between the two classes -- there are almost twice as many Benign samples as there are Malicious samples. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "
Count
0.627417
0.372583
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = data.shape[0] # Total number of training samples\n", "data['diagnosis'].table()['Count']/n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Machine Learning in H2O\n", "\n", "We will do a quick demo of the H2O software -- trying to predict malignant tumors using various machine learning algorithms.\n", "\n", "### Specify the predictor set and response\n", "\n", "The response, `y`, is the 'diagnosis' column, and the predictors, `x`, are all the columns aside from the first two columns ('id' and 'diagnosis')." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y = 'diagnosis'" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[u'diagnosis',\n", " u'radius_mean',\n", " u'texture_mean',\n", " u'perimeter_mean',\n", " u'area_mean',\n", " u'smoothness_mean',\n", " u'compactness_mean',\n", " u'concavity_mean',\n", " u'concave_points_mean',\n", " u'symmetry_mean',\n", " u'fractal_dimension_mean',\n", " u'radius_se',\n", " u'texture_se',\n", " u'perimeter_se',\n", " u'area_se',\n", " u'smoothness_se',\n", " u'compactness_se',\n", " u'concavity_se',\n", " u'concave_points_se',\n", " u'symmetry_se',\n", " u'fractal_dimension_se',\n", " u'radius_worst',\n", " u'texture_worst',\n", " u'perimeter_worst',\n", " u'area_worst',\n", " u'smoothness_worst',\n", " u'compactness_worst',\n", " u'concavity_worst',\n", " u'concave_points_worst',\n", " u'symmetry_worst',\n", " u'fractal_dimension_worst']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = data.columns\n", "del x[0:1]\n", "x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split H2O Frame into a train and test set" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": true }, "outputs": [], "source": [ "train, test = data.split_frame(ratios=[0.75], seed=1)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(428, 32)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.shape\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(141, 32)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train and Test a GBM model" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Import H2O GBM:\n", "from h2o.estimators.gbm import H2OGradientBoostingEstimator\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We first create a `model` object of class, `\"H2OGradientBoostingEstimator\"`. This does not actually do any training, it just sets the model up for training by specifying model parameters." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model = H2OGradientBoostingEstimator(distribution='bernoulli',\n", " ntrees=100,\n", " max_depth=4,\n", " learn_rate=0.1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `model` object, like all H2O estimator objects, has a `train` method, which will actually perform model training. At this step we specify the training and (optionally) a validation set, along with the response and predictor variables." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "gbm Model Build Progress: [##################################################] 100%\n" ] } ], "source": [ "model.train(x=x, y=y, training_frame=train, validation_frame=test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect Model\n", "\n", "The type of results shown when you print a model, are determined by the following:\n", "- Model class of the estimator (e.g. GBM, RF, GLM, DL)\n", "- The type of machine learning problem (e.g. binary classification, multiclass classification, regression)\n", "- The data you specify (e.g. `training_frame` only, `training_frame` and `validation_frame`, or `training_frame` and `nfolds`)\n", "\n", "Below, we see a GBM Model Summary, as well as training and validation metrics since we supplied a `validation_frame`. Since this a binary classification task, we are shown the relevant performance metrics, which inclues: MSE, R^2, LogLoss, AUC and Gini. Also, we are shown a Confusion Matrix, where the threshold for classification is chosen automatically (by H2O) as the threshold which maximizes the F1 score.\n", "\n", "The scoring history is also printed, which shows the performance metrics over some increment such as \"number of trees\" in the case of GBM and RF.\n", "\n", "Lastly, for tree-based methods (GBM and RF), we also print variable importance." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model Details\n", "=============\n", "H2OGradientBoostingEstimator : Gradient Boosting Machine\n", "Model Key: GBM_model_python_1448480209718_6\n", "\n", "Model Summary:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
number_of_treesmodel_size_in_bytesmin_depthmax_depthmean_depthmin_leavesmax_leavesmean_leaves
100.018324.04.04.04.08.014.010.31
" ], "text/plain": [ " number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves\n", "-- ----------------- --------------------- ----------- ----------- ------------ ------------ ------------ -------------\n", " 100 18324 4 4 4 8 14 10.31" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "ModelMetricsBinomial: gbm\n", "** Reported on train data. **\n", "\n", "MSE: 1.55261137469e-06\n", "R^2: 0.999993333015\n", "LogLoss: 0.000519099361538\n", "AUC: 1.0\n", "Gini: 1.0\n", "\n", "Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.989733166545:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
BMErrorRate
B270.00.00.0 (0.0/270.0)
M0.0158.00.0 (0.0/158.0)
Total270.0158.00.0 (0.0/428.0)
" ], "text/plain": [ " B M Error Rate\n", "----- --- --- ------- -----------\n", "B 270 0 0 (0.0/270.0)\n", "M 0 158 0 (0.0/158.0)\n", "Total 270 158 0 (0.0/428.0)" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Maximum Metrics: Maximum metrics at their respective thresholds\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
metricthresholdvalueidx
max f11.01.0143.0
max f21.01.0143.0
max f0point51.01.0143.0
max accuracy1.01.0143.0
max precision1.01.00.0
max absolute_MCC1.01.0143.0
max min_per_class_accuracy1.01.0143.0
" ], "text/plain": [ "metric threshold value idx\n", "-------------------------- ----------- ------- -----\n", "max f1 0.989733 1 143\n", "max f2 0.989733 1 143\n", "max f0point5 0.989733 1 143\n", "max accuracy 0.989733 1 143\n", "max precision 0.999923 1 0\n", "max absolute_MCC 0.989733 1 143\n", "max min_per_class_accuracy 0.989733 1 143" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "ModelMetricsBinomial: gbm\n", "** Reported on validation data. **\n", "\n", "MSE: 0.0507094587533\n", "R^2: 0.78540767359\n", "LogLoss: 0.247694592147\n", "AUC: 0.970200085143\n", "Gini: 0.940400170285\n", "\n", "Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.409828576406:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
BMErrorRate
B83.04.00.046 (4.0/87.0)
M4.050.00.0741 (4.0/54.0)
Total87.054.00.0567 (8.0/141.0)
" ], "text/plain": [ " B M Error Rate\n", "----- --- --- ------- -----------\n", "B 83 4 0.046 (4.0/87.0)\n", "M 4 50 0.0741 (4.0/54.0)\n", "Total 87 54 0.0567 (8.0/141.0)" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Maximum Metrics: Maximum metrics at their respective thresholds\n", "\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
metricthresholdvalueidx
max f10.40.951.0
max f20.00.960.0
max f0point50.71.043.0
max accuracy0.70.943.0
max precision1.01.00.0
max absolute_MCC0.70.943.0
max min_per_class_accuracy0.40.951.0
" ], "text/plain": [ "metric threshold value idx\n", "-------------------------- ----------- -------- -----\n", "max f1 0.409829 0.925926 51\n", "max f2 0.00935885 0.9319 60\n", "max f0point5 0.74381 0.966387 43\n", "max accuracy 0.74381 0.943262 43\n", "max precision 0.999921 1 0\n", "max absolute_MCC 0.74381 0.883242 43\n", "max min_per_class_accuracy 0.409829 0.925926 51" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Scoring History:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
timestampdurationnumber_of_treestraining_MSEtraining_loglosstraining_AUCtraining_classification_errorvalidation_MSEvalidation_loglossvalidation_AUCvalidation_classification_error
2015-11-25 11:42:58 0.006 sec1.00.20.61.00.00.20.61.00.1
2015-11-25 11:42:58 0.010 sec2.00.20.51.00.00.20.51.00.1
2015-11-25 11:42:58 0.013 sec3.00.10.41.00.00.20.51.00.1
2015-11-25 11:42:58 0.017 sec4.00.10.41.00.00.10.41.00.1
2015-11-25 11:42:58 0.021 sec5.00.10.41.00.00.10.41.00.1
------------------------------------
2015-11-25 11:42:59 0.566 sec96.00.00.01.00.00.10.21.00.1
2015-11-25 11:42:59 0.572 sec97.00.00.01.00.00.10.21.00.1
2015-11-25 11:42:59 0.579 sec98.00.00.01.00.00.10.21.00.1
2015-11-25 11:42:59 0.585 sec99.00.00.01.00.00.10.21.00.1
2015-11-25 11:42:59 0.592 sec100.00.00.01.00.00.10.21.00.1
" ], "text/plain": [ " timestamp duration number_of_trees training_MSE training_logloss training_AUC training_classification_error validation_MSE validation_logloss validation_AUC validation_classification_error\n", "--- ------------------- ---------- ----------------- ----------------- ------------------ -------------- ------------------------------- ---------------- -------------------- ---------------- ---------------------------------\n", " 2015-11-25 11:42:58 0.006 sec 1.0 0.192088988499 0.571861282203 0.996436943272 0.0303738317757 0.199904329976 0.588228883036 0.951575138357 0.0780141843972\n", " 2015-11-25 11:42:58 0.010 sec 2.0 0.160277802376 0.504398553547 0.996905766526 0.018691588785 0.172704933452 0.530358589111 0.952320136228 0.0709219858156\n", " 2015-11-25 11:42:58 0.013 sec 3.0 0.134660475655 0.448993394686 0.997187060478 0.018691588785 0.150478176462 0.482273136361 0.952958705832 0.0709219858156\n", " 2015-11-25 11:42:58 0.017 sec 4.0 0.113062072379 0.400732539059 0.998042662916 0.0140186915888 0.133037817199 0.443367592451 0.954342273308 0.0709219858156\n", " 2015-11-25 11:42:58 0.021 sec 5.0 0.0962640226993 0.361398896252 0.99719878106 0.0140186915888 0.118405733623 0.409189649903 0.955619412516 0.063829787234\n", "--- --- --- --- --- --- --- --- --- --- --- ---\n", " 2015-11-25 11:42:59 0.566 sec 96.0 2.57126681422e-06 0.000675348189042 1.0 0.0 0.0504812753847 0.24165499491 0.970200085143 0.0567375886525\n", " 2015-11-25 11:42:59 0.572 sec 97.0 2.27917081871e-06 0.000632720040278 1.0 0.0 0.0507638626588 0.243442751591 0.970838654747 0.0567375886525\n", " 2015-11-25 11:42:59 0.579 sec 98.0 2.04205964667e-06 0.000597410694761 1.0 0.0 0.0514580633117 0.246779455239 0.970838654747 0.0567375886525\n", " 2015-11-25 11:42:59 0.585 sec 99.0 1.78544678476e-06 0.000554986030507 1.0 0.0 0.0516010701671 0.249148895606 0.970625798212 0.0567375886525\n", " 2015-11-25 11:42:59 0.592 sec 100.0 1.55261137469e-06 0.000519099361538 1.0 0.0 0.0507094587533 0.247694592147 0.970200085143 0.0567375886525" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Variable Importances:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
variablerelative_importancescaled_importancepercentage
radius_worst177.51.00.3
perimeter_worst102.70.60.2
concave_points_worst94.20.50.2
concave_points_mean88.60.50.2
concavity_mean9.30.10.0
------------
compactness_mean0.00.00.0
radius_se0.00.00.0
smoothness_mean0.00.00.0
fractal_dimension_mean0.00.00.0
symmetry_mean0.00.00.0
" ], "text/plain": [ "variable relative_importance scaled_importance percentage\n", "---------------------- --------------------- ------------------- -----------------\n", "radius_worst 177.467025757 1.0 0.340759241389\n", "perimeter_worst 102.717407227 0.578797141545 0.197230474871\n", "concave_points_worst 94.2315368652 0.530980538291 0.18093652542\n", "concave_points_mean 88.6345443726 0.499442327354 0.170189588587\n", "concavity_mean 9.30055427551 0.0524072245864 0.0178582460933\n", "--- --- --- ---\n", "compactness_mean 0.0267842449248 0.000150925191937 5.14291539108e-05\n", "radius_se 0.00789974443614 4.45138718162e-05 1.51685131914e-05\n", "smoothness_mean 0.00370898260735 2.08995591803e-05 7.12171793163e-06\n", "fractal_dimension_mean 0.000214185129153 1.20690099042e-06 4.11262665927e-07\n", "symmetry_mean 5.02978673467e-06 2.83420917955e-08 9.6578296996e-09" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "print(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model Performance on a Test Set\n", "\n", "Once a model has been trained, you can also use it to make predictions on a test set. In the case above, we passed the test set as the `validation_frame` in training, so we have technically already created test set predictions and performance. \n", "\n", "However, when performing model selection over a variety of model parameters, it is common for users to break their dataset into three pieces: Training, Validation and Test.\n", "\n", "After training a variety of models using different parameters (and evaluating them on a validation set), the user may choose a single model and then evaluate model performance on a separate test set. This is when the `model_performance` method, shown below, is most useful. " ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.9814814814814814" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "perf = model.model_performance(test)\n", "perf.auc()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-validated Performance\n", "\n", "To perform k-fold cross-validation, you use the same code as above, but you specify `nfolds` as an integer greater than 1, or add a \"fold_column\" to your H2O Frame which indicates a fold ID for each row.\n", "\n", "Unless you have a specific reason to manually assign the observations to folds, you will find it easiest to simply use the `nfolds` argument.\n", "\n", "When performing cross-validation, you can still pass a `validation_frame`, but you can also choose to use the original dataset that contains all the rows. We will cross-validate a model below using the original H2O Frame which we call `data`." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "gbm Model Build Progress: [##################################################] 100%\n" ] } ], "source": [ "cvmodel = H2OGradientBoostingEstimator(distribution='bernoulli',\n", " ntrees=100,\n", " max_depth=4,\n", " learn_rate=0.1,\n", " nfolds=5)\n", "\n", "cvmodel.train(x=x, y=y, training_frame=data)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grid Search\n", "\n", "One way of evaluting models with different parameters is to perform a grid search over a set of parameter values. For example, in GBM, here are three model parameters that may be useful to search over:\n", "- `ntrees`: Number of trees\n", "- `max_depth`: Maximum depth of a tree\n", "- `learn_rate`: Learning rate in the GBM\n", "\n", "We will define a grid as follows:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true }, "outputs": [], "source": [ "ntrees_opt = [5,50,100]\n", "max_depth_opt = [2,3,5]\n", "learn_rate_opt = [0.1,0.2]\n", "\n", "hyper_params = {'ntrees': ntrees_opt, \n", " 'max_depth': max_depth_opt,\n", " 'learn_rate': learn_rate_opt}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define an `\"H2OGridSearch\"` object by specifying the algorithm (GBM) and the hyper parameters:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from h2o.grid.grid_search import H2OGridSearch\n", "\n", "gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params = hyper_params)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An `\"H2OGridSearch\"` object also has a `train` method, which is used to train all the models in the grid." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "gbm Grid Build Progress: [##################################################] 100%\n" ] } ], "source": [ "gs.train(x=x, y=y, training_frame=train, validation_frame=test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compare Models" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Grid Search Results for H2OGradientBoostingEstimator:\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Model IdHyperparameters: [learn_rate, ntrees, max_depth]mse
Grid_GBM_py_17_model_python_1448480209718_18_model_14[0.2, 100, 3]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_17[0.2, 100, 5]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_16[0.2, 50, 5]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_8[0.1, 100, 5]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_11[0.2, 100, 2]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_13[0.2, 50, 3]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_5[0.1, 100, 3]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_7[0.1, 50, 5]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_2[0.1, 100, 2]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_10[0.2, 50, 2]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_4[0.1, 50, 3]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_1[0.1, 50, 2]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_15[0.2, 5, 5]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_12[0.2, 5, 3]0.0
Grid_GBM_py_17_model_python_1448480209718_18_model_9[0.2, 5, 2]0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_6[0.1, 5, 5]0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_3[0.1, 5, 3]0.1
Grid_GBM_py_17_model_python_1448480209718_18_model_0[0.1, 5, 2]0.1
" ], "text/plain": [ "Model Id Hyperparameters: [learn_rate, ntrees, max_depth] mse\n", "----------------------------------------------------- -------------------------------------------------- -----------\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_14 [0.2, 100, 3] 2.12233e-07\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_17 [0.2, 100, 5] 2.23617e-07\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_16 [0.2, 50, 5] 5.86149e-07\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_8 [0.1, 100, 5] 7.9336e-07\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_11 [0.2, 100, 2] 1.46308e-05\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_13 [0.2, 50, 3] 2.09611e-05\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_5 [0.1, 100, 3] 2.3662e-05\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_7 [0.1, 50, 5] 0.000388941\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_2 [0.1, 100, 2] 0.000546863\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_10 [0.2, 50, 2] 0.000605298\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_4 [0.1, 50, 3] 0.00149725\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_1 [0.1, 50, 2] 0.00449607\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_15 [0.2, 5, 5] 0.0422887\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_12 [0.2, 5, 3] 0.0433428\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_9 [0.2, 5, 2] 0.0502527\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_6 [0.1, 5, 5] 0.0961144\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_3 [0.1, 5, 3] 0.097152\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_0 [0.1, 5, 2] 0.100977" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "print(gs)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Grid_GBM_py_17_model_python_1448480209718_18_model_0 auc: 0.990963431786\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_13 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_16 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_17 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_2 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_15 auc: 0.998476324426\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_1 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_3 auc: 0.997444913268\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_9 auc: 0.993682606657\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_11 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_7 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_12 auc: 0.998663853727\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_4 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_8 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_10 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_5 auc: 1.0\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_6 auc: 0.997691045476\n", "Grid_GBM_py_17_model_python_1448480209718_18_model_14 auc: 1.0\n" ] } ], "source": [ "# print out the auc for all of the models\n", "for g in gs:\n", " print(g.model_id + \" auc: \" + str(g.auc()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#TO DO: Compare grid search models" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }