{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Scikit-Learn singalong: EEG Eye State Classification\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Author: Kevin Yang\n", "\n", "Contact: kyang@h2o.ai\n", "\n", "This tutorial replicates Erin LeDell's oncology demo using Scikit Learn and Pandas, and is intended to provide a comparison of the syntactical and performance differences between sklearn and H2O implementations of Gradient Boosting Machines. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll be using Pandas, Numpy and the collections package for most of the data exploration." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from collections import Counter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download EEG Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code downloads a copy of the [EEG Eye State](http://archive.ics.uci.edu/ml/datasets/EEG+Eye+State#) dataset. All data is from one continuous EEG measurement with the [Emotiv EEG Neuroheadset](https://emotiv.com/epoc.php). The duration of the measurement was 117 seconds. The eye state was detected via a camera during the EEG measurement and added later manually to the file after analysing the video frames. '1' indicates the eye-closed and '0' the eye-open state. All values are in chronological order with the first measured value at the top of the data.\n", "\n", "![Emotiv Headset](http://dissociatedpress.com/wp-content/uploads/2013/03/emotiv-490.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import the same dataset directly with pandas" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "csv_url = \"http://www.stat.berkeley.edu/~ledell/data/eeg_eyestate_splits.csv\"\n", "data = pd.read_csv(csv_url)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explore Data\n", "Once we have loaded the data, let's take a quick look. First the dimension of the frame:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(14980, 16)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.shape\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's take a look at the top of the frame:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AF3F7F3FC5T7P7O1O2P8T8FC6F4F8AF4eyeDetectionsplit
04329.234009.234289.234148.214350.264586.154096.924641.034222.054238.464211.284280.514635.904393.850valid
14324.624004.624293.854148.724342.054586.674097.444638.974210.774226.674207.694279.494632.824384.100test
24327.694006.674295.384156.414336.924583.594096.924630.264207.694222.054206.674282.054628.724389.230train
34328.724011.794296.414155.904343.594582.564097.444630.774217.444235.384210.774287.694632.314396.410train
44326.154011.794292.314151.284347.694586.674095.904627.694210.774244.104212.824288.214632.824398.460train
\n", "
" ], "text/plain": [ " AF3 F7 F3 FC5 T7 P7 O1 O2 \\\n", "0 4329.23 4009.23 4289.23 4148.21 4350.26 4586.15 4096.92 4641.03 \n", "1 4324.62 4004.62 4293.85 4148.72 4342.05 4586.67 4097.44 4638.97 \n", "2 4327.69 4006.67 4295.38 4156.41 4336.92 4583.59 4096.92 4630.26 \n", "3 4328.72 4011.79 4296.41 4155.90 4343.59 4582.56 4097.44 4630.77 \n", "4 4326.15 4011.79 4292.31 4151.28 4347.69 4586.67 4095.90 4627.69 \n", "\n", " P8 T8 FC6 F4 F8 AF4 eyeDetection split \n", "0 4222.05 4238.46 4211.28 4280.51 4635.90 4393.85 0 valid \n", "1 4210.77 4226.67 4207.69 4279.49 4632.82 4384.10 0 test \n", "2 4207.69 4222.05 4206.67 4282.05 4628.72 4389.23 0 train \n", "3 4217.44 4235.38 4210.77 4287.69 4632.31 4396.41 0 train \n", "4 4210.77 4244.10 4212.82 4288.21 4632.82 4398.46 0 train " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first two columns contain an ID and the response. The \"diagnosis\" column is the response. Let's take a look at the column names. The data contains derived features from the medical images of the tumors." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['AF3',\n", " 'F7',\n", " 'F3',\n", " 'FC5',\n", " 'T7',\n", " 'P7',\n", " 'O1',\n", " 'O2',\n", " 'P8',\n", " 'T8',\n", " 'FC6',\n", " 'F4',\n", " 'F8',\n", " 'AF4',\n", " 'eyeDetection',\n", " 'split']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns.tolist()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To select a subset of the columns to look at, typical Pandas indexing applies:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AF3eyeDetectionsplit
04329.230valid
14324.620test
24327.690train
34328.720train
44326.150train
54321.030train
64319.490test
74325.640test
84326.150test
94326.150train
\n", "
" ], "text/plain": [ " AF3 eyeDetection split\n", "0 4329.23 0 valid\n", "1 4324.62 0 test\n", "2 4327.69 0 train\n", "3 4328.72 0 train\n", "4 4326.15 0 train\n", "5 4321.03 0 train\n", "6 4319.49 0 test\n", "7 4325.64 0 test\n", "8 4326.15 0 test\n", "9 4326.15 0 train" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "columns = ['AF3', 'eyeDetection', 'split']\n", "data[columns].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's select a single column, for example -- the response column, and look at the data more closely:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", "Name: eyeDetection, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['eyeDetection'].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like a binary response, but let's validate that assumption:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 1])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['eyeDetection'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can query the categorical \"levels\" as well ('B' and 'M' stand for \"Benign\" and \"Malignant\" diagnosis):" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['eyeDetection'].nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since \"diagnosis\" column is the response we would like to predict, we may want to check if there are any missing values, so let's look for NAs. To figure out which, if any, values are missing, we can use the `isna` method on the diagnosis column. The columns in an H2O Frame are also H2O Frames themselves, so all the methods that apply to a Frame also apply to a single column." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AF3F7F3FC5T7P7O1O2P8T8FC6F4F8AF4eyeDetectionsplit
0FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
6FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
7FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
8FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
9FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
10FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
11FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
12FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
13FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
15FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
16FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
17FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
18FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
19FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
20FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
21FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
22FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
23FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
24FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
25FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
26FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
27FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
28FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
29FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
...................................................
14950FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14951FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14952FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14953FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14954FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14955FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14956FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14957FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14958FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14959FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14960FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14961FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14962FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14963FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14964FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14965FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14966FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14967FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14968FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14969FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14970FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14971FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14972FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14973FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14974FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14975FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14976FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14977FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14978FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
14979FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", "

14980 rows × 16 columns

\n", "
" ], "text/plain": [ " AF3 F7 F3 FC5 T7 P7 O1 O2 P8 T8 \\\n", "0 False False False False False False False False False False \n", "1 False False False False False False False False False False \n", "2 False False False False False False False False False False \n", "3 False False False False False False False False False False \n", "4 False False False False False False False False False False \n", "5 False False False False False False False False False False \n", "6 False False False False False False False False False False \n", "7 False False False False False False False False False False \n", "8 False False False False False False False False False False \n", "9 False False False False False False False False False False \n", "10 False False False False False False False False False False \n", "11 False False False False False False False False False False \n", "12 False False False False False False False False False False \n", "13 False False False False False False False False False False \n", "14 False False False False False False False False False False \n", "15 False False False False False False False False False False \n", "16 False False False False False False False False False False \n", "17 False False False False False False False False False False \n", "18 False False False False False False False False False False \n", "19 False False False False False False False False False False \n", "20 False False False False False False False False False False \n", "21 False False False False False False False False False False \n", "22 False False False False False False False False False False \n", "23 False False False False False False False False False False \n", "24 False False False False False False False False False False \n", "25 False False False False False False False False False False \n", "26 False False False False False False False False False False \n", "27 False False False False False False False False False False \n", "28 False False False False False False False False False False \n", "29 False False False False False False False False False False \n", "... ... ... ... ... ... ... ... ... ... ... \n", "14950 False False False False False False False False False False \n", "14951 False False False False False False False False False False \n", "14952 False False False False False False False False False False \n", "14953 False False False False False False False False False False \n", "14954 False False False False False False False False False False \n", "14955 False False False False False False False False False False \n", "14956 False False False False False False False False False False \n", "14957 False False False False False False False False False False \n", "14958 False False False False False False False False False False \n", "14959 False False False False False False False False False False \n", "14960 False False False False False False False False False False \n", "14961 False False False False False False False False False False \n", "14962 False False False False False False False False False False \n", "14963 False False False False False False False False False False \n", "14964 False False False False False False False False False False \n", "14965 False False False False False False False False False False \n", "14966 False False False False False False False False False False \n", "14967 False False False False False False False False False False \n", "14968 False False False False False False False False False False \n", "14969 False False False False False False False False False False \n", "14970 False False False False False False False False False False \n", "14971 False False False False False False False False False False \n", "14972 False False False False False False False False False False \n", "14973 False False False False False False False False False False \n", "14974 False False False False False False False False False False \n", "14975 False False False False False False False False False False \n", "14976 False False False False False False False False False False \n", "14977 False False False False False False False False False False \n", "14978 False False False False False False False False False False \n", "14979 False False False False False False False False False False \n", "\n", " FC6 F4 F8 AF4 eyeDetection split \n", "0 False False False False False False \n", "1 False False False False False False \n", "2 False False False False False False \n", "3 False False False False False False \n", "4 False False False False False False \n", "5 False False False False False False \n", "6 False False False False False False \n", "7 False False False False False False \n", "8 False False False False False False \n", "9 False False False False False False \n", "10 False False False False False False \n", "11 False False False False False False \n", "12 False False False False False False \n", "13 False False False False False False \n", "14 False False False False False False \n", "15 False False False False False False \n", "16 False False False False False False \n", "17 False False False False False False \n", "18 False False False False False False \n", "19 False False False False False False \n", "20 False False False False False False \n", "21 False False False False False False \n", "22 False False False False False False \n", "23 False False False False False False \n", "24 False False False False False False \n", "25 False False False False False False \n", "26 False False False False False False \n", "27 False False False False False False \n", "28 False False False False False False \n", "29 False False False False False False \n", "... ... ... ... ... ... ... \n", "14950 False False False False False False \n", "14951 False False False False False False \n", "14952 False False False False False False \n", "14953 False False False False False False \n", "14954 False False False False False False \n", "14955 False False False False False False \n", "14956 False False False False False False \n", "14957 False False False False False False \n", "14958 False False False False False False \n", "14959 False False False False False False \n", "14960 False False False False False False \n", "14961 False False False False False False \n", "14962 False False False False False False \n", "14963 False False False False False False \n", "14964 False False False False False False \n", "14965 False False False False False False \n", "14966 False False False False False False \n", "14967 False False False False False False \n", "14968 False False False False False False \n", "14969 False False False False False False \n", "14970 False False False False False False \n", "14971 False False False False False False \n", "14972 False False False False False False \n", "14973 False False False False False False \n", "14974 False False False False False False \n", "14975 False False False False False False \n", "14976 False False False False False False \n", "14977 False False False False False False \n", "14978 False False False False False False \n", "14979 False False False False False False \n", "\n", "[14980 rows x 16 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", "5 False\n", "6 False\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 False\n", "12 False\n", "13 False\n", "14 False\n", "15 False\n", "16 False\n", "17 False\n", "18 False\n", "19 False\n", "20 False\n", "21 False\n", "22 False\n", "23 False\n", "24 False\n", "25 False\n", "26 False\n", "27 False\n", "28 False\n", "29 False\n", " ... \n", "14950 False\n", "14951 False\n", "14952 False\n", "14953 False\n", "14954 False\n", "14955 False\n", "14956 False\n", "14957 False\n", "14958 False\n", "14959 False\n", "14960 False\n", "14961 False\n", "14962 False\n", "14963 False\n", "14964 False\n", "14965 False\n", "14966 False\n", "14967 False\n", "14968 False\n", "14969 False\n", "14970 False\n", "14971 False\n", "14972 False\n", "14973 False\n", "14974 False\n", "14975 False\n", "14976 False\n", "14977 False\n", "14978 False\n", "14979 False\n", "Name: eyeDetection, dtype: bool" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['eyeDetection'].isnull()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `isna` method doesn't directly answer the question, \"Does the diagnosis column contain any NAs?\", rather it returns a 0 if that cell is not missing (Is NA? FALSE == 0) and a 1 if it is missing (Is NA? TRUE == 1). So if there are no missing values, then summing over the whole column should produce a summand equal to 0.0. Let's take a look:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['eyeDetection'].isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Great, no missing labels. \n", "\n", "Out of curiosity, let's see if there is any missing data in this frame:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "AF3 0\n", "F7 0\n", "F3 0\n", "FC5 0\n", "T7 0\n", "P7 0\n", "O1 0\n", "O2 0\n", "P8 0\n", "T8 0\n", "FC6 0\n", "F4 0\n", "F8 0\n", "AF4 0\n", "eyeDetection 0\n", "split 0\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next thing I may wonder about in a binary classification problem is the distribution of the response in the training data. Is one of the two outcomes under-represented in the training set? Many real datasets have what's called an \"imbalanace\" problem, where one of the classes has far fewer training examples than the other class. Let's take a look at the distribution, both visually and numerically." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Counter({0: 8257, 1: 6723})" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(data['eyeDetection'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, the data is not exactly evenly distributed between the two classes -- there are more 0's than 1's in the dataset. However, this level of imbalance shouldn't be much of an issue for the machine learning algos. (We will revisit this later in the modeling section below).\n", "\n", "Let's calculate the percentage that each class represents:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.5512016, 0.4487984])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "n = data.shape[0] # Total number of training samples\n", "np.array(Counter(data['eyeDetection']).values())/float(n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split H2O Frame into a train and test set\n", "\n", "So far we have explored the original dataset (all rows). For the machine learning portion of this tutorial, we will break the dataset into three parts: a training set, validation set and a test set.\n", "\n", "If you want H2O to do the splitting for you, you can use the `split_frame` method. However, we have explicit splits that we want (for reproducibility reasons), so we can just subset the Frame to get the partitions we want. \n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(8988, 16)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = data[data['split']==\"train\"]\n", "train.shape" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2996, 16)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "valid = data[data['split']==\"valid\"]\n", "valid.shape" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2996, 16)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test = data[data['split']==\"test\"]\n", "test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Machine Learning in H2O\n", "\n", "We will do a quick demo of the H2O software -- trying to predict eye state (open/closed) from EEG data.\n", "\n", "### Specify the predictor set and response\n", "\n", "The response, `y`, is the 'diagnosis' column, and the predictors, `x`, are all the columns aside from the first two columns ('id' and 'diagnosis')." ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y = 'eyeDetection'\n", "x = data.columns.drop(['eyeDetection','split'])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split H2O Frame into a train and test set" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import GradientBoostingClassifier" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import sklearn\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(2996, 16)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train and Test a GBM model" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = GradientBoostingClassifier(n_estimators=100,\n", " max_depth=4,\n", " learning_rate=0.1)\n" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',\n", " max_depth=4, max_features=None, max_leaf_nodes=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100,\n", " presort='auto', random_state=None, subsample=1.0, verbose=0,\n", " warm_start=False)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X=train[x].reset_index(drop=True)\n", "y=train[y].reset_index(drop=True)\n", "\n", "model.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',\n", " max_depth=4, max_features=None, max_leaf_nodes=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=100,\n", " presort='auto', random_state=None, subsample=1.0, verbose=0,\n", " warm_start=False)\n" ] } ], "source": [ "print(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect Model" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'init': None,\n", " 'learning_rate': 0.1,\n", " 'loss': 'deviance',\n", " 'max_depth': 4,\n", " 'max_features': None,\n", " 'max_leaf_nodes': None,\n", " 'min_samples_leaf': 1,\n", " 'min_samples_split': 2,\n", " 'min_weight_fraction_leaf': 0.0,\n", " 'n_estimators': 100,\n", " 'presort': 'auto',\n", " 'random_state': None,\n", " 'subsample': 1.0,\n", " 'verbose': 0,\n", " 'warm_start': False}" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model Performance on a Test Set" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.54512915254897387" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import r2_score, roc_auc_score, mean_squared_error\n", "y_pred = model.predict(X)\n", "\n", "r2_score(y_pred, y)" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.89097094432760837" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(y_pred, y)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.11103693813974187" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(y_pred, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-validated Performance" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.54945509, 0.55455629, 0.32538286, 0.38222385, 0.42590001])" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import cross_validation\n", "\n", "cross_validation.cross_val_score(model, X, y, scoring='roc_auc', cv=5)\n" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.64409495, 0.55143686, 0.30297715, 0.36688253, 0.40355729])" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_validation.cross_val_score(model, valid[x].reset_index(drop=True), valid['eyeDetection'].reset_index(drop=True), scoring='roc_auc', cv=5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }