{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "This notebook explains the use of formula langauge and capability of Statsample-GLM to handle category data in regression.\n", "\n", "\n", "This notebook based [this](https://www.google.com/url?q=https%3A%2F%2Fnbviewer.jupyter.org%2Fgithub%2Fagisga%2Fsciruby-notebooks%2Fblob%2Fmaster%2FData%2520Analysis%2FLogistic%2520regression%2520with%2520categorical%2520data.ipynb&sa=D&sntz=1&usg=AFQjCNE7gDkrVcPcy6d4EeqtRixVhB017A) notebook created by [Alexej](http://github.com/agisga)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Logistic regression with categorical data\n", "\n", "We aim to fit a logistic regression model to the [shelter animal data](https://www.kaggle.com/c/shelter-animal-outcomes) from [kaggle](https://www.kaggle.com/competitions) using the Ruby gems `daru` and `statsample-glm`.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's first load the data." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "application/javascript": [ "if(window['d3'] === undefined ||\n", " window['Nyaplot'] === undefined){\n", " var path = {\"d3\":\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\",\"downloadable\":\"https://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\"};\n", "\n", "\n", "\n", " var shim = {\"d3\":{\"exports\":\"d3\"},\"downloadable\":{\"exports\":\"downloadable\"}};\n", "\n", " require.config({paths: path, shim:shim});\n", "\n", "\n", "require(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\n", "\n", "\tvar script = d3.select(\"head\")\n", "\t .append(\"script\")\n", "\t .attr(\"src\", \"https://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\")\n", "\t .attr(\"async\", true);\n", "\n", "\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\n", "\n", "\n", "\t var event = document.createEvent(\"HTMLEvents\");\n", "\t event.initEvent(\"load_nyaplot\",false,false);\n", "\t window.dispatchEvent(event);\n", "\t console.log('Finished loading Nyaplotjs');\n", "\n", "\t};\n", "\n", "\n", "});});\n", "}\n" ], "text/plain": [ "\"if(window['d3'] === undefined ||\\n window['Nyaplot'] === undefined){\\n var path = {\\\"d3\\\":\\\"https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.5/d3.min\\\",\\\"downloadable\\\":\\\"https://cdn.rawgit.com/domitry/d3-downloadable/master/d3-downloadable\\\"};\\n\\n\\n\\n var shim = {\\\"d3\\\":{\\\"exports\\\":\\\"d3\\\"},\\\"downloadable\\\":{\\\"exports\\\":\\\"downloadable\\\"}};\\n\\n require.config({paths: path, shim:shim});\\n\\n\\nrequire(['d3'], function(d3){window['d3']=d3;console.log('finished loading d3');require(['downloadable'], function(downloadable){window['downloadable']=downloadable;console.log('finished loading downloadable');\\n\\n\\tvar script = d3.select(\\\"head\\\")\\n\\t .append(\\\"script\\\")\\n\\t .attr(\\\"src\\\", \\\"https://cdn.rawgit.com/domitry/Nyaplotjs/master/release/nyaplot.js\\\")\\n\\t .attr(\\\"async\\\", true);\\n\\n\\tscript[0][0].onload = script[0][0].onreadystatechange = function(){\\n\\n\\n\\t var event = document.createEvent(\\\"HTMLEvents\\\");\\n\\t event.initEvent(\\\"load_nyaplot\\\",false,false);\\n\\t window.dispatchEvent(event);\\n\\t console.log('Finished loading Nyaplotjs');\\n\\n\\t};\\n\\n\\n});});\\n}\\n\"" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "[26711, 10]\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::DataFrame(3x10)
AnimalIDNameDateTimeOutcomeTypeOutcomeSubtypeAnimalTypeSexuponOutcomeBreedColorAgeuponOutcome(Weeks)
0A671945Hambone2014-02-12 18:22:00Return_to_ownerDogNeutered MaleShetland Sheepdog MixBrown/White52.0
1A656520Emily2013-10-13 12:44:00EuthanasiaSufferingCatSpayed FemaleDomestic Shorthair MixCream Tabby52.0
2A686464Pearce2015-01-31 12:28:00AdoptionFosterDogNeutered MalePit Bull MixBlue/White104.0
" ], "text/plain": [ "#\n", " AnimalID Name DateTime OutcomeTyp OutcomeSub AnimalType SexuponOut Breed Color AgeuponOut\n", " 0 A671945 Hambone 2014-02-12 Return_to_ nil Dog Neutered M Shetland S Brown/Whit 52.0\n", " 1 A656520 Emily 2013-10-13 Euthanasia Suffering Cat Spayed Fem Domestic S Cream Tabb 52.0\n", " 2 A686464 Pearce 2015-01-31 Adoption Foster Dog Neutered M Pit Bull M Blue/White 104.0" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'daru'\n", "shelter_data = Daru::DataFrame.from_csv 'data/animal_shelter_train.csv'\n", "p shelter_data.shape\n", "shelter_data.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to tell Daru what vectors are category. We can do with via [#to_category](http://www.rubydoc.info/github/v0dro/daru/master/Daru/DataFrame#to_category-instance_method)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "shelter_data.to_category 'OutcomeType', 'OutcomeSubtype', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'\n", "nil" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We create a 0-1-valued indicator for whether the animal got adopted. We will then create a logistic model to predict whether an animal got adopted or not." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::DataFrame(3x11)
AnimalIDNameDateTimeOutcomeTypeOutcomeSubtypeAnimalTypeSexuponOutcomeBreedColorAgeuponOutcome(Weeks)OutcomeType_Adoption
0A671945Hambone2014-02-12 18:22:00Return_to_ownerDogNeutered MaleShetland Sheepdog MixBrown/White52.00
1A656520Emily2013-10-13 12:44:00EuthanasiaSufferingCatSpayed FemaleDomestic Shorthair MixCream Tabby52.00
2A686464Pearce2015-01-31 12:28:00AdoptionFosterDogNeutered MalePit Bull MixBlue/White104.01
" ], "text/plain": [ "#\n", " AnimalID Name DateTime OutcomeTyp OutcomeSub AnimalType SexuponOut Breed Color AgeuponOut OutcomeTyp\n", " 0 A671945 Hambone 2014-02-12 Return_to_ nil Dog Neutered M Shetland S Brown/Whit 52.0 0\n", " 1 A656520 Emily 2013-10-13 Euthanasia Suffering Cat Spayed Fem Domestic S Cream Tabb 52.0 0\n", " 2 A686464 Pearce 2015-01-31 Adoption Foster Dog Neutered M Pit Bull M Blue/White 104.0 1" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shelter_data['OutcomeType_Adoption'] = (shelter_data['OutcomeType'].contrast_code)['OutcomeType_Adoption']\n", "shelter_data.head 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we create a model. Let's do some preprocessing to create an effective model." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Some data preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I am using only 600 rows for this Demo because Statsample-GLM is a bit slow in computing." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::DataFrame(3x11)
AnimalIDNameDateTimeOutcomeTypeOutcomeSubtypeAnimalTypeSexuponOutcomeBreedColorAgeuponOutcome(Weeks)OutcomeType_Adoption
0A671945Hambone2014-02-12 18:22:00Return_to_ownerDogNeutered MaleShetland Sheepdog MixBrown/White52.00
1A656520Emily2013-10-13 12:44:00EuthanasiaSufferingCatSpayed FemaleDomestic Shorthair MixCream Tabby52.00
2A686464Pearce2015-01-31 12:28:00AdoptionFosterDogNeutered MalePit Bull MixBlue/White104.01
" ], "text/plain": [ "#\n", " AnimalID Name DateTime OutcomeTyp OutcomeSub AnimalType SexuponOut Breed Color AgeuponOut OutcomeTyp\n", " 0 A671945 Hambone 2014-02-12 Return_to_ nil Dog Neutered M Shetland S Brown/Whit 52.0 0\n", " 1 A656520 Emily 2013-10-13 Euthanasia Suffering Cat Spayed Fem Domestic S Cream Tabb 52.0 0\n", " 2 A686464 Pearce 2015-01-31 Adoption Foster Dog Neutered M Pit Bull M Blue/White 104.0 1" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small = shelter_data.head 600\n", "small.head 3" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1380\n", "366\n" ] }, { "data": { "text/plain": [ "[1380, 366]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p small['Breed'].categories.size, small['Color'].categories.size\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since, the number of categories in 'Breed' and 'Color' is large, we need club some of these categories." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping Breeds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets have a look at the distribution." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(10)
Breed
Domestic Shorthair Mix204
Chihuahua Shorthair Mix47
Pit Bull Mix38
Labrador Retriever Mix33
Domestic Medium Hair Mix17
Siamese Mix11
Domestic Longhair Mix11
German Shepherd Mix10
Australian Cattle Dog Mix8
Dachshund Mix7
" ], "text/plain": [ "#\n", " Breed\n", " Domestic Shorthair M 204\n", " Chihuahua Shorthair 47\n", " Pit Bull Mix 38\n", " Labrador Retriever M 33\n", " Domestic Medium Hair 17\n", " Siamese Mix 11\n", " Domestic Longhair Mi 11\n", " German Shepherd Mix 10\n", " Australian Cattle Do 8\n", " Dachshund Mix 7" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small['Breed'].frequencies.sort(ascending: false).head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets merge the infrequent occuring categories into single categories 'other' so we can have less number of categories to deal with.\n", "\n", "Here we've used [#rename_categories](http://www.rubydoc.info/github/v0dro/daru/master/Daru/Category#rename_categories-instance_method) which accepts a hash mapping old categories to new one." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(9)
Breed
Domestic Shorthair Mix204
Pit Bull Mix38
German Shepherd Mix10
Chihuahua Shorthair Mix47
Labrador Retriever Mix33
Domestic Longhair Mix11
Siamese Mix11
Domestic Medium Hair Mix17
other229
" ], "text/plain": [ "#\n", " Breed\n", " Domestic Shorthair M 204\n", " Pit Bull Mix 38\n", " German Shepherd Mix 10\n", " Chihuahua Shorthair 47\n", " Labrador Retriever M 33\n", " Domestic Longhair Mi 11\n", " Siamese Mix 11\n", " Domestic Medium Hair 17\n", " other 229" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "other_cats = small['Breed'].categories.select { |i| small['Breed'].count(i) < 10 }\n", "other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h\n", "small['Breed'].rename_categories other_cats_hash\n", "small['Breed'].frequencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And let's set the base category to 'other'." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\"other\"" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small['Breed'].base_category = 'other'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now do the same with 'Colors'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping colors" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "366\n" ] }, { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(10)
Color
Black/White66
Black52
Brown Tabby37
Tricolor22
Brown/White21
Brown Tabby/White20
Calico19
White19
Tan/White18
Brown16
" ], "text/plain": [ "#\n", " Color\n", " Black/White 66\n", " Black 52\n", " Brown Tabby 37\n", " Tricolor 22\n", " Brown/White 21\n", " Brown Tabby/White 20\n", " Calico 19\n", " White 19\n", " Tan/White 18\n", " Brown 16" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p small['Color'].categories.size\n", "small['Color'].frequencies.sort(ascending: false).head 10" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(24)
Color
Brown/White21
Blue/White12
Tan11
Black/Tan14
Blue Tabby10
Brown Tabby37
White19
Black52
Brown16
Orange Tabby/White14
Black/White66
Brown Brindle/White10
Orange Tabby15
Chocolate/White11
Blue10
Calico19
Brown/Black11
Tricolor22
White/Black10
Tortie13
Tan/White18
Brown Tabby/White20
White/Brown13
other156
" ], "text/plain": [ "#\n", " Color\n", " Brown/White 21\n", " Blue/White 12\n", " Tan 11\n", " Black/Tan 14\n", " Blue Tabby 10\n", " Brown Tabby 37\n", " White 19\n", " Black 52\n", " Brown 16\n", " Orange Tabby/White 14\n", " Black/White 66\n", " Brown Brindle/White 10\n", " Orange Tabby 15\n", " Chocolate/White 11\n", " Blue 10\n", " ... ..." ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "other_cats = small['Color'].categories.select { |i| small['Color'].count(i) < 10 }\n", "other_cats_hash = other_cats.zip(['other']*other_cats.size).to_h\n", "small['Color'].rename_categories other_cats_hash\n", "small['Color'].frequencies" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\"other\"" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small['Color'].base_category = 'other'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Looking at SexuponOutcome" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(6)
SexuponOutcome
Neutered Male216
Spayed Female205
Intact Male78
Intact Female77
Unknown24
0
" ], "text/plain": [ "#\n", " SexuponOutcome\n", " Neutered Male 216\n", " Spayed Female 205\n", " Intact Male 78\n", " Intact Female 77\n", " Unknown 24\n", " 0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small['SexuponOutcome'].frequencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last row tells us that there is a entry with category as 'nil'. Lets rename this category to 'Unknown' because 'Unknown' stores all the unkown values." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"Neutered Male\", \"Spayed Female\", \"Intact Male\", \"Intact Female\", \"Unknown\", nil]\n" ] }, { "data": { "text/plain": [ "[\"Neutered Male\", \"Spayed Female\", \"Intact Male\", \"Intact Female\", \"Unknown\"]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "p small['SexuponOutcome'].categories\n", "small['SexuponOutcome'].rename_categories nil => 'Unknown'\n", "small['SexuponOutcome'].categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split to train and test" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "500\n", "100\n" ] }, { "data": { "text/plain": [ "[500, 100]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train = small.head 500\n", "test = small.tail 100\n", "p train.size, test.size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model fit\n", "\n", "Now, having put data in appropriate form, we can fit the logistic regression model with `statsample-glm`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\"Trivial accuracy = 0.5900000000000001\"" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "m = test['OutcomeType_Adoption'].mean\n", "\"Trivial accuracy = #{[m, 1-m].max}\"" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{:AnimalType_Cat=>0.8376443692275163, :\"Breed_Pit Bull Mix\"=>0.28200753488859803, :\"Breed_German Shepherd Mix\"=>1.0518504638731023, :\"Breed_Chihuahua Shorthair Mix\"=>1.1960242033878856, :\"Breed_Labrador Retriever Mix\"=>0.445803000000512, :\"Breed_Domestic Longhair Mix\"=>1.898703165797653, :\"Breed_Siamese Mix\"=>1.5248210169271197, :\"Breed_Domestic Medium Hair Mix\"=>-0.19504965010288533, :Breed_other=>0.7895601504638325, :\"Color_Blue/White\"=>0.3748263925801828, :Color_Tan=>0.11356334165122918, :\"Color_Black/Tan\"=>-2.6507089126322114, :\"Color_Blue Tabby\"=>0.5234717706465536, :\"Color_Brown Tabby\"=>0.9046099720184905, :Color_White=>0.07739310267363662, :Color_Black=>0.859906249787038, :Color_Brown=>-0.003740755055106689, :\"Color_Orange Tabby/White\"=>0.2336674067343927, :\"Color_Black/White\"=>0.22564205490196415, :\"Color_Brown Brindle/White\"=>-0.6744314269278774, :\"Color_Orange Tabby\"=>2.063785952843677, :\"Color_Chocolate/White\"=>0.6417921901449108, :Color_Blue=>-2.1969040091451704, :Color_Calico=>-0.08386525532631824, :\"Color_Brown/Black\"=>0.35936722899161305, :Color_Tricolor=>-0.11440457799048752, :\"Color_White/Black\"=>-2.3593561796090383, :Color_Tortie=>-0.4325130799770577, :\"Color_Tan/White\"=>0.09637439333330515, :\"Color_Brown Tabby/White\"=>0.12304448360566177, :\"Color_White/Brown\"=>0.5867441296328475, :Color_other=>0.08821407092892847, :\"SexuponOutcome_Spayed Female\"=>0.32626712478395975, :\"SexuponOutcome_Intact Male\"=>-3.971505056680895, :\"SexuponOutcome_Intact Female\"=>-3.619095491410668, :SexuponOutcome_Unknown=>-102.73807712615843, :\"AgeuponOutcome(Weeks)\"=>-0.006959545305620043}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'statsample-glm'\n", "\n", "formula = 'OutcomeType_Adoption~AnimalType+Breed+AgeuponOutcome(Weeks)+Color+SexuponOutcome'\n", "glm_adoption = Statsample::GLM::Regression.new formula, train, :logistic\n", "glm_adoption.df_for_regression.head 5\n", "glm_adoption.model.coefficients :hash" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also predict using the model we just created." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(5)
00
10
21
30
40
" ], "text/plain": [ "#\n", " 0 0\n", " 1 0\n", " 2 1\n", " 3 0\n", " 4 0" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predict = glm_adoption.predict test\n", "predict.map! { |i| i < 0.5 ? 0 : 1 }\n", "predict.head 5" ] } ], "metadata": { "kernelspec": { "display_name": "Ruby 2.3.0", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "2.3.0" } }, "nbformat": 4, "nbformat_minor": 0 }