{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook describes how one can use categorical data. The applicability is limited now because regression is not yet supported with Categorical Data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "true" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'daru'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is [animal shelter data](https://www.kaggle.com/c/shelter-animal-outcomes) taken from [kaggle](https://www.kaggle.com/competitions) compeption.\n", "\n", "Its animals that are given up by their owner to a shelter. Lets gain some insight about this data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::DataFrame(3x10)
AnimalIDNameDateTimeOutcomeTypeOutcomeSubtypeAnimalTypeSexuponOutcomeBreedColorAgeuponOutcome(Weeks)
0A671945Hambone2014-02-12 18:22:00Return_to_ownerDogNeutered MaleShetland Sheepdog MixBrown/White52.0
1A656520Emily2013-10-13 12:44:00EuthanasiaSufferingCatSpayed FemaleDomestic Shorthair MixCream Tabby52.0
2A686464Pearce2015-01-31 12:28:00AdoptionFosterDogNeutered MalePit Bull MixBlue/White104.0
" ], "text/plain": [ "#\n", " AnimalID Name DateTime OutcomeTyp OutcomeSub AnimalType SexuponOut Breed Color AgeuponOut\n", " 0 A671945 Hambone 2014-02-12 Return_to_ nil Dog Neutered M Shetland S Brown/Whit 52.0\n", " 1 A656520 Emily 2013-10-13 Euthanasia Suffering Cat Spayed Fem Domestic S Cream Tabb 52.0\n", " 2 A686464 Pearce 2015-01-31 Adoption Foster Dog Neutered M Pit Bull M Blue/White 104.0" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shelter_data = Daru::DataFrame.from_csv '../data/animal_shelter_train.csv'\n", "shelter_data.head(3)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[26711, 10]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shelter_data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are not interested in `DateTime`, `AnimalID` and `OutcomeSubtype` so we will delete them.\n", "\n", "Since `OutcomeType`, `AnimalType`, `SexuponOutcome`, `Breed` and `Color` are qualitative variable, we'll convert them to type category." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::DataFrame(5x7)
NameOutcomeTypeAnimalTypeSexuponOutcomeBreedColorAgeuponOutcome(Weeks)
0HamboneReturn_to_ownerDogNeutered MaleShetland Sheepdog MixBrown/White52.0
1EmilyEuthanasiaCatSpayed FemaleDomestic Shorthair MixCream Tabby52.0
2PearceAdoptionDogNeutered MalePit Bull MixBlue/White104.0
3TransferCatIntact MaleDomestic Shorthair MixBlue Cream3.0
4TransferDogNeutered MaleLhasa Apso/Miniature PoodleTan104.0
" ], "text/plain": [ "#\n", " Name OutcomeTyp AnimalType SexuponOut Breed Color AgeuponOut\n", " 0 Hambone Return_to_ Dog Neutered M Shetland S Brown/Whit 52.0\n", " 1 Emily Euthanasia Cat Spayed Fem Domestic S Cream Tabb 52.0\n", " 2 Pearce Adoption Dog Neutered M Pit Bull M Blue/White 104.0\n", " 3 nil Transfer Cat Intact Mal Domestic S Blue Cream 3.0\n", " 4 nil Transfer Dog Neutered M Lhasa Apso Tan 104.0" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shelter_data.delete_vectors 'DateTime', 'AnimalID', 'OutcomeSubtype'\n", "shelter_data.to_category 'OutcomeType', 'AnimalType', 'SexuponOutcome', 'Breed', 'Color'\n", "shelter_data.first 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll categorize `AgeuponOutcome(Weeks)` to get quick summary of the ages (as we will see later)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "shelter_data['AgeuponOutcome'] = shelter_data['AgeuponOutcome(Weeks)'].cut [0, 1, 4, 52, 260, 1500], labels: [:less_than_week, :less_than_month, :less_than_year, :one_to_five_years, :more_than__five_years]\n", "shelter_data.delete_vector 'AgeuponOutcome(Weeks)'\n", "nil" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets look at the categories we have formed." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(5)
one_to_five_years10605
less_than_year9965
more_than__five_years4216
less_than_month1505
less_than_week420
" ], "text/plain": [ "#\n", " one_to_five_years 10605\n", " less_than_year 9965\n", " more_than__five_year 4216\n", " less_than_month 1505\n", " less_than_week 420" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shelter_data['AgeuponOutcome'].frequencies.sort ascending: false" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Say we are interested in looking at percentage of each animals we have having in the shelter." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(2)
AnimalType
Dog58.38044251431994
Cat41.61955748568006
" ], "text/plain": [ "#\n", " AnimalType\n", " Dog 58.38044251431994\n", " Cat 41.61955748568006" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shelter_data['AnimalType'].frequencies :percentage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tells us that we have 58% of dogs and 41% of cats in out dataset. Lets explore further." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets look at what are the possible outcomes along with their frequencies." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(5)
OutcomeType
Return_to_owner4786
Euthanasia1553
Adoption10769
Transfer9406
Died197
" ], "text/plain": [ "#\n", " OutcomeType\n", " Return_to_owner 4786\n", " Euthanasia 1553\n", " Adoption 10769\n", " Transfer 9406\n", " Died 197" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shelter_data['OutcomeType'].frequencies" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, a large amount of these animals are adopted which is great." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets get some insight into animals who died." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(2)
AnimalType
Dog25.380710659898476
Cat74.61928934010153
" ], "text/plain": [ "#\n", " AnimalType\n", " Dog 25.380710659898476\n", " Cat 74.61928934010153" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "died = shelter_data.where shelter_data['OutcomeType'].eq('Died')\n", "died['AnimalType'].frequencies :percentage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm.. Cats are more prone to die than dogs. We can say this because cats to dog ratio is almost the same in the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets have some insight into ages of cats and dogs that died." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(5)
less_than_week12.0
less_than_month4.0
less_than_year24.0
one_to_five_years40.0
more_than__five_years20.0
" ], "text/plain": [ "#\n", " less_than_week 12.0\n", " less_than_month 4.0\n", " less_than_year 24.0\n", " one_to_five_years 40.0\n", " more_than__five_year 20.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "died.where(died['AnimalType'].eq 'Dog')['AgeuponOutcome'].frequencies :percentage" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(5)
less_than_week11.564625850340136
less_than_month12.244897959183673
less_than_year57.14285714285714
one_to_five_years12.244897959183673
more_than__five_years6.802721088435375
" ], "text/plain": [ "#\n", " less_than_week 11.564625850340136\n", " less_than_month 12.244897959183673\n", " less_than_year 57.14285714285714\n", " one_to_five_years 12.244897959183673\n", " more_than__five_year 6.802721088435375" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "died.where(died['AnimalType'].eq 'Cat')['AgeuponOutcome'].frequencies :percentage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also younger cats are more prone to die." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets move our attention to animals which got adopted." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(2)
AnimalType
Dog60.33057851239669
Cat39.66942148760331
" ], "text/plain": [ "#\n", " AnimalType\n", " Dog 60.33057851239669\n", " Cat 39.66942148760331" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adopted = shelter_data.where shelter_data['OutcomeType'].eq('Adoption')\n", "adopted['AnimalType'].frequencies :percentage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmm... Dogs are more likely to be adopted, maybe that explains why so many cats die." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lets now look at those animals which got adopted by their owner back." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", " \n", "
Daru::Vector(2)
AnimalType
Dog89.55286251567071
Cat10.447137484329295
" ], "text/plain": [ "#\n", " AnimalType\n", " Dog 89.55286251567071\n", " Cat 10.447137484329295" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "owner = shelter_data.where shelter_data['OutcomeType'].eq('Return_to_owner')\n", "owner['AnimalType'].frequencies :percentage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Astonishingly 90% of dogs returns to their owner while only 10% of cats do." ] } ], "metadata": { "kernelspec": { "display_name": "Ruby 2.3.0", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "2.3.0" } }, "nbformat": 4, "nbformat_minor": 0 }