{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Feature Creation and Text Mining in Python\n", "For the majority of our work we have focused on feature creation in R. Now we are going to increase our use of python. To get started" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale2210A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female3810PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S
4503Allen, Mr. William Henrymale35003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale23134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female270234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female141023773630.0708NaNC
101113Sandstrom, Miss. Marguerite Rutfemale411PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale580011378326.5500C103S
121303Saundercock, Mr. William Henrymale2000A/5. 21518.0500NaNS
131403Andersson, Mr. Anders Johanmale391534708231.2750NaNS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14003504067.8542NaNS
151612Hewlett, Mrs. (Mary D Kingcome)female550024870616.0000NaNS
161703Rice, Master. Eugenemale24138265229.1250NaNQ
171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female311034576318.0000NaNS
192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
202102Fynney, Mr. Joseph Jmale350023986526.0000NaNS
212212Beesley, Mr. Lawrencemale340024869813.0000D56S
222313McGowan, Miss. Anna \"Annie\"female15003309238.0292NaNQ
232411Sloper, Mr. William Thompsonmale280011378835.5000A6S
242503Palsson, Miss. Torborg Danirafemale83134990921.0750NaNS
252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female381534707731.3875NaNS
262703Emir, Mr. Farred ChehabmaleNaN0026317.2250NaNC
272801Fortune, Mr. Charles Alexandermale193219950263.0000C23 C25 C27S
282913O'Dwyer, Miss. Ellen \"Nellie\"femaleNaN003309597.8792NaNQ
293003Todoroff, Mr. LaliomaleNaN003492167.8958NaNS
.......................................
86186202Giles, Mr. Frederick Edwardmale21102813411.5000NaNS
86286311Swift, Mrs. Frederick Joel (Margaret Welles Ba...female48001746625.9292D17S
86386403Sage, Miss. Dorothy Edith \"Dolly\"femaleNaN82CA. 234369.5500NaNS
86486502Gill, Mr. John Williammale240023386613.0000NaNS
86586612Bystrom, Mrs. (Karolina)female420023685213.0000NaNS
86686712Duran y More, Miss. Asuncionfemale2710SC/PARIS 214913.8583NaNC
86786801Roebling, Mr. Washington Augustus IImale3100PC 1759050.4958A24S
86886903van Melkebeke, Mr. PhilemonmaleNaN003457779.5000NaNS
86987013Johnson, Master. Harold Theodormale41134774211.1333NaNS
87087103Balkic, Mr. Cerinmale26003492487.8958NaNS
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47111175152.5542D35S
87287301Carlsson, Mr. Frans Olofmale33006955.0000B51 B53 B55S
87387403Vander Cruyssen, Mr. Victormale47003457659.0000NaNS
87487512Abelson, Mrs. Samuel (Hannah Wizosky)female2810P/PP 338124.0000NaNC
87587613Najib, Miss. Adele Kiamie \"Jane\"female150026677.2250NaNC
87687703Gustafsson, Mr. Alfred Ossianmale200075349.8458NaNS
87787803Petroff, Mr. Nedeliomale19003492127.8958NaNS
87887903Laleff, Mr. KristomaleNaN003492177.8958NaNS
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56011176783.1583C50C
88088112Shelley, Mrs. William (Imanita Parrish Hall)female250123043326.0000NaNS
88188203Markun, Mr. Johannmale33003492577.8958NaNS
88288303Dahlberg, Miss. Gerda Ulrikafemale2200755210.5167NaNS
88388402Banfield, Mr. Frederick Jamesmale2800C.A./SOTON 3406810.5000NaNS
88488503Sutehall, Mr. Henry Jrmale2500SOTON/OQ 3920767.0500NaNS
88588603Rice, Mrs. William (Margaret Norton)female390538265229.1250NaNQ
88688702Montvila, Rev. Juozasmale270021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale190011205330.0000B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale260011136930.0000C148C
89089103Dooley, Mr. Patrickmale32003703767.7500NaNQ
\n", "

891 rows × 12 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "5 6 0 3 \n", "6 7 0 1 \n", "7 8 0 3 \n", "8 9 1 3 \n", "9 10 1 2 \n", "10 11 1 3 \n", "11 12 1 1 \n", "12 13 0 3 \n", "13 14 0 3 \n", "14 15 0 3 \n", "15 16 1 2 \n", "16 17 0 3 \n", "17 18 1 2 \n", "18 19 0 3 \n", "19 20 1 3 \n", "20 21 0 2 \n", "21 22 1 2 \n", "22 23 1 3 \n", "23 24 1 1 \n", "24 25 0 3 \n", "25 26 1 3 \n", "26 27 0 3 \n", "27 28 0 1 \n", "28 29 1 3 \n", "29 30 0 3 \n", ".. ... ... ... \n", "861 862 0 2 \n", "862 863 1 1 \n", "863 864 0 3 \n", "864 865 0 2 \n", "865 866 1 2 \n", "866 867 1 2 \n", "867 868 0 1 \n", "868 869 0 3 \n", "869 870 1 3 \n", "870 871 0 3 \n", "871 872 1 1 \n", "872 873 0 1 \n", "873 874 0 3 \n", "874 875 1 2 \n", "875 876 1 3 \n", "876 877 0 3 \n", "877 878 0 3 \n", "878 879 0 3 \n", "879 880 1 1 \n", "880 881 1 2 \n", "881 882 0 3 \n", "882 883 0 3 \n", "883 884 0 2 \n", "884 885 0 3 \n", "885 886 0 3 \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n", "2 Heikkinen, Miss. Laina female 26 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n", "4 Allen, Mr. William Henry male 35 0 \n", "5 Moran, Mr. James male NaN 0 \n", "6 McCarthy, Mr. Timothy J male 54 0 \n", "7 Palsson, Master. Gosta Leonard male 2 3 \n", "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 \n", "9 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 \n", "10 Sandstrom, Miss. Marguerite Rut female 4 1 \n", "11 Bonnell, Miss. Elizabeth female 58 0 \n", "12 Saundercock, Mr. William Henry male 20 0 \n", "13 Andersson, Mr. Anders Johan male 39 1 \n", "14 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 \n", "15 Hewlett, Mrs. (Mary D Kingcome) female 55 0 \n", "16 Rice, Master. Eugene male 2 4 \n", "17 Williams, Mr. Charles Eugene male NaN 0 \n", "18 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 \n", "19 Masselmani, Mrs. Fatima female NaN 0 \n", "20 Fynney, Mr. Joseph J male 35 0 \n", "21 Beesley, Mr. Lawrence male 34 0 \n", "22 McGowan, Miss. Anna \"Annie\" female 15 0 \n", "23 Sloper, Mr. William Thompson male 28 0 \n", "24 Palsson, Miss. Torborg Danira female 8 3 \n", "25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38 1 \n", "26 Emir, Mr. Farred Chehab male NaN 0 \n", "27 Fortune, Mr. Charles Alexander male 19 3 \n", "28 O'Dwyer, Miss. Ellen \"Nellie\" female NaN 0 \n", "29 Todoroff, Mr. Lalio male NaN 0 \n", ".. ... ... ... ... \n", "861 Giles, Mr. Frederick Edward male 21 1 \n", "862 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48 0 \n", "863 Sage, Miss. Dorothy Edith \"Dolly\" female NaN 8 \n", "864 Gill, Mr. John William male 24 0 \n", "865 Bystrom, Mrs. (Karolina) female 42 0 \n", "866 Duran y More, Miss. Asuncion female 27 1 \n", "867 Roebling, Mr. Washington Augustus II male 31 0 \n", "868 van Melkebeke, Mr. Philemon male NaN 0 \n", "869 Johnson, Master. Harold Theodor male 4 1 \n", "870 Balkic, Mr. Cerin male 26 0 \n", "871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47 1 \n", "872 Carlsson, Mr. Frans Olof male 33 0 \n", "873 Vander Cruyssen, Mr. Victor male 47 0 \n", "874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28 1 \n", "875 Najib, Miss. Adele Kiamie \"Jane\" female 15 0 \n", "876 Gustafsson, Mr. Alfred Ossian male 20 0 \n", "877 Petroff, Mr. Nedelio male 19 0 \n", "878 Laleff, Mr. Kristo male NaN 0 \n", "879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56 0 \n", "880 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 \n", "881 Markun, Mr. Johann male 33 0 \n", "882 Dahlberg, Miss. Gerda Ulrika female 22 0 \n", "883 Banfield, Mr. Frederick James male 28 0 \n", "884 Sutehall, Mr. Henry Jr male 25 0 \n", "885 Rice, Mrs. William (Margaret Norton) female 39 0 \n", "886 Montvila, Rev. Juozas male 27 0 \n", "887 Graham, Miss. Margaret Edith female 19 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n", "889 Behr, Mr. Karl Howell male 26 0 \n", "890 Dooley, Mr. Patrick male 32 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S \n", "5 0 330877 8.4583 NaN Q \n", "6 0 17463 51.8625 E46 S \n", "7 1 349909 21.0750 NaN S \n", "8 2 347742 11.1333 NaN S \n", "9 0 237736 30.0708 NaN C \n", "10 1 PP 9549 16.7000 G6 S \n", "11 0 113783 26.5500 C103 S \n", "12 0 A/5. 2151 8.0500 NaN S \n", "13 5 347082 31.2750 NaN S \n", "14 0 350406 7.8542 NaN S \n", "15 0 248706 16.0000 NaN S \n", "16 1 382652 29.1250 NaN Q \n", "17 0 244373 13.0000 NaN S \n", "18 0 345763 18.0000 NaN S \n", "19 0 2649 7.2250 NaN C \n", "20 0 239865 26.0000 NaN S \n", "21 0 248698 13.0000 D56 S \n", "22 0 330923 8.0292 NaN Q \n", "23 0 113788 35.5000 A6 S \n", "24 1 349909 21.0750 NaN S \n", "25 5 347077 31.3875 NaN S \n", "26 0 2631 7.2250 NaN C \n", "27 2 19950 263.0000 C23 C25 C27 S \n", "28 0 330959 7.8792 NaN Q \n", "29 0 349216 7.8958 NaN S \n", ".. ... ... ... ... ... \n", "861 0 28134 11.5000 NaN S \n", "862 0 17466 25.9292 D17 S \n", "863 2 CA. 2343 69.5500 NaN S \n", "864 0 233866 13.0000 NaN S \n", "865 0 236852 13.0000 NaN S \n", "866 0 SC/PARIS 2149 13.8583 NaN C \n", "867 0 PC 17590 50.4958 A24 S \n", "868 0 345777 9.5000 NaN S \n", "869 1 347742 11.1333 NaN S \n", "870 0 349248 7.8958 NaN S \n", "871 1 11751 52.5542 D35 S \n", "872 0 695 5.0000 B51 B53 B55 S \n", "873 0 345765 9.0000 NaN S \n", "874 0 P/PP 3381 24.0000 NaN C \n", "875 0 2667 7.2250 NaN C \n", "876 0 7534 9.8458 NaN S \n", "877 0 349212 7.8958 NaN S \n", "878 0 349217 7.8958 NaN S \n", "879 1 11767 83.1583 C50 C \n", "880 1 230433 26.0000 NaN S \n", "881 0 349257 7.8958 NaN S \n", "882 0 7552 10.5167 NaN S \n", "883 0 C.A./SOTON 34068 10.5000 NaN S \n", "884 0 SOTON/OQ 392076 7.0500 NaN S \n", "885 5 382652 29.1250 NaN Q \n", "886 0 211536 13.0000 NaN S \n", "887 0 112053 30.0000 B42 S \n", "888 2 W./C. 6607 23.4500 NaN S \n", "889 0 111369 30.0000 C148 C \n", "890 0 370376 7.7500 NaN Q \n", "\n", "[891 rows x 12 columns]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import csv\n", "import urllib2\n", "\n", "url = 'https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titanic/train.csv'\n", "url2 = 'https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/master/data/titanic/test.csv'\n", "response = urllib2.urlopen(url)\n", "response2 = urllib2.urlopen(url2)\n", "train = pd.read_csv(response, dtype={\"Age\": np.float64},)\n", "test = pd.read_csv(response2, dtype={\"Age\": np.float64},)\n", "train" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Top of the training data:\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale2210A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female3810PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S
4503Allen, Mr. William Henrymale35003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n", "2 Heikkinen, Miss. Laina female 26 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n", "4 Allen, Mr. William Henry male 35 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Print to standard output, and see the results in the \"log\" section below after running your script\n", "print(\"\\n\\nTop of the training data:\")\n", "train.head()\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", "Top of the training data:\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Age SibSp \\\n", "count 891.000000 891.000000 891.000000 714.000000 891.000000 \n", "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", "std 257.353842 0.486592 0.836071 14.526497 1.102743 \n", "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", "25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n", "50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n", "75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n", "max 891.000000 1.000000 3.000000 80.000000 8.000000 \n", "\n", " Parch Fare \n", "count 891.000000 891.000000 \n", "mean 0.381594 32.204208 \n", "std 0.806057 49.693429 \n", "min 0.000000 0.000000 \n", "25% 0.000000 7.910400 \n", "50% 0.000000 14.454200 \n", "75% 0.000000 31.000000 \n", "max 6.000000 512.329200 " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Print to standard output, and see the results in the \"log\" section below after running your script\n", "print(\"\\n\\nTop of the training data:\")\n", "train.describe()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Show data types\n", "PassengerId int64\n", "Survived int64\n", "Pclass int64\n", "Name object\n", "Sex object\n", "Age float64\n", "SibSp int64\n", "Parch int64\n", "Ticket object\n", "Fare float64\n", "Cabin object\n", "Embarked object\n" ] } ], "source": [ "print(\"Show data types\")\n", "for col in train:\n", " print col, train[col].dtypes" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 22\n", "1 38\n", "2 26\n", "3 35\n", "4 35\n", "5 NaN\n", "6 54\n", "7 2\n", "8 27\n", "9 14\n", "10 4\n", "11 58\n", "12 20\n", "13 39\n", "14 14\n", "15 55\n", "16 2\n", "17 NaN\n", "18 31\n", "19 NaN\n", "20 35\n", "21 34\n", "22 15\n", "23 28\n", "24 8\n", "25 38\n", "26 NaN\n", "27 19\n", "28 NaN\n", "29 NaN\n", " ..\n", "861 21\n", "862 48\n", "863 NaN\n", "864 24\n", "865 42\n", "866 27\n", "867 31\n", "868 NaN\n", "869 4\n", "870 26\n", "871 47\n", "872 33\n", "873 47\n", "874 28\n", "875 15\n", "876 20\n", "877 19\n", "878 NaN\n", "879 56\n", "880 25\n", "881 33\n", "882 22\n", "883 28\n", "884 25\n", "885 39\n", "886 27\n", "887 19\n", "888 NaN\n", "889 26\n", "890 32\n", "Name: Age, dtype: float64" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Let's look at the age field. We can see \"NaN\" (which indicates missing values).s\n", "train[\"Age\"]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The Median age is: 28.0 years old.\n" ] }, { "data": { "text/plain": [ "0 22\n", "1 38\n", "2 26\n", "3 35\n", "4 35\n", "5 28\n", "6 54\n", "7 2\n", "8 27\n", "9 14\n", "10 4\n", "11 58\n", "12 20\n", "13 39\n", "14 14\n", "15 55\n", "16 2\n", "17 28\n", "18 31\n", "19 28\n", "20 35\n", "21 34\n", "22 15\n", "23 28\n", "24 8\n", "25 38\n", "26 28\n", "27 19\n", "28 28\n", "29 28\n", " ..\n", "861 21\n", "862 48\n", "863 28\n", "864 24\n", "865 42\n", "866 27\n", "867 31\n", "868 28\n", "869 4\n", "870 26\n", "871 47\n", "872 33\n", "873 47\n", "874 28\n", "875 15\n", "876 20\n", "877 19\n", "878 28\n", "879 56\n", "880 25\n", "881 33\n", "882 22\n", "883 28\n", "884 25\n", "885 39\n", "886 27\n", "887 19\n", "888 28\n", "889 26\n", "890 32\n", "Name: Age, dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Now let's recode. \n", "medianAge=train[\"Age\"].median()\n", "print \"The Median age is:\", medianAge, \" years old.\"\n", "train[\"Age\"] = train[\"Age\"].fillna(medianAge)\n", "\n", "#Option 2 all in one shot! \n", "train[\"Age\"] = train[\"Age\"].fillna(train[\"Age\"].median())\n", "train[\"Age\"] \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#For Recoding Data, we can use what we know of selecting rows and columns\n", "train[\"Embarked\"] = train[\"Embarked\"].fillna(\"S\")\n", "train.loc[train[\"Embarked\"] == \"S\", \"EmbarkedRecode\"] = 0\n", "train.loc[train[\"Embarked\"] == \"C\", \"EmbarkedRecode\"] = 1\n", "train.loc[train[\"Embarked\"] == \"Q\", \"EmbarkedRecode\"] = 2" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedEmbarkedRecodeGenderNameLengthAge2Title
0103Braund, Mr. Owen Harrismale2210A/5 211717.2500NaNS00234841
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female3810PC 1759971.2833C85C115114441
2313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.9250NaNS01226761
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S014412251
4503Allen, Mr. William Henrymale35003734508.0500NaNS002412251
5603Moran, Mr. Jamesmale28003308778.4583NaNQ20167841
6701McCarthy, Mr. Timothy Jmale54001746351.8625E46S002329161
7803Palsson, Master. Gosta Leonardmale23134990921.0750NaNS003040
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female270234774211.1333NaNS01497291
91012Nasser, Mrs. Nicholas (Adele Achem)female141023773630.0708NaNC11351961
101113Sandstrom, Miss. Marguerite Rutfemale411PP 954916.7000G6S0131161
111211Bonnell, Miss. Elizabethfemale580011378326.5500C103S012433641
121303Saundercock, Mr. William Henrymale2000A/5. 21518.0500NaNS00304001
131403Andersson, Mr. Anders Johanmale391534708231.2750NaNS002715211
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14003504067.8542NaNS01361961
151612Hewlett, Mrs. (Mary D Kingcome)female550024870616.0000NaNS013230251
161703Rice, Master. Eugenemale24138265229.1250NaNQ202040
171812Williams, Mr. Charles Eugenemale280024437313.0000NaNS00287841
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female311034576318.0000NaNS01559611
192013Masselmani, Mrs. Fatimafemale280026497.2250NaNC11237841
202102Fynney, Mr. Joseph Jmale350023986526.0000NaNS002012251
212212Beesley, Mr. Lawrencemale340024869813.0000D56S002111561
222313McGowan, Miss. Anna \"Annie\"female15003309238.0292NaNQ21272251
232411Sloper, Mr. William Thompsonmale280011378835.5000A6S00287841
242503Palsson, Miss. Torborg Danirafemale83134990921.0750NaNS0129641
252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female381534707731.3875NaNS015714441
262703Emir, Mr. Farred Chehabmale280026317.2250NaNC10237841
272801Fortune, Mr. Charles Alexandermale193219950263.0000C23 C25 C27S00303611
282913O'Dwyer, Miss. Ellen \"Nellie\"female28003309597.8792NaNQ21297841
293003Todoroff, Mr. Laliomale28003492167.8958NaNS00197841
......................................................
86186202Giles, Mr. Frederick Edwardmale21102813411.5000NaNS00274411
86286311Swift, Mrs. Frederick Joel (Margaret Welles Ba...female48001746625.9292D17S015123041
86386403Sage, Miss. Dorothy Edith \"Dolly\"female2882CA. 234369.5500NaNS01337841
86486502Gill, Mr. John Williammale240023386613.0000NaNS00225761
86586612Bystrom, Mrs. (Karolina)female420023685213.0000NaNS012417641
86686712Duran y More, Miss. Asuncionfemale2710SC/PARIS 214913.8583NaNC11287291
86786801Roebling, Mr. Washington Augustus IImale3100PC 1759050.4958A24S00369611
86886903van Melkebeke, Mr. Philemonmale28003457779.5000NaNS00277841
86987013Johnson, Master. Harold Theodormale41134774211.1333NaNS0031160
87087103Balkic, Mr. Cerinmale26003492487.8958NaNS00176761
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47111175152.5542D35S014822091
87287301Carlsson, Mr. Frans Olofmale33006955.0000B51 B53 B55S002410891
87387403Vander Cruyssen, Mr. Victormale47003457659.0000NaNS002722091
87487512Abelson, Mrs. Samuel (Hannah Wizosky)female2810P/PP 338124.0000NaNC11377841
87587613Najib, Miss. Adele Kiamie \"Jane\"female150026677.2250NaNC11322251
87687703Gustafsson, Mr. Alfred Ossianmale200075349.8458NaNS00294001
87787803Petroff, Mr. Nedeliomale19003492127.8958NaNS00203611
87887903Laleff, Mr. Kristomale28003492177.8958NaNS00187841
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56011176783.1583C50C114531361
88088112Shelley, Mrs. William (Imanita Parrish Hall)female250123043326.0000NaNS01446251
88188203Markun, Mr. Johannmale33003492577.8958NaNS001810891
88288303Dahlberg, Miss. Gerda Ulrikafemale2200755210.5167NaNS01284841
88388402Banfield, Mr. Frederick Jamesmale2800C.A./SOTON 3406810.5000NaNS00297841
88488503Sutehall, Mr. Henry Jrmale2500SOTON/OQ 3920767.0500NaNS00226251
88588603Rice, Mrs. William (Margaret Norton)female390538265229.1250NaNQ213615211
88688702Montvila, Rev. Juozasmale270021153613.0000NaNS00217290
88788811Graham, Miss. Margaret Edithfemale190011205330.0000B42S01283611
88888903Johnston, Miss. Catherine Helen \"Carrie\"female2812W./C. 660723.4500NaNS01407841
88989011Behr, Mr. Karl Howellmale260011136930.0000C148C10216761
89089103Dooley, Mr. Patrickmale32003703767.7500NaNQ201910241
\n", "

891 rows × 17 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "5 6 0 3 \n", "6 7 0 1 \n", "7 8 0 3 \n", "8 9 1 3 \n", "9 10 1 2 \n", "10 11 1 3 \n", "11 12 1 1 \n", "12 13 0 3 \n", "13 14 0 3 \n", "14 15 0 3 \n", "15 16 1 2 \n", "16 17 0 3 \n", "17 18 1 2 \n", "18 19 0 3 \n", "19 20 1 3 \n", "20 21 0 2 \n", "21 22 1 2 \n", "22 23 1 3 \n", "23 24 1 1 \n", "24 25 0 3 \n", "25 26 1 3 \n", "26 27 0 3 \n", "27 28 0 1 \n", "28 29 1 3 \n", "29 30 0 3 \n", ".. ... ... ... \n", "861 862 0 2 \n", "862 863 1 1 \n", "863 864 0 3 \n", "864 865 0 2 \n", "865 866 1 2 \n", "866 867 1 2 \n", "867 868 0 1 \n", "868 869 0 3 \n", "869 870 1 3 \n", "870 871 0 3 \n", "871 872 1 1 \n", "872 873 0 1 \n", "873 874 0 3 \n", "874 875 1 2 \n", "875 876 1 3 \n", "876 877 0 3 \n", "877 878 0 3 \n", "878 879 0 3 \n", "879 880 1 1 \n", "880 881 1 2 \n", "881 882 0 3 \n", "882 883 0 3 \n", "883 884 0 2 \n", "884 885 0 3 \n", "885 886 0 3 \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n", "2 Heikkinen, Miss. Laina female 26 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n", "4 Allen, Mr. William Henry male 35 0 \n", "5 Moran, Mr. James male 28 0 \n", "6 McCarthy, Mr. Timothy J male 54 0 \n", "7 Palsson, Master. Gosta Leonard male 2 3 \n", "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 \n", "9 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 \n", "10 Sandstrom, Miss. Marguerite Rut female 4 1 \n", "11 Bonnell, Miss. Elizabeth female 58 0 \n", "12 Saundercock, Mr. William Henry male 20 0 \n", "13 Andersson, Mr. Anders Johan male 39 1 \n", "14 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 \n", "15 Hewlett, Mrs. (Mary D Kingcome) female 55 0 \n", "16 Rice, Master. Eugene male 2 4 \n", "17 Williams, Mr. Charles Eugene male 28 0 \n", "18 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 \n", "19 Masselmani, Mrs. Fatima female 28 0 \n", "20 Fynney, Mr. Joseph J male 35 0 \n", "21 Beesley, Mr. Lawrence male 34 0 \n", "22 McGowan, Miss. Anna \"Annie\" female 15 0 \n", "23 Sloper, Mr. William Thompson male 28 0 \n", "24 Palsson, Miss. Torborg Danira female 8 3 \n", "25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38 1 \n", "26 Emir, Mr. Farred Chehab male 28 0 \n", "27 Fortune, Mr. Charles Alexander male 19 3 \n", "28 O'Dwyer, Miss. Ellen \"Nellie\" female 28 0 \n", "29 Todoroff, Mr. Lalio male 28 0 \n", ".. ... ... ... ... \n", "861 Giles, Mr. Frederick Edward male 21 1 \n", "862 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48 0 \n", "863 Sage, Miss. Dorothy Edith \"Dolly\" female 28 8 \n", "864 Gill, Mr. John William male 24 0 \n", "865 Bystrom, Mrs. (Karolina) female 42 0 \n", "866 Duran y More, Miss. Asuncion female 27 1 \n", "867 Roebling, Mr. Washington Augustus II male 31 0 \n", "868 van Melkebeke, Mr. Philemon male 28 0 \n", "869 Johnson, Master. Harold Theodor male 4 1 \n", "870 Balkic, Mr. Cerin male 26 0 \n", "871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47 1 \n", "872 Carlsson, Mr. Frans Olof male 33 0 \n", "873 Vander Cruyssen, Mr. Victor male 47 0 \n", "874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28 1 \n", "875 Najib, Miss. Adele Kiamie \"Jane\" female 15 0 \n", "876 Gustafsson, Mr. Alfred Ossian male 20 0 \n", "877 Petroff, Mr. Nedelio male 19 0 \n", "878 Laleff, Mr. Kristo male 28 0 \n", "879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56 0 \n", "880 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 \n", "881 Markun, Mr. Johann male 33 0 \n", "882 Dahlberg, Miss. Gerda Ulrika female 22 0 \n", "883 Banfield, Mr. Frederick James male 28 0 \n", "884 Sutehall, Mr. Henry Jr male 25 0 \n", "885 Rice, Mrs. William (Margaret Norton) female 39 0 \n", "886 Montvila, Rev. Juozas male 27 0 \n", "887 Graham, Miss. Margaret Edith female 19 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" female 28 1 \n", "889 Behr, Mr. Karl Howell male 26 0 \n", "890 Dooley, Mr. Patrick male 32 0 \n", "\n", " Parch Ticket Fare Cabin Embarked EmbarkedRecode \\\n", "0 0 A/5 21171 7.2500 NaN S 0 \n", "1 0 PC 17599 71.2833 C85 C 1 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 0 \n", "3 0 113803 53.1000 C123 S 0 \n", "4 0 373450 8.0500 NaN S 0 \n", "5 0 330877 8.4583 NaN Q 2 \n", "6 0 17463 51.8625 E46 S 0 \n", "7 1 349909 21.0750 NaN S 0 \n", "8 2 347742 11.1333 NaN S 0 \n", "9 0 237736 30.0708 NaN C 1 \n", "10 1 PP 9549 16.7000 G6 S 0 \n", "11 0 113783 26.5500 C103 S 0 \n", "12 0 A/5. 2151 8.0500 NaN S 0 \n", "13 5 347082 31.2750 NaN S 0 \n", "14 0 350406 7.8542 NaN S 0 \n", "15 0 248706 16.0000 NaN S 0 \n", "16 1 382652 29.1250 NaN Q 2 \n", "17 0 244373 13.0000 NaN S 0 \n", "18 0 345763 18.0000 NaN S 0 \n", "19 0 2649 7.2250 NaN C 1 \n", "20 0 239865 26.0000 NaN S 0 \n", "21 0 248698 13.0000 D56 S 0 \n", "22 0 330923 8.0292 NaN Q 2 \n", "23 0 113788 35.5000 A6 S 0 \n", "24 1 349909 21.0750 NaN S 0 \n", "25 5 347077 31.3875 NaN S 0 \n", "26 0 2631 7.2250 NaN C 1 \n", "27 2 19950 263.0000 C23 C25 C27 S 0 \n", "28 0 330959 7.8792 NaN Q 2 \n", "29 0 349216 7.8958 NaN S 0 \n", ".. ... ... ... ... ... ... \n", "861 0 28134 11.5000 NaN S 0 \n", "862 0 17466 25.9292 D17 S 0 \n", "863 2 CA. 2343 69.5500 NaN S 0 \n", "864 0 233866 13.0000 NaN S 0 \n", "865 0 236852 13.0000 NaN S 0 \n", "866 0 SC/PARIS 2149 13.8583 NaN C 1 \n", "867 0 PC 17590 50.4958 A24 S 0 \n", "868 0 345777 9.5000 NaN S 0 \n", "869 1 347742 11.1333 NaN S 0 \n", "870 0 349248 7.8958 NaN S 0 \n", "871 1 11751 52.5542 D35 S 0 \n", "872 0 695 5.0000 B51 B53 B55 S 0 \n", "873 0 345765 9.0000 NaN S 0 \n", "874 0 P/PP 3381 24.0000 NaN C 1 \n", "875 0 2667 7.2250 NaN C 1 \n", "876 0 7534 9.8458 NaN S 0 \n", "877 0 349212 7.8958 NaN S 0 \n", "878 0 349217 7.8958 NaN S 0 \n", "879 1 11767 83.1583 C50 C 1 \n", "880 1 230433 26.0000 NaN S 0 \n", "881 0 349257 7.8958 NaN S 0 \n", "882 0 7552 10.5167 NaN S 0 \n", "883 0 C.A./SOTON 34068 10.5000 NaN S 0 \n", "884 0 SOTON/OQ 392076 7.0500 NaN S 0 \n", "885 5 382652 29.1250 NaN Q 2 \n", "886 0 211536 13.0000 NaN S 0 \n", "887 0 112053 30.0000 B42 S 0 \n", "888 2 W./C. 6607 23.4500 NaN S 0 \n", "889 0 111369 30.0000 C148 C 1 \n", "890 0 370376 7.7500 NaN Q 2 \n", "\n", " Gender NameLength Age2 Title \n", "0 0 23 484 1 \n", "1 1 51 1444 1 \n", "2 1 22 676 1 \n", "3 1 44 1225 1 \n", "4 0 24 1225 1 \n", "5 0 16 784 1 \n", "6 0 23 2916 1 \n", "7 0 30 4 0 \n", "8 1 49 729 1 \n", "9 1 35 196 1 \n", "10 1 31 16 1 \n", "11 1 24 3364 1 \n", "12 0 30 400 1 \n", "13 0 27 1521 1 \n", "14 1 36 196 1 \n", "15 1 32 3025 1 \n", "16 0 20 4 0 \n", "17 0 28 784 1 \n", "18 1 55 961 1 \n", "19 1 23 784 1 \n", "20 0 20 1225 1 \n", "21 0 21 1156 1 \n", "22 1 27 225 1 \n", "23 0 28 784 1 \n", "24 1 29 64 1 \n", "25 1 57 1444 1 \n", "26 0 23 784 1 \n", "27 0 30 361 1 \n", "28 1 29 784 1 \n", "29 0 19 784 1 \n", ".. ... ... ... ... \n", "861 0 27 441 1 \n", "862 1 51 2304 1 \n", "863 1 33 784 1 \n", "864 0 22 576 1 \n", "865 1 24 1764 1 \n", "866 1 28 729 1 \n", "867 0 36 961 1 \n", "868 0 27 784 1 \n", "869 0 31 16 0 \n", "870 0 17 676 1 \n", "871 1 48 2209 1 \n", "872 0 24 1089 1 \n", "873 0 27 2209 1 \n", "874 1 37 784 1 \n", "875 1 32 225 1 \n", "876 0 29 400 1 \n", "877 0 20 361 1 \n", "878 0 18 784 1 \n", "879 1 45 3136 1 \n", "880 1 44 625 1 \n", "881 0 18 1089 1 \n", "882 1 28 484 1 \n", "883 0 29 784 1 \n", "884 0 22 625 1 \n", "885 1 36 1521 1 \n", "886 0 21 729 0 \n", "887 1 28 361 1 \n", "888 1 40 784 1 \n", "889 0 21 676 1 \n", "890 0 19 1024 1 \n", "\n", "[891 rows x 17 columns]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can also use something called a lambda function \n", "# You can read more about the lambda function here.\n", "#http://www.python-course.eu/lambda.php \n", "gender_fn = lambda x: 0 if x == 'male' else 1\n", "train['Gender'] = train['Sex'].map(gender_fn)\n", "\n", "#or we can do in one shot\n", "train['NameLength'] = train['Name'].map(lambda x: len(x))\n", "train['Age2'] = train['Age'].map(lambda x: x*x)\n", "train" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedEmbarkedRecodeGenderNameLengthAge2Title
0103Braund, Mr. Owen Harrismale2210A/5 211717.2500NaNS00234841
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female3810PC 1759971.2833C85C115114441
2313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.9250NaNS01226761
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S014412251
4503Allen, Mr. William Henrymale35003734508.0500NaNS002412251
5603Moran, Mr. Jamesmale28003308778.4583NaNQ20167841
6701McCarthy, Mr. Timothy Jmale54001746351.8625E46S002329161
7803Palsson, Master. Gosta Leonardmale23134990921.0750NaNS003040
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female270234774211.1333NaNS01497291
91012Nasser, Mrs. Nicholas (Adele Achem)female141023773630.0708NaNC11351961
101113Sandstrom, Miss. Marguerite Rutfemale411PP 954916.7000G6S0131161
111211Bonnell, Miss. Elizabethfemale580011378326.5500C103S012433641
121303Saundercock, Mr. William Henrymale2000A/5. 21518.0500NaNS00304001
131403Andersson, Mr. Anders Johanmale391534708231.2750NaNS002715211
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14003504067.8542NaNS01361961
151612Hewlett, Mrs. (Mary D Kingcome)female550024870616.0000NaNS013230251
161703Rice, Master. Eugenemale24138265229.1250NaNQ202040
171812Williams, Mr. Charles Eugenemale280024437313.0000NaNS00287841
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female311034576318.0000NaNS01559611
192013Masselmani, Mrs. Fatimafemale280026497.2250NaNC11237841
202102Fynney, Mr. Joseph Jmale350023986526.0000NaNS002012251
212212Beesley, Mr. Lawrencemale340024869813.0000D56S002111561
222313McGowan, Miss. Anna \"Annie\"female15003309238.0292NaNQ21272251
232411Sloper, Mr. William Thompsonmale280011378835.5000A6S00287841
242503Palsson, Miss. Torborg Danirafemale83134990921.0750NaNS0129641
252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female381534707731.3875NaNS015714441
262703Emir, Mr. Farred Chehabmale280026317.2250NaNC10237841
272801Fortune, Mr. Charles Alexandermale193219950263.0000C23 C25 C27S00303611
282913O'Dwyer, Miss. Ellen \"Nellie\"female28003309597.8792NaNQ21297841
293003Todoroff, Mr. Laliomale28003492167.8958NaNS00197841
......................................................
86186202Giles, Mr. Frederick Edwardmale21102813411.5000NaNS00274411
86286311Swift, Mrs. Frederick Joel (Margaret Welles Ba...female48001746625.9292D17S015123041
86386403Sage, Miss. Dorothy Edith \"Dolly\"female2882CA. 234369.5500NaNS01337841
86486502Gill, Mr. John Williammale240023386613.0000NaNS00225761
86586612Bystrom, Mrs. (Karolina)female420023685213.0000NaNS012417641
86686712Duran y More, Miss. Asuncionfemale2710SC/PARIS 214913.8583NaNC11287291
86786801Roebling, Mr. Washington Augustus IImale3100PC 1759050.4958A24S00369611
86886903van Melkebeke, Mr. Philemonmale28003457779.5000NaNS00277841
86987013Johnson, Master. Harold Theodormale41134774211.1333NaNS0031160
87087103Balkic, Mr. Cerinmale26003492487.8958NaNS00176761
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47111175152.5542D35S014822091
87287301Carlsson, Mr. Frans Olofmale33006955.0000B51 B53 B55S002410891
87387403Vander Cruyssen, Mr. Victormale47003457659.0000NaNS002722091
87487512Abelson, Mrs. Samuel (Hannah Wizosky)female2810P/PP 338124.0000NaNC11377841
87587613Najib, Miss. Adele Kiamie \"Jane\"female150026677.2250NaNC11322251
87687703Gustafsson, Mr. Alfred Ossianmale200075349.8458NaNS00294001
87787803Petroff, Mr. Nedeliomale19003492127.8958NaNS00203611
87887903Laleff, Mr. Kristomale28003492177.8958NaNS00187841
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56011176783.1583C50C114531361
88088112Shelley, Mrs. William (Imanita Parrish Hall)female250123043326.0000NaNS01446251
88188203Markun, Mr. Johannmale33003492577.8958NaNS001810891
88288303Dahlberg, Miss. Gerda Ulrikafemale2200755210.5167NaNS01284841
88388402Banfield, Mr. Frederick Jamesmale2800C.A./SOTON 3406810.5000NaNS00297841
88488503Sutehall, Mr. Henry Jrmale2500SOTON/OQ 3920767.0500NaNS00226251
88588603Rice, Mrs. William (Margaret Norton)female390538265229.1250NaNQ213615211
88688702Montvila, Rev. Juozasmale270021153613.0000NaNS00217290
88788811Graham, Miss. Margaret Edithfemale190011205330.0000B42S01283611
88888903Johnston, Miss. Catherine Helen \"Carrie\"female2812W./C. 660723.4500NaNS01407841
88989011Behr, Mr. Karl Howellmale260011136930.0000C148C10216761
89089103Dooley, Mr. Patrickmale32003703767.7500NaNQ201910241
\n", "

891 rows × 17 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "5 6 0 3 \n", "6 7 0 1 \n", "7 8 0 3 \n", "8 9 1 3 \n", "9 10 1 2 \n", "10 11 1 3 \n", "11 12 1 1 \n", "12 13 0 3 \n", "13 14 0 3 \n", "14 15 0 3 \n", "15 16 1 2 \n", "16 17 0 3 \n", "17 18 1 2 \n", "18 19 0 3 \n", "19 20 1 3 \n", "20 21 0 2 \n", "21 22 1 2 \n", "22 23 1 3 \n", "23 24 1 1 \n", "24 25 0 3 \n", "25 26 1 3 \n", "26 27 0 3 \n", "27 28 0 1 \n", "28 29 1 3 \n", "29 30 0 3 \n", ".. ... ... ... \n", "861 862 0 2 \n", "862 863 1 1 \n", "863 864 0 3 \n", "864 865 0 2 \n", "865 866 1 2 \n", "866 867 1 2 \n", "867 868 0 1 \n", "868 869 0 3 \n", "869 870 1 3 \n", "870 871 0 3 \n", "871 872 1 1 \n", "872 873 0 1 \n", "873 874 0 3 \n", "874 875 1 2 \n", "875 876 1 3 \n", "876 877 0 3 \n", "877 878 0 3 \n", "878 879 0 3 \n", "879 880 1 1 \n", "880 881 1 2 \n", "881 882 0 3 \n", "882 883 0 3 \n", "883 884 0 2 \n", "884 885 0 3 \n", "885 886 0 3 \n", "886 887 0 2 \n", "887 888 1 1 \n", "888 889 0 3 \n", "889 890 1 1 \n", "890 891 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n", "2 Heikkinen, Miss. Laina female 26 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n", "4 Allen, Mr. William Henry male 35 0 \n", "5 Moran, Mr. James male 28 0 \n", "6 McCarthy, Mr. Timothy J male 54 0 \n", "7 Palsson, Master. Gosta Leonard male 2 3 \n", "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 \n", "9 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 \n", "10 Sandstrom, Miss. Marguerite Rut female 4 1 \n", "11 Bonnell, Miss. Elizabeth female 58 0 \n", "12 Saundercock, Mr. William Henry male 20 0 \n", "13 Andersson, Mr. Anders Johan male 39 1 \n", "14 Vestrom, Miss. Hulda Amanda Adolfina female 14 0 \n", "15 Hewlett, Mrs. (Mary D Kingcome) female 55 0 \n", "16 Rice, Master. Eugene male 2 4 \n", "17 Williams, Mr. Charles Eugene male 28 0 \n", "18 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31 1 \n", "19 Masselmani, Mrs. Fatima female 28 0 \n", "20 Fynney, Mr. Joseph J male 35 0 \n", "21 Beesley, Mr. Lawrence male 34 0 \n", "22 McGowan, Miss. Anna \"Annie\" female 15 0 \n", "23 Sloper, Mr. William Thompson male 28 0 \n", "24 Palsson, Miss. Torborg Danira female 8 3 \n", "25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38 1 \n", "26 Emir, Mr. Farred Chehab male 28 0 \n", "27 Fortune, Mr. Charles Alexander male 19 3 \n", "28 O'Dwyer, Miss. Ellen \"Nellie\" female 28 0 \n", "29 Todoroff, Mr. Lalio male 28 0 \n", ".. ... ... ... ... \n", "861 Giles, Mr. Frederick Edward male 21 1 \n", "862 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48 0 \n", "863 Sage, Miss. Dorothy Edith \"Dolly\" female 28 8 \n", "864 Gill, Mr. John William male 24 0 \n", "865 Bystrom, Mrs. (Karolina) female 42 0 \n", "866 Duran y More, Miss. Asuncion female 27 1 \n", "867 Roebling, Mr. Washington Augustus II male 31 0 \n", "868 van Melkebeke, Mr. Philemon male 28 0 \n", "869 Johnson, Master. Harold Theodor male 4 1 \n", "870 Balkic, Mr. Cerin male 26 0 \n", "871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47 1 \n", "872 Carlsson, Mr. Frans Olof male 33 0 \n", "873 Vander Cruyssen, Mr. Victor male 47 0 \n", "874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28 1 \n", "875 Najib, Miss. Adele Kiamie \"Jane\" female 15 0 \n", "876 Gustafsson, Mr. Alfred Ossian male 20 0 \n", "877 Petroff, Mr. Nedelio male 19 0 \n", "878 Laleff, Mr. Kristo male 28 0 \n", "879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56 0 \n", "880 Shelley, Mrs. William (Imanita Parrish Hall) female 25 0 \n", "881 Markun, Mr. Johann male 33 0 \n", "882 Dahlberg, Miss. Gerda Ulrika female 22 0 \n", "883 Banfield, Mr. Frederick James male 28 0 \n", "884 Sutehall, Mr. Henry Jr male 25 0 \n", "885 Rice, Mrs. William (Margaret Norton) female 39 0 \n", "886 Montvila, Rev. Juozas male 27 0 \n", "887 Graham, Miss. Margaret Edith female 19 0 \n", "888 Johnston, Miss. Catherine Helen \"Carrie\" female 28 1 \n", "889 Behr, Mr. Karl Howell male 26 0 \n", "890 Dooley, Mr. Patrick male 32 0 \n", "\n", " Parch Ticket Fare Cabin Embarked EmbarkedRecode \\\n", "0 0 A/5 21171 7.2500 NaN S 0 \n", "1 0 PC 17599 71.2833 C85 C 1 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 0 \n", "3 0 113803 53.1000 C123 S 0 \n", "4 0 373450 8.0500 NaN S 0 \n", "5 0 330877 8.4583 NaN Q 2 \n", "6 0 17463 51.8625 E46 S 0 \n", "7 1 349909 21.0750 NaN S 0 \n", "8 2 347742 11.1333 NaN S 0 \n", "9 0 237736 30.0708 NaN C 1 \n", "10 1 PP 9549 16.7000 G6 S 0 \n", "11 0 113783 26.5500 C103 S 0 \n", "12 0 A/5. 2151 8.0500 NaN S 0 \n", "13 5 347082 31.2750 NaN S 0 \n", "14 0 350406 7.8542 NaN S 0 \n", "15 0 248706 16.0000 NaN S 0 \n", "16 1 382652 29.1250 NaN Q 2 \n", "17 0 244373 13.0000 NaN S 0 \n", "18 0 345763 18.0000 NaN S 0 \n", "19 0 2649 7.2250 NaN C 1 \n", "20 0 239865 26.0000 NaN S 0 \n", "21 0 248698 13.0000 D56 S 0 \n", "22 0 330923 8.0292 NaN Q 2 \n", "23 0 113788 35.5000 A6 S 0 \n", "24 1 349909 21.0750 NaN S 0 \n", "25 5 347077 31.3875 NaN S 0 \n", "26 0 2631 7.2250 NaN C 1 \n", "27 2 19950 263.0000 C23 C25 C27 S 0 \n", "28 0 330959 7.8792 NaN Q 2 \n", "29 0 349216 7.8958 NaN S 0 \n", ".. ... ... ... ... ... ... \n", "861 0 28134 11.5000 NaN S 0 \n", "862 0 17466 25.9292 D17 S 0 \n", "863 2 CA. 2343 69.5500 NaN S 0 \n", "864 0 233866 13.0000 NaN S 0 \n", "865 0 236852 13.0000 NaN S 0 \n", "866 0 SC/PARIS 2149 13.8583 NaN C 1 \n", "867 0 PC 17590 50.4958 A24 S 0 \n", "868 0 345777 9.5000 NaN S 0 \n", "869 1 347742 11.1333 NaN S 0 \n", "870 0 349248 7.8958 NaN S 0 \n", "871 1 11751 52.5542 D35 S 0 \n", "872 0 695 5.0000 B51 B53 B55 S 0 \n", "873 0 345765 9.0000 NaN S 0 \n", "874 0 P/PP 3381 24.0000 NaN C 1 \n", "875 0 2667 7.2250 NaN C 1 \n", "876 0 7534 9.8458 NaN S 0 \n", "877 0 349212 7.8958 NaN S 0 \n", "878 0 349217 7.8958 NaN S 0 \n", "879 1 11767 83.1583 C50 C 1 \n", "880 1 230433 26.0000 NaN S 0 \n", "881 0 349257 7.8958 NaN S 0 \n", "882 0 7552 10.5167 NaN S 0 \n", "883 0 C.A./SOTON 34068 10.5000 NaN S 0 \n", "884 0 SOTON/OQ 392076 7.0500 NaN S 0 \n", "885 5 382652 29.1250 NaN Q 2 \n", "886 0 211536 13.0000 NaN S 0 \n", "887 0 112053 30.0000 B42 S 0 \n", "888 2 W./C. 6607 23.4500 NaN S 0 \n", "889 0 111369 30.0000 C148 C 1 \n", "890 0 370376 7.7500 NaN Q 2 \n", "\n", " Gender NameLength Age2 Title \n", "0 0 23 484 1 \n", "1 1 51 1444 1 \n", "2 1 22 676 1 \n", "3 1 44 1225 1 \n", "4 0 24 1225 1 \n", "5 0 16 784 1 \n", "6 0 23 2916 1 \n", "7 0 30 4 0 \n", "8 1 49 729 1 \n", "9 1 35 196 1 \n", "10 1 31 16 1 \n", "11 1 24 3364 1 \n", "12 0 30 400 1 \n", "13 0 27 1521 1 \n", "14 1 36 196 1 \n", "15 1 32 3025 1 \n", "16 0 20 4 0 \n", "17 0 28 784 1 \n", "18 1 55 961 1 \n", "19 1 23 784 1 \n", "20 0 20 1225 1 \n", "21 0 21 1156 1 \n", "22 1 27 225 1 \n", "23 0 28 784 1 \n", "24 1 29 64 1 \n", "25 1 57 1444 1 \n", "26 0 23 784 1 \n", "27 0 30 361 1 \n", "28 1 29 784 1 \n", "29 0 19 784 1 \n", ".. ... ... ... ... \n", "861 0 27 441 1 \n", "862 1 51 2304 1 \n", "863 1 33 784 1 \n", "864 0 22 576 1 \n", "865 1 24 1764 1 \n", "866 1 28 729 1 \n", "867 0 36 961 1 \n", "868 0 27 784 1 \n", "869 0 31 16 0 \n", "870 0 17 676 1 \n", "871 1 48 2209 1 \n", "872 0 24 1089 1 \n", "873 0 27 2209 1 \n", "874 1 37 784 1 \n", "875 1 32 225 1 \n", "876 0 29 400 1 \n", "877 0 20 361 1 \n", "878 0 18 784 1 \n", "879 1 45 3136 1 \n", "880 1 44 625 1 \n", "881 0 18 1089 1 \n", "882 1 28 484 1 \n", "883 0 29 784 1 \n", "884 0 22 625 1 \n", "885 1 36 1521 1 \n", "886 0 21 729 0 \n", "887 1 28 361 1 \n", "888 1 40 784 1 \n", "889 0 21 676 1 \n", "890 0 19 1024 1 \n", "\n", "[891 rows x 17 columns]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "#We can start to create little small functions that will find a string.\n", "def has_title(name):\n", " for s in ['Mr.', 'Mrs.', 'Miss.', 'Dr.', 'Sir.']:\n", " if name.find(s) >= 0:\n", " return True\n", " return False\n", "\n", "#Now we are using that separate function in another function. \n", "title_fn = lambda x: 1 if has_title(x) else 0\n", "#Finally, we call the function for name\n", "train['Title'] = train['Name'].map(title_fn)\n", "train\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Writing to File\n", "submission=pd.DataFrame(test.ix[:,['PassengerId','Survived']])\n", "\n", "#Any files you save will be available in the output tab below\n", "submission.to_csv('submission.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1_twitter.ipynb\t\t\t\t Untitled3.ipynb\r\n", "BeautifulSoup.ipynb\t\t\t _Appendix B - OAuth Primer.ipynb\r\n", "Chapter 0 - Preface.ipynb\t\t data\r\n", "Chapter 1 - Mining Twitter.ipynb\t downjason.ipynb.json\r\n", "Chapter 4 - Mining Google+.ipynb\t example.Rmd\r\n", "Chapter 9 - Twitter Cookbook.ipynb\t example.html\r\n", "Class 3 More Python Basics. .ipynb\t index.html\r\n", "Lab 3 - Twitter-Copy1.ipynb\t\t index.html.1\r\n", "Lab 3 - Twitter.ipynb\t\t\t install.sh\r\n", "Lab2-webmining.ipynb\t\t\t lab2solution.ipynb\r\n", "Lab2.ipynb\t\t\t\t lab4.Rmd\r\n", "Lab3_Twitter_solution.ipynb\t\t lab4.html\r\n", "Lab4-Solution.ipynb\t\t\t model-figure\r\n", "Lab4.ipynb\t\t\t\t model.Rpres\r\n", "Lab6.Rmd\t\t\t\t model.md\r\n", "Lab7 - Feature Creation in python.ipynb nestedforloop.R\r\n", "R\t\t\t\t\t spark_mooc_version\r\n", "Titanic.ipynb\t\t\t\t spark_notebook.py\r\n", "Untitled.ipynb\t\t\t\t submission.csv\r\n", "Untitled1.ipynb\t\t\t\t titantic_train.csv\r\n", "Untitled2.ipynb\t\t\t\t titantic_train.csv.1\r\n" ] } ], "source": [ "#We can see the file her. \n", "!ls\n" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "\n", "1. Create a function that recodes the data for Name if there is a 'Mc' in it. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Introduction to Text Mining in Python\n", "These exercises were adapted from Mining the Social Web, 2nd Edition [See origional here](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/) \n", "Simplified BSD License that governs its use.\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{'a': ['mr.',\n", " 'green',\n", " 'killed',\n", " 'colonel',\n", " 'mustard',\n", " 'in',\n", " 'the',\n", " 'study',\n", " 'with',\n", " 'the',\n", " 'candlestick.',\n", " 'mr.',\n", " 'green',\n", " 'is',\n", " 'not',\n", " 'a',\n", " 'very',\n", " 'nice',\n", " 'fellow.'],\n", " 'b': ['professor',\n", " 'plum',\n", " 'has',\n", " 'a',\n", " 'green',\n", " 'plant',\n", " 'in',\n", " 'his',\n", " 'study.'],\n", " 'c': ['miss',\n", " 'scarlett',\n", " 'watered',\n", " 'professor',\n", " \"plum's\",\n", " 'green',\n", " 'plant',\n", " 'while',\n", " 'he',\n", " 'was',\n", " 'away',\n", " 'from',\n", " 'his',\n", " 'office',\n", " 'last',\n", " 'week.']}" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ " corpus = { \n", " 'a' : \"Mr. Green killed Colonel Mustard in the study with the candlestick. \\\n", "Mr. Green is not a very nice fellow.\",\n", " 'b' : \"Professor Plum has a green plant in his study.\",\n", " 'c' : \"Miss Scarlett watered Professor Plum's green plant while he was away \\\n", "from his office last week.\"\n", "}\n", "\n", "#This will separate the documents (sentences) into terms\n", "terms = {\n", " 'a' : [ i.lower() for i in corpus['a'].split() ],\n", " 'b' : [ i.lower() for i in corpus['b'].split() ],\n", " 'c' : [ i.lower() for i in corpus['c'].split() ]\n", " }\n", "terms" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a : Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.\n", "b : Professor Plum has a green plant in his study.\n", "c : Miss Scarlett watered Professor Plum's green plant while he was away from his office last week.\n", "\n", "TF(a): mr. 0.105263157895\n", "TF(b): mr. 0.0\n", "TF(c): mr. 0.0\n", "TF(a): green 0.105263157895\n", "TF(b): green 0.111111111111\n", "TF(c): green 0.0625\n", "\n", "This does the same thing but unnormalized\n", "TF(a): mr. 2.0\n", "TF(b): mr. 0.0\n", "TF(c): mr. 0.0\n", "TF(a): green 2.0\n", "TF(b): green 1.0\n", "TF(c): green 1.0\n" ] } ], "source": [ "from math import log\n", "\n", "# XXX: Enter in a query term from the corpus variable\n", "\n", "#This is our terms we would like to use.\n", "QUERY_TERMS = ['mr.', 'green']\n", "\n", "#This calculates the term frequency normalized by the length.\n", "def tf(term, doc, normalize):\n", " doc = doc.lower().split()\n", " if normalize:\n", " return doc.count(term.lower()) / float(len(doc))\n", " else:\n", " return doc.count(term.lower()) / 1.0\n", "\n", "for (k, v) in sorted(corpus.items()):\n", " print k, ':', v\n", "print\n", " \n", "# Score queries by calculating cumulative tf_idf score for each term in query\n", "query_scores = {'a': 0, 'b': 0, 'c': 0}\n", "\n", "#This starts the search for each query\n", "for term in [t.lower() for t in QUERY_TERMS]:\n", " #This starts the search for each document in the corpus\n", " for doc in sorted(corpus):\n", " print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc], True)\n", " \n", "print \n", "print \"This does the same thing but unnormalized.\"\n", "for term in [t.lower() for t in QUERY_TERMS]:\n", " #This starts the search for each document in the corpus\n", " for doc in sorted(corpus):\n", " print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc], False)\n", " \n", " \n", "\n", "\n", " " ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "IDF: mr. 2.09861228867\n", "IDF: green 1.0\n" ] } ], "source": [ "def idf(term, corpus):\n", " num_texts_with_term = len([True for text in corpus if term.lower()\n", " in text.lower().split()])\n", "\n", " # tf-idf calc involves multiplying against a tf value less than 0, so it's\n", " # necessary to return a value greater than 1 for consistent scoring. \n", " # (Multiplying two values less than 1 returns a value less than each of \n", " # them.)\n", " \n", "\n", " try:\n", " return 1.0 + log(float(len(corpus)) / num_texts_with_term)\n", " except ZeroDivisionError:\n", " return 1.0\n", "\n", "#This \n", "for term in [t.lower() for t in QUERY_TERMS]:\n", " print 'IDF: %s' % (term, ), idf(term, corpus.values())\n", " \n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TF(a): mr. 0.105263157895\n", "TF(b): mr. 0.0\n", "TF(c): mr. 0.0\n", "IDF: mr. 2.09861228867\n", "\n", "TF-IDF(a): mr. 0.220906556702\n", "TF-IDF(b): mr. 0.0\n", "TF-IDF(c): mr. 0.0\n", "\n", "TF(a): green 0.105263157895\n", "TF(b): green 0.111111111111\n", "TF(c): green 0.0625\n", "IDF: green 1.0\n", "\n", "TF-IDF(a): green 0.105263157895\n", "TF-IDF(b): green 0.111111111111\n", "TF-IDF(c): green 0.0625\n", "\n", "Overall TF-IDF scores for query 'mr. green'\n", "a 0.326169714597\n", "b 0.111111111111\n", "c 0.0625\n" ] } ], "source": [ "\n", "#TF-IDF Just multiplies the two together\n", "def tf_idf(term, doc, corpus):\n", " return tf(term, doc, True) * idf(term, corpus)\n", "\n", "query_scores = {'a': 0, 'b': 0, 'c': 0}\n", "for term in [t.lower() for t in QUERY_TERMS]:\n", " for doc in sorted(corpus):\n", " print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc], True)\n", " print 'IDF: %s' % (term, ), idf(term, corpus.values())\n", " print\n", "\n", " for doc in sorted(corpus):\n", " score = tf_idf(term, corpus[doc], corpus.values())\n", " print 'TF-IDF(%s): %s' % (doc, term), score\n", " query_scores[doc] += score\n", " print\n", "\n", "print \"Overall TF-IDF scores for query '%s'\" % (' '.join(QUERY_TERMS), )\n", "for (doc, score) in sorted(query_scores.items()):\n", " print doc, score" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [], "source": [ "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#This will only work if you have \n", "traindf = pd.read_json(\"../../vagrant/data/cooking.json\")" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredients
0greek10259[romaine lettuce, black olives, grape tomatoes...
1southern_us25693[plain flour, ground pepper, salt, tomatoes, g...
2filipino20130[eggs, pepper, salt, mayonaise, cooking oil, g...
3indian22213[water, vegetable oil, wheat, salt]
4indian13162[black pepper, shallots, cornflour, cayenne pe...
5jamaican6602[plain flour, sugar, butter, eggs, fresh ginge...
6spanish42779[olive oil, salt, medium shrimp, pepper, garli...
7italian3735[sugar, pistachio nuts, white almond bark, flo...
8mexican16903[olive oil, purple onion, fresh pineapple, por...
9italian12734[chopped tomatoes, fresh basil, garlic, extra-...
10italian5875[pimentos, sweet pepper, dried oregano, olive ...
11chinese45887[low sodium soy sauce, fresh ginger, dry musta...
12italian2698[Italian parsley leaves, walnuts, hot red pepp...
13mexican41995[ground cinnamon, fresh cilantro, chili powder...
14italian31908[fresh parmesan cheese, butter, all-purpose fl...
15indian24717[tumeric, vegetable stock, tomatoes, garam mas...
16british34466[greek yogurt, lemon curd, confectioners sugar...
17italian1420[italian seasoning, broiler-fryer chicken, may...
18thai2941[sugar, hot chili, asian fish sauce, lime juice]
19vietnamese8152[soy sauce, vegetable oil, red bell pepper, ch...
20thai13121[pork loin, roasted peanuts, chopped cilantro ...
21mexican40523[roma tomatoes, kosher salt, purple onion, jal...
22southern_us40989[low-fat mayonnaise, pepper, salt, baking pota...
23chinese29630[sesame seeds, red pepper, yellow peppers, wat...
24italian49136[marinara sauce, flat leaf parsley, olive oil,...
25chinese26705[sugar, lo mein noodles, salt, chicken broth, ...
26cajun_creole27976[herbs, lemon juice, fresh tomatoes, paprika, ...
27italian22087[ground black pepper, butter, sliced mushrooms...
28chinese9197[green bell pepper, egg roll wrappers, sweet a...
29mexican1299[flour tortillas, cheese, breakfast sausages, ...
............
39744greek5680[extra-virgin olive oil, oregano, potatoes, ga...
39745spanish5511[quinoa, extra-virgin olive oil, fresh thyme l...
39746indian32051[clove, bay leaves, ginger, chopped cilantro, ...
39747moroccan5119[water, sugar, grated lemon zest, butter, pitt...
39748italian9526[sea salt, pizza doughs, all-purpose flour, co...
39749mexican45599[kosher salt, minced onion, tortilla chips, su...
39750mexican49670[ground black pepper, chicken breasts, salsa, ...
39751moroccan30735[olive oil, cayenne pepper, chopped cilantro f...
39752southern_us5911[self rising flour, milk, white sugar, butter,...
39753italian33294[rosemary sprigs, lemon zest, garlic cloves, g...
39754vietnamese27082[jasmine rice, bay leaves, sticky rice, rotiss...
39755indian36337[mint leaves, cilantro leaves, ghee, tomatoes,...
39756mexican15508[vegetable oil, cinnamon sticks, water, all-pu...
39757greek34331[red bell pepper, garlic cloves, extra-virgin ...
39758greek47387[milk, salt, ground cayenne pepper, ground lam...
39759korean12153[red chili peppers, sea salt, onions, water, c...
39760southern_us41840[butter, large eggs, cornmeal, baking powder, ...
39761chinese6487[honey, chicken breast halves, cilantro leaves...
39762indian26646[curry powder, salt, chicken, water, vegetable...
39763italian44798[fettuccine pasta, low-fat cream cheese, garli...
39764mexican8089[chili powder, worcestershire sauce, celery, r...
39765indian6153[coconut, unsweetened coconut milk, mint leave...
39766irish25557[rutabaga, ham, thick-cut bacon, potatoes, fre...
39767italian24348[low-fat sour cream, grated parmesan cheese, s...
39768mexican7377[shredded cheddar cheese, crushed cheese crack...
39769irish29109[light brown sugar, granulated sugar, butter, ...
39770italian11462[KRAFT Zesty Italian Dressing, purple onion, b...
39771irish2238[eggs, citrus fruit, raisins, sourdough starte...
39772chinese41882[boneless chicken skinless thigh, minced garli...
39773mexican2362[green chile, jalapeno chilies, onions, ground...
\n", "

39774 rows × 3 columns

\n", "
" ], "text/plain": [ " cuisine id ingredients\n", "0 greek 10259 [romaine lettuce, black olives, grape tomatoes...\n", "1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g...\n", "2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g...\n", "3 indian 22213 [water, vegetable oil, wheat, salt]\n", "4 indian 13162 [black pepper, shallots, cornflour, cayenne pe...\n", "5 jamaican 6602 [plain flour, sugar, butter, eggs, fresh ginge...\n", "6 spanish 42779 [olive oil, salt, medium shrimp, pepper, garli...\n", "7 italian 3735 [sugar, pistachio nuts, white almond bark, flo...\n", "8 mexican 16903 [olive oil, purple onion, fresh pineapple, por...\n", "9 italian 12734 [chopped tomatoes, fresh basil, garlic, extra-...\n", "10 italian 5875 [pimentos, sweet pepper, dried oregano, olive ...\n", "11 chinese 45887 [low sodium soy sauce, fresh ginger, dry musta...\n", "12 italian 2698 [Italian parsley leaves, walnuts, hot red pepp...\n", "13 mexican 41995 [ground cinnamon, fresh cilantro, chili powder...\n", "14 italian 31908 [fresh parmesan cheese, butter, all-purpose fl...\n", "15 indian 24717 [tumeric, vegetable stock, tomatoes, garam mas...\n", "16 british 34466 [greek yogurt, lemon curd, confectioners sugar...\n", "17 italian 1420 [italian seasoning, broiler-fryer chicken, may...\n", "18 thai 2941 [sugar, hot chili, asian fish sauce, lime juice]\n", "19 vietnamese 8152 [soy sauce, vegetable oil, red bell pepper, ch...\n", "20 thai 13121 [pork loin, roasted peanuts, chopped cilantro ...\n", "21 mexican 40523 [roma tomatoes, kosher salt, purple onion, jal...\n", "22 southern_us 40989 [low-fat mayonnaise, pepper, salt, baking pota...\n", "23 chinese 29630 [sesame seeds, red pepper, yellow peppers, wat...\n", "24 italian 49136 [marinara sauce, flat leaf parsley, olive oil,...\n", "25 chinese 26705 [sugar, lo mein noodles, salt, chicken broth, ...\n", "26 cajun_creole 27976 [herbs, lemon juice, fresh tomatoes, paprika, ...\n", "27 italian 22087 [ground black pepper, butter, sliced mushrooms...\n", "28 chinese 9197 [green bell pepper, egg roll wrappers, sweet a...\n", "29 mexican 1299 [flour tortillas, cheese, breakfast sausages, ...\n", "... ... ... ...\n", "39744 greek 5680 [extra-virgin olive oil, oregano, potatoes, ga...\n", "39745 spanish 5511 [quinoa, extra-virgin olive oil, fresh thyme l...\n", "39746 indian 32051 [clove, bay leaves, ginger, chopped cilantro, ...\n", "39747 moroccan 5119 [water, sugar, grated lemon zest, butter, pitt...\n", "39748 italian 9526 [sea salt, pizza doughs, all-purpose flour, co...\n", "39749 mexican 45599 [kosher salt, minced onion, tortilla chips, su...\n", "39750 mexican 49670 [ground black pepper, chicken breasts, salsa, ...\n", "39751 moroccan 30735 [olive oil, cayenne pepper, chopped cilantro f...\n", "39752 southern_us 5911 [self rising flour, milk, white sugar, butter,...\n", "39753 italian 33294 [rosemary sprigs, lemon zest, garlic cloves, g...\n", "39754 vietnamese 27082 [jasmine rice, bay leaves, sticky rice, rotiss...\n", "39755 indian 36337 [mint leaves, cilantro leaves, ghee, tomatoes,...\n", "39756 mexican 15508 [vegetable oil, cinnamon sticks, water, all-pu...\n", "39757 greek 34331 [red bell pepper, garlic cloves, extra-virgin ...\n", "39758 greek 47387 [milk, salt, ground cayenne pepper, ground lam...\n", "39759 korean 12153 [red chili peppers, sea salt, onions, water, c...\n", "39760 southern_us 41840 [butter, large eggs, cornmeal, baking powder, ...\n", "39761 chinese 6487 [honey, chicken breast halves, cilantro leaves...\n", "39762 indian 26646 [curry powder, salt, chicken, water, vegetable...\n", "39763 italian 44798 [fettuccine pasta, low-fat cream cheese, garli...\n", "39764 mexican 8089 [chili powder, worcestershire sauce, celery, r...\n", "39765 indian 6153 [coconut, unsweetened coconut milk, mint leave...\n", "39766 irish 25557 [rutabaga, ham, thick-cut bacon, potatoes, fre...\n", "39767 italian 24348 [low-fat sour cream, grated parmesan cheese, s...\n", "39768 mexican 7377 [shredded cheddar cheese, crushed cheese crack...\n", "39769 irish 29109 [light brown sugar, granulated sugar, butter, ...\n", "39770 italian 11462 [KRAFT Zesty Italian Dressing, purple onion, b...\n", "39771 irish 2238 [eggs, citrus fruit, raisins, sourdough starte...\n", "39772 chinese 41882 [boneless chicken skinless thigh, minced garli...\n", "39773 mexican 2362 [green chile, jalapeno chilies, onions, ground...\n", "\n", "[39774 rows x 3 columns]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "traindf" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cuisineidingredients
0greek10259[romaine lettuce, black olives, grape tomatoes...
1southern_us25693[plain flour, ground pepper, salt, tomatoes, g...
2filipino20130[eggs, pepper, salt, mayonaise, cooking oil, g...
3indian22213[water, vegetable oil, wheat, salt]
4indian13162[black pepper, shallots, cornflour, cayenne pe...
5jamaican6602[plain flour, sugar, butter, eggs, fresh ginge...
6spanish42779[olive oil, salt, medium shrimp, pepper, garli...
7italian3735[sugar, pistachio nuts, white almond bark, flo...
8mexican16903[olive oil, purple onion, fresh pineapple, por...
9italian12734[chopped tomatoes, fresh basil, garlic, extra-...
10italian5875[pimentos, sweet pepper, dried oregano, olive ...
11chinese45887[low sodium soy sauce, fresh ginger, dry musta...
12italian2698[Italian parsley leaves, walnuts, hot red pepp...
13mexican41995[ground cinnamon, fresh cilantro, chili powder...
14italian31908[fresh parmesan cheese, butter, all-purpose fl...
15indian24717[tumeric, vegetable stock, tomatoes, garam mas...
16british34466[greek yogurt, lemon curd, confectioners sugar...
17italian1420[italian seasoning, broiler-fryer chicken, may...
18thai2941[sugar, hot chili, asian fish sauce, lime juice]
19vietnamese8152[soy sauce, vegetable oil, red bell pepper, ch...
20thai13121[pork loin, roasted peanuts, chopped cilantro ...
21mexican40523[roma tomatoes, kosher salt, purple onion, jal...
22southern_us40989[low-fat mayonnaise, pepper, salt, baking pota...
23chinese29630[sesame seeds, red pepper, yellow peppers, wat...
24italian49136[marinara sauce, flat leaf parsley, olive oil,...
25chinese26705[sugar, lo mein noodles, salt, chicken broth, ...
26cajun_creole27976[herbs, lemon juice, fresh tomatoes, paprika, ...
27italian22087[ground black pepper, butter, sliced mushrooms...
28chinese9197[green bell pepper, egg roll wrappers, sweet a...
29mexican1299[flour tortillas, cheese, breakfast sausages, ...
............
39744greek5680[extra-virgin olive oil, oregano, potatoes, ga...
39745spanish5511[quinoa, extra-virgin olive oil, fresh thyme l...
39746indian32051[clove, bay leaves, ginger, chopped cilantro, ...
39747moroccan5119[water, sugar, grated lemon zest, butter, pitt...
39748italian9526[sea salt, pizza doughs, all-purpose flour, co...
39749mexican45599[kosher salt, minced onion, tortilla chips, su...
39750mexican49670[ground black pepper, chicken breasts, salsa, ...
39751moroccan30735[olive oil, cayenne pepper, chopped cilantro f...
39752southern_us5911[self rising flour, milk, white sugar, butter,...
39753italian33294[rosemary sprigs, lemon zest, garlic cloves, g...
39754vietnamese27082[jasmine rice, bay leaves, sticky rice, rotiss...
39755indian36337[mint leaves, cilantro leaves, ghee, tomatoes,...
39756mexican15508[vegetable oil, cinnamon sticks, water, all-pu...
39757greek34331[red bell pepper, garlic cloves, extra-virgin ...
39758greek47387[milk, salt, ground cayenne pepper, ground lam...
39759korean12153[red chili peppers, sea salt, onions, water, c...
39760southern_us41840[butter, large eggs, cornmeal, baking powder, ...
39761chinese6487[honey, chicken breast halves, cilantro leaves...
39762indian26646[curry powder, salt, chicken, water, vegetable...
39763italian44798[fettuccine pasta, low-fat cream cheese, garli...
39764mexican8089[chili powder, worcestershire sauce, celery, r...
39765indian6153[coconut, unsweetened coconut milk, mint leave...
39766irish25557[rutabaga, ham, thick-cut bacon, potatoes, fre...
39767italian24348[low-fat sour cream, grated parmesan cheese, s...
39768mexican7377[shredded cheddar cheese, crushed cheese crack...
39769irish29109[light brown sugar, granulated sugar, butter, ...
39770italian11462[KRAFT Zesty Italian Dressing, purple onion, b...
39771irish2238[eggs, citrus fruit, raisins, sourdough starte...
39772chinese41882[boneless chicken skinless thigh, minced garli...
39773mexican2362[green chile, jalapeno chilies, onions, ground...
\n", "

39774 rows × 3 columns

\n", "
" ], "text/plain": [ " cuisine id ingredients\n", "0 greek 10259 [romaine lettuce, black olives, grape tomatoes...\n", "1 southern_us 25693 [plain flour, ground pepper, salt, tomatoes, g...\n", "2 filipino 20130 [eggs, pepper, salt, mayonaise, cooking oil, g...\n", "3 indian 22213 [water, vegetable oil, wheat, salt]\n", "4 indian 13162 [black pepper, shallots, cornflour, cayenne pe...\n", "5 jamaican 6602 [plain flour, sugar, butter, eggs, fresh ginge...\n", "6 spanish 42779 [olive oil, salt, medium shrimp, pepper, garli...\n", "7 italian 3735 [sugar, pistachio nuts, white almond bark, flo...\n", "8 mexican 16903 [olive oil, purple onion, fresh pineapple, por...\n", "9 italian 12734 [chopped tomatoes, fresh basil, garlic, extra-...\n", "10 italian 5875 [pimentos, sweet pepper, dried oregano, olive ...\n", "11 chinese 45887 [low sodium soy sauce, fresh ginger, dry musta...\n", "12 italian 2698 [Italian parsley leaves, walnuts, hot red pepp...\n", "13 mexican 41995 [ground cinnamon, fresh cilantro, chili powder...\n", "14 italian 31908 [fresh parmesan cheese, butter, all-purpose fl...\n", "15 indian 24717 [tumeric, vegetable stock, tomatoes, garam mas...\n", "16 british 34466 [greek yogurt, lemon curd, confectioners sugar...\n", "17 italian 1420 [italian seasoning, broiler-fryer chicken, may...\n", "18 thai 2941 [sugar, hot chili, asian fish sauce, lime juice]\n", "19 vietnamese 8152 [soy sauce, vegetable oil, red bell pepper, ch...\n", "20 thai 13121 [pork loin, roasted peanuts, chopped cilantro ...\n", "21 mexican 40523 [roma tomatoes, kosher salt, purple onion, jal...\n", "22 southern_us 40989 [low-fat mayonnaise, pepper, salt, baking pota...\n", "23 chinese 29630 [sesame seeds, red pepper, yellow peppers, wat...\n", "24 italian 49136 [marinara sauce, flat leaf parsley, olive oil,...\n", "25 chinese 26705 [sugar, lo mein noodles, salt, chicken broth, ...\n", "26 cajun_creole 27976 [herbs, lemon juice, fresh tomatoes, paprika, ...\n", "27 italian 22087 [ground black pepper, butter, sliced mushrooms...\n", "28 chinese 9197 [green bell pepper, egg roll wrappers, sweet a...\n", "29 mexican 1299 [flour tortillas, cheese, breakfast sausages, ...\n", "... ... ... ...\n", "39744 greek 5680 [extra-virgin olive oil, oregano, potatoes, ga...\n", "39745 spanish 5511 [quinoa, extra-virgin olive oil, fresh thyme l...\n", "39746 indian 32051 [clove, bay leaves, ginger, chopped cilantro, ...\n", "39747 moroccan 5119 [water, sugar, grated lemon zest, butter, pitt...\n", "39748 italian 9526 [sea salt, pizza doughs, all-purpose flour, co...\n", "39749 mexican 45599 [kosher salt, minced onion, tortilla chips, su...\n", "39750 mexican 49670 [ground black pepper, chicken breasts, salsa, ...\n", "39751 moroccan 30735 [olive oil, cayenne pepper, chopped cilantro f...\n", "39752 southern_us 5911 [self rising flour, milk, white sugar, butter,...\n", "39753 italian 33294 [rosemary sprigs, lemon zest, garlic cloves, g...\n", "39754 vietnamese 27082 [jasmine rice, bay leaves, sticky rice, rotiss...\n", "39755 indian 36337 [mint leaves, cilantro leaves, ghee, tomatoes,...\n", "39756 mexican 15508 [vegetable oil, cinnamon sticks, water, all-pu...\n", "39757 greek 34331 [red bell pepper, garlic cloves, extra-virgin ...\n", "39758 greek 47387 [milk, salt, ground cayenne pepper, ground lam...\n", "39759 korean 12153 [red chili peppers, sea salt, onions, water, c...\n", "39760 southern_us 41840 [butter, large eggs, cornmeal, baking powder, ...\n", "39761 chinese 6487 [honey, chicken breast halves, cilantro leaves...\n", "39762 indian 26646 [curry powder, salt, chicken, water, vegetable...\n", "39763 italian 44798 [fettuccine pasta, low-fat cream cheese, garli...\n", "39764 mexican 8089 [chili powder, worcestershire sauce, celery, r...\n", "39765 indian 6153 [coconut, unsweetened coconut milk, mint leave...\n", "39766 irish 25557 [rutabaga, ham, thick-cut bacon, potatoes, fre...\n", "39767 italian 24348 [low-fat sour cream, grated parmesan cheese, s...\n", "39768 mexican 7377 [shredded cheddar cheese, crushed cheese crack...\n", "39769 irish 29109 [light brown sugar, granulated sugar, butter, ...\n", "39770 italian 11462 [KRAFT Zesty Italian Dressing, purple onion, b...\n", "39771 irish 2238 [eggs, citrus fruit, raisins, sourdough starte...\n", "39772 chinese 41882 [boneless chicken skinless thigh, minced garli...\n", "39773 mexican 2362 [green chile, jalapeno chilies, onions, ground...\n", "\n", "[39774 rows x 3 columns]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "import urllib2\n", "\n", "url = 'https://raw.githubusercontent.com/RPI-Analytics/MGMT6963-2015/gh-pages/data/cooking.json'\n", "\n", "response = urllib2.urlopen(url)\n", "traindf = pd.read_json(response)\n", "traindf" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Assignment \n", "The assignment for this week is to start with Kaggle2. Your should begin by importing the data and by adopting the above code to be able to calculate TF-IDF for chicken, eggs, pasta, and worcestershire. \n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }