前言

这里主要讲两个模型,决策树、随机森林,最后随机森林涉及超参数调整,也就是grid思想。 在模型训练前,就会用到变量的特征工程,这里使用函数思维

引入决策树是因为简单,作为一个tutorial,让代码一次性可以跑通很重要。

决策树参考 Hugo Bowne-Anderson January 3rd, 2018 Kaggle Tutorial: Your First Machine Learning Model https://www.datacamp.com/community/tutorials/kaggle-tutorial-machine-learning

随机森林参考 Ahmed BESEES March 10th, 2016 How to score 0.8134 in Titanic Kaggle Challenge https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html 不知道哪国人。

这里展示不是base blogdown包的,因为这里jupyterRStudio感觉是竞争品牌,不会互相支持,算了,直接jupyter写。

数据下载

  • 下载数据,先安装kaggle命令: pip install kaggle
  • 然后下载数据: kaggle competitions download -c titanic
  • 参考文档见 kaggle
$ mv kaggle.json /Users/JiaxiangLi/.kaggle
$ cd /Users/JiaxiangLi/.kaggle
$ ls
kaggle.json

特征工程


In [203]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")
Out[203]:
In [204]:
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np

pd.options.display.max_rows = 100

这些设定,比如限定展示不超过100行和100列是非常有用的。

In [205]:
import os 
import pandas as pd
from dplython import (DplyFrame, X, diamonds, select, sift,
  sample_n, sample_frac, head, arrange, mutate, group_by,
  summarize, DelayFunction)
In [206]:
os.getcwd()
Out[206]:
'/Users/JiaxiangLi/Downloads/me/trans/python_learning'

录入数据

In [207]:
def get_combined_data():
    file_path = os.path.join(os.getcwd(),"titanic")
    train = pd.read_csv(os.path.join(file_path,"train.csv"))
    test = pd.read_csv(os.path.join(file_path,"test.csv"))
    target = train.Survived
    train = train.drop("Survived",axis = 1)
    combined = train.append(test)
    combined.reset_index().drop("index",axis =1)
    combined.drop("PassengerId", axis = 1, inplace=True)
    return combined
  1. 这样每次统一.ipynb文件的路径和数据的路径,就可以重复利用这个文档了。
  2. 定义一个函数,每次跑函数就可以了,python的函数思维,一定要养成,最好的效果就是节约时间。

注意上面最后return combined因此我们定义combined我们合并traintest的数据,进入特征工程方面。

In [208]:
combined = get_combined_data()
In [209]:
print combined.shape
row,col = combined.shape
print row,col
(1309, 10)
1309 10

这里是返回dataframe的大小。

简单了解下。

titles

combined["Title"] = combined.Name.map(lambda name:name.split(',')[1].split('.')[0].strip())
  1. 相当于提取以,.中间的字符,且去掉空格。
  2. 这里不需要用replace函数,难搞,直接用mapfollow R的思路。但是是针对单一变量.
Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"

                    }
In [210]:
def get_title():
    global combined
    combined["Title"] = combined.Name.map(lambda name:name.split(',')[1].split('.')[0].strip())
    Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"

                    }
    combined.Title = combined.Title.map(Title_Dictionary)

整合成一个函数,一定要这样定义,这样就把特征工程模块化,再后面跑模型的时候就不需要管前面了,非常节约时间。

In [211]:
get_title()

这个地方的map函数定义和R里面的purrr包定义类似。

Age

这个地方的age,根据title,sex,pclass的情况不同肯定不同,因此直接用均值和中位数肯定不恰当,要不groupby三个变量,要不直接回归? 简单点,直接前面一种,我们来看看,分traintest组。

In [212]:
combined[:891].groupby(["Pclass","Sex","Title"]).median()
Out[212]:
Age SibSp Parch Fare
Pclass Sex Title
1 female Miss 30.0 0.0 0.0 88.25000
Mrs 40.0 1.0 0.0 79.20000
Officer 49.0 0.0 0.0 25.92920
Royalty 40.5 0.5 0.0 63.05000
male Master 4.0 1.0 2.0 120.00000
Mr 40.0 0.0 0.0 42.40000
Officer 51.0 0.0 0.0 35.50000
Royalty 40.0 0.0 0.0 27.72080
2 female Miss 24.0 0.0 0.0 13.00000
Mrs 31.5 1.0 0.0 26.00000
male Master 1.0 1.0 1.0 26.00000
Mr 31.0 0.0 0.0 13.00000
Officer 46.5 0.0 0.0 13.00000
3 female Miss 18.0 0.0 0.0 8.75625
Mrs 31.0 1.0 1.0 15.97500
male Master 4.0 3.5 1.0 28.51250
Mr 26.0 0.0 0.0 7.89580

简单看一下就觉得差异得吓人,因此肯定要区别对待。

这里要用if-else条件层层完成一个函数,真累。 因此我们肯定不能这么做,直接一步left join搞定。

In [213]:
def process_age():
    global combined
    age_median = combined[:891].groupby(["Pclass","Sex","Title"]).median().reset_index()[["Pclass","Sex","Age","Title"]]
    combined2 = pd.merge(combined, age_median, how="left", on = ["Pclass","Sex","Title"])
    combined.Age = combined2.Age_x.fillna(combined2.Age_y)
    return combined
In [214]:
process_age()
Out[214]:
Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title
0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S Mr
1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C Mrs
2 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S Miss
3 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S Mrs
4 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S Mr
5 3 Moran, Mr. James male 26.0 0 0 330877 8.4583 NaN Q Mr
6 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S Mr
7 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S Master
8 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S Mrs
9 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C Mrs
10 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S Miss
11 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S Miss
12 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S Mr
13 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S Mr
14 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S Miss
15 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S Mrs
16 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q Master
17 2 Williams, Mr. Charles Eugene male 31.0 0 0 244373 13.0000 NaN S Mr
18 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S Mrs
19 3 Masselmani, Mrs. Fatima female 31.0 0 0 2649 7.2250 NaN C Mrs
20 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S Mr
21 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S Mr
22 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q Miss
23 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S Mr
24 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S Miss
25 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S Mrs
26 3 Emir, Mr. Farred Chehab male 26.0 0 0 2631 7.2250 NaN C Mr
27 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S Mr
28 3 O'Dwyer, Miss. Ellen "Nellie" female 18.0 0 0 330959 7.8792 NaN Q Miss
29 3 Todoroff, Mr. Lalio male 26.0 0 0 349216 7.8958 NaN S Mr
30 1 Uruchurtu, Don. Manuel E male 40.0 0 0 PC 17601 27.7208 NaN C Royalty
31 1 Spencer, Mrs. William Augustus (Marie Eugenie) female 40.0 1 0 PC 17569 146.5208 B78 C Mrs
32 3 Glynn, Miss. Mary Agatha female 18.0 0 0 335677 7.7500 NaN Q Miss
33 2 Wheadon, Mr. Edward H male 66.0 0 0 C.A. 24579 10.5000 NaN S Mr
34 1 Meyer, Mr. Edgar Joseph male 28.0 1 0 PC 17604 82.1708 NaN C Mr
35 1 Holverson, Mr. Alexander Oskar male 42.0 1 0 113789 52.0000 NaN S Mr
36 3 Mamee, Mr. Hanna male 26.0 0 0 2677 7.2292 NaN C Mr
37 3 Cann, Mr. Ernest Charles male 21.0 0 0 A./5. 2152 8.0500 NaN S Mr
38 3 Vander Planke, Miss. Augusta Maria female 18.0 2 0 345764 18.0000 NaN S Miss
39 3 Nicola-Yarred, Miss. Jamila female 14.0 1 0 2651 11.2417 NaN C Miss
40 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) female 40.0 1 0 7546 9.4750 NaN S Mrs
41 2 Turpin, Mrs. William John Robert (Dorothy Ann ... female 27.0 1 0 11668 21.0000 NaN S Mrs
42 3 Kraeff, Mr. Theodor male 26.0 0 0 349253 7.8958 NaN C Mr
43 2 Laroche, Miss. Simonne Marie Anne Andree female 3.0 1 2 SC/Paris 2123 41.5792 NaN C Miss
44 3 Devaney, Miss. Margaret Delia female 19.0 0 0 330958 7.8792 NaN Q Miss
45 3 Rogers, Mr. William John male 26.0 0 0 S.C./A.4. 23567 8.0500 NaN S Mr
46 3 Lennon, Mr. Denis male 26.0 1 0 370371 15.5000 NaN Q Mr
47 3 O'Driscoll, Miss. Bridget female 18.0 0 0 14311 7.7500 NaN Q Miss
48 3 Samaan, Mr. Youssef male 26.0 2 0 2662 21.6792 NaN C Mr
49 3 Arnold-Franchi, Mrs. Josef (Josefine Franchi) female 18.0 1 0 349237 17.8000 NaN S Mrs
... ... ... ... ... ... ... ... ... ... ... ...
368 1 Gibson, Mrs. Leonard (Pauline C Boeson) female 18.0 0 1 112378 59.4000 NaN C Mrs
369 2 Pallas y Castello, Mr. Emilio male 24.0 0 0 SC/PARIS 2147 13.8583 NaN C Mr
370 2 Giles, Mr. Edgar male 25.0 1 0 28133 11.5000 NaN S Mr
371 1 Wilson, Miss. Helen Alice female 18.0 0 0 16966 134.5000 E39 E41 C Miss
372 1 Ismay, Mr. Joseph Bruce male 19.0 0 0 112058 0.0000 B52 B54 B56 S Mr
373 2 Harbeck, Mr. William H male 22.0 0 0 248746 13.0000 NaN S Mr
374 1 Dodge, Mrs. Washington (Ruth Vidaver) female 3.0 1 1 33638 81.8583 A34 S Mrs
375 1 Bowen, Miss. Grace Scott female 40.0 0 0 PC 17608 262.3750 NaN C Miss
376 3 Kink, Miss. Maria female 22.0 2 0 315152 8.6625 NaN S Miss
377 2 Cotterill, Mr. Henry Harry"" male 27.0 0 0 29107 11.5000 NaN S Mr
378 1 Hipkins, Mr. William Edward male 20.0 0 0 680 50.0000 C39 S Mr
379 3 Asplund, Master. Carl Edgar male 19.0 4 2 347077 31.3875 NaN S Master
380 3 O'Connor, Mr. Patrick male 42.0 0 0 366713 7.7500 NaN Q Mr
381 3 Foley, Mr. Joseph male 1.0 0 0 330910 7.8792 NaN Q Mr
382 3 Risien, Mrs. Samuel (Emma) female 32.0 0 0 364498 14.5000 NaN S Mrs
383 3 McNamee, Mrs. Neal (Eileen O'Leary) female 35.0 1 0 376566 16.1000 NaN S Mrs
384 2 Wheeler, Mr. Edwin Frederick"" male 26.0 0 0 SC/PARIS 2159 12.8750 NaN S Mr
385 2 Herman, Miss. Kate female 18.0 1 2 220845 65.0000 NaN S Miss
386 3 Aronsson, Mr. Ernst Axel Algot male 1.0 0 0 349911 7.7750 NaN S Mr
387 2 Ashby, Mr. John male 36.0 0 0 244346 13.0000 NaN S Mr
388 3 Canavan, Mr. Patrick male 26.0 0 0 364858 7.7500 NaN Q Mr
389 3 Palsson, Master. Paul Folke male 17.0 3 1 349909 21.0750 NaN S Master
390 1 Payne, Mr. Vivian Ponsonby male 36.0 0 0 12749 93.5000 B24 S Mr
391 1 Lines, Mrs. Ernest H (Elizabeth Lindsey James) female 21.0 0 1 PC 17592 39.4000 D28 S Mrs
392 3 Abbott, Master. Eugene Joseph male 28.0 0 2 C.A. 2673 20.2500 NaN S Master
393 2 Gilbert, Mr. William male 23.0 0 0 C.A. 30769 10.5000 NaN S Mr
394 3 Kink-Heilmann, Mr. Anton male 24.0 3 1 315153 22.0250 NaN S Mr
395 1 Smith, Mrs. Lucien Philip (Mary Eloise Hughes) female 22.0 1 0 13695 60.0000 C31 S Mrs
396 3 Colbert, Mr. Patrick male 31.0 0 0 371109 7.2500 NaN Q Mr
397 1 Frolicher-Stehli, Mrs. Maxmillian (Margaretha ... female 46.0 1 1 13567 79.2000 B41 C Mrs
398 3 Larsson-Rondberg, Mr. Edvard A male 23.0 0 0 347065 7.7750 NaN S Mr
399 3 Conlon, Mr. Thomas Henry male 28.0 0 0 21332 7.7333 NaN Q Mr
400 1 Bonnell, Miss. Caroline female 39.0 0 0 36928 164.8667 C7 S Miss
401 2 Gale, Mr. Harry male 26.0 1 0 28664 21.0000 NaN S Mr
402 1 Gibson, Miss. Dorothy Winifred female 21.0 0 1 112378 59.4000 NaN C Miss
403 1 Carrau, Mr. Jose Pedro male 28.0 0 0 113059 47.1000 NaN S Mr
404 1 Frauenthal, Mr. Isaac Gerald male 20.0 1 0 17765 27.7208 D40 C Mr
405 2 Nourney, Mr. Alfred (Baron von Drachstedt")" male 34.0 0 0 SC/PARIS 2166 13.8625 D38 C Mr
406 2 Ware, Mr. William Jeffery male 51.0 1 0 28666 10.5000 NaN S Mr
407 1 Widener, Mr. George Dunton male 3.0 1 1 113503 211.5000 C80 C Mr
408 3 Riordan, Miss. Johanna Hannah"" female 21.0 0 0 334915 7.7208 NaN Q Miss
409 3 Peacock, Miss. Treasteall female 18.0 1 1 SOTON/O.Q. 3101315 13.7750 NaN S Miss
410 3 Naughton, Miss. Hannah female 26.0 0 0 365237 7.7500 NaN Q Miss
411 1 Minahan, Mrs. William Edward (Lillian E Thorpe) female 26.0 1 0 19928 90.0000 C78 Q Mrs
412 3 Henriksson, Miss. Jenny Lovisa female 33.0 0 0 347086 7.7750 NaN S Miss
413 3 Spector, Mr. Woolf male 31.0 0 0 A.5. 3236 8.0500 NaN S Mr
414 1 Oliva y Ocana, Dona. Fermina female 44.0 0 0 PC 17758 108.9000 C105 C Royalty
415 3 Saether, Mr. Simon Sivertsen male 31.0 0 0 SOTON/O.Q. 3101262 7.2500 NaN S Mr
416 3 Ware, Mr. Frederick male 34.0 0 0 359309 8.0500 NaN S Mr
417 3 Peter, Master. Michael J male 18.0 1 1 2668 22.3583 NaN C Master

1309 rows × 11 columns

分类和连续变量describe

In [215]:
combined.get_dtype_counts()
Out[215]:
float64    2
int64      3
object     6
dtype: int64
In [216]:
combined.select_dtypes(include=["int64","float64"]).head(n=1)
Out[216]:
Pclass Age SibSp Parch Fare
0 3 22.0 1 0 7.25
In [217]:
combined.select_dtypes(include=["object"]).nunique()
Out[217]:
Name        1307
Sex            2
Ticket       929
Cabin        186
Embarked       3
Title          6
dtype: int64

Names

names 是个噪音,这里处理方式是去掉,反正有了 title,要是不好的话,再说吧。 但是 title需要one_hot

In [218]:
combined.Title.unique()
Out[218]:
array(['Mr', 'Mrs', 'Miss', 'Master', 'Royalty', 'Officer'], dtype=object)

drop_first必须的,不然共线性。

In [219]:
pd.get_dummies(combined.Title,prefix='Title',drop_first=True).head()
#combined.drop(["Name","Title"])
Out[219]:
Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty
0 0 1 0 0 0
1 0 0 1 0 0
2 1 0 0 0 0
3 0 0 1 0 0
4 0 1 0 0 0
In [220]:
def process_names():
    global combined
    combined = pd.concat([combined.drop("Name",axis=1).drop("Title",axis=1),pd.get_dummies(combined.Title,prefix='Title',drop_first=True)],axis=1)

这里一定要加上axis=1

In [221]:
process_names()

Fare

这个地方的Fare缺失很少,直接均值搞定。

In [222]:
combined.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 14 columns):
Pclass           1309 non-null int64
Sex              1309 non-null object
Age              1309 non-null float64
SibSp            1309 non-null int64
Parch            1309 non-null int64
Ticket           1309 non-null object
Fare             1308 non-null float64
Cabin            295 non-null object
Embarked         1307 non-null object
Title_Miss       1309 non-null uint8
Title_Mr         1309 non-null uint8
Title_Mrs        1309 non-null uint8
Title_Officer    1309 non-null uint8
Title_Royalty    1309 non-null uint8
dtypes: float64(2), int64(3), object(4), uint8(5)
memory usage: 108.7+ KB

现在就剩下FareEmbarkedCabin了。

In [223]:
def process_fares():
    global combined
    combined.Fare.fillna(combined.Fare.mean(), inplace = True)
In [224]:
process_fares()
In [225]:
combined.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 14 columns):
Pclass           1309 non-null int64
Sex              1309 non-null object
Age              1309 non-null float64
SibSp            1309 non-null int64
Parch            1309 non-null int64
Ticket           1309 non-null object
Fare             1309 non-null float64
Cabin            295 non-null object
Embarked         1307 non-null object
Title_Miss       1309 non-null uint8
Title_Mr         1309 non-null uint8
Title_Mrs        1309 non-null uint8
Title_Officer    1309 non-null uint8
Title_Royalty    1309 non-null uint8
dtypes: float64(2), int64(3), object(4), uint8(5)
memory usage: 108.7+ KB

Embarked

江湖流传,这个缺失值替换成S准确性高很多。

In [226]:
def process_embarked():
    global combined
    combined.Embarked.fillna("S",inplace=True)
    embarked_dummies = pd.get_dummies(combined.Embarked,prefix="Embarked",drop_first=True)
    combined = pd.concat([combined, embarked_dummies],axis = 1)
    combined.drop("Embarked", inplace=True, axis = 1)
In [227]:
process_embarked()

Cabin

In [228]:
combined.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 15 columns):
Pclass           1309 non-null int64
Sex              1309 non-null object
Age              1309 non-null float64
SibSp            1309 non-null int64
Parch            1309 non-null int64
Ticket           1309 non-null object
Fare             1309 non-null float64
Cabin            295 non-null object
Title_Miss       1309 non-null uint8
Title_Mr         1309 non-null uint8
Title_Mrs        1309 non-null uint8
Title_Officer    1309 non-null uint8
Title_Royalty    1309 non-null uint8
Embarked_Q       1309 non-null uint8
Embarked_S       1309 non-null uint8
dtypes: float64(2), int64(3), object(3), uint8(7)
memory usage: 101.0+ KB

这里填写,U表示unknown。 其他都取用首字母,非常暴力。

In [229]:
combined.Cabin.head()
Out[229]:
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object
In [230]:
def process_cabin():
    global combined
    combined.Cabin.fillna('U', inplace=True)
    combined["Cabin"] = combined.Cabin.map(lambda c: c[0])
    cabin_dummies = pd.get_dummies(combined.Cabin, prefix="Cabin", drop_first=True)
    combined = pd.concat([combined,cabin_dummies], axis=1)
    combined.drop("Cabin", axis = 1, inplace=True)
In [231]:
process_cabin()
combined.filter(regex="^Cabin")

Sex

In [232]:
def process_sex():
    
    global combined
    # mapping string values to numerical one 
    combined['Sex'] = combined['Sex'].map({'male':1,'female':0})
In [233]:
process_sex()

Pclass

注意这个pclass也是要onehot的。

In [234]:
def process_pclass():
    global combined
    pclass_dummies = pd.get_dummies(combined.Pclass,prefix="Pclass", drop_first=True)
    combined  = pd.concat([combined, pclass_dummies], axis=1)
    combined.drop("Pclass",axis=1, inplace=True)
In [235]:
process_pclass()
In [236]:
combined.head()
Out[236]:
Sex Age SibSp Parch Ticket Fare Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty Embarked_Q Embarked_S Cabin_B Cabin_C Cabin_D Cabin_E Cabin_F Cabin_G Cabin_T Cabin_U Pclass_2 Pclass_3
0 1 22.0 1 0 A/5 21171 7.2500 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1
1 0 38.0 1 0 PC 17599 71.2833 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0
2 0 26.0 0 0 STON/O2. 3101282 7.9250 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1
3 0 35.0 1 0 113803 53.1000 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0
4 1 35.0 0 0 373450 8.0500 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1

Ticket

In [237]:
combined.Ticket.head()
Out[237]:
0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: object

这个地方太烦了,copy了一个函数。

In [238]:
def process_ticket():
    
    global combined
    
    # a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
    def cleanTicket(ticket):
        ticket = ticket.replace('.','')
        ticket = ticket.replace('/','')
        ticket = ticket.split()
        ticket = map(lambda t : t.strip(), ticket)
        ticket = filter(lambda t : not t.isdigit(), ticket)
        if len(ticket) > 0:
            return ticket[0]
        else: 
            return 'XXX'
    

    # Extracting dummy variables from tickets:

    combined['Ticket'] = combined['Ticket'].map(cleanTicket)
    tickets_dummies = pd.get_dummies(combined['Ticket'], prefix='Ticket')
    combined = pd.concat([combined, tickets_dummies], axis=1)
    combined.drop('Ticket', inplace=True, axis=1)
In [239]:
process_ticket()

Family

In [240]:
def process_family():
    
    global combined
    # introducing a new feature : the size of families (including the passenger)
    combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
    
    # introducing other features based on the family size
    combined['Singleton'] = combined['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    combined['SmallFamily'] = combined['FamilySize'].map(lambda s: 1 if 2<=s<=4 else 0)
    combined['LargeFamily'] = combined['FamilySize'].map(lambda s: 1 if 5<=s else 0)
In [241]:
process_family()

Define metrics

In [242]:
def compute_score(clf, X, y, scoring='accuracy'):
    xval = cross_val_score(clf, X, y, cv = 5, scoring=scoring)
    return np.mean(xval)

split funtion

In [243]:
def recover_train_test_target():
    global combined
    file_path = os.path.join(os.getcwd(),"titanic")
    train0 = pd.read_csv(os.path.join(file_path,"train.csv"))
    
    targets = train0.Survived
    train = combined.head(891)
    test = combined.iloc[891:]
    
    return train, test, targets

这个地方的合并是知道顺序的,位置为index = 891,那么现在开始分。

Modeling

决策树

In [244]:
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Figures inline and set visualization style
%matplotlib inline
sns.set()
In [245]:
import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()

这就是为什么全面特征工程要函数化的原因,比如这里我们改动了combined也没关系,跑下上面的代码就好了,实际上在Pycharm里面,我们会将特征工程放入一个.py文件,这样直接import就好。 这样数据的预处理,两个组的数据口径一致。

In [246]:
combined.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 63 columns):
Sex               1309 non-null int64
Age               1309 non-null float64
SibSp             1309 non-null int64
Parch             1309 non-null int64
Fare              1309 non-null float64
Title_Miss        1309 non-null uint8
Title_Mr          1309 non-null uint8
Title_Mrs         1309 non-null uint8
Title_Officer     1309 non-null uint8
Title_Royalty     1309 non-null uint8
Embarked_Q        1309 non-null uint8
Embarked_S        1309 non-null uint8
Cabin_B           1309 non-null uint8
Cabin_C           1309 non-null uint8
Cabin_D           1309 non-null uint8
Cabin_E           1309 non-null uint8
Cabin_F           1309 non-null uint8
Cabin_G           1309 non-null uint8
Cabin_T           1309 non-null uint8
Cabin_U           1309 non-null uint8
Pclass_2          1309 non-null uint8
Pclass_3          1309 non-null uint8
Ticket_A          1309 non-null uint8
Ticket_A4         1309 non-null uint8
Ticket_A5         1309 non-null uint8
Ticket_AQ3        1309 non-null uint8
Ticket_AQ4        1309 non-null uint8
Ticket_AS         1309 non-null uint8
Ticket_C          1309 non-null uint8
Ticket_CA         1309 non-null uint8
Ticket_CASOTON    1309 non-null uint8
Ticket_FC         1309 non-null uint8
Ticket_FCC        1309 non-null uint8
Ticket_Fa         1309 non-null uint8
Ticket_LINE       1309 non-null uint8
Ticket_LP         1309 non-null uint8
Ticket_PC         1309 non-null uint8
Ticket_PP         1309 non-null uint8
Ticket_PPP        1309 non-null uint8
Ticket_SC         1309 non-null uint8
Ticket_SCA3       1309 non-null uint8
Ticket_SCA4       1309 non-null uint8
Ticket_SCAH       1309 non-null uint8
Ticket_SCOW       1309 non-null uint8
Ticket_SCPARIS    1309 non-null uint8
Ticket_SCParis    1309 non-null uint8
Ticket_SOC        1309 non-null uint8
Ticket_SOP        1309 non-null uint8
Ticket_SOPP       1309 non-null uint8
Ticket_SOTONO2    1309 non-null uint8
Ticket_SOTONOQ    1309 non-null uint8
Ticket_SP         1309 non-null uint8
Ticket_STONO      1309 non-null uint8
Ticket_STONO2     1309 non-null uint8
Ticket_STONOQ     1309 non-null uint8
Ticket_SWPP       1309 non-null uint8
Ticket_WC         1309 non-null uint8
Ticket_WEP        1309 non-null uint8
Ticket_XXX        1309 non-null uint8
FamilySize        1309 non-null int64
Singleton         1309 non-null int64
SmallFamily       1309 non-null int64
LargeFamily       1309 non-null int64
dtypes: float64(2), int64(7), uint8(54)
memory usage: 171.3 KB
In [247]:
train, test, targets = recover_train_test_target()

注意这里决策树比较娇气,要使用values.

这里进入sklearn的机器学习包,需要np.array格式。因此我们提出.values

In [248]:
X = train.values
y = targets
X_test = test.values

这里的决策树是用平衡算法,level-wise,而非leaf-wise,这里限定max_depth = 3

这里不涉及调参数,因此暂时不用管为什么是3,是经验,在进行超参数调整的时候,会比较的。

In [249]:
clf = tree.DecisionTreeClassifier(max_depth=3)
print clf.fit(X,y)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

这里会有很多参数不知道,没关系,这些是系统设计的默认参数。

最后一步,开始.predict()

In [250]:
output = clf.predict(X_test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032302.csv"),index=False)

所有这些结果都保存,最后用历史法进行集成算法用。

模型保存

这里使用pip install joblib进行安装。

In [251]:
import joblib
In [252]:
joblib.dump(clf, "titanic_model_18032201", compress=9)
Out[252]:
['titanic_model_18032201']

现在尝试重新录入。

In [253]:
clf2 = joblib.load("titanic_model_18032201")
print clf2.predict(X_test)[:5]
[0 1 0 0 1]

超参数调整简单介绍

下面主要解释一些决策树的理论和可视化的东西,毕竟模型这么简单,那么一定要做到跟别人解释得非常的浅显易懂。

max_depth是用来控制过拟合的,我们可以比较下其他$\neq 3$情况。

我们把train组拿出来做随机测试。

这个地方的train_test_splitlist

In [254]:
print type(train_test_split(X,y,test_size = 0.33, random_state = 42,stratify = y))
<type 'list'>
In [255]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42, stratify=y)
In [256]:
dep = np.arange(1,9)

np.arangerange不同之处是,反馈的是array

设计等长的list,用来写入acc。

In [257]:
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))
In [258]:
print train_accuracy, test_accuracy
[4.9e-324 9.9e-324 1.5e-323 2.0e-323 2.5e-323 3.0e-323 3.5e-323 4.0e-323] [0. 0. 0. 0. 0. 0. 0. 0.]
In [259]:
for i, k in enumerate(dep):
    print i,k
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8

这个地方的i刚好做index,k就是做好max_depth的枚举值。

In [260]:
for i, k in enumerate(dep):
    c = tree.DecisionTreeClassifier(max_depth=k)
    clf.fit(X_train,y_train)
    train_accuracy[i] = clf.score(X_train,y_train)
    test_accuracy[i]  = clf.score(X_test,y_test)
print train_accuracy,test_accuracy
[0.83221477 0.82718121 0.83221477 0.83221477 0.83221477 0.82718121
 0.82718121 0.82718121] [0.81016949 0.82372881 0.81016949 0.81016949 0.81016949 0.82372881
 0.82372881 0.82372881]
plt.title(u"训练集和测试集的Acc比较")
plt.plot(dep,train_accuracy,label = u"训练集Acc")
plt.plot(dep,test_accuracy,label = u"测试集Acc")
plt.legend()
plt.xlabel(u"level-wise决策树深度选择")
plt.ylabel("Acc")
plt.show()

注意这里中文显示,每个string前面记得加上u。 这里注意看到max_depth等于3或者7的时候,Acc最高,因此我们选择3. 为什么不选择7,因为经验。

继续更新随机森林的算法啊,因为啊,这种没bagging的决策树迟早要出bug。

随机森林

In [261]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score

到这个地方可以总结下思路了,训练一个模型,相对来说比较规则化、死板的,需要需要把很多方面模块化。 这样的话,重新导入数据训练模型的时候,面对的就是一个个函数,这样就重复利用率上去了,之后我想要迭代模型的时候, 特征工程和选择模型就是分开的,节省时间。 当每次搞一个新的模型时候,模型处理变量不需要再弄一次,如果特征工程没什么可以增加的话,增加也不会把原有的剔除。 直接调用这个函数就好,这是一开始比较傻,但是之后很清楚。

这也是R和Python最大的区别, R上的数据预处理直接调用tidyverse一行代码结束,但是非常麻烦。 因此好好设计函数,这样不需要好几个.Rmd搞来搞去,累。 训练这种函数思维。

批量进行函数操作

In [262]:
import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
train, test, targets = recover_train_test_target()
In [263]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
clf = clf.fit(train, targets)
In [264]:
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)

importances图

features.plot(kind='barh', figsize=(20, 20))

这个图可以用于选择变量,当然下面这个函数相当于减负了。

下面开始用一个函数缩减变量。 当然也可以手动选择99%信息增益的变量。

In [265]:
model = SelectFromModel(clf, prefit=True)
train_reduced = model.transform(train)
train_reduced.shape
Out[265]:
(891, 14)

直接选取了13个变量。

In [266]:
test_reduced = model.transform(test)
test_reduced.shape
Out[266]:
(418, 14)

Hyperparameters tuning

我想了一下这个文档就不要动了,直接出pdf,那么knitr就一些备注,也不要md马德来回粘贴,浪费时间,节约时间学习算法。

In [267]:
# turn run_gs to True if you want to run the gridsearch again.
run_gs = False

if run_gs:
    parameter_grid = {
                 'max_depth' : [4, 6, 8],
                 'n_estimators': [50, 10],
                 'max_features': ['sqrt', 'auto', 'log2'],
                 'min_samples_split': [1, 3, 10],
                 'min_samples_leaf': [1, 3, 10],
                 'bootstrap': [True, False],
                 }
    forest = RandomForestClassifier()
    cross_validation = StratifiedKFold(targets, n_folds=5)

    grid_search = GridSearchCV(forest,
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=cross_validation)

    grid_search.fit(train, targets)
    model = grid_search
    parameters = grid_search.best_params_

    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))
else: 
    parameters = {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 50, 
                  'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6}
    
    model = RandomForestClassifier(**parameters)
    model.fit(train, targets)
In [268]:
compute_score(model, train, targets, scoring='accuracy')
Out[268]:
0.8272156017458057
In [269]:
    global combined
    file_path = os.path.join(os.getcwd(),"titanic")
    train0 = pd.read_csv(os.path.join(file_path,"train.csv"))
In [270]:
output = model.predict(test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032303.csv"),index=False)
In [271]:
import joblib
joblib.dump(model, "titanic_model_18032301", compress=9)
Out[271]:
['titanic_model_18032301']

XGBoost

安装包

In [272]:
import xgboost as xgb

DMLC做了一个全平台的xgboost tutorial,这里代码主要参考这个网站,R等平台的也可以借鉴。

In [273]:
import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
train, test, targets = recover_train_test_target()

建立train和validation组

因为xgboost可以设计watchlist,所以对原来的train数据进行切分。

In [274]:
X_train,X_test,y_train,y_test= train_test_split(train, targets, test_size=0.2, random_state=123)

简单fit

In [275]:
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
xg_cl.fit(X_train,y_train)
preds = xg_cl.predict(X_test)
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
accuracy: 0.854749

预测

In [276]:
output = xg_cl.predict(test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032401.csv"),index=False)

Your submission scored 0.78468, which is not an improvement of your best score. Keep trying! 感觉要比前面的模型都要好,试试cv的效果。

保存模型

In [277]:
# xg_cl.dump_model("titanic_model_18032401.txt","titanic_model_18032401_featmap.txt")

这个地方.save_model().dump_model()都不能用,不知道什么bug.

In [278]:
import joblib
joblib.dump(xg_cl, "titanic_model_18032401", compress=9)
Out[278]:
['titanic_model_18032401']

展示决策树图

安装包graphviz

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

有时候这个dependency会下不下来,多下几次,耐心。

展示

In [279]:
from numpy import loadtxt
import matplotlib.pyplot as plt
xgb.plot_tree(xg_cl._Booster, num_trees=0)
plt.show()

这是boosting中第一个树的展示,很难看,R完爆。

cross-validation

xgboost进入的数据需要是稀疏矩阵,因此这里设计了一个函数xgb.DMatrix进行转换。

In [280]:
# Create the DMatrix: churn_dmatrix
dmatrix = xgb.DMatrix(data=train, label=targets)

# Create the parameter dictionary: params
params = {"objective":"binary:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=dmatrix, params=params, nfold=3, num_boost_round=5, metrics=["auc"], seed=123)

# Print cv_results
print(cv_results)
['[0]\tcv-test-auc:0.856587+0.036790\tcv-train-auc:0.875038+0.004045', '[1]\tcv-test-auc:0.867031+0.026898\tcv-train-auc:0.886307+0.004904', '[2]\tcv-test-auc:0.867928+0.028468\tcv-train-auc:0.889364+0.007391', '[3]\tcv-test-auc:0.867096+0.029492\tcv-train-auc:0.893532+0.009476', '[4]\tcv-test-auc:0.865218+0.031522\tcv-train-auc:0.898663+0.013506']
[0]	cv-test-auc:0.856587+0.036790	cv-train-auc:0.875038+0.004045
[1]	cv-test-auc:0.867031+0.026898	cv-train-auc:0.886307+0.004904
[2]	cv-test-auc:0.867928+0.028468	cv-train-auc:0.889364+0.007391
[3]	cv-test-auc:0.867096+0.029492	cv-train-auc:0.893532+0.009476
[4]	cv-test-auc:0.865218+0.031522	cv-train-auc:0.898663+0.013506

注意,如果这里的评价指标使用AUC的化,需要list化,也就是输入["auc"],而非"auc",否则报错。

cv_results是一个list,就是说我母鸡如何导出?

output = cv_results.predict(test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032402.csv"),index=False)

历史集成算法

这里把这两个模型进行一个简单的决策树、投票、平均等等,也就是形成了集成算法。

In [281]:
import joblib 
md1 = joblib.load("titanic_model_18032201")
md2 = joblib.load("titanic_model_18032301")
md3 = joblib.load("titanic_model_18032401")

批量进行函数操作

In [282]:
import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
train, test, targets = recover_train_test_target()

逻辑回归

In [283]:
from sklearn.linear_model import LogisticRegression
In [284]:
y1tr = md1.predict_proba(train.values)
y2tr = md2.predict_proba(train)
y3tr = md2.predict_proba(train)
y1te = md1.predict_proba(test.values)
y2te = md2.predict_proba(test)
y3te = md2.predict_proba(test)
In [285]:
X_tr = pd.DataFrame({"y1":y1tr[:,0],"y2":y2tr[:,0],"y3":y3tr[:,0]})
X_te = pd.DataFrame({"y1":y1te[:,0],"y2":y2te[:,0],"y3":y3te[:,0]})
In [286]:
from pandas_ply import install_ply, X, sym_call
install_ply(pd)
In [287]:
X_tr = X_tr.ply_select('*',
                y1sq = X.y1**2,
                y2sq = X.y2**2,
                y3sq = X.y3**2,
                intsc1 = X.y1*X.y2,
                intsc2 = X.y1*X.y3,
                intsc3 = X.y2*X.y3,                      
                intsc4 = X.y1*X.y2*X.y3                   
                      )
X_te = X_te.ply_select('*',
                y1sq = X.y1**2,
                y2sq = X.y2**2,
                y3sq = X.y3**2,
                intsc1 = X.y1*X.y2,
                intsc2 = X.y1*X.y3,
                intsc3 = X.y2*X.y3,                      
                intsc4 = X.y1*X.y2*X.y3                   
                      )
In [288]:
lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_tr.values, targets)
Out[288]:
LogisticRegression(C=1000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
In [289]:
lr.coef_
Out[289]:
array([[ -0.24902198, -10.1045079 , -10.1045079 ,   8.72614486,
          5.66796023,  -3.71285182,   5.66796023,  -3.71285182,
         -8.10302608,   5.66796023]])

这个地方不要抱怨,sklearn还真的不反馈p值,这里就不管了,大概可以判断,决策树的$\beta$很低很挫。

这个地方注意是[:,1]不是[:,0],别弄反了。

cutoff = 0.5

In [290]:
y_hat = lr.predict_proba(X_te)[:,1]
output = pd.Series((y_hat > 0.5)).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032305.csv"),index=False)

Your submission scored 0.77990, which is not an improvement of your best score. Keep trying!

加入xgboos后,分数掉了。

Your submission scored 0.75119, which is not an improvement of your best score. Keep trying!

发现,是否加交叉项目,影响不大。

cutoff = median()

In [291]:
y_hat = lr.predict_proba(X_te)[:,1]
output = pd.Series((y_hat > pd.Series(y_hat).median())).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032306.csv"),index=False)

Your submission scored 0.73684, which is not an improvement of your best score. Keep trying! 所以中位数不能取用。

Your submission scored 0.72248, which is not an improvement of your best score. Keep trying! 所以我说最后一层用逻辑回归,把xgboost放进来,基本上是要gg的。

cutoff = mean()

In [292]:
y_hat = lr.predict_proba(X_te)[:,1]
output = pd.Series((y_hat > pd.Series(y_hat).mean())).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032307.csv"),index=False)

Your submission scored 0.78468, which is not an improvement of your best score. Keep trying!

Your submission scored 0.74641, which is not an improvement of your best score. Keep trying! xgboost就是不能放在前面!!!

In [293]:
lr.score(X_tr.values, targets)
Out[293]:
0.8630751964085297

一般我们不采用训练集的acc,是因为这个值也是过拟合的指标。

均值法

In [294]:
y1 = md1.predict_proba(test.values)
y2 = md2.predict_proba(test)
In [295]:
y_hat = (y1[:,1]+y2[:,1])/2
output = pd.Series((y_hat > 0.5)).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032304.csv"),index=False)

Your submission scored 0.78468, which is not an improvement of your best score. Keep trying!

求平均居然比逻辑回归好。