前言¶

这里主要讲两个模型，决策树、随机森林，最后随机森林涉及超参数调整，也就是grid思想。在模型训练前，就会用到变量的特征工程，这里使用函数思维。

引入决策树是因为简单，作为一个tutorial，让代码一次性可以跑通很重要。

决策树参考 Hugo Bowne-Anderson January 3rd, 2018 Kaggle Tutorial: Your First Machine Learning Model https://www.datacamp.com/community/tutorials/kaggle-tutorial-machine-learning

随机森林参考 Ahmed BESEES March 10th, 2016 How to score 0.8134 in Titanic Kaggle Challenge https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html 不知道哪国人。

这里展示不是base blogdown包的，因为这里jupyter和RStudio感觉是竞争品牌，不会互相支持，算了，直接jupyter写。

数据下载¶

下载数据，先安装kaggle命令: pip install kaggle
然后下载数据: kaggle competitions download -c titanic
参考文档见 kaggle

$ mv kaggle.json /Users/JiaxiangLi/.kaggle
$ cd /Users/JiaxiangLi/.kaggle
$ ls
kaggle.json

特征工程¶

In [203]:

from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

Out[203]:

In [204]:

# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---

%matplotlib inline
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np

pd.options.display.max_rows = 100

这些设定，比如限定展示不超过100行和100列是非常有用的。

In [205]:

import os 
import pandas as pd
from dplython import (DplyFrame, X, diamonds, select, sift,
  sample_n, sample_frac, head, arrange, mutate, group_by,
  summarize, DelayFunction)

In [206]:

os.getcwd()

Out[206]:

'/Users/JiaxiangLi/Downloads/me/trans/python_learning'

录入数据¶

In [207]:

def get_combined_data():
    file_path = os.path.join(os.getcwd(),"titanic")
    train = pd.read_csv(os.path.join(file_path,"train.csv"))
    test = pd.read_csv(os.path.join(file_path,"test.csv"))
    target = train.Survived
    train = train.drop("Survived",axis = 1)
    combined = train.append(test)
    combined.reset_index().drop("index",axis =1)
    combined.drop("PassengerId", axis = 1, inplace=True)
    return combined

这样每次统一.ipynb文件的路径和数据的路径，就可以重复利用这个文档了。
定义一个函数，每次跑函数就可以了，python的函数思维，一定要养成，最好的效果就是节约时间。

注意上面最后return combined因此我们定义combined我们合并train和test的数据，进入特征工程方面。

In [208]:

combined = get_combined_data()

In [209]:

print combined.shape
row,col = combined.shape
print row,col

(1309, 10)
1309 10

这里是返回dataframe的大小。

简单了解下。

titles¶

combined["Title"] = combined.Name.map(lambda name:name.split(',')[1].split('.')[0].strip())

相当于提取以,和.中间的字符，且去掉空格。
这里不需要用replace函数，难搞，直接用mapfollow R的思路。但是是针对单一变量.

Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"

                    }

In [210]:

def get_title():
    global combined
    combined["Title"] = combined.Name.map(lambda name:name.split(',')[1].split('.')[0].strip())
    Title_Dictionary = {
                    "Capt":       "Officer",
                    "Col":        "Officer",
                    "Major":      "Officer",
                    "Jonkheer":   "Royalty",
                    "Don":        "Royalty",
                    "Sir" :       "Royalty",
                    "Dr":         "Officer",
                    "Rev":        "Officer",
                    "the Countess":"Royalty",
                    "Dona":       "Royalty",
                    "Mme":        "Mrs",
                    "Mlle":       "Miss",
                    "Ms":         "Mrs",
                    "Mr" :        "Mr",
                    "Mrs" :       "Mrs",
                    "Miss" :      "Miss",
                    "Master" :    "Master",
                    "Lady" :      "Royalty"

                    }
    combined.Title = combined.Title.map(Title_Dictionary)

整合成一个函数，一定要这样定义，这样就把特征工程模块化，再后面跑模型的时候就不需要管前面了，非常节约时间。

In [211]:

get_title()

这个地方的map函数定义和R里面的purrr包定义类似。

Age¶

这个地方的age，根据title,sex,pclass的情况不同肯定不同，因此直接用均值和中位数肯定不恰当，要不groupby三个变量，要不直接回归？简单点，直接前面一种，我们来看看，分train和test组。

In [212]:

combined[:891].groupby(["Pclass","Sex","Title"]).median()

Out[212]:

			Age	SibSp	Parch	Fare
Pclass	Sex	Title
1	female	Miss	30.0	0.0	0.0	88.25000
		Mrs	40.0	1.0	0.0	79.20000
		Officer	49.0	0.0	0.0	25.92920
		Royalty	40.5	0.5	0.0	63.05000
	male	Master	4.0	1.0	2.0	120.00000
		Mr	40.0	0.0	0.0	42.40000
		Officer	51.0	0.0	0.0	35.50000
		Royalty	40.0	0.0	0.0	27.72080
2	female	Miss	24.0	0.0	0.0	13.00000
	female	Mrs	31.5	1.0	0.0	26.00000
	male	Master	1.0	1.0	1.0	26.00000
		Mr	31.0	0.0	0.0	13.00000
		Officer	46.5	0.0	0.0	13.00000
3	female	Miss	18.0	0.0	0.0	8.75625
	female	Mrs	31.0	1.0	1.0	15.97500
	male	Master	4.0	3.5	1.0	28.51250
	male	Mr	26.0	0.0	0.0	7.89580

简单看一下就觉得差异得吓人，因此肯定要区别对待。

这里要用if-else条件层层完成一个函数，真累。因此我们肯定不能这么做，直接一步left join搞定。

In [213]:

def process_age():
    global combined
    age_median = combined[:891].groupby(["Pclass","Sex","Title"]).median().reset_index()[["Pclass","Sex","Age","Title"]]
    combined2 = pd.merge(combined, age_median, how="left", on = ["Pclass","Sex","Title"])
    combined.Age = combined2.Age_x.fillna(combined2.Age_y)
    return combined

In [214]:

process_age()

Out[214]:

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title
0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S	Mr
1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	Mrs
2	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S	Miss
3	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	Mrs
4	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S	Mr
5	3	Moran, Mr. James	male	26.0	0	0	330877	8.4583	NaN	Q	Mr
6	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	Mr
7	3	Palsson, Master. Gosta Leonard	male	2.0	3	1	349909	21.0750	NaN	S	Master
8	3	Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)	female	27.0	0	2	347742	11.1333	NaN	S	Mrs
9	2	Nasser, Mrs. Nicholas (Adele Achem)	female	14.0	1	0	237736	30.0708	NaN	C	Mrs
10	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S	Miss
11	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S	Miss
12	3	Saundercock, Mr. William Henry	male	20.0	0	0	A/5. 2151	8.0500	NaN	S	Mr
13	3	Andersson, Mr. Anders Johan	male	39.0	1	5	347082	31.2750	NaN	S	Mr
14	3	Vestrom, Miss. Hulda Amanda Adolfina	female	14.0	0	0	350406	7.8542	NaN	S	Miss
15	2	Hewlett, Mrs. (Mary D Kingcome)	female	55.0	0	0	248706	16.0000	NaN	S	Mrs
16	3	Rice, Master. Eugene	male	2.0	4	1	382652	29.1250	NaN	Q	Master
17	2	Williams, Mr. Charles Eugene	male	31.0	0	0	244373	13.0000	NaN	S	Mr
18	3	Vander Planke, Mrs. Julius (Emelia Maria Vande...	female	31.0	1	0	345763	18.0000	NaN	S	Mrs
19	3	Masselmani, Mrs. Fatima	female	31.0	0	0	2649	7.2250	NaN	C	Mrs
20	2	Fynney, Mr. Joseph J	male	35.0	0	0	239865	26.0000	NaN	S	Mr
21	2	Beesley, Mr. Lawrence	male	34.0	0	0	248698	13.0000	D56	S	Mr
22	3	McGowan, Miss. Anna "Annie"	female	15.0	0	0	330923	8.0292	NaN	Q	Miss
23	1	Sloper, Mr. William Thompson	male	28.0	0	0	113788	35.5000	A6	S	Mr
24	3	Palsson, Miss. Torborg Danira	female	8.0	3	1	349909	21.0750	NaN	S	Miss
25	3	Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...	female	38.0	1	5	347077	31.3875	NaN	S	Mrs
26	3	Emir, Mr. Farred Chehab	male	26.0	0	0	2631	7.2250	NaN	C	Mr
27	1	Fortune, Mr. Charles Alexander	male	19.0	3	2	19950	263.0000	C23 C25 C27	S	Mr
28	3	O'Dwyer, Miss. Ellen "Nellie"	female	18.0	0	0	330959	7.8792	NaN	Q	Miss
29	3	Todoroff, Mr. Lalio	male	26.0	0	0	349216	7.8958	NaN	S	Mr
30	1	Uruchurtu, Don. Manuel E	male	40.0	0	0	PC 17601	27.7208	NaN	C	Royalty
31	1	Spencer, Mrs. William Augustus (Marie Eugenie)	female	40.0	1	0	PC 17569	146.5208	B78	C	Mrs
32	3	Glynn, Miss. Mary Agatha	female	18.0	0	0	335677	7.7500	NaN	Q	Miss
33	2	Wheadon, Mr. Edward H	male	66.0	0	0	C.A. 24579	10.5000	NaN	S	Mr
34	1	Meyer, Mr. Edgar Joseph	male	28.0	1	0	PC 17604	82.1708	NaN	C	Mr
35	1	Holverson, Mr. Alexander Oskar	male	42.0	1	0	113789	52.0000	NaN	S	Mr
36	3	Mamee, Mr. Hanna	male	26.0	0	0	2677	7.2292	NaN	C	Mr
37	3	Cann, Mr. Ernest Charles	male	21.0	0	0	A./5. 2152	8.0500	NaN	S	Mr
38	3	Vander Planke, Miss. Augusta Maria	female	18.0	2	0	345764	18.0000	NaN	S	Miss
39	3	Nicola-Yarred, Miss. Jamila	female	14.0	1	0	2651	11.2417	NaN	C	Miss
40	3	Ahlin, Mrs. Johan (Johanna Persdotter Larsson)	female	40.0	1	0	7546	9.4750	NaN	S	Mrs
41	2	Turpin, Mrs. William John Robert (Dorothy Ann ...	female	27.0	1	0	11668	21.0000	NaN	S	Mrs
42	3	Kraeff, Mr. Theodor	male	26.0	0	0	349253	7.8958	NaN	C	Mr
43	2	Laroche, Miss. Simonne Marie Anne Andree	female	3.0	1	2	SC/Paris 2123	41.5792	NaN	C	Miss
44	3	Devaney, Miss. Margaret Delia	female	19.0	0	0	330958	7.8792	NaN	Q	Miss
45	3	Rogers, Mr. William John	male	26.0	0	0	S.C./A.4. 23567	8.0500	NaN	S	Mr
46	3	Lennon, Mr. Denis	male	26.0	1	0	370371	15.5000	NaN	Q	Mr
47	3	O'Driscoll, Miss. Bridget	female	18.0	0	0	14311	7.7500	NaN	Q	Miss
48	3	Samaan, Mr. Youssef	male	26.0	2	0	2662	21.6792	NaN	C	Mr
49	3	Arnold-Franchi, Mrs. Josef (Josefine Franchi)	female	18.0	1	0	349237	17.8000	NaN	S	Mrs
...	...	...	...	...	...	...	...	...	...	...	...
368	1	Gibson, Mrs. Leonard (Pauline C Boeson)	female	18.0	0	1	112378	59.4000	NaN	C	Mrs
369	2	Pallas y Castello, Mr. Emilio	male	24.0	0	0	SC/PARIS 2147	13.8583	NaN	C	Mr
370	2	Giles, Mr. Edgar	male	25.0	1	0	28133	11.5000	NaN	S	Mr
371	1	Wilson, Miss. Helen Alice	female	18.0	0	0	16966	134.5000	E39 E41	C	Miss
372	1	Ismay, Mr. Joseph Bruce	male	19.0	0	0	112058	0.0000	B52 B54 B56	S	Mr
373	2	Harbeck, Mr. William H	male	22.0	0	0	248746	13.0000	NaN	S	Mr
374	1	Dodge, Mrs. Washington (Ruth Vidaver)	female	3.0	1	1	33638	81.8583	A34	S	Mrs
375	1	Bowen, Miss. Grace Scott	female	40.0	0	0	PC 17608	262.3750	NaN	C	Miss
376	3	Kink, Miss. Maria	female	22.0	2	0	315152	8.6625	NaN	S	Miss
377	2	Cotterill, Mr. Henry Harry""	male	27.0	0	0	29107	11.5000	NaN	S	Mr
378	1	Hipkins, Mr. William Edward	male	20.0	0	0	680	50.0000	C39	S	Mr
379	3	Asplund, Master. Carl Edgar	male	19.0	4	2	347077	31.3875	NaN	S	Master
380	3	O'Connor, Mr. Patrick	male	42.0	0	0	366713	7.7500	NaN	Q	Mr
381	3	Foley, Mr. Joseph	male	1.0	0	0	330910	7.8792	NaN	Q	Mr
382	3	Risien, Mrs. Samuel (Emma)	female	32.0	0	0	364498	14.5000	NaN	S	Mrs
383	3	McNamee, Mrs. Neal (Eileen O'Leary)	female	35.0	1	0	376566	16.1000	NaN	S	Mrs
384	2	Wheeler, Mr. Edwin Frederick""	male	26.0	0	0	SC/PARIS 2159	12.8750	NaN	S	Mr
385	2	Herman, Miss. Kate	female	18.0	1	2	220845	65.0000	NaN	S	Miss
386	3	Aronsson, Mr. Ernst Axel Algot	male	1.0	0	0	349911	7.7750	NaN	S	Mr
387	2	Ashby, Mr. John	male	36.0	0	0	244346	13.0000	NaN	S	Mr
388	3	Canavan, Mr. Patrick	male	26.0	0	0	364858	7.7500	NaN	Q	Mr
389	3	Palsson, Master. Paul Folke	male	17.0	3	1	349909	21.0750	NaN	S	Master
390	1	Payne, Mr. Vivian Ponsonby	male	36.0	0	0	12749	93.5000	B24	S	Mr
391	1	Lines, Mrs. Ernest H (Elizabeth Lindsey James)	female	21.0	0	1	PC 17592	39.4000	D28	S	Mrs
392	3	Abbott, Master. Eugene Joseph	male	28.0	0	2	C.A. 2673	20.2500	NaN	S	Master
393	2	Gilbert, Mr. William	male	23.0	0	0	C.A. 30769	10.5000	NaN	S	Mr
394	3	Kink-Heilmann, Mr. Anton	male	24.0	3	1	315153	22.0250	NaN	S	Mr
395	1	Smith, Mrs. Lucien Philip (Mary Eloise Hughes)	female	22.0	1	0	13695	60.0000	C31	S	Mrs
396	3	Colbert, Mr. Patrick	male	31.0	0	0	371109	7.2500	NaN	Q	Mr
397	1	Frolicher-Stehli, Mrs. Maxmillian (Margaretha ...	female	46.0	1	1	13567	79.2000	B41	C	Mrs
398	3	Larsson-Rondberg, Mr. Edvard A	male	23.0	0	0	347065	7.7750	NaN	S	Mr
399	3	Conlon, Mr. Thomas Henry	male	28.0	0	0	21332	7.7333	NaN	Q	Mr
400	1	Bonnell, Miss. Caroline	female	39.0	0	0	36928	164.8667	C7	S	Miss
401	2	Gale, Mr. Harry	male	26.0	1	0	28664	21.0000	NaN	S	Mr
402	1	Gibson, Miss. Dorothy Winifred	female	21.0	0	1	112378	59.4000	NaN	C	Miss
403	1	Carrau, Mr. Jose Pedro	male	28.0	0	0	113059	47.1000	NaN	S	Mr
404	1	Frauenthal, Mr. Isaac Gerald	male	20.0	1	0	17765	27.7208	D40	C	Mr
405	2	Nourney, Mr. Alfred (Baron von Drachstedt")"	male	34.0	0	0	SC/PARIS 2166	13.8625	D38	C	Mr
406	2	Ware, Mr. William Jeffery	male	51.0	1	0	28666	10.5000	NaN	S	Mr
407	1	Widener, Mr. George Dunton	male	3.0	1	1	113503	211.5000	C80	C	Mr
408	3	Riordan, Miss. Johanna Hannah""	female	21.0	0	0	334915	7.7208	NaN	Q	Miss
409	3	Peacock, Miss. Treasteall	female	18.0	1	1	SOTON/O.Q. 3101315	13.7750	NaN	S	Miss
410	3	Naughton, Miss. Hannah	female	26.0	0	0	365237	7.7500	NaN	Q	Miss
411	1	Minahan, Mrs. William Edward (Lillian E Thorpe)	female	26.0	1	0	19928	90.0000	C78	Q	Mrs
412	3	Henriksson, Miss. Jenny Lovisa	female	33.0	0	0	347086	7.7750	NaN	S	Miss
413	3	Spector, Mr. Woolf	male	31.0	0	0	A.5. 3236	8.0500	NaN	S	Mr
414	1	Oliva y Ocana, Dona. Fermina	female	44.0	0	0	PC 17758	108.9000	C105	C	Royalty
415	3	Saether, Mr. Simon Sivertsen	male	31.0	0	0	SOTON/O.Q. 3101262	7.2500	NaN	S	Mr
416	3	Ware, Mr. Frederick	male	34.0	0	0	359309	8.0500	NaN	S	Mr
417	3	Peter, Master. Michael J	male	18.0	1	1	2668	22.3583	NaN	C	Master

1309 rows × 11 columns

分类和连续变量describe¶

In [215]:

combined.get_dtype_counts()

Out[215]:

float64    2
int64      3
object     6
dtype: int64

In [216]:

combined.select_dtypes(include=["int64","float64"]).head(n=1)

Out[216]:

	Pclass	Age	SibSp	Parch	Fare
0	3	22.0	1	0	7.25

In [217]:

combined.select_dtypes(include=["object"]).nunique()

Out[217]:

Name        1307
Sex            2
Ticket       929
Cabin        186
Embarked       3
Title          6
dtype: int64

Names¶

names 是个噪音，这里处理方式是去掉，反正有了 title，要是不好的话，再说吧。但是 title需要one_hot

In [218]:

combined.Title.unique()

Out[218]:

array(['Mr', 'Mrs', 'Miss', 'Master', 'Royalty', 'Officer'], dtype=object)

drop_first必须的，不然共线性。

In [219]:

pd.get_dummies(combined.Title,prefix='Title',drop_first=True).head()
#combined.drop(["Name","Title"])

Out[219]:

	Title_Miss	Title_Mr	Title_Mrs
0	0	1	0
1	0	0	1
2	1	0	0
3	0	0	1
4	0	1	0

python - How to drop multiple columns on pandas data frame - Stack Overflow 这里减少变量只能用iloc给location，间接，真麻烦，dplyr爆几条街。

In [220]:

def process_names():
    global combined
    combined = pd.concat([combined.drop("Name",axis=1).drop("Title",axis=1),pd.get_dummies(combined.Title,prefix='Title',drop_first=True)],axis=1)

这里一定要加上axis=1

In [221]:

process_names()

Fare¶

这个地方的Fare缺失很少，直接均值搞定。

In [222]:

combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 14 columns):
Pclass           1309 non-null int64
Sex              1309 non-null object
Age              1309 non-null float64
SibSp            1309 non-null int64
Parch            1309 non-null int64
Ticket           1309 non-null object
Fare             1308 non-null float64
Cabin            295 non-null object
Embarked         1307 non-null object
Title_Miss       1309 non-null uint8
Title_Mr         1309 non-null uint8
Title_Mrs        1309 non-null uint8
Title_Officer    1309 non-null uint8
Title_Royalty    1309 non-null uint8
dtypes: float64(2), int64(3), object(4), uint8(5)
memory usage: 108.7+ KB

现在就剩下Fare、Embarked、Cabin了。

In [223]:

def process_fares():
    global combined
    combined.Fare.fillna(combined.Fare.mean(), inplace = True)

In [224]:

process_fares()

In [225]:

combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 14 columns):
Pclass           1309 non-null int64
Sex              1309 non-null object
Age              1309 non-null float64
SibSp            1309 non-null int64
Parch            1309 non-null int64
Ticket           1309 non-null object
Fare             1309 non-null float64
Cabin            295 non-null object
Embarked         1307 non-null object
Title_Miss       1309 non-null uint8
Title_Mr         1309 non-null uint8
Title_Mrs        1309 non-null uint8
Title_Officer    1309 non-null uint8
Title_Royalty    1309 non-null uint8
dtypes: float64(2), int64(3), object(4), uint8(5)
memory usage: 108.7+ KB

Embarked¶

江湖流传，这个缺失值替换成S准确性高很多。

In [226]:

def process_embarked():
    global combined
    combined.Embarked.fillna("S",inplace=True)
    embarked_dummies = pd.get_dummies(combined.Embarked,prefix="Embarked",drop_first=True)
    combined = pd.concat([combined, embarked_dummies],axis = 1)
    combined.drop("Embarked", inplace=True, axis = 1)

In [227]:

process_embarked()

Cabin¶

In [228]:

combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 15 columns):
Pclass           1309 non-null int64
Sex              1309 non-null object
Age              1309 non-null float64
SibSp            1309 non-null int64
Parch            1309 non-null int64
Ticket           1309 non-null object
Fare             1309 non-null float64
Cabin            295 non-null object
Title_Miss       1309 non-null uint8
Title_Mr         1309 non-null uint8
Title_Mrs        1309 non-null uint8
Title_Officer    1309 non-null uint8
Title_Royalty    1309 non-null uint8
Embarked_Q       1309 non-null uint8
Embarked_S       1309 non-null uint8
dtypes: float64(2), int64(3), object(3), uint8(7)
memory usage: 101.0+ KB

这里填写，U表示unknown。其他都取用首字母，非常暴力。

In [229]:

combined.Cabin.head()

Out[229]:

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

In [230]:

def process_cabin():
    global combined
    combined.Cabin.fillna('U', inplace=True)
    combined["Cabin"] = combined.Cabin.map(lambda c: c[0])
    cabin_dummies = pd.get_dummies(combined.Cabin, prefix="Cabin", drop_first=True)
    combined = pd.concat([combined,cabin_dummies], axis=1)
    combined.drop("Cabin", axis = 1, inplace=True)

python 中的 global - CSDN博客这是global的解释，我觉得作者的sense已经解释得很清楚了。

In [231]:

process_cabin()

combined.filter(regex="^Cabin")

Sex¶

In [232]:

def process_sex():
    
    global combined
    # mapping string values to numerical one 
    combined['Sex'] = combined['Sex'].map({'male':1,'female':0})

In [233]:

process_sex()

Pclass¶

注意这个pclass也是要onehot的。

In [234]:

def process_pclass():
    global combined
    pclass_dummies = pd.get_dummies(combined.Pclass,prefix="Pclass", drop_first=True)
    combined  = pd.concat([combined, pclass_dummies], axis=1)
    combined.drop("Pclass",axis=1, inplace=True)

In [235]:

process_pclass()

In [236]:

combined.head()

Out[236]:

	Sex	Age	SibSp	Ticket	Fare	Title_Miss	Title_Mr	Title_Mrs	Embarked_S	Cabin_C	Cabin_U	Pclass_3
0	1	22.0	1	A/5 21171	7.2500	0	1	0	1	0	1	1
1	0	38.0	1	PC 17599	71.2833	0	0	1	0	1	0	0
2	0	26.0	0	STON/O2. 3101282	7.9250	1	0	0	1	0	1	1
3	0	35.0	1	113803	53.1000	0	0	1	1	1	0	0
4	1	35.0	0	373450	8.0500	0	1	0	1	0	1	1

Ticket¶

In [237]:

combined.Ticket.head()

Out[237]:

0           A/5 21171
1            PC 17599
2    STON/O2. 3101282
3              113803
4              373450
Name: Ticket, dtype: object

这个地方太烦了，copy了一个函数。

In [238]:

def process_ticket():
    
    global combined
    
    # a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
    def cleanTicket(ticket):
        ticket = ticket.replace('.','')
        ticket = ticket.replace('/','')
        ticket = ticket.split()
        ticket = map(lambda t : t.strip(), ticket)
        ticket = filter(lambda t : not t.isdigit(), ticket)
        if len(ticket) > 0:
            return ticket[0]
        else: 
            return 'XXX'
    

    # Extracting dummy variables from tickets:

    combined['Ticket'] = combined['Ticket'].map(cleanTicket)
    tickets_dummies = pd.get_dummies(combined['Ticket'], prefix='Ticket')
    combined = pd.concat([combined, tickets_dummies], axis=1)
    combined.drop('Ticket', inplace=True, axis=1)

In [239]:

process_ticket()

Family¶

In [240]:

def process_family():
    
    global combined
    # introducing a new feature : the size of families (including the passenger)
    combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
    
    # introducing other features based on the family size
    combined['Singleton'] = combined['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    combined['SmallFamily'] = combined['FamilySize'].map(lambda s: 1 if 2<=s<=4 else 0)
    combined['LargeFamily'] = combined['FamilySize'].map(lambda s: 1 if 5<=s else 0)

In [241]:

process_family()

Define metrics¶

In [242]:

def compute_score(clf, X, y, scoring='accuracy'):
    xval = cross_val_score(clf, X, y, cv = 5, scoring=scoring)
    return np.mean(xval)

split funtion¶

In [243]:

def recover_train_test_target():
    global combined
    file_path = os.path.join(os.getcwd(),"titanic")
    train0 = pd.read_csv(os.path.join(file_path,"train.csv"))
    
    targets = train0.Survived
    train = combined.head(891)
    test = combined.iloc[891:]
    
    return train, test, targets

这个地方的合并是知道顺序的，位置为index = 891，那么现在开始分。

Modeling¶

决策树¶

In [244]:

# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

# Figures inline and set visualization style
%matplotlib inline
sns.set()

In [245]:

import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()

这就是为什么全面特征工程要函数化的原因，比如这里我们改动了combined也没关系，跑下上面的代码就好了，实际上在Pycharm里面，我们会将特征工程放入一个.py文件，这样直接import就好。这样数据的预处理，两个组的数据口径一致。

In [246]:

combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 63 columns):
Sex               1309 non-null int64
Age               1309 non-null float64
SibSp             1309 non-null int64
Parch             1309 non-null int64
Fare              1309 non-null float64
Title_Miss        1309 non-null uint8
Title_Mr          1309 non-null uint8
Title_Mrs         1309 non-null uint8
Title_Officer     1309 non-null uint8
Title_Royalty     1309 non-null uint8
Embarked_Q        1309 non-null uint8
Embarked_S        1309 non-null uint8
Cabin_B           1309 non-null uint8
Cabin_C           1309 non-null uint8
Cabin_D           1309 non-null uint8
Cabin_E           1309 non-null uint8
Cabin_F           1309 non-null uint8
Cabin_G           1309 non-null uint8
Cabin_T           1309 non-null uint8
Cabin_U           1309 non-null uint8
Pclass_2          1309 non-null uint8
Pclass_3          1309 non-null uint8
Ticket_A          1309 non-null uint8
Ticket_A4         1309 non-null uint8
Ticket_A5         1309 non-null uint8
Ticket_AQ3        1309 non-null uint8
Ticket_AQ4        1309 non-null uint8
Ticket_AS         1309 non-null uint8
Ticket_C          1309 non-null uint8
Ticket_CA         1309 non-null uint8
Ticket_CASOTON    1309 non-null uint8
Ticket_FC         1309 non-null uint8
Ticket_FCC        1309 non-null uint8
Ticket_Fa         1309 non-null uint8
Ticket_LINE       1309 non-null uint8
Ticket_LP         1309 non-null uint8
Ticket_PC         1309 non-null uint8
Ticket_PP         1309 non-null uint8
Ticket_PPP        1309 non-null uint8
Ticket_SC         1309 non-null uint8
Ticket_SCA3       1309 non-null uint8
Ticket_SCA4       1309 non-null uint8
Ticket_SCAH       1309 non-null uint8
Ticket_SCOW       1309 non-null uint8
Ticket_SCPARIS    1309 non-null uint8
Ticket_SCParis    1309 non-null uint8
Ticket_SOC        1309 non-null uint8
Ticket_SOP        1309 non-null uint8
Ticket_SOPP       1309 non-null uint8
Ticket_SOTONO2    1309 non-null uint8
Ticket_SOTONOQ    1309 non-null uint8
Ticket_SP         1309 non-null uint8
Ticket_STONO      1309 non-null uint8
Ticket_STONO2     1309 non-null uint8
Ticket_STONOQ     1309 non-null uint8
Ticket_SWPP       1309 non-null uint8
Ticket_WC         1309 non-null uint8
Ticket_WEP        1309 non-null uint8
Ticket_XXX        1309 non-null uint8
FamilySize        1309 non-null int64
Singleton         1309 non-null int64
SmallFamily       1309 non-null int64
LargeFamily       1309 non-null int64
dtypes: float64(2), int64(7), uint8(54)
memory usage: 171.3 KB

In [247]:

train, test, targets = recover_train_test_target()

注意这里决策树比较娇气，要使用values.

这里进入sklearn的机器学习包，需要np.array格式。因此我们提出.values。

In [248]:

X = train.values
y = targets
X_test = test.values

这里的决策树是用平衡算法，level-wise，而非leaf-wise，这里限定max_depth = 3。

这里不涉及调参数，因此暂时不用管为什么是3，是经验，在进行超参数调整的时候，会比较的。

In [249]:

clf = tree.DecisionTreeClassifier(max_depth=3)
print clf.fit(X,y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

这里会有很多参数不知道，没关系，这些是系统设计的默认参数。

最后一步，开始.predict()

In [250]:

output = clf.predict(X_test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032302.csv"),index=False)

所有这些结果都保存，最后用历史法进行集成算法用。

模型保存¶

这里使用pip install joblib进行安装。

In [251]:

import joblib

In [252]:

joblib.dump(clf, "titanic_model_18032201", compress=9)

Out[252]:

['titanic_model_18032201']

现在尝试重新录入。

In [253]:

clf2 = joblib.load("titanic_model_18032201")
print clf2.predict(X_test)[:5]

[0 1 0 0 1]

超参数调整简单介绍¶

下面主要解释一些决策树的理论和可视化的东西，毕竟模型这么简单，那么一定要做到跟别人解释得非常的浅显易懂。

max_depth是用来控制过拟合的，我们可以比较下其他$\neq 3$情况。

我们把train组拿出来做随机测试。

这个地方的train_test_split是list。

In [254]:

print type(train_test_split(X,y,test_size = 0.33, random_state = 42,stratify = y))

<type 'list'>

In [255]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42, stratify=y)

In [256]:

dep = np.arange(1,9)

np.arange和range不同之处是，反馈的是array。

设计等长的list，用来写入acc。

In [257]:

train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))

In [258]:

print train_accuracy, test_accuracy

[4.9e-324 9.9e-324 1.5e-323 2.0e-323 2.5e-323 3.0e-323 3.5e-323 4.0e-323] [0. 0. 0. 0. 0. 0. 0. 0.]

In [259]:

for i, k in enumerate(dep):
    print i,k

这个地方的i刚好做index,k就是做好max_depth的枚举值。

In [260]:

for i, k in enumerate(dep):
    c = tree.DecisionTreeClassifier(max_depth=k)
    clf.fit(X_train,y_train)
    train_accuracy[i] = clf.score(X_train,y_train)
    test_accuracy[i]  = clf.score(X_test,y_test)
print train_accuracy,test_accuracy

[0.83221477 0.82718121 0.83221477 0.83221477 0.83221477 0.82718121
 0.82718121 0.82718121] [0.81016949 0.82372881 0.81016949 0.81016949 0.81016949 0.82372881
 0.82372881 0.82372881]

plt.title(u"训练集和测试集的Acc比较")
plt.plot(dep,train_accuracy,label = u"训练集Acc")
plt.plot(dep,test_accuracy,label = u"测试集Acc")
plt.legend()
plt.xlabel(u"level-wise决策树深度选择")
plt.ylabel("Acc")
plt.show()

注意这里中文显示，每个string前面记得加上u。这里注意看到max_depth等于3或者7的时候，Acc最高，因此我们选择3. 为什么不选择7，因为经验。

继续更新随机森林的算法啊，因为啊，这种没bagging的决策树迟早要出bug。

随机森林¶

In [261]:

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score

到这个地方可以总结下思路了，训练一个模型，相对来说比较规则化、死板的，需要需要把很多方面模块化。这样的话，重新导入数据训练模型的时候，面对的就是一个个函数，这样就重复利用率上去了，之后我想要迭代模型的时候，特征工程和选择模型就是分开的，节省时间。当每次搞一个新的模型时候，模型处理变量不需要再弄一次，如果特征工程没什么可以增加的话，增加也不会把原有的剔除。直接调用这个函数就好，这是一开始比较傻，但是之后很清楚。

这也是R和Python最大的区别， R上的数据预处理直接调用tidyverse一行代码结束，但是非常麻烦。因此好好设计函数，这样不需要好几个.Rmd搞来搞去，累。训练这种函数思维。

批量进行函数操作¶

In [262]:

import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
train, test, targets = recover_train_test_target()

In [263]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
clf = clf.fit(train, targets)

In [264]:

features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)

importances图¶

features.plot(kind='barh', figsize=(20, 20))

这个图可以用于选择变量，当然下面这个函数相当于减负了。

下面开始用一个函数缩减变量。当然也可以手动选择99%信息增益的变量。

In [265]:

model = SelectFromModel(clf, prefit=True)
train_reduced = model.transform(train)
train_reduced.shape

Out[265]:

(891, 14)

直接选取了13个变量。

In [266]:

test_reduced = model.transform(test)
test_reduced.shape

Out[266]:

(418, 14)

Hyperparameters tuning¶

我想了一下这个文档就不要动了，直接出pdf，那么knitr就一些备注，也不要md马德来回粘贴，浪费时间，节约时间学习算法。

In [267]:

# turn run_gs to True if you want to run the gridsearch again.
run_gs = False

if run_gs:
    parameter_grid = {
                 'max_depth' : [4, 6, 8],
                 'n_estimators': [50, 10],
                 'max_features': ['sqrt', 'auto', 'log2'],
                 'min_samples_split': [1, 3, 10],
                 'min_samples_leaf': [1, 3, 10],
                 'bootstrap': [True, False],
                 }
    forest = RandomForestClassifier()
    cross_validation = StratifiedKFold(targets, n_folds=5)

    grid_search = GridSearchCV(forest,
                               scoring='accuracy',
                               param_grid=parameter_grid,
                               cv=cross_validation)

    grid_search.fit(train, targets)
    model = grid_search
    parameters = grid_search.best_params_

    print('Best score: {}'.format(grid_search.best_score_))
    print('Best parameters: {}'.format(grid_search.best_params_))
else: 
    parameters = {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 50, 
                  'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6}
    
    model = RandomForestClassifier(**parameters)
    model.fit(train, targets)

In [268]:

compute_score(model, train, targets, scoring='accuracy')

Out[268]:

0.8272156017458057

In [269]:

    global combined
    file_path = os.path.join(os.getcwd(),"titanic")
    train0 = pd.read_csv(os.path.join(file_path,"train.csv"))

In [270]:

output = model.predict(test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032303.csv"),index=False)

In [271]:

import joblib
joblib.dump(model, "titanic_model_18032301", compress=9)

Out[271]:

['titanic_model_18032301']

XGBoost¶

安装包¶

安装: (Mac)
- conda remove xgboost
- conda install -c aterrel xgboost=0.4.0
官网: https://pypi.python.org/pypi/xgboost/

In [272]:

import xgboost as xgb

DMLC做了一个全平台的xgboost tutorial，这里代码主要参考这个网站，R等平台的也可以借鉴。

In [273]:

import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
train, test, targets = recover_train_test_target()

建立train和validation组¶

因为xgboost可以设计watchlist，所以对原来的train数据进行切分。

In [274]:

X_train,X_test,y_train,y_test= train_test_split(train, targets, test_size=0.2, random_state=123)

简单fit¶

In [275]:

xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
xg_cl.fit(X_train,y_train)
preds = xg_cl.predict(X_test)
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

accuracy: 0.854749

预测¶

In [276]:

output = xg_cl.predict(test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032401.csv"),index=False)

Your submission scored 0.78468, which is not an improvement of your best score. Keep trying! 感觉要比前面的模型都要好，试试cv的效果。

保存模型¶

In [277]:

# xg_cl.dump_model("titanic_model_18032401.txt","titanic_model_18032401_featmap.txt")

这个地方.save_model()和.dump_model()都不能用，不知道什么bug.

In [278]:

import joblib
joblib.dump(xg_cl, "titanic_model_18032401", compress=9)

Out[278]:

['titanic_model_18032401']

展示决策树图¶

安装包`graphviz`¶

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

有时候这个dependency会下不下来，多下几次，耐心。

展示¶

In [279]:

from numpy import loadtxt
import matplotlib.pyplot as plt
xgb.plot_tree(xg_cl._Booster, num_trees=0)
plt.show()

这是boosting中第一个树的展示，很难看，R完爆。

cross-validation¶

xgboost进入的数据需要是稀疏矩阵，因此这里设计了一个函数xgb.DMatrix进行转换。

In [280]:

# Create the DMatrix: churn_dmatrix
dmatrix = xgb.DMatrix(data=train, label=targets)

# Create the parameter dictionary: params
params = {"objective":"binary:logistic", "max_depth":3}

# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=dmatrix, params=params, nfold=3, num_boost_round=5, metrics=["auc"], seed=123)

# Print cv_results
print(cv_results)

['[0]\tcv-test-auc:0.856587+0.036790\tcv-train-auc:0.875038+0.004045', '[1]\tcv-test-auc:0.867031+0.026898\tcv-train-auc:0.886307+0.004904', '[2]\tcv-test-auc:0.867928+0.028468\tcv-train-auc:0.889364+0.007391', '[3]\tcv-test-auc:0.867096+0.029492\tcv-train-auc:0.893532+0.009476', '[4]\tcv-test-auc:0.865218+0.031522\tcv-train-auc:0.898663+0.013506']

[0]	cv-test-auc:0.856587+0.036790	cv-train-auc:0.875038+0.004045
[1]	cv-test-auc:0.867031+0.026898	cv-train-auc:0.886307+0.004904
[2]	cv-test-auc:0.867928+0.028468	cv-train-auc:0.889364+0.007391
[3]	cv-test-auc:0.867096+0.029492	cv-train-auc:0.893532+0.009476
[4]	cv-test-auc:0.865218+0.031522	cv-train-auc:0.898663+0.013506

注意，如果这里的评价指标使用AUC的化，需要list化，也就是输入["auc"]，而非"auc"，否则报错。

cv_results是一个list，就是说我母鸡如何导出？

output = cv_results.predict(test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032402.csv"),index=False)

历史集成算法¶

这里把这两个模型进行一个简单的决策树、投票、平均等等，也就是形成了集成算法。

In [281]:

import joblib 
md1 = joblib.load("titanic_model_18032201")
md2 = joblib.load("titanic_model_18032301")
md3 = joblib.load("titanic_model_18032401")

批量进行函数操作¶

In [282]:

import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
train, test, targets = recover_train_test_target()

逻辑回归¶

In [283]:

from sklearn.linear_model import LogisticRegression

In [284]:

y1tr = md1.predict_proba(train.values)
y2tr = md2.predict_proba(train)
y3tr = md2.predict_proba(train)
y1te = md1.predict_proba(test.values)
y2te = md2.predict_proba(test)
y3te = md2.predict_proba(test)

In [285]:

X_tr = pd.DataFrame({"y1":y1tr[:,0],"y2":y2tr[:,0],"y3":y3tr[:,0]})
X_te = pd.DataFrame({"y1":y1te[:,0],"y2":y2te[:,0],"y3":y3te[:,0]})

In [286]:

from pandas_ply import install_ply, X, sym_call
install_ply(pd)

In [287]:

X_tr = X_tr.ply_select('*',
                y1sq = X.y1**2,
                y2sq = X.y2**2,
                y3sq = X.y3**2,
                intsc1 = X.y1*X.y2,
                intsc2 = X.y1*X.y3,
                intsc3 = X.y2*X.y3,                      
                intsc4 = X.y1*X.y2*X.y3                   
                      )
X_te = X_te.ply_select('*',
                y1sq = X.y1**2,
                y2sq = X.y2**2,
                y3sq = X.y3**2,
                intsc1 = X.y1*X.y2,
                intsc2 = X.y1*X.y3,
                intsc3 = X.y2*X.y3,                      
                intsc4 = X.y1*X.y2*X.y3                   
                      )

In [288]:

lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_tr.values, targets)

Out[288]:

LogisticRegression(C=1000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [289]:

lr.coef_

Out[289]:

array([[ -0.24902198, -10.1045079 , -10.1045079 ,   8.72614486,
          5.66796023,  -3.71285182,   5.66796023,  -3.71285182,
         -8.10302608,   5.66796023]])

这个地方不要抱怨，sklearn还真的不反馈p值，这里就不管了，大概可以判断，决策树的$\beta$很低很挫。

这个地方注意是[:,1]不是[:,0]，别弄反了。

cutoff = 0.5¶

In [290]:

y_hat = lr.predict_proba(X_te)[:,1]
output = pd.Series((y_hat > 0.5)).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032305.csv"),index=False)

Your submission scored 0.77990, which is not an improvement of your best score. Keep trying!

加入xgboos后，分数掉了。

Your submission scored 0.75119, which is not an improvement of your best score. Keep trying!

发现，是否加交叉项目，影响不大。

cutoff = median()¶

In [291]:

y_hat = lr.predict_proba(X_te)[:,1]
output = pd.Series((y_hat > pd.Series(y_hat).median())).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032306.csv"),index=False)

Your submission scored 0.73684, which is not an improvement of your best score. Keep trying! 所以中位数不能取用。

Your submission scored 0.72248, which is not an improvement of your best score. Keep trying! 所以我说最后一层用逻辑回归，把xgboost放进来，基本上是要gg的。

cutoff = mean()¶

In [292]:

y_hat = lr.predict_proba(X_te)[:,1]
output = pd.Series((y_hat > pd.Series(y_hat).mean())).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032307.csv"),index=False)

Your submission scored 0.78468, which is not an improvement of your best score. Keep trying!

Your submission scored 0.74641, which is not an improvement of your best score. Keep trying! xgboost就是不能放在前面！！！

In [293]:

lr.score(X_tr.values, targets)

Out[293]:

0.8630751964085297

一般我们不采用训练集的acc，是因为这个值也是过拟合的指标。

均值法¶

In [294]:

y1 = md1.predict_proba(test.values)
y2 = md2.predict_proba(test)

In [295]:

y_hat = (y1[:,1]+y2[:,1])/2
output = pd.Series((y_hat > 0.5)).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032304.csv"),index=False)

Your submission scored 0.78468, which is not an improvement of your best score. Keep trying!

求平均居然比逻辑回归好。

目录

前言¶

数据下载¶

特征工程¶

录入数据¶

titles¶

Age¶

分类和连续变量describe¶

Names¶

Fare¶

Embarked¶

Cabin¶

Sex¶

Pclass¶

Ticket¶

Family¶

Define metrics¶

split funtion¶

Modeling¶

决策树¶

模型保存¶

超参数调整简单介绍¶

随机森林¶

批量进行函数操作¶

importances图¶

Hyperparameters tuning¶

XGBoost¶

安装包¶

建立train和validation组¶

简单fit¶

预测¶

保存模型¶

展示决策树图¶

安装包graphviz¶

展示¶

cross-validation¶

历史集成算法¶

批量进行函数操作¶

逻辑回归¶

cutoff = 0.5¶

cutoff = median()¶

cutoff = mean()¶

均值法¶

安装包`graphviz`¶