这里主要讲两个模型,决策树、随机森林,最后随机森林涉及超参数调整,也就是grid思想。 在模型训练前,就会用到变量的特征工程,这里使用函数思维。
引入决策树是因为简单,作为一个tutorial,让代码一次性可以跑通很重要。
决策树参考 Hugo Bowne-Anderson January 3rd, 2018 Kaggle Tutorial: Your First Machine Learning Model https://www.datacamp.com/community/tutorials/kaggle-tutorial-machine-learning
随机森林参考 Ahmed BESEES March 10th, 2016 How to score 0.8134 in Titanic Kaggle Challenge https://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html 不知道哪国人。
这里展示不是base blogdown
包的,因为这里jupyter
和RStudio
感觉是竞争品牌,不会互相支持,算了,直接jupyter写。
kaggle
命令: pip install kaggle
kaggle competitions download -c titanic
$ mv kaggle.json /Users/JiaxiangLi/.kaggle
$ cd /Users/JiaxiangLi/.kaggle
$ ls
kaggle.json
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""")
# remove warnings
import warnings
warnings.filterwarnings('ignore')
# ---
%matplotlib inline
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
import numpy as np
pd.options.display.max_rows = 100
这些设定,比如限定展示不超过100行和100列是非常有用的。
import os
import pandas as pd
from dplython import (DplyFrame, X, diamonds, select, sift,
sample_n, sample_frac, head, arrange, mutate, group_by,
summarize, DelayFunction)
os.getcwd()
def get_combined_data():
file_path = os.path.join(os.getcwd(),"titanic")
train = pd.read_csv(os.path.join(file_path,"train.csv"))
test = pd.read_csv(os.path.join(file_path,"test.csv"))
target = train.Survived
train = train.drop("Survived",axis = 1)
combined = train.append(test)
combined.reset_index().drop("index",axis =1)
combined.drop("PassengerId", axis = 1, inplace=True)
return combined
.ipynb
文件的路径和数据的路径,就可以重复利用这个文档了。注意上面最后return combined
因此我们定义combined
我们合并train
和test
的数据,进入特征工程方面。
combined = get_combined_data()
print combined.shape
row,col = combined.shape
print row,col
这里是返回dataframe的大小。
简单了解下。
combined["Title"] = combined.Name.map(lambda name:name.split(',')[1].split('.')[0].strip())
,
和.
中间的字符,且去掉空格。replace
函数,难搞,直接用map
follow R的思路。但是是针对单一变量.Title_Dictionary = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona": "Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr" : "Mr",
"Mrs" : "Mrs",
"Miss" : "Miss",
"Master" : "Master",
"Lady" : "Royalty"
}
def get_title():
global combined
combined["Title"] = combined.Name.map(lambda name:name.split(',')[1].split('.')[0].strip())
Title_Dictionary = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Dona": "Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr" : "Mr",
"Mrs" : "Mrs",
"Miss" : "Miss",
"Master" : "Master",
"Lady" : "Royalty"
}
combined.Title = combined.Title.map(Title_Dictionary)
整合成一个函数,一定要这样定义,这样就把特征工程模块化,再后面跑模型的时候就不需要管前面了,非常节约时间。
get_title()
这个地方的map
函数定义和R里面的purrr
包定义类似。
这个地方的age
,根据title
,sex
,pclass
的情况不同肯定不同,因此直接用均值和中位数肯定不恰当,要不groupby
三个变量,要不直接回归?
简单点,直接前面一种,我们来看看,分train
和test
组。
combined[:891].groupby(["Pclass","Sex","Title"]).median()
简单看一下就觉得差异得吓人,因此肯定要区别对待。
这里要用if-else
条件层层完成一个函数,真累。
因此我们肯定不能这么做,直接一步left join搞定。
def process_age():
global combined
age_median = combined[:891].groupby(["Pclass","Sex","Title"]).median().reset_index()[["Pclass","Sex","Age","Title"]]
combined2 = pd.merge(combined, age_median, how="left", on = ["Pclass","Sex","Title"])
combined.Age = combined2.Age_x.fillna(combined2.Age_y)
return combined
process_age()
combined.get_dtype_counts()
combined.select_dtypes(include=["int64","float64"]).head(n=1)
combined.select_dtypes(include=["object"]).nunique()
names
是个噪音,这里处理方式是去掉,反正有了 title
,要是不好的话,再说吧。
但是 title
需要one_hot
combined.Title.unique()
drop_first
必须的,不然共线性。
pd.get_dummies(combined.Title,prefix='Title',drop_first=True).head()
#combined.drop(["Name","Title"])
iloc
给location,间接,真麻烦,dplyr
爆几条街。def process_names():
global combined
combined = pd.concat([combined.drop("Name",axis=1).drop("Title",axis=1),pd.get_dummies(combined.Title,prefix='Title',drop_first=True)],axis=1)
这里一定要加上axis=1
process_names()
这个地方的Fare
缺失很少,直接均值搞定。
combined.info()
现在就剩下Fare
、Embarked
、Cabin
了。
def process_fares():
global combined
combined.Fare.fillna(combined.Fare.mean(), inplace = True)
process_fares()
combined.info()
江湖流传,这个缺失值替换成S
准确性高很多。
def process_embarked():
global combined
combined.Embarked.fillna("S",inplace=True)
embarked_dummies = pd.get_dummies(combined.Embarked,prefix="Embarked",drop_first=True)
combined = pd.concat([combined, embarked_dummies],axis = 1)
combined.drop("Embarked", inplace=True, axis = 1)
process_embarked()
combined.info()
这里填写,U
表示unknown。
其他都取用首字母,非常暴力。
combined.Cabin.head()
def process_cabin():
global combined
combined.Cabin.fillna('U', inplace=True)
combined["Cabin"] = combined.Cabin.map(lambda c: c[0])
cabin_dummies = pd.get_dummies(combined.Cabin, prefix="Cabin", drop_first=True)
combined = pd.concat([combined,cabin_dummies], axis=1)
combined.drop("Cabin", axis = 1, inplace=True)
process_cabin()
combined.filter(regex="^Cabin")
def process_sex():
global combined
# mapping string values to numerical one
combined['Sex'] = combined['Sex'].map({'male':1,'female':0})
process_sex()
注意这个pclass
也是要onehot的。
def process_pclass():
global combined
pclass_dummies = pd.get_dummies(combined.Pclass,prefix="Pclass", drop_first=True)
combined = pd.concat([combined, pclass_dummies], axis=1)
combined.drop("Pclass",axis=1, inplace=True)
process_pclass()
combined.head()
combined.Ticket.head()
这个地方太烦了,copy了一个函数。
def process_ticket():
global combined
# a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
def cleanTicket(ticket):
ticket = ticket.replace('.','')
ticket = ticket.replace('/','')
ticket = ticket.split()
ticket = map(lambda t : t.strip(), ticket)
ticket = filter(lambda t : not t.isdigit(), ticket)
if len(ticket) > 0:
return ticket[0]
else:
return 'XXX'
# Extracting dummy variables from tickets:
combined['Ticket'] = combined['Ticket'].map(cleanTicket)
tickets_dummies = pd.get_dummies(combined['Ticket'], prefix='Ticket')
combined = pd.concat([combined, tickets_dummies], axis=1)
combined.drop('Ticket', inplace=True, axis=1)
process_ticket()
def process_family():
global combined
# introducing a new feature : the size of families (including the passenger)
combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
# introducing other features based on the family size
combined['Singleton'] = combined['FamilySize'].map(lambda s: 1 if s == 1 else 0)
combined['SmallFamily'] = combined['FamilySize'].map(lambda s: 1 if 2<=s<=4 else 0)
combined['LargeFamily'] = combined['FamilySize'].map(lambda s: 1 if 5<=s else 0)
process_family()
def compute_score(clf, X, y, scoring='accuracy'):
xval = cross_val_score(clf, X, y, cv = 5, scoring=scoring)
return np.mean(xval)
def recover_train_test_target():
global combined
file_path = os.path.join(os.getcwd(),"titanic")
train0 = pd.read_csv(os.path.join(file_path,"train.csv"))
targets = train0.Survived
train = combined.head(891)
test = combined.iloc[891:]
return train, test, targets
这个地方的合并是知道顺序的,位置为index = 891
,那么现在开始分。
# Import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# Figures inline and set visualization style
%matplotlib inline
sns.set()
import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
这就是为什么全面特征工程要函数化的原因,比如这里我们改动了combined
也没关系,跑下上面的代码就好了,实际上在Pycharm里面,我们会将特征工程放入一个.py
文件,这样直接import
就好。
这样数据的预处理,两个组的数据口径一致。
combined.info()
train, test, targets = recover_train_test_target()
注意这里决策树比较娇气,要使用values
.
这里进入sklearn的机器学习包,需要np.array
格式。因此我们提出.values
。
X = train.values
y = targets
X_test = test.values
这里的决策树是用平衡算法,level-wise,而非leaf-wise,这里限定max_depth = 3
。
这里不涉及调参数,因此暂时不用管为什么是3,是经验,在进行超参数调整的时候,会比较的。
clf = tree.DecisionTreeClassifier(max_depth=3)
print clf.fit(X,y)
这里会有很多参数不知道,没关系,这些是系统设计的默认参数。
最后一步,开始.predict()
output = clf.predict(X_test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032302.csv"),index=False)
所有这些结果都保存,最后用历史法进行集成算法用。
这里使用pip install joblib
进行安装。
import joblib
joblib.dump(clf, "titanic_model_18032201", compress=9)
现在尝试重新录入。
clf2 = joblib.load("titanic_model_18032201")
print clf2.predict(X_test)[:5]
下面主要解释一些决策树的理论和可视化的东西,毕竟模型这么简单,那么一定要做到跟别人解释得非常的浅显易懂。
max_depth
是用来控制过拟合的,我们可以比较下其他$\neq 3$情况。
我们把train组拿出来做随机测试。
这个地方的train_test_split
是list
。
print type(train_test_split(X,y,test_size = 0.33, random_state = 42,stratify = y))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42, stratify=y)
dep = np.arange(1,9)
np.arange
和range
不同之处是,反馈的是array
。
设计等长的list,用来写入acc。
train_accuracy = np.empty(len(dep))
test_accuracy = np.empty(len(dep))
print train_accuracy, test_accuracy
for i, k in enumerate(dep):
print i,k
这个地方的i
刚好做index
,k
就是做好max_depth
的枚举值。
for i, k in enumerate(dep):
c = tree.DecisionTreeClassifier(max_depth=k)
clf.fit(X_train,y_train)
train_accuracy[i] = clf.score(X_train,y_train)
test_accuracy[i] = clf.score(X_test,y_test)
print train_accuracy,test_accuracy
plt.title(u"训练集和测试集的Acc比较")
plt.plot(dep,train_accuracy,label = u"训练集Acc")
plt.plot(dep,test_accuracy,label = u"测试集Acc")
plt.legend()
plt.xlabel(u"level-wise决策树深度选择")
plt.ylabel("Acc")
plt.show()
注意这里中文显示,每个string前面记得加上u
。
这里注意看到max_depth
等于3或者7的时候,Acc最高,因此我们选择3.
为什么不选择7,因为经验。
继续更新随机森林的算法啊,因为啊,这种没bagging的决策树迟早要出bug。
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.cross_validation import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.cross_validation import cross_val_score
到这个地方可以总结下思路了,训练一个模型,相对来说比较规则化、死板的,需要需要把很多方面模块化。 这样的话,重新导入数据训练模型的时候,面对的就是一个个函数,这样就重复利用率上去了,之后我想要迭代模型的时候, 特征工程和选择模型就是分开的,节省时间。 当每次搞一个新的模型时候,模型处理变量不需要再弄一次,如果特征工程没什么可以增加的话,增加也不会把原有的剔除。 直接调用这个函数就好,这是一开始比较傻,但是之后很清楚。
这也是R和Python最大的区别,
R上的数据预处理直接调用tidyverse一行代码结束,但是非常麻烦。
因此好好设计函数,这样不需要好几个.Rmd
搞来搞去,累。
训练这种函数思维。
import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
train, test, targets = recover_train_test_target()
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
clf = clf.fit(train, targets)
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.plot(kind='barh', figsize=(20, 20))
这个图可以用于选择变量,当然下面这个函数相当于减负了。
下面开始用一个函数缩减变量。 当然也可以手动选择99%信息增益的变量。
model = SelectFromModel(clf, prefit=True)
train_reduced = model.transform(train)
train_reduced.shape
直接选取了13个变量。
test_reduced = model.transform(test)
test_reduced.shape
我想了一下这个文档就不要动了,直接出pdf,那么knitr就一些备注,也不要md马德来回粘贴,浪费时间,节约时间学习算法。
# turn run_gs to True if you want to run the gridsearch again.
run_gs = False
if run_gs:
parameter_grid = {
'max_depth' : [4, 6, 8],
'n_estimators': [50, 10],
'max_features': ['sqrt', 'auto', 'log2'],
'min_samples_split': [1, 3, 10],
'min_samples_leaf': [1, 3, 10],
'bootstrap': [True, False],
}
forest = RandomForestClassifier()
cross_validation = StratifiedKFold(targets, n_folds=5)
grid_search = GridSearchCV(forest,
scoring='accuracy',
param_grid=parameter_grid,
cv=cross_validation)
grid_search.fit(train, targets)
model = grid_search
parameters = grid_search.best_params_
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
else:
parameters = {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 50,
'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6}
model = RandomForestClassifier(**parameters)
model.fit(train, targets)
compute_score(model, train, targets, scoring='accuracy')
global combined
file_path = os.path.join(os.getcwd(),"titanic")
train0 = pd.read_csv(os.path.join(file_path,"train.csv"))
output = model.predict(test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032303.csv"),index=False)
import joblib
joblib.dump(model, "titanic_model_18032301", compress=9)
conda remove xgboost
conda install -c aterrel xgboost=0.4.0
import xgboost as xgb
DMLC做了一个全平台的xgboost tutorial,这里代码主要参考这个网站,R等平台的也可以借鉴。
import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
train, test, targets = recover_train_test_target()
因为xgboost可以设计watchlist
,所以对原来的train数据进行切分。
X_train,X_test,y_train,y_test= train_test_split(train, targets, test_size=0.2, random_state=123)
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
xg_cl.fit(X_train,y_train)
preds = xg_cl.predict(X_test)
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
output = xg_cl.predict(test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032401.csv"),index=False)
Your submission scored 0.78468, which is not an improvement of your best score. Keep trying! 感觉要比前面的模型都要好,试试cv的效果。
# xg_cl.dump_model("titanic_model_18032401.txt","titanic_model_18032401_featmap.txt")
这个地方.save_model()
和.dump_model()
都不能用,不知道什么bug.
import joblib
joblib.dump(xg_cl, "titanic_model_18032401", compress=9)
graphviz
¶ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
有时候这个dependency会下不下来,多下几次,耐心。
from numpy import loadtxt
import matplotlib.pyplot as plt
xgb.plot_tree(xg_cl._Booster, num_trees=0)
plt.show()
这是boosting中第一个树的展示,很难看,R完爆。
xgboost进入的数据需要是稀疏矩阵,因此这里设计了一个函数xgb.DMatrix
进行转换。
# Create the DMatrix: churn_dmatrix
dmatrix = xgb.DMatrix(data=train, label=targets)
# Create the parameter dictionary: params
params = {"objective":"binary:logistic", "max_depth":3}
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=dmatrix, params=params, nfold=3, num_boost_round=5, metrics=["auc"], seed=123)
# Print cv_results
print(cv_results)
注意,如果这里的评价指标使用AUC的化,需要list化,也就是输入["auc"]
,而非"auc",否则报错。
cv_results
是一个list,就是说我母鸡如何导出?
output = cv_results.predict(test).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032402.csv"),index=False)
这里把这两个模型进行一个简单的决策树、投票、平均等等,也就是形成了集成算法。
import joblib
md1 = joblib.load("titanic_model_18032201")
md2 = joblib.load("titanic_model_18032301")
md3 = joblib.load("titanic_model_18032401")
import os
combined = get_combined_data()
get_title()
process_age()
process_names()
process_fares()
process_embarked()
process_cabin()
process_sex()
process_pclass()
process_ticket()
process_family()
train, test, targets = recover_train_test_target()
from sklearn.linear_model import LogisticRegression
y1tr = md1.predict_proba(train.values)
y2tr = md2.predict_proba(train)
y3tr = md2.predict_proba(train)
y1te = md1.predict_proba(test.values)
y2te = md2.predict_proba(test)
y3te = md2.predict_proba(test)
X_tr = pd.DataFrame({"y1":y1tr[:,0],"y2":y2tr[:,0],"y3":y3tr[:,0]})
X_te = pd.DataFrame({"y1":y1te[:,0],"y2":y2te[:,0],"y3":y3te[:,0]})
from pandas_ply import install_ply, X, sym_call
install_ply(pd)
X_tr = X_tr.ply_select('*',
y1sq = X.y1**2,
y2sq = X.y2**2,
y3sq = X.y3**2,
intsc1 = X.y1*X.y2,
intsc2 = X.y1*X.y3,
intsc3 = X.y2*X.y3,
intsc4 = X.y1*X.y2*X.y3
)
X_te = X_te.ply_select('*',
y1sq = X.y1**2,
y2sq = X.y2**2,
y3sq = X.y3**2,
intsc1 = X.y1*X.y2,
intsc2 = X.y1*X.y3,
intsc3 = X.y2*X.y3,
intsc4 = X.y1*X.y2*X.y3
)
lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(X_tr.values, targets)
lr.coef_
这个地方不要抱怨,sklearn
还真的不反馈p值,这里就不管了,大概可以判断,决策树的$\beta$很低很挫。
这个地方注意是[:,1]
不是[:,0]
,别弄反了。
y_hat = lr.predict_proba(X_te)[:,1]
output = pd.Series((y_hat > 0.5)).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032305.csv"),index=False)
Your submission scored 0.77990, which is not an improvement of your best score. Keep trying!
加入xgboos后,分数掉了。
Your submission scored 0.75119, which is not an improvement of your best score. Keep trying!
发现,是否加交叉项目,影响不大。
y_hat = lr.predict_proba(X_te)[:,1]
output = pd.Series((y_hat > pd.Series(y_hat).median())).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032306.csv"),index=False)
Your submission scored 0.73684, which is not an improvement of your best score. Keep trying! 所以中位数不能取用。
Your submission scored 0.72248, which is not an improvement of your best score. Keep trying! 所以我说最后一层用逻辑回归,把xgboost放进来,基本上是要gg的。
y_hat = lr.predict_proba(X_te)[:,1]
output = pd.Series((y_hat > pd.Series(y_hat).mean())).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032307.csv"),index=False)
Your submission scored 0.78468, which is not an improvement of your best score. Keep trying!
Your submission scored 0.74641, which is not an improvement of your best score. Keep trying! xgboost就是不能放在前面!!!
lr.score(X_tr.values, targets)
一般我们不采用训练集的acc,是因为这个值也是过拟合的指标。
y1 = md1.predict_proba(test.values)
y2 = md2.predict_proba(test)
y_hat = (y1[:,1]+y2[:,1])/2
output = pd.Series((y_hat > 0.5)).astype(int)
df_output = pd.DataFrame()
file_path = os.path.join(os.getcwd(),"titanic")
aux = pd.read_csv(os.path.join(file_path,"test.csv"))
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv(os.path.join(file_path,"titanic_18032304.csv"),index=False)
Your submission scored 0.78468, which is not an improvement of your best score. Keep trying!
求平均居然比逻辑回归好。