{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 集成学习案例一 (幸福感预测)\n", "\n", "### 背景介绍\n", "\n", "此案例是一个数据挖掘类型的比赛——幸福感预测的baseline。比赛的数据使用的是官方的《中国综合社会调查(CGSS)》文件中的调查结果中的数据,其共包含有139个维度的特征,包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等等)、家庭变量(父母、配偶、子女、家庭资本等等)、社会态度(公平、信用、公共服务)等特征。\n", "\n", "\n", "### 数据信息\n", "赛题要求使用以上 **139** 维的特征,使用 **8000** 余组数据进行对于个人幸福感的预测(预测值为1,2,3,4,5,其中1代表幸福感最低,5代表幸福感最高)。\n", "因为考虑到变量个数较多,部分变量间关系复杂,数据分为完整版和精简版两类。可从精简版入手熟悉赛题后,使用完整版挖掘更多信息。在这里我直接使用了完整版的数据。赛题也给出了index文件中包含每个变量对应的问卷题目,以及变量取值的含义;survey文件中为原版问卷,作为补充以方便理解问题背景。\n", "\n", "### 评价指标\n", "最终的评价指标为均方误差MSE,即:\n", "$$Score = \\frac{1}{n} \\sum_1 ^n (y_i - y ^*)^2$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 导入package" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import time \n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC, LinearSVC\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.linear_model import Perceptron\n", "from sklearn.linear_model import SGDClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn import metrics\n", "from datetime import datetime\n", "import matplotlib.pyplot as plt\n", "from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error,mean_absolute_error, f1_score\n", "import lightgbm as lgb\n", "import xgboost as xgb\n", "from sklearn.ensemble import RandomForestRegressor as rfr\n", "from sklearn.ensemble import ExtraTreesRegressor as etr\n", "from sklearn.linear_model import BayesianRidge as br\n", "from sklearn.ensemble import GradientBoostingRegressor as gbr\n", "from sklearn.linear_model import Ridge\n", "from sklearn.linear_model import Lasso\n", "from sklearn.linear_model import LinearRegression as lr\n", "from sklearn.linear_model import ElasticNet as en\n", "from sklearn.kernel_ridge import KernelRidge as kr\n", "from sklearn.model_selection import KFold, StratifiedKFold,GroupKFold, RepeatedKFold\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn import preprocessing\n", "import logging\n", "import warnings\n", "\n", "warnings.filterwarnings('ignore') #消除warning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 导入数据集" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv(\"train.csv\", parse_dates=['survey_time'],encoding='latin-1') \n", "test = pd.read_csv(\"test.csv\", parse_dates=['survey_time'],encoding='latin-1') #latin-1向下兼容ASCII\n", "train = train[train[\"happiness\"]!=-8].reset_index(drop=True)\n", "train_data_copy = train.copy() #删去\"happiness\" 为-8的行\n", "target_col = \"happiness\" #目标列\n", "target = train_data_copy[target_col]\n", "del train_data_copy[target_col] #去除目标列\n", "\n", "data = pd.concat([train_data_copy,test],axis=0,ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 查看数据的基本信息" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 7988.000000\n", "mean 3.867927\n", "std 0.818717\n", "min 1.000000\n", "25% 4.000000\n", "50% 4.000000\n", "75% 4.000000\n", "max 5.000000\n", "Name: happiness, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.happiness.describe() #数据的基本信息" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 数据预处理\n", "\n", "首先需要对于数据中的连续出现的负数值进行处理。由于数据中的负数值只有-1,-2,-3,-8这几种数值,所以它们进行分别的操作,实现代码如下:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#make feature +5\n", "#csv中有复数值:-1、-2、-3、-8,将他们视为有问题的特征,但是不删去\n", "def getres1(row):\n", " return len([x for x in row.values if type(x)==int and x<0])\n", "\n", "def getres2(row):\n", " return len([x for x in row.values if type(x)==int and x==-8])\n", "\n", "def getres3(row):\n", " return len([x for x in row.values if type(x)==int and x==-1])\n", "\n", "def getres4(row):\n", " return len([x for x in row.values if type(x)==int and x==-2])\n", "\n", "def getres5(row):\n", " return len([x for x in row.values if type(x)==int and x==-3])\n", "\n", "#检查数据\n", "data['neg1'] = data[data.columns].apply(lambda row:getres1(row),axis=1)\n", "data.loc[data['neg1']>20,'neg1'] = 20 #平滑处理\n", "\n", "data['neg2'] = data[data.columns].apply(lambda row:getres2(row),axis=1)\n", "data['neg3'] = data[data.columns].apply(lambda row:getres3(row),axis=1)\n", "data['neg4'] = data[data.columns].apply(lambda row:getres4(row),axis=1)\n", "data['neg5'] = data[data.columns].apply(lambda row:getres5(row),axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "填充缺失值,在这里我采取的方式是将缺失值补全,使用fillna(value),其中value的数值根据具体的情况来确定。例如将大部分缺失信息认为是零,将家庭成员数认为是1,将家庭收入这个特征认为是66365,即所有家庭的收入平均值。部分实现代码如下:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#填充缺失值 共25列 去掉4列 填充21列\n", "#以下的列都是缺省的,视情况填补\n", "data['work_status'] = data['work_status'].fillna(0)\n", "data['work_yr'] = data['work_yr'].fillna(0)\n", "data['work_manage'] = data['work_manage'].fillna(0)\n", "data['work_type'] = data['work_type'].fillna(0)\n", "\n", "data['edu_yr'] = data['edu_yr'].fillna(0)\n", "data['edu_status'] = data['edu_status'].fillna(0)\n", "\n", "data['s_work_type'] = data['s_work_type'].fillna(0)\n", "data['s_work_status'] = data['s_work_status'].fillna(0)\n", "data['s_political'] = data['s_political'].fillna(0)\n", "data['s_hukou'] = data['s_hukou'].fillna(0)\n", "data['s_income'] = data['s_income'].fillna(0)\n", "data['s_birth'] = data['s_birth'].fillna(0)\n", "data['s_edu'] = data['s_edu'].fillna(0)\n", "data['s_work_exper'] = data['s_work_exper'].fillna(0)\n", "\n", "data['minor_child'] = data['minor_child'].fillna(0)\n", "data['marital_now'] = data['marital_now'].fillna(0)\n", "data['marital_1st'] = data['marital_1st'].fillna(0)\n", "data['social_neighbor']=data['social_neighbor'].fillna(0)\n", "data['social_friend']=data['social_friend'].fillna(0)\n", "data['hukou_loc']=data['hukou_loc'].fillna(1) #最少为1,表示户口\n", "data['family_income']=data['family_income'].fillna(66365) #删除问题值后的平均值" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "除此之外,还有特殊格式的信息需要另外处理,比如与时间有关的信息,这里主要分为两部分进行处理:首先是将“连续”的年龄,进行分层处理,即划分年龄段,具体地在这里我们将年龄分为了6个区间。其次是计算具体的年龄,在Excel表格中,只有出生年月以及调查时间等信息,我们根据此计算出每一位调查者的真实年龄。具体实现代码如下:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [], "source": [ "#144+1 =145\n", "#继续进行特殊的列进行数据处理\n", "#读happiness_index.xlsx\n", "data['survey_time'] = pd.to_datetime(data['survey_time'], format='%Y-%m-%d',errors='coerce')#防止时间格式不同的报错errors='coerce‘\n", "data['survey_time'] = data['survey_time'].dt.year #仅仅是year,方便计算年龄\n", "data['age'] = data['survey_time']-data['birth']\n", "# print(data['age'],data['survey_time'],data['birth'])\n", "#年龄分层 145+1=146\n", "bins = [0,17,26,34,50,63,100]\n", "data['age_bin'] = pd.cut(data['age'], bins, labels=[0,1,2,3,4,5]) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这里因为家庭的收入是连续值,所以不能再使用取众数的方法进行处理,这里就直接使用了均值进行缺失值的补全。第三种方法是使用我们日常生活中的真实情况,例如“宗教信息”特征为负数的认为是“不信仰宗教”,并认为“参加宗教活动的频率”为1,即没有参加过宗教活动,主观的进行补全,这也是我在这一步骤中使用最多的一种方式。就像我自己填表一样,这里我全部都使用了我自己的想法进行缺省值的补全。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#对‘宗教’处理\n", "data.loc[data['religion']<0,'religion'] = 1 #1为不信仰宗教\n", "data.loc[data['religion_freq']<0,'religion_freq'] = 1 #1为从来没有参加过\n", "#对‘教育程度’处理\n", "data.loc[data['edu']<0,'edu'] = 4 #初中\n", "data.loc[data['edu_status']<0,'edu_status'] = 0\n", "data.loc[data['edu_yr']<0,'edu_yr'] = 0\n", "#对‘个人收入’处理\n", "data.loc[data['income']<0,'income'] = 0 #认为无收入\n", "#对‘政治面貌’处理\n", "data.loc[data['political']<0,'political'] = 1 #认为是群众\n", "#对体重处理\n", "data.loc[(data['weight_jin']<=80)&(data['height_cm']>=160),'weight_jin']= data['weight_jin']*2\n", "data.loc[data['weight_jin']<=60,'weight_jin']= data['weight_jin']*2 #个人的想法,哈哈哈,没有60斤的成年人吧\n", "#对身高处理\n", "data.loc[data['height_cm']<150,'height_cm'] = 150 #成年人的实际情况\n", "#对‘健康’处理\n", "data.loc[data['health']<0,'health'] = 4 #认为是比较健康\n", "data.loc[data['health_problem']<0,'health_problem'] = 4\n", "#对‘沮丧’处理\n", "data.loc[data['depression']<0,'depression'] = 4 #一般人都是很少吧\n", "#对‘媒体’处理\n", "data.loc[data['media_1']<0,'media_1'] = 1 #都是从不\n", "data.loc[data['media_2']<0,'media_2'] = 1\n", "data.loc[data['media_3']<0,'media_3'] = 1\n", "data.loc[data['media_4']<0,'media_4'] = 1\n", "data.loc[data['media_5']<0,'media_5'] = 1\n", "data.loc[data['media_6']<0,'media_6'] = 1\n", "#对‘空闲活动’处理\n", "data.loc[data['leisure_1']<0,'leisure_1'] = 1 #都是根据自己的想法\n", "data.loc[data['leisure_2']<0,'leisure_2'] = 5\n", "data.loc[data['leisure_3']<0,'leisure_3'] = 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用众数(代码中使用mode()来实现异常值的修正),由于这里的特征是空闲活动,所以采用众数对于缺失值进行处理比较合理。具体的代码参考如下:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "data.loc[data['leisure_4']<0,'leisure_4'] = data['leisure_4'].mode() #取众数\n", "data.loc[data['leisure_5']<0,'leisure_5'] = data['leisure_5'].mode()\n", "data.loc[data['leisure_6']<0,'leisure_6'] = data['leisure_6'].mode()\n", "data.loc[data['leisure_7']<0,'leisure_7'] = data['leisure_7'].mode()\n", "data.loc[data['leisure_8']<0,'leisure_8'] = data['leisure_8'].mode()\n", "data.loc[data['leisure_9']<0,'leisure_9'] = data['leisure_9'].mode()\n", "data.loc[data['leisure_10']<0,'leisure_10'] = data['leisure_10'].mode()\n", "data.loc[data['leisure_11']<0,'leisure_11'] = data['leisure_11'].mode()\n", "data.loc[data['leisure_12']<0,'leisure_12'] = data['leisure_12'].mode()\n", "data.loc[data['socialize']<0,'socialize'] = 2 #很少\n", "data.loc[data['relax']<0,'relax'] = 4 #经常\n", "data.loc[data['learn']<0,'learn'] = 1 #从不,哈哈哈哈\n", "#对‘社交’处理\n", "data.loc[data['social_neighbor']<0,'social_neighbor'] = 0\n", "data.loc[data['social_friend']<0,'social_friend'] = 0\n", "data.loc[data['socia_outing']<0,'socia_outing'] = 1\n", "data.loc[data['neighbor_familiarity']<0,'social_neighbor']= 4\n", "#对‘社会公平性’处理\n", "data.loc[data['equity']<0,'equity'] = 4\n", "#对‘社会等级’处理\n", "data.loc[data['class_10_before']<0,'class_10_before'] = 3\n", "data.loc[data['class']<0,'class'] = 5\n", "data.loc[data['class_10_after']<0,'class_10_after'] = 5\n", "data.loc[data['class_14']<0,'class_14'] = 2\n", "#对‘工作情况’处理\n", "data.loc[data['work_status']<0,'work_status'] = 0\n", "data.loc[data['work_yr']<0,'work_yr'] = 0\n", "data.loc[data['work_manage']<0,'work_manage'] = 0\n", "data.loc[data['work_type']<0,'work_type'] = 0\n", "#对‘社会保障’处理\n", "data.loc[data['insur_1']<0,'insur_1'] = 1\n", "data.loc[data['insur_2']<0,'insur_2'] = 1\n", "data.loc[data['insur_3']<0,'insur_3'] = 1\n", "data.loc[data['insur_4']<0,'insur_4'] = 1\n", "data.loc[data['insur_1']==0,'insur_1'] = 0\n", "data.loc[data['insur_2']==0,'insur_2'] = 0\n", "data.loc[data['insur_3']==0,'insur_3'] = 0\n", "data.loc[data['insur_4']==0,'insur_4'] = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "取均值进行缺失值的补全(代码实现为means()),在这里因为家庭的收入是连续值,所以不能再使用取众数的方法进行处理,这里就直接使用了均值进行缺失值的补全。具体的代码参考如下:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "#对家庭情况处理\n", "family_income_mean = data['family_income'].mean()\n", "data.loc[data['family_income']<0,'family_income'] = family_income_mean\n", "data.loc[data['family_m']<0,'family_m'] = 2\n", "data.loc[data['family_status']<0,'family_status'] = 3\n", "data.loc[data['house']<0,'house'] = 1\n", "data.loc[data['car']<0,'car'] = 0\n", "data.loc[data['car']==2,'car'] = 0 #变为0和1\n", "data.loc[data['son']<0,'son'] = 1\n", "data.loc[data['daughter']<0,'daughter'] = 0\n", "data.loc[data['minor_child']<0,'minor_child'] = 0\n", "#对‘婚姻’处理\n", "data.loc[data['marital_1st']<0,'marital_1st'] = 0\n", "data.loc[data['marital_now']<0,'marital_now'] = 0\n", "#对‘配偶’处理\n", "data.loc[data['s_birth']<0,'s_birth'] = 0\n", "data.loc[data['s_edu']<0,'s_edu'] = 0\n", "data.loc[data['s_political']<0,'s_political'] = 0\n", "data.loc[data['s_hukou']<0,'s_hukou'] = 0\n", "data.loc[data['s_income']<0,'s_income'] = 0\n", "data.loc[data['s_work_type']<0,'s_work_type'] = 0\n", "data.loc[data['s_work_status']<0,'s_work_status'] = 0\n", "data.loc[data['s_work_exper']<0,'s_work_exper'] = 0\n", "#对‘父母情况’处理\n", "data.loc[data['f_birth']<0,'f_birth'] = 1945\n", "data.loc[data['f_edu']<0,'f_edu'] = 1\n", "data.loc[data['f_political']<0,'f_political'] = 1\n", "data.loc[data['f_work_14']<0,'f_work_14'] = 2\n", "data.loc[data['m_birth']<0,'m_birth'] = 1940\n", "data.loc[data['m_edu']<0,'m_edu'] = 1\n", "data.loc[data['m_political']<0,'m_political'] = 1\n", "data.loc[data['m_work_14']<0,'m_work_14'] = 2\n", "#和同龄人相比社会经济地位\n", "data.loc[data['status_peer']<0,'status_peer'] = 2\n", "#和3年前比社会经济地位\n", "data.loc[data['status_3_before']<0,'status_3_before'] = 2\n", "#对‘观点’处理\n", "data.loc[data['view']<0,'view'] = 4\n", "#对期望年收入处理\n", "data.loc[data['inc_ability']<=0,'inc_ability']= 2\n", "inc_exp_mean = data['inc_exp'].mean()\n", "data.loc[data['inc_exp']<=0,'inc_exp']= inc_exp_mean #取均值\n", "\n", "#部分特征处理,取众数(首先去除缺失值的数据)\n", "for i in range(1,9+1):\n", " data.loc[data['public_service_'+str(i)]<0,'public_service_'+str(i)] = data['public_service_'+str(i)].dropna().mode().values\n", "for i in range(1,13+1):\n", " data.loc[data['trust_'+str(i)]<0,'trust_'+str(i)] = data['trust_'+str(i)].dropna().mode().values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 数据增广\n", "\n", "这一步,我们需要进一步分析每一个特征之间的关系,从而进行数据增广。经过思考,这里我添加了如下的特征:第一次结婚年龄、最近结婚年龄、是否再婚、配偶年龄、配偶年龄差、各种收入比(与配偶之间的收入比、十年后预期收入与现在收入之比等等)、收入与住房面积比(其中也包括10年后期望收入等等各种情况)、社会阶级(10年后的社会阶级、14年后的社会阶级等等)、悠闲指数、满意指数、信任指数等等。除此之外,我还考虑了对于同一省、市、县进行了归一化。例如同一省市内的收入的平均值等以及一个个体相对于同省、市、县其他人的各个指标的情况。同时也考虑了对于同龄人之间的相互比较,即在同龄人中的收入情况、健康情况等等。具体的实现代码如下:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "#第一次结婚年龄 147\n", "data['marital_1stbir'] = data['marital_1st'] - data['birth'] \n", "#最近结婚年龄 148\n", "data['marital_nowtbir'] = data['marital_now'] - data['birth'] \n", "#是否再婚 149\n", "data['mar'] = data['marital_nowtbir'] - data['marital_1stbir']\n", "#配偶年龄 150\n", "data['marital_sbir'] = data['marital_now']-data['s_birth']\n", "#配偶年龄差 151\n", "data['age_'] = data['marital_nowtbir'] - data['marital_sbir'] \n", "\n", "#收入比 151+7 =158\n", "data['income/s_income'] = data['income']/(data['s_income']+1) #同居伴侣\n", "data['income+s_income'] = data['income']+(data['s_income']+1)\n", "data['income/family_income'] = data['income']/(data['family_income']+1)\n", "data['all_income/family_income'] = (data['income']+data['s_income'])/(data['family_income']+1)\n", "data['income/inc_exp'] = data['income']/(data['inc_exp']+1)\n", "data['family_income/m'] = data['family_income']/(data['family_m']+0.01)\n", "data['income/m'] = data['income']/(data['family_m']+0.01)\n", "\n", "#收入/面积比 158+4=162\n", "data['income/floor_area'] = data['income']/(data['floor_area']+0.01)\n", "data['all_income/floor_area'] = (data['income']+data['s_income'])/(data['floor_area']+0.01)\n", "data['family_income/floor_area'] = data['family_income']/(data['floor_area']+0.01)\n", "data['floor_area/m'] = data['floor_area']/(data['family_m']+0.01)\n", "\n", "#class 162+3=165\n", "data['class_10_diff'] = (data['class_10_after'] - data['class'])\n", "data['class_diff'] = data['class'] - data['class_10_before']\n", "data['class_14_diff'] = data['class'] - data['class_14']\n", "#悠闲指数 166\n", "leisure_fea_lis = ['leisure_'+str(i) for i in range(1,13)]\n", "data['leisure_sum'] = data[leisure_fea_lis].sum(axis=1) #skew\n", "#满意指数 167\n", "public_service_fea_lis = ['public_service_'+str(i) for i in range(1,10)]\n", "data['public_service_sum'] = data[public_service_fea_lis].sum(axis=1) #skew\n", "\n", "#信任指数 168\n", "trust_fea_lis = ['trust_'+str(i) for i in range(1,14)]\n", "data['trust_sum'] = data[trust_fea_lis].sum(axis=1) #skew\n", "\n", "#province mean 168+13=181\n", "data['province_income_mean'] = data.groupby(['province'])['income'].transform('mean').values\n", "data['province_family_income_mean'] = data.groupby(['province'])['family_income'].transform('mean').values\n", "data['province_equity_mean'] = data.groupby(['province'])['equity'].transform('mean').values\n", "data['province_depression_mean'] = data.groupby(['province'])['depression'].transform('mean').values\n", "data['province_floor_area_mean'] = data.groupby(['province'])['floor_area'].transform('mean').values\n", "data['province_health_mean'] = data.groupby(['province'])['health'].transform('mean').values\n", "data['province_class_10_diff_mean'] = data.groupby(['province'])['class_10_diff'].transform('mean').values\n", "data['province_class_mean'] = data.groupby(['province'])['class'].transform('mean').values\n", "data['province_health_problem_mean'] = data.groupby(['province'])['health_problem'].transform('mean').values\n", "data['province_family_status_mean'] = data.groupby(['province'])['family_status'].transform('mean').values\n", "data['province_leisure_sum_mean'] = data.groupby(['province'])['leisure_sum'].transform('mean').values\n", "data['province_public_service_sum_mean'] = data.groupby(['province'])['public_service_sum'].transform('mean').values\n", "data['province_trust_sum_mean'] = data.groupby(['province'])['trust_sum'].transform('mean').values\n", "\n", "#city mean 181+13=194\n", "data['city_income_mean'] = data.groupby(['city'])['income'].transform('mean').values #按照city分组\n", "data['city_family_income_mean'] = data.groupby(['city'])['family_income'].transform('mean').values\n", "data['city_equity_mean'] = data.groupby(['city'])['equity'].transform('mean').values\n", "data['city_depression_mean'] = data.groupby(['city'])['depression'].transform('mean').values\n", "data['city_floor_area_mean'] = data.groupby(['city'])['floor_area'].transform('mean').values\n", "data['city_health_mean'] = data.groupby(['city'])['health'].transform('mean').values\n", "data['city_class_10_diff_mean'] = data.groupby(['city'])['class_10_diff'].transform('mean').values\n", "data['city_class_mean'] = data.groupby(['city'])['class'].transform('mean').values\n", "data['city_health_problem_mean'] = data.groupby(['city'])['health_problem'].transform('mean').values\n", "data['city_family_status_mean'] = data.groupby(['city'])['family_status'].transform('mean').values\n", "data['city_leisure_sum_mean'] = data.groupby(['city'])['leisure_sum'].transform('mean').values\n", "data['city_public_service_sum_mean'] = data.groupby(['city'])['public_service_sum'].transform('mean').values\n", "data['city_trust_sum_mean'] = data.groupby(['city'])['trust_sum'].transform('mean').values\n", "\n", "#county mean 194 + 13 = 207\n", "data['county_income_mean'] = data.groupby(['county'])['income'].transform('mean').values\n", "data['county_family_income_mean'] = data.groupby(['county'])['family_income'].transform('mean').values\n", "data['county_equity_mean'] = data.groupby(['county'])['equity'].transform('mean').values\n", "data['county_depression_mean'] = data.groupby(['county'])['depression'].transform('mean').values\n", "data['county_floor_area_mean'] = data.groupby(['county'])['floor_area'].transform('mean').values\n", "data['county_health_mean'] = data.groupby(['county'])['health'].transform('mean').values\n", "data['county_class_10_diff_mean'] = data.groupby(['county'])['class_10_diff'].transform('mean').values\n", "data['county_class_mean'] = data.groupby(['county'])['class'].transform('mean').values\n", "data['county_health_problem_mean'] = data.groupby(['county'])['health_problem'].transform('mean').values\n", "data['county_family_status_mean'] = data.groupby(['county'])['family_status'].transform('mean').values\n", "data['county_leisure_sum_mean'] = data.groupby(['county'])['leisure_sum'].transform('mean').values\n", "data['county_public_service_sum_mean'] = data.groupby(['county'])['public_service_sum'].transform('mean').values\n", "data['county_trust_sum_mean'] = data.groupby(['county'])['trust_sum'].transform('mean').values\n", "\n", "#ratio 相比同省 207 + 13 =220\n", "data['income/province'] = data['income']/(data['province_income_mean']) \n", "data['family_income/province'] = data['family_income']/(data['province_family_income_mean']) \n", "data['equity/province'] = data['equity']/(data['province_equity_mean']) \n", "data['depression/province'] = data['depression']/(data['province_depression_mean']) \n", "data['floor_area/province'] = data['floor_area']/(data['province_floor_area_mean'])\n", "data['health/province'] = data['health']/(data['province_health_mean'])\n", "data['class_10_diff/province'] = data['class_10_diff']/(data['province_class_10_diff_mean'])\n", "data['class/province'] = data['class']/(data['province_class_mean'])\n", "data['health_problem/province'] = data['health_problem']/(data['province_health_problem_mean'])\n", "data['family_status/province'] = data['family_status']/(data['province_family_status_mean'])\n", "data['leisure_sum/province'] = data['leisure_sum']/(data['province_leisure_sum_mean'])\n", "data['public_service_sum/province'] = data['public_service_sum']/(data['province_public_service_sum_mean'])\n", "data['trust_sum/province'] = data['trust_sum']/(data['province_trust_sum_mean']+1)\n", "\n", "#ratio 相比同市 220 + 13 =233\n", "data['income/city'] = data['income']/(data['city_income_mean']) \n", "data['family_income/city'] = data['family_income']/(data['city_family_income_mean']) \n", "data['equity/city'] = data['equity']/(data['city_equity_mean']) \n", "data['depression/city'] = data['depression']/(data['city_depression_mean']) \n", "data['floor_area/city'] = data['floor_area']/(data['city_floor_area_mean'])\n", "data['health/city'] = data['health']/(data['city_health_mean'])\n", "data['class_10_diff/city'] = data['class_10_diff']/(data['city_class_10_diff_mean'])\n", "data['class/city'] = data['class']/(data['city_class_mean'])\n", "data['health_problem/city'] = data['health_problem']/(data['city_health_problem_mean'])\n", "data['family_status/city'] = data['family_status']/(data['city_family_status_mean'])\n", "data['leisure_sum/city'] = data['leisure_sum']/(data['city_leisure_sum_mean'])\n", "data['public_service_sum/city'] = data['public_service_sum']/(data['city_public_service_sum_mean'])\n", "data['trust_sum/city'] = data['trust_sum']/(data['city_trust_sum_mean'])\n", "\n", "#ratio 相比同个地区 233 + 13 =246\n", "data['income/county'] = data['income']/(data['county_income_mean']) \n", "data['family_income/county'] = data['family_income']/(data['county_family_income_mean']) \n", "data['equity/county'] = data['equity']/(data['county_equity_mean']) \n", "data['depression/county'] = data['depression']/(data['county_depression_mean']) \n", "data['floor_area/county'] = data['floor_area']/(data['county_floor_area_mean'])\n", "data['health/county'] = data['health']/(data['county_health_mean'])\n", "data['class_10_diff/county'] = data['class_10_diff']/(data['county_class_10_diff_mean'])\n", "data['class/county'] = data['class']/(data['county_class_mean'])\n", "data['health_problem/county'] = data['health_problem']/(data['county_health_problem_mean'])\n", "data['family_status/county'] = data['family_status']/(data['county_family_status_mean'])\n", "data['leisure_sum/county'] = data['leisure_sum']/(data['county_leisure_sum_mean'])\n", "data['public_service_sum/county'] = data['public_service_sum']/(data['county_public_service_sum_mean'])\n", "data['trust_sum/county'] = data['trust_sum']/(data['county_trust_sum_mean'])\n", "\n", "#age mean 246+ 13 =259\n", "data['age_income_mean'] = data.groupby(['age'])['income'].transform('mean').values\n", "data['age_family_income_mean'] = data.groupby(['age'])['family_income'].transform('mean').values\n", "data['age_equity_mean'] = data.groupby(['age'])['equity'].transform('mean').values\n", "data['age_depression_mean'] = data.groupby(['age'])['depression'].transform('mean').values\n", "data['age_floor_area_mean'] = data.groupby(['age'])['floor_area'].transform('mean').values\n", "data['age_health_mean'] = data.groupby(['age'])['health'].transform('mean').values\n", "data['age_class_10_diff_mean'] = data.groupby(['age'])['class_10_diff'].transform('mean').values\n", "data['age_class_mean'] = data.groupby(['age'])['class'].transform('mean').values\n", "data['age_health_problem_mean'] = data.groupby(['age'])['health_problem'].transform('mean').values\n", "data['age_family_status_mean'] = data.groupby(['age'])['family_status'].transform('mean').values\n", "data['age_leisure_sum_mean'] = data.groupby(['age'])['leisure_sum'].transform('mean').values\n", "data['age_public_service_sum_mean'] = data.groupby(['age'])['public_service_sum'].transform('mean').values\n", "data['age_trust_sum_mean'] = data.groupby(['age'])['trust_sum'].transform('mean').values\n", "\n", "# 和同龄人相比259 + 13 =272\n", "data['income/age'] = data['income']/(data['age_income_mean']) \n", "data['family_income/age'] = data['family_income']/(data['age_family_income_mean']) \n", "data['equity/age'] = data['equity']/(data['age_equity_mean']) \n", "data['depression/age'] = data['depression']/(data['age_depression_mean']) \n", "data['floor_area/age'] = data['floor_area']/(data['age_floor_area_mean'])\n", "data['health/age'] = data['health']/(data['age_health_mean'])\n", "data['class_10_diff/age'] = data['class_10_diff']/(data['age_class_10_diff_mean'])\n", "data['class/age'] = data['class']/(data['age_class_mean'])\n", "data['health_problem/age'] = data['health_problem']/(data['age_health_problem_mean'])\n", "data['family_status/age'] = data['family_status']/(data['age_family_status_mean'])\n", "data['leisure_sum/age'] = data['leisure_sum']/(data['age_leisure_sum_mean'])\n", "data['public_service_sum/age'] = data['public_service_sum']/(data['age_public_service_sum_mean'])\n", "data['trust_sum/age'] = data['trust_sum']/(data['age_trust_sum_mean'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "经过如上的操作后,最终我们的特征从一开始的131维,扩充为了272维的特征。接下来考虑特征工程、训练模型以及模型融合的工作。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shape (10956, 272)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idsurvey_typeprovincecitycountysurvey_timegenderbirthnationalityreligion...depression/agefloor_area/agehealth/ageclass_10_diff/ageclass/agehealth_problem/agefamily_status/ageleisure_sum/agepublic_service_sum/agetrust_sum/age
01112325920151195911...1.2852110.4103510.8488370.0000000.6833070.5214290.7336680.7246200.6666380.925941
12218528520151199211...0.7333330.9528241.1793371.0125521.3444440.8913441.3595511.0117921.1307781.188442
232298312620152196710...1.3435370.9723281.1504851.1909551.1957621.0556791.1909550.9664701.1932040.803693
34210285120152194311...1.1116630.6423291.2763534.9777781.1991431.1883291.1626300.8993461.1538101.300950
4517183620152199411...0.7500000.5872841.1771060.0000000.2369571.1168031.0936451.0453130.7281611.117428
\n", "

5 rows × 272 columns

\n", "
" ], "text/plain": [ " id survey_type province city county survey_time gender birth \\\n", "0 1 1 12 32 59 2015 1 1959 \n", "1 2 2 18 52 85 2015 1 1992 \n", "2 3 2 29 83 126 2015 2 1967 \n", "3 4 2 10 28 51 2015 2 1943 \n", "4 5 1 7 18 36 2015 2 1994 \n", "\n", " nationality religion ... depression/age floor_area/age health/age \\\n", "0 1 1 ... 1.285211 0.410351 0.848837 \n", "1 1 1 ... 0.733333 0.952824 1.179337 \n", "2 1 0 ... 1.343537 0.972328 1.150485 \n", "3 1 1 ... 1.111663 0.642329 1.276353 \n", "4 1 1 ... 0.750000 0.587284 1.177106 \n", "\n", " class_10_diff/age class/age health_problem/age family_status/age \\\n", "0 0.000000 0.683307 0.521429 0.733668 \n", "1 1.012552 1.344444 0.891344 1.359551 \n", "2 1.190955 1.195762 1.055679 1.190955 \n", "3 4.977778 1.199143 1.188329 1.162630 \n", "4 0.000000 0.236957 1.116803 1.093645 \n", "\n", " leisure_sum/age public_service_sum/age trust_sum/age \n", "0 0.724620 0.666638 0.925941 \n", "1 1.011792 1.130778 1.188442 \n", "2 0.966470 1.193204 0.803693 \n", "3 0.899346 1.153810 1.300950 \n", "4 1.045313 0.728161 1.117428 \n", "\n", "[5 rows x 272 columns]" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print('shape',data.shape)\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们还应该删去有效样本数很少的特征,例如负值太多的特征或者是缺失值太多的特征,这里我一共删除了包括“目前的最高教育程度”在内的9类特征,得到了最终的263维的特征" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7988, 263)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#272-9=263\n", "#删除数值特别少的和之前用过的特征\n", "del_list=['id','survey_time','edu_other','invest_other','property_other','join_party','province','city','county']\n", "use_feature = [clo for clo in data.columns if clo not in del_list]\n", "data.fillna(0,inplace=True) #还是补0\n", "train_shape = train.shape[0] #一共的数据量,训练集\n", "features = data[use_feature].columns #删除后所有的特征\n", "X_train_263 = data[:train_shape][use_feature].values\n", "y_train = target\n", "X_test_263 = data[train_shape:][use_feature].values\n", "X_train_263.shape #最终一种263个特征" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "这里选择了最重要的49个特征,作为除了以上263维特征外的另外一组特征" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7988, 49)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imp_fea_49 = ['equity','depression','health','class','family_status','health_problem','class_10_after',\n", " 'equity/province','equity/city','equity/county',\n", " 'depression/province','depression/city','depression/county',\n", " 'health/province','health/city','health/county',\n", " 'class/province','class/city','class/county',\n", " 'family_status/province','family_status/city','family_status/county',\n", " 'family_income/province','family_income/city','family_income/county',\n", " 'floor_area/province','floor_area/city','floor_area/county',\n", " 'leisure_sum/province','leisure_sum/city','leisure_sum/county',\n", " 'public_service_sum/province','public_service_sum/city','public_service_sum/county',\n", " 'trust_sum/province','trust_sum/city','trust_sum/county',\n", " 'income/m','public_service_sum','class_diff','status_3_before','age_income_mean','age_floor_area_mean',\n", " 'weight_jin','height_cm',\n", " 'health/age','depression/age','equity/age','leisure_sum/age'\n", " ]\n", "train_shape = train.shape[0]\n", "X_train_49 = data[:train_shape][imp_fea_49].values\n", "X_test_49 = data[train_shape:][imp_fea_49].values\n", "X_train_49.shape #最重要的49个特征" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "选择需要进行onehot编码的离散变量进行one-hot编码,再合成为第三类特征,共383维。" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(7988, 383)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cat_fea = ['survey_type','gender','nationality','edu_status','political','hukou','hukou_loc','work_exper','work_status','work_type',\n", " 'work_manage','marital','s_political','s_hukou','s_work_exper','s_work_status','s_work_type','f_political','f_work_14',\n", " 'm_political','m_work_14'] #已经是0、1的值不需要onehot\n", "noc_fea = [clo for clo in use_feature if clo not in cat_fea]\n", "\n", "onehot_data = data[cat_fea].values\n", "enc = preprocessing.OneHotEncoder(categories = 'auto')\n", "oh_data=enc.fit_transform(onehot_data).toarray()\n", "oh_data.shape #变为onehot编码格式\n", "\n", "X_train_oh = oh_data[:train_shape,:]\n", "X_test_oh = oh_data[train_shape:,:]\n", "X_train_oh.shape #其中的训练集\n", "\n", "X_train_383 = np.column_stack([data[:train_shape][noc_fea].values,X_train_oh])#先是noc,再是cat_fea\n", "X_test_383 = np.column_stack([data[train_shape:][noc_fea].values,X_test_oh])\n", "X_train_383.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "基于此,我们构建完成了三种特征工程(训练数据集),其一是上面提取的最重要的49中特征,其中包括健康程度、社会阶级、在同龄人中的收入情况等等特征。其二是扩充后的263维特征(这里可以认为是初始特征)。其三是使用One-hot编码后的特征,这里要使用One-hot进行编码的原因在于,有部分特征为分离值,例如性别中男女,男为1,女为2,我们想使用One-hot将其变为男为0,女为1,来增强机器学习算法的鲁棒性能;再如民族这个特征,原本是1-56这56个数值,如果直接分类会让分类器的鲁棒性变差,所以使用One-hot编码将其变为6个特征进行非零即一的处理。" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 特征建模\n", "\n", "首先我们对于原始的263维的特征,使用lightGBM进行处理,这里我们使用5折交叉验证的方法:\n", "\n", "1.lightGBM" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "Training until validation scores don't improve for 800 rounds\n", "[500]\ttraining's l2: 0.507058\tvalid_1's l2: 0.50963\n", "[1000]\ttraining's l2: 0.458952\tvalid_1's l2: 0.469564\n", "[1500]\ttraining's l2: 0.433197\tvalid_1's l2: 0.453234\n", "[2000]\ttraining's l2: 0.415242\tvalid_1's l2: 0.444839\n", "[2500]\ttraining's l2: 0.400993\tvalid_1's l2: 0.440086\n", "[3000]\ttraining's l2: 0.388937\tvalid_1's l2: 0.436924\n", "[3500]\ttraining's l2: 0.378101\tvalid_1's l2: 0.434974\n", "[4000]\ttraining's l2: 0.368159\tvalid_1's l2: 0.43406\n", "[4500]\ttraining's l2: 0.358827\tvalid_1's l2: 0.433151\n", "[5000]\ttraining's l2: 0.350291\tvalid_1's l2: 0.432544\n", "[5500]\ttraining's l2: 0.342368\tvalid_1's l2: 0.431821\n", "[6000]\ttraining's l2: 0.334675\tvalid_1's l2: 0.431331\n", "[6500]\ttraining's l2: 0.327275\tvalid_1's l2: 0.431014\n", "[7000]\ttraining's l2: 0.320398\tvalid_1's l2: 0.431087\n", "[7500]\ttraining's l2: 0.31352\tvalid_1's l2: 0.430819\n", "[8000]\ttraining's l2: 0.307021\tvalid_1's l2: 0.430848\n", "[8500]\ttraining's l2: 0.300811\tvalid_1's l2: 0.430688\n", "[9000]\ttraining's l2: 0.294787\tvalid_1's l2: 0.430441\n", "[9500]\ttraining's l2: 0.288993\tvalid_1's l2: 0.430433\n", "Early stopping, best iteration is:\n", "[9119]\ttraining's l2: 0.293371\tvalid_1's l2: 0.430308\n", "fold n°2\n", "Training until validation scores don't improve for 800 rounds\n", "[500]\ttraining's l2: 0.49895\tvalid_1's l2: 0.52945\n", "[1000]\ttraining's l2: 0.450107\tvalid_1's l2: 0.496478\n", "[1500]\ttraining's l2: 0.424394\tvalid_1's l2: 0.483286\n", "[2000]\ttraining's l2: 0.40666\tvalid_1's l2: 0.476764\n", "[2500]\ttraining's l2: 0.392432\tvalid_1's l2: 0.472668\n", "[3000]\ttraining's l2: 0.380438\tvalid_1's l2: 0.470481\n", "[3500]\ttraining's l2: 0.369872\tvalid_1's l2: 0.468919\n", "[4000]\ttraining's l2: 0.36014\tvalid_1's l2: 0.467318\n", "[4500]\ttraining's l2: 0.351175\tvalid_1's l2: 0.466438\n", "[5000]\ttraining's l2: 0.342705\tvalid_1's l2: 0.466284\n", "[5500]\ttraining's l2: 0.334778\tvalid_1's l2: 0.466151\n", "[6000]\ttraining's l2: 0.3273\tvalid_1's l2: 0.466016\n", "[6500]\ttraining's l2: 0.320121\tvalid_1's l2: 0.466013\n", "Early stopping, best iteration is:\n", "[5915]\ttraining's l2: 0.328534\tvalid_1's l2: 0.465918\n", "fold n°3\n", "Training until validation scores don't improve for 800 rounds\n", "[500]\ttraining's l2: 0.499658\tvalid_1's l2: 0.528985\n", "[1000]\ttraining's l2: 0.450356\tvalid_1's l2: 0.497264\n", "[1500]\ttraining's l2: 0.424109\tvalid_1's l2: 0.485403\n", "[2000]\ttraining's l2: 0.405965\tvalid_1's l2: 0.479513\n", "[2500]\ttraining's l2: 0.391747\tvalid_1's l2: 0.47646\n", "[3000]\ttraining's l2: 0.379601\tvalid_1's l2: 0.474691\n", "[3500]\ttraining's l2: 0.368915\tvalid_1's l2: 0.473648\n", "[4000]\ttraining's l2: 0.359218\tvalid_1's l2: 0.47316\n", "[4500]\ttraining's l2: 0.350338\tvalid_1's l2: 0.473043\n", "[5000]\ttraining's l2: 0.341842\tvalid_1's l2: 0.472719\n", "[5500]\ttraining's l2: 0.333851\tvalid_1's l2: 0.472779\n", "Early stopping, best iteration is:\n", "[4942]\ttraining's l2: 0.342828\tvalid_1's l2: 0.472642\n", "fold n°4\n", "Training until validation scores don't improve for 800 rounds\n", "[500]\ttraining's l2: 0.505224\tvalid_1's l2: 0.508238\n", "[1000]\ttraining's l2: 0.456198\tvalid_1's l2: 0.473992\n", "[1500]\ttraining's l2: 0.430167\tvalid_1's l2: 0.461419\n", "[2000]\ttraining's l2: 0.412084\tvalid_1's l2: 0.454843\n", "[2500]\ttraining's l2: 0.397714\tvalid_1's l2: 0.450999\n", "[3000]\ttraining's l2: 0.385456\tvalid_1's l2: 0.448697\n", "[3500]\ttraining's l2: 0.374527\tvalid_1's l2: 0.446993\n", "[4000]\ttraining's l2: 0.364711\tvalid_1's l2: 0.44597\n", "[4500]\ttraining's l2: 0.355626\tvalid_1's l2: 0.445132\n", "[5000]\ttraining's l2: 0.347108\tvalid_1's l2: 0.44466\n", "[5500]\ttraining's l2: 0.339146\tvalid_1's l2: 0.444226\n", "[6000]\ttraining's l2: 0.331478\tvalid_1's l2: 0.443992\n", "[6500]\ttraining's l2: 0.324231\tvalid_1's l2: 0.444014\n", "Early stopping, best iteration is:\n", "[5874]\ttraining's l2: 0.333372\tvalid_1's l2: 0.443868\n", "fold n°5\n", "Training until validation scores don't improve for 800 rounds\n", "[500]\ttraining's l2: 0.504304\tvalid_1's l2: 0.515256\n", "[1000]\ttraining's l2: 0.456062\tvalid_1's l2: 0.478544\n", "[1500]\ttraining's l2: 0.430298\tvalid_1's l2: 0.463847\n", "[2000]\ttraining's l2: 0.412591\tvalid_1's l2: 0.456182\n", "[2500]\ttraining's l2: 0.398635\tvalid_1's l2: 0.451783\n", "[3000]\ttraining's l2: 0.386609\tvalid_1's l2: 0.449154\n", "[3500]\ttraining's l2: 0.375948\tvalid_1's l2: 0.447265\n", "[4000]\ttraining's l2: 0.366291\tvalid_1's l2: 0.445796\n", "[4500]\ttraining's l2: 0.357236\tvalid_1's l2: 0.445098\n", "[5000]\ttraining's l2: 0.348637\tvalid_1's l2: 0.444364\n", "[5500]\ttraining's l2: 0.340736\tvalid_1's l2: 0.443998\n", "[6000]\ttraining's l2: 0.333154\tvalid_1's l2: 0.443622\n", "[6500]\ttraining's l2: 0.325783\tvalid_1's l2: 0.443226\n", "[7000]\ttraining's l2: 0.318802\tvalid_1's l2: 0.442986\n", "[7500]\ttraining's l2: 0.312164\tvalid_1's l2: 0.442928\n", "[8000]\ttraining's l2: 0.305691\tvalid_1's l2: 0.442696\n", "[8500]\ttraining's l2: 0.29935\tvalid_1's l2: 0.442521\n", "[9000]\ttraining's l2: 0.293242\tvalid_1's l2: 0.442655\n", "Early stopping, best iteration is:\n", "[8594]\ttraining's l2: 0.298201\tvalid_1's l2: 0.44238\n", "CV score: 0.45102656\n" ] } ], "source": [ "##### lgb_263 #\n", "#lightGBM决策树\n", "lgb_263_param = {\n", "'num_leaves': 7, \n", "'min_data_in_leaf': 20, #叶子可能具有的最小记录数\n", "'objective':'regression',\n", "'max_depth': -1,\n", "'learning_rate': 0.003,\n", "\"boosting\": \"gbdt\", #用gbdt算法\n", "\"feature_fraction\": 0.18, #例如 0.18时,意味着在每次迭代中随机选择18%的参数来建树\n", "\"bagging_freq\": 1,\n", "\"bagging_fraction\": 0.55, #每次迭代时用的数据比例\n", "\"bagging_seed\": 14,\n", "\"metric\": 'mse',\n", "\"lambda_l1\": 0.1,\n", "\"lambda_l2\": 0.2, \n", "\"verbosity\": -1}\n", "folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=4) #交叉切分:5\n", "oof_lgb_263 = np.zeros(len(X_train_263))\n", "predictions_lgb_263 = np.zeros(len(X_test_263))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):\n", "\n", " print(\"fold n°{}\".format(fold_+1))\n", " trn_data = lgb.Dataset(X_train_263[trn_idx], y_train[trn_idx])\n", " val_data = lgb.Dataset(X_train_263[val_idx], y_train[val_idx])#train:val=4:1\n", "\n", " num_round = 10000\n", " lgb_263 = lgb.train(lgb_263_param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=500, early_stopping_rounds = 800)\n", " oof_lgb_263[val_idx] = lgb_263.predict(X_train_263[val_idx], num_iteration=lgb_263.best_iteration)\n", " predictions_lgb_263 += lgb_263.predict(X_test_263, num_iteration=lgb_263.best_iteration) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_lgb_263, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "接着,我使用已经训练完的lightGBM的模型进行特征重要性的判断以及可视化,从结果我们可以看出,排在重要性第一位的是health/age,就是同龄人中的健康程度,与我们主观的看法基本一致。" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "#---------------特征重要性\n", "pd.set_option('display.max_columns', None)\n", "#显示所有行\n", "pd.set_option('display.max_rows', None)\n", "#设置value的显示长度为100,默认为50\n", "pd.set_option('max_colwidth',100)\n", "df = pd.DataFrame(data[use_feature].columns.tolist(), columns=['feature'])\n", "df['importance']=list(lgb_263.feature_importance())\n", "df = df.sort_values(by='importance',ascending=False)\n", "plt.figure(figsize=(14,28))\n", "sns.barplot(x=\"importance\", y=\"feature\", data=df.head(50))\n", "plt.title('Features importance (averaged/folds)')\n", "plt.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "后面,我们使用常见的机器学习方法,对于263维特征进行建模:\n", "\n", "2.xgboost" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "[19:14:55] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:14:55] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.40426\tvalid_data-rmse:3.38329\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.40805\tvalid_data-rmse:0.70588\n", "[1000]\ttrain-rmse:0.27046\tvalid_data-rmse:0.70760\n", "Stopping. Best iteration:\n", "[663]\ttrain-rmse:0.35644\tvalid_data-rmse:0.70521\n", "\n", "[19:15:46] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "fold n°2\n", "[19:15:46] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:15:46] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.39811\tvalid_data-rmse:3.40788\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.40719\tvalid_data-rmse:0.69456\n", "[1000]\ttrain-rmse:0.27402\tvalid_data-rmse:0.69501\n", "Stopping. Best iteration:\n", "[551]\ttrain-rmse:0.39079\tvalid_data-rmse:0.69403\n", "\n", "[19:16:31] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "fold n°3\n", "[19:16:31] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:16:31] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.40181\tvalid_data-rmse:3.39295\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.41334\tvalid_data-rmse:0.66250\n", "Stopping. Best iteration:\n", "[333]\ttrain-rmse:0.47284\tvalid_data-rmse:0.66178\n", "\n", "[19:17:07] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "fold n°4\n", "[19:17:08] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:17:08] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.40240\tvalid_data-rmse:3.39012\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.41021\tvalid_data-rmse:0.66575\n", "[1000]\ttrain-rmse:0.27491\tvalid_data-rmse:0.66431\n", "Stopping. Best iteration:\n", "[863]\ttrain-rmse:0.30689\tvalid_data-rmse:0.66358\n", "\n", "[19:18:06] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "fold n°5\n", "[19:18:07] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:18:07] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.39347\tvalid_data-rmse:3.42628\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.41704\tvalid_data-rmse:0.64937\n", "[1000]\ttrain-rmse:0.27907\tvalid_data-rmse:0.64914\n", "Stopping. Best iteration:\n", "[598]\ttrain-rmse:0.38625\tvalid_data-rmse:0.64856\n", "\n", "[19:18:55] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "CV score: 0.45559329\n" ] } ], "source": [ "##### xgb_263\n", "#xgboost\n", "xgb_263_params = {'eta': 0.02, #lr\n", " 'max_depth': 6, \n", " 'min_child_weight':3,#最小叶子节点样本权重和\n", " 'gamma':0, #指定节点分裂所需的最小损失函数下降值。\n", " 'subsample': 0.7, #控制对于每棵树,随机采样的比例\n", " 'colsample_bytree': 0.3, #用来控制每棵随机采样的列数的占比 (每一列是一个特征)。\n", " 'lambda':2,\n", " 'objective': 'reg:linear', \n", " 'eval_metric': 'rmse', \n", " 'silent': True, \n", " 'nthread': -1}\n", "\n", "\n", "folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2019)\n", "oof_xgb_263 = np.zeros(len(X_train_263))\n", "predictions_xgb_263 = np.zeros(len(X_test_263))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " trn_data = xgb.DMatrix(X_train_263[trn_idx], y_train[trn_idx])\n", " val_data = xgb.DMatrix(X_train_263[val_idx], y_train[val_idx])\n", "\n", " watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]\n", " xgb_263 = xgb.train(dtrain=trn_data, num_boost_round=3000, evals=watchlist, early_stopping_rounds=600, verbose_eval=500, params=xgb_263_params)\n", " oof_xgb_263[val_idx] = xgb_263.predict(xgb.DMatrix(X_train_263[val_idx]), ntree_limit=xgb_263.best_ntree_limit)\n", " predictions_xgb_263 += xgb_263.predict(xgb.DMatrix(X_test_263), ntree_limit=xgb_263.best_ntree_limit) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_xgb_263, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. RandomForestRegressor随机森林" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.6s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 2.6s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 6.5s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 11.8s\n", "[Parallel(n_jobs=-1)]: Done 1234 tasks | elapsed: 18.9s\n", "[Parallel(n_jobs=-1)]: Done 1600 out of 1600 | elapsed: 25.6s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.2s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "fold n°2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.6s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 2.8s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 6.9s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 12.9s\n", "[Parallel(n_jobs=-1)]: Done 1234 tasks | elapsed: 21.0s\n", "[Parallel(n_jobs=-1)]: Done 1600 out of 1600 | elapsed: 27.5s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.3s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "fold n°3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.6s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 3.4s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 7.6s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 13.7s\n", "[Parallel(n_jobs=-1)]: Done 1234 tasks | elapsed: 21.0s\n", "[Parallel(n_jobs=-1)]: Done 1600 out of 1600 | elapsed: 26.9s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.2s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "fold n°4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.8s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 3.5s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 7.9s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 13.3s\n", "[Parallel(n_jobs=-1)]: Done 1234 tasks | elapsed: 20.6s\n", "[Parallel(n_jobs=-1)]: Done 1600 out of 1600 | elapsed: 26.1s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.2s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "fold n°5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.6s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 2.7s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 6.8s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 12.2s\n", "[Parallel(n_jobs=-1)]: Done 1234 tasks | elapsed: 19.2s\n", "[Parallel(n_jobs=-1)]: Done 1600 out of 1600 | elapsed: 25.1s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CV score: 0.47804209\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=8)]: Done 1234 tasks | elapsed: 0.2s\n", "[Parallel(n_jobs=8)]: Done 1600 out of 1600 | elapsed: 0.3s finished\n" ] } ], "source": [ "#RandomForestRegressor随机森林\n", "folds = KFold(n_splits=5, shuffle=True, random_state=2019)\n", "oof_rfr_263 = np.zeros(len(X_train_263))\n", "predictions_rfr_263 = np.zeros(len(X_test_263))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_263[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " rfr_263 = rfr(n_estimators=1600,max_depth=9, min_samples_leaf=9, min_weight_fraction_leaf=0.0,\n", " max_features=0.25,verbose=1,n_jobs=-1) #并行化\n", " #verbose = 0 为不在标准输出流输出日志信息\n", "#verbose = 1 为输出进度条记录\n", "#verbose = 2 为每个epoch输出一行记录\n", " rfr_263.fit(tr_x,tr_y)\n", " oof_rfr_263[val_idx] = rfr_263.predict(X_train_263[val_idx])\n", " \n", " predictions_rfr_263 += rfr_263.predict(X_test_263) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_rfr_263, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. GradientBoostingRegressor梯度提升决策树" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6419 0.0036 24.34s\n", " 2 0.6564 0.0031 23.18s\n", " 3 0.6693 0.0031 22.69s\n", " 4 0.6589 0.0031 22.78s\n", " 5 0.6522 0.0027 22.58s\n", " 6 0.6521 0.0031 22.40s\n", " 7 0.6370 0.0029 22.23s\n", " 8 0.6343 0.0030 22.06s\n", " 9 0.6447 0.0029 21.87s\n", " 10 0.6397 0.0028 21.75s\n", " 20 0.5955 0.0019 20.93s\n", " 30 0.5695 0.0016 20.09s\n", " 40 0.5460 0.0015 19.34s\n", " 50 0.5121 0.0011 18.65s\n", " 60 0.4994 0.0012 18.03s\n", " 70 0.4912 0.0010 17.44s\n", " 80 0.4719 0.0010 16.76s\n", " 90 0.4310 0.0007 16.28s\n", " 100 0.4437 0.0006 15.84s\n", " 200 0.3424 0.0002 10.15s\n", " 300 0.3063 -0.0000 4.94s\n", " 400 0.2759 -0.0000 0.00s\n", "fold n°2\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6836 0.0034 24.61s\n", " 2 0.6613 0.0030 22.86s\n", " 3 0.6500 0.0031 24.11s\n", " 4 0.6621 0.0036 23.15s\n", " 5 0.6356 0.0031 23.49s\n", " 6 0.6460 0.0029 23.13s\n", " 7 0.6263 0.0032 22.83s\n", " 8 0.6149 0.0029 22.72s\n", " 9 0.6350 0.0030 22.83s\n", " 10 0.6325 0.0026 22.65s\n", " 20 0.6064 0.0025 21.62s\n", " 30 0.5812 0.0018 20.59s\n", " 40 0.5460 0.0018 19.98s\n", " 50 0.5016 0.0014 19.52s\n", " 60 0.4991 0.0010 18.84s\n", " 70 0.4645 0.0009 18.24s\n", " 80 0.4621 0.0007 17.76s\n", " 90 0.4497 0.0007 17.20s\n", " 100 0.4374 0.0005 16.51s\n", " 200 0.3420 0.0001 10.35s\n", " 300 0.3032 -0.0000 4.95s\n", " 400 0.2710 -0.0000 0.00s\n", "fold n°3\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6692 0.0036 24.95s\n", " 2 0.6468 0.0031 23.99s\n", " 3 0.6313 0.0034 24.05s\n", " 4 0.6499 0.0032 23.70s\n", " 5 0.6358 0.0033 23.38s\n", " 6 0.6343 0.0029 23.05s\n", " 7 0.6312 0.0036 22.71s\n", " 8 0.6180 0.0032 22.47s\n", " 9 0.6275 0.0035 22.57s\n", " 10 0.6168 0.0030 22.24s\n", " 20 0.5792 0.0021 20.73s\n", " 30 0.5583 0.0023 20.27s\n", " 40 0.5521 0.0018 19.70s\n", " 50 0.5067 0.0013 18.84s\n", " 60 0.4754 0.0010 18.42s\n", " 70 0.4811 0.0009 17.84s\n", " 80 0.4603 0.0008 17.38s\n", " 90 0.4439 0.0006 16.74s\n", " 100 0.4323 0.0007 16.25s\n", " 200 0.3401 0.0002 10.23s\n", " 300 0.2862 -0.0000 4.84s\n", " 400 0.2690 -0.0000 0.00s\n", "fold n°4\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6687 0.0032 21.09s\n", " 2 0.6517 0.0031 23.29s\n", " 3 0.6583 0.0031 23.63s\n", " 4 0.6607 0.0033 24.45s\n", " 5 0.6583 0.0029 24.78s\n", " 6 0.6688 0.0028 24.80s\n", " 7 0.6320 0.0030 25.08s\n", " 8 0.6502 0.0026 24.94s\n", " 9 0.6358 0.0026 24.51s\n", " 10 0.6258 0.0027 24.24s\n", " 20 0.5910 0.0023 22.41s\n", " 30 0.5609 0.0020 21.31s\n", " 40 0.5399 0.0017 20.50s\n", " 50 0.4963 0.0013 19.67s\n", " 60 0.4844 0.0012 18.86s\n", " 70 0.4781 0.0008 18.21s\n", " 80 0.4484 0.0010 17.63s\n", " 90 0.4619 0.0006 16.95s\n", " 100 0.4430 0.0005 16.46s\n", " 200 0.3377 0.0001 10.50s\n", " 300 0.3001 0.0001 4.97s\n", " 400 0.2623 -0.0000 0.00s\n", "fold n°5\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6857 0.0031 23.50s\n", " 2 0.6320 0.0035 24.26s\n", " 3 0.6573 0.0033 23.41s\n", " 4 0.6494 0.0033 24.20s\n", " 5 0.6311 0.0033 24.32s\n", " 6 0.6362 0.0031 24.20s\n", " 7 0.6291 0.0032 24.05s\n", " 8 0.6354 0.0032 23.56s\n", " 9 0.6383 0.0030 23.54s\n", " 10 0.6250 0.0029 23.64s\n", " 20 0.5989 0.0023 21.45s\n", " 30 0.5736 0.0019 20.27s\n", " 40 0.5457 0.0016 19.60s\n", " 50 0.5045 0.0015 18.76s\n", " 60 0.4820 0.0012 18.20s\n", " 70 0.4756 0.0010 17.44s\n", " 80 0.4484 0.0009 16.91s\n", " 90 0.4410 0.0007 16.34s\n", " 100 0.4195 0.0004 15.72s\n", " 200 0.3348 0.0001 10.05s\n", " 300 0.2933 -0.0000 4.76s\n", " 400 0.2658 -0.0000 0.00s\n", "CV score: 0.45583290\n" ] } ], "source": [ "#GradientBoostingRegressor梯度提升决策树\n", "folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2018)\n", "oof_gbr_263 = np.zeros(train_shape)\n", "predictions_gbr_263 = np.zeros(len(X_test_263))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_263[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " gbr_263 = gbr(n_estimators=400, learning_rate=0.01,subsample=0.65,max_depth=7, min_samples_leaf=20,\n", " max_features=0.22,verbose=1)\n", " gbr_263.fit(tr_x,tr_y)\n", " oof_gbr_263[val_idx] = gbr_263.predict(X_train_263[val_idx])\n", " \n", " predictions_gbr_263 += gbr_263.predict(X_test_263) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_gbr_263, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. ExtraTreesRegressor 极端随机森林回归" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.4s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 1.7s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 4.0s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 7.2s\n", "[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 9.0s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.1s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "fold n°2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.3s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 1.6s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 3.8s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 6.9s\n", "[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 8.9s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.1s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "fold n°3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.4s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 1.7s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 4.1s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 7.6s\n", "[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 9.6s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.1s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "fold n°4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.4s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 1.7s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 4.0s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 7.6s\n", "[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 10.6s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.2s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.2s finished\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "fold n°5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 0.4s\n", "[Parallel(n_jobs=-1)]: Done 184 tasks | elapsed: 1.9s\n", "[Parallel(n_jobs=-1)]: Done 434 tasks | elapsed: 4.4s\n", "[Parallel(n_jobs=-1)]: Done 784 tasks | elapsed: 8.6s\n", "[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 10.7s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.1s finished\n", "[Parallel(n_jobs=8)]: Using backend ThreadingBackend with 8 concurrent workers.\n", "[Parallel(n_jobs=8)]: Done 34 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 184 tasks | elapsed: 0.0s\n", "[Parallel(n_jobs=8)]: Done 434 tasks | elapsed: 0.1s\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "CV score: 0.48598792\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=8)]: Done 784 tasks | elapsed: 0.1s\n", "[Parallel(n_jobs=8)]: Done 1000 out of 1000 | elapsed: 0.1s finished\n" ] } ], "source": [ "#ExtraTreesRegressor 极端随机森林回归\n", "folds = KFold(n_splits=5, shuffle=True, random_state=13)\n", "oof_etr_263 = np.zeros(train_shape)\n", "predictions_etr_263 = np.zeros(len(X_test_263))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_263, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_263[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " etr_263 = etr(n_estimators=1000,max_depth=8, min_samples_leaf=12, min_weight_fraction_leaf=0.0,\n", " max_features=0.4,verbose=1,n_jobs=-1)# max_feature:划分时考虑的最大特征数\n", " etr_263.fit(tr_x,tr_y)\n", " oof_etr_263[val_idx] = etr_263.predict(X_train_263[val_idx])\n", " \n", " predictions_etr_263 += etr_263.predict(X_test_263) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_etr_263, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "至此,我们得到了以上5种模型的预测结果以及模型架构及参数。其中在每一种特征工程中,进行5折的交叉验证,并重复两次(Kernel Ridge Regression,核脊回归),取得每一个特征数下的模型的结果。" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold 0\n", "fold 1\n", "fold 2\n", "fold 3\n", "fold 4\n", "fold 5\n", "fold 6\n", "fold 7\n", "fold 8\n", "fold 9\n" ] }, { "data": { "text/plain": [ "0.44815130114230267" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_stack2 = np.vstack([oof_lgb_263,oof_xgb_263,oof_gbr_263,oof_rfr_263,oof_etr_263]).transpose()\n", "# transpose()函数的作用就是调换x,y,z的位置,也就是数组的索引值\n", "test_stack2 = np.vstack([predictions_lgb_263, predictions_xgb_263,predictions_gbr_263,predictions_rfr_263,predictions_etr_263]).transpose()\n", "\n", "#交叉验证:5折,重复2次\n", "folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)\n", "oof_stack2 = np.zeros(train_stack2.shape[0])\n", "predictions_lr2 = np.zeros(test_stack2.shape[0])\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack2,target)):\n", " print(\"fold {}\".format(fold_))\n", " trn_data, trn_y = train_stack2[trn_idx], target.iloc[trn_idx].values\n", " val_data, val_y = train_stack2[val_idx], target.iloc[val_idx].values\n", " #Kernel Ridge Regression\n", " lr2 = kr()\n", " lr2.fit(trn_data, trn_y)\n", " \n", " oof_stack2[val_idx] = lr2.predict(val_data)\n", " predictions_lr2 += lr2.predict(test_stack2) / 10\n", " \n", "mean_squared_error(target.values, oof_stack2) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "接下来我们对于49维的数据进行与上述263维数据相同的操作\n", "\n", "1.lightGBM" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "Training until validation scores don't improve for 1000 rounds\n", "[1000]\ttraining's l2: 0.46958\tvalid_1's l2: 0.500767\n", "[2000]\ttraining's l2: 0.429395\tvalid_1's l2: 0.482214\n", "[3000]\ttraining's l2: 0.406748\tvalid_1's l2: 0.477959\n", "[4000]\ttraining's l2: 0.388735\tvalid_1's l2: 0.476283\n", "[5000]\ttraining's l2: 0.373399\tvalid_1's l2: 0.475506\n", "[6000]\ttraining's l2: 0.359798\tvalid_1's l2: 0.475435\n", "Early stopping, best iteration is:\n", "[5429]\ttraining's l2: 0.367348\tvalid_1's l2: 0.475325\n", "fold n°2\n", "Training until validation scores don't improve for 1000 rounds\n", "[1000]\ttraining's l2: 0.469767\tvalid_1's l2: 0.496741\n", "[2000]\ttraining's l2: 0.428546\tvalid_1's l2: 0.479198\n", "[3000]\ttraining's l2: 0.405733\tvalid_1's l2: 0.475903\n", "[4000]\ttraining's l2: 0.388021\tvalid_1's l2: 0.474891\n", "[5000]\ttraining's l2: 0.372619\tvalid_1's l2: 0.474262\n", "[6000]\ttraining's l2: 0.358826\tvalid_1's l2: 0.47449\n", "Early stopping, best iteration is:\n", "[5002]\ttraining's l2: 0.372597\tvalid_1's l2: 0.47425\n", "fold n°3\n", "Training until validation scores don't improve for 1000 rounds\n", "[1000]\ttraining's l2: 0.47361\tvalid_1's l2: 0.4839\n", "[2000]\ttraining's l2: 0.433064\tvalid_1's l2: 0.462219\n", "[3000]\ttraining's l2: 0.410658\tvalid_1's l2: 0.457989\n", "[4000]\ttraining's l2: 0.392859\tvalid_1's l2: 0.456091\n", "[5000]\ttraining's l2: 0.377706\tvalid_1's l2: 0.455416\n", "[6000]\ttraining's l2: 0.364058\tvalid_1's l2: 0.455285\n", "Early stopping, best iteration is:\n", "[5815]\ttraining's l2: 0.3665\tvalid_1's l2: 0.455119\n", "fold n°4\n", "Training until validation scores don't improve for 1000 rounds\n", "[1000]\ttraining's l2: 0.471715\tvalid_1's l2: 0.496877\n", "[2000]\ttraining's l2: 0.431956\tvalid_1's l2: 0.472828\n", "[3000]\ttraining's l2: 0.409505\tvalid_1's l2: 0.467016\n", "[4000]\ttraining's l2: 0.391659\tvalid_1's l2: 0.464929\n", "[5000]\ttraining's l2: 0.376239\tvalid_1's l2: 0.464048\n", "[6000]\ttraining's l2: 0.36213\tvalid_1's l2: 0.463628\n", "[7000]\ttraining's l2: 0.349338\tvalid_1's l2: 0.463767\n", "Early stopping, best iteration is:\n", "[6272]\ttraining's l2: 0.358584\tvalid_1's l2: 0.463542\n", "fold n°5\n", "Training until validation scores don't improve for 1000 rounds\n", "[1000]\ttraining's l2: 0.466349\tvalid_1's l2: 0.507696\n", "[2000]\ttraining's l2: 0.425606\tvalid_1's l2: 0.492745\n", "[3000]\ttraining's l2: 0.403731\tvalid_1's l2: 0.488917\n", "[4000]\ttraining's l2: 0.386479\tvalid_1's l2: 0.487113\n", "[5000]\ttraining's l2: 0.371358\tvalid_1's l2: 0.485881\n", "[6000]\ttraining's l2: 0.357821\tvalid_1's l2: 0.485185\n", "[7000]\ttraining's l2: 0.345577\tvalid_1's l2: 0.484535\n", "[8000]\ttraining's l2: 0.33415\tvalid_1's l2: 0.484483\n", "Early stopping, best iteration is:\n", "[7649]\ttraining's l2: 0.338078\tvalid_1's l2: 0.484416\n", "CV score: 0.47052692\n" ] } ], "source": [ "##### lgb_49\n", "lgb_49_param = {\n", "'num_leaves': 9,\n", "'min_data_in_leaf': 23,\n", "'objective':'regression',\n", "'max_depth': -1,\n", "'learning_rate': 0.002,\n", "\"boosting\": \"gbdt\",\n", "\"feature_fraction\": 0.45, \n", "\"bagging_freq\": 1,\n", "\"bagging_fraction\": 0.65, \n", "\"bagging_seed\": 15,\n", "\"metric\": 'mse',\n", "\"lambda_l2\": 0.2, \n", "\"verbosity\": -1} # 一个叶子上数据的最小数量 \\ feature_fraction将会在每棵树训练之前选择 45% 的特征。可以用来加速训练,可以用来处理过拟合。 #bagging_fraction不进行重采样的情况下随机选择部分数据。可以用来加速训练,可以用来处理过拟合。\n", "folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=9) \n", "oof_lgb_49 = np.zeros(len(X_train_49))\n", "predictions_lgb_49 = np.zeros(len(X_test_49))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " trn_data = lgb.Dataset(X_train_49[trn_idx], y_train[trn_idx])\n", " val_data = lgb.Dataset(X_train_49[val_idx], y_train[val_idx])\n", "\n", " num_round = 12000\n", " lgb_49 = lgb.train(lgb_49_param, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=1000, early_stopping_rounds = 1000)\n", " oof_lgb_49[val_idx] = lgb_49.predict(X_train_49[val_idx], num_iteration=lgb_49.best_iteration)\n", " predictions_lgb_49 += lgb_49.predict(X_test_49, num_iteration=lgb_49.best_iteration) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_lgb_49, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. xgboost" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "[19:25:31] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:25:31] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.40431\tvalid_data-rmse:3.38307\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.52770\tvalid_data-rmse:0.72110\n", "[1000]\ttrain-rmse:0.43563\tvalid_data-rmse:0.72245\n", "Stopping. Best iteration:\n", "[690]\ttrain-rmse:0.49010\tvalid_data-rmse:0.72044\n", "\n", "[19:25:44] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "fold n°2\n", "[19:25:44] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:25:44] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.39815\tvalid_data-rmse:3.40784\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.52871\tvalid_data-rmse:0.70336\n", "[1000]\ttrain-rmse:0.43793\tvalid_data-rmse:0.70446\n", "Stopping. Best iteration:\n", "[754]\ttrain-rmse:0.47982\tvalid_data-rmse:0.70278\n", "\n", "[19:25:57] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "fold n°3\n", "[19:25:57] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:25:57] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.40183\tvalid_data-rmse:3.39291\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.53169\tvalid_data-rmse:0.66896\n", "[1000]\ttrain-rmse:0.44129\tvalid_data-rmse:0.67058\n", "Stopping. Best iteration:\n", "[452]\ttrain-rmse:0.54177\tvalid_data-rmse:0.66871\n", "\n", "[19:26:07] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "fold n°4\n", "[19:26:07] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:26:07] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.40240\tvalid_data-rmse:3.39014\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.53218\tvalid_data-rmse:0.67783\n", "[1000]\ttrain-rmse:0.44361\tvalid_data-rmse:0.67978\n", "Stopping. Best iteration:\n", "[566]\ttrain-rmse:0.51924\tvalid_data-rmse:0.67765\n", "\n", "[19:26:18] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "fold n°5\n", "[19:26:19] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "[19:26:19] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:480: \n", "Parameters: { silent } might not be used.\n", "\n", " This may not be accurate due to some parameters are only used in language bindings but\n", " passed down to XGBoost core. Or some parameters are not used but slip through this\n", " verification. Please open an issue if you find above cases.\n", "\n", "\n", "[0]\ttrain-rmse:3.39345\tvalid_data-rmse:3.42619\n", "Multiple eval metrics have been passed: 'valid_data-rmse' will be used for early stopping.\n", "\n", "Will train until valid_data-rmse hasn't improved in 600 rounds.\n", "[500]\ttrain-rmse:0.53565\tvalid_data-rmse:0.66150\n", "[1000]\ttrain-rmse:0.44204\tvalid_data-rmse:0.66241\n", "Stopping. Best iteration:\n", "[747]\ttrain-rmse:0.48554\tvalid_data-rmse:0.66016\n", "\n", "[19:26:32] WARNING: /Users/travis/build/dmlc/xgboost/src/objective/regression_obj.cu:170: reg:linear is now deprecated in favor of reg:squarederror.\n", "CV score: 0.47102840\n" ] } ], "source": [ "##### xgb_49\n", "xgb_49_params = {'eta': 0.02, \n", " 'max_depth': 5, \n", " 'min_child_weight':3,\n", " 'gamma':0,\n", " 'subsample': 0.7, \n", " 'colsample_bytree': 0.35, \n", " 'lambda':2,\n", " 'objective': 'reg:linear', \n", " 'eval_metric': 'rmse', \n", " 'silent': True, \n", " 'nthread': -1}\n", "\n", "\n", "folds = KFold(n_splits=5, shuffle=True, random_state=2019)\n", "oof_xgb_49 = np.zeros(len(X_train_49))\n", "predictions_xgb_49 = np.zeros(len(X_test_49))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " trn_data = xgb.DMatrix(X_train_49[trn_idx], y_train[trn_idx])\n", " val_data = xgb.DMatrix(X_train_49[val_idx], y_train[val_idx])\n", "\n", " watchlist = [(trn_data, 'train'), (val_data, 'valid_data')]\n", " xgb_49 = xgb.train(dtrain=trn_data, num_boost_round=3000, evals=watchlist, early_stopping_rounds=600, verbose_eval=500, params=xgb_49_params)\n", " oof_xgb_49[val_idx] = xgb_49.predict(xgb.DMatrix(X_train_49[val_idx]), ntree_limit=xgb_49.best_ntree_limit)\n", " predictions_xgb_49 += xgb_49.predict(xgb.DMatrix(X_test_49), ntree_limit=xgb_49.best_ntree_limit) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_xgb_49, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. GradientBoostingRegressor梯度提升决策树" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6529 0.0032 9.69s\n", " 2 0.6736 0.0029 9.55s\n", " 3 0.6522 0.0029 9.29s\n", " 4 0.6393 0.0034 9.49s\n", " 5 0.6454 0.0032 9.36s\n", " 6 0.6467 0.0031 9.22s\n", " 7 0.6650 0.0026 9.23s\n", " 8 0.6225 0.0030 9.20s\n", " 9 0.6350 0.0028 9.09s\n", " 10 0.6311 0.0028 9.25s\n", " 20 0.6074 0.0022 8.67s\n", " 30 0.5790 0.0017 8.19s\n", " 40 0.5443 0.0016 7.89s\n", " 50 0.5405 0.0013 7.63s\n", " 60 0.5141 0.0010 7.47s\n", " 70 0.4991 0.0008 7.28s\n", " 80 0.4791 0.0007 7.12s\n", " 90 0.4707 0.0006 6.92s\n", " 100 0.4632 0.0006 6.74s\n", " 200 0.4013 0.0001 5.09s\n", " 300 0.3924 -0.0001 3.62s\n", " 400 0.3526 -0.0000 2.32s\n", " 500 0.3355 -0.0000 1.12s\n", " 600 0.3201 -0.0000 0.00s\n", "fold n°2\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6518 0.0034 8.83s\n", " 2 0.6618 0.0033 8.42s\n", " 3 0.6483 0.0032 8.28s\n", " 4 0.6592 0.0029 8.27s\n", " 5 0.6386 0.0030 8.18s\n", " 6 0.6438 0.0031 8.16s\n", " 7 0.6477 0.0033 8.12s\n", " 8 0.6593 0.0029 8.15s\n", " 9 0.6182 0.0029 8.19s\n", " 10 0.6358 0.0028 8.32s\n", " 20 0.5810 0.0025 7.91s\n", " 30 0.5816 0.0020 7.74s\n", " 40 0.5529 0.0013 7.53s\n", " 50 0.5402 0.0011 7.38s\n", " 60 0.5096 0.0011 7.17s\n", " 70 0.4883 0.0010 7.03s\n", " 80 0.4980 0.0007 6.84s\n", " 90 0.4706 0.0006 6.71s\n", " 100 0.4704 0.0004 6.55s\n", " 200 0.3867 0.0001 5.01s\n", " 300 0.3686 -0.0000 3.60s\n", " 400 0.3363 -0.0000 2.32s\n", " 500 0.3357 -0.0000 1.13s\n", " 600 0.3160 -0.0000 0.00s\n", "fold n°3\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6457 0.0038 8.04s\n", " 2 0.6687 0.0033 8.08s\n", " 3 0.6462 0.0036 8.04s\n", " 4 0.6587 0.0035 8.02s\n", " 5 0.6430 0.0031 7.99s\n", " 6 0.6540 0.0029 7.95s\n", " 7 0.6377 0.0030 7.93s\n", " 8 0.6414 0.0030 7.97s\n", " 9 0.6399 0.0030 8.07s\n", " 10 0.6375 0.0028 8.07s\n", " 20 0.5949 0.0025 7.67s\n", " 30 0.5854 0.0019 7.72s\n", " 40 0.5386 0.0016 7.46s\n", " 50 0.5156 0.0013 7.32s\n", " 60 0.5080 0.0011 7.17s\n", " 70 0.5021 0.0009 7.04s\n", " 80 0.4654 0.0008 6.85s\n", " 90 0.4712 0.0006 6.72s\n", " 100 0.4740 0.0006 6.53s\n", " 200 0.3924 0.0000 4.96s\n", " 300 0.3568 -0.0000 3.58s\n", " 400 0.3400 -0.0001 2.31s\n", " 500 0.3283 -0.0001 1.12s\n", " 600 0.3044 -0.0000 0.00s\n", "fold n°4\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6606 0.0032 8.27s\n", " 2 0.6878 0.0030 8.37s\n", " 3 0.6490 0.0031 8.37s\n", " 4 0.6564 0.0032 8.29s\n", " 5 0.6568 0.0027 8.27s\n", " 6 0.6496 0.0030 8.27s\n", " 7 0.6451 0.0029 8.22s\n", " 8 0.6210 0.0031 8.21s\n", " 9 0.6239 0.0028 8.35s\n", " 10 0.6535 0.0025 8.35s\n", " 20 0.6038 0.0022 7.92s\n", " 30 0.6032 0.0019 7.76s\n", " 40 0.5492 0.0018 7.55s\n", " 50 0.5333 0.0011 7.37s\n", " 60 0.4973 0.0010 7.24s\n", " 70 0.4942 0.0009 7.09s\n", " 80 0.4753 0.0008 6.92s\n", " 90 0.4806 0.0005 6.76s\n", " 100 0.4659 0.0005 6.58s\n", " 200 0.4046 0.0000 4.99s\n", " 300 0.3647 -0.0000 3.59s\n", " 400 0.3561 -0.0000 2.32s\n", " 500 0.3330 -0.0000 1.12s\n", " 600 0.3152 -0.0000 0.00s\n", "fold n°5\n", " Iter Train Loss OOB Improve Remaining Time \n", " 1 0.6721 0.0036 8.28s\n", " 2 0.6822 0.0034 8.41s\n", " 3 0.6634 0.0033 8.26s\n", " 4 0.6584 0.0032 8.21s\n", " 5 0.6574 0.0030 8.40s\n", " 6 0.6544 0.0033 8.31s\n", " 7 0.6533 0.0028 8.30s\n", " 8 0.6196 0.0029 8.27s\n", " 9 0.6530 0.0028 8.43s\n", " 10 0.6108 0.0032 8.49s\n", " 20 0.6107 0.0027 7.91s\n", " 30 0.5649 0.0020 7.70s\n", " 40 0.5555 0.0016 7.55s\n", " 50 0.5156 0.0014 7.40s\n", " 60 0.5144 0.0010 7.21s\n", " 70 0.5001 0.0009 7.05s\n", " 80 0.4908 0.0007 6.88s\n", " 90 0.4820 0.0008 6.73s\n", " 100 0.4617 0.0007 6.55s\n", " 200 0.3993 -0.0000 5.01s\n", " 300 0.3678 -0.0000 3.61s\n", " 400 0.3399 -0.0000 2.31s\n", " 500 0.3182 -0.0000 1.12s\n", " 600 0.3238 -0.0000 0.00s\n", "CV score: 0.46724198\n" ] } ], "source": [ "folds = StratifiedKFold(n_splits=5, shuffle=True, random_state=2018)\n", "oof_gbr_49 = np.zeros(train_shape)\n", "predictions_gbr_49 = np.zeros(len(X_test_49))\n", "#GradientBoostingRegressor梯度提升决策树\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_49[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " gbr_49 = gbr(n_estimators=600, learning_rate=0.01,subsample=0.65,max_depth=6, min_samples_leaf=20,\n", " max_features=0.35,verbose=1)\n", " gbr_49.fit(tr_x,tr_y)\n", " oof_gbr_49[val_idx] = gbr_49.predict(X_train_49[val_idx])\n", " \n", " predictions_gbr_49 += gbr_49.predict(X_test_49) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_gbr_49, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "至此,我们得到了以上3种模型的基于49个特征的预测结果以及模型架构及参数。其中在每一种特征工程中,进行5折的交叉验证,并重复两次(Kernel Ridge Regression,核脊回归),取得每一个特征数下的模型的结果。" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold 0\n", "fold 1\n", "fold 2\n", "fold 3\n", "fold 4\n", "fold 5\n", "fold 6\n", "fold 7\n", "fold 8\n", "fold 9\n" ] }, { "data": { "text/plain": [ "0.4662728551415085" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_stack3 = np.vstack([oof_lgb_49,oof_xgb_49,oof_gbr_49]).transpose()\n", "test_stack3 = np.vstack([predictions_lgb_49, predictions_xgb_49,predictions_gbr_49]).transpose()\n", "#\n", "folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)\n", "oof_stack3 = np.zeros(train_stack3.shape[0])\n", "predictions_lr3 = np.zeros(test_stack3.shape[0])\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack3,target)):\n", " print(\"fold {}\".format(fold_))\n", " trn_data, trn_y = train_stack3[trn_idx], target.iloc[trn_idx].values\n", " val_data, val_y = train_stack3[val_idx], target.iloc[val_idx].values\n", " #Kernel Ridge Regression\n", " lr3 = kr()\n", " lr3.fit(trn_data, trn_y)\n", " \n", " oof_stack3[val_idx] = lr3.predict(val_data)\n", " predictions_lr3 += lr3.predict(test_stack3) / 10\n", " \n", "mean_squared_error(target.values, oof_stack3) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "接下来我们对于383维的数据进行与上述263以及49维数据相同的操作\n", "\n", "1. Kernel Ridge Regression 基于核的岭回归" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "fold n°2\n", "fold n°3\n", "fold n°4\n", "fold n°5\n", "CV score: 0.51412085\n" ] } ], "source": [ "folds = KFold(n_splits=5, shuffle=True, random_state=13)\n", "oof_kr_383 = np.zeros(train_shape)\n", "predictions_kr_383 = np.zeros(len(X_test_383))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_383[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " #Kernel Ridge Regression 岭回归\n", " kr_383 = kr()\n", " kr_383.fit(tr_x,tr_y)\n", " oof_kr_383[val_idx] = kr_383.predict(X_train_383[val_idx])\n", " \n", " predictions_kr_383 += kr_383.predict(X_test_383) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_kr_383, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. 使用普通岭回归" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "fold n°2\n", "fold n°3\n", "fold n°4\n", "fold n°5\n", "CV score: 0.48687670\n" ] } ], "source": [ "folds = KFold(n_splits=5, shuffle=True, random_state=13)\n", "oof_ridge_383 = np.zeros(train_shape)\n", "predictions_ridge_383 = np.zeros(len(X_test_383))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_383[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " #使用岭回归\n", " ridge_383 = Ridge(alpha=1200)\n", " ridge_383.fit(tr_x,tr_y)\n", " oof_ridge_383[val_idx] = ridge_383.predict(X_train_383[val_idx])\n", " \n", " predictions_ridge_383 += ridge_383.predict(X_test_383) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_ridge_383, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. 使用ElasticNet 弹性网络" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "fold n°2\n", "fold n°3\n", "fold n°4\n", "fold n°5\n", "CV score: 0.53296555\n" ] } ], "source": [ "folds = KFold(n_splits=5, shuffle=True, random_state=13)\n", "oof_en_383 = np.zeros(train_shape)\n", "predictions_en_383 = np.zeros(len(X_test_383))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_383[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " #ElasticNet 弹性网络\n", " en_383 = en(alpha=1.0,l1_ratio=0.06)\n", " en_383.fit(tr_x,tr_y)\n", " oof_en_383[val_idx] = en_383.predict(X_train_383[val_idx])\n", " \n", " predictions_en_383 += en_383.predict(X_test_383) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_en_383, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. 使用BayesianRidge 贝叶斯岭回归" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "fold n°2\n", "fold n°3\n", "fold n°4\n", "fold n°5\n", "CV score: 0.48717310\n" ] } ], "source": [ "folds = KFold(n_splits=5, shuffle=True, random_state=13)\n", "oof_br_383 = np.zeros(train_shape)\n", "predictions_br_383 = np.zeros(len(X_test_383))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_383, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_383[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " #BayesianRidge 贝叶斯回归\n", " br_383 = br()\n", " br_383.fit(tr_x,tr_y)\n", " oof_br_383[val_idx] = br_383.predict(X_train_383[val_idx])\n", " \n", " predictions_br_383 += br_383.predict(X_test_383) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_br_383, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "至此,我们得到了以上4种模型的基于383个特征的预测结果以及模型架构及参数。其中在每一种特征工程中,进行5折的交叉验证,并重复两次(LinearRegression简单的线性回归),取得每一个特征数下的模型的结果。" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold 0\n", "fold 1\n", "fold 2\n", "fold 3\n", "fold 4\n", "fold 5\n", "fold 6\n", "fold 7\n", "fold 8\n", "fold 9\n" ] }, { "data": { "text/plain": [ "0.4878202780283125" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_stack1 = np.vstack([oof_br_383,oof_kr_383,oof_en_383,oof_ridge_383]).transpose()\n", "test_stack1 = np.vstack([predictions_br_383, predictions_kr_383,predictions_en_383,predictions_ridge_383]).transpose()\n", "\n", "folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)\n", "oof_stack1 = np.zeros(train_stack1.shape[0])\n", "predictions_lr1 = np.zeros(test_stack1.shape[0])\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack1,target)):\n", " print(\"fold {}\".format(fold_))\n", " trn_data, trn_y = train_stack1[trn_idx], target.iloc[trn_idx].values\n", " val_data, val_y = train_stack1[val_idx], target.iloc[val_idx].values\n", " # LinearRegression简单的线性回归\n", " lr1 = lr()\n", " lr1.fit(trn_data, trn_y)\n", " \n", " oof_stack1[val_idx] = lr1.predict(val_data)\n", " predictions_lr1 += lr1.predict(test_stack1) / 10\n", " \n", "mean_squared_error(target.values, oof_stack1) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "由于49维的特征是最重要的特征,所以这里考虑增加更多的模型进行49维特征的数据的构建工作。\n", "1. KernelRidge 核岭回归" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "fold n°2\n", "fold n°3\n", "fold n°4\n", "fold n°5\n", "CV score: 0.50254410\n" ] } ], "source": [ "folds = KFold(n_splits=5, shuffle=True, random_state=13)\n", "oof_kr_49 = np.zeros(train_shape)\n", "predictions_kr_49 = np.zeros(len(X_test_49))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_49[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " kr_49 = kr()\n", " kr_49.fit(tr_x,tr_y)\n", " oof_kr_49[val_idx] = kr_49.predict(X_train_49[val_idx])\n", " \n", " predictions_kr_49 += kr_49.predict(X_test_49) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_kr_49, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Ridge 岭回归" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "fold n°2\n", "fold n°3\n", "fold n°4\n", "fold n°5\n", "CV score: 0.49451286\n" ] } ], "source": [ "folds = KFold(n_splits=5, shuffle=True, random_state=13)\n", "oof_ridge_49 = np.zeros(train_shape)\n", "predictions_ridge_49 = np.zeros(len(X_test_49))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_49[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " ridge_49 = Ridge(alpha=6)\n", " ridge_49.fit(tr_x,tr_y)\n", " oof_ridge_49[val_idx] = ridge_49.predict(X_train_49[val_idx])\n", " \n", " predictions_ridge_49 += ridge_49.predict(X_test_49) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_ridge_49, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. BayesianRidge 贝叶斯岭回归" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "fold n°2\n", "fold n°3\n", "fold n°4\n", "fold n°5\n", "CV score: 0.49534595\n" ] } ], "source": [ "folds = KFold(n_splits=5, shuffle=True, random_state=13)\n", "oof_br_49 = np.zeros(train_shape)\n", "predictions_br_49 = np.zeros(len(X_test_49))\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_49[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " br_49 = br()\n", " br_49.fit(tr_x,tr_y)\n", " oof_br_49[val_idx] = br_49.predict(X_train_49[val_idx])\n", " \n", " predictions_br_49 += br_49.predict(X_test_49) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_br_49, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. ElasticNet 弹性网络" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold n°1\n", "fold n°2\n", "fold n°3\n", "fold n°4\n", "fold n°5\n", "CV score: 0.53841695\n" ] } ], "source": [ "folds = KFold(n_splits=5, shuffle=True, random_state=13)\n", "oof_en_49 = np.zeros(train_shape)\n", "predictions_en_49 = np.zeros(len(X_test_49))\n", "#\n", "for fold_, (trn_idx, val_idx) in enumerate(folds.split(X_train_49, y_train)):\n", " print(\"fold n°{}\".format(fold_+1))\n", " tr_x = X_train_49[trn_idx]\n", " tr_y = y_train[trn_idx]\n", " en_49 = en(alpha=1.0,l1_ratio=0.05)\n", " en_49.fit(tr_x,tr_y)\n", " oof_en_49[val_idx] = en_49.predict(X_train_49[val_idx])\n", " \n", " predictions_en_49 += en_49.predict(X_test_49) / folds.n_splits\n", "\n", "print(\"CV score: {:<8.8f}\".format(mean_squared_error(oof_en_49, target)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "我们得到了以上4种新模型的基于49个特征的预测结果以及模型架构及参数。其中在每一种特征工程中,进行5折的交叉验证,并重复两次(LinearRegression简单的线性回归),取得每一个特征数下的模型的结果。" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold 0\n", "fold 1\n", "fold 2\n", "fold 3\n", "fold 4\n", "fold 5\n", "fold 6\n", "fold 7\n", "fold 8\n", "fold 9\n" ] }, { "data": { "text/plain": [ "0.49491439094008133" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_stack4 = np.vstack([oof_br_49,oof_kr_49,oof_en_49,oof_ridge_49]).transpose()\n", "test_stack4 = np.vstack([predictions_br_49, predictions_kr_49,predictions_en_49,predictions_ridge_49]).transpose()\n", "\n", "folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)\n", "oof_stack4 = np.zeros(train_stack4.shape[0])\n", "predictions_lr4 = np.zeros(test_stack4.shape[0])\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack4,target)):\n", " print(\"fold {}\".format(fold_))\n", " trn_data, trn_y = train_stack4[trn_idx], target.iloc[trn_idx].values\n", " val_data, val_y = train_stack4[val_idx], target.iloc[val_idx].values\n", " #LinearRegression\n", " lr4 = lr()\n", " lr4.fit(trn_data, trn_y)\n", " \n", " oof_stack4[val_idx] = lr4.predict(val_data)\n", " predictions_lr4 += lr4.predict(test_stack1) / 10\n", " \n", "mean_squared_error(target.values, oof_stack4) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 模型融合\n", "\n", "这里对于上述四种集成学习的模型的预测结果进行加权的求和,得到最终的结果,当然这种方式是很不准确的。" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4527515432292745" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#和下面作对比\n", "mean_squared_error(target.values, 0.7*(0.6*oof_stack2 + 0.4*oof_stack3)+0.3*(0.55*oof_stack1+0.45*oof_stack4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "更好的方式是将以上的4中集成学习模型再次进行集成学习的训练,这里直接使用LinearRegression简单线性回归的进行集成。" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "fold 0\n", "fold 1\n", "fold 2\n", "fold 3\n", "fold 4\n", "fold 5\n", "fold 6\n", "fold 7\n", "fold 8\n", "fold 9\n" ] }, { "data": { "text/plain": [ "0.4480223491250565" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_stack5 = np.vstack([oof_stack1,oof_stack2,oof_stack3,oof_stack4]).transpose()\n", "test_stack5 = np.vstack([predictions_lr1, predictions_lr2,predictions_lr3,predictions_lr4]).transpose()\n", "\n", "folds_stack = RepeatedKFold(n_splits=5, n_repeats=2, random_state=7)\n", "oof_stack5 = np.zeros(train_stack5.shape[0])\n", "predictions_lr5= np.zeros(test_stack5.shape[0])\n", "\n", "for fold_, (trn_idx, val_idx) in enumerate(folds_stack.split(train_stack5,target)):\n", " print(\"fold {}\".format(fold_))\n", " trn_data, trn_y = train_stack5[trn_idx], target.iloc[trn_idx].values\n", " val_data, val_y = train_stack5[val_idx], target.iloc[val_idx].values\n", " #LinearRegression\n", " lr5 = lr()\n", " lr5.fit(trn_data, trn_y)\n", " \n", " oof_stack5[val_idx] = lr5.predict(val_data)\n", " predictions_lr5 += lr5.predict(test_stack5) / 10\n", " \n", "mean_squared_error(target.values, oof_stack5) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 结果保存\n", "\n", "进行index的读取工作" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 2968.000000\n", "mean 3.879322\n", "std 0.462290\n", "min 1.636433\n", "25% 3.667859\n", "50% 3.954825\n", "75% 4.185277\n", "max 5.051027\n", "Name: happiness, dtype: float64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "submit_example = pd.read_csv('submit_example.csv',sep=',',encoding='latin-1')\n", "\n", "submit_example['happiness'] = predictions_lr5\n", "\n", "submit_example.happiness.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "进行结果保存,这里我们预测出的值是1-5的连续值,但是我们的ground truth是整数值,所以为了进一步优化我们的结果,我们对于结果进行了整数解的近似,并保存到了csv文件中。" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 2968.000000\n", "mean 3.879330\n", "std 0.462127\n", "min 1.636433\n", "25% 3.667859\n", "50% 3.954825\n", "75% 4.185277\n", "max 5.000000\n", "Name: happiness, dtype: float64" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "submit_example.loc[submit_example['happiness']>4.96,'happiness']= 5\n", "submit_example.loc[submit_example['happiness']<=1.04,'happiness']= 1\n", "submit_example.loc[(submit_example['happiness']>1.96)&(submit_example['happiness']<2.04),'happiness']= 2\n", "\n", "submit_example.to_csv(\"submision.csv\",index=False)\n", "submit_example.happiness.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "大家可以对于model的参数进行更进一步的调整,例如使用网格搜索的方法。这留给大家做进一步的思考喽~" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.svm import SVC\n", "from sklearn.model_selection import train_test_split\n", " \n", "iris = load_iris()\n", "X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=0)\n", "print(\"Size of training set:{} size of testing set:{}\".format(X_train.shape[0],X_test.shape[0]))\n", " \n", "#### 1\n", "best_score = 0\n", "for gamma in [0.001,0.01,0.1,1,10,100]:\n", " for C in [0.001,0.01,0.1,1,10,100]:\n", " svm = SVC(gamma=gamma,C=C)#对于每种参数可能的组合,进行一次训练;\n", " svm.fit(X_train,y_train)\n", " score = svm.score(X_test,y_test)\n", " if score > best_score:#找到表现最好的参数\n", " best_score = score\n", " best_parameters = {'gamma':gamma,'C':C}\n", "print(\"Best score:{:.2f}\".format(best_score))\n", "\n", " \n", "\n", "#### 2\n", "from sklearn.model_selection import GridSearchCV\n", " \n", "#把要调整的参数以及其候选值 列出来;\n", "param_grid = {\"gamma\":[0.001,0.01,0.1,1,10,100],\n", " \"C\":[0.001,0.01,0.1,1,10,100]}\n", "print(\"Parameters:{}\".format(param_grid))\n", " \n", "grid_search = GridSearchCV(SVC(),param_grid,cv=5) #实例化一个GridSearchCV类,cv交叉验证参数\n", "X_train,X_test,y_train,y_test = train_test_split(iris.data,iris.target,random_state=10)\n", "grid_search.fit(X_train,y_train) #训练,找到最优的参数,同时使用最优的参数实例化一个新的SVC estimator。\n", "print(\"Test set score:{:.2f}\".format(grid_search.score(X_test,y_test)))\n", "print(\"Best parameters:{}\".format(grid_search.best_params_))\n", "\n", "#SVM模型有两个非常重要的参数C与gamma。\n", "#C是惩罚系数,即对误差的容忍度(间隔大小,分类准确度)。C越高,说明越不能容忍出现误差,容易过拟合。C越小,容易欠拟合。C过大或过小,泛化能力变差\n", "#gamma是选择RBF函数作为kernel后,该函数自带的一个参数。隐含地决定了数据映射到新的特征空间后的分布,gamma越大,支持向量越少,gamma值越小,支持向量越多。支持向量的个数影响训练与预测的速度。\n", "#两者独立\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Result:\n", " \n", "Parameters:{'gamma': [0.001, 0.01, 0.1, 1, 10, 100], 'C': [0.001, 0.01, 0.1, 1, 10, 100]}\n", "Test set score:0.97\n", "Best parameters:{'C': 10, 'gamma': 0.1}\n", "Best score on train set:0.98\n", "\n", "#Grid Search 调参方法存在的共性弊端就是:耗时;参数越多,候选值越多,耗费时间越长!所以,一般情况下,先定一个大范围,然后再细化。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }