{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## 集成学习案例一 (幸福感预测)\n", "\n", "### 背景介绍\n", "\n", "此案例是一个数据挖掘类型的比赛——幸福感预测的baseline。比赛的数据使用的是官方的《中国综合社会调查(CGSS)》文件中的调查结果中的数据,其共包含有139个维度的特征,包括个体变量(性别、年龄、地域、职业、健康、婚姻与政治面貌等等)、家庭变量(父母、配偶、子女、家庭资本等等)、社会态度(公平、信用、公共服务)等特征。\n", "\n", "\n", "### 数据信息\n", "赛题要求使用以上 **139** 维的特征,使用 **8000** 余组数据进行对于个人幸福感的预测(预测值为1,2,3,4,5,其中1代表幸福感最低,5代表幸福感最高)。\n", "因为考虑到变量个数较多,部分变量间关系复杂,数据分为完整版和精简版两类。可从精简版入手熟悉赛题后,使用完整版挖掘更多信息。在这里我直接使用了完整版的数据。赛题也给出了index文件中包含每个变量对应的问卷题目,以及变量取值的含义;survey文件中为原版问卷,作为补充以方便理解问题背景。\n", "\n", "### 评价指标\n", "最终的评价指标为均方误差MSE,即:\n", "$$Score = \\frac{1}{n} \\sum_1 ^n (y_i - y ^*)^2$$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 导入package" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "import time \n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC, LinearSVC\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.linear_model import Perceptron\n", "from sklearn.linear_model import SGDClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn import metrics\n", "from datetime import datetime\n", "import matplotlib.pyplot as plt\n", "from sklearn.metrics import roc_auc_score, roc_curve, mean_squared_error,mean_absolute_error, f1_score\n", "import lightgbm as lgb\n", "import xgboost as xgb\n", "from sklearn.ensemble import RandomForestRegressor as rfr\n", "from sklearn.ensemble import ExtraTreesRegressor as etr\n", "from sklearn.linear_model import BayesianRidge as br\n", "from sklearn.ensemble import GradientBoostingRegressor as gbr\n", "from sklearn.linear_model import Ridge\n", "from sklearn.linear_model import Lasso\n", "from sklearn.linear_model import LinearRegression as lr\n", "from sklearn.linear_model import ElasticNet as en\n", "from sklearn.kernel_ridge import KernelRidge as kr\n", "from sklearn.model_selection import KFold, StratifiedKFold,GroupKFold, RepeatedKFold\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.model_selection import GridSearchCV\n", "from sklearn import preprocessing\n", "import logging\n", "import warnings\n", "\n", "warnings.filterwarnings('ignore') #消除warning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 导入数据集" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "train = pd.read_csv(\"train.csv\", parse_dates=['survey_time'],encoding='latin-1') \n", "test = pd.read_csv(\"test.csv\", parse_dates=['survey_time'],encoding='latin-1') #latin-1向下兼容ASCII\n", "train = train[train[\"happiness\"]!=-8].reset_index(drop=True)\n", "train_data_copy = train.copy() #删去\"happiness\" 为-8的行\n", "target_col = \"happiness\" #目标列\n", "target = train_data_copy[target_col]\n", "del train_data_copy[target_col] #去除目标列\n", "\n", "data = pd.concat([train_data_copy,test],axis=0,ignore_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 查看数据的基本信息" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 7988.000000\n", "mean 3.867927\n", "std 0.818717\n", "min 1.000000\n", "25% 4.000000\n", "50% 4.000000\n", "75% 4.000000\n", "max 5.000000\n", "Name: happiness, dtype: float64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train.happiness.describe() #数据的基本信息" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 数据预处理\n", "\n", "首先需要对于数据中的连续出现的负数值进行处理。由于数据中的负数值只有-1,-2,-3,-8这几种数值,所以它们进行分别的操作,实现代码如下:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "#make feature +5\n", "#csv中有复数值:-1、-2、-3、-8,将他们视为有问题的特征,但是不删去\n", "def getres1(row):\n", " return len([x for x in row.values if type(x)==int and x<0])\n", "\n", "def getres2(row):\n", " return len([x for x in row.values if type(x)==int and x==-8])\n", "\n", "def getres3(row):\n", " return len([x for x in row.values if type(x)==int and x==-1])\n", "\n", "def getres4(row):\n", " return len([x for x in row.values if type(x)==int and x==-2])\n", "\n", "def getres5(row):\n", " return len([x for x in row.values if type(x)==int and x==-3])\n", "\n", "#检查数据\n", "data['neg1'] = data[data.columns].apply(lambda row:getres1(row),axis=1)\n", "data.loc[data['neg1']>20,'neg1'] = 20 #平滑处理\n", "\n", "data['neg2'] = data[data.columns].apply(lambda row:getres2(row),axis=1)\n", "data['neg3'] = data[data.columns].apply(lambda row:getres3(row),axis=1)\n", "data['neg4'] = data[data.columns].apply(lambda row:getres4(row),axis=1)\n", "data['neg5'] = data[data.columns].apply(lambda row:getres5(row),axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "填充缺失值,在这里我采取的方式是将缺失值补全,使用fillna(value),其中value的数值根据具体的情况来确定。例如将大部分缺失信息认为是零,将家庭成员数认为是1,将家庭收入这个特征认为是66365,即所有家庭的收入平均值。部分实现代码如下:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#填充缺失值 共25列 去掉4列 填充21列\n", "#以下的列都是缺省的,视情况填补\n", "data['work_status'] = data['work_status'].fillna(0)\n", "data['work_yr'] = data['work_yr'].fillna(0)\n", "data['work_manage'] = data['work_manage'].fillna(0)\n", "data['work_type'] = data['work_type'].fillna(0)\n", "\n", "data['edu_yr'] = data['edu_yr'].fillna(0)\n", "data['edu_status'] = data['edu_status'].fillna(0)\n", "\n", "data['s_work_type'] = data['s_work_type'].fillna(0)\n", "data['s_work_status'] = data['s_work_status'].fillna(0)\n", "data['s_political'] = data['s_political'].fillna(0)\n", "data['s_hukou'] = data['s_hukou'].fillna(0)\n", "data['s_income'] = data['s_income'].fillna(0)\n", "data['s_birth'] = data['s_birth'].fillna(0)\n", "data['s_edu'] = data['s_edu'].fillna(0)\n", "data['s_work_exper'] = data['s_work_exper'].fillna(0)\n", "\n", "data['minor_child'] = data['minor_child'].fillna(0)\n", "data['marital_now'] = data['marital_now'].fillna(0)\n", "data['marital_1st'] = data['marital_1st'].fillna(0)\n", "data['social_neighbor']=data['social_neighbor'].fillna(0)\n", "data['social_friend']=data['social_friend'].fillna(0)\n", "data['hukou_loc']=data['hukou_loc'].fillna(1) #最少为1,表示户口\n", "data['family_income']=data['family_income'].fillna(66365) #删除问题值后的平均值" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "除此之外,还有特殊格式的信息需要另外处理,比如与时间有关的信息,这里主要分为两部分进行处理:首先是将“连续”的年龄,进行分层处理,即划分年龄段,具体地在这里我们将年龄分为了6个区间。其次是计算具体的年龄,在Excel表格中,只有出生年月以及调查时间等信息,我们根据此计算出每一位调查者的真实年龄。具体实现代码如下:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [], "source": [ "#144+1 =145\n", "#继续进行特殊的列进行数据处理\n", "#读happiness_index.xlsx\n", "data['survey_time'] = pd.to_datetime(data['survey_time'], format='%Y-%m-%d',errors='coerce')#防止时间格式不同的报错errors='coerce‘\n", "data['survey_time'] = data['survey_time'].dt.year #仅仅是year,方便计算年龄\n", "data['age'] = data['survey_time']-data['birth']\n", "# print(data['age'],data['survey_time'],data['birth'])\n", "#年龄分层 145+1=146\n", "bins = [0,17,26,34,50,63,100]\n", "data['age_bin'] = pd.cut(data['age'], bins, labels=[0,1,2,3,4,5]) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在这里因为家庭的收入是连续值,所以不能再使用取众数的方法进行处理,这里就直接使用了均值进行缺失值的补全。第三种方法是使用我们日常生活中的真实情况,例如“宗教信息”特征为负数的认为是“不信仰宗教”,并认为“参加宗教活动的频率”为1,即没有参加过宗教活动,主观的进行补全,这也是我在这一步骤中使用最多的一种方式。就像我自己填表一样,这里我全部都使用了我自己的想法进行缺省值的补全。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "#对‘宗教’处理\n", "data.loc[data['religion']<0,'religion'] = 1 #1为不信仰宗教\n", "data.loc[data['religion_freq']<0,'religion_freq'] = 1 #1为从来没有参加过\n", "#对‘教育程度’处理\n", "data.loc[data['edu']<0,'edu'] = 4 #初中\n", "data.loc[data['edu_status']<0,'edu_status'] = 0\n", "data.loc[data['edu_yr']<0,'edu_yr'] = 0\n", "#对‘个人收入’处理\n", "data.loc[data['income']<0,'income'] = 0 #认为无收入\n", "#对‘政治面貌’处理\n", "data.loc[data['political']<0,'political'] = 1 #认为是群众\n", "#对体重处理\n", "data.loc[(data['weight_jin']<=80)&(data['height_cm']>=160),'weight_jin']= data['weight_jin']*2\n", "data.loc[data['weight_jin']<=60,'weight_jin']= data['weight_jin']*2 #个人的想法,哈哈哈,没有60斤的成年人吧\n", "#对身高处理\n", "data.loc[data['height_cm']<150,'height_cm'] = 150 #成年人的实际情况\n", "#对‘健康’处理\n", "data.loc[data['health']<0,'health'] = 4 #认为是比较健康\n", "data.loc[data['health_problem']<0,'health_problem'] = 4\n", "#对‘沮丧’处理\n", "data.loc[data['depression']<0,'depression'] = 4 #一般人都是很少吧\n", "#对‘媒体’处理\n", "data.loc[data['media_1']<0,'media_1'] = 1 #都是从不\n", "data.loc[data['media_2']<0,'media_2'] = 1\n", "data.loc[data['media_3']<0,'media_3'] = 1\n", "data.loc[data['media_4']<0,'media_4'] = 1\n", "data.loc[data['media_5']<0,'media_5'] = 1\n", "data.loc[data['media_6']<0,'media_6'] = 1\n", "#对‘空闲活动’处理\n", "data.loc[data['leisure_1']<0,'leisure_1'] = 1 #都是根据自己的想法\n", "data.loc[data['leisure_2']<0,'leisure_2'] = 5\n", "data.loc[data['leisure_3']<0,'leisure_3'] = 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "使用众数(代码中使用mode()来实现异常值的修正),由于这里的特征是空闲活动,所以采用众数对于缺失值进行处理比较合理。具体的代码参考如下:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "data.loc[data['leisure_4']<0,'leisure_4'] = data['leisure_4'].mode() #取众数\n", "data.loc[data['leisure_5']<0,'leisure_5'] = data['leisure_5'].mode()\n", "data.loc[data['leisure_6']<0,'leisure_6'] = data['leisure_6'].mode()\n", "data.loc[data['leisure_7']<0,'leisure_7'] = data['leisure_7'].mode()\n", "data.loc[data['leisure_8']<0,'leisure_8'] = data['leisure_8'].mode()\n", "data.loc[data['leisure_9']<0,'leisure_9'] = data['leisure_9'].mode()\n", "data.loc[data['leisure_10']<0,'leisure_10'] = data['leisure_10'].mode()\n", "data.loc[data['leisure_11']<0,'leisure_11'] = data['leisure_11'].mode()\n", "data.loc[data['leisure_12']<0,'leisure_12'] = data['leisure_12'].mode()\n", "data.loc[data['socialize']<0,'socialize'] = 2 #很少\n", "data.loc[data['relax']<0,'relax'] = 4 #经常\n", "data.loc[data['learn']<0,'learn'] = 1 #从不,哈哈哈哈\n", "#对‘社交’处理\n", "data.loc[data['social_neighbor']<0,'social_neighbor'] = 0\n", "data.loc[data['social_friend']<0,'social_friend'] = 0\n", "data.loc[data['socia_outing']<0,'socia_outing'] = 1\n", "data.loc[data['neighbor_familiarity']<0,'social_neighbor']= 4\n", "#对‘社会公平性’处理\n", "data.loc[data['equity']<0,'equity'] = 4\n", "#对‘社会等级’处理\n", "data.loc[data['class_10_before']<0,'class_10_before'] = 3\n", "data.loc[data['class']<0,'class'] = 5\n", "data.loc[data['class_10_after']<0,'class_10_after'] = 5\n", "data.loc[data['class_14']<0,'class_14'] = 2\n", "#对‘工作情况’处理\n", "data.loc[data['work_status']<0,'work_status'] = 0\n", "data.loc[data['work_yr']<0,'work_yr'] = 0\n", "data.loc[data['work_manage']<0,'work_manage'] = 0\n", "data.loc[data['work_type']<0,'work_type'] = 0\n", "#对‘社会保障’处理\n", "data.loc[data['insur_1']<0,'insur_1'] = 1\n", "data.loc[data['insur_2']<0,'insur_2'] = 1\n", "data.loc[data['insur_3']<0,'insur_3'] = 1\n", "data.loc[data['insur_4']<0,'insur_4'] = 1\n", "data.loc[data['insur_1']==0,'insur_1'] = 0\n", "data.loc[data['insur_2']==0,'insur_2'] = 0\n", "data.loc[data['insur_3']==0,'insur_3'] = 0\n", "data.loc[data['insur_4']==0,'insur_4'] = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "取均值进行缺失值的补全(代码实现为means()),在这里因为家庭的收入是连续值,所以不能再使用取众数的方法进行处理,这里就直接使用了均值进行缺失值的补全。具体的代码参考如下:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "#对家庭情况处理\n", "family_income_mean = data['family_income'].mean()\n", "data.loc[data['family_income']<0,'family_income'] = family_income_mean\n", "data.loc[data['family_m']<0,'family_m'] = 2\n", "data.loc[data['family_status']<0,'family_status'] = 3\n", "data.loc[data['house']<0,'house'] = 1\n", "data.loc[data['car']<0,'car'] = 0\n", "data.loc[data['car']==2,'car'] = 0 #变为0和1\n", "data.loc[data['son']<0,'son'] = 1\n", "data.loc[data['daughter']<0,'daughter'] = 0\n", "data.loc[data['minor_child']<0,'minor_child'] = 0\n", "#对‘婚姻’处理\n", "data.loc[data['marital_1st']<0,'marital_1st'] = 0\n", "data.loc[data['marital_now']<0,'marital_now'] = 0\n", "#对‘配偶’处理\n", "data.loc[data['s_birth']<0,'s_birth'] = 0\n", "data.loc[data['s_edu']<0,'s_edu'] = 0\n", "data.loc[data['s_political']<0,'s_political'] = 0\n", "data.loc[data['s_hukou']<0,'s_hukou'] = 0\n", "data.loc[data['s_income']<0,'s_income'] = 0\n", "data.loc[data['s_work_type']<0,'s_work_type'] = 0\n", "data.loc[data['s_work_status']<0,'s_work_status'] = 0\n", "data.loc[data['s_work_exper']<0,'s_work_exper'] = 0\n", "#对‘父母情况’处理\n", "data.loc[data['f_birth']<0,'f_birth'] = 1945\n", "data.loc[data['f_edu']<0,'f_edu'] = 1\n", "data.loc[data['f_political']<0,'f_political'] = 1\n", "data.loc[data['f_work_14']<0,'f_work_14'] = 2\n", "data.loc[data['m_birth']<0,'m_birth'] = 1940\n", "data.loc[data['m_edu']<0,'m_edu'] = 1\n", "data.loc[data['m_political']<0,'m_political'] = 1\n", "data.loc[data['m_work_14']<0,'m_work_14'] = 2\n", "#和同龄人相比社会经济地位\n", "data.loc[data['status_peer']<0,'status_peer'] = 2\n", "#和3年前比社会经济地位\n", "data.loc[data['status_3_before']<0,'status_3_before'] = 2\n", "#对‘观点’处理\n", "data.loc[data['view']<0,'view'] = 4\n", "#对期望年收入处理\n", "data.loc[data['inc_ability']<=0,'inc_ability']= 2\n", "inc_exp_mean = data['inc_exp'].mean()\n", "data.loc[data['inc_exp']<=0,'inc_exp']= inc_exp_mean #取均值\n", "\n", "#部分特征处理,取众数(首先去除缺失值的数据)\n", "for i in range(1,9+1):\n", " data.loc[data['public_service_'+str(i)]<0,'public_service_'+str(i)] = data['public_service_'+str(i)].dropna().mode().values\n", "for i in range(1,13+1):\n", " data.loc[data['trust_'+str(i)]<0,'trust_'+str(i)] = data['trust_'+str(i)].dropna().mode().values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 数据增广\n", "\n", "这一步,我们需要进一步分析每一个特征之间的关系,从而进行数据增广。经过思考,这里我添加了如下的特征:第一次结婚年龄、最近结婚年龄、是否再婚、配偶年龄、配偶年龄差、各种收入比(与配偶之间的收入比、十年后预期收入与现在收入之比等等)、收入与住房面积比(其中也包括10年后期望收入等等各种情况)、社会阶级(10年后的社会阶级、14年后的社会阶级等等)、悠闲指数、满意指数、信任指数等等。除此之外,我还考虑了对于同一省、市、县进行了归一化。例如同一省市内的收入的平均值等以及一个个体相对于同省、市、县其他人的各个指标的情况。同时也考虑了对于同龄人之间的相互比较,即在同龄人中的收入情况、健康情况等等。具体的实现代码如下:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "#第一次结婚年龄 147\n", "data['marital_1stbir'] = data['marital_1st'] - data['birth'] \n", "#最近结婚年龄 148\n", "data['marital_nowtbir'] = data['marital_now'] - data['birth'] \n", "#是否再婚 149\n", "data['mar'] = data['marital_nowtbir'] - data['marital_1stbir']\n", "#配偶年龄 150\n", "data['marital_sbir'] = data['marital_now']-data['s_birth']\n", "#配偶年龄差 151\n", "data['age_'] = data['marital_nowtbir'] - data['marital_sbir'] \n", "\n", "#收入比 151+7 =158\n", "data['income/s_income'] = data['income']/(data['s_income']+1) #同居伴侣\n", "data['income+s_income'] = data['income']+(data['s_income']+1)\n", "data['income/family_income'] = data['income']/(data['family_income']+1)\n", "data['all_income/family_income'] = (data['income']+data['s_income'])/(data['family_income']+1)\n", "data['income/inc_exp'] = data['income']/(data['inc_exp']+1)\n", "data['family_income/m'] = data['family_income']/(data['family_m']+0.01)\n", "data['income/m'] = data['income']/(data['family_m']+0.01)\n", "\n", "#收入/面积比 158+4=162\n", "data['income/floor_area'] = data['income']/(data['floor_area']+0.01)\n", "data['all_income/floor_area'] = (data['income']+data['s_income'])/(data['floor_area']+0.01)\n", "data['family_income/floor_area'] = data['family_income']/(data['floor_area']+0.01)\n", "data['floor_area/m'] = data['floor_area']/(data['family_m']+0.01)\n", "\n", "#class 162+3=165\n", "data['class_10_diff'] = (data['class_10_after'] - data['class'])\n", "data['class_diff'] = data['class'] - data['class_10_before']\n", "data['class_14_diff'] = data['class'] - data['class_14']\n", "#悠闲指数 166\n", "leisure_fea_lis = ['leisure_'+str(i) for i in range(1,13)]\n", "data['leisure_sum'] = data[leisure_fea_lis].sum(axis=1) #skew\n", "#满意指数 167\n", "public_service_fea_lis = ['public_service_'+str(i) for i in range(1,10)]\n", "data['public_service_sum'] = data[public_service_fea_lis].sum(axis=1) #skew\n", "\n", "#信任指数 168\n", "trust_fea_lis = ['trust_'+str(i) for i in range(1,14)]\n", "data['trust_sum'] = data[trust_fea_lis].sum(axis=1) #skew\n", "\n", "#province mean 168+13=181\n", "data['province_income_mean'] = data.groupby(['province'])['income'].transform('mean').values\n", "data['province_family_income_mean'] = data.groupby(['province'])['family_income'].transform('mean').values\n", "data['province_equity_mean'] = data.groupby(['province'])['equity'].transform('mean').values\n", "data['province_depression_mean'] = data.groupby(['province'])['depression'].transform('mean').values\n", "data['province_floor_area_mean'] = data.groupby(['province'])['floor_area'].transform('mean').values\n", "data['province_health_mean'] = data.groupby(['province'])['health'].transform('mean').values\n", "data['province_class_10_diff_mean'] = data.groupby(['province'])['class_10_diff'].transform('mean').values\n", "data['province_class_mean'] = data.groupby(['province'])['class'].transform('mean').values\n", "data['province_health_problem_mean'] = data.groupby(['province'])['health_problem'].transform('mean').values\n", "data['province_family_status_mean'] = data.groupby(['province'])['family_status'].transform('mean').values\n", "data['province_leisure_sum_mean'] = data.groupby(['province'])['leisure_sum'].transform('mean').values\n", "data['province_public_service_sum_mean'] = data.groupby(['province'])['public_service_sum'].transform('mean').values\n", "data['province_trust_sum_mean'] = data.groupby(['province'])['trust_sum'].transform('mean').values\n", "\n", "#city mean 181+13=194\n", "data['city_income_mean'] = data.groupby(['city'])['income'].transform('mean').values #按照city分组\n", "data['city_family_income_mean'] = data.groupby(['city'])['family_income'].transform('mean').values\n", "data['city_equity_mean'] = data.groupby(['city'])['equity'].transform('mean').values\n", "data['city_depression_mean'] = data.groupby(['city'])['depression'].transform('mean').values\n", "data['city_floor_area_mean'] = data.groupby(['city'])['floor_area'].transform('mean').values\n", "data['city_health_mean'] = data.groupby(['city'])['health'].transform('mean').values\n", "data['city_class_10_diff_mean'] = data.groupby(['city'])['class_10_diff'].transform('mean').values\n", "data['city_class_mean'] = data.groupby(['city'])['class'].transform('mean').values\n", "data['city_health_problem_mean'] = data.groupby(['city'])['health_problem'].transform('mean').values\n", "data['city_family_status_mean'] = data.groupby(['city'])['family_status'].transform('mean').values\n", "data['city_leisure_sum_mean'] = data.groupby(['city'])['leisure_sum'].transform('mean').values\n", "data['city_public_service_sum_mean'] = data.groupby(['city'])['public_service_sum'].transform('mean').values\n", "data['city_trust_sum_mean'] = data.groupby(['city'])['trust_sum'].transform('mean').values\n", "\n", "#county mean 194 + 13 = 207\n", "data['county_income_mean'] = data.groupby(['county'])['income'].transform('mean').values\n", "data['county_family_income_mean'] = data.groupby(['county'])['family_income'].transform('mean').values\n", "data['county_equity_mean'] = data.groupby(['county'])['equity'].transform('mean').values\n", "data['county_depression_mean'] = data.groupby(['county'])['depression'].transform('mean').values\n", "data['county_floor_area_mean'] = data.groupby(['county'])['floor_area'].transform('mean').values\n", "data['county_health_mean'] = data.groupby(['county'])['health'].transform('mean').values\n", "data['county_class_10_diff_mean'] = data.groupby(['county'])['class_10_diff'].transform('mean').values\n", "data['county_class_mean'] = data.groupby(['county'])['class'].transform('mean').values\n", "data['county_health_problem_mean'] = data.groupby(['county'])['health_problem'].transform('mean').values\n", "data['county_family_status_mean'] = data.groupby(['county'])['family_status'].transform('mean').values\n", "data['county_leisure_sum_mean'] = data.groupby(['county'])['leisure_sum'].transform('mean').values\n", "data['county_public_service_sum_mean'] = data.groupby(['county'])['public_service_sum'].transform('mean').values\n", "data['county_trust_sum_mean'] = data.groupby(['county'])['trust_sum'].transform('mean').values\n", "\n", "#ratio 相比同省 207 + 13 =220\n", "data['income/province'] = data['income']/(data['province_income_mean']) \n", "data['family_income/province'] = data['family_income']/(data['province_family_income_mean']) \n", "data['equity/province'] = data['equity']/(data['province_equity_mean']) \n", "data['depression/province'] = data['depression']/(data['province_depression_mean']) \n", "data['floor_area/province'] = data['floor_area']/(data['province_floor_area_mean'])\n", "data['health/province'] = data['health']/(data['province_health_mean'])\n", "data['class_10_diff/province'] = data['class_10_diff']/(data['province_class_10_diff_mean'])\n", "data['class/province'] = data['class']/(data['province_class_mean'])\n", "data['health_problem/province'] = data['health_problem']/(data['province_health_problem_mean'])\n", "data['family_status/province'] = data['family_status']/(data['province_family_status_mean'])\n", "data['leisure_sum/province'] = data['leisure_sum']/(data['province_leisure_sum_mean'])\n", "data['public_service_sum/province'] = data['public_service_sum']/(data['province_public_service_sum_mean'])\n", "data['trust_sum/province'] = data['trust_sum']/(data['province_trust_sum_mean']+1)\n", "\n", "#ratio 相比同市 220 + 13 =233\n", "data['income/city'] = data['income']/(data['city_income_mean']) \n", "data['family_income/city'] = data['family_income']/(data['city_family_income_mean']) \n", "data['equity/city'] = data['equity']/(data['city_equity_mean']) \n", "data['depression/city'] = data['depression']/(data['city_depression_mean']) \n", "data['floor_area/city'] = data['floor_area']/(data['city_floor_area_mean'])\n", "data['health/city'] = data['health']/(data['city_health_mean'])\n", "data['class_10_diff/city'] = data['class_10_diff']/(data['city_class_10_diff_mean'])\n", "data['class/city'] = data['class']/(data['city_class_mean'])\n", "data['health_problem/city'] = data['health_problem']/(data['city_health_problem_mean'])\n", "data['family_status/city'] = data['family_status']/(data['city_family_status_mean'])\n", "data['leisure_sum/city'] = data['leisure_sum']/(data['city_leisure_sum_mean'])\n", "data['public_service_sum/city'] = data['public_service_sum']/(data['city_public_service_sum_mean'])\n", "data['trust_sum/city'] = data['trust_sum']/(data['city_trust_sum_mean'])\n", "\n", "#ratio 相比同个地区 233 + 13 =246\n", "data['income/county'] = data['income']/(data['county_income_mean']) \n", "data['family_income/county'] = data['family_income']/(data['county_family_income_mean']) \n", "data['equity/county'] = data['equity']/(data['county_equity_mean']) \n", "data['depression/county'] = data['depression']/(data['county_depression_mean']) \n", "data['floor_area/county'] = data['floor_area']/(data['county_floor_area_mean'])\n", "data['health/county'] = data['health']/(data['county_health_mean'])\n", "data['class_10_diff/county'] = data['class_10_diff']/(data['county_class_10_diff_mean'])\n", "data['class/county'] = data['class']/(data['county_class_mean'])\n", "data['health_problem/county'] = data['health_problem']/(data['county_health_problem_mean'])\n", "data['family_status/county'] = data['family_status']/(data['county_family_status_mean'])\n", "data['leisure_sum/county'] = data['leisure_sum']/(data['county_leisure_sum_mean'])\n", "data['public_service_sum/county'] = data['public_service_sum']/(data['county_public_service_sum_mean'])\n", "data['trust_sum/county'] = data['trust_sum']/(data['county_trust_sum_mean'])\n", "\n", "#age mean 246+ 13 =259\n", "data['age_income_mean'] = data.groupby(['age'])['income'].transform('mean').values\n", "data['age_family_income_mean'] = data.groupby(['age'])['family_income'].transform('mean').values\n", "data['age_equity_mean'] = data.groupby(['age'])['equity'].transform('mean').values\n", "data['age_depression_mean'] = data.groupby(['age'])['depression'].transform('mean').values\n", "data['age_floor_area_mean'] = data.groupby(['age'])['floor_area'].transform('mean').values\n", "data['age_health_mean'] = data.groupby(['age'])['health'].transform('mean').values\n", "data['age_class_10_diff_mean'] = data.groupby(['age'])['class_10_diff'].transform('mean').values\n", "data['age_class_mean'] = data.groupby(['age'])['class'].transform('mean').values\n", "data['age_health_problem_mean'] = data.groupby(['age'])['health_problem'].transform('mean').values\n", "data['age_family_status_mean'] = data.groupby(['age'])['family_status'].transform('mean').values\n", "data['age_leisure_sum_mean'] = data.groupby(['age'])['leisure_sum'].transform('mean').values\n", "data['age_public_service_sum_mean'] = data.groupby(['age'])['public_service_sum'].transform('mean').values\n", "data['age_trust_sum_mean'] = data.groupby(['age'])['trust_sum'].transform('mean').values\n", "\n", "# 和同龄人相比259 + 13 =272\n", "data['income/age'] = data['income']/(data['age_income_mean']) \n", "data['family_income/age'] = data['family_income']/(data['age_family_income_mean']) \n", "data['equity/age'] = data['equity']/(data['age_equity_mean']) \n", "data['depression/age'] = data['depression']/(data['age_depression_mean']) \n", "data['floor_area/age'] = data['floor_area']/(data['age_floor_area_mean'])\n", "data['health/age'] = data['health']/(data['age_health_mean'])\n", "data['class_10_diff/age'] = data['class_10_diff']/(data['age_class_10_diff_mean'])\n", "data['class/age'] = data['class']/(data['age_class_mean'])\n", "data['health_problem/age'] = data['health_problem']/(data['age_health_problem_mean'])\n", "data['family_status/age'] = data['family_status']/(data['age_family_status_mean'])\n", "data['leisure_sum/age'] = data['leisure_sum']/(data['age_leisure_sum_mean'])\n", "data['public_service_sum/age'] = data['public_service_sum']/(data['age_public_service_sum_mean'])\n", "data['trust_sum/age'] = data['trust_sum']/(data['age_trust_sum_mean'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "经过如上的操作后,最终我们的特征从一开始的131维,扩充为了272维的特征。接下来考虑特征工程、训练模型以及模型融合的工作。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "shape (10956, 272)\n" ] }, { "data": { "text/html": [ "
| \n", " | id | \n", "survey_type | \n", "province | \n", "city | \n", "county | \n", "survey_time | \n", "gender | \n", "birth | \n", "nationality | \n", "religion | \n", "... | \n", "depression/age | \n", "floor_area/age | \n", "health/age | \n", "class_10_diff/age | \n", "class/age | \n", "health_problem/age | \n", "family_status/age | \n", "leisure_sum/age | \n", "public_service_sum/age | \n", "trust_sum/age | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1 | \n", "1 | \n", "12 | \n", "32 | \n", "59 | \n", "2015 | \n", "1 | \n", "1959 | \n", "1 | \n", "1 | \n", "... | \n", "1.285211 | \n", "0.410351 | \n", "0.848837 | \n", "0.000000 | \n", "0.683307 | \n", "0.521429 | \n", "0.733668 | \n", "0.724620 | \n", "0.666638 | \n", "0.925941 | \n", "
| 1 | \n", "2 | \n", "2 | \n", "18 | \n", "52 | \n", "85 | \n", "2015 | \n", "1 | \n", "1992 | \n", "1 | \n", "1 | \n", "... | \n", "0.733333 | \n", "0.952824 | \n", "1.179337 | \n", "1.012552 | \n", "1.344444 | \n", "0.891344 | \n", "1.359551 | \n", "1.011792 | \n", "1.130778 | \n", "1.188442 | \n", "
| 2 | \n", "3 | \n", "2 | \n", "29 | \n", "83 | \n", "126 | \n", "2015 | \n", "2 | \n", "1967 | \n", "1 | \n", "0 | \n", "... | \n", "1.343537 | \n", "0.972328 | \n", "1.150485 | \n", "1.190955 | \n", "1.195762 | \n", "1.055679 | \n", "1.190955 | \n", "0.966470 | \n", "1.193204 | \n", "0.803693 | \n", "
| 3 | \n", "4 | \n", "2 | \n", "10 | \n", "28 | \n", "51 | \n", "2015 | \n", "2 | \n", "1943 | \n", "1 | \n", "1 | \n", "... | \n", "1.111663 | \n", "0.642329 | \n", "1.276353 | \n", "4.977778 | \n", "1.199143 | \n", "1.188329 | \n", "1.162630 | \n", "0.899346 | \n", "1.153810 | \n", "1.300950 | \n", "
| 4 | \n", "5 | \n", "1 | \n", "7 | \n", "18 | \n", "36 | \n", "2015 | \n", "2 | \n", "1994 | \n", "1 | \n", "1 | \n", "... | \n", "0.750000 | \n", "0.587284 | \n", "1.177106 | \n", "0.000000 | \n", "0.236957 | \n", "1.116803 | \n", "1.093645 | \n", "1.045313 | \n", "0.728161 | \n", "1.117428 | \n", "
5 rows × 272 columns
\n", "