{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 实战Kaggle比赛:房价预测\n", "\n", "作为深度学习基础篇章的总结,我们将对本章内容学以致用。下面,让我们动手实战一个Kaggle比赛:房价预测。本节将提供未经调优的数据的预处理、模型的设计和超参数的选择。我们希望读者通过动手操作、仔细观察实验现象、认真分析实验结果并不断调整方法,得到令自己满意的结果。\n", "\n", "## Kaggle比赛\n", "\n", "[Kaggle](https://www.kaggle.com)是一个著名的供机器学习爱好者交流的平台。图3.7展示了Kaggle网站的首页。为了便于提交结果,需要注册Kaggle账号。\n", "\n", "![Kaggle网站首页](../img/kaggle.png)\n", "\n", "我们可以在房价预测比赛的网页上了解比赛信息和参赛者成绩,也可以下载数据集并提交自己的预测结果。该比赛的网页地址是 https://www.kaggle.com/c/house-prices-advanced-regression-techniques 。\n", "\n", "\n", "图3.8展示了房价预测比赛的网页信息。\n", "\n", "![房价预测比赛的网页信息。比赛数据集可通过点击“Data”标签获取](../img/house_pricing.png)\n", "\n", "## 获取和读取数据集\n", "\n", "比赛数据分为训练数据集和测试数据集。两个数据集都包括每栋房子的特征,如街道类型、建造年份、房顶类型、地下室状况等特征值。这些特征值有连续的数字、离散的标签甚至是缺失值“na”。只有训练数据集包括了每栋房子的价格,也就是标签。我们可以访问比赛网页,点击图3.8中的“Data”标签,并下载这些数据集。\n", "\n", "我们将通过`pandas`库读入并处理数据。在导入本节需要的包前请确保已安装`pandas`库,否则请参考下面的代码注释。" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# 如果没有安装pandas,则反注释下面一行\n", "# !pip install pandas\n", "\n", "%matplotlib inline\n", "import d2ltorch as d2lt\n", "import torch\n", "from torch import autograd, nn, optim\n", "from torch.utils import data as tdata\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "解压后的数据位于`../data`目录,它包括两个csv文件。下面使用`pandas`读取这两个文件。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "attributes": { "classes": [], "id": "", "n": "14" } }, "outputs": [], "source": [ "train_data = pd.read_csv('../data/kaggle_house_pred_train.csv')\n", "test_data = pd.read_csv('../data/kaggle_house_pred_test.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "训练数据集包括1460个样本、80个特征和1个标签。" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "attributes": { "classes": [], "id": "", "n": "11" } }, "outputs": [ { "data": { "text/plain": [ "(1460, 81)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "测试数据集包括1459个样本和80个特征。我们需要将测试数据集中每个样本的标签预测出来。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "attributes": { "classes": [], "id": "", "n": "5" } }, "outputs": [ { "data": { "text/plain": [ "(1459, 80)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "让我们来查看前4个样本的前4个特征、后2个特征和标签(SalePrice):" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "attributes": { "classes": [], "id": "", "n": "28" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | Id | \n", "MSSubClass | \n", "MSZoning | \n", "LotFrontage | \n", "SaleType | \n", "SaleCondition | \n", "SalePrice | \n", "
---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "60 | \n", "RL | \n", "65.0 | \n", "WD | \n", "Normal | \n", "208500 | \n", "
1 | \n", "2 | \n", "20 | \n", "RL | \n", "80.0 | \n", "WD | \n", "Normal | \n", "181500 | \n", "
2 | \n", "3 | \n", "60 | \n", "RL | \n", "68.0 | \n", "WD | \n", "Normal | \n", "223500 | \n", "
3 | \n", "4 | \n", "70 | \n", "RL | \n", "60.0 | \n", "WD | \n", "Abnorml | \n", "140000 | \n", "