{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "03_Pandas",
"version": "0.3.2",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"metadata": {
"id": "bOChJSNXtC9g",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# Pandas"
]
},
{
"metadata": {
"id": "OLIxEDq6VhvZ",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"\n",
"\n",
"在本文中,我们将学习使用Python Pandas库进行数据分析的基础知识。\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"metadata": {
"id": "VoMq0eFRvugb",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# 下载数据"
]
},
{
"metadata": {
"id": "qWro5T5qTJJL",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"首先,我们要获得一些数据。我们将从下面的公共链接中下载titanic数据集。"
]
},
{
"metadata": {
"id": "cdg5wEFcV6qA",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import urllib"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "6FuyDUTFVY7J",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"# 将数据从GitHub下载到笔记本电脑的本地磁盘\n",
"url = \"https://raw.githubusercontent.com/LisonEvf/practicalAI-cn/master/data/titanic.csv\"\n",
"response = urllib.request.urlopen(url)\n",
"html = response.read()\n",
"with open('titanic.csv', 'wb') as f:\n",
" f.write(html)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "TK3wsHCFhldU",
"colab_type": "code",
"outputId": "3c617391-f929-4956-ab76-2e53b64abdf3",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 84
}
},
"cell_type": "code",
"source": [
"# 检查数据是否已下载成功\n",
"!ls -l "
],
"execution_count": 3,
"outputs": [
{
"output_type": "stream",
"text": [
"total 96\n",
"-rw-r--r-- 1 root root 6975 Dec 16 12:45 processed_titanic.csv\n",
"drwxr-xr-x 1 root root 4096 Dec 10 17:34 sample_data\n",
"-rw-r--r-- 1 root root 85153 Dec 16 12:46 titanic.csv\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "TL4rwLUSW9hV",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# 加载数据"
]
},
{
"metadata": {
"id": "4EOXMnGHiLxM",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"现在我们有一些数据可以使用,让我们加载到Pandas数据帧(dataframe)中。Pandas是一个很棒的python数据库分析库。"
]
},
{
"metadata": {
"id": "-Zd-zoyjaaw2",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import pandas as pd"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "ywaEF_0aQ023",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"# 从CSV读取到Pandas DataFrame\n",
"df = pd.read_csv(\"titanic.csv\", header=0)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "J79FUzZWQ-kx",
"colab_type": "code",
"outputId": "3ccab6de-901e-42d2-8032-9307c5448a04",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"cell_type": "code",
"source": [
"# 前五项\n",
"df.head()"
],
"execution_count": 6,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" name | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" ticket | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
" survived | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Allen, Miss. Elisabeth Walton | \n",
" female | \n",
" 29.0000 | \n",
" 0 | \n",
" 0 | \n",
" 24160 | \n",
" 211.3375 | \n",
" B5 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" Allison, Master. Hudson Trevor | \n",
" male | \n",
" 0.9167 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Allison, Miss. Helen Loraine | \n",
" female | \n",
" 2.0000 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" Allison, Mr. Hudson Joshua Creighton | \n",
" male | \n",
" 30.0000 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | \n",
" female | \n",
" 25.0000 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass name sex age \\\n",
"0 1 Allen, Miss. Elisabeth Walton female 29.0000 \n",
"1 1 Allison, Master. Hudson Trevor male 0.9167 \n",
"2 1 Allison, Miss. Helen Loraine female 2.0000 \n",
"3 1 Allison, Mr. Hudson Joshua Creighton male 30.0000 \n",
"4 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 \n",
"\n",
" sibsp parch ticket fare cabin embarked survived \n",
"0 0 0 24160 211.3375 B5 S 1 \n",
"1 1 2 113781 151.5500 C22 C26 S 1 \n",
"2 1 2 113781 151.5500 C22 C26 S 0 \n",
"3 1 2 113781 151.5500 C22 C26 S 0 \n",
"4 1 2 113781 151.5500 C22 C26 S 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 6
}
]
},
{
"metadata": {
"id": "qhYyM3iGRZ8W",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"他们有不同的特征: \n",
"* pclass: class of travel\n",
"* name: full name of the passenger\n",
"* sex: gender\n",
"* age: numerical age\n",
"* sibsp: # of siblings/spouse aboard\n",
"* parch: number of parents/child aboard\n",
"* ticket: ticket number\n",
"* fare: cost of the ticket\n",
"* cabin: location of room\n",
"* emarked: port that the passenger embarked at (C - Cherbourg, S - Southampton, Q = Queenstown)\n",
"* survived: survial metric (0 - died, 1 - survived)"
]
},
{
"metadata": {
"id": "NBx5VP8K_y6N",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# 探索性分析"
]
},
{
"metadata": {
"id": "DD14WJ1G_zum",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"我们将使用Pandas库,看看如何探索和处理我们的数据。"
]
},
{
"metadata": {
"id": "thR28yTmASRr",
"colab_type": "code",
"outputId": "798b1a62-eb6a-46d5-e527-449fc7f32bb2",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 282
}
},
"cell_type": "code",
"source": [
"# 描述性统计\n",
"df.describe()"
],
"execution_count": 7,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" survived | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 1309.000000 | \n",
" 1046.000000 | \n",
" 1309.000000 | \n",
" 1309.000000 | \n",
" 1308.000000 | \n",
" 1309.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 2.294882 | \n",
" 29.881135 | \n",
" 0.498854 | \n",
" 0.385027 | \n",
" 33.295479 | \n",
" 0.381971 | \n",
"
\n",
" \n",
" std | \n",
" 0.837836 | \n",
" 14.413500 | \n",
" 1.041658 | \n",
" 0.865560 | \n",
" 51.758668 | \n",
" 0.486055 | \n",
"
\n",
" \n",
" min | \n",
" 1.000000 | \n",
" 0.166700 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 2.000000 | \n",
" 21.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 7.895800 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 3.000000 | \n",
" 28.000000 | \n",
" 0.000000 | \n",
" 0.000000 | \n",
" 14.454200 | \n",
" 0.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 3.000000 | \n",
" 39.000000 | \n",
" 1.000000 | \n",
" 0.000000 | \n",
" 31.275000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" max | \n",
" 3.000000 | \n",
" 80.000000 | \n",
" 8.000000 | \n",
" 9.000000 | \n",
" 512.329200 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass age sibsp parch fare \\\n",
"count 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000 \n",
"mean 2.294882 29.881135 0.498854 0.385027 33.295479 \n",
"std 0.837836 14.413500 1.041658 0.865560 51.758668 \n",
"min 1.000000 0.166700 0.000000 0.000000 0.000000 \n",
"25% 2.000000 21.000000 0.000000 0.000000 7.895800 \n",
"50% 3.000000 28.000000 0.000000 0.000000 14.454200 \n",
"75% 3.000000 39.000000 1.000000 0.000000 31.275000 \n",
"max 3.000000 80.000000 8.000000 9.000000 512.329200 \n",
"\n",
" survived \n",
"count 1309.000000 \n",
"mean 0.381971 \n",
"std 0.486055 \n",
"min 0.000000 \n",
"25% 0.000000 \n",
"50% 0.000000 \n",
"75% 1.000000 \n",
"max 1.000000 "
]
},
"metadata": {
"tags": []
},
"execution_count": 7
}
]
},
{
"metadata": {
"id": "Mn5HqS3XmzJs",
"colab_type": "code",
"outputId": "bb849441-d178-4559-bb7a-e3d9ef3caefb",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 364
}
},
"cell_type": "code",
"source": [
"# 直方图\n",
"df[\"age\"].hist()"
],
"execution_count": 8,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
""
]
},
"metadata": {
"tags": []
},
"execution_count": 8
},
{
"output_type": "display_data",
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAeQAAAFKCAYAAADMuCxnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFwpJREFUeJzt3WtsU4f5x/FfGidyA0a51EZlGu1W\n0RKNlItgWhiwJtwCXVWgXLoI0FToYAQGBQYpizYmpAGBol5A4hqGYJcMV1ozDSmIISQ0BU8QKUuq\nTZS+mCilwYFA0lwYpOf/oqr/ZbRxmjrxcw7fz7ucmOPnEYm+8jE+pDiO4wgAACTVQ8keAAAAEGQA\nAEwgyAAAGECQAQAwgCADAGAAQQYAwABfMp88Gm1N6PmysjLU3Nye0HMmg1f2kNjFKnaxySu7eGUP\nKfG7BIOBL/2ep14h+3ypyR4hIbyyh8QuVrGLTV7ZxSt7SP27i6eCDACAWxFkAAAMIMgAABhAkAEA\nMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwIKn/2xNsemnb6WSP0K2K\n0sJkjwAACccrZAAADCDIAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgA\nABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwgCADAGCArycP\nKi8v14ULF3T37l0tW7ZMp0+f1rvvvqvMzExJ0pIlS/TMM8+oqqpKR44c0UMPPaT58+dr3rx5fTo8\nAABeETfI586d03vvvafKyko1Nzdr9uzZ+t73vqe1a9eqoKAg9rj29nbt2bNH4XBYaWlpmjt3rqZO\nnRqLNgAA+HJxgzxu3Dg9/fTTkqRBgwapo6NDXV1d9z2urq5OeXl5CgQCkqQxY8aotrZWhYWFCR4Z\nAADvifsecmpqqjIyMiRJ4XBYkyZNUmpqqo4dO6bFixfrlVde0Y0bN9TU1KTs7OzYn8vOzlY0Gu27\nyQEA8JAevYcsSadOnVI4HFZFRYUaGhqUmZmp3Nxc7d+/X7t379bo0aPvebzjOHHPmZWVIZ8v9atP\n3Y1gMJDQ8yWLV/boCy9tO53sEeL6y2vPJ3uEuLz0M8Yu9nhlD6n/dulRkM+ePau9e/fq4MGDCgQC\nys/Pj32vsLBQmzdv1vTp09XU1BQ7fu3aNY0aNarb8zY3t/dy7C8WDAYUjbYm9JzJ4JU9HmTW//68\n9DPGLvZ4ZQ8p8bt0F/e4l6xbW1tVXl6uffv2xf6B1qpVq3T58mVJUiQS0bBhwzRy5EjV19erpaVF\nbW1tqq2t1dixYxO0AgAA3hb3FfKJEyfU3NysNWvWxI7NmTNHa9as0cMPP6yMjAxt3bpVfr9f69at\n05IlS5SSkqKSkpLYP/ACAADdixvkBQsWaMGCBfcdnz179n3HioqKVFRUlJjJAAB4gHCnLgAADCDI\nAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEG\nAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIA\nAAYQZAAADCDIAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEA\nMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwwNeTB5WXl+vChQu6e/eu\nli1bpry8PG3YsEFdXV0KBoPasWOH0tPTVVVVpSNHjuihhx7S/PnzNW/evL6eHwAAT4gb5HPnzum9\n995TZWWlmpubNXv2bOXn56u4uFgzZszQrl27FA6HNWvWLO3Zs0fhcFhpaWmaO3eupk6dqszMzP7Y\nAwAAV4t7yXrcuHF64403JEmDBg1SR0eHIpGIJk+eLEkqKChQTU2N6urqlJeXp0AgIL/frzFjxqi2\ntrZvpwcAwCPiBjk1NVUZGRmSpHA4rEmTJqmjo0Pp6emSpJycHEWjUTU1NSk7Ozv257KzsxWNRvto\nbAAAvKVH7yFL0qlTpxQOh1VRUaFp06bFjjuO84WP/7Ljn5eVlSGfL7WnI/RIMBhI6PmSxSt7PKjc\n8Pfnhhl7il3s8coeUv/t0qMgnz17Vnv37tXBgwcVCASUkZGhzs5O+f1+NTY2KhQKKRQKqampKfZn\nrl27plGjRnV73ubm9q83/f8IBgOKRlsTes5k8MoeDzLrf39e+hljF3u8soeU+F26i3vcS9atra0q\nLy/Xvn37Yv9Aa/z48aqurpYknTx5UhMnTtTIkSNVX1+vlpYWtbW1qba2VmPHjk3QCgAAeFvcV8gn\nTpxQc3Oz1qxZEzu2bds2lZWVqbKyUkOGDNGsWbOUlpamdevWacmSJUpJSVFJSYkCAe9csgAAoC/F\nDfKCBQu0YMGC+44fPnz4vmNFRUUqKipKzGQAADxAuFMXAAAGEGQAAAwgyAAAGECQAQAwgCADAGAA\nQQYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMI\nMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQ\nAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAY4Ev2AIAXvbTtdLJH6NZfXns+\n2SMA+B+8QgYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAYQJABADCgR0G+ePGipkyZomPHjkmS\nSktL9dxzz2nRokVatGiRzpw5I0mqqqrSCy+8oHnz5un48eN9NjQAAF4T98Yg7e3t2rJli/Lz8+85\nvnbtWhUUFNzzuD179igcDistLU1z587V1KlTlZmZmfipAQDwmLivkNPT03XgwAGFQqFuH1dXV6e8\nvDwFAgH5/X6NGTNGtbW1CRsUAAAvixtkn88nv99/3/Fjx45p8eLFeuWVV3Tjxg01NTUpOzs79v3s\n7GxFo9HETgsAgEf16l7Wzz//vDIzM5Wbm6v9+/dr9+7dGj169D2PcRwn7nmysjLk86X2ZoQvFQwG\nEnq+ZPHKHrDLSz9j7GKPV/aQ+m+XXgX58+8nFxYWavPmzZo+fbqamppix69du6ZRo0Z1e57m5vbe\nPP2XCgYDikZbE3rOZPDKHrDNKz9jXvp98couXtlDSvwu3cW9Vx97WrVqlS5fvixJikQiGjZsmEaO\nHKn6+nq1tLSora1NtbW1Gjt2bO8mBgDgARP3FXJDQ4O2b9+uK1euyOfzqbq6WgsXLtSaNWv08MMP\nKyMjQ1u3bpXf79e6deu0ZMkSpaSkqKSkRIGAdy5ZAADQl+IGecSIETp69Oh9x6dPn37fsaKiIhUV\nFSVmMgAAHiDcqQsAAAMIMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgy\nAAAGEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAYQJAB\nADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEGAMAAggwA\ngAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwABfsgcA0P+eW/dOskeIq6K0MNkj\nAP2KV8gAABhAkAEAMIAgAwBgQI+CfPHiRU2ZMkXHjh2TJF29elWLFi1ScXGxVq9erf/+97+SpKqq\nKr3wwguaN2+ejh8/3ndTAwDgMXGD3N7eri1btig/Pz927M0331RxcbF+//vf67HHHlM4HFZ7e7v2\n7Nmj3/72tzp69KiOHDmimzdv9unwAAB4Rdwgp6en68CBAwqFQrFjkUhEkydPliQVFBSopqZGdXV1\nysvLUyAQkN/v15gxY1RbW9t3kwMA4CFxP/bk8/nk8937sI6ODqWnp0uScnJyFI1G1dTUpOzs7Nhj\nsrOzFY1GEzwuAADe9LU/h+w4zlc6/nlZWRny+VK/7gj3CAYDCT1fsnhlD6C3vsrvgJd+X7yyi1f2\nkPpvl14FOSMjQ52dnfL7/WpsbFQoFFIoFFJTU1PsMdeuXdOoUaO6PU9zc3tvnv5LBYMBRaOtCT1n\nMnhlD+Dr6OnvgJd+X7yyi1f2kBK/S3dx79XHnsaPH6/q6mpJ0smTJzVx4kSNHDlS9fX1amlpUVtb\nm2prazV27NjeTQwAwAMm7ivkhoYGbd++XVeuXJHP51N1dbV27typ0tJSVVZWasiQIZo1a5bS0tK0\nbt06LVmyRCkpKSopKVEg4J1LFgAA9KW4QR4xYoSOHj163/HDhw/fd6yoqEhFRUWJmQwAgAcId+oC\nAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIA\nAAYQZAAADCDIAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEA\nMIAgAwBggC/ZAyTSc+veSfYIcVWUFiZ7BACAQbxCBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgA\nABjgqY89AfCOl7adTvYI3eIjjEg0XiEDAGAAQQYAwAAuWfcz65fhAADJwStkAAAMIMgAABhAkAEA\nMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAG9OrGIJFIRKtXr9awYcMkSU8++aSWLl2qDRs2qKur\nS8FgUDt27FB6enpChwUAwKt6faeu7373u3rzzTdjX7/66qsqLi7WjBkztGvXLoXDYRUXFydkSAAA\nvC5hl6wjkYgmT54sSSooKFBNTU2iTg0AgOf1+hXypUuXtHz5ct26dUsrV65UR0dH7BJ1Tk6OotFo\nwoYEAMDrehXkxx9/XCtXrtSMGTN0+fJlLV68WF1dXbHvO47To/NkZWXI50vtzQgAkFTBYMBV5+1v\nXtlD6r9dehXkwYMHa+bMmZKkoUOH6pFHHlF9fb06Ozvl9/vV2NioUCgU9zzNze29eXoASLpotDXh\n5wwGA31y3v7mlT2kxO/SXdx79R5yVVWVDh06JEmKRqO6fv265syZo+rqaknSyZMnNXHixN6cGgCA\nB1KvXiEXFhZq/fr1+tvf/qY7d+5o8+bNys3N1caNG1VZWakhQ4Zo1qxZiZ4VAADP6lWQBw4cqL17\n9953/PDhw197IAAAHkTcqQsAAAMIMgAABhBkAAAMIMgAABjQ6zt1AcCD7KVtp5M9QlwVpYXJHgFf\nAa+QAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAYQJABADCAIAMAYABBBgDA\nAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAG\nEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIAAAYQZAAADPAlewAAQN94advp\nZI8QV0VpYbJHMINXyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADEv6xp9/85jeqq6tTSkqK\nNm3apKeffjrRTwEAgOckNMj/+Mc/9J///EeVlZV6//33tWnTJlVWVibyKQAAHmL9s9J/ee35fnuu\nhF6yrqmp0ZQpUyRJTzzxhG7duqWPP/44kU8BAIAnJTTITU1NysrKin2dnZ2taDSayKcAAMCT+vTW\nmY7jdPv9YDCQ0Ofrz0sLAIAHQ6Jb9WUS+go5FAqpqakp9vW1a9cUDAYT+RQAAHhSQoP8/e9/X9XV\n1ZKkd999V6FQSAMHDkzkUwAA4EkJvWQ9ZswYfec739GLL76olJQU/epXv0rk6QEA8KwUJ94bvQAA\noM9xpy4AAAwgyAAAGNCnH3vqL26/XefFixe1YsUK/fjHP9bChQt19epVbdiwQV1dXQoGg9qxY4fS\n09OTPWaPlJeX68KFC7p7966WLVumvLw8V+7S0dGh0tJSXb9+Xbdv39aKFSs0fPhwV+4iSZ2dnfrh\nD3+oFStWKD8/35V7RCIRrV69WsOGDZMkPfnkk1q6dKkrd5GkqqoqHTx4UD6fTz/72c/01FNPuXKX\n48ePq6qqKvZ1Q0OD/vCHP2jz5s2SpKeeekq//vWvkzTdV9PW1qaNGzfq1q1bunPnjkpKShQMBvtv\nF8flIpGI85Of/MRxHMe5dOmSM3/+/CRP9NW0tbU5CxcudMrKypyjR486juM4paWlzokTJxzHcZzX\nXnvN+d3vfpfMEXuspqbGWbp0qeM4jnPjxg3nBz/4gWt3+etf/+rs37/fcRzH+eCDD5xp06a5dhfH\ncZxdu3Y5c+bMcd5++23X7nHu3Dln1apV9xxz6y43btxwpk2b5rS2tjqNjY1OWVmZa3f5vEgk4mze\nvNlZuHChU1dX5ziO46xdu9Y5c+ZMkifrmaNHjzo7d+50HMdxPvroI2f69On9uovrL1m7/Xad6enp\nOnDggEKhUOxYJBLR5MmTJUkFBQWqqalJ1nhfybhx4/TGG29IkgYNGqSOjg7X7jJz5ky9/PLLkqSr\nV69q8ODBrt3l/fff16VLl/TMM89Icu/P1xdx6y41NTXKz8/XwIEDFQqFtGXLFtfu8nl79uzRyy+/\nrCtXrsSuVLppl6ysLN28eVOS1NLSoszMzH7dxfVBdvvtOn0+n/x+/z3HOjo6YpeqcnJyXLNPamqq\nMjIyJEnhcFiTJk1y7S6fefHFF7V+/Xpt2rTJtbts375dpaWlsa/duockXbp0ScuXL9ePfvQj/f3v\nf3ftLh988IE6Ozu1fPlyFRcXq6amxrW7fOaf//ynHn30UaWmpmrQoEGx427a5dlnn9WHH36oqVOn\nauHChdqwYUO/7uKJ95A/z/HYp7jcuM+pU6cUDodVUVGhadOmxY67cZc//vGP+te//qWf//zn98zv\nll3+/Oc/a9SoUfrmN7/5hd93yx6S9Pjjj2vlypWaMWOGLl++rMWLF6urqyv2fTftIkk3b97U7t27\n9eGHH2rx4sWu/Pn6vHA4rNmzZ9933E27vPPOOxoyZIgOHTqkf//73yopKVEg8P+3zezrXVwfZC/e\nrjMjI0OdnZ3y+/1qbGy853K2dWfPntXevXt18OBBBQIB1+7S0NCgnJwcPfroo8rNzVVXV5cGDBjg\nul3OnDmjy5cv68yZM/roo4+Unp7u2r+TwYMHa+bMmZKkoUOH6pFHHlF9fb0rd8nJydHo0aPl8/k0\ndOhQDRgwQKmpqa7c5TORSERlZWVKSUmJXfaV5KpdamtrNWHCBEnS8OHDdfv2bd29ezf2/b7exfWX\nrL14u87x48fHdjp58qQmTpyY5Il6prW1VeXl5dq3b58yMzMluXeX8+fPq6KiQtKnb4u0t7e7cpfX\nX39db7/9tv70pz9p3rx5WrFihSv3kD79V8mHDh2SJEWjUV2/fl1z5sxx5S4TJkzQuXPn9Mknn6i5\nudm1P1+faWxs1IABA5Senq60tDR9+9vf1vnz5yW5a5fHHntMdXV1kqQrV65owIABeuKJJ/ptF0/c\nqWvnzp06f/587Hadw4cPT/ZIPdbQ0KDt27frypUr8vl8Gjx4sHbu3KnS0lLdvn1bQ4YM0datW5WW\nlpbsUeOqrKzUW2+9pW9961uxY9u2bVNZWZnrduns7NQvfvELXb16VZ2dnVq5cqVGjBihjRs3um6X\nz7z11lv6xje+oQkTJrhyj48//ljr169XS0uL7ty5o5UrVyo3N9eVu0ifvh0SDoclST/96U+Vl5fn\n2l0aGhr0+uuv6+DBg5I+fa//l7/8pT755BONHDlSr776apIn7Jm2tjZt2rRJ169f1927d7V69WoF\ng8F+28UTQQYAwO1cf8kaAAAvIMgAABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAG/B9C\nr3yxdV1VfwAAAABJRU5ErkJggg==\n",
"text/plain": [
""
]
},
"metadata": {
"tags": []
}
}
]
},
{
"metadata": {
"id": "7illbHR1nLEF",
"colab_type": "code",
"outputId": "38dd59af-3901-485d-8335-8d9aef4cb204",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"cell_type": "code",
"source": [
"# 唯一值\n",
"df[\"embarked\"].unique()"
],
"execution_count": 9,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array(['S', 'C', nan, 'Q'], dtype=object)"
]
},
"metadata": {
"tags": []
},
"execution_count": 9
}
]
},
{
"metadata": {
"id": "BG1IMeV_hrqV",
"colab_type": "code",
"outputId": "5499f70f-9a64-4371-b185-e1f56b62bbb9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 118
}
},
"cell_type": "code",
"source": [
"# 根据特征选择数据\n",
"df[\"name\"].head()"
],
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 Allen, Miss. Elisabeth Walton\n",
"1 Allison, Master. Hudson Trevor\n",
"2 Allison, Miss. Helen Loraine\n",
"3 Allison, Mr. Hudson Joshua Creighton\n",
"4 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)\n",
"Name: name, dtype: object"
]
},
"metadata": {
"tags": []
},
"execution_count": 10
}
]
},
{
"metadata": {
"id": "wPrRGLDtiZSp",
"colab_type": "code",
"outputId": "7e32434d-0a96-44c0-e54d-2dc45e04c5c7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"cell_type": "code",
"source": [
"# 筛选\n",
"df[df[\"sex\"]==\"female\"].head() # 只有女性数据出现"
],
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" name | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" ticket | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
" survived | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" Allen, Miss. Elisabeth Walton | \n",
" female | \n",
" 29.0 | \n",
" 0 | \n",
" 0 | \n",
" 24160 | \n",
" 211.3375 | \n",
" B5 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" Allison, Miss. Helen Loraine | \n",
" female | \n",
" 2.0 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | \n",
" female | \n",
" 25.0 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 6 | \n",
" 1 | \n",
" Andrews, Miss. Kornelia Theodosia | \n",
" female | \n",
" 63.0 | \n",
" 1 | \n",
" 0 | \n",
" 13502 | \n",
" 77.9583 | \n",
" D7 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 8 | \n",
" 1 | \n",
" Appleton, Mrs. Edward Dale (Charlotte Lamson) | \n",
" female | \n",
" 53.0 | \n",
" 2 | \n",
" 0 | \n",
" 11769 | \n",
" 51.4792 | \n",
" C101 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass name sex age \\\n",
"0 1 Allen, Miss. Elisabeth Walton female 29.0 \n",
"2 1 Allison, Miss. Helen Loraine female 2.0 \n",
"4 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 \n",
"6 1 Andrews, Miss. Kornelia Theodosia female 63.0 \n",
"8 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53.0 \n",
"\n",
" sibsp parch ticket fare cabin embarked survived \n",
"0 0 0 24160 211.3375 B5 S 1 \n",
"2 1 2 113781 151.5500 C22 C26 S 0 \n",
"4 1 2 113781 151.5500 C22 C26 S 0 \n",
"6 1 0 13502 77.9583 D7 S 1 \n",
"8 2 0 11769 51.4792 C101 S 1 "
]
},
"metadata": {
"tags": []
},
"execution_count": 11
}
]
},
{
"metadata": {
"id": "FOuLeYIojMMH",
"colab_type": "code",
"outputId": "aa4fbb43-cf32-4e6c-e92f-c97ad79da9d1",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"cell_type": "code",
"source": [
"# 排序\n",
"df.sort_values(\"age\", ascending=False).head()"
],
"execution_count": 12,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" name | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" ticket | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
" survived | \n",
"
\n",
" \n",
" \n",
" \n",
" 14 | \n",
" 1 | \n",
" Barkworth, Mr. Algernon Henry Wilson | \n",
" male | \n",
" 80.0 | \n",
" 0 | \n",
" 0 | \n",
" 27042 | \n",
" 30.0000 | \n",
" A23 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 61 | \n",
" 1 | \n",
" Cavendish, Mrs. Tyrell William (Julia Florence... | \n",
" female | \n",
" 76.0 | \n",
" 1 | \n",
" 0 | \n",
" 19877 | \n",
" 78.8500 | \n",
" C46 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 1235 | \n",
" 3 | \n",
" Svensson, Mr. Johan | \n",
" male | \n",
" 74.0 | \n",
" 0 | \n",
" 0 | \n",
" 347060 | \n",
" 7.7750 | \n",
" NaN | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 135 | \n",
" 1 | \n",
" Goldschmidt, Mr. George B | \n",
" male | \n",
" 71.0 | \n",
" 0 | \n",
" 0 | \n",
" PC 17754 | \n",
" 34.6542 | \n",
" A5 | \n",
" C | \n",
" 0 | \n",
"
\n",
" \n",
" 9 | \n",
" 1 | \n",
" Artagaveytia, Mr. Ramon | \n",
" male | \n",
" 71.0 | \n",
" 0 | \n",
" 0 | \n",
" PC 17609 | \n",
" 49.5042 | \n",
" NaN | \n",
" C | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass name sex age \\\n",
"14 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 \n",
"61 1 Cavendish, Mrs. Tyrell William (Julia Florence... female 76.0 \n",
"1235 3 Svensson, Mr. Johan male 74.0 \n",
"135 1 Goldschmidt, Mr. George B male 71.0 \n",
"9 1 Artagaveytia, Mr. Ramon male 71.0 \n",
"\n",
" sibsp parch ticket fare cabin embarked survived \n",
"14 0 0 27042 30.0000 A23 S 1 \n",
"61 1 0 19877 78.8500 C46 S 1 \n",
"1235 0 0 347060 7.7750 NaN S 0 \n",
"135 0 0 PC 17754 34.6542 A5 C 0 \n",
"9 0 0 PC 17609 49.5042 NaN C 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
},
{
"metadata": {
"id": "v0TCbtSMjMO5",
"colab_type": "code",
"outputId": "5c602a41-3250-48e1-f5c2-973340a20708",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 135
}
},
"cell_type": "code",
"source": [
"# Grouping(数据聚合与分组运算)\n",
"sex_group = df.groupby(\"survived\")\n",
"sex_group.mean()"
],
"execution_count": 13,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
"
\n",
" \n",
" survived | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 2.500618 | \n",
" 30.545369 | \n",
" 0.521632 | \n",
" 0.328801 | \n",
" 23.353831 | \n",
"
\n",
" \n",
" 1 | \n",
" 1.962000 | \n",
" 28.918228 | \n",
" 0.462000 | \n",
" 0.476000 | \n",
" 49.361184 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass age sibsp parch fare\n",
"survived \n",
"0 2.500618 30.545369 0.521632 0.328801 23.353831\n",
"1 1.962000 28.918228 0.462000 0.476000 49.361184"
]
},
"metadata": {
"tags": []
},
"execution_count": 13
}
]
},
{
"metadata": {
"id": "34LmckWDhdSA",
"colab_type": "code",
"outputId": "0befc9a0-b30b-472b-ef19-e092b8da6f28",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 220
}
},
"cell_type": "code",
"source": [
"# iloc根据位置的索引来访问\n",
"df.iloc[0, :] # iloc在索引中的特定位置获取行(或列)(因此它只需要整数)"
],
"execution_count": 14,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"pclass 1\n",
"name Allen, Miss. Elisabeth Walton\n",
"sex female\n",
"age 29\n",
"sibsp 0\n",
"parch 0\n",
"ticket 24160\n",
"fare 211.338\n",
"cabin B5\n",
"embarked S\n",
"survived 1\n",
"Name: 0, dtype: object"
]
},
"metadata": {
"tags": []
},
"execution_count": 14
}
]
},
{
"metadata": {
"id": "QrdXeuRdFkXB",
"colab_type": "code",
"outputId": "1f6b255f-7ed1-463c-adb6-55a026881d56",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"cell_type": "code",
"source": [
"# 获取指定位置的数据\n",
"df.iloc[0, 1]\n"
],
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'Allen, Miss. Elisabeth Walton'"
]
},
"metadata": {
"tags": []
},
"execution_count": 15
}
]
},
{
"metadata": {
"id": "Rz35_-x2FkaL",
"colab_type": "code",
"outputId": "4d50ef2b-f1d7-4b50-84ea-44a11d1b360a",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 220
}
},
"cell_type": "code",
"source": [
"# loc根据标签的索引来访问\n",
"df.loc[0] # loc从索引中获取具有特定标签的行(或列)"
],
"execution_count": 16,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"pclass 1\n",
"name Allen, Miss. Elisabeth Walton\n",
"sex female\n",
"age 29\n",
"sibsp 0\n",
"parch 0\n",
"ticket 24160\n",
"fare 211.338\n",
"cabin B5\n",
"embarked S\n",
"survived 1\n",
"Name: 0, dtype: object"
]
},
"metadata": {
"tags": []
},
"execution_count": 16
}
]
},
{
"metadata": {
"id": "uSezrq4vEFYh",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# 预处理"
]
},
{
"metadata": {
"id": "EZ1pCKHIjMUY",
"colab_type": "code",
"outputId": "90120eb7-0764-4feb-f084-399d4f7f3081",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"cell_type": "code",
"source": [
"# 具有至少一个NaN值的行\n",
"df[pd.isnull(df).any(axis=1)].head()"
],
"execution_count": 17,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" name | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" ticket | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
" survived | \n",
"
\n",
" \n",
" \n",
" \n",
" 9 | \n",
" 1 | \n",
" Artagaveytia, Mr. Ramon | \n",
" male | \n",
" 71.0 | \n",
" 0 | \n",
" 0 | \n",
" PC 17609 | \n",
" 49.5042 | \n",
" NaN | \n",
" C | \n",
" 0 | \n",
"
\n",
" \n",
" 13 | \n",
" 1 | \n",
" Barber, Miss. Ellen \"Nellie\" | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 19877 | \n",
" 78.8500 | \n",
" NaN | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 15 | \n",
" 1 | \n",
" Baumann, Mr. John D | \n",
" male | \n",
" NaN | \n",
" 0 | \n",
" 0 | \n",
" PC 17318 | \n",
" 25.9250 | \n",
" NaN | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 23 | \n",
" 1 | \n",
" Bidois, Miss. Rosalie | \n",
" female | \n",
" 42.0 | \n",
" 0 | \n",
" 0 | \n",
" PC 17757 | \n",
" 227.5250 | \n",
" NaN | \n",
" C | \n",
" 1 | \n",
"
\n",
" \n",
" 25 | \n",
" 1 | \n",
" Birnbaum, Mr. Jakob | \n",
" male | \n",
" 25.0 | \n",
" 0 | \n",
" 0 | \n",
" 13905 | \n",
" 26.0000 | \n",
" NaN | \n",
" C | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass name sex age sibsp parch \\\n",
"9 1 Artagaveytia, Mr. Ramon male 71.0 0 0 \n",
"13 1 Barber, Miss. Ellen \"Nellie\" female 26.0 0 0 \n",
"15 1 Baumann, Mr. John D male NaN 0 0 \n",
"23 1 Bidois, Miss. Rosalie female 42.0 0 0 \n",
"25 1 Birnbaum, Mr. Jakob male 25.0 0 0 \n",
"\n",
" ticket fare cabin embarked survived \n",
"9 PC 17609 49.5042 NaN C 0 \n",
"13 19877 78.8500 NaN S 1 \n",
"15 PC 17318 25.9250 NaN S 0 \n",
"23 PC 17757 227.5250 NaN C 1 \n",
"25 13905 26.0000 NaN C 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 17
}
]
},
{
"metadata": {
"id": "zUaiFplEkmoB",
"colab_type": "code",
"outputId": "65b701a3-6d72-4914-f920-67561a59a73e",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"cell_type": "code",
"source": [
"# 删除具有Nan值的行\n",
"df = df.dropna() # 删除具有NaN值的行\n",
"df = df.reset_index() # 重置行索引\n",
"df.head()"
],
"execution_count": 18,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" index | \n",
" pclass | \n",
" name | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" ticket | \n",
" fare | \n",
" cabin | \n",
" embarked | \n",
" survived | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" Allen, Miss. Elisabeth Walton | \n",
" female | \n",
" 29.0000 | \n",
" 0 | \n",
" 0 | \n",
" 24160 | \n",
" 211.3375 | \n",
" B5 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" Allison, Master. Hudson Trevor | \n",
" male | \n",
" 0.9167 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" Allison, Miss. Helen Loraine | \n",
" female | \n",
" 2.0000 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
" Allison, Mr. Hudson Joshua Creighton | \n",
" male | \n",
" 30.0000 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 1 | \n",
" Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | \n",
" female | \n",
" 25.0000 | \n",
" 1 | \n",
" 2 | \n",
" 113781 | \n",
" 151.5500 | \n",
" C22 C26 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" index pclass name sex \\\n",
"0 0 1 Allen, Miss. Elisabeth Walton female \n",
"1 1 1 Allison, Master. Hudson Trevor male \n",
"2 2 1 Allison, Miss. Helen Loraine female \n",
"3 3 1 Allison, Mr. Hudson Joshua Creighton male \n",
"4 4 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female \n",
"\n",
" age sibsp parch ticket fare cabin embarked survived \n",
"0 29.0000 0 0 24160 211.3375 B5 S 1 \n",
"1 0.9167 1 2 113781 151.5500 C22 C26 S 1 \n",
"2 2.0000 1 2 113781 151.5500 C22 C26 S 0 \n",
"3 30.0000 1 2 113781 151.5500 C22 C26 S 0 \n",
"4 25.0000 1 2 113781 151.5500 C22 C26 S 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 18
}
]
},
{
"metadata": {
"id": "ubujZv_8qG-d",
"colab_type": "code",
"outputId": "3a397700-f1ff-496b-e585-ef0ad7c2b37f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"cell_type": "code",
"source": [
"# 删除多行\n",
"df = df.drop([\"name\", \"cabin\", \"ticket\"], axis=1) # we won't use text features for our initial basic models\n",
"df.head()"
],
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" index | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" embarked | \n",
" survived | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" female | \n",
" 29.0000 | \n",
" 0 | \n",
" 0 | \n",
" 211.3375 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" male | \n",
" 0.9167 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" S | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" female | \n",
" 2.0000 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
" male | \n",
" 30.0000 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 1 | \n",
" female | \n",
" 25.0000 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" S | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" index pclass sex age sibsp parch fare embarked survived\n",
"0 0 1 female 29.0000 0 0 211.3375 S 1\n",
"1 1 1 male 0.9167 1 2 151.5500 S 1\n",
"2 2 1 female 2.0000 1 2 151.5500 S 0\n",
"3 3 1 male 30.0000 1 2 151.5500 S 0\n",
"4 4 1 female 25.0000 1 2 151.5500 S 0"
]
},
"metadata": {
"tags": []
},
"execution_count": 19
}
]
},
{
"metadata": {
"id": "8m117GcVnon9",
"colab_type": "code",
"outputId": "ee33bb3e-c64b-4e2c-8e82-03e669b7bcba",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"cell_type": "code",
"source": [
"# 映射特征值\n",
"df['sex'] = df['sex'].map( {'female': 0, 'male': 1} ).astype(int)\n",
"df[\"embarked\"] = df['embarked'].dropna().map( {'S':0, 'C':1, 'Q':2} ).astype(int)\n",
"df.head()"
],
"execution_count": 20,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" index | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" embarked | \n",
" survived | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 29.0000 | \n",
" 0 | \n",
" 0 | \n",
" 211.3375 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0.9167 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 0 | \n",
" 2.0000 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 30.0000 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 25.0000 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" index pclass sex age sibsp parch fare embarked survived\n",
"0 0 1 0 29.0000 0 0 211.3375 0 1\n",
"1 1 1 1 0.9167 1 2 151.5500 0 1\n",
"2 2 1 0 2.0000 1 2 151.5500 0 0\n",
"3 3 1 1 30.0000 1 2 151.5500 0 0\n",
"4 4 1 0 25.0000 1 2 151.5500 0 0"
]
},
"metadata": {
"tags": []
},
"execution_count": 20
}
]
},
{
"metadata": {
"id": "ZaVqjpsCEtft",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# 特征工程"
]
},
{
"metadata": {
"id": "_FPtk5tpqrDI",
"colab_type": "code",
"outputId": "72c426bb-005d-4b47-a9d4-8cbe28e936c5",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"cell_type": "code",
"source": [
"# lambda表达式创建新特征\n",
"def get_family_size(sibsp, parch):\n",
" family_size = sibsp + parch\n",
" return family_size\n",
"\n",
"df[\"family_size\"] = df[[\"sibsp\", \"parch\"]].apply(lambda x: get_family_size(x[\"sibsp\"], x[\"parch\"]), axis=1)\n",
"df.head()"
],
"execution_count": 21,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" index | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" fare | \n",
" embarked | \n",
" survived | \n",
" family_size | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 29.0000 | \n",
" 0 | \n",
" 0 | \n",
" 211.3375 | \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0.9167 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" 0 | \n",
" 1 | \n",
" 3 | \n",
"
\n",
" \n",
" 2 | \n",
" 2 | \n",
" 1 | \n",
" 0 | \n",
" 2.0000 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 30.0000 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
" 4 | \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 25.0000 | \n",
" 1 | \n",
" 2 | \n",
" 151.5500 | \n",
" 0 | \n",
" 0 | \n",
" 3 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" index pclass sex age sibsp parch fare embarked survived \\\n",
"0 0 1 0 29.0000 0 0 211.3375 0 1 \n",
"1 1 1 1 0.9167 1 2 151.5500 0 1 \n",
"2 2 1 0 2.0000 1 2 151.5500 0 0 \n",
"3 3 1 1 30.0000 1 2 151.5500 0 0 \n",
"4 4 1 0 25.0000 1 2 151.5500 0 0 \n",
"\n",
" family_size \n",
"0 0 \n",
"1 3 \n",
"2 3 \n",
"3 3 \n",
"4 3 "
]
},
"metadata": {
"tags": []
},
"execution_count": 21
}
]
},
{
"metadata": {
"id": "JK3FqfjnpSNi",
"colab_type": "code",
"outputId": "80aedded-c9f6-40cb-b073-cf8afab75531",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 194
}
},
"cell_type": "code",
"source": [
"# 重新组织标题\n",
"df = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'family_size', 'fare', 'embarked', 'survived']]\n",
"df.head()"
],
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" pclass | \n",
" sex | \n",
" age | \n",
" sibsp | \n",
" parch | \n",
" family_size | \n",
" fare | \n",
" embarked | \n",
" survived | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 0 | \n",
" 29.0000 | \n",
" 0 | \n",
" 0 | \n",
" 0 | \n",
" 211.3375 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 1 | \n",
" 0.9167 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 151.5500 | \n",
" 0 | \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
" 1 | \n",
" 0 | \n",
" 2.0000 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 151.5500 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 3 | \n",
" 1 | \n",
" 1 | \n",
" 30.0000 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 151.5500 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
" 4 | \n",
" 1 | \n",
" 0 | \n",
" 25.0000 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 151.5500 | \n",
" 0 | \n",
" 0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" pclass sex age sibsp parch family_size fare embarked \\\n",
"0 1 0 29.0000 0 0 0 211.3375 0 \n",
"1 1 1 0.9167 1 2 3 151.5500 0 \n",
"2 1 0 2.0000 1 2 3 151.5500 0 \n",
"3 1 1 30.0000 1 2 3 151.5500 0 \n",
"4 1 0 25.0000 1 2 3 151.5500 0 \n",
"\n",
" survived \n",
"0 1 \n",
"1 1 \n",
"2 0 \n",
"3 0 \n",
"4 0 "
]
},
"metadata": {
"tags": []
},
"execution_count": 22
}
]
},
{
"metadata": {
"id": "N_rwgfrFGTne",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"# 保存数据"
]
},
{
"metadata": {
"id": "rNNxA7Vrp2fC",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"# 保存数据帧(dataframe)到 CSV\n",
"df.to_csv(\"processed_titanic.csv\", index=False)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "gfc7Epp7sgqz",
"colab_type": "code",
"outputId": "e1e21143-2e87-43cb-aa16-b0af114a1b4a",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 84
}
},
"cell_type": "code",
"source": [
"# 看你一下你保持的文件\n",
"!ls -l"
],
"execution_count": 24,
"outputs": [
{
"output_type": "stream",
"text": [
"total 96\n",
"-rw-r--r-- 1 root root 6975 Dec 16 12:46 processed_titanic.csv\n",
"drwxr-xr-x 1 root root 4096 Dec 10 17:34 sample_data\n",
"-rw-r--r-- 1 root root 85153 Dec 16 12:46 titanic.csv\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "i1rVSjsdDaTw",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}