{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "03_Pandas", "version": "0.3.2", "provenance": [], "collapsed_sections": [], "toc_visible": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "metadata": { "id": "bOChJSNXtC9g", "colab_type": "text" }, "cell_type": "markdown", "source": [ "# Pandas" ] }, { "metadata": { "id": "OLIxEDq6VhvZ", "colab_type": "text" }, "cell_type": "markdown", "source": [ "\n", "\n", "在本文中,我们将学习使用Python Pandas库进行数据分析的基础知识。\n", "\n", "\n", "\n", "\n" ] }, { "metadata": { "id": "VoMq0eFRvugb", "colab_type": "text" }, "cell_type": "markdown", "source": [ "# 下载数据" ] }, { "metadata": { "id": "qWro5T5qTJJL", "colab_type": "text" }, "cell_type": "markdown", "source": [ "首先,我们要获得一些数据。我们将从下面的公共链接中下载titanic数据集。" ] }, { "metadata": { "id": "cdg5wEFcV6qA", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "import urllib" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "6FuyDUTFVY7J", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "# 将数据从GitHub下载到笔记本电脑的本地磁盘\n", "url = \"https://raw.githubusercontent.com/LisonEvf/practicalAI-cn/master/data/titanic.csv\"\n", "response = urllib.request.urlopen(url)\n", "html = response.read()\n", "with open('titanic.csv', 'wb') as f:\n", " f.write(html)" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "TK3wsHCFhldU", "colab_type": "code", "outputId": "3c617391-f929-4956-ab76-2e53b64abdf3", "colab": { "base_uri": "https://localhost:8080/", "height": 84 } }, "cell_type": "code", "source": [ "# 检查数据是否已下载成功\n", "!ls -l " ], "execution_count": 3, "outputs": [ { "output_type": "stream", "text": [ "total 96\n", "-rw-r--r-- 1 root root 6975 Dec 16 12:45 processed_titanic.csv\n", "drwxr-xr-x 1 root root 4096 Dec 10 17:34 sample_data\n", "-rw-r--r-- 1 root root 85153 Dec 16 12:46 titanic.csv\n" ], "name": "stdout" } ] }, { "metadata": { "id": "TL4rwLUSW9hV", "colab_type": "text" }, "cell_type": "markdown", "source": [ "# 加载数据" ] }, { "metadata": { "id": "4EOXMnGHiLxM", "colab_type": "text" }, "cell_type": "markdown", "source": [ "现在我们有一些数据可以使用,让我们加载到Pandas数据帧(dataframe)中。Pandas是一个很棒的python数据库分析库。" ] }, { "metadata": { "id": "-Zd-zoyjaaw2", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "import pandas as pd" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "ywaEF_0aQ023", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "# 从CSV读取到Pandas DataFrame\n", "df = pd.read_csv(\"titanic.csv\", header=0)" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "J79FUzZWQ-kx", "colab_type": "code", "outputId": "3ccab6de-901e-42d2-8032-9307c5448a04", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "cell_type": "code", "source": [ "# 前五项\n", "df.head()" ], "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclassnamesexagesibspparchticketfarecabinembarkedsurvived
01Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375B5S1
11Allison, Master. Hudson Trevormale0.916712113781151.5500C22 C26S1
21Allison, Miss. Helen Lorainefemale2.000012113781151.5500C22 C26S0
31Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500C22 C26S0
41Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500C22 C26S0
\n", "
" ], "text/plain": [ " pclass name sex age \\\n", "0 1 Allen, Miss. Elisabeth Walton female 29.0000 \n", "1 1 Allison, Master. Hudson Trevor male 0.9167 \n", "2 1 Allison, Miss. Helen Loraine female 2.0000 \n", "3 1 Allison, Mr. Hudson Joshua Creighton male 30.0000 \n", "4 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 \n", "\n", " sibsp parch ticket fare cabin embarked survived \n", "0 0 0 24160 211.3375 B5 S 1 \n", "1 1 2 113781 151.5500 C22 C26 S 1 \n", "2 1 2 113781 151.5500 C22 C26 S 0 \n", "3 1 2 113781 151.5500 C22 C26 S 0 \n", "4 1 2 113781 151.5500 C22 C26 S 0 " ] }, "metadata": { "tags": [] }, "execution_count": 6 } ] }, { "metadata": { "id": "qhYyM3iGRZ8W", "colab_type": "text" }, "cell_type": "markdown", "source": [ "他们有不同的特征: \n", "* pclass: class of travel\n", "* name: full name of the passenger\n", "* sex: gender\n", "* age: numerical age\n", "* sibsp: # of siblings/spouse aboard\n", "* parch: number of parents/child aboard\n", "* ticket: ticket number\n", "* fare: cost of the ticket\n", "* cabin: location of room\n", "* emarked: port that the passenger embarked at (C - Cherbourg, S - Southampton, Q = Queenstown)\n", "* survived: survial metric (0 - died, 1 - survived)" ] }, { "metadata": { "id": "NBx5VP8K_y6N", "colab_type": "text" }, "cell_type": "markdown", "source": [ "# 探索性分析" ] }, { "metadata": { "id": "DD14WJ1G_zum", "colab_type": "text" }, "cell_type": "markdown", "source": [ "我们将使用Pandas库,看看如何探索和处理我们的数据。" ] }, { "metadata": { "id": "thR28yTmASRr", "colab_type": "code", "outputId": "798b1a62-eb6a-46d5-e527-449fc7f32bb2", "colab": { "base_uri": "https://localhost:8080/", "height": 282 } }, "cell_type": "code", "source": [ "# 描述性统计\n", "df.describe()" ], "execution_count": 7, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclassagesibspparchfaresurvived
count1309.0000001046.0000001309.0000001309.0000001308.0000001309.000000
mean2.29488229.8811350.4988540.38502733.2954790.381971
std0.83783614.4135001.0416580.86556051.7586680.486055
min1.0000000.1667000.0000000.0000000.0000000.000000
25%2.00000021.0000000.0000000.0000007.8958000.000000
50%3.00000028.0000000.0000000.00000014.4542000.000000
75%3.00000039.0000001.0000000.00000031.2750001.000000
max3.00000080.0000008.0000009.000000512.3292001.000000
\n", "
" ], "text/plain": [ " pclass age sibsp parch fare \\\n", "count 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000 \n", "mean 2.294882 29.881135 0.498854 0.385027 33.295479 \n", "std 0.837836 14.413500 1.041658 0.865560 51.758668 \n", "min 1.000000 0.166700 0.000000 0.000000 0.000000 \n", "25% 2.000000 21.000000 0.000000 0.000000 7.895800 \n", "50% 3.000000 28.000000 0.000000 0.000000 14.454200 \n", "75% 3.000000 39.000000 1.000000 0.000000 31.275000 \n", "max 3.000000 80.000000 8.000000 9.000000 512.329200 \n", "\n", " survived \n", "count 1309.000000 \n", "mean 0.381971 \n", "std 0.486055 \n", "min 0.000000 \n", "25% 0.000000 \n", "50% 0.000000 \n", "75% 1.000000 \n", "max 1.000000 " ] }, "metadata": { "tags": [] }, "execution_count": 7 } ] }, { "metadata": { "id": "Mn5HqS3XmzJs", "colab_type": "code", "outputId": "bb849441-d178-4559-bb7a-e3d9ef3caefb", "colab": { "base_uri": "https://localhost:8080/", "height": 364 } }, "cell_type": "code", "source": [ "# 直方图\n", "df[\"age\"].hist()" ], "execution_count": 8, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "" ] }, "metadata": { "tags": [] }, "execution_count": 8 }, { "output_type": "display_data", "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAeQAAAFKCAYAAADMuCxnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMS4yLCBo\ndHRwOi8vbWF0cGxvdGxpYi5vcmcvNQv5yAAAFwpJREFUeJzt3WtsU4f5x/FfGidyA0a51EZlGu1W\n0RKNlItgWhiwJtwCXVWgXLoI0FToYAQGBQYpizYmpAGBol5A4hqGYJcMV1ozDSmIISQ0BU8QKUuq\nTZS+mCilwYFA0lwYpOf/oqr/ZbRxmjrxcw7fz7ucmOPnEYm+8jE+pDiO4wgAACTVQ8keAAAAEGQA\nAEwgyAAAGECQAQAwgCADAGAAQQYAwABfMp88Gm1N6PmysjLU3Nye0HMmg1f2kNjFKnaxySu7eGUP\nKfG7BIOBL/2ep14h+3ypyR4hIbyyh8QuVrGLTV7ZxSt7SP27i6eCDACAWxFkAAAMIMgAABhAkAEA\nMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwIKn/2xNsemnb6WSP0K2K\n0sJkjwAACccrZAAADCDIAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgA\nABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwgCADAGCArycP\nKi8v14ULF3T37l0tW7ZMp0+f1rvvvqvMzExJ0pIlS/TMM8+oqqpKR44c0UMPPaT58+dr3rx5fTo8\nAABeETfI586d03vvvafKyko1Nzdr9uzZ+t73vqe1a9eqoKAg9rj29nbt2bNH4XBYaWlpmjt3rqZO\nnRqLNgAA+HJxgzxu3Dg9/fTTkqRBgwapo6NDXV1d9z2urq5OeXl5CgQCkqQxY8aotrZWhYWFCR4Z\nAADvifsecmpqqjIyMiRJ4XBYkyZNUmpqqo4dO6bFixfrlVde0Y0bN9TU1KTs7OzYn8vOzlY0Gu27\nyQEA8JAevYcsSadOnVI4HFZFRYUaGhqUmZmp3Nxc7d+/X7t379bo0aPvebzjOHHPmZWVIZ8v9atP\n3Y1gMJDQ8yWLV/boCy9tO53sEeL6y2vPJ3uEuLz0M8Yu9nhlD6n/dulRkM+ePau9e/fq4MGDCgQC\nys/Pj32vsLBQmzdv1vTp09XU1BQ7fu3aNY0aNarb8zY3t/dy7C8WDAYUjbYm9JzJ4JU9HmTW//68\n9DPGLvZ4ZQ8p8bt0F/e4l6xbW1tVXl6uffv2xf6B1qpVq3T58mVJUiQS0bBhwzRy5EjV19erpaVF\nbW1tqq2t1dixYxO0AgAA3hb3FfKJEyfU3NysNWvWxI7NmTNHa9as0cMPP6yMjAxt3bpVfr9f69at\n05IlS5SSkqKSkpLYP/ACAADdixvkBQsWaMGCBfcdnz179n3HioqKVFRUlJjJAAB4gHCnLgAADCDI\nAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEG\nAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIA\nAAYQZAAADCDIAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEA\nMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwwNeTB5WXl+vChQu6e/eu\nli1bpry8PG3YsEFdXV0KBoPasWOH0tPTVVVVpSNHjuihhx7S/PnzNW/evL6eHwAAT4gb5HPnzum9\n995TZWWlmpubNXv2bOXn56u4uFgzZszQrl27FA6HNWvWLO3Zs0fhcFhpaWmaO3eupk6dqszMzP7Y\nAwAAV4t7yXrcuHF64403JEmDBg1SR0eHIpGIJk+eLEkqKChQTU2N6urqlJeXp0AgIL/frzFjxqi2\ntrZvpwcAwCPiBjk1NVUZGRmSpHA4rEmTJqmjo0Pp6emSpJycHEWjUTU1NSk7Ozv257KzsxWNRvto\nbAAAvKVH7yFL0qlTpxQOh1VRUaFp06bFjjuO84WP/7Ljn5eVlSGfL7WnI/RIMBhI6PmSxSt7PKjc\n8Pfnhhl7il3s8coeUv/t0qMgnz17Vnv37tXBgwcVCASUkZGhzs5O+f1+NTY2KhQKKRQKqampKfZn\nrl27plGjRnV73ubm9q83/f8IBgOKRlsTes5k8MoeDzLrf39e+hljF3u8soeU+F26i3vcS9atra0q\nLy/Xvn37Yv9Aa/z48aqurpYknTx5UhMnTtTIkSNVX1+vlpYWtbW1qba2VmPHjk3QCgAAeFvcV8gn\nTpxQc3Oz1qxZEzu2bds2lZWVqbKyUkOGDNGsWbOUlpamdevWacmSJUpJSVFJSYkCAe9csgAAoC/F\nDfKCBQu0YMGC+44fPnz4vmNFRUUqKipKzGQAADxAuFMXAAAGEGQAAAwgyAAAGECQAQAwgCADAGAA\nQQYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMI\nMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQ\nAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAY4Ev2AIAXvbTtdLJH6NZfXns+\n2SMA+B+8QgYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAYQJABADCgR0G+ePGipkyZomPHjkmS\nSktL9dxzz2nRokVatGiRzpw5I0mqqqrSCy+8oHnz5un48eN9NjQAAF4T98Yg7e3t2rJli/Lz8+85\nvnbtWhUUFNzzuD179igcDistLU1z587V1KlTlZmZmfipAQDwmLivkNPT03XgwAGFQqFuH1dXV6e8\nvDwFAgH5/X6NGTNGtbW1CRsUAAAvixtkn88nv99/3/Fjx45p8eLFeuWVV3Tjxg01NTUpOzs79v3s\n7GxFo9HETgsAgEf16l7Wzz//vDIzM5Wbm6v9+/dr9+7dGj169D2PcRwn7nmysjLk86X2ZoQvFQwG\nEnq+ZPHKHrDLSz9j7GKPV/aQ+m+XXgX58+8nFxYWavPmzZo+fbqamppix69du6ZRo0Z1e57m5vbe\nPP2XCgYDikZbE3rOZPDKHrDNKz9jXvp98couXtlDSvwu3cW9Vx97WrVqlS5fvixJikQiGjZsmEaO\nHKn6+nq1tLSora1NtbW1Gjt2bO8mBgDgARP3FXJDQ4O2b9+uK1euyOfzqbq6WgsXLtSaNWv08MMP\nKyMjQ1u3bpXf79e6deu0ZMkSpaSkqKSkRIGAdy5ZAADQl+IGecSIETp69Oh9x6dPn37fsaKiIhUV\nFSVmMgAAHiDcqQsAAAMIMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgy\nAAAGEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAYQJAB\nADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEGAMAAggwA\ngAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwABfsgcA0P+eW/dOskeIq6K0MNkj\nAP2KV8gAABhAkAEAMIAgAwBgQI+CfPHiRU2ZMkXHjh2TJF29elWLFi1ScXGxVq9erf/+97+SpKqq\nKr3wwguaN2+ejh8/3ndTAwDgMXGD3N7eri1btig/Pz927M0331RxcbF+//vf67HHHlM4HFZ7e7v2\n7Nmj3/72tzp69KiOHDmimzdv9unwAAB4Rdwgp6en68CBAwqFQrFjkUhEkydPliQVFBSopqZGdXV1\nysvLUyAQkN/v15gxY1RbW9t3kwMA4CFxP/bk8/nk8937sI6ODqWnp0uScnJyFI1G1dTUpOzs7Nhj\nsrOzFY1GEzwuAADe9LU/h+w4zlc6/nlZWRny+VK/7gj3CAYDCT1fsnhlD6C3vsrvgJd+X7yyi1f2\nkPpvl14FOSMjQ52dnfL7/WpsbFQoFFIoFFJTU1PsMdeuXdOoUaO6PU9zc3tvnv5LBYMBRaOtCT1n\nMnhlD+Dr6OnvgJd+X7yyi1f2kBK/S3dx79XHnsaPH6/q6mpJ0smTJzVx4kSNHDlS9fX1amlpUVtb\nm2prazV27NjeTQwAwAMm7ivkhoYGbd++XVeuXJHP51N1dbV27typ0tJSVVZWasiQIZo1a5bS0tK0\nbt06LVmyRCkpKSopKVEg4J1LFgAA9KW4QR4xYoSOHj163/HDhw/fd6yoqEhFRUWJmQwAgAcId+oC\nAMAAggwAgAEEGQAAAwgyAAAGEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIA\nAAYQZAAADCDIAAAYQJABADCAIAMAYABBBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEA\nMIAgAwBggC/ZAyTSc+veSfYIcVWUFiZ7BACAQbxCBgDAAIIMAIABBBkAAAMIMgAABhBkAAAMIMgA\nABjgqY89AfCOl7adTvYI3eIjjEg0XiEDAGAAQQYAwAAuWfcz65fhAADJwStkAAAMIMgAABhAkAEA\nMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAG9OrGIJFIRKtXr9awYcMkSU8++aSWLl2qDRs2qKur\nS8FgUDt27FB6enpChwUAwKt6faeu7373u3rzzTdjX7/66qsqLi7WjBkztGvXLoXDYRUXFydkSAAA\nvC5hl6wjkYgmT54sSSooKFBNTU2iTg0AgOf1+hXypUuXtHz5ct26dUsrV65UR0dH7BJ1Tk6OotFo\nwoYEAMDrehXkxx9/XCtXrtSMGTN0+fJlLV68WF1dXbHvO47To/NkZWXI50vtzQgAkFTBYMBV5+1v\nXtlD6r9dehXkwYMHa+bMmZKkoUOH6pFHHlF9fb06Ozvl9/vV2NioUCgU9zzNze29eXoASLpotDXh\n5wwGA31y3v7mlT2kxO/SXdx79R5yVVWVDh06JEmKRqO6fv265syZo+rqaknSyZMnNXHixN6cGgCA\nB1KvXiEXFhZq/fr1+tvf/qY7d+5o8+bNys3N1caNG1VZWakhQ4Zo1qxZiZ4VAADP6lWQBw4cqL17\n9953/PDhw197IAAAHkTcqQsAAAMIMgAABhBkAAAMIMgAABjQ6zt1AcCD7KVtp5M9QlwVpYXJHgFf\nAa+QAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIAAAYQZAAADCDIAAAYQJABADCAIAMAYABBBgDA\nAIIMAIABBBkAAAMIMgAABhBkAAAMIMgAABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAG\nEGQAAAwgyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADCDIAAAYQZAAADPAlewAAQN94advp\nZI8QV0VpYbJHMINXyAAAGECQAQAwgCADAGAAQQYAwACCDACAAQQZAAADEv6xp9/85jeqq6tTSkqK\nNm3apKeffjrRTwEAgOckNMj/+Mc/9J///EeVlZV6//33tWnTJlVWVibyKQAAHmL9s9J/ee35fnuu\nhF6yrqmp0ZQpUyRJTzzxhG7duqWPP/44kU8BAIAnJTTITU1NysrKin2dnZ2taDSayKcAAMCT+vTW\nmY7jdPv9YDCQ0Ofrz0sLAIAHQ6Jb9WUS+go5FAqpqakp9vW1a9cUDAYT+RQAAHhSQoP8/e9/X9XV\n1ZKkd999V6FQSAMHDkzkUwAA4EkJvWQ9ZswYfec739GLL76olJQU/epXv0rk6QEA8KwUJ94bvQAA\noM9xpy4AAAwgyAAAGNCnH3vqL26/XefFixe1YsUK/fjHP9bChQt19epVbdiwQV1dXQoGg9qxY4fS\n09OTPWaPlJeX68KFC7p7966WLVumvLw8V+7S0dGh0tJSXb9+Xbdv39aKFSs0fPhwV+4iSZ2dnfrh\nD3+oFStWKD8/35V7RCIRrV69WsOGDZMkPfnkk1q6dKkrd5GkqqoqHTx4UD6fTz/72c/01FNPuXKX\n48ePq6qqKvZ1Q0OD/vCHP2jz5s2SpKeeekq//vWvkzTdV9PW1qaNGzfq1q1bunPnjkpKShQMBvtv\nF8flIpGI85Of/MRxHMe5dOmSM3/+/CRP9NW0tbU5CxcudMrKypyjR486juM4paWlzokTJxzHcZzX\nXnvN+d3vfpfMEXuspqbGWbp0qeM4jnPjxg3nBz/4gWt3+etf/+rs37/fcRzH+eCDD5xp06a5dhfH\ncZxdu3Y5c+bMcd5++23X7nHu3Dln1apV9xxz6y43btxwpk2b5rS2tjqNjY1OWVmZa3f5vEgk4mze\nvNlZuHChU1dX5ziO46xdu9Y5c+ZMkifrmaNHjzo7d+50HMdxPvroI2f69On9uovrL1m7/Xad6enp\nOnDggEKhUOxYJBLR5MmTJUkFBQWqqalJ1nhfybhx4/TGG29IkgYNGqSOjg7X7jJz5ky9/PLLkqSr\nV69q8ODBrt3l/fff16VLl/TMM89Icu/P1xdx6y41NTXKz8/XwIEDFQqFtGXLFtfu8nl79uzRyy+/\nrCtXrsSuVLppl6ysLN28eVOS1NLSoszMzH7dxfVBdvvtOn0+n/x+/z3HOjo6YpeqcnJyXLNPamqq\nMjIyJEnhcFiTJk1y7S6fefHFF7V+/Xpt2rTJtbts375dpaWlsa/duockXbp0ScuXL9ePfvQj/f3v\nf3ftLh988IE6Ozu1fPlyFRcXq6amxrW7fOaf//ynHn30UaWmpmrQoEGx427a5dlnn9WHH36oqVOn\nauHChdqwYUO/7uKJ95A/z/HYp7jcuM+pU6cUDodVUVGhadOmxY67cZc//vGP+te//qWf//zn98zv\nll3+/Oc/a9SoUfrmN7/5hd93yx6S9Pjjj2vlypWaMWOGLl++rMWLF6urqyv2fTftIkk3b97U7t27\n9eGHH2rx4sWu/Pn6vHA4rNmzZ9933E27vPPOOxoyZIgOHTqkf//73yopKVEg8P+3zezrXVwfZC/e\nrjMjI0OdnZ3y+/1qbGy853K2dWfPntXevXt18OBBBQIB1+7S0NCgnJwcPfroo8rNzVVXV5cGDBjg\nul3OnDmjy5cv68yZM/roo4+Unp7u2r+TwYMHa+bMmZKkoUOH6pFHHlF9fb0rd8nJydHo0aPl8/k0\ndOhQDRgwQKmpqa7c5TORSERlZWVKSUmJXfaV5KpdamtrNWHCBEnS8OHDdfv2bd29ezf2/b7exfWX\nrL14u87x48fHdjp58qQmTpyY5Il6prW1VeXl5dq3b58yMzMluXeX8+fPq6KiQtKnb4u0t7e7cpfX\nX39db7/9tv70pz9p3rx5WrFihSv3kD79V8mHDh2SJEWjUV2/fl1z5sxx5S4TJkzQuXPn9Mknn6i5\nudm1P1+faWxs1IABA5Senq60tDR9+9vf1vnz5yW5a5fHHntMdXV1kqQrV65owIABeuKJJ/ptF0/c\nqWvnzp06f/587Hadw4cPT/ZIPdbQ0KDt27frypUr8vl8Gjx4sHbu3KnS0lLdvn1bQ4YM0datW5WW\nlpbsUeOqrKzUW2+9pW9961uxY9u2bVNZWZnrduns7NQvfvELXb16VZ2dnVq5cqVGjBihjRs3um6X\nz7z11lv6xje+oQkTJrhyj48//ljr169XS0uL7ty5o5UrVyo3N9eVu0ifvh0SDoclST/96U+Vl5fn\n2l0aGhr0+uuv6+DBg5I+fa//l7/8pT755BONHDlSr776apIn7Jm2tjZt2rRJ169f1927d7V69WoF\ng8F+28UTQQYAwO1cf8kaAAAvIMgAABhAkAEAMIAgAwBgAEEGAMAAggwAgAEEGQAAAwgyAAAG/B9C\nr3yxdV1VfwAAAABJRU5ErkJggg==\n", "text/plain": [ "" ] }, "metadata": { "tags": [] } } ] }, { "metadata": { "id": "7illbHR1nLEF", "colab_type": "code", "outputId": "38dd59af-3901-485d-8335-8d9aef4cb204", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "cell_type": "code", "source": [ "# 唯一值\n", "df[\"embarked\"].unique()" ], "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array(['S', 'C', nan, 'Q'], dtype=object)" ] }, "metadata": { "tags": [] }, "execution_count": 9 } ] }, { "metadata": { "id": "BG1IMeV_hrqV", "colab_type": "code", "outputId": "5499f70f-9a64-4371-b185-e1f56b62bbb9", "colab": { "base_uri": "https://localhost:8080/", "height": 118 } }, "cell_type": "code", "source": [ "# 根据特征选择数据\n", "df[\"name\"].head()" ], "execution_count": 10, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 Allen, Miss. Elisabeth Walton\n", "1 Allison, Master. Hudson Trevor\n", "2 Allison, Miss. Helen Loraine\n", "3 Allison, Mr. Hudson Joshua Creighton\n", "4 Allison, Mrs. Hudson J C (Bessie Waldo Daniels)\n", "Name: name, dtype: object" ] }, "metadata": { "tags": [] }, "execution_count": 10 } ] }, { "metadata": { "id": "wPrRGLDtiZSp", "colab_type": "code", "outputId": "7e32434d-0a96-44c0-e54d-2dc45e04c5c7", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "cell_type": "code", "source": [ "# 筛选\n", "df[df[\"sex\"]==\"female\"].head() # 只有女性数据出现" ], "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclassnamesexagesibspparchticketfarecabinembarkedsurvived
01Allen, Miss. Elisabeth Waltonfemale29.00024160211.3375B5S1
21Allison, Miss. Helen Lorainefemale2.012113781151.5500C22 C26S0
41Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.012113781151.5500C22 C26S0
61Andrews, Miss. Kornelia Theodosiafemale63.0101350277.9583D7S1
81Appleton, Mrs. Edward Dale (Charlotte Lamson)female53.0201176951.4792C101S1
\n", "
" ], "text/plain": [ " pclass name sex age \\\n", "0 1 Allen, Miss. Elisabeth Walton female 29.0 \n", "2 1 Allison, Miss. Helen Loraine female 2.0 \n", "4 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0 \n", "6 1 Andrews, Miss. Kornelia Theodosia female 63.0 \n", "8 1 Appleton, Mrs. Edward Dale (Charlotte Lamson) female 53.0 \n", "\n", " sibsp parch ticket fare cabin embarked survived \n", "0 0 0 24160 211.3375 B5 S 1 \n", "2 1 2 113781 151.5500 C22 C26 S 0 \n", "4 1 2 113781 151.5500 C22 C26 S 0 \n", "6 1 0 13502 77.9583 D7 S 1 \n", "8 2 0 11769 51.4792 C101 S 1 " ] }, "metadata": { "tags": [] }, "execution_count": 11 } ] }, { "metadata": { "id": "FOuLeYIojMMH", "colab_type": "code", "outputId": "aa4fbb43-cf32-4e6c-e92f-c97ad79da9d1", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "cell_type": "code", "source": [ "# 排序\n", "df.sort_values(\"age\", ascending=False).head()" ], "execution_count": 12, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclassnamesexagesibspparchticketfarecabinembarkedsurvived
141Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S1
611Cavendish, Mrs. Tyrell William (Julia Florence...female76.0101987778.8500C46S1
12353Svensson, Mr. Johanmale74.0003470607.7750NaNS0
1351Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C0
91Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC0
\n", "
" ], "text/plain": [ " pclass name sex age \\\n", "14 1 Barkworth, Mr. Algernon Henry Wilson male 80.0 \n", "61 1 Cavendish, Mrs. Tyrell William (Julia Florence... female 76.0 \n", "1235 3 Svensson, Mr. Johan male 74.0 \n", "135 1 Goldschmidt, Mr. George B male 71.0 \n", "9 1 Artagaveytia, Mr. Ramon male 71.0 \n", "\n", " sibsp parch ticket fare cabin embarked survived \n", "14 0 0 27042 30.0000 A23 S 1 \n", "61 1 0 19877 78.8500 C46 S 1 \n", "1235 0 0 347060 7.7750 NaN S 0 \n", "135 0 0 PC 17754 34.6542 A5 C 0 \n", "9 0 0 PC 17609 49.5042 NaN C 0 " ] }, "metadata": { "tags": [] }, "execution_count": 12 } ] }, { "metadata": { "id": "v0TCbtSMjMO5", "colab_type": "code", "outputId": "5c602a41-3250-48e1-f5c2-973340a20708", "colab": { "base_uri": "https://localhost:8080/", "height": 135 } }, "cell_type": "code", "source": [ "# Grouping(数据聚合与分组运算)\n", "sex_group = df.groupby(\"survived\")\n", "sex_group.mean()" ], "execution_count": 13, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclassagesibspparchfare
survived
02.50061830.5453690.5216320.32880123.353831
11.96200028.9182280.4620000.47600049.361184
\n", "
" ], "text/plain": [ " pclass age sibsp parch fare\n", "survived \n", "0 2.500618 30.545369 0.521632 0.328801 23.353831\n", "1 1.962000 28.918228 0.462000 0.476000 49.361184" ] }, "metadata": { "tags": [] }, "execution_count": 13 } ] }, { "metadata": { "id": "34LmckWDhdSA", "colab_type": "code", "outputId": "0befc9a0-b30b-472b-ef19-e092b8da6f28", "colab": { "base_uri": "https://localhost:8080/", "height": 220 } }, "cell_type": "code", "source": [ "# iloc根据位置的索引来访问\n", "df.iloc[0, :] # iloc在索引中的特定位置获取行(或列)(因此它只需要整数)" ], "execution_count": 14, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "pclass 1\n", "name Allen, Miss. Elisabeth Walton\n", "sex female\n", "age 29\n", "sibsp 0\n", "parch 0\n", "ticket 24160\n", "fare 211.338\n", "cabin B5\n", "embarked S\n", "survived 1\n", "Name: 0, dtype: object" ] }, "metadata": { "tags": [] }, "execution_count": 14 } ] }, { "metadata": { "id": "QrdXeuRdFkXB", "colab_type": "code", "outputId": "1f6b255f-7ed1-463c-adb6-55a026881d56", "colab": { "base_uri": "https://localhost:8080/", "height": 34 } }, "cell_type": "code", "source": [ "# 获取指定位置的数据\n", "df.iloc[0, 1]\n" ], "execution_count": 15, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "'Allen, Miss. Elisabeth Walton'" ] }, "metadata": { "tags": [] }, "execution_count": 15 } ] }, { "metadata": { "id": "Rz35_-x2FkaL", "colab_type": "code", "outputId": "4d50ef2b-f1d7-4b50-84ea-44a11d1b360a", "colab": { "base_uri": "https://localhost:8080/", "height": 220 } }, "cell_type": "code", "source": [ "# loc根据标签的索引来访问\n", "df.loc[0] # loc从索引中获取具有特定标签的行(或列)" ], "execution_count": 16, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "pclass 1\n", "name Allen, Miss. Elisabeth Walton\n", "sex female\n", "age 29\n", "sibsp 0\n", "parch 0\n", "ticket 24160\n", "fare 211.338\n", "cabin B5\n", "embarked S\n", "survived 1\n", "Name: 0, dtype: object" ] }, "metadata": { "tags": [] }, "execution_count": 16 } ] }, { "metadata": { "id": "uSezrq4vEFYh", "colab_type": "text" }, "cell_type": "markdown", "source": [ "# 预处理" ] }, { "metadata": { "id": "EZ1pCKHIjMUY", "colab_type": "code", "outputId": "90120eb7-0764-4feb-f084-399d4f7f3081", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "cell_type": "code", "source": [ "# 具有至少一个NaN值的行\n", "df[pd.isnull(df).any(axis=1)].head()" ], "execution_count": 17, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclassnamesexagesibspparchticketfarecabinembarkedsurvived
91Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC0
131Barber, Miss. Ellen \"Nellie\"female26.0001987778.8500NaNS1
151Baumann, Mr. John DmaleNaN00PC 1731825.9250NaNS0
231Bidois, Miss. Rosaliefemale42.000PC 17757227.5250NaNC1
251Birnbaum, Mr. Jakobmale25.0001390526.0000NaNC0
\n", "
" ], "text/plain": [ " pclass name sex age sibsp parch \\\n", "9 1 Artagaveytia, Mr. Ramon male 71.0 0 0 \n", "13 1 Barber, Miss. Ellen \"Nellie\" female 26.0 0 0 \n", "15 1 Baumann, Mr. John D male NaN 0 0 \n", "23 1 Bidois, Miss. Rosalie female 42.0 0 0 \n", "25 1 Birnbaum, Mr. Jakob male 25.0 0 0 \n", "\n", " ticket fare cabin embarked survived \n", "9 PC 17609 49.5042 NaN C 0 \n", "13 19877 78.8500 NaN S 1 \n", "15 PC 17318 25.9250 NaN S 0 \n", "23 PC 17757 227.5250 NaN C 1 \n", "25 13905 26.0000 NaN C 0 " ] }, "metadata": { "tags": [] }, "execution_count": 17 } ] }, { "metadata": { "id": "zUaiFplEkmoB", "colab_type": "code", "outputId": "65b701a3-6d72-4914-f920-67561a59a73e", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "cell_type": "code", "source": [ "# 删除具有Nan值的行\n", "df = df.dropna() # 删除具有NaN值的行\n", "df = df.reset_index() # 重置行索引\n", "df.head()" ], "execution_count": 18, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexpclassnamesexagesibspparchticketfarecabinembarkedsurvived
001Allen, Miss. Elisabeth Waltonfemale29.00000024160211.3375B5S1
111Allison, Master. Hudson Trevormale0.916712113781151.5500C22 C26S1
221Allison, Miss. Helen Lorainefemale2.000012113781151.5500C22 C26S0
331Allison, Mr. Hudson Joshua Creightonmale30.000012113781151.5500C22 C26S0
441Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.000012113781151.5500C22 C26S0
\n", "
" ], "text/plain": [ " index pclass name sex \\\n", "0 0 1 Allen, Miss. Elisabeth Walton female \n", "1 1 1 Allison, Master. Hudson Trevor male \n", "2 2 1 Allison, Miss. Helen Loraine female \n", "3 3 1 Allison, Mr. Hudson Joshua Creighton male \n", "4 4 1 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female \n", "\n", " age sibsp parch ticket fare cabin embarked survived \n", "0 29.0000 0 0 24160 211.3375 B5 S 1 \n", "1 0.9167 1 2 113781 151.5500 C22 C26 S 1 \n", "2 2.0000 1 2 113781 151.5500 C22 C26 S 0 \n", "3 30.0000 1 2 113781 151.5500 C22 C26 S 0 \n", "4 25.0000 1 2 113781 151.5500 C22 C26 S 0 " ] }, "metadata": { "tags": [] }, "execution_count": 18 } ] }, { "metadata": { "id": "ubujZv_8qG-d", "colab_type": "code", "outputId": "3a397700-f1ff-496b-e585-ef0ad7c2b37f", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "cell_type": "code", "source": [ "# 删除多行\n", "df = df.drop([\"name\", \"cabin\", \"ticket\"], axis=1) # we won't use text features for our initial basic models\n", "df.head()" ], "execution_count": 19, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexpclasssexagesibspparchfareembarkedsurvived
001female29.000000211.3375S1
111male0.916712151.5500S1
221female2.000012151.5500S0
331male30.000012151.5500S0
441female25.000012151.5500S0
\n", "
" ], "text/plain": [ " index pclass sex age sibsp parch fare embarked survived\n", "0 0 1 female 29.0000 0 0 211.3375 S 1\n", "1 1 1 male 0.9167 1 2 151.5500 S 1\n", "2 2 1 female 2.0000 1 2 151.5500 S 0\n", "3 3 1 male 30.0000 1 2 151.5500 S 0\n", "4 4 1 female 25.0000 1 2 151.5500 S 0" ] }, "metadata": { "tags": [] }, "execution_count": 19 } ] }, { "metadata": { "id": "8m117GcVnon9", "colab_type": "code", "outputId": "ee33bb3e-c64b-4e2c-8e82-03e669b7bcba", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "cell_type": "code", "source": [ "# 映射特征值\n", "df['sex'] = df['sex'].map( {'female': 0, 'male': 1} ).astype(int)\n", "df[\"embarked\"] = df['embarked'].dropna().map( {'S':0, 'C':1, 'Q':2} ).astype(int)\n", "df.head()" ], "execution_count": 20, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexpclasssexagesibspparchfareembarkedsurvived
001029.000000211.337501
11110.916712151.550001
22102.000012151.550000
331130.000012151.550000
441025.000012151.550000
\n", "
" ], "text/plain": [ " index pclass sex age sibsp parch fare embarked survived\n", "0 0 1 0 29.0000 0 0 211.3375 0 1\n", "1 1 1 1 0.9167 1 2 151.5500 0 1\n", "2 2 1 0 2.0000 1 2 151.5500 0 0\n", "3 3 1 1 30.0000 1 2 151.5500 0 0\n", "4 4 1 0 25.0000 1 2 151.5500 0 0" ] }, "metadata": { "tags": [] }, "execution_count": 20 } ] }, { "metadata": { "id": "ZaVqjpsCEtft", "colab_type": "text" }, "cell_type": "markdown", "source": [ "# 特征工程" ] }, { "metadata": { "id": "_FPtk5tpqrDI", "colab_type": "code", "outputId": "72c426bb-005d-4b47-a9d4-8cbe28e936c5", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "cell_type": "code", "source": [ "# lambda表达式创建新特征\n", "def get_family_size(sibsp, parch):\n", " family_size = sibsp + parch\n", " return family_size\n", "\n", "df[\"family_size\"] = df[[\"sibsp\", \"parch\"]].apply(lambda x: get_family_size(x[\"sibsp\"], x[\"parch\"]), axis=1)\n", "df.head()" ], "execution_count": 21, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
indexpclasssexagesibspparchfareembarkedsurvivedfamily_size
001029.000000211.3375010
11110.916712151.5500013
22102.000012151.5500003
331130.000012151.5500003
441025.000012151.5500003
\n", "
" ], "text/plain": [ " index pclass sex age sibsp parch fare embarked survived \\\n", "0 0 1 0 29.0000 0 0 211.3375 0 1 \n", "1 1 1 1 0.9167 1 2 151.5500 0 1 \n", "2 2 1 0 2.0000 1 2 151.5500 0 0 \n", "3 3 1 1 30.0000 1 2 151.5500 0 0 \n", "4 4 1 0 25.0000 1 2 151.5500 0 0 \n", "\n", " family_size \n", "0 0 \n", "1 3 \n", "2 3 \n", "3 3 \n", "4 3 " ] }, "metadata": { "tags": [] }, "execution_count": 21 } ] }, { "metadata": { "id": "JK3FqfjnpSNi", "colab_type": "code", "outputId": "80aedded-c9f6-40cb-b073-cf8afab75531", "colab": { "base_uri": "https://localhost:8080/", "height": 194 } }, "cell_type": "code", "source": [ "# 重新组织标题\n", "df = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'family_size', 'fare', 'embarked', 'survived']]\n", "df.head()" ], "execution_count": 22, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
pclasssexagesibspparchfamily_sizefareembarkedsurvived
01029.0000000211.337501
1110.9167123151.550001
2102.0000123151.550000
31130.0000123151.550000
41025.0000123151.550000
\n", "
" ], "text/plain": [ " pclass sex age sibsp parch family_size fare embarked \\\n", "0 1 0 29.0000 0 0 0 211.3375 0 \n", "1 1 1 0.9167 1 2 3 151.5500 0 \n", "2 1 0 2.0000 1 2 3 151.5500 0 \n", "3 1 1 30.0000 1 2 3 151.5500 0 \n", "4 1 0 25.0000 1 2 3 151.5500 0 \n", "\n", " survived \n", "0 1 \n", "1 1 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "metadata": { "tags": [] }, "execution_count": 22 } ] }, { "metadata": { "id": "N_rwgfrFGTne", "colab_type": "text" }, "cell_type": "markdown", "source": [ "# 保存数据" ] }, { "metadata": { "id": "rNNxA7Vrp2fC", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "# 保存数据帧(dataframe)到 CSV\n", "df.to_csv(\"processed_titanic.csv\", index=False)" ], "execution_count": 0, "outputs": [] }, { "metadata": { "id": "gfc7Epp7sgqz", "colab_type": "code", "outputId": "e1e21143-2e87-43cb-aa16-b0af114a1b4a", "colab": { "base_uri": "https://localhost:8080/", "height": 84 } }, "cell_type": "code", "source": [ "# 看你一下你保持的文件\n", "!ls -l" ], "execution_count": 24, "outputs": [ { "output_type": "stream", "text": [ "total 96\n", "-rw-r--r-- 1 root root 6975 Dec 16 12:46 processed_titanic.csv\n", "drwxr-xr-x 1 root root 4096 Dec 10 17:34 sample_data\n", "-rw-r--r-- 1 root root 85153 Dec 16 12:46 titanic.csv\n" ], "name": "stdout" } ] }, { "metadata": { "id": "i1rVSjsdDaTw", "colab_type": "code", "colab": {} }, "cell_type": "code", "source": [ "" ], "execution_count": 0, "outputs": [] } ] }