{ "cells": [ { "cell_type": "markdown", "id": "ed1f74e7", "metadata": {}, "source": [ "# 二维数据结构DataFrame对象" ] }, { "cell_type": "markdown", "id": "717b29b8", "metadata": {}, "source": [ "DataFrame对象是一种二维带标记数据结构,不同列的数据类型可以不同。为了方便理解,可以将DataFrame对象看成一张Excel电子表格,或者是一个由多列Series对象构成的字典。" ] }, { "cell_type": "code", "execution_count": 1, "id": "b0182569", "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "id": "79178b2b", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "attachments": {}, "cell_type": "markdown", "id": "698481f5", "metadata": {}, "source": [ "## DataFrame对象的生成\n", "\n", "与Series类似,DataFrame对象也可以由多种类型的数据生成:\n", "- 由Series对象为值构成的字典。\n", "- 由一维数组或列表构成的字典。\n", "- 由字典构成的列表或数组。" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2e7cb22e", "metadata": {}, "source": [ "### 使用Series对象构成的字典生成" ] }, { "attachments": {}, "cell_type": "markdown", "id": "821beb00", "metadata": {}, "source": [ "DataFrame对象可以从一组由Series对象为值构成的字典中生成。字典中的值除了Series对象,也可以是另一个字典,因为字典被转换为Series对象。\n", "\n", "假设有一个包含两个Series对象的字典d:" ] }, { "cell_type": "code", "execution_count": 3, "id": "98d20646", "metadata": {}, "outputs": [], "source": [ "s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])" ] }, { "cell_type": "code", "execution_count": 4, "id": "315ddecb", "metadata": {}, "outputs": [], "source": [ "s2 = pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])" ] }, { "cell_type": "code", "execution_count": 5, "id": "1e13fa34", "metadata": {}, "outputs": [], "source": [ "d = {\"one\": s1, \"two\": s2}" ] }, { "attachments": {}, "cell_type": "markdown", "id": "49f79ba0", "metadata": {}, "source": [ "可以用字典d构造一个DataFrame对象:" ] }, { "cell_type": "code", "execution_count": 6, "id": "79c735da", "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(d)" ] }, { "cell_type": "code", "execution_count": 7, "id": "d65d98a8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>two</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>2.0</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>c</th>\n", " <td>3.0</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>d</th>\n", " <td>NaN</td>\n", " <td>4.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one two\n", "a 1.0 1.0\n", "b 2.0 2.0\n", "c 3.0 3.0\n", "d NaN 4.0" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e1f02d1b", "metadata": {}, "source": [ "与Series相比,DataFrame对象要区分不同的行和列,因此有行标记和列标记之分。默认情况下,df的列标记是传入字典的键,可以用属性`.columns`查看:" ] }, { "cell_type": "code", "execution_count": 8, "id": "fa4a04fb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['one', 'two'], dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ff9acb82", "metadata": {}, "source": [ "行标记是两个Series对象标记的并集,Pandas会自动将两个Series对象的标记进行对齐:" ] }, { "cell_type": "code", "execution_count": 9, "id": "34b63bf8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['a', 'b', 'c', 'd'], dtype='object')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.index" ] }, { "cell_type": "markdown", "id": "741871ba", "metadata": {}, "source": [ "在生成DataFrame时,也可以指定index和columns参数:" ] }, { "cell_type": "code", "execution_count": 10, "id": "227afaa7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>two</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>d</th>\n", " <td>NaN</td>\n", " <td>4.0</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>2.0</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>a</th>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one two\n", "d NaN 4.0\n", "b 2.0 2.0\n", "a 1.0 1.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(d, index=[\"d\", \"b\", \"a\"])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a50f5bc2", "metadata": {}, "source": [ "Pandas会按照给定的顺序从传入的数据中寻找对应的值,如果该值不存在,则使用缺省值`np.nan`:" ] }, { "cell_type": "code", "execution_count": 11, "id": "492f3b05", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>two</th>\n", " <th>three</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>d</th>\n", " <td>4.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>2.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>a</th>\n", " <td>1.0</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " two three\n", "d 4.0 NaN\n", "b 2.0 NaN\n", "a 1.0 NaN" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2cef3d5d", "metadata": {}, "source": [ "### 使用一维数组构成的字典生成" ] }, { "attachments": {}, "cell_type": "markdown", "id": "eb0bb12a", "metadata": {}, "source": [ "DataFrame对象还可以使用由一维数组或列表构成的字典生成,这些数组和列表必须是等长的:" ] }, { "cell_type": "code", "execution_count": 12, "id": "a097d435", "metadata": {}, "outputs": [], "source": [ "d = {'one' : [1., 2., 3., 4.],\n", " 'two' : [4., 3., 2., 1.]}" ] }, { "cell_type": "code", "execution_count": 13, "id": "8d58e2f8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>two</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1.0</td>\n", " <td>4.0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2.0</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3.0</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one two\n", "0 1.0 4.0\n", "1 2.0 3.0\n", "2 3.0 2.0\n", "3 4.0 1.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(d)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "0333d58f", "metadata": {}, "source": [ "传入index参数时,该参数的长度也必须与列表长度一致:" ] }, { "cell_type": "code", "execution_count": 14, "id": "21661d1e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>two</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>1.0</td>\n", " <td>4.0</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>2.0</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>c</th>\n", " <td>3.0</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>d</th>\n", " <td>4.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one two\n", "a 1.0 4.0\n", "b 2.0 3.0\n", "c 3.0 2.0\n", "d 4.0 1.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(d, index=['a', 'b', 'c', 'd'])" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9b2c149f", "metadata": {}, "source": [ "### 使用字典数组生成" ] }, { "cell_type": "markdown", "id": "3f5efc2a", "metadata": {}, "source": [ "还可以使用字典构成的数组或列表进行构建:" ] }, { "cell_type": "code", "execution_count": 15, "id": "57d4e5b2", "metadata": {}, "outputs": [], "source": [ "data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]" ] }, { "cell_type": "code", "execution_count": 16, "id": "fd002d26", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>a</th>\n", " <th>b</th>\n", " <th>c</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>5</td>\n", " <td>10</td>\n", " <td>20.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " a b c\n", "0 1 2 NaN\n", "1 5 10 20.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(data)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e6448551", "metadata": {}, "source": [ "与Series不同的是,字典的键对应的是列标记,行标记由数组或列表的大小决定。" ] }, { "cell_type": "markdown", "id": "28b0a139", "metadata": {}, "source": [ "### 使用二维数组生成" ] }, { "cell_type": "markdown", "id": "e73d4783", "metadata": {}, "source": [ "还可以使用NumPy的二维数组生成:" ] }, { "cell_type": "code", "execution_count": 17, "id": "5e70aa1e", "metadata": {}, "outputs": [], "source": [ "a = np.array([[1,2,3], [4,5,6]])" ] }, { "cell_type": "code", "execution_count": 18, "id": "a6fa5ce0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>0</th>\n", " <th>1</th>\n", " <th>2</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>2</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>4</td>\n", " <td>5</td>\n", " <td>6</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " 0 1 2\n", "0 1 2 3\n", "1 4 5 6" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(a)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6aed9e8c", "metadata": {}, "source": [ "## DataFrame对象的使用\n", "\n", "DataFrame对象不是二维NumPy数组,在使用方法上存在很大差异:" ] }, { "cell_type": "code", "execution_count": 19, "id": "9f32f6c3", "metadata": {}, "outputs": [], "source": [ "s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])" ] }, { "cell_type": "code", "execution_count": 20, "id": "239d3b03", "metadata": {}, "outputs": [], "source": [ "s2 = pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])" ] }, { "cell_type": "code", "execution_count": 21, "id": "e8776dc6", "metadata": {}, "outputs": [], "source": [ "d = {\"one\": s1, \"two\": s2}" ] }, { "cell_type": "code", "execution_count": 22, "id": "cdde7d53", "metadata": {}, "outputs": [], "source": [ "df = pd.DataFrame(d)" ] }, { "cell_type": "code", "execution_count": 23, "id": "5e17d0aa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>two</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>2.0</td>\n", " <td>2.0</td>\n", " </tr>\n", " <tr>\n", " <th>c</th>\n", " <td>3.0</td>\n", " <td>3.0</td>\n", " </tr>\n", " <tr>\n", " <th>d</th>\n", " <td>NaN</td>\n", " <td>4.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one two\n", "a 1.0 1.0\n", "b 2.0 2.0\n", "c 3.0 3.0\n", "d NaN 4.0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "attachments": {}, "cell_type": "markdown", "id": "23e62817", "metadata": {}, "source": [ "### 列相关的操作" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3950a0b2", "metadata": {}, "source": [ "DataFrame对象可以看成是一个由Series对象构成的字典,.columns属性对应字典的键,每一列对应字典的值:" ] }, { "cell_type": "code", "execution_count": 24, "id": "a3c973ea", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 2.0\n", "c 3.0\n", "d NaN\n", "Name: one, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['one']" ] }, { "attachments": {}, "cell_type": "markdown", "id": "55ef0692", "metadata": {}, "source": [ "可以像字典一样增加新列:" ] }, { "cell_type": "code", "execution_count": 25, "id": "0f2ef75c", "metadata": {}, "outputs": [], "source": [ "df[\"three\"] = df[\"one\"] * df[\"two\"]" ] }, { "cell_type": "code", "execution_count": 26, "id": "05ac44bb", "metadata": {}, "outputs": [], "source": [ "df[\"flag\"] = df[\"one\"] > 2" ] }, { "cell_type": "code", "execution_count": 27, "id": "7ecb4c9f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>two</th>\n", " <th>three</th>\n", " <th>flag</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>2.0</td>\n", " <td>2.0</td>\n", " <td>4.0</td>\n", " <td>False</td>\n", " </tr>\n", " <tr>\n", " <th>c</th>\n", " <td>3.0</td>\n", " <td>3.0</td>\n", " <td>9.0</td>\n", " <td>True</td>\n", " </tr>\n", " <tr>\n", " <th>d</th>\n", " <td>NaN</td>\n", " <td>4.0</td>\n", " <td>NaN</td>\n", " <td>False</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one two three flag\n", "a 1.0 1.0 1.0 False\n", "b 2.0 2.0 4.0 False\n", "c 3.0 3.0 9.0 True\n", "d NaN 4.0 NaN False" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1f62216a", "metadata": {}, "source": [ "增加新列时,如果新列的值是单一值,Pandas会按照行标记自动进行扩展:" ] }, { "cell_type": "code", "execution_count": 28, "id": "e156535d", "metadata": {}, "outputs": [], "source": [ "df[\"four\"] = 4" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e220dded", "metadata": {}, "source": [ "DataFrame对象支持用del关键字或者.pop()方法删除列:" ] }, { "cell_type": "code", "execution_count": 29, "id": "8e744653", "metadata": {}, "outputs": [], "source": [ "del df[\"two\"]" ] }, { "cell_type": "code", "execution_count": 30, "id": "4770ad42", "metadata": {}, "outputs": [], "source": [ "three = df.pop(\"three\")" ] }, { "cell_type": "code", "execution_count": 31, "id": "d20663a2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "a 1.0\n", "b 4.0\n", "c 9.0\n", "d NaN\n", "Name: three, dtype: float64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "three" ] }, { "cell_type": "code", "execution_count": 32, "id": "8713ea5e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>flag</th>\n", " <th>four</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>1.0</td>\n", " <td>False</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>2.0</td>\n", " <td>False</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>c</th>\n", " <td>3.0</td>\n", " <td>True</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>d</th>\n", " <td>NaN</td>\n", " <td>False</td>\n", " <td>4</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one flag four\n", "a 1.0 False 4\n", "b 2.0 False 4\n", "c 3.0 True 4\n", "d NaN False 4" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "attachments": {}, "cell_type": "markdown", "id": "18ec10aa", "metadata": {}, "source": [ "增加一个行标记不完全相同的新列时,Pandas只会保留该列中与原有行标记相同的部分,以保证原DataFrame对象的行标记不变化:" ] }, { "cell_type": "code", "execution_count": 33, "id": "07b61507", "metadata": {}, "outputs": [], "source": [ "df[\"foo\"] = pd.Series([1,2,3], index=[\"a\", \"d\", \"e\"])" ] }, { "cell_type": "code", "execution_count": 34, "id": "e45141a6", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>flag</th>\n", " <th>four</th>\n", " <th>foo</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>1.0</td>\n", " <td>False</td>\n", " <td>4</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>2.0</td>\n", " <td>False</td>\n", " <td>4</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>c</th>\n", " <td>3.0</td>\n", " <td>True</td>\n", " <td>4</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>d</th>\n", " <td>NaN</td>\n", " <td>False</td>\n", " <td>4</td>\n", " <td>2.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one flag four foo\n", "a 1.0 False 4 1.0\n", "b 2.0 False 4 NaN\n", "c 3.0 True 4 NaN\n", "d NaN False 4 2.0" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f53b8dd1", "metadata": {}, "source": [ "默认情况下,新列的插入位置都在DataFrame对象的最后。可以使用.insert()方法将其插入指定的位置:" ] }, { "cell_type": "code", "execution_count": 35, "id": "edad10b3", "metadata": {}, "outputs": [], "source": [ "df.insert(1, \"bar\", df[\"one\"])" ] }, { "cell_type": "code", "execution_count": 36, "id": "5b1967c2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>one</th>\n", " <th>bar</th>\n", " <th>flag</th>\n", " <th>four</th>\n", " <th>foo</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>a</th>\n", " <td>1.0</td>\n", " <td>1.0</td>\n", " <td>False</td>\n", " <td>4</td>\n", " <td>1.0</td>\n", " </tr>\n", " <tr>\n", " <th>b</th>\n", " <td>2.0</td>\n", " <td>2.0</td>\n", " <td>False</td>\n", " <td>4</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>c</th>\n", " <td>3.0</td>\n", " <td>3.0</td>\n", " <td>True</td>\n", " <td>4</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>d</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>False</td>\n", " <td>4</td>\n", " <td>2.0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " one bar flag four foo\n", "a 1.0 1.0 False 4 1.0\n", "b 2.0 2.0 False 4 NaN\n", "c 3.0 3.0 True 4 NaN\n", "d NaN NaN False 4 2.0" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2441cfd5", "metadata": {}, "source": [ "### 行相关的操作\n", "\n", "DataFrame对象有两种常用的索引行的方式。可以用`.loc`属性索引行标记,返回一个Series对象:" ] }, { "cell_type": "code", "execution_count": 37, "id": "737b173b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one 2.0\n", "bar 2.0\n", "flag False\n", "four 4\n", "foo NaN\n", "Name: b, dtype: object" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[\"b\"]" ] }, { "attachments": {}, "cell_type": "markdown", "id": "cb199e17", "metadata": {}, "source": [ "也可以用.iloc属性索引位置,得到第二行数据:" ] }, { "cell_type": "code", "execution_count": 38, "id": "80f5c2ae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "one 2.0\n", "bar 2.0\n", "flag False\n", "four 4\n", "foo NaN\n", "Name: b, dtype: object" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[1]" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a41ed5e1", "metadata": {}, "source": [ "### 加法与减法操作\n", "\n", "DataFrame对象支持加法和减法的操作,并且按照行列标记对齐的原则进行计算:" ] }, { "cell_type": "code", "execution_count": 39, "id": "9b6e727a", "metadata": {}, "outputs": [], "source": [ "df1 = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])" ] }, { "cell_type": "code", "execution_count": 40, "id": "f2b0db7d", "metadata": {}, "outputs": [], "source": [ "df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])" ] }, { "cell_type": "code", "execution_count": 41, "id": "ed6e0274", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>-1.906552</td>\n", " <td>-2.428495</td>\n", " <td>1.131278</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.955872</td>\n", " <td>-1.476556</td>\n", " <td>-1.523796</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.766210</td>\n", " <td>-0.162112</td>\n", " <td>0.190370</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-2.866838</td>\n", " <td>0.866281</td>\n", " <td>1.340097</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-2.027247</td>\n", " <td>0.972097</td>\n", " <td>-0.807422</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>0.841079</td>\n", " <td>0.101313</td>\n", " <td>-1.701630</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>0.318099</td>\n", " <td>-0.037061</td>\n", " <td>-1.878293</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " <td>NaN</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "0 -1.906552 -2.428495 1.131278 NaN\n", "1 -0.955872 -1.476556 -1.523796 NaN\n", "2 0.766210 -0.162112 0.190370 NaN\n", "3 -2.866838 0.866281 1.340097 NaN\n", "4 -2.027247 0.972097 -0.807422 NaN\n", "5 0.841079 0.101313 -1.701630 NaN\n", "6 0.318099 -0.037061 -1.878293 NaN\n", "7 NaN NaN NaN NaN\n", "8 NaN NaN NaN NaN\n", "9 NaN NaN NaN NaN" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 + df2" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f0a10763", "metadata": {}, "source": [ "DataFrame对象还可以与Series对象进行加减操作。与NumPy中的广播机制类似,Pandas会先将Series对象的标记与DataFrame对象的列标记中对应的部分拿出来,然后使用广播机制将Series对象沿着行标记进行扩展:" ] }, { "cell_type": "code", "execution_count": 42, "id": "898bd8b9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.034677</td>\n", " <td>-1.447889</td>\n", " <td>0.239673</td>\n", " <td>0.897156</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.216450</td>\n", " <td>-0.052522</td>\n", " <td>0.237849</td>\n", " <td>0.806303</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.260522</td>\n", " <td>0.590821</td>\n", " <td>0.231546</td>\n", " <td>-2.164184</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-1.264539</td>\n", " <td>0.947130</td>\n", " <td>0.601591</td>\n", " <td>-0.753204</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-1.113126</td>\n", " <td>0.063686</td>\n", " <td>-0.379063</td>\n", " <td>-0.275933</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>0.596109</td>\n", " <td>-0.516650</td>\n", " <td>-1.177866</td>\n", " <td>0.075800</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>1.386725</td>\n", " <td>-0.328219</td>\n", " <td>-1.303265</td>\n", " <td>-0.790358</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>1.225454</td>\n", " <td>0.923503</td>\n", " <td>0.715214</td>\n", " <td>-0.144048</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>-0.982050</td>\n", " <td>-0.026315</td>\n", " <td>1.963732</td>\n", " <td>0.638793</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>0.715773</td>\n", " <td>-0.767911</td>\n", " <td>-0.379927</td>\n", " <td>-1.533615</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "0 0.034677 -1.447889 0.239673 0.897156\n", "1 -0.216450 -0.052522 0.237849 0.806303\n", "2 0.260522 0.590821 0.231546 -2.164184\n", "3 -1.264539 0.947130 0.601591 -0.753204\n", "4 -1.113126 0.063686 -0.379063 -0.275933\n", "5 0.596109 -0.516650 -1.177866 0.075800\n", "6 1.386725 -0.328219 -1.303265 -0.790358\n", "7 1.225454 0.923503 0.715214 -0.144048\n", "8 -0.982050 -0.026315 1.963732 0.638793\n", "9 0.715773 -0.767911 -0.379927 -1.533615" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1" ] }, { "cell_type": "code", "execution_count": 43, "id": "0726b1fc", "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>A</th>\n", " <th>B</th>\n", " <th>C</th>\n", " <th>D</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " <td>0.000000</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>-0.251127</td>\n", " <td>1.395367</td>\n", " <td>-0.001824</td>\n", " <td>-0.090853</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>0.225845</td>\n", " <td>2.038710</td>\n", " <td>-0.008127</td>\n", " <td>-3.061340</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>-1.299216</td>\n", " <td>2.395019</td>\n", " <td>0.361919</td>\n", " <td>-1.650360</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>-1.147802</td>\n", " <td>1.511575</td>\n", " <td>-0.618736</td>\n", " <td>-1.173089</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>0.561432</td>\n", " <td>0.931239</td>\n", " <td>-1.417538</td>\n", " <td>-0.821356</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>1.352048</td>\n", " <td>1.119670</td>\n", " <td>-1.542938</td>\n", " <td>-1.687514</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>1.190778</td>\n", " <td>2.371392</td>\n", " <td>0.475542</td>\n", " <td>-1.041204</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>-1.016727</td>\n", " <td>1.421574</td>\n", " <td>1.724059</td>\n", " <td>-0.258363</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>0.681096</td>\n", " <td>0.679978</td>\n", " <td>-0.619600</td>\n", " <td>-2.430771</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " A B C D\n", "0 0.000000 0.000000 0.000000 0.000000\n", "1 -0.251127 1.395367 -0.001824 -0.090853\n", "2 0.225845 2.038710 -0.008127 -3.061340\n", "3 -1.299216 2.395019 0.361919 -1.650360\n", "4 -1.147802 1.511575 -0.618736 -1.173089\n", "5 0.561432 0.931239 -1.417538 -0.821356\n", "6 1.352048 1.119670 -1.542938 -1.687514\n", "7 1.190778 2.371392 0.475542 -1.041204\n", "8 -1.016727 1.421574 1.724059 -0.258363\n", "9 0.681096 0.679978 -0.619600 -2.430771" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 - df1.iloc[0]" ] }, { "cell_type": "code", "execution_count": null, "id": "0fe7dcf0", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.10" } }, "nbformat": 4, "nbformat_minor": 5 }