{ "cells": [ { "cell_type": "markdown", "id": "human-franchise", "metadata": {}, "source": [ "# 前言\n", "\n", "数据源是来自[和鲸社区](https://www.kesci.com/mw/dataset/5ffac64f3441fd001538228b/file)的 3 份互不相关的电商数据集,所以分成 3 部分,每部分只对其中的一个数据集进行分析。\n", "\n", "\n", "## part 1\n", "\n", "tmall_order_report.csv 这个数据集是订单数据,可供挖掘的纬度有订单时间、省份(收货地址),指标则有销售量、销售额、退款金额、退货率、成交率、地区分布、下单时间趋势等。\n", "\n", "### 1、数据理解与处理" ] }, { "cell_type": "code", "execution_count": 1, "id": "hawaiian-tournament", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
订单编号总金额买家实际支付金额收货地址订单创建时间订单付款时间退款金额
01178.80.0上海2020-02-21 00:00:00NaN0.0
1221.021.0内蒙古自治区2020-02-20 23:59:542020-02-21 00:00:020.0
2337.00.0安徽省2020-02-20 23:59:35NaN0.0
34157.0157.0湖南省2020-02-20 23:58:342020-02-20 23:58:440.0
4564.80.0江苏省2020-02-20 23:57:042020-02-20 23:57:1164.8
\n", "
" ], "text/plain": [ " 订单编号 总金额 买家实际支付金额 收货地址 订单创建时间 订单付款时间 \\\n", "0 1 178.8 0.0 上海 2020-02-21 00:00:00 NaN \n", "1 2 21.0 21.0 内蒙古自治区 2020-02-20 23:59:54 2020-02-21 00:00:02 \n", "2 3 37.0 0.0 安徽省 2020-02-20 23:59:35 NaN \n", "3 4 157.0 157.0 湖南省 2020-02-20 23:58:34 2020-02-20 23:58:44 \n", "4 5 64.8 0.0 江苏省 2020-02-20 23:57:04 2020-02-20 23:57:11 \n", "\n", " 退款金额 \n", "0 0.0 \n", "1 0.0 \n", "2 0.0 \n", "3 0.0 \n", "4 64.8 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "data = pd.read_csv('tmall_order_report.csv')\n", "data.head() # 退款金额应该就是客户退货后,返还给客户的退款金额" ] }, { "cell_type": "code", "execution_count": 2, "id": "dependent-analyst", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 28010 entries, 0 to 28009\n", "Data columns (total 7 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 订单编号 28010 non-null int64 \n", " 1 总金额 28010 non-null float64\n", " 2 买家实际支付金额 28010 non-null float64\n", " 3 收货地址 28010 non-null object \n", " 4 订单创建时间 28010 non-null object \n", " 5 订单付款时间 24087 non-null object \n", " 6 退款金额 28010 non-null float64\n", "dtypes: float64(3), int64(1), object(3)\n", "memory usage: 1.5+ MB\n" ] } ], "source": [ "data.info() # 数据集情况 28010 条,6个字段" ] }, { "cell_type": "code", "execution_count": 3, "id": "central-green", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['订单编号', '总金额', '买家实际支付金额', '收货地址', '订单创建时间', '订单付款时间', '退款金额'], dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.columns = data.columns.str.strip() # 列名有空格,需要处理下\n", "data.columns" ] }, { "cell_type": "code", "execution_count": 4, "id": "decreased-sector", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "订单编号 0\n", "总金额 0\n", "买家实际支付金额 0\n", "收货地址 0\n", "订单创建时间 0\n", "订单付款时间 0\n", "退款金额 0\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data[data.duplicated()].count() # 没有完全重复的数据" ] }, { "cell_type": "code", "execution_count": 5, "id": "american-chuck", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "订单编号 0\n", "总金额 0\n", "买家实际支付金额 0\n", "收货地址 0\n", "订单创建时间 0\n", "订单付款时间 3923\n", "退款金额 0\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.isnull().sum() # 付款时间存在空值,表示订单未付款" ] }, { "cell_type": "code", "execution_count": 6, "id": "caring-flavor", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['上海', '内蒙古', '安徽', '湖南', '江苏', '浙江', '天津', '北京', '四川', '贵州', '辽宁',\n", " '河南', '广西', '广东', '福建', '海南', '江西', '甘肃', '河北', '黑龙江', '云南', '重庆',\n", " '山西', '吉林', '山东', '陕西', '湖北', '青海', '新疆', '宁夏', '西藏'], dtype=object)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data['收货地址'] = data['收货地址'].str.replace('自治区|维吾尔|回族|壮族|省', '') # 对省份做个清洗,便于可视化\n", "data['收货地址'].unique()" ] }, { "cell_type": "markdown", "id": "rocky-employee", "metadata": {}, "source": [ "### 2、数据分析可视化" ] }, { "cell_type": "markdown", "id": "continent-alpha", "metadata": {}, "source": [ "#### 2.1 整体情况" ] }, { "cell_type": "code", "execution_count": 7, "id": "spanish-aerospace", "metadata": {}, "outputs": [], "source": [ "result = {}\n", "result['总订单数'] = data['订单编号'].count() \n", "result['已完成订单数'] = data['订单编号'][data['订单付款时间'].notnull()].count() \n", "result['未付款订单数'] = data['订单编号'][data['订单付款时间'].isnull()].count() \n", "result['退款订单数'] = data['订单编号'][data['退款金额'] > 0].count() \n", "result['总订单金额'] = data['总金额'][data['订单付款时间'].notnull()].sum() \n", "result['总退款金额'] = data['退款金额'][data['订单付款时间'].notnull()].sum() \n", "result['总实际收入金额'] = data['买家实际支付金额'][data['订单付款时间'].notnull()].sum() " ] }, { "cell_type": "code", "execution_count": 8, "id": "surgical-settle", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'总订单数': 28010,\n", " '已完成订单数': 24087,\n", " '未付款订单数': 3923,\n", " '退款订单数': 5646,\n", " '总订单金额': 2474823.0700000003,\n", " '总退款金额': 572335.9199999999,\n", " '总实际收入金额': 1902487.1500000001}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result" ] }, { "cell_type": "code", "execution_count": 9, "id": "direct-diana", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", "
\n", "

整体情况

\n", "

\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
总订单数总订单金额已完成订单数总实际收入金额退款订单数总退款金额成交率退货率
28010247.48 万24087190.25 万564657.23 万85.99%23.44%
\n", "
\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyecharts import options as opts\n", "from pyecharts.charts import Map, Bar, Line\n", "from pyecharts.components import Table\n", "from pyecharts.options import ComponentTitleOpts\n", "from pyecharts.faker import Faker\n", "\n", "table = Table()\n", "\n", "headers = ['总订单数', '总订单金额', '已完成订单数', '总实际收入金额', '退款订单数', '总退款金额', '成交率', '退货率']\n", "rows = [\n", " [\n", " result['总订单数'], f\"{result['总订单金额']/10000:.2f} 万\", result['已完成订单数'], f\"{result['总实际收入金额']/10000:.2f} 万\",\n", " result['退款订单数'], f\"{result['总退款金额']/10000:.2f} 万\", \n", " f\"{result['已完成订单数']/result['总订单数']:.2%}\",\n", " f\"{result['退款订单数']/result['已完成订单数']:.2%}\",\n", " ]\n", "]\n", "table.add(headers, rows)\n", "table.set_global_opts(\n", " title_opts=ComponentTitleOpts(title='整体情况')\n", ")\n", "table.render_notebook()" ] }, { "cell_type": "markdown", "id": "young-insert", "metadata": {}, "source": [ "#### 2.2 地区分析" ] }, { "cell_type": "code", "execution_count": 10, "id": "still-cradle", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result2 = data[data['订单付款时间'].notnull()].groupby('收货地址').agg({'订单编号':'count'})\n", "result21 = result2.to_dict()['订单编号']\n", "c = (\n", " Map()\n", " .add(\"订单量\", [*result21.items()], \"china\", is_map_symbol_show=False)\n", " .set_series_opts(label_opts=opts.LabelOpts(is_show=True))\n", " .set_global_opts(\n", " title_opts=opts.TitleOpts(title='地区分布'),\n", " visualmap_opts=opts.VisualMapOpts(max_=1000), \n", " )\n", ")\n", "c.render_notebook()" ] }, { "cell_type": "markdown", "id": "hungry-bangkok", "metadata": {}, "source": [ "#### 2.3 时间分析" ] }, { "cell_type": "code", "execution_count": 11, "id": "behavioral-mumbai", "metadata": {}, "outputs": [], "source": [ "data['订单创建时间'] = pd.to_datetime(data['订单创建时间'])\n", "data['订单付款时间'] = pd.to_datetime(data['订单付款时间'])" ] }, { "cell_type": "code", "execution_count": 12, "id": "rough-torture", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result31 = data.groupby(data['订单创建时间'].apply(lambda x: x.strftime(\"%Y-%m-%d\"))).agg({'订单编号':'count'}).to_dict()['订单编号']\n", "c = (\n", " Line()\n", " .add_xaxis(list(result31.keys()))\n", " .add_yaxis(\"订单量\", list(result31.values()))\n", " .set_series_opts(\n", " label_opts=opts.LabelOpts(is_show=False),\n", " markpoint_opts=opts.MarkPointOpts(\n", " data=[\n", " opts.MarkPointItem(type_=\"max\", name=\"最大值\"),\n", " ]\n", " ),\n", " )\n", " .set_global_opts(title_opts=opts.TitleOpts(title=\"每日订单量走势\"))\n", ")\n", "c.render_notebook()" ] }, { "cell_type": "markdown", "id": "nasty-priority", "metadata": {}, "source": [ "从上图来看,2月份上半月由于受新冠疫情影响,订单量比较少,随着复工开展,下半月的订单量增长明显。" ] }, { "cell_type": "code", "execution_count": 13, "id": "existing-hungarian", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result32 = data.groupby(data['订单创建时间'].apply(lambda x: x.strftime(\"%H\"))).agg({'订单编号':'count'}).to_dict()['订单编号']\n", "x = [*result32.keys()]\n", "y = [*result32.values()]\n", "c = (\n", " Bar()\n", " .add_xaxis(x)\n", " .add_yaxis(\"订单量\", y)\n", " .set_global_opts(title_opts=opts.TitleOpts(title=\"每小时订单量走势\"))\n", " .set_series_opts(\n", " label_opts=opts.LabelOpts(is_show=False),\n", " markpoint_opts=opts.MarkPointOpts(\n", " data=[\n", " opts.MarkPointItem(type_=\"max\", name=\"峰值\"),\n", " opts.MarkPointItem(name=\"第二峰值\", coord=[x[15], y[15]], value=y[15]),\n", " opts.MarkPointItem(name=\"第三峰值\", coord=[x[10], y[10]], value=y[10]),\n", " ]\n", " ),\n", " )\n", ")\n", "c.render_notebook()" ] }, { "cell_type": "markdown", "id": "protecting-berkeley", "metadata": {}, "source": [ "从每小时订单量走势来看,一天中有3个高峰期(10点、15点、21点),其中21点-22点之间是一天中订单量最多的时候,这个结果和之前 [1 亿条淘宝用户行为数据分析](https://github.com/TurboWay/bigdata_analyse/blob/main/UserBehaviorFromTaobao_Batch/用户行为数据分析.md) 的结果是一致的。对于卖家的指导意义就是,为了提高订单量,高峰期时应该尽量保证客服的回复速度,尤其是晚上21点-22点之间,所以很多做电商的基本都有夜班。" ] }, { "cell_type": "code", "execution_count": 14, "id": "functional-yeast", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7.7399046511949745" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = data['订单付款时间'] - data['订单创建时间']\n", "s[s.notnull()].apply(lambda x: x.seconds / 60 ).mean() # 从下单到付款的平均耗时为 7.7 分钟" ] }, { "cell_type": "markdown", "id": "running-conspiracy", "metadata": {}, "source": [ "## part2\n", "\n", "双十一淘宝美妆数据.csv 这个数据集是美妆店铺的双十一销售数据,可以挖掘的纬度有日期、店铺,指标则有销售量、销售额、评论数等。\n", "### 1、数据理解与处理" ] }, { "cell_type": "code", "execution_count": 15, "id": "fresh-episode", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
update_timeidtitlepricesale_countcomment_count店名
02016/11/14A18164178225CHANDO/自然堂 雪域精粹纯粹滋润霜50g 补水保湿 滋润水润面霜139.026719.02704.0自然堂
12016/11/14A18177105952CHANDO/自然堂凝时鲜颜肌活乳液120ML 淡化细纹补水滋润专柜正品194.08122.01492.0自然堂
22016/11/14A18177226992CHANDO/自然堂活泉保湿修护精华水(滋润型135ml 补水控油爽肤水99.012668.0589.0自然堂
32016/11/14A18178033846CHANDO/自然堂 男士劲爽控油洁面膏 100g 深层清洁 男士洗面奶38.025805.04287.0自然堂
42016/11/14A18178045259CHANDO/自然堂雪域精粹纯粹滋润霜(清爽型)50g补水保湿滋润霜139.05196.0618.0自然堂
\n", "
" ], "text/plain": [ " update_time id title price \\\n", "0 2016/11/14 A18164178225 CHANDO/自然堂 雪域精粹纯粹滋润霜50g 补水保湿 滋润水润面霜 139.0 \n", "1 2016/11/14 A18177105952 CHANDO/自然堂凝时鲜颜肌活乳液120ML 淡化细纹补水滋润专柜正品 194.0 \n", "2 2016/11/14 A18177226992 CHANDO/自然堂活泉保湿修护精华水(滋润型135ml 补水控油爽肤水 99.0 \n", "3 2016/11/14 A18178033846 CHANDO/自然堂 男士劲爽控油洁面膏 100g 深层清洁 男士洗面奶 38.0 \n", "4 2016/11/14 A18178045259 CHANDO/自然堂雪域精粹纯粹滋润霜(清爽型)50g补水保湿滋润霜 139.0 \n", "\n", " sale_count comment_count 店名 \n", "0 26719.0 2704.0 自然堂 \n", "1 8122.0 1492.0 自然堂 \n", "2 12668.0 589.0 自然堂 \n", "3 25805.0 4287.0 自然堂 \n", "4 5196.0 618.0 自然堂 " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data2 = pd.read_csv('双十一淘宝美妆数据.csv')\n", "data2.head()" ] }, { "cell_type": "code", "execution_count": 16, "id": "annoying-guidance", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 27598 entries, 0 to 27597\n", "Data columns (total 7 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 update_time 27598 non-null object \n", " 1 id 27598 non-null object \n", " 2 title 27598 non-null object \n", " 3 price 27598 non-null float64\n", " 4 sale_count 25244 non-null float64\n", " 5 comment_count 25244 non-null float64\n", " 6 店名 27598 non-null object \n", "dtypes: float64(3), object(4)\n", "memory usage: 1.5+ MB\n" ] } ], "source": [ "data2.info() # 数据集情况 28010 条,6个字段" ] }, { "cell_type": "code", "execution_count": 17, "id": "silent-taxation", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "update_time 86\n", "id 86\n", "title 86\n", "price 86\n", "sale_count 82\n", "comment_count 82\n", "店名 86\n", "dtype: int64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data2[data2.duplicated()].count() # 有86条完全重复数据 " ] }, { "cell_type": "code", "execution_count": 18, "id": "neural-toner", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "update_time 0\n", "id 0\n", "title 0\n", "price 0\n", "sale_count 2350\n", "comment_count 2350\n", "店名 0\n", "dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data2.drop_duplicates(inplace=True) # 删除重复数据\n", "data2.reset_index(drop=True, inplace=True) # 重建索引\n", "data2.isnull().sum() # 查看空值 ,销售数量和评论数有空值" ] }, { "cell_type": "code", "execution_count": 19, "id": "binding-handle", "metadata": {}, "outputs": [], "source": [ "data2.fillna(0, inplace=True) # 空值填充\n", "data2['update_time'] = pd.to_datetime(data2['update_time']).apply(lambda x: x.strftime(\"%Y-%m-%d\")) # 日期格式化,便于统计" ] }, { "cell_type": "code", "execution_count": 20, "id": "radio-verse", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
update_timeidtitlepricesale_countcomment_count店名
270422016-11-05A541190557158Herborist/佰草集新美肌梦幻曲面贴膜3片 保湿补水1.01.00.0佰草集
14942016-11-10A538981087285【双II预售】资生堂 新透白色控霜 30ml390.01.00.0资生堂
241482016-11-09A540190519057【娇兰盛典】腮红亲密容和肌肤 裸妆感 自然持久玫瑰闰色腮红420.01.00.0娇兰
241472016-11-09A540189922026【娇兰盛典】丝柔蜜粉饼 营造细致透明妆感 柔滑细腻贴肤美颜480.01.01.0娇兰
169742016-11-05A541166044768L'OREAL欧莱雅卓韵霜时尚魅棕系列染发霜 富含炫闪因子蜜茶棕红棕79.01.00.0欧莱雅
\n", "
" ], "text/plain": [ " update_time id title price \\\n", "27042 2016-11-05 A541190557158 Herborist/佰草集新美肌梦幻曲面贴膜3片 保湿补水 1.0 \n", "1494 2016-11-10 A538981087285 【双II预售】资生堂 新透白色控霜 30ml 390.0 \n", "24148 2016-11-09 A540190519057 【娇兰盛典】腮红亲密容和肌肤 裸妆感 自然持久玫瑰闰色腮红 420.0 \n", "24147 2016-11-09 A540189922026 【娇兰盛典】丝柔蜜粉饼 营造细致透明妆感 柔滑细腻贴肤美颜 480.0 \n", "16974 2016-11-05 A541166044768 L'OREAL欧莱雅卓韵霜时尚魅棕系列染发霜 富含炫闪因子蜜茶棕红棕 79.0 \n", "\n", " sale_count comment_count 店名 \n", "27042 1.0 0.0 佰草集 \n", "1494 1.0 0.0 资生堂 \n", "24148 1.0 0.0 娇兰 \n", "24147 1.0 1.0 娇兰 \n", "16974 1.0 0.0 欧莱雅 " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data2[data2['sale_count']>0].sort_values(by=['sale_count']).head() # 从数据来看,sale_count 是销售量" ] }, { "cell_type": "code", "execution_count": 21, "id": "exterior-headquarters", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
update_timeidtitlepricesale_countcomment_count店名sale_amount
270422016-11-05A541190557158Herborist/佰草集新美肌梦幻曲面贴膜3片 保湿补水1.01.00.0佰草集1.0
14942016-11-10A538981087285【双II预售】资生堂 新透白色控霜 30ml390.01.00.0资生堂390.0
241482016-11-09A540190519057【娇兰盛典】腮红亲密容和肌肤 裸妆感 自然持久玫瑰闰色腮红420.01.00.0娇兰420.0
241472016-11-09A540189922026【娇兰盛典】丝柔蜜粉饼 营造细致透明妆感 柔滑细腻贴肤美颜480.01.01.0娇兰480.0
169742016-11-05A541166044768L'OREAL欧莱雅卓韵霜时尚魅棕系列染发霜 富含炫闪因子蜜茶棕红棕79.01.00.0欧莱雅79.0
...........................
174702016-11-10A24304992630德国妮维雅男士洗面奶控油祛痘印保湿去黑头去油清洁面乳液护肤品42.01827562.0200154.0妮维雅76757604.0
173392016-11-11A243049926302瓶更划算*妮维雅男士洗面奶控油祛痘印保湿去黑头去油洁面乳护肤35.01886100.0199532.0妮维雅66013500.0
172282016-11-12A24304992630德国妮维雅男士洗面奶控油祛痘印保湿去黑头去油清洁面乳液护肤品37.91920083.0199062.0妮维雅72771145.7
171262016-11-13A24304992630德国妮维雅男士洗面奶控油祛痘印保湿去黑头去油清洁面乳液护肤品37.91921582.0198774.0妮维雅72827957.8
170262016-11-14A24304992630德国妮维雅男士洗面奶控油祛痘印保湿去黑头去油清洁面乳液护肤品42.01923160.0197949.0妮维雅80772720.0
\n", "

24398 rows × 8 columns

\n", "
" ], "text/plain": [ " update_time id title price \\\n", "27042 2016-11-05 A541190557158 Herborist/佰草集新美肌梦幻曲面贴膜3片 保湿补水 1.0 \n", "1494 2016-11-10 A538981087285 【双II预售】资生堂 新透白色控霜 30ml 390.0 \n", "24148 2016-11-09 A540190519057 【娇兰盛典】腮红亲密容和肌肤 裸妆感 自然持久玫瑰闰色腮红 420.0 \n", "24147 2016-11-09 A540189922026 【娇兰盛典】丝柔蜜粉饼 营造细致透明妆感 柔滑细腻贴肤美颜 480.0 \n", "16974 2016-11-05 A541166044768 L'OREAL欧莱雅卓韵霜时尚魅棕系列染发霜 富含炫闪因子蜜茶棕红棕 79.0 \n", "... ... ... ... ... \n", "17470 2016-11-10 A24304992630 德国妮维雅男士洗面奶控油祛痘印保湿去黑头去油清洁面乳液护肤品 42.0 \n", "17339 2016-11-11 A24304992630 2瓶更划算*妮维雅男士洗面奶控油祛痘印保湿去黑头去油洁面乳护肤 35.0 \n", "17228 2016-11-12 A24304992630 德国妮维雅男士洗面奶控油祛痘印保湿去黑头去油清洁面乳液护肤品 37.9 \n", "17126 2016-11-13 A24304992630 德国妮维雅男士洗面奶控油祛痘印保湿去黑头去油清洁面乳液护肤品 37.9 \n", "17026 2016-11-14 A24304992630 德国妮维雅男士洗面奶控油祛痘印保湿去黑头去油清洁面乳液护肤品 42.0 \n", "\n", " sale_count comment_count 店名 sale_amount \n", "27042 1.0 0.0 佰草集 1.0 \n", "1494 1.0 0.0 资生堂 390.0 \n", "24148 1.0 0.0 娇兰 420.0 \n", "24147 1.0 1.0 娇兰 480.0 \n", "16974 1.0 0.0 欧莱雅 79.0 \n", "... ... ... ... ... \n", "17470 1827562.0 200154.0 妮维雅 76757604.0 \n", "17339 1886100.0 199532.0 妮维雅 66013500.0 \n", "17228 1920083.0 199062.0 妮维雅 72771145.7 \n", "17126 1921582.0 198774.0 妮维雅 72827957.8 \n", "17026 1923160.0 197949.0 妮维雅 80772720.0 \n", "\n", "[24398 rows x 8 columns]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data2['sale_amount'] = data2['price'] * data2['sale_count'] # 增加一列销售额\n", "data2[data2['sale_count']>0].sort_values(by=['sale_count'])" ] }, { "cell_type": "markdown", "id": "supreme-performer", "metadata": {}, "source": [ "### 2、数据分析与可视化" ] }, { "cell_type": "markdown", "id": "traditional-third", "metadata": {}, "source": [ "#### 2.1 每日整体销售量走势" ] }, { "cell_type": "code", "execution_count": 22, "id": "unnecessary-ownership", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = data2.groupby('update_time').agg({'sale_count':'sum'}).to_dict()['sale_count']\n", "c = (\n", " Line()\n", " .add_xaxis(list(result.keys()))\n", " .add_yaxis(\"销售量\", list(result.values()))\n", " .set_series_opts(\n", " areastyle_opts=opts.AreaStyleOpts(opacity=0.5),\n", " label_opts=opts.LabelOpts(is_show=False),\n", " markpoint_opts=opts.MarkPointOpts(\n", " data=[\n", " opts.MarkPointItem(type_=\"max\", name=\"最大值\"),\n", " opts.MarkPointItem(type_=\"min\", name=\"最小值\"),\n", " opts.MarkPointItem(type_=\"average\", name=\"平均值\"),\n", " ]\n", " ),\n", " )\n", " .set_global_opts(title_opts=opts.TitleOpts(title=\"每日整体销售量走势\"))\n", ")\n", "c.render_notebook()" ] }, { "cell_type": "markdown", "id": "technical-vitamin", "metadata": {}, "source": [ "#### 2.2 谁家的化妆品卖的最好" ] }, { "cell_type": "code", "execution_count": 23, "id": "blessed-wagon", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['2016-11-05',\n", " '2016-11-06',\n", " '2016-11-07',\n", " '2016-11-08',\n", " '2016-11-09',\n", " '2016-11-10',\n", " '2016-11-11',\n", " '2016-11-12',\n", " '2016-11-13',\n", " '2016-11-14']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dts = list(data2['update_time'].unique())\n", "dts.reverse()\n", "dts" ] }, { "cell_type": "code", "execution_count": 24, "id": "olive-accident", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyecharts import options as opts\n", "from pyecharts.charts import Map, Timeline, Bar, Line, Pie\n", "from pyecharts.components import Table\n", "from pyecharts.options import ComponentTitleOpts\n", "\n", "tl = Timeline()\n", "tl.add_schema(\n", "# is_auto_play=True,\n", " is_loop_play=False,\n", " play_interval=500,\n", " )\n", "for dt in dts:\n", " item = data2[data2['update_time'] <= dt].groupby('店名').agg({'sale_count': 'sum', 'sale_amount': 'sum'}).sort_values(by='sale_count', ascending=False)[:10].sort_values(by='sale_count').to_dict()\n", " bar = (\n", " Bar()\n", " .add_xaxis([*item['sale_count'].keys()])\n", " .add_yaxis(\"销售量\", [round(val/10000,2) for val in item['sale_count'].values()], label_opts=opts.LabelOpts(position=\"right\", formatter='{@[1]/} 万'))\n", " .add_yaxis(\"销售额\", [round(val/10000/10000,2) for val in item['sale_amount'].values()], label_opts=opts.LabelOpts(position=\"right\", formatter='{@[1]/} 亿元'))\n", " .reversal_axis()\n", " .set_global_opts(\n", " title_opts=opts.TitleOpts(\"累计销售量排行 TOP10\")\n", " )\n", " )\n", " tl.add(bar, dt)\n", "tl.render_notebook()" ] }, { "cell_type": "code", "execution_count": 25, "id": "positive-basement", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "item = data2.groupby('店名').agg({'sale_count': 'sum'}).sort_values(by='sale_count', ascending=False)[:10].to_dict()['sale_count']\n", "item = {k: round(v/10000, 2) for k, v in item.items()}\n", "c = (\n", " Pie()\n", " .add(\"销量\", [*item.items()])\n", " .set_series_opts(label_opts=opts.LabelOpts(formatter=\"{b}: {c} 万({d}%)\"))\n", ")\n", "c.render_notebook()" ] }, { "cell_type": "markdown", "id": "confident-mattress", "metadata": {}, "source": [ "#### 2.4 谁家的化妆品最贵" ] }, { "cell_type": "code", "execution_count": 26, "id": "acute-wesley", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "item = data2.groupby('店名').agg({'price': 'mean'}).sort_values(by='price', ascending=False)[:20].sort_values(by='price').to_dict()\n", "c = (\n", " Bar()\n", " .add_xaxis([*item['price'].keys()])\n", " .add_yaxis(\"销售量\", [round(v, 2) for v in item['price'].values()], label_opts=opts.LabelOpts(position=\"right\"))\n", " .reversal_axis()\n", " .set_global_opts(\n", " title_opts=opts.TitleOpts(\"平均价格排行 TOP20\")\n", " )\n", ")\n", "c.render_notebook()" ] }, { "cell_type": "markdown", "id": "virgin-purple", "metadata": {}, "source": [ "## part3\n", "\n", "日化.xlsx 这个数据集是美妆类商品的订单数据,从数量来看,应该是批发类的订单。包含两个 sheet 页(订单表和商品表),可以挖掘的纬度有日期、地区、商品,指标则有销售量、销售额、增长率等。\n", "### 1、数据理解与处理" ] }, { "cell_type": "code", "execution_count": 27, "id": "otherwise-drama", "metadata": {}, "outputs": [], "source": [ "import pandas as pd \n", "fact_order = pd.read_excel('日化.xlsx', sheet_name='销售订单表')\n", "dim_product = pd.read_excel('日化.xlsx', sheet_name='商品信息表')" ] }, { "cell_type": "markdown", "id": "underlying-registrar", "metadata": {}, "source": [ "#### 1.1 商品表数据清洗" ] }, { "cell_type": "code", "execution_count": 28, "id": "sexual-austin", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
商品编号商品名称商品小类商品大类销售单价
0X001商品1面膜护肤品121
1X002商品2面膜护肤品141
2X003商品3面膜护肤品168
3X004商品4面膜护肤品211
4X005商品5面膜护肤品185
\n", "
" ], "text/plain": [ " 商品编号 商品名称 商品小类 商品大类 销售单价\n", "0 X001 商品1 面膜 护肤品 121\n", "1 X002 商品2 面膜 护肤品 141\n", "2 X003 商品3 面膜 护肤品 168\n", "3 X004 商品4 面膜 护肤品 211\n", "4 X005 商品5 面膜 护肤品 185" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim_product.head()" ] }, { "cell_type": "code", "execution_count": 29, "id": "assumed-ideal", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
销售单价
count122.000000
mean156.155738
std58.454619
min56.000000
25%102.250000
50%158.000000
75%210.750000
max253.000000
\n", "
" ], "text/plain": [ " 销售单价\n", "count 122.000000\n", "mean 156.155738\n", "std 58.454619\n", "min 56.000000\n", "25% 102.250000\n", "50% 158.000000\n", "75% 210.750000\n", "max 253.000000" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim_product.describe()" ] }, { "cell_type": "code", "execution_count": 30, "id": "organizational-tract", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "商品编号 0\n", "商品名称 0\n", "商品小类 0\n", "商品大类 0\n", "销售单价 0\n", "dtype: int64" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim_product[dim_product.duplicated()].count() # 没有完全重复的数据" ] }, { "cell_type": "code", "execution_count": 31, "id": "diagnostic-western", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "商品编号 0\n", "商品名称 0\n", "商品小类 0\n", "商品大类 0\n", "销售单价 0\n", "dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim_product[dim_product['商品编号'].duplicated()].count() # ID 唯一没有重复" ] }, { "cell_type": "code", "execution_count": 32, "id": "atomic-challenge", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "商品编号 0\n", "商品名称 0\n", "商品小类 0\n", "商品大类 0\n", "销售单价 0\n", "dtype: int64" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dim_product.isnull().sum() # 没有空值 " ] }, { "cell_type": "markdown", "id": "olive-massage", "metadata": {}, "source": [ "#### 1.2 订单表数据清洗" ] }, { "cell_type": "code", "execution_count": 33, "id": "indie-profit", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
订单编码订单日期客户编码所在区域所在省份所在地市商品编号订购数量订购单价金额
0D313132019-05-16 00:00:00S22796东区浙江省台州市X091892214190888.0
1D213292019-05-14 00:00:00S11460东区安徽省宿州市X00527618551060.0
2D223722019-08-26 00:00:00S11101北区山西省忻州市X0781450116168200.0
3D310782019-04-08 00:00:00S10902北区吉林省延边朝鲜族自治州X0251834102187068.0
4D324702019-04-11 00:00:00S18696北区北京市北京市X0108875851446.0
\n", "
" ], "text/plain": [ " 订单编码 订单日期 客户编码 所在区域 所在省份 所在地市 商品编号 订购数量 订购单价 \\\n", "0 D31313 2019-05-16 00:00:00 S22796 东区 浙江省 台州市 X091 892 214 \n", "1 D21329 2019-05-14 00:00:00 S11460 东区 安徽省 宿州市 X005 276 185 \n", "2 D22372 2019-08-26 00:00:00 S11101 北区 山西省 忻州市 X078 1450 116 \n", "3 D31078 2019-04-08 00:00:00 S10902 北区 吉林省 延边朝鲜族自治州 X025 1834 102 \n", "4 D32470 2019-04-11 00:00:00 S18696 北区 北京市 北京市 X010 887 58 \n", "\n", " 金额 \n", "0 190888.0 \n", "1 51060.0 \n", "2 168200.0 \n", "3 187068.0 \n", "4 51446.0 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fact_order.head()" ] }, { "cell_type": "code", "execution_count": 34, "id": "bibliographic-island", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 31452 entries, 0 to 31451\n", "Data columns (total 10 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 订单编码 31452 non-null object \n", " 1 订单日期 31452 non-null object \n", " 2 客户编码 31452 non-null object \n", " 3 所在区域 31450 non-null object \n", " 4 所在省份 31450 non-null object \n", " 5 所在地市 31452 non-null object \n", " 6 商品编号 31451 non-null object \n", " 7 订购数量 31450 non-null object \n", " 8 订购单价 31448 non-null object \n", " 9 金额 31448 non-null float64\n", "dtypes: float64(1), object(9)\n", "memory usage: 2.4+ MB\n" ] } ], "source": [ "fact_order.info()" ] }, { "cell_type": "code", "execution_count": 35, "id": "intensive-houston", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "订单编码 6\n", "订单日期 6\n", "客户编码 6\n", "所在区域 6\n", "所在省份 6\n", "所在地市 6\n", "商品编号 6\n", "订购数量 6\n", "订购单价 6\n", "金额 6\n", "dtype: int64" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fact_order[fact_order.duplicated()].count() # 没有完全重复的数据" ] }, { "cell_type": "code", "execution_count": 36, "id": "amber-wedding", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "订单编码 0\n", "订单日期 0\n", "客户编码 0\n", "所在区域 2\n", "所在省份 2\n", "所在地市 0\n", "商品编号 1\n", "订购数量 2\n", "订购单价 4\n", "金额 4\n", "dtype: int64" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fact_order.drop_duplicates(inplace=True) # 删除重复数据\n", "fact_order.reset_index(drop=True, inplace=True) # 重建索引\n", "fact_order.isnull().sum() # 查看空值,有几条数据缺失" ] }, { "cell_type": "code", "execution_count": 37, "id": "suspected-rhythm", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "订单编码 0\n", "订单日期 0\n", "客户编码 0\n", "所在区域 0\n", "所在省份 0\n", "所在地市 0\n", "商品编号 0\n", "订购数量 0\n", "订购单价 0\n", "金额 0\n", "dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fact_order.fillna(method='bfill', inplace=True) # 空值填充\n", "fact_order.fillna(method='ffill', inplace=True) # 空值填充\n", "fact_order.isnull().sum() # 查看空值,有几条数据缺失" ] }, { "cell_type": "code", "execution_count": 38, "id": "golden-photograph", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
订单编码订单日期客户编码所在区域所在省份所在地市商品编号订购数量订购单价金额
20797D265332050-06-09S21396北区河北省石家庄市X022759158119922.0
\n", "
" ], "text/plain": [ " 订单编码 订单日期 客户编码 所在区域 所在省份 所在地市 商品编号 订购数量 订购单价 金额\n", "20797 D26533 2050-06-09 S21396 北区 河北省 石家庄市 X022 759 158 119922.0" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fact_order['订单日期'] = fact_order['订单日期'].apply(lambda x: pd.to_datetime(x, format='%Y#%m#%d') if isinstance(x, str) else x)\n", "fact_order[fact_order['订单日期'] > '2021-01-01'] # 有一条脏数据" ] }, { "cell_type": "code", "execution_count": 39, "id": "lined-throw", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Timestamp('2019-09-30 00:00:00'), Timestamp('2019-01-01 00:00:00'))" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fact_order = fact_order[fact_order['订单日期'] < '2021-01-01'] # 过滤掉脏数据\n", "fact_order['订单日期'].max(), fact_order['订单日期'].min() # 数据区间在 2019-01-01 到 2019-09-30 之间" ] }, { "cell_type": "code", "execution_count": 40, "id": "coated-boston", "metadata": {}, "outputs": [], "source": [ "fact_order['订购数量'] = fact_order['订购数量'].apply(lambda x: x.strip('个') if isinstance(x, str) else x).astype('int')\n", "fact_order['订购单价'] = fact_order['订购单价'].apply(lambda x: x.strip('元') if isinstance(x, str) else x).astype('float')\n", "fact_order['金额'] = fact_order['金额'].astype('float')" ] }, { "cell_type": "code", "execution_count": 41, "id": "regular-colorado", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 31445 entries, 0 to 31445\n", "Data columns (total 10 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 订单编码 31445 non-null object \n", " 1 订单日期 31445 non-null datetime64[ns]\n", " 2 客户编码 31445 non-null object \n", " 3 所在区域 31445 non-null object \n", " 4 所在省份 31445 non-null object \n", " 5 所在地市 31445 non-null object \n", " 6 商品编号 31445 non-null object \n", " 7 订购数量 31445 non-null int32 \n", " 8 订购单价 31445 non-null float64 \n", " 9 金额 31445 non-null float64 \n", "dtypes: datetime64[ns](1), float64(2), int32(1), object(6)\n", "memory usage: 2.5+ MB\n" ] } ], "source": [ "fact_order.info()" ] }, { "cell_type": "code", "execution_count": 42, "id": "potential-recording", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['浙江', '安徽', '山西', '吉林', '北京', '云南', '广东', '广西', '内蒙古', '新疆', '湖北',\n", " '江苏', '甘肃', '四川', '河南', '福建', '陕西', '辽宁', '山东', '江西', '重庆', '河北',\n", " '湖南', '上海', '贵州', '天津', '海南', '宁夏', '黑龙江'], dtype=object)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fact_order['所在省份'] = fact_order['所在省份'].str.replace('自治区|维吾尔|回族|壮族|省|市', '') # 对省份做个清洗,便于可视化\n", "fact_order['所在省份'].unique()" ] }, { "cell_type": "code", "execution_count": 43, "id": "usual-suspect", "metadata": {}, "outputs": [], "source": [ "fact_order['客户编码'] = fact_order['客户编码'].str.replace('编号', '')" ] }, { "cell_type": "markdown", "id": "speaking-angle", "metadata": {}, "source": [ "### 2、数据分析与可视化\n", "#### 2.1 每月订购情况" ] }, { "cell_type": "code", "execution_count": 44, "id": "french-entry", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyecharts import options as opts\n", "from pyecharts.charts import Map, Bar, Line\n", "from pyecharts.components import Table\n", "from pyecharts.options import ComponentTitleOpts\n", "from pyecharts.faker import Faker\n", "\n", "fact_order['订单月份'] = fact_order['订单日期'].apply(lambda x: x.month) \n", "item = fact_order.groupby('订单月份').agg({'订购数量': 'sum', '金额': 'sum'}).to_dict()\n", "x = [f'{key} 月' for key in item['订购数量'].keys()]\n", "y1 = [round(val/10000, 2) for val in item['订购数量'].values()]\n", "y2 = [round(val/10000/10000, 2) for val in item['金额'].values()]\n", "c = (\n", " Bar()\n", " .add_xaxis(x)\n", " .add_yaxis(\"订购数量(万件)\", y1, is_selected=False)\n", " .add_yaxis(\"金额(亿元)\", y2)\n", " .set_global_opts(title_opts=opts.TitleOpts(title=\"每月订购情况\"))\n", " .set_series_opts(\n", " label_opts=opts.LabelOpts(is_show=True),\n", " )\n", ")\n", "c.render_notebook()" ] }, { "cell_type": "markdown", "id": "sitting-yacht", "metadata": {}, "source": [ "#### 2.2 哪里的人最爱美" ] }, { "cell_type": "code", "execution_count": 45, "id": "incomplete-maria", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "item = fact_order.groupby('所在地市').agg({'订购数量': 'sum'}).sort_values(by='订购数量', ascending=False)[:20].sort_values(by='订购数量').to_dict()['订购数量']\n", "\n", "c = (\n", " Bar()\n", " .add_xaxis([*item.keys()])\n", " .add_yaxis(\"订购量\", [round(v/10000, 2) for v in item.values()], label_opts=opts.LabelOpts(position=\"right\", formatter='{@[1]/} 万'))\n", " .reversal_axis()\n", " .set_global_opts(\n", " title_opts=opts.TitleOpts(\"订购数量排行 TOP20\")\n", " )\n", ")\n", "c.render_notebook()" ] }, { "cell_type": "markdown", "id": "broken-estimate", "metadata": {}, "source": [ "#### 2.3 什么类型的美妆需求量最大" ] }, { "cell_type": "code", "execution_count": 46, "id": "seasonal-particle", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
订单编码订单日期客户编码所在区域所在省份所在地市商品编号订购数量订购单价金额订单月份商品名称商品小类商品大类销售单价
0D313132019-05-16S22796东区浙江台州市X091892214.0190888.05商品91粉底彩妆214
1D266742019-05-01S15128东区江苏南通市X0911133214.0242462.05商品91粉底彩妆214
2D233812019-09-22S17133东区江苏宿迁市X0911136214.0243104.09商品91粉底彩妆214
3D290602019-09-10S14106东区江苏常州市X091544214.0116416.09商品91粉底彩妆214
4D212342019-07-03S17197东区湖北十堰市X091342214.073188.07商品91粉底彩妆214
................................................
31439D304822019-06-05S11033东区浙江金华市X118551238.0131138.06商品118蜜粉彩妆238
31440D295422019-05-01S12446东区江苏南通市X118165238.039270.05商品118蜜粉彩妆238
31441D247982019-06-26S16170南区福建泉州市X11862238.014756.06商品118蜜粉彩妆238
31442D318312019-08-13S22214北区黑龙江佳木斯市X118795238.0189210.08商品118蜜粉彩妆238
31443D288902019-07-21S15066西区四川南充市X118148238.035224.07商品118蜜粉彩妆238
\n", "

31444 rows × 15 columns

\n", "
" ], "text/plain": [ " 订单编码 订单日期 客户编码 所在区域 所在省份 所在地市 商品编号 订购数量 订购单价 金额 \\\n", "0 D31313 2019-05-16 S22796 东区 浙江 台州市 X091 892 214.0 190888.0 \n", "1 D26674 2019-05-01 S15128 东区 江苏 南通市 X091 1133 214.0 242462.0 \n", "2 D23381 2019-09-22 S17133 东区 江苏 宿迁市 X091 1136 214.0 243104.0 \n", "3 D29060 2019-09-10 S14106 东区 江苏 常州市 X091 544 214.0 116416.0 \n", "4 D21234 2019-07-03 S17197 东区 湖北 十堰市 X091 342 214.0 73188.0 \n", "... ... ... ... ... ... ... ... ... ... ... \n", "31439 D30482 2019-06-05 S11033 东区 浙江 金华市 X118 551 238.0 131138.0 \n", "31440 D29542 2019-05-01 S12446 东区 江苏 南通市 X118 165 238.0 39270.0 \n", "31441 D24798 2019-06-26 S16170 南区 福建 泉州市 X118 62 238.0 14756.0 \n", "31442 D31831 2019-08-13 S22214 北区 黑龙江 佳木斯市 X118 795 238.0 189210.0 \n", "31443 D28890 2019-07-21 S15066 西区 四川 南充市 X118 148 238.0 35224.0 \n", "\n", " 订单月份 商品名称 商品小类 商品大类 销售单价 \n", "0 5 商品91 粉底 彩妆 214 \n", "1 5 商品91 粉底 彩妆 214 \n", "2 9 商品91 粉底 彩妆 214 \n", "3 9 商品91 粉底 彩妆 214 \n", "4 7 商品91 粉底 彩妆 214 \n", "... ... ... ... ... ... \n", "31439 6 商品118 蜜粉 彩妆 238 \n", "31440 5 商品118 蜜粉 彩妆 238 \n", "31441 6 商品118 蜜粉 彩妆 238 \n", "31442 8 商品118 蜜粉 彩妆 238 \n", "31443 7 商品118 蜜粉 彩妆 238 \n", "\n", "[31444 rows x 15 columns]" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "order = pd.merge(fact_order, dim_product, on='商品编号',how='inner') # 表关联\n", "order" ] }, { "cell_type": "code", "execution_count": 47, "id": "amended-intensity", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
订购数量
商品大类商品小类
彩妆口红2013024
粉底1188621
睫毛膏587399
眼影296599
蜜粉45534
护肤品面膜5451914
面霜4566905
爽肤水3523687
眼霜3350743
隔离霜2488124
防晒霜2388610
洁面乳1928020
\n", "
" ], "text/plain": [ " 订购数量\n", "商品大类 商品小类 \n", "彩妆 口红 2013024\n", " 粉底 1188621\n", " 睫毛膏 587399\n", " 眼影 296599\n", " 蜜粉 45534\n", "护肤品 面膜 5451914\n", " 面霜 4566905\n", " 爽肤水 3523687\n", " 眼霜 3350743\n", " 隔离霜 2488124\n", " 防晒霜 2388610\n", " 洁面乳 1928020" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "order.groupby(['商品大类','商品小类']).agg({'订购数量': 'sum'}).sort_values(by=['商品大类', '订购数量'], ascending=[True, False])" ] }, { "cell_type": "markdown", "id": "changed-sentence", "metadata": {}, "source": [ "#### 2.4 哪些省份的美妆需求量最大" ] }, { "cell_type": "code", "execution_count": 48, "id": "opponent-buying", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "
\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "item = fact_order.groupby('所在省份').agg({'订购数量': 'sum'}).to_dict()['订购数量']\n", "c = (\n", " Map()\n", " .add(\"订购数量\", [*item.items()], \"china\", is_map_symbol_show=False)\n", " .set_series_opts(label_opts=opts.LabelOpts(is_show=True))\n", " .set_global_opts(\n", " title_opts=opts.TitleOpts(title='省份分布'),\n", " visualmap_opts=opts.VisualMapOpts(max_=1000000), \n", " )\n", ")\n", "c.render_notebook()" ] }, { "cell_type": "markdown", "id": "distinguished-mercy", "metadata": {}, "source": [ "#### 2.5 通过 RFM 模型挖掘客户价值\n", "\n", "RFM 模型是衡量客户价值和客户创利能力的重要工具和手段,其中由3个要素构成了数据分析最好的指标,分别是:\n", "* R-Recency(最近一次购买时间)\n", "* F-Frequency(消费频率)\n", "* M-Money(消费金额)\n", "\n", "设定一个计算权重,比如 R-Recency 20% F-Frequency 30% M-Money 50% ,最后通过这个权重进行打分,量化客户价值,后续还可以基于分数进一步打标签,用来指导二次营销的策略。" ] }, { "cell_type": "code", "execution_count": 49, "id": "alpine-warning", "metadata": {}, "outputs": [], "source": [ "data_rfm = fact_order.groupby('客户编码').agg({'订单日期': 'max', '订单编码': 'count', '金额': 'sum'})\n", "data_rfm.columns = ['最近一次购买时间', '消费频率', '消费金额']" ] }, { "cell_type": "code", "execution_count": 50, "id": "absent-automation", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
最近一次购买时间消费频率消费金额RFM
客户编码
S116092019-09-30427326027.00.9801480.7963990.903970
S198282019-09-30212642275.00.9801480.3568790.306556
S171662019-09-30173627037.00.9801480.2613110.478301
S229252019-09-30313449117.00.9801480.5914130.457987
S104692019-09-30304198071.00.9801480.5701750.564174
.....................
S165032019-04-07141682893.00.0046170.1980610.146814
S175472019-03-14101784531.00.0032320.0872580.163435
S208642019-03-1481118752.00.0032320.0392430.047091
S119082019-03-0991552311.00.0018470.0609420.125577
S116112019-03-03101487966.00.0009230.0872580.113573
\n", "

1083 rows × 6 columns

\n", "
" ], "text/plain": [ " 最近一次购买时间 消费频率 消费金额 R F M\n", "客户编码 \n", "S11609 2019-09-30 42 7326027.0 0.980148 0.796399 0.903970\n", "S19828 2019-09-30 21 2642275.0 0.980148 0.356879 0.306556\n", "S17166 2019-09-30 17 3627037.0 0.980148 0.261311 0.478301\n", "S22925 2019-09-30 31 3449117.0 0.980148 0.591413 0.457987\n", "S10469 2019-09-30 30 4198071.0 0.980148 0.570175 0.564174\n", "... ... ... ... ... ... ...\n", "S16503 2019-04-07 14 1682893.0 0.004617 0.198061 0.146814\n", "S17547 2019-03-14 10 1784531.0 0.003232 0.087258 0.163435\n", "S20864 2019-03-14 8 1118752.0 0.003232 0.039243 0.047091\n", "S11908 2019-03-09 9 1552311.0 0.001847 0.060942 0.125577\n", "S11611 2019-03-03 10 1487966.0 0.000923 0.087258 0.113573\n", "\n", "[1083 rows x 6 columns]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_rfm['R'] = data_rfm['最近一次购买时间'].rank(pct=True) # 转化为排名 百分比,便于后续切片\n", "data_rfm['F'] = data_rfm['消费频率'].rank(pct=True)\n", "data_rfm['M'] = data_rfm['消费金额'].rank(pct=True)\n", "data_rfm.sort_values(by='R', ascending=False) " ] }, { "cell_type": "code", "execution_count": 51, "id": "transsexual-frost", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
最近一次购买时间消费频率消费金额RFMscore
客户编码
S174762019-09-306910325832.00.9801480.9866110.98707398.6
S223262019-09-306210074609.00.9801480.9732230.98430398.0
S115812019-09-287910333668.00.9182830.9967680.98799697.7
S128482019-09-29669673572.00.9445980.9806090.98060997.3
S190952019-09-268111031632.00.8647280.9990770.99630797.1
........................
S126902019-05-077917233.00.0129270.0226220.0249312.2
S111762019-06-097614134.00.0360110.0226220.0092341.9
S183792019-07-054400195.00.0710990.0032320.0046171.7
S132592019-06-016645925.00.0258540.0115420.0110801.4
S124632019-04-117345919.00.0055400.0226220.0009230.8
\n", "

1083 rows × 7 columns

\n", "
" ], "text/plain": [ " 最近一次购买时间 消费频率 消费金额 R F M score\n", "客户编码 \n", "S17476 2019-09-30 69 10325832.0 0.980148 0.986611 0.987073 98.6\n", "S22326 2019-09-30 62 10074609.0 0.980148 0.973223 0.984303 98.0\n", "S11581 2019-09-28 79 10333668.0 0.918283 0.996768 0.987996 97.7\n", "S12848 2019-09-29 66 9673572.0 0.944598 0.980609 0.980609 97.3\n", "S19095 2019-09-26 81 11031632.0 0.864728 0.999077 0.996307 97.1\n", "... ... ... ... ... ... ... ...\n", "S12690 2019-05-07 7 917233.0 0.012927 0.022622 0.024931 2.2\n", "S11176 2019-06-09 7 614134.0 0.036011 0.022622 0.009234 1.9\n", "S18379 2019-07-05 4 400195.0 0.071099 0.003232 0.004617 1.7\n", "S13259 2019-06-01 6 645925.0 0.025854 0.011542 0.011080 1.4\n", "S12463 2019-04-11 7 345919.0 0.005540 0.022622 0.000923 0.8\n", "\n", "[1083 rows x 7 columns]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_rfm['score'] = data_rfm['R'] * 20 + data_rfm['F'] * 30 + data_rfm['M'] * 50\n", "data_rfm['score'] = data_rfm['score'].round(1)\n", "data_rfm.sort_values(by='score', ascending=False) " ] }, { "cell_type": "markdown", "id": "fixed-turkish", "metadata": {}, "source": [ "根据这个分数结果,我们可以对客户打上一些标签,比如大于 80 分的,标志为优质客户,在资源有限的情况下,可以优先服务好优质客户。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "384px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 5 }