{ "cells": [ { "cell_type": "markdown", "id": "ac5ef3bc-214b-416a-a1ed-bb8342a2e23e", "metadata": {}, "source": [ "# The Donor Data Debacle \n", "\n", "### Description: \n", "You're a data analyst for a non-profit organization, and you've been tasked with cleaning up a messy dataset of donations. The data is a bit of a disaster, with missing values, duplicates, and inconsistent formatting. Your mission is to use your Pandas skills to wrangle the data into shape. \n", "\n", "### Tasks: \n", "- **Clean up the Mess:** Remove duplicates, handle missing values, and ensure data types are correct. \n", "- **Standardize the data:** Normalize the 'Donation Amount' column and convert the `date` column to a standard format. \n", "- **Data quality check:** Identify and correct any inconsistent or invalid data. " ] }, { "cell_type": "code", "execution_count": 1, "id": "50e9f9d5-6c65-494f-a5c6-6818efc2979d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]\n", "Pandas version 2.2.1\n", "Numpy version 1.26.4\n" ] } ], "source": [ "# import libraries\n", "import pandas as pd\n", "import numpy as np\n", "import sys\n", "import re\n", "\n", "print('Python version ' + sys.version)\n", "print('Pandas version ' + pd.__version__)\n", "print('Numpy version ' + np.__version__)" ] }, { "cell_type": "markdown", "id": "afbc5eec-6da5-482e-a003-4982e7e63b97", "metadata": {}, "source": [ "# The Data \n", "\n", "The columns below represent information about individual donations, the date they were made, and the campaign that drove the donation. The goal is to clean, transform, and prepare this data for analysis. \n", "\n", "Here's a breakdown of what each column in the sample data represents: \n", "- **Donor ID:** A unique identifier for each donor \n", "- **Donation Amount:** The amount donated by each donor ( initially in a mix of numeric and string formats, requiring cleanup) \n", "- **Date:** The date each donation was made\n", "- **Campaign:** The marketing campaign or channel that led to the donation\n", "\n", "Important Note about the `Donation Amount` Column: \n", "\n", "The logic below will generate a mix of: \n", "- Numeric values (e.g., 10.50, 500.00) \n", "- String values with words (e.g., \"10 thousand\", \"5 dollars and 25 cents\") \n", "- String values with currency symbols (e.g., \"`$50`\", \"`$1000`\") \n", "\n", "Your task will be to clean up this column by converting all values to a standard numeric format, handling the various string formats, and dealing with any potential errors or inconsistencies. Good luck!" ] }, { "cell_type": "code", "execution_count": 2, "id": "8c807d37-576a-4f3c-8bc4-be4a3cf34886", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
donor_iddatecampaigndonation_amount
0685Invalid DateUnknownNaN
15602022-03-07T00:00:00.000000000Email6 dollars and 98 cents
26302022-11-08T00:00:00.000000000Social Media76 thousand
31932022-03-25T00:00:00.000000000EmailNaN
48362022-04-07T00:00:00.000000000Email$81
...............
9995426Invalid DateSocial Media$81
99968912022-04-18T00:00:00.000000000UnknownNaN
99977782022-08-24T00:00:00.000000000Event$81
99989742022-10-07T00:00:00.000000000Email$81
9999742022-07-09T00:00:00.000000000EventNaN
\n", "

10000 rows × 4 columns

\n", "
" ], "text/plain": [ " donor_id date campaign \\\n", "0 685 Invalid Date Unknown \n", "1 560 2022-03-07T00:00:00.000000000 Email \n", "2 630 2022-11-08T00:00:00.000000000 Social Media \n", "3 193 2022-03-25T00:00:00.000000000 Email \n", "4 836 2022-04-07T00:00:00.000000000 Email \n", "... ... ... ... \n", "9995 426 Invalid Date Social Media \n", "9996 891 2022-04-18T00:00:00.000000000 Unknown \n", "9997 778 2022-08-24T00:00:00.000000000 Event \n", "9998 974 2022-10-07T00:00:00.000000000 Email \n", "9999 74 2022-07-09T00:00:00.000000000 Event \n", "\n", " donation_amount \n", "0 NaN \n", "1 6 dollars and 98 cents \n", "2 76 thousand \n", "3 NaN \n", "4 $81 \n", "... ... \n", "9995 $81 \n", "9996 NaN \n", "9997 $81 \n", "9998 $81 \n", "9999 NaN \n", "\n", "[10000 rows x 4 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# set the seed\n", "np.random.seed(0)\n", "\n", "# synthetic data\n", "data = {\n", " 'donor_id': np.random.randint(1, 1000, 10000),\n", " 'date': np.random.choice(pd.date_range('2022-01-01', periods=365), 10000),\n", " 'campaign': np.random.choice(['Email', 'Social Media', 'Event'], 10000),\n", " 'donation_amount': np.random.choice([\n", " np.random.uniform(10, 1000), # numeric value\n", " f'{np.random.randint(1, 100)} thousand', # string value (e.g., \"10 thousand\")\n", " f'{np.random.randint(1, 10)} dollars and {np.random.randint(1, 100)} cents', # string value (e.g., \"5 dollars and 25 cents\")\n", " f'${np.random.randint(1, 100)}', # string value with currency symbol (e.g., \"$50\")\n", " ], 10000) \n", "}\n", "\n", "# create dataframe\n", "df = pd.DataFrame(data)\n", "\n", "## introduce some messiness ##\n", "\n", "# make the column the wrong datatype\n", "df['donor_id'] = data['donor_id'].astype(str)\n", "\n", "# missing values\n", "df.loc[df.index % 3 == 0, 'donation_amount'] = np.nan\n", "\n", "# messy is my middle name\n", "df['date'] = data['date'].astype(str)\n", "df.loc[df.index % 5 == 0, 'date'] = 'Invalid Date'\n", "\n", "# the marketing manager is not going to be happy :)\n", "df.loc[df.index % 7 == 0, 'campaign'] = 'Unknown'\n", "\n", "df" ] }, { "cell_type": "markdown", "id": "532d8d82-f071-423e-9dee-6a5ffdba1c33", "metadata": {}, "source": [ "Let's start by looking at the datatypes. \n", "\n", "As you can expect, Pandas is treating all of the columns as strings. Let the clean up process begin." ] }, { "cell_type": "code", "execution_count": 3, "id": "65a0bfa8-6038-4c9d-b9a6-38d881042643", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 10000 entries, 0 to 9999\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 donor_id 10000 non-null object\n", " 1 date 10000 non-null object\n", " 2 campaign 10000 non-null object\n", " 3 donation_amount 6666 non-null object\n", "dtypes: object(4)\n", "memory usage: 312.6+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "890de063-eb4f-4fc9-a645-12bb85e6ea22", "metadata": {}, "source": [ "# Clean up the Mess: \n", "\n", "Remove duplicates, handle missing values, and ensure data types are correct. \n", "\n", "If we assume that we will not be able to get the correct donation amounts, we might as well remove those rows from the data." ] }, { "cell_type": "code", "execution_count": 4, "id": "7d069545-4c10-4d6e-a0c2-5ff7ac5c9ed7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 6666 entries, 1 to 9998\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 donor_id 6666 non-null object\n", " 1 date 6666 non-null object\n", " 2 campaign 6666 non-null object\n", " 3 donation_amount 6666 non-null object\n", "dtypes: object(4)\n", "memory usage: 260.4+ KB\n" ] } ], "source": [ "df = df.dropna()\n", "df.info()" ] }, { "cell_type": "markdown", "id": "80f71d38-12e8-48a9-b5d2-3b2b0bca8714", "metadata": {}, "source": [ "The marketing manager told us to replace any missing dates with '1970-01-01' so we can identify these and deal with them later." ] }, { "cell_type": "code", "execution_count": 5, "id": "da81e037-fbe7-438f-ae37-31f5ffdcb2f8", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
donor_iddatecampaigndonation_amount
5764Invalid DateSocial Media77.55500350452421
10278Invalid DateEvent76 thousand
20487Invalid DateEvent77.55500350452421
25850Invalid DateEmail76 thousand
35710Invalid DateUnknown6 dollars and 98 cents
\n", "
" ], "text/plain": [ " donor_id date campaign donation_amount\n", "5 764 Invalid Date Social Media 77.55500350452421\n", "10 278 Invalid Date Event 76 thousand\n", "20 487 Invalid Date Event 77.55500350452421\n", "25 850 Invalid Date Email 76 thousand\n", "35 710 Invalid Date Unknown 6 dollars and 98 cents" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# identify the invalid dates\n", "mask = df['date'] == 'Invalid Date'\n", "df[mask].head()" ] }, { "cell_type": "markdown", "id": "af67e05b-4cc4-42d9-bd0c-4173f923056c", "metadata": {}, "source": [ "Here is where we set the dates to `1970-01-01`. " ] }, { "cell_type": "code", "execution_count": 6, "id": "271a4ae9-05b2-4a59-a693-e8713cb1e66e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
donor_iddatecampaigndonation_amount
15602022-03-07T00:00:00.000000000Email6 dollars and 98 cents
26302022-11-08T00:00:00.000000000Social Media76 thousand
48362022-04-07T00:00:00.000000000Email$81
57641970-01-01Social Media77.55500350452421
73602022-10-20T00:00:00.000000000Unknown76 thousand
...............
99923082022-01-13T00:00:00.000000000Email77.55500350452421
99946942022-07-21T00:00:00.000000000Event6 dollars and 98 cents
99954261970-01-01Social Media$81
99977782022-08-24T00:00:00.000000000Event$81
99989742022-10-07T00:00:00.000000000Email$81
\n", "

6666 rows × 4 columns

\n", "
" ], "text/plain": [ " donor_id date campaign \\\n", "1 560 2022-03-07T00:00:00.000000000 Email \n", "2 630 2022-11-08T00:00:00.000000000 Social Media \n", "4 836 2022-04-07T00:00:00.000000000 Email \n", "5 764 1970-01-01 Social Media \n", "7 360 2022-10-20T00:00:00.000000000 Unknown \n", "... ... ... ... \n", "9992 308 2022-01-13T00:00:00.000000000 Email \n", "9994 694 2022-07-21T00:00:00.000000000 Event \n", "9995 426 1970-01-01 Social Media \n", "9997 778 2022-08-24T00:00:00.000000000 Event \n", "9998 974 2022-10-07T00:00:00.000000000 Email \n", "\n", " donation_amount \n", "1 6 dollars and 98 cents \n", "2 76 thousand \n", "4 $81 \n", "5 77.55500350452421 \n", "7 76 thousand \n", "... ... \n", "9992 77.55500350452421 \n", "9994 6 dollars and 98 cents \n", "9995 $81 \n", "9997 $81 \n", "9998 $81 \n", "\n", "[6666 rows x 4 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[mask,'date'] = '1970-01-01'\n", "df" ] }, { "cell_type": "markdown", "id": "8415a901-55e7-4215-89c4-ce53816e0a25", "metadata": {}, "source": [ "Although we successfully converted the strings into dates, the date column remains in string format." ] }, { "cell_type": "code", "execution_count": 7, "id": "f7f9c4c8-3d5e-43fb-b067-9565f5bdcd70", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 6666 entries, 1 to 9998\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 donor_id 6666 non-null object\n", " 1 date 6666 non-null object\n", " 2 campaign 6666 non-null object\n", " 3 donation_amount 6666 non-null object\n", "dtypes: object(4)\n", "memory usage: 260.4+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "e3d0635c-840f-4af7-8cc4-011e15b5bdee", "metadata": {}, "source": [ "Convert string column to a datetime object." ] }, { "cell_type": "code", "execution_count": 8, "id": "292b5d9d-0391-4883-b8df-6672aff8cf54", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 2022-03-07\n", "2 2022-11-08\n", "4 2022-04-07\n", "5 1970-01-01\n", "7 2022-10-20\n", " ... \n", "9992 2022-01-13\n", "9994 2022-07-21\n", "9995 1970-01-01\n", "9997 2022-08-24\n", "9998 2022-10-07\n", "Name: date, Length: 6666, dtype: datetime64[ns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# `format='mixed'`, the format will be inferred for each element individually as the 1970 dates do not have the same format as the rest\n", "pd.to_datetime(df['date'], format='mixed')" ] }, { "cell_type": "markdown", "id": "e1c8c08e-6f47-4a07-a2bd-11d6d3dc1b22", "metadata": {}, "source": [ "This morning, for some reason I can't get these datatypes to behave...... the code below did not work." ] }, { "cell_type": "code", "execution_count": 9, "id": "1bf149d4-0d40-41c0-ba86-2c5b900354d2", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 6666 entries, 1 to 9998\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 donor_id 6666 non-null object\n", " 1 date 6666 non-null object\n", " 2 campaign 6666 non-null object\n", " 3 donation_amount 6666 non-null object\n", "dtypes: object(4)\n", "memory usage: 260.4+ KB\n" ] } ], "source": [ "# convert to date object\n", "df.loc[:,'date'] = pd.to_datetime(df['date'], format='mixed')\n", "\n", "df.info()" ] }, { "cell_type": "markdown", "id": "55f34638-9b7c-4511-a304-128cc099dfb0", "metadata": {}, "source": [ "We can also take care of the Donor ID pretty easily." ] }, { "cell_type": "markdown", "id": "56a1fb73-3468-42f6-a823-d5bb691e154f", "metadata": {}, "source": [ "This also did not work..." ] }, { "cell_type": "code", "execution_count": 10, "id": "8acfd543-d9fd-450a-aa03-8392d563528e", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 6666 entries, 1 to 9998\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 donor_id 6666 non-null object\n", " 1 date 6666 non-null object\n", " 2 campaign 6666 non-null object\n", " 3 donation_amount 6666 non-null object\n", "dtypes: object(4)\n", "memory usage: 260.4+ KB\n" ] } ], "source": [ "df.loc[:,'donor_id'] = df.loc[:,'donor_id'].astype(int)\n", "df.info()" ] }, { "cell_type": "markdown", "id": "779e0cc4-0dd0-4ca2-8ee7-a7df69384b05", "metadata": {}, "source": [ "This did the trick for me to get the date types to be represented correctly." ] }, { "cell_type": "code", "execution_count": 11, "id": "87bf4f85-62af-4a27-b203-f2d62e4a81a6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 6666 entries, 1 to 9998\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 donor_id 6666 non-null Int64 \n", " 1 date 6666 non-null datetime64[ns]\n", " 2 campaign 6666 non-null string \n", " 3 donation_amount 6666 non-null string \n", "dtypes: Int64(1), datetime64[ns](1), string(2)\n", "memory usage: 266.9 KB\n" ] } ], "source": [ "df = df.convert_dtypes()\n", "df.info()" ] }, { "cell_type": "markdown", "id": "36413b55-7486-4d4b-b075-26bc7cff1c3d", "metadata": {}, "source": [ "# Donation Amount Cleanup \n", "\n", "- Remove the dollar sign\n", "- Apply a custom function to convert the values to a numeric format" ] }, { "cell_type": "code", "execution_count": 12, "id": "a174909b-35cd-442c-8073-274f637f78cf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 6 dollars and 98 cents\n", "2 76 thousand\n", "4 81\n", "5 77.55500350452421\n", "7 76 thousand\n", "Name: donation_amount, dtype: string" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# remove dollar sign\n", "df.loc[:,'donation_amount'] = df.loc[:,'donation_amount'].apply(lambda x:x.replace(\"$\",\"\"))\n", "df.loc[:,'donation_amount'].head()" ] }, { "cell_type": "code", "execution_count": 13, "id": "e467db1f-8691-41b1-bd65-8b5ed72d5e45", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 6.98\n", "2 76000\n", "4 81\n", "5 77.55500350452421\n", "7 76000\n", "Name: donation_amount, dtype: string" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def clean_column(value):\n", " ''' identify pattern and clean up \n", "\n", " patterns: \"10 thousand\", \"5 dollars and 25 cents\"\n", " '''\n", " pattern1 = r'\\d+ thousand'\n", " pattern2 = r'\\d+ dollars and \\d+ cents'\n", " \n", " if re.search(pattern1, value):\n", " # remove all non numeric characters from the string\n", " return str(int(re.sub(r'[^\\d]', '', value)) * 1000)\n", " elif re.search(pattern2, value):\n", " # remove all non numeric characters from the strings\n", " dollars = re.sub(r'[^\\d]', '', value.split('and')[0])\n", " cents = re.sub(r'[^\\d]', '', value.split('and')[1])\n", " return dollars + \".\" + cents\n", " else: \n", " return value\n", "\n", "df.loc[:,'donation_amount'] = df['donation_amount'].apply(clean_column)\n", "df['donation_amount'].head()" ] }, { "cell_type": "markdown", "id": "cb1dc70a-0891-4154-9876-b1915601481b", "metadata": {}, "source": [ "Now let's fix the datatype for the donation amount." ] }, { "cell_type": "code", "execution_count": 14, "id": "6ea05a5c-db94-4fa0-a5bd-a133658d3c71", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 6666 entries, 1 to 9998\n", "Data columns (total 4 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 donor_id 6666 non-null Int64 \n", " 1 date 6666 non-null datetime64[ns]\n", " 2 campaign 6666 non-null string \n", " 3 donation_amount 6666 non-null float64 \n", "dtypes: Int64(1), datetime64[ns](1), float64(1), string(1)\n", "memory usage: 266.9 KB\n" ] } ], "source": [ "df['donation_amount'] = df.loc[:,'donation_amount'].astype(float)\n", "df.info()" ] }, { "cell_type": "markdown", "id": "0a911727-958d-4e1e-8b3c-d4b0b3c2ee73", "metadata": {}, "source": [ "OK, so we have taken care of a lot here. \n", "\n", "- The donor_id column is now in integer format\n", "- The date column is now in the correct format\n", "- The donation_amount column has been successfully cleaned up and converted to the correct numeric format" ] }, { "cell_type": "code", "execution_count": 15, "id": "5ec3426c-1e06-4511-982e-22bb10b3f5b4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
donor_iddatecampaigndonation_amount
15602022-03-07Email6.980000
26302022-11-08Social Media76000.000000
48362022-04-07Email81.000000
57641970-01-01Social Media77.555004
73602022-10-20Unknown76000.000000
8102022-06-18Social Media81.000000
102781970-01-01Event76000.000000
117552022-02-02Event76000.000000
136002022-09-21Social Media6.980000
14712022-05-25Unknown6.980000
166012022-12-29Event76000.000000
173972022-01-31Event81.000000
197062022-12-05Social Media76000.000000
204871970-01-01Event77.555004
22882022-07-05Social Media6.980000
231752022-07-24Email81.000000
258501970-01-01Email76000.000000
266782022-06-20Event76000.000000
288462022-05-20Unknown77.555004
29732022-08-31Email6.980000
\n", "
" ], "text/plain": [ " donor_id date campaign donation_amount\n", "1 560 2022-03-07 Email 6.980000\n", "2 630 2022-11-08 Social Media 76000.000000\n", "4 836 2022-04-07 Email 81.000000\n", "5 764 1970-01-01 Social Media 77.555004\n", "7 360 2022-10-20 Unknown 76000.000000\n", "8 10 2022-06-18 Social Media 81.000000\n", "10 278 1970-01-01 Event 76000.000000\n", "11 755 2022-02-02 Event 76000.000000\n", "13 600 2022-09-21 Social Media 6.980000\n", "14 71 2022-05-25 Unknown 6.980000\n", "16 601 2022-12-29 Event 76000.000000\n", "17 397 2022-01-31 Event 81.000000\n", "19 706 2022-12-05 Social Media 76000.000000\n", "20 487 1970-01-01 Event 77.555004\n", "22 88 2022-07-05 Social Media 6.980000\n", "23 175 2022-07-24 Email 81.000000\n", "25 850 1970-01-01 Email 76000.000000\n", "26 678 2022-06-20 Event 76000.000000\n", "28 846 2022-05-20 Unknown 77.555004\n", "29 73 2022-08-31 Email 6.980000" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# let's take a peek at the data\n", "df.head(20)" ] }, { "cell_type": "code", "execution_count": 16, "id": "d19684cc-f469-43b6-8ef0-3d2d88ca6fad", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
donor_iddatedonation_amount
count6666.066666666.000000
mean501.1923192012-01-04 18:13:04.15841587219297.706620
min1.01970-01-01 00:00:006.980000
25%255.02022-01-25 00:00:0077.555004
50%499.02022-05-24 00:00:0077.555004
75%755.02022-09-13 00:00:0076000.000000
max999.02022-12-31 00:00:0076000.000000
std288.740445NaN33034.244560
\n", "
" ], "text/plain": [ " donor_id date donation_amount\n", "count 6666.0 6666 6666.000000\n", "mean 501.192319 2012-01-04 18:13:04.158415872 19297.706620\n", "min 1.0 1970-01-01 00:00:00 6.980000\n", "25% 255.0 2022-01-25 00:00:00 77.555004\n", "50% 499.0 2022-05-24 00:00:00 77.555004\n", "75% 755.0 2022-09-13 00:00:00 76000.000000\n", "max 999.0 2022-12-31 00:00:00 76000.000000\n", "std 288.740445 NaN 33034.244560" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "id": "d13aa3dc-bbf0-4d4d-b486-89a6e0f65b96", "metadata": {}, "source": [ "# Data Gaze \n", "\n", "I am going to recommend you get this data into Microsoft Excel and do a quick glance. Excel does a much better job at letting you analyze the data on your nice and big monitor." ] }, { "cell_type": "code", "execution_count": 17, "id": "18d08951-7701-4f2e-97f6-5d36ff9d824e", "metadata": {}, "outputs": [], "source": [ "df.to_clipboard()" ] }, { "cell_type": "markdown", "id": "22feacfa-f89e-4fe0-ab32-637d785b2fbb", "metadata": {}, "source": [ "

This tutorial was created by HEDARO

" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" } }, "nbformat": 4, "nbformat_minor": 5 }