{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ac5ef3bc-214b-416a-a1ed-bb8342a2e23e",
   "metadata": {},
   "source": [
    "# The Donor Data Debacle  \n",
    "\n",
    "### Description:  \n",
    "You're a data analyst for a non-profit organization, and you've been tasked with cleaning up a messy dataset of donations. The data is a bit of a disaster, with missing values, duplicates, and inconsistent formatting. Your mission is to use your Pandas skills to wrangle the data into shape.  \n",
    "\n",
    "### Tasks:  \n",
    "- **Clean up the Mess:** Remove duplicates, handle missing values, and ensure data types are correct.  \n",
    "- **Standardize the data:** Normalize the 'Donation Amount' column and convert the `date` column to a standard format.  \n",
    "- **Data quality check:** Identify and correct any inconsistent or invalid data.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "50e9f9d5-6c65-494f-a5c6-6818efc2979d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Python version 3.11.7 | packaged by Anaconda, Inc. | (main, Dec 15 2023, 18:05:47) [MSC v.1916 64 bit (AMD64)]\n",
      "Pandas version 2.2.1\n",
      "Numpy version 1.26.4\n"
     ]
    }
   ],
   "source": [
    "# import libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import sys\n",
    "import re\n",
    "\n",
    "print('Python version ' + sys.version)\n",
    "print('Pandas version ' + pd.__version__)\n",
    "print('Numpy version ' + np.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afbc5eec-6da5-482e-a003-4982e7e63b97",
   "metadata": {},
   "source": [
    "# The Data  \n",
    "\n",
    "The columns below represent information about individual donations, the date they were made, and the campaign that drove the donation. The goal is to clean, transform, and prepare this data for analysis.  \n",
    "\n",
    "Here's a breakdown of what each column in the sample data represents:  \n",
    "- **Donor ID:** A unique identifier for each donor \n",
    "- **Donation Amount:** The amount donated by each donor ( initially in a mix of numeric and string formats, requiring cleanup)  \n",
    "- **Date:** The date each donation was made\n",
    "- **Campaign:** The marketing campaign or channel that led to the donation\n",
    "\n",
    "Important Note about the `Donation Amount` Column:  \n",
    "\n",
    "The logic below will generate a mix of:  \n",
    "- Numeric values (e.g., 10.50, 500.00)  \n",
    "- String values with words (e.g., \"10 thousand\", \"5 dollars and 25 cents\")  \n",
    "- String values with currency symbols (e.g., \"`$50`\", \"`$1000`\")  \n",
    "\n",
    "Your task will be to clean up this column by converting all values to a standard numeric format, handling the various string formats, and dealing with any potential errors or inconsistencies. Good luck!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "8c807d37-576a-4f3c-8bc4-be4a3cf34886",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>donor_id</th>\n",
       "      <th>date</th>\n",
       "      <th>campaign</th>\n",
       "      <th>donation_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>685</td>\n",
       "      <td>Invalid Date</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>560</td>\n",
       "      <td>2022-03-07T00:00:00.000000000</td>\n",
       "      <td>Email</td>\n",
       "      <td>6 dollars and 98 cents</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>630</td>\n",
       "      <td>2022-11-08T00:00:00.000000000</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>76 thousand</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>193</td>\n",
       "      <td>2022-03-25T00:00:00.000000000</td>\n",
       "      <td>Email</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>836</td>\n",
       "      <td>2022-04-07T00:00:00.000000000</td>\n",
       "      <td>Email</td>\n",
       "      <td>$81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9995</th>\n",
       "      <td>426</td>\n",
       "      <td>Invalid Date</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>$81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9996</th>\n",
       "      <td>891</td>\n",
       "      <td>2022-04-18T00:00:00.000000000</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9997</th>\n",
       "      <td>778</td>\n",
       "      <td>2022-08-24T00:00:00.000000000</td>\n",
       "      <td>Event</td>\n",
       "      <td>$81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9998</th>\n",
       "      <td>974</td>\n",
       "      <td>2022-10-07T00:00:00.000000000</td>\n",
       "      <td>Email</td>\n",
       "      <td>$81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9999</th>\n",
       "      <td>74</td>\n",
       "      <td>2022-07-09T00:00:00.000000000</td>\n",
       "      <td>Event</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>10000 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     donor_id                           date      campaign  \\\n",
       "0         685                   Invalid Date       Unknown   \n",
       "1         560  2022-03-07T00:00:00.000000000         Email   \n",
       "2         630  2022-11-08T00:00:00.000000000  Social Media   \n",
       "3         193  2022-03-25T00:00:00.000000000         Email   \n",
       "4         836  2022-04-07T00:00:00.000000000         Email   \n",
       "...       ...                            ...           ...   \n",
       "9995      426                   Invalid Date  Social Media   \n",
       "9996      891  2022-04-18T00:00:00.000000000       Unknown   \n",
       "9997      778  2022-08-24T00:00:00.000000000         Event   \n",
       "9998      974  2022-10-07T00:00:00.000000000         Email   \n",
       "9999       74  2022-07-09T00:00:00.000000000         Event   \n",
       "\n",
       "             donation_amount  \n",
       "0                        NaN  \n",
       "1     6 dollars and 98 cents  \n",
       "2                76 thousand  \n",
       "3                        NaN  \n",
       "4                        $81  \n",
       "...                      ...  \n",
       "9995                     $81  \n",
       "9996                     NaN  \n",
       "9997                     $81  \n",
       "9998                     $81  \n",
       "9999                     NaN  \n",
       "\n",
       "[10000 rows x 4 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# set the seed\n",
    "np.random.seed(0)\n",
    "\n",
    "# synthetic data\n",
    "data = {\n",
    "    'donor_id': np.random.randint(1, 1000, 10000),\n",
    "    'date': np.random.choice(pd.date_range('2022-01-01', periods=365), 10000),\n",
    "    'campaign': np.random.choice(['Email', 'Social Media', 'Event'], 10000),\n",
    "    'donation_amount': np.random.choice([\n",
    "        np.random.uniform(10, 1000),  # numeric value\n",
    "        f'{np.random.randint(1, 100)} thousand',  # string value (e.g., \"10 thousand\")\n",
    "        f'{np.random.randint(1, 10)} dollars and {np.random.randint(1, 100)} cents',  # string value (e.g., \"5 dollars and 25 cents\")\n",
    "        f'${np.random.randint(1, 100)}',  # string value with currency symbol (e.g., \"$50\")\n",
    "    ], 10000)    \n",
    "}\n",
    "\n",
    "# create dataframe\n",
    "df = pd.DataFrame(data)\n",
    "\n",
    "## introduce some messiness ##\n",
    "\n",
    "# make the column the wrong datatype\n",
    "df['donor_id'] = data['donor_id'].astype(str)\n",
    "\n",
    "# missing values\n",
    "df.loc[df.index % 3 == 0, 'donation_amount'] = np.nan\n",
    "\n",
    "# messy is my middle name\n",
    "df['date'] = data['date'].astype(str)\n",
    "df.loc[df.index % 5 == 0, 'date'] = 'Invalid Date'\n",
    "\n",
    "# the marketing manager is not going to be happy :)\n",
    "df.loc[df.index % 7 == 0, 'campaign'] = 'Unknown'\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "532d8d82-f071-423e-9dee-6a5ffdba1c33",
   "metadata": {},
   "source": [
    "Let's start by looking at the datatypes.  \n",
    "\n",
    "As you can expect, Pandas is treating all of the columns as strings. Let the clean up process begin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "65a0bfa8-6038-4c9d-b9a6-38d881042643",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 10000 entries, 0 to 9999\n",
      "Data columns (total 4 columns):\n",
      " #   Column           Non-Null Count  Dtype \n",
      "---  ------           --------------  ----- \n",
      " 0   donor_id         10000 non-null  object\n",
      " 1   date             10000 non-null  object\n",
      " 2   campaign         10000 non-null  object\n",
      " 3   donation_amount  6666 non-null   object\n",
      "dtypes: object(4)\n",
      "memory usage: 312.6+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "890de063-eb4f-4fc9-a645-12bb85e6ea22",
   "metadata": {},
   "source": [
    "# Clean up the Mess:  \n",
    "\n",
    "Remove duplicates, handle missing values, and ensure data types are correct.  \n",
    "\n",
    "If we assume that we will not be able to get the correct donation amounts, we might as well remove those rows from the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "7d069545-4c10-4d6e-a0c2-5ff7ac5c9ed7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 6666 entries, 1 to 9998\n",
      "Data columns (total 4 columns):\n",
      " #   Column           Non-Null Count  Dtype \n",
      "---  ------           --------------  ----- \n",
      " 0   donor_id         6666 non-null   object\n",
      " 1   date             6666 non-null   object\n",
      " 2   campaign         6666 non-null   object\n",
      " 3   donation_amount  6666 non-null   object\n",
      "dtypes: object(4)\n",
      "memory usage: 260.4+ KB\n"
     ]
    }
   ],
   "source": [
    "df = df.dropna()\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80f71d38-12e8-48a9-b5d2-3b2b0bca8714",
   "metadata": {},
   "source": [
    "The marketing manager told us to replace any missing dates with '1970-01-01' so we can identify these and deal with them later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "da81e037-fbe7-438f-ae37-31f5ffdcb2f8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>donor_id</th>\n",
       "      <th>date</th>\n",
       "      <th>campaign</th>\n",
       "      <th>donation_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>764</td>\n",
       "      <td>Invalid Date</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>77.55500350452421</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>278</td>\n",
       "      <td>Invalid Date</td>\n",
       "      <td>Event</td>\n",
       "      <td>76 thousand</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>487</td>\n",
       "      <td>Invalid Date</td>\n",
       "      <td>Event</td>\n",
       "      <td>77.55500350452421</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>850</td>\n",
       "      <td>Invalid Date</td>\n",
       "      <td>Email</td>\n",
       "      <td>76 thousand</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>710</td>\n",
       "      <td>Invalid Date</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>6 dollars and 98 cents</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   donor_id          date      campaign         donation_amount\n",
       "5       764  Invalid Date  Social Media       77.55500350452421\n",
       "10      278  Invalid Date         Event             76 thousand\n",
       "20      487  Invalid Date         Event       77.55500350452421\n",
       "25      850  Invalid Date         Email             76 thousand\n",
       "35      710  Invalid Date       Unknown  6 dollars and 98 cents"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# identify the invalid dates\n",
    "mask = df['date'] == 'Invalid Date'\n",
    "df[mask].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af67e05b-4cc4-42d9-bd0c-4173f923056c",
   "metadata": {},
   "source": [
    "Here is where we set the dates to `1970-01-01`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "271a4ae9-05b2-4a59-a693-e8713cb1e66e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>donor_id</th>\n",
       "      <th>date</th>\n",
       "      <th>campaign</th>\n",
       "      <th>donation_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>560</td>\n",
       "      <td>2022-03-07T00:00:00.000000000</td>\n",
       "      <td>Email</td>\n",
       "      <td>6 dollars and 98 cents</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>630</td>\n",
       "      <td>2022-11-08T00:00:00.000000000</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>76 thousand</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>836</td>\n",
       "      <td>2022-04-07T00:00:00.000000000</td>\n",
       "      <td>Email</td>\n",
       "      <td>$81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>764</td>\n",
       "      <td>1970-01-01</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>77.55500350452421</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>360</td>\n",
       "      <td>2022-10-20T00:00:00.000000000</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>76 thousand</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9992</th>\n",
       "      <td>308</td>\n",
       "      <td>2022-01-13T00:00:00.000000000</td>\n",
       "      <td>Email</td>\n",
       "      <td>77.55500350452421</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9994</th>\n",
       "      <td>694</td>\n",
       "      <td>2022-07-21T00:00:00.000000000</td>\n",
       "      <td>Event</td>\n",
       "      <td>6 dollars and 98 cents</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9995</th>\n",
       "      <td>426</td>\n",
       "      <td>1970-01-01</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>$81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9997</th>\n",
       "      <td>778</td>\n",
       "      <td>2022-08-24T00:00:00.000000000</td>\n",
       "      <td>Event</td>\n",
       "      <td>$81</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9998</th>\n",
       "      <td>974</td>\n",
       "      <td>2022-10-07T00:00:00.000000000</td>\n",
       "      <td>Email</td>\n",
       "      <td>$81</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6666 rows × 4 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     donor_id                           date      campaign  \\\n",
       "1         560  2022-03-07T00:00:00.000000000         Email   \n",
       "2         630  2022-11-08T00:00:00.000000000  Social Media   \n",
       "4         836  2022-04-07T00:00:00.000000000         Email   \n",
       "5         764                     1970-01-01  Social Media   \n",
       "7         360  2022-10-20T00:00:00.000000000       Unknown   \n",
       "...       ...                            ...           ...   \n",
       "9992      308  2022-01-13T00:00:00.000000000         Email   \n",
       "9994      694  2022-07-21T00:00:00.000000000         Event   \n",
       "9995      426                     1970-01-01  Social Media   \n",
       "9997      778  2022-08-24T00:00:00.000000000         Event   \n",
       "9998      974  2022-10-07T00:00:00.000000000         Email   \n",
       "\n",
       "             donation_amount  \n",
       "1     6 dollars and 98 cents  \n",
       "2                76 thousand  \n",
       "4                        $81  \n",
       "5          77.55500350452421  \n",
       "7                76 thousand  \n",
       "...                      ...  \n",
       "9992       77.55500350452421  \n",
       "9994  6 dollars and 98 cents  \n",
       "9995                     $81  \n",
       "9997                     $81  \n",
       "9998                     $81  \n",
       "\n",
       "[6666 rows x 4 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[mask,'date'] = '1970-01-01'\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8415a901-55e7-4215-89c4-ce53816e0a25",
   "metadata": {},
   "source": [
    "Although we successfully converted the strings into dates, the date column remains in string format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f7f9c4c8-3d5e-43fb-b067-9565f5bdcd70",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 6666 entries, 1 to 9998\n",
      "Data columns (total 4 columns):\n",
      " #   Column           Non-Null Count  Dtype \n",
      "---  ------           --------------  ----- \n",
      " 0   donor_id         6666 non-null   object\n",
      " 1   date             6666 non-null   object\n",
      " 2   campaign         6666 non-null   object\n",
      " 3   donation_amount  6666 non-null   object\n",
      "dtypes: object(4)\n",
      "memory usage: 260.4+ KB\n"
     ]
    }
   ],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3d0635c-840f-4af7-8cc4-011e15b5bdee",
   "metadata": {},
   "source": [
    "Convert string column to a datetime object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "292b5d9d-0391-4883-b8df-6672aff8cf54",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1      2022-03-07\n",
       "2      2022-11-08\n",
       "4      2022-04-07\n",
       "5      1970-01-01\n",
       "7      2022-10-20\n",
       "          ...    \n",
       "9992   2022-01-13\n",
       "9994   2022-07-21\n",
       "9995   1970-01-01\n",
       "9997   2022-08-24\n",
       "9998   2022-10-07\n",
       "Name: date, Length: 6666, dtype: datetime64[ns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# `format='mixed'`, the format will be inferred for each element individually as the 1970 dates do not have the same format as the rest\n",
    "pd.to_datetime(df['date'], format='mixed')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1c8c08e-6f47-4a07-a2bd-11d6d3dc1b22",
   "metadata": {},
   "source": [
    "This morning, for some reason I can't get these datatypes to behave...... the code below did not work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "1bf149d4-0d40-41c0-ba86-2c5b900354d2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 6666 entries, 1 to 9998\n",
      "Data columns (total 4 columns):\n",
      " #   Column           Non-Null Count  Dtype \n",
      "---  ------           --------------  ----- \n",
      " 0   donor_id         6666 non-null   object\n",
      " 1   date             6666 non-null   object\n",
      " 2   campaign         6666 non-null   object\n",
      " 3   donation_amount  6666 non-null   object\n",
      "dtypes: object(4)\n",
      "memory usage: 260.4+ KB\n"
     ]
    }
   ],
   "source": [
    "# convert to date object\n",
    "df.loc[:,'date'] = pd.to_datetime(df['date'], format='mixed')\n",
    "\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "55f34638-9b7c-4511-a304-128cc099dfb0",
   "metadata": {},
   "source": [
    "We can also take care of the Donor ID pretty easily."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "56a1fb73-3468-42f6-a823-d5bb691e154f",
   "metadata": {},
   "source": [
    "This also did not work..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "8acfd543-d9fd-450a-aa03-8392d563528e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 6666 entries, 1 to 9998\n",
      "Data columns (total 4 columns):\n",
      " #   Column           Non-Null Count  Dtype \n",
      "---  ------           --------------  ----- \n",
      " 0   donor_id         6666 non-null   object\n",
      " 1   date             6666 non-null   object\n",
      " 2   campaign         6666 non-null   object\n",
      " 3   donation_amount  6666 non-null   object\n",
      "dtypes: object(4)\n",
      "memory usage: 260.4+ KB\n"
     ]
    }
   ],
   "source": [
    "df.loc[:,'donor_id'] = df.loc[:,'donor_id'].astype(int)\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "779e0cc4-0dd0-4ca2-8ee7-a7df69384b05",
   "metadata": {},
   "source": [
    "This did the trick for me to get the date types to be represented correctly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "87bf4f85-62af-4a27-b203-f2d62e4a81a6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 6666 entries, 1 to 9998\n",
      "Data columns (total 4 columns):\n",
      " #   Column           Non-Null Count  Dtype         \n",
      "---  ------           --------------  -----         \n",
      " 0   donor_id         6666 non-null   Int64         \n",
      " 1   date             6666 non-null   datetime64[ns]\n",
      " 2   campaign         6666 non-null   string        \n",
      " 3   donation_amount  6666 non-null   string        \n",
      "dtypes: Int64(1), datetime64[ns](1), string(2)\n",
      "memory usage: 266.9 KB\n"
     ]
    }
   ],
   "source": [
    "df = df.convert_dtypes()\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36413b55-7486-4d4b-b075-26bc7cff1c3d",
   "metadata": {},
   "source": [
    "# Donation Amount Cleanup  \n",
    "\n",
    "- Remove the dollar sign\n",
    "- Apply a custom function to convert the values to a numeric format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "a174909b-35cd-442c-8073-274f637f78cf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1    6 dollars and 98 cents\n",
       "2               76 thousand\n",
       "4                        81\n",
       "5         77.55500350452421\n",
       "7               76 thousand\n",
       "Name: donation_amount, dtype: string"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# remove dollar sign\n",
    "df.loc[:,'donation_amount'] = df.loc[:,'donation_amount'].apply(lambda x:x.replace(\"$\",\"\"))\n",
    "df.loc[:,'donation_amount'].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "e467db1f-8691-41b1-bd65-8b5ed72d5e45",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1                 6.98\n",
       "2                76000\n",
       "4                   81\n",
       "5    77.55500350452421\n",
       "7                76000\n",
       "Name: donation_amount, dtype: string"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def clean_column(value):\n",
    "    ''' identify pattern and clean up  \n",
    "\n",
    "        patterns: \"10 thousand\", \"5 dollars and 25 cents\"\n",
    "    '''\n",
    "    pattern1 = r'\\d+ thousand'\n",
    "    pattern2 = r'\\d+ dollars and \\d+ cents'\n",
    "    \n",
    "    if re.search(pattern1, value):\n",
    "        # remove all non numeric characters from the string\n",
    "        return str(int(re.sub(r'[^\\d]', '', value)) * 1000)\n",
    "    elif re.search(pattern2, value):\n",
    "        # remove all non numeric characters from the strings\n",
    "        dollars = re.sub(r'[^\\d]', '', value.split('and')[0])\n",
    "        cents = re.sub(r'[^\\d]', '', value.split('and')[1])\n",
    "        return dollars + \".\" + cents\n",
    "    else: \n",
    "        return value\n",
    "\n",
    "df.loc[:,'donation_amount'] = df['donation_amount'].apply(clean_column)\n",
    "df['donation_amount'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cb1dc70a-0891-4154-9876-b1915601481b",
   "metadata": {},
   "source": [
    "Now let's fix the datatype for the donation amount."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "6ea05a5c-db94-4fa0-a5bd-a133658d3c71",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Index: 6666 entries, 1 to 9998\n",
      "Data columns (total 4 columns):\n",
      " #   Column           Non-Null Count  Dtype         \n",
      "---  ------           --------------  -----         \n",
      " 0   donor_id         6666 non-null   Int64         \n",
      " 1   date             6666 non-null   datetime64[ns]\n",
      " 2   campaign         6666 non-null   string        \n",
      " 3   donation_amount  6666 non-null   float64       \n",
      "dtypes: Int64(1), datetime64[ns](1), float64(1), string(1)\n",
      "memory usage: 266.9 KB\n"
     ]
    }
   ],
   "source": [
    "df['donation_amount'] = df.loc[:,'donation_amount'].astype(float)\n",
    "df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a911727-958d-4e1e-8b3c-d4b0b3c2ee73",
   "metadata": {},
   "source": [
    "OK, so we have taken care of a lot here.  \n",
    "\n",
    "- The donor_id column is now in integer format\n",
    "- The date column is now in the correct format\n",
    "- The donation_amount column has been successfully cleaned up and converted to the correct numeric format"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "5ec3426c-1e06-4511-982e-22bb10b3f5b4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>donor_id</th>\n",
       "      <th>date</th>\n",
       "      <th>campaign</th>\n",
       "      <th>donation_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>560</td>\n",
       "      <td>2022-03-07</td>\n",
       "      <td>Email</td>\n",
       "      <td>6.980000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>630</td>\n",
       "      <td>2022-11-08</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>836</td>\n",
       "      <td>2022-04-07</td>\n",
       "      <td>Email</td>\n",
       "      <td>81.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>764</td>\n",
       "      <td>1970-01-01</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>77.555004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>360</td>\n",
       "      <td>2022-10-20</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>10</td>\n",
       "      <td>2022-06-18</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>81.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>278</td>\n",
       "      <td>1970-01-01</td>\n",
       "      <td>Event</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>755</td>\n",
       "      <td>2022-02-02</td>\n",
       "      <td>Event</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>600</td>\n",
       "      <td>2022-09-21</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>6.980000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>71</td>\n",
       "      <td>2022-05-25</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>6.980000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>601</td>\n",
       "      <td>2022-12-29</td>\n",
       "      <td>Event</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>397</td>\n",
       "      <td>2022-01-31</td>\n",
       "      <td>Event</td>\n",
       "      <td>81.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>706</td>\n",
       "      <td>2022-12-05</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>487</td>\n",
       "      <td>1970-01-01</td>\n",
       "      <td>Event</td>\n",
       "      <td>77.555004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>88</td>\n",
       "      <td>2022-07-05</td>\n",
       "      <td>Social Media</td>\n",
       "      <td>6.980000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>175</td>\n",
       "      <td>2022-07-24</td>\n",
       "      <td>Email</td>\n",
       "      <td>81.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>850</td>\n",
       "      <td>1970-01-01</td>\n",
       "      <td>Email</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>678</td>\n",
       "      <td>2022-06-20</td>\n",
       "      <td>Event</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>846</td>\n",
       "      <td>2022-05-20</td>\n",
       "      <td>Unknown</td>\n",
       "      <td>77.555004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>73</td>\n",
       "      <td>2022-08-31</td>\n",
       "      <td>Email</td>\n",
       "      <td>6.980000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    donor_id       date      campaign  donation_amount\n",
       "1        560 2022-03-07         Email         6.980000\n",
       "2        630 2022-11-08  Social Media     76000.000000\n",
       "4        836 2022-04-07         Email        81.000000\n",
       "5        764 1970-01-01  Social Media        77.555004\n",
       "7        360 2022-10-20       Unknown     76000.000000\n",
       "8         10 2022-06-18  Social Media        81.000000\n",
       "10       278 1970-01-01         Event     76000.000000\n",
       "11       755 2022-02-02         Event     76000.000000\n",
       "13       600 2022-09-21  Social Media         6.980000\n",
       "14        71 2022-05-25       Unknown         6.980000\n",
       "16       601 2022-12-29         Event     76000.000000\n",
       "17       397 2022-01-31         Event        81.000000\n",
       "19       706 2022-12-05  Social Media     76000.000000\n",
       "20       487 1970-01-01         Event        77.555004\n",
       "22        88 2022-07-05  Social Media         6.980000\n",
       "23       175 2022-07-24         Email        81.000000\n",
       "25       850 1970-01-01         Email     76000.000000\n",
       "26       678 2022-06-20         Event     76000.000000\n",
       "28       846 2022-05-20       Unknown        77.555004\n",
       "29        73 2022-08-31         Email         6.980000"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# let's take a peek at the data\n",
    "df.head(20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "d19684cc-f469-43b6-8ef0-3d2d88ca6fad",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>donor_id</th>\n",
       "      <th>date</th>\n",
       "      <th>donation_amount</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>6666.0</td>\n",
       "      <td>6666</td>\n",
       "      <td>6666.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>501.192319</td>\n",
       "      <td>2012-01-04 18:13:04.158415872</td>\n",
       "      <td>19297.706620</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1970-01-01 00:00:00</td>\n",
       "      <td>6.980000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>255.0</td>\n",
       "      <td>2022-01-25 00:00:00</td>\n",
       "      <td>77.555004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>499.0</td>\n",
       "      <td>2022-05-24 00:00:00</td>\n",
       "      <td>77.555004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>755.0</td>\n",
       "      <td>2022-09-13 00:00:00</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>999.0</td>\n",
       "      <td>2022-12-31 00:00:00</td>\n",
       "      <td>76000.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>288.740445</td>\n",
       "      <td>NaN</td>\n",
       "      <td>33034.244560</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         donor_id                           date  donation_amount\n",
       "count      6666.0                           6666      6666.000000\n",
       "mean   501.192319  2012-01-04 18:13:04.158415872     19297.706620\n",
       "min           1.0            1970-01-01 00:00:00         6.980000\n",
       "25%         255.0            2022-01-25 00:00:00        77.555004\n",
       "50%         499.0            2022-05-24 00:00:00        77.555004\n",
       "75%         755.0            2022-09-13 00:00:00     76000.000000\n",
       "max         999.0            2022-12-31 00:00:00     76000.000000\n",
       "std    288.740445                            NaN     33034.244560"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d13aa3dc-bbf0-4d4d-b486-89a6e0f65b96",
   "metadata": {},
   "source": [
    "# Data Gaze  \n",
    "\n",
    "I am going to recommend you get this data into Microsoft Excel and do a quick glance. Excel does a much better job at letting you analyze the data on your nice and big monitor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "18d08951-7701-4f2e-97f6-5d36ff9d824e",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.to_clipboard()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22feacfa-f89e-4fe0-ab32-637d785b2fbb",
   "metadata": {},
   "source": [
    "<p class=\"text-muted\">This tutorial was created by <a href=\"https://www.hedaro.com\" target=\"_blank\"><strong>HEDARO</strong></a></p>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}