{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Standardize NTSB\n",
    "\n",
    "The NTSB has inconsistently logged the make and model of helicopters involved in accidents. This notebook cleans up variations based on Times research. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import data directories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in the NTSB's list of helicopters involved in fatal accidents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "helicopters = pd.read_csv(os.path.join(output_dir, \"helicopters-by-accident.csv\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in The Times' crosswalk of standardizations and corrections."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "crosswalk = pd.read_csv(os.path.join(input_dir, \"crosswalk.csv\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Join The Times crosswalk to the NTSB database. The standardized make and models values are stored in fields with an \"latimes\" prefix, which will appended to the raw data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged = helicopters.merge(\n",
    "    crosswalk,\n",
    "    on=[\"ntsb_make\", \"ntsb_model\"],\n",
    "    how=\"left\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After they are joined, fill in any records that don't yet have a match."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged.latimes_make.fillna(\"\", inplace=True)\n",
    "merged.latimes_model.fillna(\"\", inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recreate the crosswalk with any new makes and models from the raw data now included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_crosswalk = merged.groupby([\n",
    "    \"ntsb_make\",\n",
    "    \"ntsb_model\",\n",
    "    \"latimes_make\",\n",
    "    \"latimes_model\"\n",
    "]).size().rename(\"count\").reset_index()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Audit how many makes and models have been standardized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "uniques = pd.DataFrame([\n",
    "    ('all unique makes', len(new_crosswalk)),\n",
    "    ('standardized uniques', len(new_crosswalk[new_crosswalk.latimes_make != ''])),\n",
    "    ('unstandardized uniques', len(new_crosswalk[new_crosswalk.latimes_make == ''])),\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "counts = pd.DataFrame([\n",
    "    ('all helicopters', new_crosswalk['count'].sum()),\n",
    "    ('standardize count', new_crosswalk[new_crosswalk.latimes_make != '']['count'].sum()),\n",
    "    ('unstandardize count', new_crosswalk[new_crosswalk.latimes_make == '']['count'].sum()),\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "uniques['percent'] = uniques[1] / len(new_crosswalk)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "counts['percent'] = counts[1] / new_crosswalk['count'].sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>percent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>all unique makes</td>\n",
       "      <td>543</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>standardized uniques</td>\n",
       "      <td>539</td>\n",
       "      <td>0.992634</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>unstandardized uniques</td>\n",
       "      <td>4</td>\n",
       "      <td>0.007366</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        0    1   percent\n",
       "0        all unique makes  543  1.000000\n",
       "1    standardized uniques  539  0.992634\n",
       "2  unstandardized uniques    4  0.007366"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "uniques"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>percent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>all helicopters</td>\n",
       "      <td>1545</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>standardize count</td>\n",
       "      <td>1541</td>\n",
       "      <td>0.997411</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>unstandardize count</td>\n",
       "      <td>4</td>\n",
       "      <td>0.002589</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     0     1   percent\n",
       "0      all helicopters  1545  1.000000\n",
       "1    standardize count  1541  0.997411\n",
       "2  unstandardize count     4  0.002589"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write the crosswalk out for further standardization work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_crosswalk.sort_values([\"ntsb_make\", \"ntsb_model\"]).to_csv(\n",
    "    os.path.join(input_dir, \"crosswalk.csv\"),\n",
    "    index=False\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the NTSB data, create a combined field with the two standardized columns combined into one."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged['latimes_make_and_model'] = merged.latimes_make + \" \" + merged.latimes_model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write out the standardized file for analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged.to_csv(os.path.join(output_dir, \"standardized-helicopters-by-accident.csv\"), index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}