{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tidy NTSB accidentis database\n",
    "\n",
    "Only a subset of the accidents data is necessary for this analysis. The source tables need to be tidied up and filtered down to only records involving helicopters. That's what this notebook does.\n",
    "\n",
    "Each step below was vetted by database experts at the NTSB."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Open\n",
    "\n",
    "Configure the data directories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%store -r"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read in the files we want to keep."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "read_df = lambda name, types: pd.read_csv(os.path.join(input_dir, name), dtype=types, low_memory=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A list of all the aircraft involved with accidents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_aircraft = read_df(\"aircraft.csv\", dict(ev_id=str, Aircraft_Key=str, ntsb_no=str))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A list of all the accidents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_events = read_df(\"events.csv\", dict(ev_id=str, ntsb_no=str))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A list of all the flight crew on the aircraft involved in accidents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_flight_crew = read_df(\"flight_crew.csv\", dict(ev_id=str, Aircraft_Key=str, crew_no=str))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data about the previous experience levels of flight crew involved in accidents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_flight_time = read_df(\"flight_time.csv\", dict(ev_id=str, Aircraft_Key=str, crew_no=str))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Narratives from investigators"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_narratives = read_df(\"narratives.csv\", str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Strip\n",
    "\n",
    "The NTSB frequently leaves dangling spaces at the end of its string columns. Let's strip them all."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "strip_df = lambda df: df.applymap(lambda x: x.strip().upper() if type(x) is str else x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "stripped_aircraft = strip_df(raw_aircraft)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "stripped_events = strip_df(raw_events)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "stripped_flight_crew = strip_df(raw_flight_crew)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "stripped_flight_time = strip_df(raw_flight_time)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "stripped_narratives = strip_df(raw_narratives)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Trim\n",
    "\n",
    "Limit each table to the columns we want to keep. Rename the columns to make them more readable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "trim_df = lambda df, cols: df[list(cols)].rename(columns=cols) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "aircraft_columns = dict(\n",
    "    ev_id=\"event_id\",\n",
    "    Aircraft_Key=\"aircraft_id\",\n",
    "    acft_make=\"ntsb_make\",\n",
    "    acft_model=\"ntsb_model\",\n",
    "    acft_category=\"category\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "trimmed_aircraft = trim_df(stripped_aircraft, aircraft_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "events_columns = dict(\n",
    "    ev_id=\"event_id\",\n",
    "    ntsb_no=\"ntsb_number\",\n",
    "    ev_year=\"year\",\n",
    "    ev_date=\"date\",\n",
    "    ev_city=\"city\",\n",
    "    ev_state=\"state\",\n",
    "    ev_country=\"country\",\n",
    "    inj_tot_f=\"total_fatalities\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "trimmed_events = trim_df(stripped_events, events_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "flight_crew_columns = dict(\n",
    "    ev_id= \"event_id\",\n",
    "    Aircraft_Key=\"aircraft_id\",\n",
    "    crew_no=\"crew_id\",\n",
    "    crew_age=\"age\",\n",
    "    crew_category=\"category\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "trimmed_flight_crew = trim_df(stripped_flight_crew, flight_crew_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "flight_time_columns = dict(\n",
    "    ev_id=\"event_id\",\n",
    "    Aircraft_Key=\"aircraft_id\",\n",
    "    crew_no=\"crew_id\",\n",
    "    flight_hours=\"flight_hours\",\n",
    "    flight_type=\"flight_type\",\n",
    "    flight_craft=\"flight_craft\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "trimmed_flight_time = trim_df(stripped_flight_time, flight_time_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "narrative_columns = dict(\n",
    "    ev_id=\"event_id\",\n",
    "    Aircraft_Key=\"aircraft_id\",\n",
    "    narr_accp=\"factual_narrative\",\n",
    "    narr_accf=\"final_narrative\",\n",
    "    narr_cause=\"probable_cause_narrative\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "trimmed_narratives = trim_df(stripped_narratives, narrative_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Filter\n",
    "\n",
    "Each table needs to be filtered so that they only contain data related to fatal helicopter accidents in the United States."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we trim down to the helicopters linked to accident events."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "helicopters = trimmed_aircraft[trimmed_aircraft.category == 'HELI'].drop(\"category\", axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we trim down the accidents to those involving helicopters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "heli_events = trimmed_events[trimmed_events.event_id.isin(helicopters.event_id)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those also need to be limited to accidents that involved fatalities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "fatal_heli_events = heli_events[heli_events.total_fatalities > 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The flight crew data should be limited to pilots. We know the codes to look for thanks to the NTSB's database documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "pilots = trimmed_flight_crew[trimmed_flight_crew.category.isin([\"PLT\", \"DSTU\", \"FLTI\", \"CPLT\", \"KPLT\"])]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Those pilots then also need to be limited to events involving helicopters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "heli_pilots = pilots[pilots.event_id.isin(fatal_heli_events.event_id)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Only total flight time in all flights."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_flight_hours = trimmed_flight_time[\n",
    "    (trimmed_flight_time.flight_type == 'TOTL') &\n",
    "    (trimmed_flight_time.flight_craft == 'ALL')\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Merge\n",
    "\n",
    "The datasets needs to be merged with each other for analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Helicopters should be joined to our fatal accident list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "helicopters_by_accident = pd.merge(\n",
    "    helicopters,\n",
    "    fatal_heli_events,\n",
    "    on=\"event_id\",\n",
    "    how=\"inner\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pilots should be joined to their flight times."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "merged_pilots = pd.merge(\n",
    "    heli_pilots,\n",
    "    total_flight_hours,\n",
    "    on=[\"event_id\", \"aircraft_id\", \"crew_id\"],\n",
    "    how=\"left\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Annotate"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mark the accidents that happened in the United States."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "helicopters_by_accident['in_usa'] = helicopters_by_accident.country == 'USA'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will add a pilot count to each accident for use in the analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "pilot_counts = merged_pilots.groupby([\"event_id\", \"aircraft_id\"]).size().reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "pilot_counts.columns = [\"event_id\", \"aircraft_id\", \"pilot_total\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "helicopters_by_accident_annotated = helicopters_by_accident.merge(\n",
    "    pilot_counts,\n",
    "    on=[\"event_id\", \"aircraft_id\"],\n",
    "    how=\"left\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write out the files we'll use in the analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_df = lambda df, name: df.to_csv(os.path.join(output_dir, name), index=False, encoding=\"utf-8\") "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_df(helicopters_by_accident_annotated, 'helicopters-by-accident.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_df(fatal_heli_events, 'fatal-accidents.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_df(merged_pilots, 'pilots.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_df(trimmed_narratives, 'narratives.csv')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}