{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Standardize NTSB\n", "\n", "The NTSB has inconsistently logged the make and model of helicopters involved in accidents. This notebook cleans up variations based on Times research. " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import data directories" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "%store -r" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in the NTSB's list of helicopters involved in fatal accidents." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "helicopters = pd.read_csv(os.path.join(output_dir, \"helicopters-by-accident.csv\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in The Times' crosswalk of standardizations and corrections." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "crosswalk = pd.read_csv(os.path.join(input_dir, \"crosswalk.csv\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Join The Times crosswalk to the NTSB database. The standardized make and models values are stored in fields with an \"latimes\" prefix, which will appended to the raw data." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "merged = helicopters.merge(\n", " crosswalk,\n", " on=[\"ntsb_make\", \"ntsb_model\"],\n", " how=\"left\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After they are joined, fill in any records that don't yet have a match." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "merged.latimes_make.fillna(\"\", inplace=True)\n", "merged.latimes_model.fillna(\"\", inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recreate the crosswalk with any new makes and models from the raw data now included." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "new_crosswalk = merged.groupby([\n", " \"ntsb_make\",\n", " \"ntsb_model\",\n", " \"latimes_make\",\n", " \"latimes_model\"\n", "]).size().rename(\"count\").reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Audit how many makes and models have been standardized." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "uniques = pd.DataFrame([\n", " ('all unique makes', len(new_crosswalk)),\n", " ('standardized uniques', len(new_crosswalk[new_crosswalk.latimes_make != ''])),\n", " ('unstandardized uniques', len(new_crosswalk[new_crosswalk.latimes_make == ''])),\n", "])" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "counts = pd.DataFrame([\n", " ('all helicopters', new_crosswalk['count'].sum()),\n", " ('standardize count', new_crosswalk[new_crosswalk.latimes_make != '']['count'].sum()),\n", " ('unstandardize count', new_crosswalk[new_crosswalk.latimes_make == '']['count'].sum()),\n", "])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "uniques['percent'] = uniques[1] / len(new_crosswalk)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "counts['percent'] = counts[1] / new_crosswalk['count'].sum()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01percent
0all unique makes5431.000000
1standardized uniques5390.992634
2unstandardized uniques40.007366
\n", "
" ], "text/plain": [ " 0 1 percent\n", "0 all unique makes 543 1.000000\n", "1 standardized uniques 539 0.992634\n", "2 unstandardized uniques 4 0.007366" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "uniques" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01percent
0all helicopters15451.000000
1standardize count15410.997411
2unstandardize count40.002589
\n", "
" ], "text/plain": [ " 0 1 percent\n", "0 all helicopters 1545 1.000000\n", "1 standardize count 1541 0.997411\n", "2 unstandardize count 4 0.002589" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write the crosswalk out for further standardization work." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "new_crosswalk.sort_values([\"ntsb_make\", \"ntsb_model\"]).to_csv(\n", " os.path.join(input_dir, \"crosswalk.csv\"),\n", " index=False\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the NTSB data, create a combined field with the two standardized columns combined into one." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "merged['latimes_make_and_model'] = merged.latimes_make + \" \" + merged.latimes_model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Write out the standardized file for analysis." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "merged.to_csv(os.path.join(output_dir, \"standardized-helicopters-by-accident.csv\"), index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }