{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Standardize NTSB\n",
"\n",
"The NTSB has inconsistently logged the make and model of helicopters involved in accidents. This notebook cleans up variations based on Times research. "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import data directories"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"%store -r"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read in the NTSB's list of helicopters involved in fatal accidents."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"helicopters = pd.read_csv(os.path.join(output_dir, \"helicopters-by-accident.csv\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read in The Times' crosswalk of standardizations and corrections."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"crosswalk = pd.read_csv(os.path.join(input_dir, \"crosswalk.csv\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Join The Times crosswalk to the NTSB database. The standardized make and models values are stored in fields with an \"latimes\" prefix, which will appended to the raw data."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"merged = helicopters.merge(\n",
" crosswalk,\n",
" on=[\"ntsb_make\", \"ntsb_model\"],\n",
" how=\"left\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After they are joined, fill in any records that don't yet have a match."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"merged.latimes_make.fillna(\"\", inplace=True)\n",
"merged.latimes_model.fillna(\"\", inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recreate the crosswalk with any new makes and models from the raw data now included."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"new_crosswalk = merged.groupby([\n",
" \"ntsb_make\",\n",
" \"ntsb_model\",\n",
" \"latimes_make\",\n",
" \"latimes_model\"\n",
"]).size().rename(\"count\").reset_index()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Audit how many makes and models have been standardized."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"uniques = pd.DataFrame([\n",
" ('all unique makes', len(new_crosswalk)),\n",
" ('standardized uniques', len(new_crosswalk[new_crosswalk.latimes_make != ''])),\n",
" ('unstandardized uniques', len(new_crosswalk[new_crosswalk.latimes_make == ''])),\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"counts = pd.DataFrame([\n",
" ('all helicopters', new_crosswalk['count'].sum()),\n",
" ('standardize count', new_crosswalk[new_crosswalk.latimes_make != '']['count'].sum()),\n",
" ('unstandardize count', new_crosswalk[new_crosswalk.latimes_make == '']['count'].sum()),\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"uniques['percent'] = uniques[1] / len(new_crosswalk)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"counts['percent'] = counts[1] / new_crosswalk['count'].sum()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" percent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" all unique makes | \n",
" 543 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" standardized uniques | \n",
" 539 | \n",
" 0.992634 | \n",
"
\n",
" \n",
" 2 | \n",
" unstandardized uniques | \n",
" 4 | \n",
" 0.007366 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 percent\n",
"0 all unique makes 543 1.000000\n",
"1 standardized uniques 539 0.992634\n",
"2 unstandardized uniques 4 0.007366"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"uniques"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" percent | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" all helicopters | \n",
" 1545 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 1 | \n",
" standardize count | \n",
" 1541 | \n",
" 0.997411 | \n",
"
\n",
" \n",
" 2 | \n",
" unstandardize count | \n",
" 4 | \n",
" 0.002589 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 percent\n",
"0 all helicopters 1545 1.000000\n",
"1 standardize count 1541 0.997411\n",
"2 unstandardize count 4 0.002589"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write the crosswalk out for further standardization work."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"new_crosswalk.sort_values([\"ntsb_make\", \"ntsb_model\"]).to_csv(\n",
" os.path.join(input_dir, \"crosswalk.csv\"),\n",
" index=False\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"On the NTSB data, create a combined field with the two standardized columns combined into one."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"merged['latimes_make_and_model'] = merged.latimes_make + \" \" + merged.latimes_model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write out the standardized file for analysis."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"merged.to_csv(os.path.join(output_dir, \"standardized-helicopters-by-accident.csv\"), index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}