{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Assign NERC labels to plants using 860 data and k-nearest neighbors"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:00:19.547289Z",
"start_time": "2017-11-06T20:00:19.534260Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import os\n",
"from os.path import join\n",
"import pandas as pd\n",
"from sklearn import neighbors, metrics\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from sklearn.model_selection import train_test_split, GridSearchCV\n",
"from collections import Counter\n",
"\n",
"\n",
"cwd = os.getcwd()\n",
"data_path = join(cwd, '..', 'Data storage')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Date string for filenames\n",
"This will be inserted into all filenames (reading and writing)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"file_date = '2018-03-06'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data\n",
"This loads facility data that has been assembled from the EIA bulk data file, and EIA-860 excel files. The EIA-860 excel files need to be downloaded manually."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load EIA facility data\n",
"Only need to keep the plant id, year (as a check that plants don't move between years), and lat/lon"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T14:34:57.939401Z",
"start_time": "2017-11-06T14:34:48.570853Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"path = os.path.join(data_path, 'Facility gen fuels and CO2 {}.csv'.format(file_date))\n",
"facility_df = pd.read_csv(path)\n",
"facility_df['state'] = facility_df['geography'].str[-2:]"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T14:42:35.675435Z",
"start_time": "2017-11-06T14:42:34.964483Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"plants = facility_df.loc[:, ['plant id', 'year', 'lat', 'lon', 'state']]\n",
"plants.drop_duplicates(inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load known NERC labels from EIA-860\n",
"Current NERCS go back to 2012. Use that, 2015, and the 2016 early release."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T14:27:35.633747Z",
"start_time": "2017-11-06T14:27:35.621006Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"eia_base_path = join(data_path, 'EIA downloads')\n",
"file_860_info = {\n",
"# 2011: {'io': join(eia_base_path, 'eia8602011', 'Plant.xlsx'),\n",
"# 'skiprows': 0,\n",
"# 'parse_cols': 'B,J'},\n",
" 2012: {'io': join(eia_base_path, 'eia8602012', 'PlantY2012.xlsx'),\n",
" 'skiprows': 0,\n",
" 'usecols': 'B,J'},\n",
" 2013: {'io': join(eia_base_path, 'eia8602013', '2___Plant_Y2013.xlsx'),\n",
" 'skiprows': 0,\n",
" 'usecols': 'C,L'},\n",
" 2014: {'io': join(eia_base_path, 'eia8602014', '2___Plant_Y2014.xlsx'),\n",
" 'skiprows': 0,\n",
" 'usecols': 'C,L'},\n",
" 2015: {'io': join(eia_base_path, 'eia8602015', '2___Plant_Y2015.xlsx'),\n",
" 'skiprows': 0,\n",
" 'usecols': 'C,L'},\n",
" 2016: {'io': join(eia_base_path, 'eia8602016', '2___Plant_Y2016.xlsx'),\n",
" 'skiprows': 0,\n",
" 'usecols': 'C,L'}\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T14:30:20.380645Z",
"start_time": "2017-11-06T14:29:57.233977Z"
}
},
"outputs": [],
"source": [
"eia_nercs = {}\n",
"for key, args in file_860_info.items():\n",
" eia_nercs[key] = pd.read_excel(**args)\n",
" eia_nercs[key].columns = ['plant id', 'nerc']\n",
" eia_nercs[key]['year'] = key"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T14:30:23.906158Z",
"start_time": "2017-11-06T14:30:23.877433Z"
}
},
"outputs": [],
"source": [
"nercs = pd.concat(eia_nercs.values()).drop_duplicates(subset=['plant id', 'nerc'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Look for plants listed with different NERC labels\n",
"There are 30 plants duplicated. Five of them don't have a NERC label in one of the years. The largest move is from MRO to other regions (12), with most of those to SPP (7). After that, moves from RFC (5) to MRO (3) and SERC (2). There are also some moves from WECC and FRCC to HICC/ASCC - these might be diesel generators that get moved.\n",
"\n",
"The plants that have duplicate NERC region labels represent a small fraction of national generation, but one that is growing over time. By 2016 they consist of 0.15% of national generation."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T14:31:12.401717Z",
"start_time": "2017-11-06T14:31:12.388812Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7289 total records\n",
"7289 unique plants\n",
"8060 total records\n",
"8060 unique plants\n",
"8520 total records\n",
"8520 unique plants\n",
"8928 total records\n",
"8928 unique plants\n",
"9711 total records\n",
"9711 unique plants\n",
"10104 total records\n",
"10068 unique plants\n"
]
}
],
"source": [
"for df_ in list(eia_nercs.values()) + [nercs]:\n",
" print('{} total records'.format(len(df_)))\n",
" print('{} unique plants'.format(len(df_['plant id'].unique())))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T14:43:34.109577Z",
"start_time": "2017-11-06T14:43:34.102600Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 66, 1120, 1121, 7757, 7848, 7847, 6280, 57251, 57252,\n",
" 70, 899, 1168, 57449, 58469, 55836, 56266, 56106, 56856,\n",
" 56985, 57622, 57623, 57651, 57650, 57983, 58117, 58278, 58511,\n",
" 59027, 59037, 58690, 58655, 58676], dtype=int64)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dup_plants = nercs.loc[nercs['plant id'].duplicated(keep=False), 'plant id'].unique()\n",
"dup_plants"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T14:43:41.526762Z",
"start_time": "2017-11-06T14:43:41.503604Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Counter({('ASCC', nan): 2,\n",
" ('FRCC', 'HICC'): 1,\n",
" ('MRO', 'RFC'): 2,\n",
" ('MRO', 'SERC'): 1,\n",
" ('MRO', 'SPP'): 7,\n",
" ('MRO', 'WECC'): 3,\n",
" ('RFC', 'MRO'): 3,\n",
" ('RFC', 'SERC'): 2,\n",
" ('SERC', 'SPP'): 1,\n",
" ('SPP', 'SERC'): 2,\n",
" ('SPP', 'TRE'): 1,\n",
" ('WECC', 'ASCC'): 2,\n",
" ('WECC', 'HICC'): 1,\n",
" (nan, 'WECC', 'ASCC'): 3,\n",
" (nan, 'WECC', 'HICC'): 1})"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"region_list = []\n",
"for plant in dup_plants:\n",
" regions = nercs.loc[nercs['plant id'] == plant, 'nerc'].unique()\n",
"# regions = regions.tolist()\n",
" region_list.append(regions)\n",
"Counter(tuple(x) for x in region_list)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T14:43:46.932593Z",
"start_time": "2017-11-06T14:43:46.804826Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"year\n",
"2001 0.000345\n",
"2002 0.000269\n",
"2003 0.000262\n",
"2004 0.000313\n",
"2005 0.000426\n",
"2006 0.000514\n",
"2007 0.000509\n",
"2008 0.000527\n",
"2009 0.000631\n",
"2010 0.000683\n",
"2011 0.000763\n",
"2012 0.001286\n",
"2013 0.001138\n",
"2014 0.001079\n",
"2015 0.001701\n",
"2016 0.001962\n",
"2017 0.001460\n",
"Name: generation (MWh), dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(facility_df.loc[facility_df['plant id'].isin(dup_plants), :]\n",
" .groupby('year')['generation (MWh)'].sum()\n",
" / facility_df.loc[:, :]\n",
" .groupby('year')['generation (MWh)'].sum())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some plants in EIA-860 don't have NERC labels. Drop them now.\n",
"This is my training data. All of these plants should still be in my `plants` dataframe."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:24:20.138841Z",
"start_time": "2017-11-06T16:24:20.056677Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"41"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nan_plants = nercs.loc[nercs.isnull().any(axis=1)]\n",
"len(nan_plants)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:24:53.142457Z",
"start_time": "2017-11-06T16:24:53.113360Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" nerc | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 65 | \n",
" 66 | \n",
" ASCC | \n",
" 2012 | \n",
"
\n",
" \n",
" 1637 | \n",
" 58277 | \n",
" NaN | \n",
" 2012 | \n",
"
\n",
" \n",
" 1928 | \n",
" 70 | \n",
" ASCC | \n",
" 2012 | \n",
"
\n",
" \n",
" 1961 | \n",
" 58405 | \n",
" NaN | \n",
" 2012 | \n",
"
\n",
" \n",
" 4145 | \n",
" 58469 | \n",
" NaN | \n",
" 2012 | \n",
"
\n",
" \n",
" 6945 | \n",
" 58117 | \n",
" NaN | \n",
" 2012 | \n",
"
\n",
" \n",
" 6946 | \n",
" 58278 | \n",
" NaN | \n",
" 2012 | \n",
"
\n",
" \n",
" 7164 | \n",
" 58380 | \n",
" NaN | \n",
" 2012 | \n",
"
\n",
" \n",
" 7194 | \n",
" 58425 | \n",
" NaN | \n",
" 2012 | \n",
"
\n",
" \n",
" 7262 | \n",
" 58511 | \n",
" NaN | \n",
" 2012 | \n",
"
\n",
" \n",
" 64 | \n",
" 66 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 828 | \n",
" 58982 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 1917 | \n",
" 58639 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 1918 | \n",
" 58640 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 1938 | \n",
" 58977 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 2205 | \n",
" 58989 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 2213 | \n",
" 58549 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 4127 | \n",
" 58469 | \n",
" WECC | \n",
" 2013 | \n",
"
\n",
" \n",
" 6954 | \n",
" 58117 | \n",
" WECC | \n",
" 2013 | \n",
"
\n",
" \n",
" 6955 | \n",
" 58278 | \n",
" WECC | \n",
" 2013 | \n",
"
\n",
" \n",
" 7301 | \n",
" 58511 | \n",
" WECC | \n",
" 2013 | \n",
"
\n",
" \n",
" 7414 | \n",
" 58651 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 7422 | \n",
" 58656 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 7424 | \n",
" 58659 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 7427 | \n",
" 58662 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 7432 | \n",
" 58684 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 7457 | \n",
" 58705 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 7603 | \n",
" 58837 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 7750 | \n",
" 59035 | \n",
" NaN | \n",
" 2013 | \n",
"
\n",
" \n",
" 4129 | \n",
" 58469 | \n",
" HICC | \n",
" 2014 | \n",
"
\n",
" \n",
" 6929 | \n",
" 58117 | \n",
" ASCC | \n",
" 2014 | \n",
"
\n",
" \n",
" 6930 | \n",
" 58278 | \n",
" ASCC | \n",
" 2014 | \n",
"
\n",
" \n",
" 7272 | \n",
" 58511 | \n",
" ASCC | \n",
" 2014 | \n",
"
\n",
" \n",
" 8453 | \n",
" 60024 | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 8550 | \n",
" 60125 | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 8667 | \n",
" 60243 | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 8668 | \n",
" 60244 | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 8669 | \n",
" 60245 | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 8674 | \n",
" 60250 | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 8684 | \n",
" 60260 | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 8749 | \n",
" 60328 | \n",
" NaN | \n",
" 2015 | \n",
"
\n",
" \n",
" 40 | \n",
" 70 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
" 8888 | \n",
" 60563 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
" 8913 | \n",
" 60588 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
" 9128 | \n",
" 60814 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
" 9371 | \n",
" 61068 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
" 9400 | \n",
" 61099 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
" 9402 | \n",
" 61101 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
" 9448 | \n",
" 61166 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
" 9453 | \n",
" 61172 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
" 9626 | \n",
" 61364 | \n",
" NaN | \n",
" 2016 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" plant id nerc year\n",
"65 66 ASCC 2012\n",
"1637 58277 NaN 2012\n",
"1928 70 ASCC 2012\n",
"1961 58405 NaN 2012\n",
"4145 58469 NaN 2012\n",
"6945 58117 NaN 2012\n",
"6946 58278 NaN 2012\n",
"7164 58380 NaN 2012\n",
"7194 58425 NaN 2012\n",
"7262 58511 NaN 2012\n",
"64 66 NaN 2013\n",
"828 58982 NaN 2013\n",
"1917 58639 NaN 2013\n",
"1918 58640 NaN 2013\n",
"1938 58977 NaN 2013\n",
"2205 58989 NaN 2013\n",
"2213 58549 NaN 2013\n",
"4127 58469 WECC 2013\n",
"6954 58117 WECC 2013\n",
"6955 58278 WECC 2013\n",
"7301 58511 WECC 2013\n",
"7414 58651 NaN 2013\n",
"7422 58656 NaN 2013\n",
"7424 58659 NaN 2013\n",
"7427 58662 NaN 2013\n",
"7432 58684 NaN 2013\n",
"7457 58705 NaN 2013\n",
"7603 58837 NaN 2013\n",
"7750 59035 NaN 2013\n",
"4129 58469 HICC 2014\n",
"6929 58117 ASCC 2014\n",
"6930 58278 ASCC 2014\n",
"7272 58511 ASCC 2014\n",
"8453 60024 NaN 2015\n",
"8550 60125 NaN 2015\n",
"8667 60243 NaN 2015\n",
"8668 60244 NaN 2015\n",
"8669 60245 NaN 2015\n",
"8674 60250 NaN 2015\n",
"8684 60260 NaN 2015\n",
"8749 60328 NaN 2015\n",
"40 70 NaN 2016\n",
"8888 60563 NaN 2016\n",
"8913 60588 NaN 2016\n",
"9128 60814 NaN 2016\n",
"9371 61068 NaN 2016\n",
"9400 61099 NaN 2016\n",
"9402 61101 NaN 2016\n",
"9448 61166 NaN 2016\n",
"9453 61172 NaN 2016\n",
"9626 61364 NaN 2016"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nercs.loc[nercs['plant id'].isin(nan_plants['plant id'])]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nercs.dropna(inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load EIA-860m for some info on recent facilities\n",
"SPP and TRE have the lowest accuracy. I'm going to assume that anything in TX or OK and SWPP balancing authority is in SPP. On the flip side, if it's in TX and ERCOT I'll assign it to TRE.\n",
"\n",
"Only do this for plants that come online since the most recent 860 annual data."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"path = join(data_path, 'EIA downloads', 'december_generator2017.xlsx')\n",
"m860 = pd.read_excel(path, sheet_name='Operating',skip_footer=1,\n",
" usecols='C,F,P,AE', skiprows=0)\n",
"m860.columns = m860.columns.str.lower()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" plant state | \n",
" operating year | \n",
" balancing authority code | \n",
"
\n",
" \n",
" \n",
" \n",
" 21373 | \n",
" 61632 | \n",
" IA | \n",
" 2017 | \n",
" MISO | \n",
"
\n",
" \n",
" 21374 | \n",
" 61633 | \n",
" MA | \n",
" 2017 | \n",
" ISNE | \n",
"
\n",
" \n",
" 21375 | \n",
" 61634 | \n",
" MA | \n",
" 2017 | \n",
" ISNE | \n",
"
\n",
" \n",
" 21376 | \n",
" 61635 | \n",
" MA | \n",
" 2017 | \n",
" ISNE | \n",
"
\n",
" \n",
" 21377 | \n",
" 61636 | \n",
" MA | \n",
" 2017 | \n",
" ISNE | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" plant id plant state operating year balancing authority code\n",
"21373 61632 IA 2017 MISO\n",
"21374 61633 MA 2017 ISNE\n",
"21375 61634 MA 2017 ISNE\n",
"21376 61635 MA 2017 ISNE\n",
"21377 61636 MA 2017 ISNE"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m860 = m860.loc[m860['operating year'] == 2017]\n",
"m860.tail()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"m860.loc[(m860['plant state'].isin(['TX', 'OK'])) &\n",
" (m860['balancing authority code'] == 'SWPP'), 'nerc'] = 'SPP'\n",
"\n",
"m860.loc[(m860['plant state'].isin(['TX'])) &\n",
" (m860['balancing authority code'] == 'ERCO'), 'nerc'] = 'TRE'"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" plant state | \n",
" operating year | \n",
" balancing authority code | \n",
" nerc | \n",
"
\n",
" \n",
" \n",
" \n",
" 343 | \n",
" 165 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 344 | \n",
" 165 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 4836 | \n",
" 2953 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 4837 | \n",
" 2953 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 4838 | \n",
" 2953 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 4839 | \n",
" 2953 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 4840 | \n",
" 2953 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 4841 | \n",
" 2953 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 4842 | \n",
" 2953 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 15993 | \n",
" 56984 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 18994 | \n",
" 59066 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 19143 | \n",
" 59193 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 19157 | \n",
" 59206 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 19187 | \n",
" 59245 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 19665 | \n",
" 59712 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 19791 | \n",
" 59812 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 19792 | \n",
" 59812 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 19793 | \n",
" 59812 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20091 | \n",
" 60122 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20092 | \n",
" 60122 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20093 | \n",
" 60122 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20147 | \n",
" 60210 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20149 | \n",
" 60217 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20296 | \n",
" 60366 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20301 | \n",
" 60372 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20334 | \n",
" 60414 | \n",
" TX | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 20350 | \n",
" 60436 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20371 | \n",
" 60459 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20372 | \n",
" 60459 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20373 | \n",
" 60460 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20374 | \n",
" 60460 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20476 | \n",
" 60581 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20556 | \n",
" 60682 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20566 | \n",
" 60690 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20681 | \n",
" 60774 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20785 | \n",
" 60901 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20786 | \n",
" 60902 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20865 | \n",
" 60983 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 20870 | \n",
" 60989 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 21010 | \n",
" 61205 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 21023 | \n",
" 61221 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 21050 | \n",
" 61261 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 21077 | \n",
" 61309 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 21114 | \n",
" 61362 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 21171 | \n",
" 61409 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 21172 | \n",
" 61410 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 21173 | \n",
" 61411 | \n",
" TX | \n",
" 2017 | \n",
" ERCO | \n",
" TRE | \n",
"
\n",
" \n",
" 21362 | \n",
" 61614 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 21363 | \n",
" 61615 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 21364 | \n",
" 61616 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 21365 | \n",
" 61617 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
" 21366 | \n",
" 61618 | \n",
" OK | \n",
" 2017 | \n",
" SWPP | \n",
" SPP | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" plant id plant state operating year balancing authority code nerc\n",
"343 165 OK 2017 SWPP SPP\n",
"344 165 OK 2017 SWPP SPP\n",
"4836 2953 OK 2017 SWPP SPP\n",
"4837 2953 OK 2017 SWPP SPP\n",
"4838 2953 OK 2017 SWPP SPP\n",
"4839 2953 OK 2017 SWPP SPP\n",
"4840 2953 OK 2017 SWPP SPP\n",
"4841 2953 OK 2017 SWPP SPP\n",
"4842 2953 OK 2017 SWPP SPP\n",
"15993 56984 TX 2017 ERCO TRE\n",
"18994 59066 TX 2017 ERCO TRE\n",
"19143 59193 TX 2017 ERCO TRE\n",
"19157 59206 TX 2017 ERCO TRE\n",
"19187 59245 TX 2017 ERCO TRE\n",
"19665 59712 TX 2017 ERCO TRE\n",
"19791 59812 TX 2017 ERCO TRE\n",
"19792 59812 TX 2017 ERCO TRE\n",
"19793 59812 TX 2017 ERCO TRE\n",
"20091 60122 TX 2017 ERCO TRE\n",
"20092 60122 TX 2017 ERCO TRE\n",
"20093 60122 TX 2017 ERCO TRE\n",
"20147 60210 TX 2017 ERCO TRE\n",
"20149 60217 TX 2017 ERCO TRE\n",
"20296 60366 TX 2017 ERCO TRE\n",
"20301 60372 TX 2017 ERCO TRE\n",
"20334 60414 TX 2017 SWPP SPP\n",
"20350 60436 TX 2017 ERCO TRE\n",
"20371 60459 TX 2017 ERCO TRE\n",
"20372 60459 TX 2017 ERCO TRE\n",
"20373 60460 TX 2017 ERCO TRE\n",
"20374 60460 TX 2017 ERCO TRE\n",
"20476 60581 TX 2017 ERCO TRE\n",
"20556 60682 TX 2017 ERCO TRE\n",
"20566 60690 TX 2017 ERCO TRE\n",
"20681 60774 TX 2017 ERCO TRE\n",
"20785 60901 TX 2017 ERCO TRE\n",
"20786 60902 TX 2017 ERCO TRE\n",
"20865 60983 TX 2017 ERCO TRE\n",
"20870 60989 TX 2017 ERCO TRE\n",
"21010 61205 TX 2017 ERCO TRE\n",
"21023 61221 OK 2017 SWPP SPP\n",
"21050 61261 OK 2017 SWPP SPP\n",
"21077 61309 TX 2017 ERCO TRE\n",
"21114 61362 TX 2017 ERCO TRE\n",
"21171 61409 TX 2017 ERCO TRE\n",
"21172 61410 TX 2017 ERCO TRE\n",
"21173 61411 TX 2017 ERCO TRE\n",
"21362 61614 OK 2017 SWPP SPP\n",
"21363 61615 OK 2017 SWPP SPP\n",
"21364 61616 OK 2017 SWPP SPP\n",
"21365 61617 OK 2017 SWPP SPP\n",
"21366 61618 OK 2017 SWPP SPP"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m860.dropna(inplace=True)\n",
"m860"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Make lists of plant codes for SPP and TRE facilities"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"m860_spp_plants = (m860.loc[m860['nerc'] == 'SPP', 'plant id']\n",
" .drop_duplicates()\n",
" .tolist())\n",
"m860_tre_plants = (m860.loc[m860['nerc'] == 'TRE', 'plant id']\n",
" .drop_duplicates()\n",
" .tolist())"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[165, 2953, 60414, 61221, 61261, 61614, 61615, 61616, 61617, 61618]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m860_spp_plants"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[56984,\n",
" 59066,\n",
" 59193,\n",
" 59206,\n",
" 59245,\n",
" 59712,\n",
" 59812,\n",
" 60122,\n",
" 60210,\n",
" 60217,\n",
" 60366,\n",
" 60372,\n",
" 60436,\n",
" 60459,\n",
" 60460,\n",
" 60581,\n",
" 60682,\n",
" 60690,\n",
" 60774,\n",
" 60901,\n",
" 60902,\n",
" 60983,\n",
" 60989,\n",
" 61205,\n",
" 61309,\n",
" 61362,\n",
" 61409,\n",
" 61410,\n",
" 61411]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m860_tre_plants"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" nerc | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [plant id, nerc, year]\n",
"Index: []"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"nan_plants.loc[nan_plants['plant id'].isin(m860_spp_plants)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Clean and prep data for KNN"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:39:54.416966Z",
"start_time": "2017-11-06T16:39:54.383812Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"df = pd.merge(plants, nercs.drop('year', axis=1), on=['plant id'], how='left')"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:39:55.684713Z",
"start_time": "2017-11-06T16:39:55.678207Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Index(['plant id', 'year', 'lat', 'lon', 'state', 'nerc'], dtype='object')"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Drop plants that don't have lat/lon data (using just lon to check), and then drop duplicates. If any plants have kept the same plant id but moved over time (maybe a diesel generator?) or switched NERC they will show up twice."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:40:03.957239Z",
"start_time": "2017-11-06T16:40:03.928838Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" year | \n",
" lat | \n",
" lon | \n",
" state | \n",
" nerc | \n",
"
\n",
" \n",
" \n",
" \n",
" 86900 | \n",
" 10851 | \n",
" 2006 | \n",
" NaN | \n",
" NaN | \n",
" NJ | \n",
" NaN | \n",
"
\n",
" \n",
" 87088 | \n",
" 50291 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" FL | \n",
" NaN | \n",
"
\n",
" \n",
" 87607 | \n",
" 56672 | \n",
" 2010 | \n",
" NaN | \n",
" NaN | \n",
" MN | \n",
" NaN | \n",
"
\n",
" \n",
" 87808 | \n",
" 55982 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" AK | \n",
" NaN | \n",
"
\n",
" \n",
" 87857 | \n",
" 55150 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 87890 | \n",
" 55082 | \n",
" 2006 | \n",
" NaN | \n",
" NaN | \n",
" MS | \n",
" NaN | \n",
"
\n",
" \n",
" 87955 | \n",
" 55521 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 88021 | \n",
" 55314 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88318 | \n",
" 54243 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" GA | \n",
" NaN | \n",
"
\n",
" \n",
" 88441 | \n",
" 50726 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 88690 | \n",
" 50249 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" TX | \n",
" NaN | \n",
"
\n",
" \n",
" 88917 | \n",
" 2666 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NY | \n",
" NaN | \n",
"
\n",
" \n",
" 91857 | \n",
" 55952 | \n",
" 2014 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" WECC | \n",
"
\n",
" \n",
" 92742 | \n",
" 7704 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 92808 | \n",
" 6339 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 92930 | \n",
" 55840 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" AZ | \n",
" NaN | \n",
"
\n",
" \n",
" 92969 | \n",
" 55883 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" KY | \n",
" NaN | \n",
"
\n",
" \n",
" 92980 | \n",
" 55955 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" AZ | \n",
" NaN | \n",
"
\n",
" \n",
" 92981 | \n",
" 55958 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 93017 | \n",
" 56043 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 93048 | \n",
" 56175 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 93061 | \n",
" 55975 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 93066 | \n",
" 56042 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" NY | \n",
" NaN | \n",
"
\n",
" \n",
" 93101 | \n",
" 56168 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" MN | \n",
" NaN | \n",
"
\n",
" \n",
" 93129 | \n",
" 55073 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 93152 | \n",
" 54516 | \n",
" 2006 | \n",
" 41.8875 | \n",
" NaN | \n",
" IL | \n",
" NaN | \n",
"
\n",
" \n",
" 93165 | \n",
" 55303 | \n",
" 2008 | \n",
" NaN | \n",
" NaN | \n",
" MS | \n",
" NaN | \n",
"
\n",
" \n",
" 93251 | \n",
" 55301 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" AZ | \n",
" NaN | \n",
"
\n",
" \n",
" 93431 | \n",
" 52164 | \n",
" 2006 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 93552 | \n",
" 54222 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" AK | \n",
" NaN | \n",
"
\n",
" \n",
" 93734 | \n",
" 50728 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" SERC | \n",
"
\n",
" \n",
" 93786 | \n",
" 50040 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" IL | \n",
" NaN | \n",
"
\n",
" \n",
" 93858 | \n",
" 50030 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 93906 | \n",
" 50168 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 94017 | \n",
" 50313 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NJ | \n",
" NaN | \n",
"
\n",
" \n",
" 94423 | \n",
" 10683 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" CO | \n",
" NaN | \n",
"
\n",
" \n",
" 94647 | \n",
" 10257 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 94815 | \n",
" 56508 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" plant id year lat lon state nerc\n",
"86900 10851 2006 NaN NaN NJ NaN\n",
"87088 50291 2005 NaN NaN FL NaN\n",
"87607 56672 2010 NaN NaN MN NaN\n",
"87808 55982 2004 NaN NaN AK NaN\n",
"87857 55150 2004 NaN NaN LA NaN\n",
"87890 55082 2006 NaN NaN MS NaN\n",
"87955 55521 2005 NaN NaN CA NaN\n",
"88021 55314 2003 NaN NaN NaN NaN\n",
"88318 54243 2005 NaN NaN GA NaN\n",
"88441 50726 2007 NaN NaN LA NaN\n",
"88690 50249 2004 NaN NaN TX NaN\n",
"88917 2666 2004 NaN NaN NY NaN\n",
"91857 55952 2014 NaN NaN NaN WECC\n",
"92742 7704 2003 NaN NaN NaN NaN\n",
"92808 6339 2004 NaN NaN NaN NaN\n",
"92930 55840 2003 NaN NaN AZ NaN\n",
"92969 55883 2004 NaN NaN KY NaN\n",
"92980 55955 2003 NaN NaN AZ NaN\n",
"92981 55958 2004 NaN NaN LA NaN\n",
"93017 56043 2004 NaN NaN CA NaN\n",
"93048 56175 2007 NaN NaN CA NaN\n",
"93061 55975 2004 NaN NaN NaN NaN\n",
"93066 56042 2005 NaN NaN NY NaN\n",
"93101 56168 2003 NaN NaN MN NaN\n",
"93129 55073 2004 NaN NaN NaN NaN\n",
"93152 54516 2006 41.8875 NaN IL NaN\n",
"93165 55303 2008 NaN NaN MS NaN\n",
"93251 55301 2003 NaN NaN AZ NaN\n",
"93431 52164 2006 NaN NaN CA NaN\n",
"93552 54222 2007 NaN NaN AK NaN\n",
"93734 50728 2003 NaN NaN LA SERC\n",
"93786 50040 2005 NaN NaN IL NaN\n",
"93858 50030 2003 NaN NaN LA NaN\n",
"93906 50168 2003 NaN NaN LA NaN\n",
"94017 50313 2004 NaN NaN NJ NaN\n",
"94423 10683 2007 NaN NaN CO NaN\n",
"94647 10257 2005 NaN NaN CA NaN\n",
"94815 56508 2007 NaN NaN CA NaN"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[df.lon.isnull()].drop_duplicates(subset='plant id')"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:40:06.598416Z",
"start_time": "2017-11-06T16:40:06.576972Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" year | \n",
" lat | \n",
" lon | \n",
" state | \n",
" nerc | \n",
"
\n",
" \n",
" \n",
" \n",
" 86900 | \n",
" 10851 | \n",
" 2006 | \n",
" NaN | \n",
" NaN | \n",
" NJ | \n",
" NaN | \n",
"
\n",
" \n",
" 87088 | \n",
" 50291 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" FL | \n",
" NaN | \n",
"
\n",
" \n",
" 87607 | \n",
" 56672 | \n",
" 2010 | \n",
" NaN | \n",
" NaN | \n",
" MN | \n",
" NaN | \n",
"
\n",
" \n",
" 87808 | \n",
" 55982 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" AK | \n",
" NaN | \n",
"
\n",
" \n",
" 87857 | \n",
" 55150 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 87890 | \n",
" 55082 | \n",
" 2006 | \n",
" NaN | \n",
" NaN | \n",
" MS | \n",
" NaN | \n",
"
\n",
" \n",
" 87955 | \n",
" 55521 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 88021 | \n",
" 55314 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 88318 | \n",
" 54243 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" GA | \n",
" NaN | \n",
"
\n",
" \n",
" 88441 | \n",
" 50726 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 88690 | \n",
" 50249 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" TX | \n",
" NaN | \n",
"
\n",
" \n",
" 88917 | \n",
" 2666 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NY | \n",
" NaN | \n",
"
\n",
" \n",
" 91857 | \n",
" 55952 | \n",
" 2014 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" WECC | \n",
"
\n",
" \n",
" 92742 | \n",
" 7704 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 92808 | \n",
" 6339 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 92930 | \n",
" 55840 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" AZ | \n",
" NaN | \n",
"
\n",
" \n",
" 92969 | \n",
" 55883 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" KY | \n",
" NaN | \n",
"
\n",
" \n",
" 92980 | \n",
" 55955 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" AZ | \n",
" NaN | \n",
"
\n",
" \n",
" 92981 | \n",
" 55958 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 93017 | \n",
" 56043 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 93048 | \n",
" 56175 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 93061 | \n",
" 55975 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 93066 | \n",
" 56042 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" NY | \n",
" NaN | \n",
"
\n",
" \n",
" 93101 | \n",
" 56168 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" MN | \n",
" NaN | \n",
"
\n",
" \n",
" 93129 | \n",
" 55073 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" 93165 | \n",
" 55303 | \n",
" 2008 | \n",
" NaN | \n",
" NaN | \n",
" MS | \n",
" NaN | \n",
"
\n",
" \n",
" 93251 | \n",
" 55301 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" AZ | \n",
" NaN | \n",
"
\n",
" \n",
" 93431 | \n",
" 52164 | \n",
" 2006 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 93552 | \n",
" 54222 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" AK | \n",
" NaN | \n",
"
\n",
" \n",
" 93734 | \n",
" 50728 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" SERC | \n",
"
\n",
" \n",
" 93786 | \n",
" 50040 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" IL | \n",
" NaN | \n",
"
\n",
" \n",
" 93858 | \n",
" 50030 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 93906 | \n",
" 50168 | \n",
" 2003 | \n",
" NaN | \n",
" NaN | \n",
" LA | \n",
" NaN | \n",
"
\n",
" \n",
" 94017 | \n",
" 50313 | \n",
" 2004 | \n",
" NaN | \n",
" NaN | \n",
" NJ | \n",
" NaN | \n",
"
\n",
" \n",
" 94423 | \n",
" 10683 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" CO | \n",
" NaN | \n",
"
\n",
" \n",
" 94647 | \n",
" 10257 | \n",
" 2005 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
" 94815 | \n",
" 56508 | \n",
" 2007 | \n",
" NaN | \n",
" NaN | \n",
" CA | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" plant id year lat lon state nerc\n",
"86900 10851 2006 NaN NaN NJ NaN\n",
"87088 50291 2005 NaN NaN FL NaN\n",
"87607 56672 2010 NaN NaN MN NaN\n",
"87808 55982 2004 NaN NaN AK NaN\n",
"87857 55150 2004 NaN NaN LA NaN\n",
"87890 55082 2006 NaN NaN MS NaN\n",
"87955 55521 2005 NaN NaN CA NaN\n",
"88021 55314 2003 NaN NaN NaN NaN\n",
"88318 54243 2005 NaN NaN GA NaN\n",
"88441 50726 2007 NaN NaN LA NaN\n",
"88690 50249 2004 NaN NaN TX NaN\n",
"88917 2666 2004 NaN NaN NY NaN\n",
"91857 55952 2014 NaN NaN NaN WECC\n",
"92742 7704 2003 NaN NaN NaN NaN\n",
"92808 6339 2004 NaN NaN NaN NaN\n",
"92930 55840 2003 NaN NaN AZ NaN\n",
"92969 55883 2004 NaN NaN KY NaN\n",
"92980 55955 2003 NaN NaN AZ NaN\n",
"92981 55958 2004 NaN NaN LA NaN\n",
"93017 56043 2004 NaN NaN CA NaN\n",
"93048 56175 2007 NaN NaN CA NaN\n",
"93061 55975 2004 NaN NaN NaN NaN\n",
"93066 56042 2005 NaN NaN NY NaN\n",
"93101 56168 2003 NaN NaN MN NaN\n",
"93129 55073 2004 NaN NaN NaN NaN\n",
"93165 55303 2008 NaN NaN MS NaN\n",
"93251 55301 2003 NaN NaN AZ NaN\n",
"93431 52164 2006 NaN NaN CA NaN\n",
"93552 54222 2007 NaN NaN AK NaN\n",
"93734 50728 2003 NaN NaN LA SERC\n",
"93786 50040 2005 NaN NaN IL NaN\n",
"93858 50030 2003 NaN NaN LA NaN\n",
"93906 50168 2003 NaN NaN LA NaN\n",
"94017 50313 2004 NaN NaN NJ NaN\n",
"94423 10683 2007 NaN NaN CO NaN\n",
"94647 10257 2005 NaN NaN CA NaN\n",
"94815 56508 2007 NaN NaN CA NaN"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[df.lat.isnull()].drop_duplicates(subset='plant id')"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:40:08.841245Z",
"start_time": "2017-11-06T16:40:08.805376Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"cols = ['plant id', 'lat', 'lon', 'nerc', 'state']\n",
"df_slim = (df.loc[:, cols].dropna(subset=['lon'])\n",
" .drop_duplicates(subset=['plant id', 'nerc']))"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:40:09.317698Z",
"start_time": "2017-11-06T16:40:09.311670Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"8982"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df_slim)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:40:29.960293Z",
"start_time": "2017-11-06T16:40:29.944141Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" lat | \n",
" lon | \n",
" nerc | \n",
" state | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1001 | \n",
" 39.924200 | \n",
" -87.424400 | \n",
" RFC | \n",
" IN | \n",
"
\n",
" \n",
" 17 | \n",
" 10003 | \n",
" 39.760600 | \n",
" -105.215000 | \n",
" WECC | \n",
" CO | \n",
"
\n",
" \n",
" 34 | \n",
" 10005 | \n",
" 37.047800 | \n",
" -121.170800 | \n",
" WECC | \n",
" CA | \n",
"
\n",
" \n",
" 49 | \n",
" 10008 | \n",
" 30.314467 | \n",
" -81.662705 | \n",
" FRCC | \n",
" FL | \n",
"
\n",
" \n",
" 62 | \n",
" 10091 | \n",
" 33.768300 | \n",
" -118.283600 | \n",
" WECC | \n",
" CA | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" plant id lat lon nerc state\n",
"0 1001 39.924200 -87.424400 RFC IN\n",
"17 10003 39.760600 -105.215000 WECC CO\n",
"34 10005 37.047800 -121.170800 WECC CA\n",
"49 10008 30.314467 -81.662705 FRCC FL\n",
"62 10091 33.768300 -118.283600 WECC CA"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_slim.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Separate out the list of plants where we don't have NERC labels from EIA-860."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:40:35.247883Z",
"start_time": "2017-11-06T16:40:35.239582Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"unknown = df_slim.loc[df_slim.nerc.isnull()]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:40:37.583325Z",
"start_time": "2017-11-06T16:40:37.568223Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"279 plants don't have NERC labels\n",
"\n",
" plant id lat lon nerc state\n",
"38520 52080 37.9725 122.058333 NaN CA\n",
"48717 55647 36.1844 -86.854200 NaN TN\n",
"56706 58277 20.8867 -156.337800 NaN HI\n",
"57035 58380 61.2860 -149.610000 NaN AK\n",
"57071 58425 61.1300 -150.243611 NaN AK\n"
]
}
],
"source": [
"print(\"{} plants don't have NERC labels\\n\".format(len(unknown)))\n",
"print(unknown.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create X and y matricies\n",
"X is lat/lon\n",
"\n",
"y is the NERC label\n",
"\n",
"For both, I'm only using plants where we have all data (no `NaN`s). Not doing any transformation of the lat/lon at this time. There is certainly some error here, as KNN will use the Euclidian distance to calculate nearest neighbors. Not sure how I plan on dealing with this, or if it is even necessary."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:06:21.393849Z",
"start_time": "2017-11-06T20:06:21.367798Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"X = df_slim.loc[df_slim.notnull().all(axis=1), ['lat', 'lon']]\n",
"y = df_slim.loc[df_slim.notnull().all(axis=1), 'nerc']\n",
"\n",
"# le = LabelEncoder()\n",
"# le.fit(y)\n",
"\n",
"# y = le.transform(y)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:06:27.239811Z",
"start_time": "2017-11-06T20:06:27.233495Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"8703"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(X)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:06:27.249590Z",
"start_time": "2017-11-06T20:06:27.242414Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, y, test_size=0.33, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## GridSearch to find the best parameters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular KNN classifier\n",
"Run gridsearch testing parameter values for weights, n_neighbors, and p (use Euclidean or Manhattan distance).\n",
"\n",
"With 15 neighbors, weights by distance, and Euclidean distance, the model is able to accurately predict the test sample NERC region with 96% accuracy. This varies by region, with the lowest accuracy scores for TRE and SPP (89% and 87%), and the highest accuracy scores for WECC and NPCC (each 99%). F1 scores tend to be similar to the accuracy, although TRE has slightly higher F1 (0.94 vs 0.89)."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:06:28.581180Z",
"start_time": "2017-11-06T20:06:27.257246Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fitting 3 folds for each of 12 candidates, totalling 36 fits\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Parallel(n_jobs=-1)]: Done 20 out of 36 | elapsed: 12.5s remaining: 10.0s\n",
"[Parallel(n_jobs=-1)]: Done 36 out of 36 | elapsed: 21.1s finished\n"
]
},
{
"data": {
"text/plain": [
"GridSearchCV(cv=None, error_score='raise',\n",
" estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" metric_params=None, n_jobs=1, n_neighbors=5, p=2,\n",
" weights='uniform'),\n",
" fit_params=None, iid=False, n_jobs=-1,\n",
" param_grid={'weights': ['uniform', 'distance'], 'n_neighbors': [10, 15, 20], 'p': [1, 2]},\n",
" pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
" scoring=None, verbose=1)"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"knn = neighbors.KNeighborsClassifier()\n",
"\n",
"params = {'weights': ['uniform', 'distance'],\n",
" 'n_neighbors': [10, 15, 20],\n",
" 'p': [1, 2]\n",
" }\n",
"\n",
"clf_knn = GridSearchCV(knn, params, n_jobs=-1, iid=False, verbose=1)\n",
"\n",
"clf_knn.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:06:28.595105Z",
"start_time": "2017-11-06T20:06:28.586540Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" metric_params=None, n_jobs=1, n_neighbors=10, p=1,\n",
" weights='distance'), 0.9590098375679993)"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf_knn.best_estimator_, clf_knn.best_score_"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:06:28.646251Z",
"start_time": "2017-11-06T20:06:28.597679Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.9589136490250696"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"clf_knn.score(X_test, y_test)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:10:03.521287Z",
"start_time": "2017-11-06T20:10:03.510982Z"
}
},
"outputs": [],
"source": [
"nerc_labels = nercs.nerc.dropna().unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Accuracy score by region"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:15:48.830308Z",
"start_time": "2017-11-06T20:15:48.737017Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SERC : 0.9553752535496958\n",
"RFC : 0.9523809523809523\n",
"SPP : 0.8098591549295775\n",
"NPCC : 0.9768115942028985\n",
"WECC : 0.9904191616766467\n",
"MRO : 0.9469964664310954\n",
"TRE : 0.927007299270073\n",
"HICC : 1.0\n",
"ASCC : 0.975609756097561\n",
"FRCC : 0.9629629629629629\n"
]
}
],
"source": [
"for region in nerc_labels:\n",
" mask = y_test == region\n",
" \n",
" X_masked = X_test[mask]\n",
" y_hat_masked = clf_knn.predict(X_masked)\n",
" y_test_masked = y_test[mask]\n",
" \n",
" accuracy = metrics.accuracy_score(y_test_masked, y_hat_masked)\n",
" print('{} : {}'.format(region, accuracy))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"F1 score by region"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:18:57.242329Z",
"start_time": "2017-11-06T20:18:57.079002Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SERC : 0.9534412955465588\n",
"RFC : 0.949667616334283\n",
"SPP : 0.8273381294964028\n",
"NPCC : 0.9810771470160117\n",
"WECC : 0.9904191616766467\n",
"MRO : 0.9337979094076655\n",
"TRE : 0.9407407407407407\n",
"HICC : 0.9444444444444444\n",
"ASCC : 0.975609756097561\n",
"FRCC : 0.9811320754716981\n"
]
}
],
"source": [
"y_hat = clf_knn.predict(X_test)\n",
"\n",
"for region in nerc_labels:\n",
" f1 = metrics.f1_score(y_test, y_hat, labels=[region], average='macro')\n",
" print('{} : {}'.format(region, f1))"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:21:11.256728Z",
"start_time": "2017-11-06T20:21:11.237272Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.9589136490250696"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"metrics.f1_score(y_test, y_hat, average='micro')"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T20:21:24.325435Z",
"start_time": "2017-11-06T20:21:24.308078Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.9477668276232013"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"metrics.f1_score(y_test, y_hat, average='macro')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use best KNN parameters to predict NERC for unknown plants"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:53:30.016363Z",
"start_time": "2017-11-06T16:53:29.583624Z"
}
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\gschivley\\Anaconda2\\envs\\psci\\lib\\site-packages\\pandas\\core\\indexing.py:537: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame.\n",
"Try using .loc[row_indexer,col_indexer] = value instead\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" self.obj[item] = s\n"
]
}
],
"source": [
"unknown.loc[:, 'nerc'] = clf_knn.predict(unknown.loc[:, ['lat', 'lon']])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ensuring that no plants in Alaska or Hawaii are assigned to continental NERCs, or the other way around."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:53:34.746917Z",
"start_time": "2017-11-06T16:53:34.734080Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['HICC' 'ASCC' 'WECC']\n",
"['HI' 'AK']\n"
]
}
],
"source": [
"print(unknown.loc[unknown.state.isin(['AK', 'HI']), 'nerc'].unique())\n",
"print(unknown.loc[unknown.nerc.isin(['HICC', 'ASCC']), 'state'].unique())"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:53:38.090243Z",
"start_time": "2017-11-06T16:53:38.084127Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'ASCC': 11,\n",
" 'FRCC': 2,\n",
" 'HICC': 11,\n",
" 'MRO': 8,\n",
" 'NPCC': 57,\n",
" 'RFC': 33,\n",
" 'SERC': 41,\n",
" 'SPP': 22,\n",
" 'TRE': 29,\n",
" 'WECC': 65})"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Counter(unknown['nerc'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Export plants with lat/lon, state, and nerc"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:54:00.912370Z",
"start_time": "2017-11-06T16:54:00.897855Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" lat | \n",
" lon | \n",
" nerc | \n",
" state | \n",
"
\n",
" \n",
" \n",
" \n",
" 38520 | \n",
" 52080 | \n",
" 37.9725 | \n",
" 122.058333 | \n",
" WECC | \n",
" CA | \n",
"
\n",
" \n",
" 48717 | \n",
" 55647 | \n",
" 36.1844 | \n",
" -86.854200 | \n",
" SERC | \n",
" TN | \n",
"
\n",
" \n",
" 56706 | \n",
" 58277 | \n",
" 20.8867 | \n",
" -156.337800 | \n",
" HICC | \n",
" HI | \n",
"
\n",
" \n",
" 57035 | \n",
" 58380 | \n",
" 61.2860 | \n",
" -149.610000 | \n",
" ASCC | \n",
" AK | \n",
"
\n",
" \n",
" 57071 | \n",
" 58425 | \n",
" 61.1300 | \n",
" -150.243611 | \n",
" ASCC | \n",
" AK | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" plant id lat lon nerc state\n",
"38520 52080 37.9725 122.058333 WECC CA\n",
"48717 55647 36.1844 -86.854200 SERC TN\n",
"56706 58277 20.8867 -156.337800 HICC HI\n",
"57035 58380 61.2860 -149.610000 ASCC AK\n",
"57071 58425 61.1300 -150.243611 ASCC AK"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"unknown.head()"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:54:02.144417Z",
"start_time": "2017-11-06T16:54:02.126662Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" lat | \n",
" lon | \n",
" nerc | \n",
" state | \n",
"
\n",
" \n",
" \n",
" \n",
" 94762 | \n",
" 499 | \n",
" 37.643611 | \n",
" 120.757500 | \n",
" WECC | \n",
" CA | \n",
"
\n",
" \n",
" 94774 | \n",
" 7478 | \n",
" 32.738889 | \n",
" 114.700278 | \n",
" WECC | \n",
" AZ | \n",
"
\n",
" \n",
" 94792 | \n",
" 56197 | \n",
" 35.301389 | \n",
" 77.631111 | \n",
" SERC | \n",
" NC | \n",
"
\n",
" \n",
" 94796 | \n",
" 52205 | \n",
" 36.796389 | \n",
" 121.448889 | \n",
" WECC | \n",
" CA | \n",
"
\n",
" \n",
" 94894 | \n",
" 596 | \n",
" 39.733889 | \n",
" 75.564444 | \n",
" NPCC | \n",
" DE | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" plant id lat lon nerc state\n",
"94762 499 37.643611 120.757500 WECC CA\n",
"94774 7478 32.738889 114.700278 WECC AZ\n",
"94792 56197 35.301389 77.631111 SERC NC\n",
"94796 52205 36.796389 121.448889 WECC CA\n",
"94894 596 39.733889 75.564444 NPCC DE"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"unknown.tail()"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:54:06.371732Z",
"start_time": "2017-11-06T16:54:06.354901Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" lat | \n",
" lon | \n",
" nerc | \n",
" state | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1001 | \n",
" 39.924200 | \n",
" -87.424400 | \n",
" RFC | \n",
" IN | \n",
"
\n",
" \n",
" 17 | \n",
" 10003 | \n",
" 39.760600 | \n",
" -105.215000 | \n",
" WECC | \n",
" CO | \n",
"
\n",
" \n",
" 34 | \n",
" 10005 | \n",
" 37.047800 | \n",
" -121.170800 | \n",
" WECC | \n",
" CA | \n",
"
\n",
" \n",
" 49 | \n",
" 10008 | \n",
" 30.314467 | \n",
" -81.662705 | \n",
" FRCC | \n",
" FL | \n",
"
\n",
" \n",
" 62 | \n",
" 10091 | \n",
" 33.768300 | \n",
" -118.283600 | \n",
" WECC | \n",
" CA | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" plant id lat lon nerc state\n",
"0 1001 39.924200 -87.424400 RFC IN\n",
"17 10003 39.760600 -105.215000 WECC CO\n",
"34 10005 37.047800 -121.170800 WECC CA\n",
"49 10008 30.314467 -81.662705 FRCC FL\n",
"62 10091 33.768300 -118.283600 WECC CA"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_slim.head()"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:54:07.622073Z",
"start_time": "2017-11-06T16:54:07.597436Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"labeled = pd.concat([df_slim.loc[df_slim.notnull().all(axis=1)], unknown])"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:54:09.653819Z",
"start_time": "2017-11-06T16:54:09.640776Z"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" plant id | \n",
" lat | \n",
" lon | \n",
" nerc | \n",
" state | \n",
"
\n",
" \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [plant id, lat, lon, nerc, state]\n",
"Index: []"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labeled.loc[labeled.nerc.isnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are 11 facilities that don't show up in my labeled data - they didn't have lat/lon info."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:55:19.867023Z",
"start_time": "2017-11-06T16:55:19.768985Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([10851, 50291, 56672, 55982, 55150, 55082, 55521, 55314, 54243,\n",
" 50726, 50249, 2666, 55952, 7704, 6339, 55840, 55883, 55955,\n",
" 55958, 56043, 56175, 55975, 56042, 56168, 55073, 54516, 55303,\n",
" 55301, 52164, 54222, 50728, 50040, 50030, 50168, 50313, 10683,\n",
" 10257, 56508], dtype=int64)"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"facility_df.loc[~facility_df['plant id'].isin(labeled['plant id']),\n",
" 'plant id'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"ExecuteTime": {
"end_time": "2017-11-06T16:59:09.744166Z",
"start_time": "2017-11-06T16:59:09.667898Z"
},
"collapsed": true
},
"outputs": [],
"source": [
"path = join(data_path, 'Facility labels', 'Facility locations_knn.csv')\n",
"labeled.to_csv(path, index=False)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "py36",
"language": "python",
"name": "py36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}