{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Assign NERC labels to plants using 860 data and RandomForest\n",
    "\n",
    "## Instructions\n",
    "Make sure the `file_date` parameter below is set to whatever value you would like appended to file names.\n",
    "\n",
    "Change the `most_recent_860_year` parameter below to match the most up-to-date EIA-860 annual data file. As of March 2018 this is 2016.\n",
    "\n",
    "EIA-860 (annual) excel files will need to be [downloaded](https://www.eia.gov/electricity/data/eia860/) and unzipped to the `EIA downloads` folder. Make sure that all years from 2012 through the most recent data year are available. Also download the most recent [EIA-860m](https://www.eia.gov/electricity/data/eia860m/) to `EIA downloads`.\n",
    "\n",
    "The most recent annual 860 file available (as of March 2018) represents 2016 data. When newer EIA-860 annual files are added the dictionary with pandas `read_excel` parameters will need to be updated. Note that EIA puts out an Early Release version of 860 with extra header rows and columns, so be sure to appropriately adjust the `skiprows` and `usecols` parameters if using an Early Release file.\n",
    "\n",
    "The entire notebook can be run at once using *Run All Cells*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:00:19.547289Z",
     "start_time": "2017-11-06T20:00:19.534260Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import os\n",
    "from os.path import join\n",
    "import pandas as pd\n",
    "from sklearn import neighbors, metrics\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.model_selection import train_test_split, GridSearchCV\n",
    "from collections import Counter\n",
    "from copy import deepcopy\n",
    "\n",
    "\n",
    "cwd = os.getcwd()\n",
    "data_path = join(cwd, '..', 'Data storage')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Date string for filenames\n",
    "This will be inserted into all filenames (reading and writing)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_date = '2018-03-06'\n",
    "most_recent_860_year = 2016"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data\n",
    "This loads facility data that has been assembled from the EIA bulk data file, and EIA-860 excel files. The EIA-860 excel files need to be downloaded manually."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load EIA facility data\n",
    "Only need to keep the plant id, year (as a check that plants don't move between years), and lat/lon"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T14:34:57.939401Z",
     "start_time": "2017-11-06T14:34:48.570853Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "path = os.path.join(data_path, 'Derived data',\n",
    "                    'Facility gen fuels and CO2 {}.csv'.format(file_date))\n",
    "facility_df = pd.read_csv(path)\n",
    "facility_df['state'] = facility_df['geography'].str[-2:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T14:42:35.675435Z",
     "start_time": "2017-11-06T14:42:34.964483Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plants = facility_df.loc[:, ['plant id', 'year', 'lat', 'lon', 'state']]\n",
    "plants.drop_duplicates(inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because the 2017 facility dataframe only includes annually reporting facilities I'm going to duplicate the plant id, lat/lon, and state information from 2016."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 206,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make a copy of 2016 (or most recent annual data year) and change the year to \n",
    "plants_2017 = plants.loc[plants['year'] == most_recent_860_year, :].copy()\n",
    "plants_2017.loc[:, 'year'] += 1\n",
    "\n",
    "plants = pd.concat([plants.loc[plants['year']<=most_recent_860_year, :], plants_2017])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 207,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "set()"
      ]
     },
     "execution_count": 207,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(set(plants.loc[plants.year==2016, 'plant id']) - set(plants.loc[plants.year==2017, 'plant id']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load known NERC labels from EIA-860\n",
    "Current NERCS go back to 2012. Use all annual 860 files from 2012 through the most recent available. Extend the dictionary of dictionaries below with any files available after 2016. `io`, `skiprows`, and `usecols` are all input parameters for the Pandas `read_excel` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T14:27:35.633747Z",
     "start_time": "2017-11-06T14:27:35.621006Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "eia_base_path = join(data_path, 'EIA downloads')\n",
    "file_860_info = {\n",
    "#     2011: {'io': join(eia_base_path, 'eia8602011', 'Plant.xlsx'),\n",
    "#            'skiprows': 0,\n",
    "#            'parse_cols': 'B,J'},\n",
    "    2012: {'io': join(eia_base_path, 'eia8602012', 'PlantY2012.xlsx'),\n",
    "           'skiprows': 0,\n",
    "           'usecols': 'B,J'},\n",
    "    2013: {'io': join(eia_base_path, 'eia8602013', '2___Plant_Y2013.xlsx'),\n",
    "           'skiprows': 0,\n",
    "           'usecols': 'C,L'},\n",
    "    2014: {'io': join(eia_base_path, 'eia8602014', '2___Plant_Y2014.xlsx'),\n",
    "           'skiprows': 0,\n",
    "           'usecols': 'C,L'},\n",
    "    2015: {'io': join(eia_base_path, 'eia8602015', '2___Plant_Y2015.xlsx'),\n",
    "           'skiprows': 0,\n",
    "           'usecols': 'C,L'},\n",
    "    2016: {'io': join(eia_base_path, 'eia8602016', '2___Plant_Y2016.xlsx'),\n",
    "           'skiprows': 0,\n",
    "           'usecols': 'C,L'}\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T14:30:20.380645Z",
     "start_time": "2017-11-06T14:29:57.233977Z"
    }
   },
   "outputs": [],
   "source": [
    "eia_nercs = {}\n",
    "for key, args in file_860_info.items():\n",
    "    eia_nercs[key] = pd.read_excel(**args)\n",
    "    eia_nercs[key].columns = ['plant id', 'nerc']\n",
    "    eia_nercs[key]['year'] = key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I want to assign NERC regions for every year. We have data for 2012 onward from the EIA-860 files. For the purpose of this analysis I'll assume that all years from 2001-2011 are the same NERC as 2012.\n",
    "\n",
    "Also assume that values in 2017 are the same as in 2016. I'll fill in nerc values for plants that were built in 2017 below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "for year in range(2001, 2012):\n",
    "    # the pandas .copy() method is deep by default but I'm not sure in this case\n",
    "    df = deepcopy(eia_nercs[2012])\n",
    "    df['year'] = year\n",
    "    \n",
    "    eia_nercs[year] = df\n",
    "    \n",
    "df = deepcopy(eia_nercs[2016])\n",
    "df['year'] = 2017\n",
    "eia_nercs[2017] = df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict_keys([2012, 2013, 2014, 2015, 2016, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2017])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eia_nercs.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>10867</td>\n",
       "      <td>SERC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>50903</td>\n",
       "      <td>RFC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>10671</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2527</td>\n",
       "      <td>NPCC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3305</td>\n",
       "      <td>SERC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   plant id  nerc  year\n",
       "0     10867  SERC  2001\n",
       "1     50903   RFC  2001\n",
       "2     10671   SPP  2001\n",
       "3      2527  NPCC  2001\n",
       "4      3305  SERC  2001"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "eia_nercs[2001].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T14:30:23.906158Z",
     "start_time": "2017-11-06T14:30:23.877433Z"
    }
   },
   "outputs": [],
   "source": [
    "nercs = pd.concat(eia_nercs.values())\n",
    "nercs.sort_values('year', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5465</th>\n",
       "      <td>56512</td>\n",
       "      <td>RFC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4866</th>\n",
       "      <td>1481</td>\n",
       "      <td>NPCC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4865</th>\n",
       "      <td>1480</td>\n",
       "      <td>NPCC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4864</th>\n",
       "      <td>805</td>\n",
       "      <td>NPCC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4863</th>\n",
       "      <td>54451</td>\n",
       "      <td>WECC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      plant id  nerc  year\n",
       "5465     56512   RFC  2001\n",
       "4866      1481  NPCC  2001\n",
       "4865      1480  NPCC  2001\n",
       "4864       805  NPCC  2001\n",
       "4863     54451  WECC  2001"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nercs.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "set()"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(set(nercs.loc[(nercs.nerc == 'MRO') &\n",
    "              (nercs.year == 2016), 'plant id'])\n",
    " - set(nercs.loc[(nercs.nerc == 'MRO') &\n",
    "              (nercs.year == 2017), 'plant id']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,\n",
       "       2012, 2013, 2014, 2015, 2016, 2017])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nercs.year.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Look for plants listed with different NERC labels\n",
    "\n",
    "** This may not matter anymore if I use NERC labels for each year **\n",
    "\n",
    "There are 30 plants duplicated. Five of them don't have a NERC label in one of the years. The largest move is from MRO to other regions (12), with most of those to SPP (7). After that, moves from RFC (5) to MRO (3) and SERC (2). There are also some moves from WECC and FRCC to HICC/ASCC - these might be diesel generators that get moved.\n",
    "\n",
    "The plants that have duplicate NERC region labels represent a small fraction of national generation, but one that is growing over time. By 2016 they consist of 0.15% of national generation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T14:31:12.401717Z",
     "start_time": "2017-11-06T14:31:12.388812Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7289 total records\n",
      "7289 unique plants\n",
      "8060 total records\n",
      "8060 unique plants\n",
      "8520 total records\n",
      "8520 unique plants\n",
      "8928 total records\n",
      "8928 unique plants\n",
      "9711 total records\n",
      "9711 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "7289 total records\n",
      "7289 unique plants\n",
      "9711 total records\n",
      "9711 unique plants\n",
      "132398 total records\n",
      "10068 unique plants\n"
     ]
    }
   ],
   "source": [
    "for df_ in list(eia_nercs.values()) + [nercs]:\n",
    "    print('{} total records'.format(len(df_)))\n",
    "    print('{} unique plants'.format(len(df_['plant id'].unique())))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T14:43:34.109577Z",
     "start_time": "2017-11-06T14:43:34.102600Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([56512,  1481,  1480, ..., 61001, 61003, 55160])"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dup_plants = nercs.loc[nercs['plant id'].duplicated(keep=False), 'plant id'].unique()\n",
    "dup_plants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T14:43:41.526762Z",
     "start_time": "2017-11-06T14:43:41.503604Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({('ASCC', nan): 2,\n",
       "         ('ASCC',): 130,\n",
       "         ('FRCC', 'HICC'): 1,\n",
       "         ('FRCC',): 205,\n",
       "         ('HICC',): 52,\n",
       "         ('MRO', 'RFC'): 2,\n",
       "         ('MRO', 'SERC'): 1,\n",
       "         ('MRO', 'SPP'): 7,\n",
       "         ('MRO', 'WECC'): 3,\n",
       "         ('MRO',): 1051,\n",
       "         ('NPCC',): 1257,\n",
       "         ('RFC', 'MRO'): 3,\n",
       "         ('RFC', 'SERC'): 2,\n",
       "         ('RFC',): 1626,\n",
       "         ('SERC', 'SPP'): 1,\n",
       "         ('SERC',): 1832,\n",
       "         ('SPP', 'SERC'): 2,\n",
       "         ('SPP', 'TRE'): 1,\n",
       "         ('SPP',): 470,\n",
       "         ('TRE',): 461,\n",
       "         ('WECC', 'ASCC'): 2,\n",
       "         ('WECC', 'HICC'): 1,\n",
       "         ('WECC',): 2892,\n",
       "         (nan, 'WECC', 'ASCC'): 3,\n",
       "         (nan, 'WECC', 'HICC'): 1,\n",
       "         (nan,): 35})"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "region_list = []\n",
    "for plant in dup_plants:\n",
    "    regions = nercs.loc[nercs['plant id'] == plant, 'nerc'].unique()\n",
    "#     regions = regions.tolist()\n",
    "    region_list.append(regions)\n",
    "Counter(tuple(x) for x in region_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T14:43:46.932593Z",
     "start_time": "2017-11-06T14:43:46.804826Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "year\n",
       "2001    0.993784\n",
       "2002    0.994703\n",
       "2003    0.993925\n",
       "2004    0.996498\n",
       "2005    0.997682\n",
       "2006    0.998378\n",
       "2007    0.998707\n",
       "2008    0.998873\n",
       "2009    0.999267\n",
       "2010    0.999767\n",
       "2011    0.999990\n",
       "2012    1.000000\n",
       "2013    1.000000\n",
       "2014    1.000000\n",
       "2015    1.000000\n",
       "2016    1.000000\n",
       "2017    0.999998\n",
       "Name: generation (MWh), dtype: float64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(facility_df.loc[facility_df['plant id'].isin(dup_plants), :]\n",
    "            .groupby('year')['generation (MWh)'].sum()\n",
    " / facility_df.loc[:, :]\n",
    "              .groupby('year')['generation (MWh)'].sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Some plants in EIA-860 don't have NERC labels. Drop them now.\n",
    "This is my training data. All of these plants should still be in my `plants` dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "232"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nan_plants = {}\n",
    "all_nan = []\n",
    "years = nercs.year.unique()\n",
    "for year in years:\n",
    "    nan_plants[year] = nercs.loc[(nercs.year == year) &\n",
    "                                 (nercs.isnull().any(axis=1)), 'plant id'].tolist()\n",
    "    all_nan.extend(nan_plants[year])\n",
    "\n",
    "# number of plants that don't have a nerc in at least one year\n",
    "len(all_nan)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# drop all the rows without a nerc value\n",
    "nercs.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[58651,\n",
       " 58656,\n",
       " 58659,\n",
       " 58662,\n",
       " 58639,\n",
       " 58640,\n",
       " 58549,\n",
       " 58684,\n",
       " 58277,\n",
       " 58380,\n",
       " 58425,\n",
       " 58405,\n",
       " 60563,\n",
       " 60588,\n",
       " 60260,\n",
       " 60250,\n",
       " 60243,\n",
       " 60244,\n",
       " 60245,\n",
       " 60328,\n",
       " 61166,\n",
       " 61172,\n",
       " 61364,\n",
       " 60814,\n",
       " 61099,\n",
       " 61101,\n",
       " 61068,\n",
       " 58989,\n",
       " 58982,\n",
       " 58977,\n",
       " 58837,\n",
       " 59035,\n",
       " 60125,\n",
       " 60024,\n",
       " 66,\n",
       " 70]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nan_plants[2017]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load EIA-860m for some info on recent facilities\n",
    "The EIA-860m (monthly) data file has an up-to-date list of all operating power plants and their associated balancing authority. It does not list the NERC region, so it can't be used to assign NERC labels for all plants. But in some cases knowing the state and balancing authority is enough to make a good guess about which NERC a plant is in.\n",
    "\n",
    "Assigning NERC region labels has the lowest accuracy for plants in SPP and TRE. To compensate, I'm going to assume that anything in TX or OK and SWPP balancing authority is in SPP. On the flip side, if it's in TX and ERCOT I'll assign it to TRE.\n",
    "\n",
    "Only do this for plants that come online since the most recent 860 annual data.\n",
    "\n",
    "**NOTE**\n",
    "Because I'm looking at units that came online in 2017 some of the plant ids will already exist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [],
   "source": [
    "path = join(data_path, 'EIA downloads', 'december_generator2017.xlsx')\n",
    "\n",
    "# Check the excel file columns if there is a read error. They should match\n",
    "# the plant id, plant state, operating year, and balancing authority code.\n",
    "_m860 = pd.read_excel(path, sheet_name='Operating',skip_footer=1,\n",
    "                     usecols='C,F,P,AE', skiprows=0)\n",
    "_m860.columns = _m860.columns.str.lower()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>plant state</th>\n",
       "      <th>operating year</th>\n",
       "      <th>balancing authority code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21373</th>\n",
       "      <td>61632</td>\n",
       "      <td>IA</td>\n",
       "      <td>2017</td>\n",
       "      <td>MISO</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21374</th>\n",
       "      <td>61633</td>\n",
       "      <td>MA</td>\n",
       "      <td>2017</td>\n",
       "      <td>ISNE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21375</th>\n",
       "      <td>61634</td>\n",
       "      <td>MA</td>\n",
       "      <td>2017</td>\n",
       "      <td>ISNE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21376</th>\n",
       "      <td>61635</td>\n",
       "      <td>MA</td>\n",
       "      <td>2017</td>\n",
       "      <td>ISNE</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21377</th>\n",
       "      <td>61636</td>\n",
       "      <td>MA</td>\n",
       "      <td>2017</td>\n",
       "      <td>ISNE</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       plant id plant state  operating year balancing authority code\n",
       "21373     61632          IA            2017                     MISO\n",
       "21374     61633          MA            2017                     ISNE\n",
       "21375     61634          MA            2017                     ISNE\n",
       "21376     61635          MA            2017                     ISNE\n",
       "21377     61636          MA            2017                     ISNE"
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# most_recent_860_year is defined at the top of this notebook\n",
    "# The goal here is to only look at plants that started operating after\n",
    "# the most recent annual data. So only include units starting after\n",
    "# the last annual data and that don't have plant ids in the nercs\n",
    "# dataframe\n",
    "\n",
    "m860 = _m860.loc[(_m860['operating year'] > most_recent_860_year)].copy() #&\n",
    "#                 (~_m860['plant id'].isin(nercs['plant id'].unique()))].copy()\n",
    "m860.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [],
   "source": [
    "m860.loc[(m860['plant state'].isin(['TX', 'OK'])) &\n",
    "         (m860['balancing authority code'] == 'SWPP'), 'nerc'] = 'SPP'\n",
    "\n",
    "m860.loc[(m860['plant state'].isin(['TX'])) &\n",
    "         (m860['balancing authority code'] == 'ERCO'), 'nerc'] = 'TRE'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>plant state</th>\n",
       "      <th>operating year</th>\n",
       "      <th>balancing authority code</th>\n",
       "      <th>nerc</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>343</th>\n",
       "      <td>165</td>\n",
       "      <td>OK</td>\n",
       "      <td>2017</td>\n",
       "      <td>SWPP</td>\n",
       "      <td>SPP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>344</th>\n",
       "      <td>165</td>\n",
       "      <td>OK</td>\n",
       "      <td>2017</td>\n",
       "      <td>SWPP</td>\n",
       "      <td>SPP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4836</th>\n",
       "      <td>2953</td>\n",
       "      <td>OK</td>\n",
       "      <td>2017</td>\n",
       "      <td>SWPP</td>\n",
       "      <td>SPP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4837</th>\n",
       "      <td>2953</td>\n",
       "      <td>OK</td>\n",
       "      <td>2017</td>\n",
       "      <td>SWPP</td>\n",
       "      <td>SPP</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4838</th>\n",
       "      <td>2953</td>\n",
       "      <td>OK</td>\n",
       "      <td>2017</td>\n",
       "      <td>SWPP</td>\n",
       "      <td>SPP</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      plant id plant state  operating year balancing authority code nerc\n",
       "343        165          OK            2017                     SWPP  SPP\n",
       "344        165          OK            2017                     SWPP  SPP\n",
       "4836      2953          OK            2017                     SWPP  SPP\n",
       "4837      2953          OK            2017                     SWPP  SPP\n",
       "4838      2953          OK            2017                     SWPP  SPP"
      ]
     },
     "execution_count": 130,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Drop all rows except the ones I've labeled as TRE or SPP\n",
    "m860.dropna(inplace=True)\n",
    "m860.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make lists of plant codes for SPP and TRE facilities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 131,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>5465</th>\n",
       "      <td>56512</td>\n",
       "      <td>RFC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4866</th>\n",
       "      <td>1481</td>\n",
       "      <td>NPCC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4865</th>\n",
       "      <td>1480</td>\n",
       "      <td>NPCC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4864</th>\n",
       "      <td>805</td>\n",
       "      <td>NPCC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4863</th>\n",
       "      <td>54451</td>\n",
       "      <td>WECC</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      plant id  nerc  year\n",
       "5465     56512   RFC  2001\n",
       "4866      1481  NPCC  2001\n",
       "4865      1480  NPCC  2001\n",
       "4864       805  NPCC  2001\n",
       "4863     54451  WECC  2001"
      ]
     },
     "execution_count": 131,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nercs.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 132,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create additional dataframes with 2017 SPP and TRE plants.\n",
    "# Use these to fill in values for 2017 plants\n",
    "\n",
    "m860_spp_plants = (m860.loc[m860['nerc'] == 'SPP', 'plant id']\n",
    "                       .drop_duplicates()\n",
    "                       .reset_index(drop=True))\n",
    "\n",
    "additional_spp = pd.DataFrame(m860_spp_plants.copy())\n",
    "# additional_spp['plant id'] = m860_spp_plants\n",
    "additional_spp['nerc'] = 'SPP'\n",
    "additional_spp['year'] = 2017\n",
    "\n",
    "m860_tre_plants = (m860.loc[m860['nerc'] == 'TRE', 'plant id']\n",
    "                       .drop_duplicates()\n",
    "                       .reset_index(drop=True))\n",
    "\n",
    "additional_tre = pd.DataFrame(m860_tre_plants)\n",
    "# additional_tre['plant id'] = m860_tre_plants\n",
    "additional_tre['nerc'] = 'TRE'\n",
    "additional_tre['year'] = 2017"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 133,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>165</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2953</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>60414</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>61221</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>61261</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>61614</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>61615</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>61616</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>61617</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>61618</td>\n",
       "      <td>SPP</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   plant id nerc  year\n",
       "0       165  SPP  2017\n",
       "1      2953  SPP  2017\n",
       "2     60414  SPP  2017\n",
       "3     61221  SPP  2017\n",
       "4     61261  SPP  2017\n",
       "5     61614  SPP  2017\n",
       "6     61615  SPP  2017\n",
       "7     61616  SPP  2017\n",
       "8     61617  SPP  2017\n",
       "9     61618  SPP  2017"
      ]
     },
     "execution_count": 133,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_spp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>56984</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>59066</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>59193</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>59206</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>59245</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>59712</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>59812</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>60122</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>60210</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>60217</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>60366</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>60372</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>60436</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>60459</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>60460</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>60581</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>60682</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>60690</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>60774</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>60901</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>60902</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>60983</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>60989</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>61205</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>61309</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>61362</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>61409</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>61410</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>61411</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    plant id nerc  year\n",
       "0      56984  TRE  2017\n",
       "1      59066  TRE  2017\n",
       "2      59193  TRE  2017\n",
       "3      59206  TRE  2017\n",
       "4      59245  TRE  2017\n",
       "5      59712  TRE  2017\n",
       "6      59812  TRE  2017\n",
       "7      60122  TRE  2017\n",
       "8      60210  TRE  2017\n",
       "9      60217  TRE  2017\n",
       "10     60366  TRE  2017\n",
       "11     60372  TRE  2017\n",
       "12     60436  TRE  2017\n",
       "13     60459  TRE  2017\n",
       "14     60460  TRE  2017\n",
       "15     60581  TRE  2017\n",
       "16     60682  TRE  2017\n",
       "17     60690  TRE  2017\n",
       "18     60774  TRE  2017\n",
       "19     60901  TRE  2017\n",
       "20     60902  TRE  2017\n",
       "21     60983  TRE  2017\n",
       "22     60989  TRE  2017\n",
       "23     61205  TRE  2017\n",
       "24     61309  TRE  2017\n",
       "25     61362  TRE  2017\n",
       "26     61409  TRE  2017\n",
       "27     61410  TRE  2017\n",
       "28     61411  TRE  2017"
      ]
     },
     "execution_count": 134,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "additional_tre"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Append my 2017 SPP and TRE guesses to the full nerc dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [],
   "source": [
    "nercs = pd.concat([nercs, additional_spp, additional_tre])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean and prep data for KNN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>year</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1001</td>\n",
       "      <td>2017</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>1001</td>\n",
       "      <td>2016</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>1001</td>\n",
       "      <td>2015</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>1001</td>\n",
       "      <td>2014</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>1001</td>\n",
       "      <td>2013</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    plant id  year      lat      lon state\n",
       "0       1001  2017  39.9242 -87.4244    IN\n",
       "12      1001  2016  39.9242 -87.4244    IN\n",
       "24      1001  2015  39.9242 -87.4244    IN\n",
       "36      1001  2014  39.9242 -87.4244    IN\n",
       "48      1001  2013  39.9242 -87.4244    IN"
      ]
     },
     "execution_count": 116,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "plants.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>61309</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>61362</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>61409</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>61410</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>61411</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    plant id nerc  year\n",
       "24     61309  TRE  2017\n",
       "25     61362  TRE  2017\n",
       "26     61409  TRE  2017\n",
       "27     61410  TRE  2017\n",
       "28     61411  TRE  2017"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nercs.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Checked to make sure the type of merge doesn't matter once rows without nerc values are dropped"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 209,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:39:54.416966Z",
     "start_time": "2017-11-06T16:39:54.383812Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "df = pd.merge(plants, nercs, on=['plant id', 'year'], how='left')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 210,
   "metadata": {},
   "outputs": [],
   "source": [
    "omitted = set(df['plant id'].unique()) - set(nercs['plant id'].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 213,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>year</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>state</th>\n",
       "      <th>nerc</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1001</td>\n",
       "      <td>2016</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "      <td>RFC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1001</td>\n",
       "      <td>2015</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "      <td>RFC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1001</td>\n",
       "      <td>2014</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "      <td>RFC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1001</td>\n",
       "      <td>2013</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "      <td>RFC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1001</td>\n",
       "      <td>2012</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>IN</td>\n",
       "      <td>RFC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   plant id  year      lat      lon state nerc\n",
       "0      1001  2016  39.9242 -87.4244    IN  RFC\n",
       "1      1001  2015  39.9242 -87.4244    IN  RFC\n",
       "2      1001  2014  39.9242 -87.4244    IN  RFC\n",
       "3      1001  2013  39.9242 -87.4244    IN  RFC\n",
       "4      1001  2012  39.9242 -87.4244    IN  RFC"
      ]
     },
     "execution_count": 213,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 214,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>year</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>state</th>\n",
       "      <th>nerc</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>100851</th>\n",
       "      <td>59957</td>\n",
       "      <td>2017</td>\n",
       "      <td>41.226900</td>\n",
       "      <td>-88.683600</td>\n",
       "      <td>IL</td>\n",
       "      <td>RFC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100852</th>\n",
       "      <td>59257</td>\n",
       "      <td>2017</td>\n",
       "      <td>37.305833</td>\n",
       "      <td>-121.755000</td>\n",
       "      <td>CA</td>\n",
       "      <td>WECC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100853</th>\n",
       "      <td>59256</td>\n",
       "      <td>2017</td>\n",
       "      <td>38.393333</td>\n",
       "      <td>-121.927778</td>\n",
       "      <td>CA</td>\n",
       "      <td>WECC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100854</th>\n",
       "      <td>10475</td>\n",
       "      <td>2017</td>\n",
       "      <td>41.683600</td>\n",
       "      <td>-87.423300</td>\n",
       "      <td>IN</td>\n",
       "      <td>RFC</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100855</th>\n",
       "      <td>55370</td>\n",
       "      <td>2017</td>\n",
       "      <td>39.851389</td>\n",
       "      <td>-79.070556</td>\n",
       "      <td>PA</td>\n",
       "      <td>RFC</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        plant id  year        lat         lon state  nerc\n",
       "100851     59957  2017  41.226900  -88.683600    IL   RFC\n",
       "100852     59257  2017  37.305833 -121.755000    CA  WECC\n",
       "100853     59256  2017  38.393333 -121.927778    CA  WECC\n",
       "100854     10475  2017  41.683600  -87.423300    IN   RFC\n",
       "100855     55370  2017  39.851389  -79.070556    PA   RFC"
      ]
     },
     "execution_count": 214,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 215,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:39:55.684713Z",
     "start_time": "2017-11-06T16:39:55.678207Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['plant id', 'year', 'lat', 'lon', 'state', 'nerc'], dtype='object')"
      ]
     },
     "execution_count": 215,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Drop plants that don't have lat/lon data (using just lon to check), and then drop duplicates. If any plants have kept the same plant id but moved over time (maybe a diesel generator?) or switched NERC they will show up twice."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 216,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:40:08.841245Z",
     "start_time": "2017-11-06T16:40:08.805376Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cols = ['plant id', 'lat', 'lon', 'nerc', 'state', 'year']\n",
    "df_slim = (df.loc[:, cols].dropna(subset=['lon'])\n",
    "             .drop_duplicates(subset=['plant id', 'year', 'nerc']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 217,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:40:29.960293Z",
     "start_time": "2017-11-06T16:40:29.944141Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>nerc</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>100851</th>\n",
       "      <td>59957</td>\n",
       "      <td>41.226900</td>\n",
       "      <td>-88.683600</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IL</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100852</th>\n",
       "      <td>59257</td>\n",
       "      <td>37.305833</td>\n",
       "      <td>-121.755000</td>\n",
       "      <td>WECC</td>\n",
       "      <td>CA</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100853</th>\n",
       "      <td>59256</td>\n",
       "      <td>38.393333</td>\n",
       "      <td>-121.927778</td>\n",
       "      <td>WECC</td>\n",
       "      <td>CA</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100854</th>\n",
       "      <td>10475</td>\n",
       "      <td>41.683600</td>\n",
       "      <td>-87.423300</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100855</th>\n",
       "      <td>55370</td>\n",
       "      <td>39.851389</td>\n",
       "      <td>-79.070556</td>\n",
       "      <td>RFC</td>\n",
       "      <td>PA</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        plant id        lat         lon  nerc state  year\n",
       "100851     59957  41.226900  -88.683600   RFC    IL  2017\n",
       "100852     59257  37.305833 -121.755000  WECC    CA  2017\n",
       "100853     59256  38.393333 -121.927778  WECC    CA  2017\n",
       "100854     10475  41.683600  -87.423300   RFC    IN  2017\n",
       "100855     55370  39.851389  -79.070556   RFC    PA  2017"
      ]
     },
     "execution_count": 217,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_slim.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Separate out the list of plants where we don't have NERC labels from EIA-860."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 218,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:40:35.247883Z",
     "start_time": "2017-11-06T16:40:35.239582Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "unknown = df_slim.loc[df_slim.nerc.isnull()].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 219,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:40:37.583325Z",
     "start_time": "2017-11-06T16:40:37.568223Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1579 plants/years don't have NERC labels\n",
      "\n",
      "       plant id    lat     lon nerc state  year\n",
      "28205      3823  38.27 -78.035  NaN    VA  2009\n",
      "28206      3823  38.27 -78.035  NaN    VA  2005\n",
      "28207      3823  38.27 -78.035  NaN    VA  2004\n",
      "28208      3823  38.27 -78.035  NaN    VA  2003\n",
      "28209      3823  38.27 -78.035  NaN    VA  2002\n"
     ]
    }
   ],
   "source": [
    "print(\"{} plants/years don't have NERC labels\\n\".format(len(unknown)))\n",
    "print(unknown.head())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 220,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>nerc</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>100814</th>\n",
       "      <td>50712</td>\n",
       "      <td>37.734100</td>\n",
       "      <td>-121.6516</td>\n",
       "      <td>NaN</td>\n",
       "      <td>CA</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100817</th>\n",
       "      <td>55162</td>\n",
       "      <td>29.096900</td>\n",
       "      <td>-81.0689</td>\n",
       "      <td>NaN</td>\n",
       "      <td>FL</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100818</th>\n",
       "      <td>56171</td>\n",
       "      <td>34.037200</td>\n",
       "      <td>-117.5569</td>\n",
       "      <td>NaN</td>\n",
       "      <td>CA</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100822</th>\n",
       "      <td>7939</td>\n",
       "      <td>37.771400</td>\n",
       "      <td>-75.6342</td>\n",
       "      <td>NaN</td>\n",
       "      <td>VA</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>100825</th>\n",
       "      <td>58916</td>\n",
       "      <td>34.248056</td>\n",
       "      <td>-116.0600</td>\n",
       "      <td>NaN</td>\n",
       "      <td>CA</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        plant id        lat       lon nerc state  year\n",
       "100814     50712  37.734100 -121.6516  NaN    CA  2017\n",
       "100817     55162  29.096900  -81.0689  NaN    FL  2017\n",
       "100818     56171  34.037200 -117.5569  NaN    CA  2017\n",
       "100822      7939  37.771400  -75.6342  NaN    VA  2017\n",
       "100825     58916  34.248056 -116.0600  NaN    CA  2017"
      ]
     },
     "execution_count": 220,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unknown.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create X and y matricies\n",
    "X is lat/lon and year\n",
    "\n",
    "y is the NERC label\n",
    "\n",
    "For both, I'm only using plants where we have all data (no `NaN`s). Not doing any transformation of the lat/lon at this time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 221,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:06:21.393849Z",
     "start_time": "2017-11-06T20:06:21.367798Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = df_slim.loc[df_slim.notnull().all(axis=1), ['lat', 'lon', 'year']]\n",
    "y = df_slim.loc[df_slim.notnull().all(axis=1), 'nerc']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 222,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:06:27.239811Z",
     "start_time": "2017-11-06T20:06:27.233495Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "99127"
      ]
     },
     "execution_count": 222,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 223,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 223,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Make sure that unknown and X include all records from df_slim\n",
    "len(X) + len(unknown) - len(df_slim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 224,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:06:27.249590Z",
     "start_time": "2017-11-06T20:06:27.242414Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.33, random_state=42)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GridSearch to find the best parameters in a RandomForest Classifier\n",
    "\n",
    "I previously used k-nearest neighbors with just lat/lon as input features. The problem is that some facilities don't have lat/lon data. They do usually have a state geography label though. Categorical labels don't work well in KNN, but the combination of lat/lon and a state label will work well in a tree model. RandomForest is usually a quite effective tree model and my results are more accurate with this than they were with KNN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 225,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 226,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 36 candidates, totalling 108 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   32.3s\n",
      "[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:  1.3min finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=None, error_score='raise',\n",
       "       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "            max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
       "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "            min_samples_leaf=1, min_samples_split=2,\n",
       "            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
       "            oob_score=False, random_state=None, verbose=0,\n",
       "            warm_start=False),\n",
       "       fit_params=None, iid=False, n_jobs=-1,\n",
       "       param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
       "       scoring=None, verbose=1)"
      ]
     },
     "execution_count": 226,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rf = RandomForestClassifier()\n",
    "params = dict(\n",
    "    n_estimators = [5, 10, 25, 50],\n",
    "    min_samples_split = [2, 5, 10],\n",
    "    min_samples_leaf = [1, 3, 5],\n",
    ")\n",
    "\n",
    "clf_rf = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)\n",
    "clf_rf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 227,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "             max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
       "             min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "             min_samples_leaf=1, min_samples_split=2,\n",
       "             min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,\n",
       "             oob_score=False, random_state=None, verbose=0,\n",
       "             warm_start=False), 0.9797785431782634)"
      ]
     },
     "execution_count": 227,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf_rf.best_estimator_, clf_rf.best_score_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 228,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9822695035460993"
      ]
     },
     "execution_count": 228,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf_rf.score(X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 229,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:10:03.521287Z",
     "start_time": "2017-11-06T20:10:03.510982Z"
    }
   },
   "outputs": [],
   "source": [
    "nerc_labels = nercs.nerc.dropna().unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accuracy score by region"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 230,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:15:48.830308Z",
     "start_time": "2017-11-06T20:15:48.737017Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RFC : 0.9707198806415517\n",
      "NPCC : 0.9907823209643111\n",
      "WECC : 0.9987963672174198\n",
      "SERC : 0.9690892364305428\n",
      "SPP : 0.9328663164039697\n",
      "TRE : 0.9903057419835943\n",
      "MRO : 0.9853683148335015\n",
      "FRCC : 0.9768707482993197\n",
      "ASCC : 0.9965986394557823\n",
      "HICC : 1.0\n"
     ]
    }
   ],
   "source": [
    "for region in nerc_labels:\n",
    "    mask = y_test == region\n",
    "    \n",
    "    X_masked = X_test[mask]\n",
    "    y_hat_masked = clf_rf.predict(X_masked)\n",
    "    y_test_masked = y_test[mask]\n",
    "    \n",
    "    accuracy = metrics.accuracy_score(y_test_masked, y_hat_masked)\n",
    "    print('{} : {}'.format(region, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "F1 score by region"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 231,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:18:57.242329Z",
     "start_time": "2017-11-06T20:18:57.079002Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RFC : 0.9682820202771835\n",
      "NPCC : 0.9916026020106445\n",
      "WECC : 0.9985778361229624\n",
      "SERC : 0.9707861026633492\n",
      "SPP : 0.9461219656601539\n",
      "TRE : 0.986627043090639\n",
      "MRO : 0.9798068481123793\n",
      "FRCC : 0.9835616438356164\n",
      "ASCC : 0.9982964224872232\n",
      "HICC : 1.0\n"
     ]
    }
   ],
   "source": [
    "y_hat = clf_rf.predict(X_test)\n",
    "\n",
    "for region in nerc_labels:\n",
    "    f1 = metrics.f1_score(y_test, y_hat, labels=[region], average='macro')\n",
    "    print('{} : {}'.format(region, f1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Plants without lat/lon\n",
    "Use just the state for plants that don't have lat/lon info. Less accurate, especially where NERC regions cross state lines, but better than nothing.\n",
    "\n",
    "Need to start with the `lon` column so I can filter to only unknown facilities that also don't have lon"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 232,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols = ['plant id', 'nerc', 'state', 'year', 'lon']\n",
    "df_state_slim = (df.loc[:, cols].dropna(subset=['state']).copy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 233,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "      <th>lon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1001</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2016</td>\n",
       "      <td>-87.4244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1001</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2015</td>\n",
       "      <td>-87.4244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1001</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2014</td>\n",
       "      <td>-87.4244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1001</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2013</td>\n",
       "      <td>-87.4244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1001</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2012</td>\n",
       "      <td>-87.4244</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   plant id nerc state  year      lon\n",
       "0      1001  RFC    IN  2016 -87.4244\n",
       "1      1001  RFC    IN  2015 -87.4244\n",
       "2      1001  RFC    IN  2014 -87.4244\n",
       "3      1001  RFC    IN  2013 -87.4244\n",
       "4      1001  RFC    IN  2012 -87.4244"
      ]
     },
     "execution_count": 233,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_state_slim.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 234,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100831"
      ]
     },
     "execution_count": 234,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df_state_slim)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Encode state names as numbers for use in sklearn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 235,
   "metadata": {},
   "outputs": [],
   "source": [
    "le = LabelEncoder()\n",
    "df_state_slim.loc[:, 'enc state'] = le.fit_transform(df_state_slim.loc[:, 'state'].tolist())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 236,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "100831"
      ]
     },
     "execution_count": 236,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df_state_slim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 237,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:40:35.247883Z",
     "start_time": "2017-11-06T16:40:35.239582Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "unknown_state = df_state_slim.loc[(df_state_slim.nerc.isnull()) &\n",
    "                                  (df_state_slim.lon.isnull())].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 238,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(116, 1579)"
      ]
     },
     "execution_count": 238,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(unknown_state), len(unknown)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 239,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:06:21.393849Z",
     "start_time": "2017-11-06T20:06:21.367798Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), ['enc state', 'year']].copy()\n",
    "y_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), 'nerc'].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 240,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:06:27.249590Z",
     "start_time": "2017-11-06T20:06:27.242414Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X_state_train, X_state_test, y_state_train, y_state_test = train_test_split(\n",
    "    X_state, y_state, test_size=0.33, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 241,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 3 folds for each of 36 candidates, totalling 108 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   17.1s\n",
      "[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:   45.4s finished\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=None, error_score='raise',\n",
       "       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "            max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
       "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "            min_samples_leaf=1, min_samples_split=2,\n",
       "            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n",
       "            oob_score=False, random_state=None, verbose=0,\n",
       "            warm_start=False),\n",
       "       fit_params=None, iid=False, n_jobs=-1,\n",
       "       param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]},\n",
       "       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n",
       "       scoring=None, verbose=1)"
      ]
     },
     "execution_count": 241,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rf = RandomForestClassifier()\n",
    "params = dict(\n",
    "    n_estimators = [5, 10, 25, 50],\n",
    "    min_samples_split = [2, 5, 10],\n",
    "    min_samples_leaf = [1, 3, 5],\n",
    ")\n",
    "\n",
    "clf_rf_state = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)\n",
    "clf_rf_state.fit(X_state_train, y_state_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 242,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n",
       "             max_depth=None, max_features='auto', max_leaf_nodes=None,\n",
       "             min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "             min_samples_leaf=1, min_samples_split=10,\n",
       "             min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,\n",
       "             oob_score=False, random_state=None, verbose=0,\n",
       "             warm_start=False), 0.9402882327958322)"
      ]
     },
     "execution_count": 242,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf_rf_state.best_estimator_, clf_rf_state.best_score_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 243,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9387418230726906"
      ]
     },
     "execution_count": 243,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf_rf_state.score(X_state_test, y_state_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Accuracy score by region"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 244,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:15:48.830308Z",
     "start_time": "2017-11-06T20:15:48.737017Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RFC : 0.9243697478991597\n",
      "NPCC : 0.9938519744620478\n",
      "WECC : 0.9943912900032993\n",
      "SERC : 0.8831431726168568\n",
      "SPP : 0.5765453495089543\n",
      "TRE : 1.0\n",
      "MRO : 0.9614311088556204\n",
      "FRCC : 0.9973992197659298\n",
      "ASCC : 1.0\n",
      "HICC : 1.0\n"
     ]
    }
   ],
   "source": [
    "nerc_labels = nercs.nerc.dropna().unique()\n",
    "\n",
    "for region in nerc_labels:\n",
    "    mask = y_state_test == region\n",
    "    \n",
    "    X_state_masked = X_state_test[mask]\n",
    "    y_state_hat_masked = clf_rf_state.predict(X_state_masked)\n",
    "    y_state_test_masked = y_state_test[mask]\n",
    "    \n",
    "    accuracy = metrics.accuracy_score(y_state_test_masked, y_state_hat_masked)\n",
    "    print('{} : {}'.format(region, accuracy))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "F1 score by region"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 245,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T20:18:57.242329Z",
     "start_time": "2017-11-06T20:18:57.079002Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RFC : 0.9088405397961994\n",
      "NPCC : 0.9958535718516763\n",
      "WECC : 0.9933534743202416\n",
      "SERC : 0.8819259395387301\n",
      "SPP : 0.7300658376005852\n",
      "TRE : 0.8763183125599233\n",
      "MRO : 0.9512929952297263\n",
      "FRCC : 0.962962962962963\n",
      "ASCC : 0.9991474850809888\n",
      "HICC : 0.9975669099756691\n"
     ]
    }
   ],
   "source": [
    "y_state_hat = clf_rf_state.predict(X_state_test)\n",
    "\n",
    "for region in nerc_labels:\n",
    "    f1 = metrics.f1_score(y_state_test, y_state_hat, labels=[region], average='macro')\n",
    "    print('{} : {}'.format(region, f1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use best RandomForest parameters to predict NERC for unknown plants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 246,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:53:30.016363Z",
     "start_time": "2017-11-06T16:53:29.583624Z"
    }
   },
   "outputs": [],
   "source": [
    "unknown.loc[:, 'nerc'] = clf_rf.predict(unknown.loc[:, ['lat', 'lon', 'year']])\n",
    "unknown_state.loc[:, 'nerc'] = clf_rf_state.predict(unknown_state.loc[:, ['enc state', 'year']])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ensuring that no plants in Alaska or Hawaii are assigned to continental NERCs, or the other way around."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 247,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:53:34.746917Z",
     "start_time": "2017-11-06T16:53:34.734080Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['ASCC' 'HICC' 'WECC']\n",
      "['AK' 'HI']\n"
     ]
    }
   ],
   "source": [
    "print(unknown.loc[unknown.state.isin(['AK', 'HI']), 'nerc'].unique())\n",
    "print(unknown.loc[unknown.nerc.isin(['HICC', 'ASCC']), 'state'].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 248,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:53:38.090243Z",
     "start_time": "2017-11-06T16:53:38.084127Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Counter({'ASCC': 44,\n",
       "         'FRCC': 6,\n",
       "         'HICC': 39,\n",
       "         'MRO': 67,\n",
       "         'NPCC': 264,\n",
       "         'RFC': 202,\n",
       "         'SERC': 349,\n",
       "         'SPP': 99,\n",
       "         'TRE': 145,\n",
       "         'WECC': 364})"
      ]
     },
     "execution_count": 248,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Counter(unknown['nerc'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 249,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>nerc</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>28205</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28206</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28207</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28208</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28209</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       plant id    lat     lon  nerc state  year\n",
       "28205      3823  38.27 -78.035  SERC    VA  2009\n",
       "28206      3823  38.27 -78.035  SERC    VA  2005\n",
       "28207      3823  38.27 -78.035  SERC    VA  2004\n",
       "28208      3823  38.27 -78.035  SERC    VA  2003\n",
       "28209      3823  38.27 -78.035  SERC    VA  2002"
      ]
     },
     "execution_count": 249,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unknown.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 250,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "      <th>lon</th>\n",
       "      <th>enc state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>84867</th>\n",
       "      <td>10851</td>\n",
       "      <td>RFC</td>\n",
       "      <td>NJ</td>\n",
       "      <td>2006</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84868</th>\n",
       "      <td>10851</td>\n",
       "      <td>RFC</td>\n",
       "      <td>NJ</td>\n",
       "      <td>2005</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84869</th>\n",
       "      <td>10851</td>\n",
       "      <td>RFC</td>\n",
       "      <td>NJ</td>\n",
       "      <td>2004</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84870</th>\n",
       "      <td>10851</td>\n",
       "      <td>RFC</td>\n",
       "      <td>NJ</td>\n",
       "      <td>2002</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84871</th>\n",
       "      <td>10851</td>\n",
       "      <td>RFC</td>\n",
       "      <td>NJ</td>\n",
       "      <td>2001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>31</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       plant id nerc state  year  lon  enc state\n",
       "84867     10851  RFC    NJ  2006  NaN         31\n",
       "84868     10851  RFC    NJ  2005  NaN         31\n",
       "84869     10851  RFC    NJ  2004  NaN         31\n",
       "84870     10851  RFC    NJ  2002  NaN         31\n",
       "84871     10851  RFC    NJ  2001  NaN         31"
      ]
     },
     "execution_count": 250,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unknown_state.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export plants with lat/lon, state, and nerc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 251,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>61309</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>61362</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>61409</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>61410</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>61411</td>\n",
       "      <td>TRE</td>\n",
       "      <td>2017</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    plant id nerc  year\n",
       "24     61309  TRE  2017\n",
       "25     61362  TRE  2017\n",
       "26     61409  TRE  2017\n",
       "27     61410  TRE  2017\n",
       "28     61411  TRE  2017"
      ]
     },
     "execution_count": 251,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nercs.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 252,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:54:00.912370Z",
     "start_time": "2017-11-06T16:54:00.897855Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>nerc</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>28205</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28206</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28207</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28208</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28209</th>\n",
       "      <td>3823</td>\n",
       "      <td>38.27</td>\n",
       "      <td>-78.035</td>\n",
       "      <td>SERC</td>\n",
       "      <td>VA</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       plant id    lat     lon  nerc state  year\n",
       "28205      3823  38.27 -78.035  SERC    VA  2009\n",
       "28206      3823  38.27 -78.035  SERC    VA  2005\n",
       "28207      3823  38.27 -78.035  SERC    VA  2004\n",
       "28208      3823  38.27 -78.035  SERC    VA  2003\n",
       "28209      3823  38.27 -78.035  SERC    VA  2002"
      ]
     },
     "execution_count": 252,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unknown.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 253,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:54:02.144417Z",
     "start_time": "2017-11-06T16:54:02.126662Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>nerc</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "      <th>lon</th>\n",
       "      <th>enc state</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>92615</th>\n",
       "      <td>10257</td>\n",
       "      <td>WECC</td>\n",
       "      <td>CA</td>\n",
       "      <td>2004</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92616</th>\n",
       "      <td>10257</td>\n",
       "      <td>WECC</td>\n",
       "      <td>CA</td>\n",
       "      <td>2003</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92617</th>\n",
       "      <td>10257</td>\n",
       "      <td>WECC</td>\n",
       "      <td>CA</td>\n",
       "      <td>2002</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92618</th>\n",
       "      <td>10257</td>\n",
       "      <td>WECC</td>\n",
       "      <td>CA</td>\n",
       "      <td>2001</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92782</th>\n",
       "      <td>56508</td>\n",
       "      <td>WECC</td>\n",
       "      <td>CA</td>\n",
       "      <td>2007</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       plant id  nerc state  year  lon  enc state\n",
       "92615     10257  WECC    CA  2004  NaN          4\n",
       "92616     10257  WECC    CA  2003  NaN          4\n",
       "92617     10257  WECC    CA  2002  NaN          4\n",
       "92618     10257  WECC    CA  2001  NaN          4\n",
       "92782     56508  WECC    CA  2007  NaN          4"
      ]
     },
     "execution_count": 253,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "unknown_state.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 267,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "31"
      ]
     },
     "execution_count": 267,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(unknown_state['plant id'].unique())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 254,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:54:06.371732Z",
     "start_time": "2017-11-06T16:54:06.354901Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>plant id</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>nerc</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1001</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2016</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1001</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1001</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1001</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2013</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1001</td>\n",
       "      <td>39.9242</td>\n",
       "      <td>-87.4244</td>\n",
       "      <td>RFC</td>\n",
       "      <td>IN</td>\n",
       "      <td>2012</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   plant id      lat      lon nerc state  year\n",
       "0      1001  39.9242 -87.4244  RFC    IN  2016\n",
       "1      1001  39.9242 -87.4244  RFC    IN  2015\n",
       "2      1001  39.9242 -87.4244  RFC    IN  2014\n",
       "3      1001  39.9242 -87.4244  RFC    IN  2013\n",
       "4      1001  39.9242 -87.4244  RFC    IN  2012"
      ]
     },
     "execution_count": 254,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_slim.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 255,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:54:07.622073Z",
     "start_time": "2017-11-06T16:54:07.597436Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "labeled = pd.concat([df_slim.loc[df_slim.notnull().all(axis=1)],\n",
    "                     unknown,\n",
    "                     unknown_state.loc[:, ['plant id', 'nerc', 'state', 'year']]])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 256,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>nerc</th>\n",
       "      <th>plant id</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>92615</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>WECC</td>\n",
       "      <td>10257</td>\n",
       "      <td>CA</td>\n",
       "      <td>2004</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92616</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>WECC</td>\n",
       "      <td>10257</td>\n",
       "      <td>CA</td>\n",
       "      <td>2003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92617</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>WECC</td>\n",
       "      <td>10257</td>\n",
       "      <td>CA</td>\n",
       "      <td>2002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92618</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>WECC</td>\n",
       "      <td>10257</td>\n",
       "      <td>CA</td>\n",
       "      <td>2001</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92782</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>WECC</td>\n",
       "      <td>56508</td>\n",
       "      <td>CA</td>\n",
       "      <td>2007</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       lat  lon  nerc  plant id state  year\n",
       "92615  NaN  NaN  WECC     10257    CA  2004\n",
       "92616  NaN  NaN  WECC     10257    CA  2003\n",
       "92617  NaN  NaN  WECC     10257    CA  2002\n",
       "92618  NaN  NaN  WECC     10257    CA  2001\n",
       "92782  NaN  NaN  WECC     56508    CA  2007"
      ]
     },
     "execution_count": 256,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labeled.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 257,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:54:09.653819Z",
     "start_time": "2017-11-06T16:54:09.640776Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>nerc</th>\n",
       "      <th>plant id</th>\n",
       "      <th>state</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [lat, lon, nerc, plant id, state, year]\n",
       "Index: []"
      ]
     },
     "execution_count": 257,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "labeled.loc[labeled.nerc.isnull()]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 7 facilities that don't show up in my labeled data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 258,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([57116, 57794, 58690, 58236, 58098, 57913, 57400, 57628, 61084,\n",
       "       60539, 60540, 60991, 61079, 60383, 61172, 61020, 61221, 61021,\n",
       "       61022, 60682, 60688, 61407, 60689, 61330, 61357, 60414, 60366,\n",
       "       60983, 60989, 60372, 60658, 60581, 60583, 60901, 60902, 60905,\n",
       "       60883, 60885, 60856, 60552, 60467, 60145, 60152, 59308, 59220,\n",
       "       59309, 59315, 60237, 60306, 60303, 60340, 59665, 59666, 59684,\n",
       "       59827, 59061, 60033, 59066, 60210, 60261, 61039, 61040, 61048,\n",
       "       60258, 61050, 60346, 59689, 59690, 59691, 59764, 61261, 61268,\n",
       "       61303, 59812, 59875, 60043, 59888, 61561, 60122, 59245, 59193,\n",
       "       59004, 60655, 60987, 61512, 59206, 60436, 60217, 59712, 59940,\n",
       "       60785, 61222, 61422, 55314, 55952,  7704,  6339, 55975, 55073,\n",
       "       50728, 60569, 60570, 61197, 60690, 60506])"
      ]
     },
     "execution_count": 258,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "facility_df.loc[~facility_df['plant id'].isin(labeled['plant id']), 'plant id'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 259,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(100822, 132244)"
      ]
     },
     "execution_count": 259,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(labeled), len(nercs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 260,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['RFC', 'NPCC', 'WECC', 'SERC', 'SPP', 'TRE', 'MRO', 'FRCC', 'ASCC',\n",
       "       'HICC'], dtype=object)"
      ]
     },
     "execution_count": 260,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nerc_labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 261,
   "metadata": {},
   "outputs": [],
   "source": [
    "mro_2016 = set(labeled.loc[(labeled.nerc == 'MRO') &\n",
    "                                  (labeled.year == 2016), 'plant id'])\n",
    "mro_2017 = set(labeled.loc[(labeled.nerc == 'MRO') &\n",
    "                                  (labeled.year == 2017), 'plant id'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 262,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{1052,\n",
       " 1058,\n",
       " 1175,\n",
       " 1189,\n",
       " 1218,\n",
       " 1771,\n",
       " 1889,\n",
       " 1914,\n",
       " 1918,\n",
       " 1932,\n",
       " 1960,\n",
       " 1995,\n",
       " 2008,\n",
       " 2217,\n",
       " 2791,\n",
       " 2821,\n",
       " 2822,\n",
       " 3334,\n",
       " 3343,\n",
       " 3347,\n",
       " 3348,\n",
       " 3996,\n",
       " 4048,\n",
       " 4054,\n",
       " 4062,\n",
       " 4121,\n",
       " 4140,\n",
       " 7376,\n",
       " 7377,\n",
       " 7602,\n",
       " 7706,\n",
       " 7882,\n",
       " 7956,\n",
       " 8014,\n",
       " 8025,\n",
       " 8057,\n",
       " 10476,\n",
       " 54713,\n",
       " 54930,\n",
       " 55315,\n",
       " 55638,\n",
       " 55764,\n",
       " 55834,\n",
       " 56072,\n",
       " 56183,\n",
       " 56366,\n",
       " 57048,\n",
       " 57116,\n",
       " 57255,\n",
       " 57659,\n",
       " 58236,\n",
       " 58434,\n",
       " 58903,\n",
       " 59053,\n",
       " 59197,\n",
       " 59223,\n",
       " 59224,\n",
       " 59225,\n",
       " 59226,\n",
       " 59227,\n",
       " 59228,\n",
       " 59230,\n",
       " 59231,\n",
       " 59307,\n",
       " 59684,\n",
       " 59875,\n",
       " 59902,\n",
       " 59903,\n",
       " 60066,\n",
       " 60203,\n",
       " 60254,\n",
       " 60503,\n",
       " 60504,\n",
       " 60505,\n",
       " 60519,\n",
       " 60520,\n",
       " 60521,\n",
       " 60522,\n",
       " 60523,\n",
       " 60524,\n",
       " 60525,\n",
       " 60526,\n",
       " 60527,\n",
       " 60528,\n",
       " 60529,\n",
       " 60530,\n",
       " 60531,\n",
       " 60532,\n",
       " 60533,\n",
       " 60534,\n",
       " 60564,\n",
       " 60595,\n",
       " 60631,\n",
       " 60632,\n",
       " 60647,\n",
       " 60674,\n",
       " 60694,\n",
       " 60695,\n",
       " 60711,\n",
       " 60712,\n",
       " 60713,\n",
       " 60714,\n",
       " 60715,\n",
       " 60716,\n",
       " 60717,\n",
       " 60795,\n",
       " 60823,\n",
       " 60830,\n",
       " 60832,\n",
       " 60833,\n",
       " 60834,\n",
       " 60835,\n",
       " 60836,\n",
       " 60837,\n",
       " 60838,\n",
       " 60873,\n",
       " 60887,\n",
       " 60888,\n",
       " 60889,\n",
       " 60890,\n",
       " 60891,\n",
       " 60892,\n",
       " 60893,\n",
       " 60894,\n",
       " 60895,\n",
       " 60905,\n",
       " 60934,\n",
       " 60935,\n",
       " 60936,\n",
       " 60937,\n",
       " 60938,\n",
       " 60939,\n",
       " 60940,\n",
       " 60941,\n",
       " 60942,\n",
       " 60943,\n",
       " 60944,\n",
       " 60951,\n",
       " 60955,\n",
       " 60957,\n",
       " 60958,\n",
       " 60962,\n",
       " 60966,\n",
       " 60971,\n",
       " 60977,\n",
       " 61046,\n",
       " 61047,\n",
       " 61056,\n",
       " 61057,\n",
       " 61058,\n",
       " 61059,\n",
       " 61060,\n",
       " 61070,\n",
       " 61071,\n",
       " 61072,\n",
       " 61077,\n",
       " 61078,\n",
       " 61079,\n",
       " 61138,\n",
       " 61139,\n",
       " 61140,\n",
       " 61141,\n",
       " 61142,\n",
       " 61144,\n",
       " 61174,\n",
       " 61175,\n",
       " 61176,\n",
       " 61177,\n",
       " 61178,\n",
       " 61179,\n",
       " 61180,\n",
       " 61181,\n",
       " 61182,\n",
       " 61183,\n",
       " 61203,\n",
       " 61328,\n",
       " 61329,\n",
       " 61357,\n",
       " 61363,\n",
       " 61379,\n",
       " 61380,\n",
       " 61381,\n",
       " 61382,\n",
       " 61383,\n",
       " 61384,\n",
       " 61426,\n",
       " 61427,\n",
       " 61428}"
      ]
     },
     "execution_count": 262,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(set(nercs.loc[(nercs.nerc=='MRO') &\n",
    "              (nercs.year==2017),'plant id'])\n",
    " - mro_2017)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 263,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 plants dropped in RFC\n",
      "0 plants dropped in NPCC\n",
      "0 plants dropped in WECC\n",
      "0 plants dropped in SERC\n",
      "0 plants dropped in SPP\n",
      "0 plants dropped in TRE\n",
      "0 plants dropped in MRO\n",
      "0 plants dropped in FRCC\n",
      "0 plants dropped in ASCC\n",
      "0 plants dropped in HICC\n"
     ]
    }
   ],
   "source": [
    "for nerc in nerc_labels:\n",
    "    l = len((set(labeled.loc[(labeled.nerc == nerc) &\n",
    "              (labeled.year == 2016), 'plant id'])\n",
    "     - set(labeled.loc[(labeled.nerc == nerc) &\n",
    "              (labeled.year == 2017), 'plant id'])))\n",
    "    \n",
    "    print('{} plants dropped in {}'.format(l, nerc))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 264,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "set()"
      ]
     },
     "execution_count": 264,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "(set(labeled.loc[(labeled.nerc == 'MRO') &\n",
    "              (labeled.year == 2016), 'plant id'])\n",
    " - set(labeled.loc[(labeled.nerc == 'MRO') &\n",
    "              (labeled.year == 2017), 'plant id']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 265,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2017-11-06T16:59:09.744166Z",
     "start_time": "2017-11-06T16:59:09.667898Z"
    },
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "path = join(data_path, 'Facility labels', 'Facility locations_RF.csv')\n",
    "labeled.to_csv(path, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "psci",
   "language": "python",
   "name": "psci"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}