{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Assign NERC labels to plants using 860 data and RandomForest\n", "\n", "## Instructions\n", "Make sure the `file_date` parameter below is set to whatever value you would like appended to file names.\n", "\n", "Change the `most_recent_860_year` parameter below to match the most up-to-date EIA-860 annual data file. As of March 2018 this is 2016.\n", "\n", "EIA-860 (annual) excel files will need to be [downloaded](https://www.eia.gov/electricity/data/eia860/) and unzipped to the `EIA downloads` folder. Make sure that all years from 2012 through the most recent data year are available. Also download the most recent [EIA-860m](https://www.eia.gov/electricity/data/eia860m/) to `EIA downloads`.\n", "\n", "The most recent annual 860 file available (as of March 2018) represents 2016 data. When newer EIA-860 annual files are added the dictionary with pandas `read_excel` parameters will need to be updated. Note that EIA puts out an Early Release version of 860 with extra header rows and columns, so be sure to appropriately adjust the `skiprows` and `usecols` parameters if using an Early Release file.\n", "\n", "The entire notebook can be run at once using *Run All Cells*" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:00:19.547289Z", "start_time": "2017-11-06T20:00:19.534260Z" }, "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import os\n", "from os.path import join\n", "import pandas as pd\n", "from sklearn import neighbors, metrics\n", "from sklearn.preprocessing import LabelEncoder\n", "from sklearn.model_selection import train_test_split, GridSearchCV\n", "from collections import Counter\n", "from copy import deepcopy\n", "\n", "\n", "cwd = os.getcwd()\n", "data_path = join(cwd, '..', 'Data storage')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Date string for filenames\n", "This will be inserted into all filenames (reading and writing)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "file_date = '2018-03-06'\n", "most_recent_860_year = 2016" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data\n", "This loads facility data that has been assembled from the EIA bulk data file, and EIA-860 excel files. The EIA-860 excel files need to be downloaded manually." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load EIA facility data\n", "Only need to keep the plant id, year (as a check that plants don't move between years), and lat/lon" ] }, { "cell_type": "code", "execution_count": 204, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T14:34:57.939401Z", "start_time": "2017-11-06T14:34:48.570853Z" }, "collapsed": true }, "outputs": [], "source": [ "path = os.path.join(data_path, 'Derived data',\n", " 'Facility gen fuels and CO2 {}.csv'.format(file_date))\n", "facility_df = pd.read_csv(path)\n", "facility_df['state'] = facility_df['geography'].str[-2:]" ] }, { "cell_type": "code", "execution_count": 205, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T14:42:35.675435Z", "start_time": "2017-11-06T14:42:34.964483Z" }, "collapsed": true }, "outputs": [], "source": [ "plants = facility_df.loc[:, ['plant id', 'year', 'lat', 'lon', 'state']]\n", "plants.drop_duplicates(inplace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the 2017 facility dataframe only includes annually reporting facilities I'm going to duplicate the plant id, lat/lon, and state information from 2016." ] }, { "cell_type": "code", "execution_count": 206, "metadata": {}, "outputs": [], "source": [ "# make a copy of 2016 (or most recent annual data year) and change the year to \n", "plants_2017 = plants.loc[plants['year'] == most_recent_860_year, :].copy()\n", "plants_2017.loc[:, 'year'] += 1\n", "\n", "plants = pd.concat([plants.loc[plants['year']<=most_recent_860_year, :], plants_2017])" ] }, { "cell_type": "code", "execution_count": 207, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "set()" ] }, "execution_count": 207, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(set(plants.loc[plants.year==2016, 'plant id']) - set(plants.loc[plants.year==2017, 'plant id']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load known NERC labels from EIA-860\n", "Current NERCS go back to 2012. Use all annual 860 files from 2012 through the most recent available. Extend the dictionary of dictionaries below with any files available after 2016. `io`, `skiprows`, and `usecols` are all input parameters for the Pandas `read_excel` function." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T14:27:35.633747Z", "start_time": "2017-11-06T14:27:35.621006Z" }, "collapsed": true }, "outputs": [], "source": [ "eia_base_path = join(data_path, 'EIA downloads')\n", "file_860_info = {\n", "# 2011: {'io': join(eia_base_path, 'eia8602011', 'Plant.xlsx'),\n", "# 'skiprows': 0,\n", "# 'parse_cols': 'B,J'},\n", " 2012: {'io': join(eia_base_path, 'eia8602012', 'PlantY2012.xlsx'),\n", " 'skiprows': 0,\n", " 'usecols': 'B,J'},\n", " 2013: {'io': join(eia_base_path, 'eia8602013', '2___Plant_Y2013.xlsx'),\n", " 'skiprows': 0,\n", " 'usecols': 'C,L'},\n", " 2014: {'io': join(eia_base_path, 'eia8602014', '2___Plant_Y2014.xlsx'),\n", " 'skiprows': 0,\n", " 'usecols': 'C,L'},\n", " 2015: {'io': join(eia_base_path, 'eia8602015', '2___Plant_Y2015.xlsx'),\n", " 'skiprows': 0,\n", " 'usecols': 'C,L'},\n", " 2016: {'io': join(eia_base_path, 'eia8602016', '2___Plant_Y2016.xlsx'),\n", " 'skiprows': 0,\n", " 'usecols': 'C,L'}\n", "}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T14:30:20.380645Z", "start_time": "2017-11-06T14:29:57.233977Z" } }, "outputs": [], "source": [ "eia_nercs = {}\n", "for key, args in file_860_info.items():\n", " eia_nercs[key] = pd.read_excel(**args)\n", " eia_nercs[key].columns = ['plant id', 'nerc']\n", " eia_nercs[key]['year'] = key" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I want to assign NERC regions for every year. We have data for 2012 onward from the EIA-860 files. For the purpose of this analysis I'll assume that all years from 2001-2011 are the same NERC as 2012.\n", "\n", "Also assume that values in 2017 are the same as in 2016. I'll fill in nerc values for plants that were built in 2017 below." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "for year in range(2001, 2012):\n", " # the pandas .copy() method is deep by default but I'm not sure in this case\n", " df = deepcopy(eia_nercs[2012])\n", " df['year'] = year\n", " \n", " eia_nercs[year] = df\n", " \n", "df = deepcopy(eia_nercs[2016])\n", "df['year'] = 2017\n", "eia_nercs[2017] = df" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys([2012, 2013, 2014, 2015, 2016, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2017])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eia_nercs.keys()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercyear
010867SERC2001
150903RFC2001
210671SPP2001
32527NPCC2001
43305SERC2001
\n", "
" ], "text/plain": [ " plant id nerc year\n", "0 10867 SERC 2001\n", "1 50903 RFC 2001\n", "2 10671 SPP 2001\n", "3 2527 NPCC 2001\n", "4 3305 SERC 2001" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "eia_nercs[2001].head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T14:30:23.906158Z", "start_time": "2017-11-06T14:30:23.877433Z" } }, "outputs": [], "source": [ "nercs = pd.concat(eia_nercs.values())\n", "nercs.sort_values('year', inplace=True)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercyear
546556512RFC2001
48661481NPCC2001
48651480NPCC2001
4864805NPCC2001
486354451WECC2001
\n", "
" ], "text/plain": [ " plant id nerc year\n", "5465 56512 RFC 2001\n", "4866 1481 NPCC 2001\n", "4865 1480 NPCC 2001\n", "4864 805 NPCC 2001\n", "4863 54451 WECC 2001" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nercs.head()" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "set()" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(set(nercs.loc[(nercs.nerc == 'MRO') &\n", " (nercs.year == 2016), 'plant id'])\n", " - set(nercs.loc[(nercs.nerc == 'MRO') &\n", " (nercs.year == 2017), 'plant id']))" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,\n", " 2012, 2013, 2014, 2015, 2016, 2017])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nercs.year.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Look for plants listed with different NERC labels\n", "\n", "** This may not matter anymore if I use NERC labels for each year **\n", "\n", "There are 30 plants duplicated. Five of them don't have a NERC label in one of the years. The largest move is from MRO to other regions (12), with most of those to SPP (7). After that, moves from RFC (5) to MRO (3) and SERC (2). There are also some moves from WECC and FRCC to HICC/ASCC - these might be diesel generators that get moved.\n", "\n", "The plants that have duplicate NERC region labels represent a small fraction of national generation, but one that is growing over time. By 2016 they consist of 0.15% of national generation." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T14:31:12.401717Z", "start_time": "2017-11-06T14:31:12.388812Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "7289 total records\n", "7289 unique plants\n", "8060 total records\n", "8060 unique plants\n", "8520 total records\n", "8520 unique plants\n", "8928 total records\n", "8928 unique plants\n", "9711 total records\n", "9711 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "7289 total records\n", "7289 unique plants\n", "9711 total records\n", "9711 unique plants\n", "132398 total records\n", "10068 unique plants\n" ] } ], "source": [ "for df_ in list(eia_nercs.values()) + [nercs]:\n", " print('{} total records'.format(len(df_)))\n", " print('{} unique plants'.format(len(df_['plant id'].unique())))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T14:43:34.109577Z", "start_time": "2017-11-06T14:43:34.102600Z" } }, "outputs": [ { "data": { "text/plain": [ "array([56512, 1481, 1480, ..., 61001, 61003, 55160])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dup_plants = nercs.loc[nercs['plant id'].duplicated(keep=False), 'plant id'].unique()\n", "dup_plants" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T14:43:41.526762Z", "start_time": "2017-11-06T14:43:41.503604Z" } }, "outputs": [ { "data": { "text/plain": [ "Counter({('ASCC', nan): 2,\n", " ('ASCC',): 130,\n", " ('FRCC', 'HICC'): 1,\n", " ('FRCC',): 205,\n", " ('HICC',): 52,\n", " ('MRO', 'RFC'): 2,\n", " ('MRO', 'SERC'): 1,\n", " ('MRO', 'SPP'): 7,\n", " ('MRO', 'WECC'): 3,\n", " ('MRO',): 1051,\n", " ('NPCC',): 1257,\n", " ('RFC', 'MRO'): 3,\n", " ('RFC', 'SERC'): 2,\n", " ('RFC',): 1626,\n", " ('SERC', 'SPP'): 1,\n", " ('SERC',): 1832,\n", " ('SPP', 'SERC'): 2,\n", " ('SPP', 'TRE'): 1,\n", " ('SPP',): 470,\n", " ('TRE',): 461,\n", " ('WECC', 'ASCC'): 2,\n", " ('WECC', 'HICC'): 1,\n", " ('WECC',): 2892,\n", " (nan, 'WECC', 'ASCC'): 3,\n", " (nan, 'WECC', 'HICC'): 1,\n", " (nan,): 35})" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "region_list = []\n", "for plant in dup_plants:\n", " regions = nercs.loc[nercs['plant id'] == plant, 'nerc'].unique()\n", "# regions = regions.tolist()\n", " region_list.append(regions)\n", "Counter(tuple(x) for x in region_list)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T14:43:46.932593Z", "start_time": "2017-11-06T14:43:46.804826Z" } }, "outputs": [ { "data": { "text/plain": [ "year\n", "2001 0.993784\n", "2002 0.994703\n", "2003 0.993925\n", "2004 0.996498\n", "2005 0.997682\n", "2006 0.998378\n", "2007 0.998707\n", "2008 0.998873\n", "2009 0.999267\n", "2010 0.999767\n", "2011 0.999990\n", "2012 1.000000\n", "2013 1.000000\n", "2014 1.000000\n", "2015 1.000000\n", "2016 1.000000\n", "2017 0.999998\n", "Name: generation (MWh), dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(facility_df.loc[facility_df['plant id'].isin(dup_plants), :]\n", " .groupby('year')['generation (MWh)'].sum()\n", " / facility_df.loc[:, :]\n", " .groupby('year')['generation (MWh)'].sum())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some plants in EIA-860 don't have NERC labels. Drop them now.\n", "This is my training data. All of these plants should still be in my `plants` dataframe." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "232" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nan_plants = {}\n", "all_nan = []\n", "years = nercs.year.unique()\n", "for year in years:\n", " nan_plants[year] = nercs.loc[(nercs.year == year) &\n", " (nercs.isnull().any(axis=1)), 'plant id'].tolist()\n", " all_nan.extend(nan_plants[year])\n", "\n", "# number of plants that don't have a nerc in at least one year\n", "len(all_nan)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# drop all the rows without a nerc value\n", "nercs.dropna(inplace=True)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[58651,\n", " 58656,\n", " 58659,\n", " 58662,\n", " 58639,\n", " 58640,\n", " 58549,\n", " 58684,\n", " 58277,\n", " 58380,\n", " 58425,\n", " 58405,\n", " 60563,\n", " 60588,\n", " 60260,\n", " 60250,\n", " 60243,\n", " 60244,\n", " 60245,\n", " 60328,\n", " 61166,\n", " 61172,\n", " 61364,\n", " 60814,\n", " 61099,\n", " 61101,\n", " 61068,\n", " 58989,\n", " 58982,\n", " 58977,\n", " 58837,\n", " 59035,\n", " 60125,\n", " 60024,\n", " 66,\n", " 70]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nan_plants[2017]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load EIA-860m for some info on recent facilities\n", "The EIA-860m (monthly) data file has an up-to-date list of all operating power plants and their associated balancing authority. It does not list the NERC region, so it can't be used to assign NERC labels for all plants. But in some cases knowing the state and balancing authority is enough to make a good guess about which NERC a plant is in.\n", "\n", "Assigning NERC region labels has the lowest accuracy for plants in SPP and TRE. To compensate, I'm going to assume that anything in TX or OK and SWPP balancing authority is in SPP. On the flip side, if it's in TX and ERCOT I'll assign it to TRE.\n", "\n", "Only do this for plants that come online since the most recent 860 annual data.\n", "\n", "**NOTE**\n", "Because I'm looking at units that came online in 2017 some of the plant ids will already exist" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [], "source": [ "path = join(data_path, 'EIA downloads', 'december_generator2017.xlsx')\n", "\n", "# Check the excel file columns if there is a read error. They should match\n", "# the plant id, plant state, operating year, and balancing authority code.\n", "_m860 = pd.read_excel(path, sheet_name='Operating',skip_footer=1,\n", " usecols='C,F,P,AE', skiprows=0)\n", "_m860.columns = _m860.columns.str.lower()" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idplant stateoperating yearbalancing authority code
2137361632IA2017MISO
2137461633MA2017ISNE
2137561634MA2017ISNE
2137661635MA2017ISNE
2137761636MA2017ISNE
\n", "
" ], "text/plain": [ " plant id plant state operating year balancing authority code\n", "21373 61632 IA 2017 MISO\n", "21374 61633 MA 2017 ISNE\n", "21375 61634 MA 2017 ISNE\n", "21376 61635 MA 2017 ISNE\n", "21377 61636 MA 2017 ISNE" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# most_recent_860_year is defined at the top of this notebook\n", "# The goal here is to only look at plants that started operating after\n", "# the most recent annual data. So only include units starting after\n", "# the last annual data and that don't have plant ids in the nercs\n", "# dataframe\n", "\n", "m860 = _m860.loc[(_m860['operating year'] > most_recent_860_year)].copy() #&\n", "# (~_m860['plant id'].isin(nercs['plant id'].unique()))].copy()\n", "m860.tail()" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [], "source": [ "m860.loc[(m860['plant state'].isin(['TX', 'OK'])) &\n", " (m860['balancing authority code'] == 'SWPP'), 'nerc'] = 'SPP'\n", "\n", "m860.loc[(m860['plant state'].isin(['TX'])) &\n", " (m860['balancing authority code'] == 'ERCO'), 'nerc'] = 'TRE'" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idplant stateoperating yearbalancing authority codenerc
343165OK2017SWPPSPP
344165OK2017SWPPSPP
48362953OK2017SWPPSPP
48372953OK2017SWPPSPP
48382953OK2017SWPPSPP
\n", "
" ], "text/plain": [ " plant id plant state operating year balancing authority code nerc\n", "343 165 OK 2017 SWPP SPP\n", "344 165 OK 2017 SWPP SPP\n", "4836 2953 OK 2017 SWPP SPP\n", "4837 2953 OK 2017 SWPP SPP\n", "4838 2953 OK 2017 SWPP SPP" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Drop all rows except the ones I've labeled as TRE or SPP\n", "m860.dropna(inplace=True)\n", "m860.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make lists of plant codes for SPP and TRE facilities" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercyear
546556512RFC2001
48661481NPCC2001
48651480NPCC2001
4864805NPCC2001
486354451WECC2001
\n", "
" ], "text/plain": [ " plant id nerc year\n", "5465 56512 RFC 2001\n", "4866 1481 NPCC 2001\n", "4865 1480 NPCC 2001\n", "4864 805 NPCC 2001\n", "4863 54451 WECC 2001" ] }, "execution_count": 131, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nercs.head()" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [], "source": [ "# Create additional dataframes with 2017 SPP and TRE plants.\n", "# Use these to fill in values for 2017 plants\n", "\n", "m860_spp_plants = (m860.loc[m860['nerc'] == 'SPP', 'plant id']\n", " .drop_duplicates()\n", " .reset_index(drop=True))\n", "\n", "additional_spp = pd.DataFrame(m860_spp_plants.copy())\n", "# additional_spp['plant id'] = m860_spp_plants\n", "additional_spp['nerc'] = 'SPP'\n", "additional_spp['year'] = 2017\n", "\n", "m860_tre_plants = (m860.loc[m860['nerc'] == 'TRE', 'plant id']\n", " .drop_duplicates()\n", " .reset_index(drop=True))\n", "\n", "additional_tre = pd.DataFrame(m860_tre_plants)\n", "# additional_tre['plant id'] = m860_tre_plants\n", "additional_tre['nerc'] = 'TRE'\n", "additional_tre['year'] = 2017" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercyear
0165SPP2017
12953SPP2017
260414SPP2017
361221SPP2017
461261SPP2017
561614SPP2017
661615SPP2017
761616SPP2017
861617SPP2017
961618SPP2017
\n", "
" ], "text/plain": [ " plant id nerc year\n", "0 165 SPP 2017\n", "1 2953 SPP 2017\n", "2 60414 SPP 2017\n", "3 61221 SPP 2017\n", "4 61261 SPP 2017\n", "5 61614 SPP 2017\n", "6 61615 SPP 2017\n", "7 61616 SPP 2017\n", "8 61617 SPP 2017\n", "9 61618 SPP 2017" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "additional_spp" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercyear
056984TRE2017
159066TRE2017
259193TRE2017
359206TRE2017
459245TRE2017
559712TRE2017
659812TRE2017
760122TRE2017
860210TRE2017
960217TRE2017
1060366TRE2017
1160372TRE2017
1260436TRE2017
1360459TRE2017
1460460TRE2017
1560581TRE2017
1660682TRE2017
1760690TRE2017
1860774TRE2017
1960901TRE2017
2060902TRE2017
2160983TRE2017
2260989TRE2017
2361205TRE2017
2461309TRE2017
2561362TRE2017
2661409TRE2017
2761410TRE2017
2861411TRE2017
\n", "
" ], "text/plain": [ " plant id nerc year\n", "0 56984 TRE 2017\n", "1 59066 TRE 2017\n", "2 59193 TRE 2017\n", "3 59206 TRE 2017\n", "4 59245 TRE 2017\n", "5 59712 TRE 2017\n", "6 59812 TRE 2017\n", "7 60122 TRE 2017\n", "8 60210 TRE 2017\n", "9 60217 TRE 2017\n", "10 60366 TRE 2017\n", "11 60372 TRE 2017\n", "12 60436 TRE 2017\n", "13 60459 TRE 2017\n", "14 60460 TRE 2017\n", "15 60581 TRE 2017\n", "16 60682 TRE 2017\n", "17 60690 TRE 2017\n", "18 60774 TRE 2017\n", "19 60901 TRE 2017\n", "20 60902 TRE 2017\n", "21 60983 TRE 2017\n", "22 60989 TRE 2017\n", "23 61205 TRE 2017\n", "24 61309 TRE 2017\n", "25 61362 TRE 2017\n", "26 61409 TRE 2017\n", "27 61410 TRE 2017\n", "28 61411 TRE 2017" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "additional_tre" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Append my 2017 SPP and TRE guesses to the full nerc dataframe" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "nercs = pd.concat([nercs, additional_spp, additional_tre])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Clean and prep data for KNN" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idyearlatlonstate
01001201739.9242-87.4244IN
121001201639.9242-87.4244IN
241001201539.9242-87.4244IN
361001201439.9242-87.4244IN
481001201339.9242-87.4244IN
\n", "
" ], "text/plain": [ " plant id year lat lon state\n", "0 1001 2017 39.9242 -87.4244 IN\n", "12 1001 2016 39.9242 -87.4244 IN\n", "24 1001 2015 39.9242 -87.4244 IN\n", "36 1001 2014 39.9242 -87.4244 IN\n", "48 1001 2013 39.9242 -87.4244 IN" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "plants.head()" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercyear
2461309TRE2017
2561362TRE2017
2661409TRE2017
2761410TRE2017
2861411TRE2017
\n", "
" ], "text/plain": [ " plant id nerc year\n", "24 61309 TRE 2017\n", "25 61362 TRE 2017\n", "26 61409 TRE 2017\n", "27 61410 TRE 2017\n", "28 61411 TRE 2017" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nercs.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Checked to make sure the type of merge doesn't matter once rows without nerc values are dropped" ] }, { "cell_type": "code", "execution_count": 209, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:39:54.416966Z", "start_time": "2017-11-06T16:39:54.383812Z" }, "collapsed": true }, "outputs": [], "source": [ "df = pd.merge(plants, nercs, on=['plant id', 'year'], how='left')" ] }, { "cell_type": "code", "execution_count": 210, "metadata": {}, "outputs": [], "source": [ "omitted = set(df['plant id'].unique()) - set(nercs['plant id'].unique())" ] }, { "cell_type": "code", "execution_count": 213, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idyearlatlonstatenerc
01001201639.9242-87.4244INRFC
11001201539.9242-87.4244INRFC
21001201439.9242-87.4244INRFC
31001201339.9242-87.4244INRFC
41001201239.9242-87.4244INRFC
\n", "
" ], "text/plain": [ " plant id year lat lon state nerc\n", "0 1001 2016 39.9242 -87.4244 IN RFC\n", "1 1001 2015 39.9242 -87.4244 IN RFC\n", "2 1001 2014 39.9242 -87.4244 IN RFC\n", "3 1001 2013 39.9242 -87.4244 IN RFC\n", "4 1001 2012 39.9242 -87.4244 IN RFC" ] }, "execution_count": 213, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 214, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idyearlatlonstatenerc
10085159957201741.226900-88.683600ILRFC
10085259257201737.305833-121.755000CAWECC
10085359256201738.393333-121.927778CAWECC
10085410475201741.683600-87.423300INRFC
10085555370201739.851389-79.070556PARFC
\n", "
" ], "text/plain": [ " plant id year lat lon state nerc\n", "100851 59957 2017 41.226900 -88.683600 IL RFC\n", "100852 59257 2017 37.305833 -121.755000 CA WECC\n", "100853 59256 2017 38.393333 -121.927778 CA WECC\n", "100854 10475 2017 41.683600 -87.423300 IN RFC\n", "100855 55370 2017 39.851389 -79.070556 PA RFC" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "code", "execution_count": 215, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:39:55.684713Z", "start_time": "2017-11-06T16:39:55.678207Z" } }, "outputs": [ { "data": { "text/plain": [ "Index(['plant id', 'year', 'lat', 'lon', 'state', 'nerc'], dtype='object')" ] }, "execution_count": 215, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Drop plants that don't have lat/lon data (using just lon to check), and then drop duplicates. If any plants have kept the same plant id but moved over time (maybe a diesel generator?) or switched NERC they will show up twice." ] }, { "cell_type": "code", "execution_count": 216, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:40:08.841245Z", "start_time": "2017-11-06T16:40:08.805376Z" }, "collapsed": true }, "outputs": [], "source": [ "cols = ['plant id', 'lat', 'lon', 'nerc', 'state', 'year']\n", "df_slim = (df.loc[:, cols].dropna(subset=['lon'])\n", " .drop_duplicates(subset=['plant id', 'year', 'nerc']))" ] }, { "cell_type": "code", "execution_count": 217, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:40:29.960293Z", "start_time": "2017-11-06T16:40:29.944141Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idlatlonnercstateyear
1008515995741.226900-88.683600RFCIL2017
1008525925737.305833-121.755000WECCCA2017
1008535925638.393333-121.927778WECCCA2017
1008541047541.683600-87.423300RFCIN2017
1008555537039.851389-79.070556RFCPA2017
\n", "
" ], "text/plain": [ " plant id lat lon nerc state year\n", "100851 59957 41.226900 -88.683600 RFC IL 2017\n", "100852 59257 37.305833 -121.755000 WECC CA 2017\n", "100853 59256 38.393333 -121.927778 WECC CA 2017\n", "100854 10475 41.683600 -87.423300 RFC IN 2017\n", "100855 55370 39.851389 -79.070556 RFC PA 2017" ] }, "execution_count": 217, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_slim.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Separate out the list of plants where we don't have NERC labels from EIA-860." ] }, { "cell_type": "code", "execution_count": 218, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:40:35.247883Z", "start_time": "2017-11-06T16:40:35.239582Z" }, "collapsed": true }, "outputs": [], "source": [ "unknown = df_slim.loc[df_slim.nerc.isnull()].copy()" ] }, { "cell_type": "code", "execution_count": 219, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:40:37.583325Z", "start_time": "2017-11-06T16:40:37.568223Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1579 plants/years don't have NERC labels\n", "\n", " plant id lat lon nerc state year\n", "28205 3823 38.27 -78.035 NaN VA 2009\n", "28206 3823 38.27 -78.035 NaN VA 2005\n", "28207 3823 38.27 -78.035 NaN VA 2004\n", "28208 3823 38.27 -78.035 NaN VA 2003\n", "28209 3823 38.27 -78.035 NaN VA 2002\n" ] } ], "source": [ "print(\"{} plants/years don't have NERC labels\\n\".format(len(unknown)))\n", "print(unknown.head())" ] }, { "cell_type": "code", "execution_count": 220, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idlatlonnercstateyear
1008145071237.734100-121.6516NaNCA2017
1008175516229.096900-81.0689NaNFL2017
1008185617134.037200-117.5569NaNCA2017
100822793937.771400-75.6342NaNVA2017
1008255891634.248056-116.0600NaNCA2017
\n", "
" ], "text/plain": [ " plant id lat lon nerc state year\n", "100814 50712 37.734100 -121.6516 NaN CA 2017\n", "100817 55162 29.096900 -81.0689 NaN FL 2017\n", "100818 56171 34.037200 -117.5569 NaN CA 2017\n", "100822 7939 37.771400 -75.6342 NaN VA 2017\n", "100825 58916 34.248056 -116.0600 NaN CA 2017" ] }, "execution_count": 220, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unknown.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create X and y matricies\n", "X is lat/lon and year\n", "\n", "y is the NERC label\n", "\n", "For both, I'm only using plants where we have all data (no `NaN`s). Not doing any transformation of the lat/lon at this time." ] }, { "cell_type": "code", "execution_count": 221, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:06:21.393849Z", "start_time": "2017-11-06T20:06:21.367798Z" }, "collapsed": true }, "outputs": [], "source": [ "X = df_slim.loc[df_slim.notnull().all(axis=1), ['lat', 'lon', 'year']]\n", "y = df_slim.loc[df_slim.notnull().all(axis=1), 'nerc']" ] }, { "cell_type": "code", "execution_count": 222, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:06:27.239811Z", "start_time": "2017-11-06T20:06:27.233495Z" } }, "outputs": [ { "data": { "text/plain": [ "99127" ] }, "execution_count": 222, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(X)" ] }, { "cell_type": "code", "execution_count": 223, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 223, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Make sure that unknown and X include all records from df_slim\n", "len(X) + len(unknown) - len(df_slim)" ] }, { "cell_type": "code", "execution_count": 224, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:06:27.249590Z", "start_time": "2017-11-06T20:06:27.242414Z" }, "collapsed": true }, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(\n", " X, y, test_size=0.33, random_state=42)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## GridSearch to find the best parameters in a RandomForest Classifier\n", "\n", "I previously used k-nearest neighbors with just lat/lon as input features. The problem is that some facilities don't have lat/lon data. They do usually have a state geography label though. Categorical labels don't work well in KNN, but the combination of lat/lon and a state label will work well in a tree model. RandomForest is usually a quite effective tree model and my results are more accurate with this than they were with KNN." ] }, { "cell_type": "code", "execution_count": 225, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 226, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 3 folds for each of 36 candidates, totalling 108 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 32.3s\n", "[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 1.3min finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=None, error_score='raise',\n", " estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0,\n", " warm_start=False),\n", " fit_params=None, iid=False, n_jobs=-1,\n", " param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", " scoring=None, verbose=1)" ] }, "execution_count": 226, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf = RandomForestClassifier()\n", "params = dict(\n", " n_estimators = [5, 10, 25, 50],\n", " min_samples_split = [2, 5, 10],\n", " min_samples_leaf = [1, 3, 5],\n", ")\n", "\n", "clf_rf = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)\n", "clf_rf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 227, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0,\n", " warm_start=False), 0.9797785431782634)" ] }, "execution_count": 227, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf_rf.best_estimator_, clf_rf.best_score_" ] }, { "cell_type": "code", "execution_count": 228, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9822695035460993" ] }, "execution_count": 228, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf_rf.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 229, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:10:03.521287Z", "start_time": "2017-11-06T20:10:03.510982Z" } }, "outputs": [], "source": [ "nerc_labels = nercs.nerc.dropna().unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accuracy score by region" ] }, { "cell_type": "code", "execution_count": 230, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:15:48.830308Z", "start_time": "2017-11-06T20:15:48.737017Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RFC : 0.9707198806415517\n", "NPCC : 0.9907823209643111\n", "WECC : 0.9987963672174198\n", "SERC : 0.9690892364305428\n", "SPP : 0.9328663164039697\n", "TRE : 0.9903057419835943\n", "MRO : 0.9853683148335015\n", "FRCC : 0.9768707482993197\n", "ASCC : 0.9965986394557823\n", "HICC : 1.0\n" ] } ], "source": [ "for region in nerc_labels:\n", " mask = y_test == region\n", " \n", " X_masked = X_test[mask]\n", " y_hat_masked = clf_rf.predict(X_masked)\n", " y_test_masked = y_test[mask]\n", " \n", " accuracy = metrics.accuracy_score(y_test_masked, y_hat_masked)\n", " print('{} : {}'.format(region, accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "F1 score by region" ] }, { "cell_type": "code", "execution_count": 231, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:18:57.242329Z", "start_time": "2017-11-06T20:18:57.079002Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RFC : 0.9682820202771835\n", "NPCC : 0.9916026020106445\n", "WECC : 0.9985778361229624\n", "SERC : 0.9707861026633492\n", "SPP : 0.9461219656601539\n", "TRE : 0.986627043090639\n", "MRO : 0.9798068481123793\n", "FRCC : 0.9835616438356164\n", "ASCC : 0.9982964224872232\n", "HICC : 1.0\n" ] } ], "source": [ "y_hat = clf_rf.predict(X_test)\n", "\n", "for region in nerc_labels:\n", " f1 = metrics.f1_score(y_test, y_hat, labels=[region], average='macro')\n", " print('{} : {}'.format(region, f1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plants without lat/lon\n", "Use just the state for plants that don't have lat/lon info. Less accurate, especially where NERC regions cross state lines, but better than nothing.\n", "\n", "Need to start with the `lon` column so I can filter to only unknown facilities that also don't have lon" ] }, { "cell_type": "code", "execution_count": 232, "metadata": {}, "outputs": [], "source": [ "cols = ['plant id', 'nerc', 'state', 'year', 'lon']\n", "df_state_slim = (df.loc[:, cols].dropna(subset=['state']).copy())" ] }, { "cell_type": "code", "execution_count": 233, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercstateyearlon
01001RFCIN2016-87.4244
11001RFCIN2015-87.4244
21001RFCIN2014-87.4244
31001RFCIN2013-87.4244
41001RFCIN2012-87.4244
\n", "
" ], "text/plain": [ " plant id nerc state year lon\n", "0 1001 RFC IN 2016 -87.4244\n", "1 1001 RFC IN 2015 -87.4244\n", "2 1001 RFC IN 2014 -87.4244\n", "3 1001 RFC IN 2013 -87.4244\n", "4 1001 RFC IN 2012 -87.4244" ] }, "execution_count": 233, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_state_slim.head()" ] }, { "cell_type": "code", "execution_count": 234, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100831" ] }, "execution_count": 234, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_state_slim)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Encode state names as numbers for use in sklearn" ] }, { "cell_type": "code", "execution_count": 235, "metadata": {}, "outputs": [], "source": [ "le = LabelEncoder()\n", "df_state_slim.loc[:, 'enc state'] = le.fit_transform(df_state_slim.loc[:, 'state'].tolist())" ] }, { "cell_type": "code", "execution_count": 236, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "100831" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df_state_slim)" ] }, { "cell_type": "code", "execution_count": 237, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:40:35.247883Z", "start_time": "2017-11-06T16:40:35.239582Z" }, "collapsed": true }, "outputs": [], "source": [ "unknown_state = df_state_slim.loc[(df_state_slim.nerc.isnull()) &\n", " (df_state_slim.lon.isnull())].copy()" ] }, { "cell_type": "code", "execution_count": 238, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(116, 1579)" ] }, "execution_count": 238, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(unknown_state), len(unknown)" ] }, { "cell_type": "code", "execution_count": 239, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:06:21.393849Z", "start_time": "2017-11-06T20:06:21.367798Z" }, "collapsed": true }, "outputs": [], "source": [ "X_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), ['enc state', 'year']].copy()\n", "y_state = df_state_slim.loc[df_state_slim.notnull().all(axis=1), 'nerc'].copy()" ] }, { "cell_type": "code", "execution_count": 240, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:06:27.249590Z", "start_time": "2017-11-06T20:06:27.242414Z" }, "collapsed": true }, "outputs": [], "source": [ "X_state_train, X_state_test, y_state_train, y_state_test = train_test_split(\n", " X_state, y_state, test_size=0.33, random_state=42)" ] }, { "cell_type": "code", "execution_count": 241, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting 3 folds for each of 36 candidates, totalling 108 fits\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "[Parallel(n_jobs=-1)]: Done 42 tasks | elapsed: 17.1s\n", "[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed: 45.4s finished\n" ] }, { "data": { "text/plain": [ "GridSearchCV(cv=None, error_score='raise',\n", " estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0,\n", " warm_start=False),\n", " fit_params=None, iid=False, n_jobs=-1,\n", " param_grid={'n_estimators': [5, 10, 25, 50], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 3, 5]},\n", " pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',\n", " scoring=None, verbose=1)" ] }, "execution_count": 241, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rf = RandomForestClassifier()\n", "params = dict(\n", " n_estimators = [5, 10, 25, 50],\n", " min_samples_split = [2, 5, 10],\n", " min_samples_leaf = [1, 3, 5],\n", ")\n", "\n", "clf_rf_state = GridSearchCV(rf, params, n_jobs=-1, iid=False, verbose=1)\n", "clf_rf_state.fit(X_state_train, y_state_train)" ] }, { "cell_type": "code", "execution_count": 242, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',\n", " max_depth=None, max_features='auto', max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=10,\n", " min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,\n", " oob_score=False, random_state=None, verbose=0,\n", " warm_start=False), 0.9402882327958322)" ] }, "execution_count": 242, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf_rf_state.best_estimator_, clf_rf_state.best_score_" ] }, { "cell_type": "code", "execution_count": 243, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9387418230726906" ] }, "execution_count": 243, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf_rf_state.score(X_state_test, y_state_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Accuracy score by region" ] }, { "cell_type": "code", "execution_count": 244, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:15:48.830308Z", "start_time": "2017-11-06T20:15:48.737017Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RFC : 0.9243697478991597\n", "NPCC : 0.9938519744620478\n", "WECC : 0.9943912900032993\n", "SERC : 0.8831431726168568\n", "SPP : 0.5765453495089543\n", "TRE : 1.0\n", "MRO : 0.9614311088556204\n", "FRCC : 0.9973992197659298\n", "ASCC : 1.0\n", "HICC : 1.0\n" ] } ], "source": [ "nerc_labels = nercs.nerc.dropna().unique()\n", "\n", "for region in nerc_labels:\n", " mask = y_state_test == region\n", " \n", " X_state_masked = X_state_test[mask]\n", " y_state_hat_masked = clf_rf_state.predict(X_state_masked)\n", " y_state_test_masked = y_state_test[mask]\n", " \n", " accuracy = metrics.accuracy_score(y_state_test_masked, y_state_hat_masked)\n", " print('{} : {}'.format(region, accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "F1 score by region" ] }, { "cell_type": "code", "execution_count": 245, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T20:18:57.242329Z", "start_time": "2017-11-06T20:18:57.079002Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RFC : 0.9088405397961994\n", "NPCC : 0.9958535718516763\n", "WECC : 0.9933534743202416\n", "SERC : 0.8819259395387301\n", "SPP : 0.7300658376005852\n", "TRE : 0.8763183125599233\n", "MRO : 0.9512929952297263\n", "FRCC : 0.962962962962963\n", "ASCC : 0.9991474850809888\n", "HICC : 0.9975669099756691\n" ] } ], "source": [ "y_state_hat = clf_rf_state.predict(X_state_test)\n", "\n", "for region in nerc_labels:\n", " f1 = metrics.f1_score(y_state_test, y_state_hat, labels=[region], average='macro')\n", " print('{} : {}'.format(region, f1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Use best RandomForest parameters to predict NERC for unknown plants" ] }, { "cell_type": "code", "execution_count": 246, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:53:30.016363Z", "start_time": "2017-11-06T16:53:29.583624Z" } }, "outputs": [], "source": [ "unknown.loc[:, 'nerc'] = clf_rf.predict(unknown.loc[:, ['lat', 'lon', 'year']])\n", "unknown_state.loc[:, 'nerc'] = clf_rf_state.predict(unknown_state.loc[:, ['enc state', 'year']])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ensuring that no plants in Alaska or Hawaii are assigned to continental NERCs, or the other way around." ] }, { "cell_type": "code", "execution_count": 247, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:53:34.746917Z", "start_time": "2017-11-06T16:53:34.734080Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['ASCC' 'HICC' 'WECC']\n", "['AK' 'HI']\n" ] } ], "source": [ "print(unknown.loc[unknown.state.isin(['AK', 'HI']), 'nerc'].unique())\n", "print(unknown.loc[unknown.nerc.isin(['HICC', 'ASCC']), 'state'].unique())" ] }, { "cell_type": "code", "execution_count": 248, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:53:38.090243Z", "start_time": "2017-11-06T16:53:38.084127Z" } }, "outputs": [ { "data": { "text/plain": [ "Counter({'ASCC': 44,\n", " 'FRCC': 6,\n", " 'HICC': 39,\n", " 'MRO': 67,\n", " 'NPCC': 264,\n", " 'RFC': 202,\n", " 'SERC': 349,\n", " 'SPP': 99,\n", " 'TRE': 145,\n", " 'WECC': 364})" ] }, "execution_count": 248, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(unknown['nerc'])" ] }, { "cell_type": "code", "execution_count": 249, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idlatlonnercstateyear
28205382338.27-78.035SERCVA2009
28206382338.27-78.035SERCVA2005
28207382338.27-78.035SERCVA2004
28208382338.27-78.035SERCVA2003
28209382338.27-78.035SERCVA2002
\n", "
" ], "text/plain": [ " plant id lat lon nerc state year\n", "28205 3823 38.27 -78.035 SERC VA 2009\n", "28206 3823 38.27 -78.035 SERC VA 2005\n", "28207 3823 38.27 -78.035 SERC VA 2004\n", "28208 3823 38.27 -78.035 SERC VA 2003\n", "28209 3823 38.27 -78.035 SERC VA 2002" ] }, "execution_count": 249, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unknown.head()" ] }, { "cell_type": "code", "execution_count": 250, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercstateyearlonenc state
8486710851RFCNJ2006NaN31
8486810851RFCNJ2005NaN31
8486910851RFCNJ2004NaN31
8487010851RFCNJ2002NaN31
8487110851RFCNJ2001NaN31
\n", "
" ], "text/plain": [ " plant id nerc state year lon enc state\n", "84867 10851 RFC NJ 2006 NaN 31\n", "84868 10851 RFC NJ 2005 NaN 31\n", "84869 10851 RFC NJ 2004 NaN 31\n", "84870 10851 RFC NJ 2002 NaN 31\n", "84871 10851 RFC NJ 2001 NaN 31" ] }, "execution_count": 250, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unknown_state.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export plants with lat/lon, state, and nerc" ] }, { "cell_type": "code", "execution_count": 251, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercyear
2461309TRE2017
2561362TRE2017
2661409TRE2017
2761410TRE2017
2861411TRE2017
\n", "
" ], "text/plain": [ " plant id nerc year\n", "24 61309 TRE 2017\n", "25 61362 TRE 2017\n", "26 61409 TRE 2017\n", "27 61410 TRE 2017\n", "28 61411 TRE 2017" ] }, "execution_count": 251, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nercs.tail()" ] }, { "cell_type": "code", "execution_count": 252, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:54:00.912370Z", "start_time": "2017-11-06T16:54:00.897855Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idlatlonnercstateyear
28205382338.27-78.035SERCVA2009
28206382338.27-78.035SERCVA2005
28207382338.27-78.035SERCVA2004
28208382338.27-78.035SERCVA2003
28209382338.27-78.035SERCVA2002
\n", "
" ], "text/plain": [ " plant id lat lon nerc state year\n", "28205 3823 38.27 -78.035 SERC VA 2009\n", "28206 3823 38.27 -78.035 SERC VA 2005\n", "28207 3823 38.27 -78.035 SERC VA 2004\n", "28208 3823 38.27 -78.035 SERC VA 2003\n", "28209 3823 38.27 -78.035 SERC VA 2002" ] }, "execution_count": 252, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unknown.head()" ] }, { "cell_type": "code", "execution_count": 253, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:54:02.144417Z", "start_time": "2017-11-06T16:54:02.126662Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idnercstateyearlonenc state
9261510257WECCCA2004NaN4
9261610257WECCCA2003NaN4
9261710257WECCCA2002NaN4
9261810257WECCCA2001NaN4
9278256508WECCCA2007NaN4
\n", "
" ], "text/plain": [ " plant id nerc state year lon enc state\n", "92615 10257 WECC CA 2004 NaN 4\n", "92616 10257 WECC CA 2003 NaN 4\n", "92617 10257 WECC CA 2002 NaN 4\n", "92618 10257 WECC CA 2001 NaN 4\n", "92782 56508 WECC CA 2007 NaN 4" ] }, "execution_count": 253, "metadata": {}, "output_type": "execute_result" } ], "source": [ "unknown_state.tail()" ] }, { "cell_type": "code", "execution_count": 267, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "31" ] }, "execution_count": 267, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(unknown_state['plant id'].unique())" ] }, { "cell_type": "code", "execution_count": 254, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:54:06.371732Z", "start_time": "2017-11-06T16:54:06.354901Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
plant idlatlonnercstateyear
0100139.9242-87.4244RFCIN2016
1100139.9242-87.4244RFCIN2015
2100139.9242-87.4244RFCIN2014
3100139.9242-87.4244RFCIN2013
4100139.9242-87.4244RFCIN2012
\n", "
" ], "text/plain": [ " plant id lat lon nerc state year\n", "0 1001 39.9242 -87.4244 RFC IN 2016\n", "1 1001 39.9242 -87.4244 RFC IN 2015\n", "2 1001 39.9242 -87.4244 RFC IN 2014\n", "3 1001 39.9242 -87.4244 RFC IN 2013\n", "4 1001 39.9242 -87.4244 RFC IN 2012" ] }, "execution_count": 254, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_slim.head()" ] }, { "cell_type": "code", "execution_count": 255, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:54:07.622073Z", "start_time": "2017-11-06T16:54:07.597436Z" }, "collapsed": true }, "outputs": [], "source": [ "labeled = pd.concat([df_slim.loc[df_slim.notnull().all(axis=1)],\n", " unknown,\n", " unknown_state.loc[:, ['plant id', 'nerc', 'state', 'year']]])" ] }, { "cell_type": "code", "execution_count": 256, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlonnercplant idstateyear
92615NaNNaNWECC10257CA2004
92616NaNNaNWECC10257CA2003
92617NaNNaNWECC10257CA2002
92618NaNNaNWECC10257CA2001
92782NaNNaNWECC56508CA2007
\n", "
" ], "text/plain": [ " lat lon nerc plant id state year\n", "92615 NaN NaN WECC 10257 CA 2004\n", "92616 NaN NaN WECC 10257 CA 2003\n", "92617 NaN NaN WECC 10257 CA 2002\n", "92618 NaN NaN WECC 10257 CA 2001\n", "92782 NaN NaN WECC 56508 CA 2007" ] }, "execution_count": 256, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.tail()" ] }, { "cell_type": "code", "execution_count": 257, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:54:09.653819Z", "start_time": "2017-11-06T16:54:09.640776Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
latlonnercplant idstateyear
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [lat, lon, nerc, plant id, state, year]\n", "Index: []" ] }, "execution_count": 257, "metadata": {}, "output_type": "execute_result" } ], "source": [ "labeled.loc[labeled.nerc.isnull()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 7 facilities that don't show up in my labeled data." ] }, { "cell_type": "code", "execution_count": 258, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([57116, 57794, 58690, 58236, 58098, 57913, 57400, 57628, 61084,\n", " 60539, 60540, 60991, 61079, 60383, 61172, 61020, 61221, 61021,\n", " 61022, 60682, 60688, 61407, 60689, 61330, 61357, 60414, 60366,\n", " 60983, 60989, 60372, 60658, 60581, 60583, 60901, 60902, 60905,\n", " 60883, 60885, 60856, 60552, 60467, 60145, 60152, 59308, 59220,\n", " 59309, 59315, 60237, 60306, 60303, 60340, 59665, 59666, 59684,\n", " 59827, 59061, 60033, 59066, 60210, 60261, 61039, 61040, 61048,\n", " 60258, 61050, 60346, 59689, 59690, 59691, 59764, 61261, 61268,\n", " 61303, 59812, 59875, 60043, 59888, 61561, 60122, 59245, 59193,\n", " 59004, 60655, 60987, 61512, 59206, 60436, 60217, 59712, 59940,\n", " 60785, 61222, 61422, 55314, 55952, 7704, 6339, 55975, 55073,\n", " 50728, 60569, 60570, 61197, 60690, 60506])" ] }, "execution_count": 258, "metadata": {}, "output_type": "execute_result" } ], "source": [ "facility_df.loc[~facility_df['plant id'].isin(labeled['plant id']), 'plant id'].unique()" ] }, { "cell_type": "code", "execution_count": 259, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(100822, 132244)" ] }, "execution_count": 259, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(labeled), len(nercs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 260, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['RFC', 'NPCC', 'WECC', 'SERC', 'SPP', 'TRE', 'MRO', 'FRCC', 'ASCC',\n", " 'HICC'], dtype=object)" ] }, "execution_count": 260, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nerc_labels" ] }, { "cell_type": "code", "execution_count": 261, "metadata": {}, "outputs": [], "source": [ "mro_2016 = set(labeled.loc[(labeled.nerc == 'MRO') &\n", " (labeled.year == 2016), 'plant id'])\n", "mro_2017 = set(labeled.loc[(labeled.nerc == 'MRO') &\n", " (labeled.year == 2017), 'plant id'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 262, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{1052,\n", " 1058,\n", " 1175,\n", " 1189,\n", " 1218,\n", " 1771,\n", " 1889,\n", " 1914,\n", " 1918,\n", " 1932,\n", " 1960,\n", " 1995,\n", " 2008,\n", " 2217,\n", " 2791,\n", " 2821,\n", " 2822,\n", " 3334,\n", " 3343,\n", " 3347,\n", " 3348,\n", " 3996,\n", " 4048,\n", " 4054,\n", " 4062,\n", " 4121,\n", " 4140,\n", " 7376,\n", " 7377,\n", " 7602,\n", " 7706,\n", " 7882,\n", " 7956,\n", " 8014,\n", " 8025,\n", " 8057,\n", " 10476,\n", " 54713,\n", " 54930,\n", " 55315,\n", " 55638,\n", " 55764,\n", " 55834,\n", " 56072,\n", " 56183,\n", " 56366,\n", " 57048,\n", " 57116,\n", " 57255,\n", " 57659,\n", " 58236,\n", " 58434,\n", " 58903,\n", " 59053,\n", " 59197,\n", " 59223,\n", " 59224,\n", " 59225,\n", " 59226,\n", " 59227,\n", " 59228,\n", " 59230,\n", " 59231,\n", " 59307,\n", " 59684,\n", " 59875,\n", " 59902,\n", " 59903,\n", " 60066,\n", " 60203,\n", " 60254,\n", " 60503,\n", " 60504,\n", " 60505,\n", " 60519,\n", " 60520,\n", " 60521,\n", " 60522,\n", " 60523,\n", " 60524,\n", " 60525,\n", " 60526,\n", " 60527,\n", " 60528,\n", " 60529,\n", " 60530,\n", " 60531,\n", " 60532,\n", " 60533,\n", " 60534,\n", " 60564,\n", " 60595,\n", " 60631,\n", " 60632,\n", " 60647,\n", " 60674,\n", " 60694,\n", " 60695,\n", " 60711,\n", " 60712,\n", " 60713,\n", " 60714,\n", " 60715,\n", " 60716,\n", " 60717,\n", " 60795,\n", " 60823,\n", " 60830,\n", " 60832,\n", " 60833,\n", " 60834,\n", " 60835,\n", " 60836,\n", " 60837,\n", " 60838,\n", " 60873,\n", " 60887,\n", " 60888,\n", " 60889,\n", " 60890,\n", " 60891,\n", " 60892,\n", " 60893,\n", " 60894,\n", " 60895,\n", " 60905,\n", " 60934,\n", " 60935,\n", " 60936,\n", " 60937,\n", " 60938,\n", " 60939,\n", " 60940,\n", " 60941,\n", " 60942,\n", " 60943,\n", " 60944,\n", " 60951,\n", " 60955,\n", " 60957,\n", " 60958,\n", " 60962,\n", " 60966,\n", " 60971,\n", " 60977,\n", " 61046,\n", " 61047,\n", " 61056,\n", " 61057,\n", " 61058,\n", " 61059,\n", " 61060,\n", " 61070,\n", " 61071,\n", " 61072,\n", " 61077,\n", " 61078,\n", " 61079,\n", " 61138,\n", " 61139,\n", " 61140,\n", " 61141,\n", " 61142,\n", " 61144,\n", " 61174,\n", " 61175,\n", " 61176,\n", " 61177,\n", " 61178,\n", " 61179,\n", " 61180,\n", " 61181,\n", " 61182,\n", " 61183,\n", " 61203,\n", " 61328,\n", " 61329,\n", " 61357,\n", " 61363,\n", " 61379,\n", " 61380,\n", " 61381,\n", " 61382,\n", " 61383,\n", " 61384,\n", " 61426,\n", " 61427,\n", " 61428}" ] }, "execution_count": 262, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(set(nercs.loc[(nercs.nerc=='MRO') &\n", " (nercs.year==2017),'plant id'])\n", " - mro_2017)" ] }, { "cell_type": "code", "execution_count": 263, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 plants dropped in RFC\n", "0 plants dropped in NPCC\n", "0 plants dropped in WECC\n", "0 plants dropped in SERC\n", "0 plants dropped in SPP\n", "0 plants dropped in TRE\n", "0 plants dropped in MRO\n", "0 plants dropped in FRCC\n", "0 plants dropped in ASCC\n", "0 plants dropped in HICC\n" ] } ], "source": [ "for nerc in nerc_labels:\n", " l = len((set(labeled.loc[(labeled.nerc == nerc) &\n", " (labeled.year == 2016), 'plant id'])\n", " - set(labeled.loc[(labeled.nerc == nerc) &\n", " (labeled.year == 2017), 'plant id'])))\n", " \n", " print('{} plants dropped in {}'.format(l, nerc))" ] }, { "cell_type": "code", "execution_count": 264, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "set()" ] }, "execution_count": 264, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(set(labeled.loc[(labeled.nerc == 'MRO') &\n", " (labeled.year == 2016), 'plant id'])\n", " - set(labeled.loc[(labeled.nerc == 'MRO') &\n", " (labeled.year == 2017), 'plant id']))" ] }, { "cell_type": "code", "execution_count": 265, "metadata": { "ExecuteTime": { "end_time": "2017-11-06T16:59:09.744166Z", "start_time": "2017-11-06T16:59:09.667898Z" }, "collapsed": true }, "outputs": [], "source": [ "path = join(data_path, 'Facility labels', 'Facility locations_RF.csv')\n", "labeled.to_csv(path, index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "psci", "language": "python", "name": "psci" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }