# Weather and Climate
This notebook is intended to reinforce the skills you learned in the Pandas course.

## Fredericksburg Virginia and Las Cruces New Mexico
<img src="http://s3.amazonaws.com/placester-wordpress/blogs.dir/84/files/2017/01/Las-Cruces1-242630.jpg" width="400"/>
I downloaded a dataset containing weather information about these two cities from [the National Oceanic and Atmospheric Administration (NOAA) website](https://www.ncdc.noaa.gov/cdo-web/datasets). If you would rather explore the data from other cities, feel free to do so.


The dataset is [nmva2018.csv](http://zacharski.org/files/courses/data101/nmva2018.csv)

It contains weather information for Fredericksburg Virginia and Las Cruces New Mexico from January 1, 2000 to the present. The dataset columns are:

Column | Description
:---: | :--- 
STATION | The NOAA weather station identifier
NAME | The name of the station - I changed these to be Las Cruces and Fredericksburg. (They were originally 'STATE UNIVERSITY' and 'FREDERICKSBURG SEWAGE'
DATE | The date
DAPR | Number of days included in the multiday precipitation total (MDPR)
MDPR | Multiday precipitation total
MDWM | Multiday wind movement (miles or km as per user preference)
PRCP | Precipitation total (in tenths of mm)
SNOW | Snowfall (mm)
SNWD | Snow depth (mm)
TMAX | Maximum Temperature
TMIN | Minimum Temperature
TOBS | Temperature at time of observation
WDMV | 24-hour wind movement (miles)
WT01 | Fog, ice fog, or freezing fog (may include heavy fog)
WT03 | Thunder
WT06 | Glaze or rime
WT11 | Blowing Spray

Let's load in the dataset:

In [2]:
# TBD

First, let's examine only the data from 2016:

In [None]:
#TBD

We would like to display a series of plots comparing Fredericksburg to Las Cruces. First, let's plot the number of days that have reached a  temperature of 90 or above:

In [3]:
#TBD

And we would like to see a similar plot for the number of days that reached a temperature of 32 or below (meaning that at some point of the day the temperature was 32 or lower):


In [4]:
#TBD

What do you consider the ideal outdoor temperature range? Whatever you decide, we would like to plot the number of days that were within that range. 

In [5]:
#TBD

<h2 style="color:blue">Hacker Challenges</h2>

The following requires some mental calisthenics.  

### Part 1. 
We would like to see a yearly plot of the number of days 90 or over for Fredericksburg.  So the x axis would be the years 2010 to 2017.  Which year had the most days over 90?

### Part 2. even more challenging.
We would like to see a plot similar to that in Part 1, but showing data for both Fredericksburg and Las Cruces. That is, for each year we see the days 90 or over for Las Cruces, and the days  90 or over for Fredericksburg.


Here is a bit of a hint. (This was my approach - you might have a different, better one). I had two Pandas Series. One, `cc` was the number of days 90 or higher for Las Cruces. It looked like this:

    cc.head()
    
    DATE
    2000-12-31    142
    2001-12-31    121
    2002-12-31    117
    2003-12-31    117
    2004-12-31    106

And had a similar one, `ff` for Fredericksburg. Then I combined them into one DataFrame by:

    combined = pd.DataFrame({'Las Cruces': cc, 'Fredericksburg' : ff})
    
    combined.head()
    
                FR	LC
    DATE		
    2000-12-31	24	142
    2001-12-31	25	121
    2002-12-31	57	117
    2003-12-31	26	117
    2004-12-31	24	106
    
    
After that, the plotting was easy.



In [None]:
#TBD

# The average max weekly temperatures of Fredericksburg in 2016

What we mean:

 | Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday
 ---: | :---: | :---: | :---: | :---: | :---: | :---: | :---:  
 Max Temp | 82 | 84 | 82 | 75 | 77 | 87 | 89
 
 The average max weekly temperature for that week 
 
 $$avgMaxWeekly = \frac{82 + 84 + 82 + 75 + 77 + 87 + 89}{7} = \frac{576}{7} = 82.2857$$
 
 We would like to see a plot for the whole year:
 

In [6]:
#TBD


<h2 style="color:blue">Hacker Challenge</h2>

Can you do the same (the average max weekly temperature plot) for both Fredericksburg and Las Cruces in one plot?

In [7]:
#TBD


## The total annual precipitation amounts for both Fredericksburg and Las Cruces
A plot showing the amounts from 2010 through 2017. (a plot showing 2010, 2011, 2012, etc)

### A non-plot question.
What is the average yearly precipitation amounts for Fredericksburg and Las Cruces?

In [10]:
# TBD

# Climate Change: Atmospheric Carbon Dioxide
Before the industrial revolution atmospheric carbon dioxide was about 280 ppm (parts per million). When we first started measuring its concentration at Mauna Loa Hawaii in 1958 the concentration was 315.

The data from this location is in the CSV file:

[co2_mm_mlo.csv](https://raw.githubusercontent.com/zacharski/data101/master/co2_mm_mlo.csv)
    

The following information from the original dataset is important:

> Data from March 1958 through April 1974 have been obtained by C. David Keeling
> of the Scripps Institution of Oceanography (SIO) and were obtained from the
> Scripps website (scrippsco2.ucsd.edu).
>
> The "average" column contains the monthly mean CO2 mole fraction determined
> from daily averages.  The mole fraction of CO2, expressed as parts per million
> (ppm) is the number of molecules of CO2 in every one million molecules of dried
> air (water vapor removed).  If there are missing days concentrated either early
> or late in the month, the monthly mean is corrected to the middle of the month
> using the average seasonal cycle.  Missing months are denoted by -99.99.
> The "interpolated" column includes average values from the preceding column
> and interpolated values where data are missing.  Interpolated values are
> computed in two steps.  First, we compute for each month the average seasonal
> cycle in a 7-year window around each monthly value.  In this way the seasonal
> cycle is allowed to change slowly over time.  We then determine the "trend"
> value for each month by removing the seasonal cycle; this result is shown in
> the "trend" column.  Trend values are linearly interpolated for missing months.
> The interpolated monthly mean is then the sum of the average seasonal cycle
> value and the trend value for the missing month.
>
> NOTE: In general, the data presented for the last year are subject to change, 
> depending on recalibration of the reference gas mixtures used, and other quality
> control procedures. Occasionally, earlier years may also be changed for the same
> reasons.  Usually these changes are minor.
>
> CO2 expressed as a mole fraction in dry air, micromol/mol, abbreviated as ppm
>
>  (-99.99 missing data;  -1 no data for >daily means in month)

**Please give a monthly plot of the atmospheric carbon (extra xp for making a pretty plot<sup>TM</sup>).**

### Hint:

The date has a year and a month column:


year |	month |	decimal_date	| average	| interpolated | 	trend |	days
:---: | :---: | :---: | :---: | :---: | :---: | :---: 
1958 |	3	| 1958.208	| 315.71	| 315.71 |	314.62 |	-1
1958 |	4	| 1958.292	| 317.45	| 317.45  |	315.29	| -1
1958 |	5	| 1958.375	|  317.50	| 317.50  |	314.71	| -1

Let's say you wanted to combine the year and month to create a Pandas Series with entries like '1958-03' and so on.  

If our original Pandas DataFrame is called `carbon` we can create a series called `date_string` by executing:


    date_string = carbon['year'].astype(str)  + '-' + carbon['month'].apply(lambda x:"%02i" % x)
    
For more of a hint see the DataCamp page *Cleaning and tidying datetime data*


**Now we would like to see a plot of the average daily atmospheric carbon for every 5 years. **

One of those plots looked saw-toothed leading us to wonder if some months of the year had lower atmospheric carbon than others.  For example, maybe it is low during winter months. Can you come up with a plot that will help us answer this question?

### Before We Start

Suppose we have the small DataFrame

In [18]:
import pandas as pd
names = ['Ann', 'Ben', 'Clara', "Dora", 'Enric', 'Fred', 'Ginny', 'Hannah']
midtermGrades =   [87, 75, 97, 81, 65, 91, 85, 96]
finalGrades   =   [89, 81, 99, 95, 60, 93, 87, 99]
grades = pd.DataFrame({'Name': names, 'midterm': midtermGrades, 'final': finalGrades})
grades


Unnamed: 0,Name,final,midterm
0,Ann,89,87
1,Ben,81,75
2,Clara,99,97
3,Dora,95,81
4,Enric,60,65
5,Fred,93,91
6,Ginny,87,85
7,Hannah,99,96


We can sort the data by the values in the final column by:


In [19]:
gradesSorted = grades.sort_values('final', ascending=False)
gradesSorted

Unnamed: 0,Name,final,midterm
2,Clara,99,97
7,Hannah,99,96
3,Dora,95,81
5,Fred,93,91
0,Ann,89,87
6,Ginny,87,85
1,Ben,81,75
4,Enric,60,65


And, if we want, we can make a new dataframe of the top 3 students:

In [20]:
topStudents = gradesSorted[:4]
topStudents

Unnamed: 0,Name,final,midterm
2,Clara,99,97
7,Hannah,99,96
3,Dora,95,81
5,Fred,93,91





# Salaries, Colleges, and Degrees

We have two data files that were created by the Wall Street Journal. 

One is called salariesByCollege and looks like:

School Name | Unnamed: 0 | School Type | Starting Median Salary | Mid-Career Median Salary | Mid-Career 10th Percentile Salary | Mid-Career 25th Percentile Salary | Mid-Career 75th Percentile Salary | Mid-Career 90th Percentile Salary | region
:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: 
Massachusetts Institute of Technology (MIT) | 0 | Engineering | 72200.0 | 126000.0 | 76800.0 | 99200.0 | 168000.0 | 220000.0 | Northeastern
California Institute of Technology (CIT) | 1 | Engineering | 75500.0 | 123000.0 |  | 104000.0 | 161000.0 |  | California
Harvey Mudd College | 2 | Engineering | 71800.0 | 122000.0 |  | 96000.0 | 180000.0 |  | California
Polytechnic University of New York Brooklyn | 3 | Engineering | 62400.0 | 114000.0 | 66800.0 | 94300.0 | 143000.0 | 190000.0 | Northeastern

The other is called degreesThatPayBack:

Unamed 0 | Undergraduate Major | Starting Median Salary | Mid-Career Median Salary | Percent change from Starting to Mid-Career Salary | Mid-Career 10th Percentile Salary | Mid-Career 25th Percentile Salary | Mid-Career 75th Percentile Salary | Mid-Career 90th Percentile Salary
:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: 
0 | Accounting | 46000.00 | 77100.00 | 67.6 | 42200.00 | 56100.00 | 108000.00 | 152000.00
1 | Aerospace Engineering | 57700.00 | 101000.00 | 75.0 | 64300.00 | 82100.00 | 127000.00 | 161000.00
2 | Agriculture | 42600.00 | 71900.00 | 68.8 | 36300.00 | 52100.00 | 96300.00 | 150000.00
3 | Anthropology | 36800.00 | 61500.00 | 67.1 | 33800.00 | 45500.00 | 89300.00 | 138000.00
4 | Architecture | 41600.00 | 76800.00 | 84.6 | 50600.00 | 62200.00 | 97000.00 | 136000.00
5 | Art History | 35800.00 | 64900.00 | 81.3 | 28800.00 | 42200.00 | 87400.00 | 125000.00

The files are in a zipped compressed folder at [collegeSalaries.zip](http://zacharski.org/files/courses/data101/collegeSalaries.zip). You will need to download the file to your laptop, unzip the file, and then load the files into Pandas. This is good practice for when someone emails you a file, or you create your own datafile.

I will give you a moment to load this data

## A few basic questions

### We would like to see a list of universities sorted by those with the highest starting salary first.

### Can you do the same sort but this time with majors? (A list of the highest paying majors)

### Can you plot the salaries of the top 5 majors?

### Can you plot the salaries of the top 5 schools?

# Creative
Now is your chance to do something creative with the data. What is an interesting question you have that can be answered with a few plots and/or summary statistics?
