# Using APIs to get data from OpenAQ: A Walkthrough

[OpenAQ](https://openaq.org/#/?_k=mls3yy) is a an open-source platform continuously aggregating air quality data from all over the world from regulatory sources. We'll illustrate how to pull data using their front-facing API, including the py-openaq API Python Wrapper.

## Query an API with Python

### What is an API?

'API' stands for <b>'Application Programming Interface'</b>. In the simplest terms, an API is "an interface that lets the program you write control or access a program somebody else wrote"([1](https://community.webroot.com/unity-api-forum-49/people-usually-ask-can-some-eli5-what-an-api-is-341484)).

You can also think of it this way: writing a program on your home computer is like cooking in your own kitchen. You have <em>ingredients</em> that you own, and with your <em>culinary skills</em> you're able to cook yourself a fine meal. But let's say you lack the ingredients, kitchenware, knowledge, or even patience to make, say, a pizza, in that case you might want to head down to the local pizzeria, where you can request the toppings and ingredients of your choice. Like an API, you can't cook the pizza for yourself, but, depending on the pizzeria, you can get the type of pizza you desire ([2](https://www.reddit.com/r/explainlikeimfive/comments/628y0c/eli5_what_is_an_api/)). Think of using an API as a transaction, often free of charge, but with certain conditions and limitations. 

![Alt Text](https://media.giphy.com/media/lgYKV5BjA4FR6/giphy.gif)

## DO THIS ALWAYS: Read the API Documentation

A good practice when working with APIs is to read the documentation: https://docs.openaq.org/

This is important to understand the parameters can be called from the API (whether you can call data by location, intervals of time, attribute type etc.) and, importantly, to be aware of the API's limits and restrictions. For example, OpenAQ has a limit of 2,000 requests over a 5 minute period. In other words, you can't call more than 2,000 records, or rows, in under 5 minutes. APIs have rate limits to avoid being overwhelmed by requests, and as a protection against 'bad actors' seeking to overwhelm the system.

Furthermore, APIs can also limit the scope of all data you may be able to access from the source. For example, OpenAQ's API is limited to an archive of 90-rolling days worth of data. So if you wanted data from a year ago, that would not be possible with this API.

[Terms of licensing](https://github.com/openaq/openaq-api/blob/develop/LICENSE.md) is also something you might want to be aware of, depending on the intended use of the data. OpenAQ API's data is licensed under Creative Commons as "Attribution 4.0 International", which grants authorization for the data to be transformed and used for commercial purposes.

## Getting started with OpenAQ

### Import the package dependencies  

Needed to call the API and store the requests (as a .json file format).

In [1]:
import pandas as pd
import json
import requests

### Define the scope of parameters

Since my requests will be limited to the archive of the last 90 days, my objective will be to request measurements of fine particulates (PM2.5) for the city of San Francisco, California, for the month of May 2020 (as of June 13th, 2020).

#### Set the Base URL

The API we're working with is a <b>'REpresentational State Transfer'</b> (REST) API ([3](https://restfulapi.net/)). REST is the language by which you pass your request, i.e. via HTTP commands. To borrow the previous pizza metaphor: the chef doesn't want to hear from you directly, she's too busy. Instead, she will direct you to a structured menu with a list of options, asking how many toppings you desire, the kind of crust you prefer, the preferred type of cheese etc.

Think of the base url below as the bare dough which will serve as the base to your desired sauce and toppings.

You will find the base url in the documentation.

![Alt Text](https://media.giphy.com/media/e4AUQS59Gjra0/giphy.gif)

In [2]:
base_url = "https://api.openaq.org/v1/measurements?"

#### Build the Query URL

Add your toppings!

I have three variables of interest:

    a. A location;
    b. A temporal interval;
    c. A specific attribute.
    
The documentation will allow me to identify which arguments (ingredients!) I need to pass into the base url to define my query. From the documentation, they are identified as:

    a. "location" (be careful, as cities share names between each other globally. Think carefully about what might be caveats in your argument.)
    b. "date_from" & "date_to" (the documentation informs me that the timestamp in the database is measured in UTC, thus I will need to convert my intervals from PST to UTC. There are several online tools to enable the conversion, such as https://savvytime.com/converter/pst-to-utc)
    c. "parameter" (I want measurements for PM2.5, thus the argument I need to pass is pm25)

In [7]:
city_name = "San Francisco"

# Note that the timestamp in UTC is written in military time (24-hours)
date_start = "2020-05-01T08:00:00" # 12:00 pm in PDT is 7:00 am in UTC
date_end = "2020-05-31T08:00:00"

parameter = "pm25"

query_url = base_url + "location=" + city_name + "&date_from=" + date_start + "&date_to=" + date_end + "&parameter=" + parameter + "&limit=10000"

print("this is our query url: " + query_url)

this is our query url: https://api.openaq.org/v1/measurements?location=San Francisco&date_from=2020-05-01T08:00:00&date_to=2020-05-31T08:00:00&parameter=pm25&limit=10000


Mama Mia! The URL has a blank space! Will it cause my query to break?

Thankfully, HTTP knows how to recognize this, and will fill blank spaces with '%20'. 

#### Call the API using request()

We use the <em>GET</em> HTTP command to retrieve data. APIs are not just for getting data; a database administrator can use HTTP commands such as POST or PUT to update or create data respectivelly.

Think of GET as being a customer in the pizzaria, as opposed to a food inspector.

![Alt Text](https://media.giphy.com/media/mtCQJHkFLpG3m/giphy.gif)
<center><em>We do not endorse Pizza Hut</em></center>

In [4]:
results_jsons = requests.get(query_url).json()

We append the json() argument to request(), something called method chaining ([4](https://stackoverflow.com/questions/41817578/basic-method-chaining)), to store the data as a JSON.

#### From JSON to a Python DataFrame

Why do we want to store data as a <b>JavaScript Object Notation</b> (JSON)? Well, if we're calling a large amount of data, JSON takes less space, is faster for a search query to sparse through, and contratory to how things may look below, is actually quite readable (once you pay attention to the structure of the file). 

In [5]:
print(results_jsons)

{'meta': {'name': 'openaq-api', 'license': 'CC BY 4.0', 'website': 'https://docs.openaq.org/', 'page': 1, 'limit': 10000, 'found': 0}, 'results': []}


But OK, I hear you. This is giving you a headache, like anchovies on a pizza. Plus there's some info under the 'meta' tag that we don't really care about in the context of storing data inside a dataframe.

To remedy this, we do something called a list comprehension. I.e., we store a JSON inside a list. To do so, we create a <b>'for' loop</b> so that each element of interest is iteratively stored inside the list.

Remember that bit about the structure of a JSON file? Well, if we look at the JSON carefully, we'll notice that there are two high level tags: 'meta', which we want to exclude, and 'results', which contain all of the ingredients we seek. Each individual record is stored inside these {}, and we want to store those into individual rows.

In [145]:
results_list = [results_json for results_json in results_jsons['results']]
print(results_list)

[{'location': 'San Francisco', 'parameter': 'pm25', 'date': {'utc': '2020-05-31T08:00:00.000Z', 'local': '2020-05-31T00:00:00-08:00'}, 'value': 5, 'unit': 'µg/m³', 'coordinates': {'latitude': 37.7658, 'longitude': -122.3978}, 'country': 'US', 'city': 'San Francisco-Oakland-Fremont'}, {'location': 'San Francisco', 'parameter': 'pm25', 'date': {'utc': '2020-05-31T07:00:00.000Z', 'local': '2020-05-30T23:00:00-08:00'}, 'value': 6, 'unit': 'µg/m³', 'coordinates': {'latitude': 37.7658, 'longitude': -122.3978}, 'country': 'US', 'city': 'San Francisco-Oakland-Fremont'}, {'location': 'San Francisco', 'parameter': 'pm25', 'date': {'utc': '2020-05-31T06:00:00.000Z', 'local': '2020-05-30T22:00:00-08:00'}, 'value': 0, 'unit': 'µg/m³', 'coordinates': {'latitude': 37.7658, 'longitude': -122.3978}, 'country': 'US', 'city': 'San Francisco-Oakland-Fremont'}, {'location': 'San Francisco', 'parameter': 'pm25', 'date': {'utc': '2020-05-31T05:00:00.000Z', 'local': '2020-05-30T21:00:00-08:00'}, 'value': 3, '

Still a bit of a mess, but now we've truly isolated all of the data that we're interested in. Using the pandas's function DataFrame(), we can easily convert the list to a dataframe, and even further filter attributes if we so wished in the columns argument.

In [146]:
results_df = pd.DataFrame(results_list,columns=['location','parameter','date','value','unit','coordinates','country','city'])

In [147]:
print("Our dataframe has this many rows : " + str(len(results_df)))
results_df.head()

Our dataframe has this many rows : 563


Unnamed: 0,location,parameter,date,value,unit,coordinates,country,city
0,San Francisco,pm25,"{'utc': '2020-05-31T08:00:00.000Z', 'local': '...",5,µg/m³,"{'latitude': 37.7658, 'longitude': -122.3978}",US,San Francisco-Oakland-Fremont
1,San Francisco,pm25,"{'utc': '2020-05-31T07:00:00.000Z', 'local': '...",6,µg/m³,"{'latitude': 37.7658, 'longitude': -122.3978}",US,San Francisco-Oakland-Fremont
2,San Francisco,pm25,"{'utc': '2020-05-31T06:00:00.000Z', 'local': '...",0,µg/m³,"{'latitude': 37.7658, 'longitude': -122.3978}",US,San Francisco-Oakland-Fremont
3,San Francisco,pm25,"{'utc': '2020-05-31T05:00:00.000Z', 'local': '...",3,µg/m³,"{'latitude': 37.7658, 'longitude': -122.3978}",US,San Francisco-Oakland-Fremont
4,San Francisco,pm25,"{'utc': '2020-05-31T04:00:00.000Z', 'local': '...",6,µg/m³,"{'latitude': 37.7658, 'longitude': -122.3978}",US,San Francisco-Oakland-Fremont


Well, obviously, some work needs to be done (looking at the data and coordinates fields); but that's a story for another time.

Don't forget to save your work, preferrably into a csv file. I.e., put that slice in the fridge!

In [148]:
results_df.to_csv('ouputs/openAQ_sf_may2020_method1')

<center>It's pizza time!</center>

![Alt Text](https://media.giphy.com/media/12G5TOxGH7WUEM/giphy.gif)

# Using the Python wrapper for the Open AQ API

## What's a Python wrapper?

It's essentially a Python function that simplifies or streamlines more complicated functions. It's like if you ordered you pizza online instead of walking all the way to the pizzeria.

## Install the wrapper in your programming environment:

pip install py-openaq

## Read the Documentation: http://dhhagan.github.io/py-openaq/tutorial/api.html#openaq-api

### Import Dependencies

In [2]:
import openaq
import warnings

In [3]:
warnings.simplefilter('ignore')
print ("openaq v{}".format(openaq.__version__))

openaq v1.1.0


### Set variables

In [139]:
location = "San Francisco"
date_from = "2020-05-01T08:00:00" 
date_to = "2020-05-31T08:00:00" 
parameter = "pm25"

### initiate an instance of the openaq.OpenAQ class

In [130]:
api = openaq.OpenAQ()

### Run the wrapper

In [140]:
results = api.measurements(location=location, parameter=parameter, date_from=date_from, date_to=date_to, limit=10000,df=True, index='local')

In [141]:
results.head()

Unnamed: 0_level_0,location,parameter,value,unit,country,city,date.utc,coordinates.latitude,coordinates.longitude
date.local,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2020-05-31 00:00:00,San Francisco,pm25,5,b'\xc2\xb5g/m\xc2\xb3',US,San Francisco-Oakland-Fremont,2020-05-31 08:00:00+00:00,37.7658,-122.3978
2020-05-30 23:00:00,San Francisco,pm25,6,b'\xc2\xb5g/m\xc2\xb3',US,San Francisco-Oakland-Fremont,2020-05-31 07:00:00+00:00,37.7658,-122.3978
2020-05-30 22:00:00,San Francisco,pm25,0,b'\xc2\xb5g/m\xc2\xb3',US,San Francisco-Oakland-Fremont,2020-05-31 06:00:00+00:00,37.7658,-122.3978
2020-05-30 21:00:00,San Francisco,pm25,3,b'\xc2\xb5g/m\xc2\xb3',US,San Francisco-Oakland-Fremont,2020-05-31 05:00:00+00:00,37.7658,-122.3978
2020-05-30 20:00:00,San Francisco,pm25,6,b'\xc2\xb5g/m\xc2\xb3',US,San Francisco-Oakland-Fremont,2020-05-31 04:00:00+00:00,37.7658,-122.3978


The outputs via the wrapper are also much cleaner than using a GET API query. Hence, it always pays to read the documentation!

In [149]:
results.to_csv('ouputs/openAQ_sf_may2020_method2')