# HOW TO WORK WITH APPLICATION PROGRAMMING INTERFACES (APIS) # Summer Institute in Computational Social Science # Author: Chris Bail, Duke University (www.chrisbail.net) # Many sites that prevent you from screen-scraping- which are unfortunately most of the sites you # might want to scrape- usually offer a powerful alternative: an # Application Programming Interface (API) # An API is a type of web infrastructure that enables a developer (you) # to request large amounts of specific information from a website. The # Website then creates a new URL that contains the data you request, # and you scrape it. This has become such an important part of the web # that most large websites now have APIs (e.g. Google, Twitter, Facebook # even the New York Times) # APIs Are called Application Programming Interfaces because may of the people # who use them are building apps. For example, a music sharing website # might want to build an app that helps people expose their friends to # new types of music. But to do this it needs to request permission to # extract certain types of information about the person from a site # such as Facebook. # But Facebook obviously can't give them all the info. Facebook needs to # make sure that the person wants them to access their data. They also # need to make sure the app develop can only access certain types of data # and not all the data that Facebook has. # To do this, Facebook- and other sites that have APIs- have "authenticat# ion tokens," or "access keys." These are simply codes that you need to # give when you request data from an API. # Let's take a look at how the Facebook API works using the "Facebook # Graph API Explorer. This is a website that lets you see how an # API works, also known as a "sandbox": # https://developers.facebook.com/tools/explorer # try typing "me/friends" into the search bar below the text "FQL Query." # This is a tool that shows you what the results would look like if you #made this API request. # what it is actually doing is forming the URL request and then showing you the JSON-format data that would load if you pasted the URL in your browser. # Most sites that have APIs do not have this type of "sandbox." but # learning how to master working with them is a really nice to skill because # there are so many APIs out there. #At present, there are more than 13,000 APIs. You can see a list of them here: http://www.programmableweb.com/category/all/apis?order=field_popularity # Academics may be interested to know that many data archiving sites now offer # APIs (such as ProQuest). Many are free to use, but others cost significant # amounts of money. # Most APIs have "rate limits" which determine how many # requests for information a developer (you) can make within a certain time frame # In R, you can either interact with an API by forming requests for data within # a loop and "scraping" the resultant data from URLS "by hand," or you can # use a variety of user-generated packages to collect data. # Because we already covered screen scraping, let's look at one of these packages. Let's start with the twitteR package. install.packages("twitteR") #The instruction manual for this package is here: # http://cran.r-project.org/web/packages/twitteR/twitteR.pdf # The first thing you need to do is register as a developer with Twitter. # in order to do this, you need to visit this page: # apps.twitter.com #Unfortunately, if you don't have a Twitter account, you'll have to make one, # or follow along on your neighbor's laptop if they don't mind. #THe next step is to click on "Create New App." You need to name your app, and # provide some other credentials. It really does not matter much what you # put in here, because we are not building an app that other people are going # to use. I just put in my own website. You can leave the "Callback URL field blank." # Our goal in doing this is to obtain a Twitter API Key which we need to extract # Any data from Twitter. TO do this we need to scroll down to the "Application #Settings section, and then click the blue "manage keys and access token" link # That is to the right of our Consumer Key # The next thing we need to do is tell the twitteR package what our secret #login details are. I can't write mine in here because if this information # got out a hacker could use it to pose as me, or get data collected by my # app which I might not want her or him to have. setup_twitter_oauth(consumer_key="TEXTOFYOURKEYHERE", consumer_secret="TEXTOFYOURSECRETHERE", access_token="TEXTOFACCESSTOKENHERE", access_secret="TEXTOFACCESSSECRETHERE") # When we run this last line, it will ask us if we want to use a # local file to store these "credentials." I am going to say "no" # and load these into R each time I need them. # What this twitteR package is doing for us is simplifying some of # the complex URL requests we would need to put in each URL call # we make to the TWitter API. Once all of our authentication # information is in the system, we have a range of useful commands # available to us. #First, we can define a Twitter user whose information we want to scrape # you can use my name, or feel free to put in your own name # instead of mine user <- getUser("chris_bail") # Let's get a list of my "friends"- by friends, the author of this package is referring # to the name of the people that I follow on Twitter: friends<-user$getFriends() # Now let's get a list of people who follow me on Twitter: followers<-user$getFollowers() #We can also get a list of all my favorite Tweets: favorites<-favorites(user) #This package also has some nice commands for formatting these # data as data.frames: friendsdata <- twListToDF(friends) followersdata <- twListToDF(followers) #This the command I would use to get a user's tweets: tweets<-userTimeline(user) # I mentioned earlier that Twitter will set shut us # down if we ask for two much data. this command # let's me see the limits on what I can do # within a given time frame: getCurRateLimitInfo() # Remember that list of top twitter accounts we got? # Let's see if we can scrape network data from these folks. # First, let's remind ourselves what these data look like: head(twitter_data) # So what I am going to want to do is create a for loop # where I make each person the "user" in each iteration # and scrape the names of the people they follow: # Create a blank data frame to store data we scrape twitter_network_data<-as.data.frame(NULL) # figure out how many rows we have to scrape z<-nrow(twitter_data) # start for loop that gets names of people the user # follows and append them to the dataset we # just created. Finally take a break between # pulling each user's Twitter data in order # to prevent Twitter rate limiting kicking in: for(i in 1:z){ user <- getUser(twitter_data$twitter_handles[i]) people_user_follows <- user$getFriends() people_user_follows<-twListToDF(people_user_follows) people_user_follows$name_of_user<-twitter_data$twitter_handles[i] twitter_network_data<-rbind(twitter_network_data, people_user_follows) Sys.sleep(60) print(i) } # We don't have time to run this loop together, it will take quite a bit # of time to run. ## There are many many more R packages for working with APIS: ## Here are a few: `RgoogleMaps`, `Rfacebook`, `rOpenSci` ##(this one combines many different APIs e.g. the Internet Archive), ##`WDI`,`rOpenGov`,`rtimes` ##Many more are available but not yet on CRAN (install from ##github or using devtools) ## There are also APIS that you can use to do analyses, like plotly # for visualization. # But there are still APIs that don't have R packages (many of them) # Let's pretend there was no R package for Google Maps, what would we do? # first: look for patterns # https://maps.googleapis.com/maps/api/geocode/json?address=Durham,NorthCarolina&sensor=false # In this case, address goes between the first** `=` **and the** `&` findGPS <- function(address,sensor = "false") { beginning <- "http://maps.google.com/maps/api/geocode/json?" paster <- paste0(beginning,"address=", address, "&sensor=false") return(URLencode(paster)) } findGPS("Durham, North Carolina") # let's put it all together page<-findGPS("Durham, North Carolina") gpscoordinates <- fromJSON(page) latitude <- gpscoordinates$results[[1]]$geometry$location["lat"] longitude <- gpscoordinates$results[[1]]$geometry$location["lng"] gps<-c(latitude, longitude) gps # we could then wrap them in a loop.