{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `rorcid`\n",
    "\n",
    "`rorcid` is a package developed by [Scott Chamberlain](https://scottchamberlain.info/), co-founder of [rOpenSci](https://ropensci.org/), to serve as an interface to the ORCID API. \n",
    "\n",
    "You can find more information about the API [on the ORCID site](http://members.orcid.org/api/about-public-api).\n",
    "\n",
    "Credit to Paul Oldham at https://www.pauloldham.net/introduction-to-orcid-with-rorcid/ for inspiring some of the structure and ideas throughout this document. I highly recommend reading it.\n",
    "\n",
    "# Licensing\n",
    "\n",
    "This walkthrough is distributed under a [Creative Commons Attribution 4.0 International (CC BY 4.0) License](https://creativecommons.org/licenses/by/4.0/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load packages\n",
    "When you download R it already has a number of functions built in: these encompass what is called **Base R**. However, many R users write their own libraries of functions, package them together in R **packages**, and provide them to the R community at no charge. This extends the capacity of R and allows us to do much more. In many cases, they improve on the Base R functions by making them easier and more straight-forward to use. In addition to `rorcid`, we will also be using the `dplyr`, `purrr`, `tidyr`, `anytime`, `lubridate`, and `janitor` packages. Some of these packages are part of the [tidyverse](https://www.tidyverse.org/), a collection of R packages designed for data science.\n",
    "\n",
    "If you are using R and R Studio, you will need to use `install.packages()` function to install the packages first. We have already installed the packages here in our Binder repository, so we will simply load them by calling `library()`. Let's also set an option to see a max number of 100 columns and max 20 rows in our Jupyter Notebooks environment, to make printed tables easier to look at."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#load packages\n",
    "library(tidyr)\n",
    "library(purrr)\n",
    "library(lubridate)\n",
    "library(rorcid)\n",
    "library(anytime)\n",
    "library(httr)\n",
    "library(janitor)\n",
    "library(readr)\n",
    "library(glue)\n",
    "library(stringr)\n",
    "library(dplyr)\n",
    "\n",
    "# increase number of columns and rows displayed when we print a table\n",
    "options(repr.matrix.max.cols=100, repr.matrix.max.rows=20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up rorcid\n",
    "\n",
    "If you haven't done so already, create an ORCID account at https://orcid.org/signin. If you have an ORCID but can't remember it, search for your name at https://orcid.org. If you try to sign in with an email address already associated with an ORCID, you'll be prompted to sign into the existing record. If you try to register with a different address, when you enter your name you'll be asked to review existing records with that name and verify that none of them belong to you--[see more on duplicate ORCID records](https://support.orcid.org/hc/en-us/articles/360006972593-How-do-you-check-for-duplicate-ORCID-records-). Make sure you have verified your email address."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, you need to set up an ORCID API client in order to get a token you can use in your API queries. According to the [ORCID API tutorial](https://members.orcid.org/api/tutorial/read-orcid-records), anyone can receive a key to access the public API--you do not have to belong to a member institution.\n",
    "\n",
    "If you were using R Studio, this would be simple: call `orcid_auth()` and a browser window will open, giving you your API token and allowing you to store it in your R session. See [this step on the rorcid page of the course website](https://ciakovx.github.io/NEED TO FILL THE REST OF THIS URL) for instructions on how to do that. Here in our Jupyter Notebook, we will be following the steps at <https://support.orcid.org/hc/en-us/articles/360006897174>:\n",
    "\n",
    "1. Sign in to your ORCID account\n",
    "2. In the upper right corner, click your name, then in the drop-down menu, click **Developer Tools**. Note: In order to access Developer Tools, you must verify your email address. If you have not already verified your email address, you will be prompted to do so at this point.\n",
    "4. Click the **\"Register for the free ORCID public API\"** button\n",
    "5. Review and agree to the terms of service when prompted.\n",
    "6. Add your name in the Name field, https://www.orcid.org in the Your Website URL field, \"Getting public API key\" in Description field, and https://www.orcid.org in the redirect URI field. Click the diskette button to save.\n",
    "7. A gray box will appear including your Client ID and Client Secret. In the below code chunk, copy and paste the client ID and the client secret respectively. Make sure to leave the quotation marks (e.g. `orcid_client_id <- \"APP-FDFJKDSLF320SDFF\"` and `orcid_client_secret <- \"c8e987sa-0b9c-82ed-91as-1112b24234e\"`). Then execute the code chunk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# copy/paste your client ID from https://orcid.org/developer-tools\n",
    "orcid_client_id <- \"APP-ZC8HBR68EX0V6N54\"\n",
    "\n",
    "# copy/paste your client secret from https://orcid.org/developer-tools\n",
    "orcid_client_secret <- \"a65173e5-e939-462d-b73e-4f84d34db3d2\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "8. Now execute the below code, which will send a `POST` request (from the `httr` package) to ORCID and return to you an access token."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "orcid_request <- POST(url  = \"https://orcid.org/oauth/token\",\n",
    "          config = add_headers(`Accept` = \"application/json\",\n",
    "                               `Content-Type` = \"application/x-www-form-urlencoded\"),\n",
    "          body = list(grant_type = \"client_credentials\",\n",
    "                      scope = \"/read-public\",\n",
    "                      client_id = orcid_client_id,\n",
    "                      client_secret = orcid_client_secret),\n",
    "          encode = \"form\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have the response, we can use `content` from `httr` to get the information we want. We will then `print()` that to the console to see the contents."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "orcid_response <- content(orcid_request)\n",
    "print(orcid_response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now use `Sys.setenv` to set our `ORCID_TOKEN` to the `access_token` value. This will give us the necessary authorization to make API calls to gather public ORCID records information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Sys.setenv(ORCID_TOKEN = orcid_response$access_token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we use `orcid_auth()` to confirm that we have set our token properly. If so, it will print `Bearer` and the access token we just assigned. We are ready to use `rorcid`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rorcid::orcid_auth(scope = \"/authenticate\",\n",
    "                   reauth = TRUE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After you have your access token, you only ever need to complete the `Sys.setenv()` step and the `rorcid::orcid_auth()` step. Your access token is good for several years. Just save it somewhere and reuse as needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Finding ORCID iDs with `rorcid::orcid_search()`\n",
    "\n",
    "The `rorcid::orcid_search()` function takes a query and returns a data frame with three columns: first name, last name, and ORCID iD. We use this when we have some data about a person or people, and we want to get their ORCID iDs. \n",
    "\n",
    "Call `?orcid_search` to view the available parameters.\n",
    "\n",
    "For this example, we will use the fictitious professor [Josiah S(tinkney) Carberry](https://library.brown.edu/hay/carberry.php), \"legendary professor of psychoceramics (the study of cracked pots) since 1929,\" at Brown University. Despite being make-believe, Carberry has a profile in ORCID that can be used for test cases such as these.\n",
    "\n",
    "## Names\n",
    "\n",
    "We start with a simple search by family name with the `family_name` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry <- rorcid::orcid_search(family_name = 'carberry')\n",
    "carberry"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the data frame in your R environment, you will see it returns 10 observations of three variables. By default, `orcid_search` returns a limit of 10. We can increase the number of results returned by using the `rows` argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry <- rorcid::orcid_search(family_name = 'carberry',\n",
    "                                 rows = 50)\n",
    "carberry"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now get 28 results. However, we need not look through that long list to find Josiah, but can add another argument: `given_name`. Multiple arguments are combined with AND, such that the above example gets passed to ORCID as **given-names:josiah** AND **family-name:carberry**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry <- rorcid::orcid_search(given_name = 'josiah',\n",
    "                                 family_name = 'carberry')\n",
    "carberry"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like there are two people in the public ORCID registry with the first name Josiah and the last name Carberry. \n",
    "In R Studio, we can launch the actual ORCID profiles to our browser by using `rorcid::browse()` function, where we use brackets to look up the first and second items in the `orcid` variable--i.e. `rorcid::browse(carberry$orcid[1])` however, we can't do that in Jupyter Notebooks, so we will simply look them up manually: <http://orcid.org/0000-0002-1028-6941> and <http://orcid.org/0000-0002-1825-0097>."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Call `?orcid_search` to see a list of fields you can pass to the function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "?orcid_search"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "**TRY IT YOURSELF**\n",
    "\n",
    "1. Do a name search for yourself or a colleague. Try it with and without `given_names` and `family_names`. Increase or decrease the `rows` returned."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run an orcid name search\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Affiliation name\n",
    "\n",
    "`orcid_search` includes the argument `affiliation_org`, which searches across all of one's affiliation data (employment, education, invited positions, membership & service). Because this is such a broad search, it has the potential to return false positives if you are using it on it's own."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry <- rorcid::orcid_search(family_name = 'carberry', \n",
    "                                 affiliation_org = 'Wesleyan')\n",
    "carberry"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are also arguments for past institution (`past_inst`) and current institution (`current_inst`), as well as institutional identifiers (see below).\n",
    "\n",
    "---\n",
    "**TRY IT YOURSELF**\n",
    "\n",
    "1. If your ORCID profile has affiliation data, run a name and affiliation search. If you have designated current and past institutions, try both:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run a name and affiliation search\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Email address\n",
    "\n",
    "We can search by email address, however, [according to ORCID](https://members.orcid.org/api/tutorial/search-orcid-registry), as of February 2017, fewer than 2% of the 3+ million email addresses on ORCID records are public, so this one may not be incredibly helpful.  Since Josiah doesn't have email, we'll use mine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "clarke <- rorcid::orcid_search(email = 'clarke.iakovakis@okstate.edu')\n",
    "clarke"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the person has chosen to keep their email address private, the function will return an empty dataframe.`\n",
    "\n",
    "\n",
    "## Keywords\n",
    "\n",
    "If the individual has added keywords to their ORCID profile, we can search those. Dr. Carberry's profile includes the keyword \"psychoceramics\" (the study of cracked pots) (*note: this function is currently under development - :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry <- rorcid::orcid_search(family_name = 'carberry',\n",
    "                                 keywords = 'psychoceramics')\n",
    "carberry"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Work title and DOI\n",
    "\n",
    "If you know the name of a work (i.e. article, book chapter, etc.) or its DOI, you can obtain the associated ORCID iD by using the `work_title` or `digital_object_ids` **if and only if** the authos have added it to their ORCID profile.\n",
    "\n",
    "We will search for the article [\"Building Software Building Community: Lessons from the rOpenSci Project\"](https://openresearchsoftware.metajnl.com/articles/10.5334/jors.bu/print/). Notice how the title is in double quotes, inside of single quotes, and the colon is removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "ropensci1 <- rorcid::orcid_search(work_title = '\"Building Software Building Community\"')\n",
    "ropensci1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you get an error message here, try to simply cut and paste the `'\"Building Software Building Community\"'` part of the argument out and back in. I'm not sure why that is happening or why that works.\n",
    "\n",
    "This gives us two authors: Edmund Hart and Scott Chamberlain. Notice we get a different result when we look the same article up by DOI:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "FALSE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "ropensci2 <- rorcid::orcid_search(digital_object_ids = '\"10.5334/jors.bu\"')\n",
    "ropensci2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we navigate to the author profiles on the web, we can see this is because Edmund Hart added the article to his ORCID profile, whereas Scott Chamberlain added the dataset (which has a different DOI)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus a word of caution when searching by article title or DOI: a significant amount of data in ORCID is manually added or added with incorrect, inconsistent, or incomplete metadata. In other words, the fact that you didn't get results or got erroneous results may not be due to errors in your queries, but rather errors in the data itself.\n",
    "\n",
    "## Institutional ID (Ringgold, ISNI) - `ringgold-org-id:`\n",
    "\n",
    "When filling out an ORCID profile, users are encouraged to select their institutions from the drop-down menu, which will ensure it includes the Ringgold ID and any other unique identifiers, such as ISNI, that ORCID has for that institution. Read the ORCID report, [\"Organization identifiers: current provider survey\"](https://orcid.org/sites/default/files/ckfinder/userfiles/files/20161031%20OrgIDProviderSurvey.pdf) to learn more."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "FALSE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry <- rorcid::orcid_search(family_name = 'carberry',\n",
    "                               ringgold_org_id = '5468')\n",
    "carberry"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You have to [register with Ringgold](https://www.ringgold.com/identify-online-guests/) to search in their registry. \n",
    "\n",
    "Sometimes different entities on campus will have separate Ringgold IDs; you may consider contacting Ringgold to get the full list of your institution's identifiers.\n",
    "\n",
    "---\n",
    "**TRY IT YOURSELF**\n",
    "1. Go to your ORCID profile and add an institution. As you start typing the institution, it will autocomplete. Click on the entry to add it to your profile and save. Click the drop down arrow on that institution to see the Ringgold ID. \n",
    "2. Run an `orcid_search` and set `ringgold_org_id` to that Ringgold ID. It will return 10 results. \n",
    "3. Try adding your name to `family_name` to the query to find yourself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Run an `orcid_search` with `ringgold_org_id` set to your institution's Ringgold ID.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Name and email address domain\n",
    "\n",
    "We can search by name and email address domain by using an asterisk followed by the domain name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "FALSE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "clarke <- rorcid::orcid_search(family_name = 'iakovakis',\n",
    "                               email = '*@okstate.edu')\n",
    "clarke"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced searching\n",
    "\n",
    "`orcid_search` is a wrapper for another `rorcid` function--`orcid()`, which allows for a more advanced range of searching, including Boolean OR operators.\n",
    "\n",
    "According to `help(orcid)`:\n",
    "\n",
    "> You can use any of the following within the query statement: given-names, family-name, credit-name, other-names, email, grant-number, patent-number, keyword, worktitle, digital-objectids, current-institution, affiliation-name, current-primary-institution, text, past-institution, peer-review-type, peer-review-role, peer-review-group-id, biography, external-id-type-and-value\n",
    "\n",
    "Note that `current_prim_inst` and `patent_number` parameters have been removed as ORCID has removed them.\n",
    "\n",
    "### Searching with Boolean OR\n",
    "\n",
    "We can combine affiliation names, Ringgold IDs, and email addresses using the `OR` operator to cover all our bases, in case the person or people we are looking for did not hit on of those values. This will return all records that either have a Ringgold of 7618, have an affiliation name of \"Oklahoma State,\" or have an email domain ending in \"\\@okstate.edu.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "clarke <- rorcid::orcid(query = 'family-name:iakovakis AND(ringgold-org-id:7618 OR email:*@okstate.edu OR \n",
    "                           affiliation-org-name:\"Oklahoma State\")')\n",
    "clarke"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get ORCID iDs for everyone at an institution\n",
    "\n",
    "This can also be helpful if you want to cast a very wide net and capture everyone affiliated with your institution who has an ORICID iD. Keep in mind that this searches across **all** of an individuals listed affiliations (employment, education, invited positions, membership & service) past and present. So it has **will** return false positives--in other words, one should not use it to get ORCID iDs of all individuals currently at an institution, because it will include those who previously worked there or got their degree from there.\n",
    "\n",
    "We start by adding all identifying information into its own object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ringgold_id <- \"7618\"\n",
    "email_domain <- \"@okstate.edu\"\n",
    "organization_name <- \"Oklahoma State\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then use the `glue()` function from the `glue` package to construct a single character vector with each of the parameters in order to use in our API call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_query <- glue('ringgold-org-id:',\n",
    "                 ringgold_id,\n",
    "                 ' OR email:*',\n",
    "                 email_domain,\n",
    "                 ' OR affiliation-org-name:\"',\n",
    "                 organization_name,\n",
    "                 '\"')\n",
    "my_query"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we use the `my_query` string in our API call. The maximum number of returned results is 200. Below, since this is being done for instructional purposes, we leave it at 25."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "my_osu_orcids <- rorcid::orcid(query = my_query,\n",
    "                               rows = 25)\n",
    "my_osu_orcids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to retrieve a complete set of all results above 200, we have to write a small function. First, we will wrap our API call in `base::attr` and include a `\"found\"` argument to see how many results are found with that call:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_osu_orcid_count <- base::attr(rorcid::orcid(query = my_query),\n",
    "                      \"found\")\n",
    "my_osu_orcid_count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 2,336 records at the time of writing. Next we will first create a numeric vector using `seq` that starts with 0 and ends with 1903 (which has been assigned to `my_osu_orcid_count`, adding 200 to each value incrementally."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_pages <- seq(from = 0, to = my_osu_orcid_count, by = 200)\n",
    "my_pages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we will write a small function using `map` from the `purrr` package. In essence, this takes each value from our `my_pages` vector, and passes it into the `page` argument of the `orcid()` query. In other words, the first loop will get results 0-200, the next will get 200-400, and so on.\n",
    "\n",
    "I am going to comment out (add a hash `#`) to the first line of the below code because I don't want you to run this large query unnecessarily. If you do want to test it out, just remove the hash."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "FALSE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "# my_osu_orcids <- purrr::map(\n",
    "  my_pages,\n",
    "  function(page) {\n",
    "  print(page)\n",
    "  my_orcids <- rorcid::orcid(query = 'ringgold-org-id:7618 OR \n",
    "                               email:*@okstate.edu OR affiliation-org-name:\"Oklahoma State\"',\n",
    "                               rows = 200,\n",
    "                               start = page)\n",
    "  return(my_orcids)\n",
    "  })"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then use the `bind_rows()` function from `dplyr` to pull the data together into a single dataframe, and the `clean_names()` function from `janitor`, described below, to make the column names easier to handle. We also introduce here the Pipe Operator, which is also described below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_osu_orcids_data <- my_osu_orcids %>%\n",
    "  map_dfr(., dplyr::as_tibble) %>%\n",
    "  janitor::clean_names()\n",
    "my_osu_orcids_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, these are only the ORCID iDs, with no other data. The next sections will describe other functions in `rorcid` to get biographical, employment, and works data from the profiles.\n",
    "\n",
    "# `clean_names()` and `%>%`\n",
    "\n",
    "Often times the columns returned from the ORCID API have a complicated combination of punctuation that can make them hard to use. The `clean_names()` function from the `janitor` package is optional and used only to simplify the column names of the data. It converts all punctuation to underscores, so the field `orcid-identifier.uri` becomes `orcid_identifier_uri`.\n",
    "\n",
    "A [Pipe Operator](https://www.datacamp.com/community/tutorials/pipe-r-tutorial) `%>%`. A pipe takes the output of one statement and makes it the input of the next statement. You can think of it as \"then\" in natural language. So the above script first runs the `orcid()` API call, then it clean the column names of the data that was pulled into R as a result of that call. So for example, in the expression above, we first call up the `my_osu_orcids` data. We then apply the `bind_rows()` function to pull all the data together into a single data frame, and then clean the names with `clean_names()`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "**TRY IT YOURSELF**\n",
    "\n",
    "Do this exercise when you have some time and are prepared. We are going to get a set of ORCID iDs for your institution.\n",
    "\n",
    "First, construct a query, replacing the below `REPLACEME` values. If you do not know your institution's Ringgold ID, you can look it up at <https://ido.ringgold.com>. Unfortunately, you do have to register for that. If you don't want to, just remove the argument below. However, you may be excluding a number of potential matches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ringgold_id <- \"REPLACEME\"\n",
    "email_domain <- \"REPLACEME\"\n",
    "organization_name <- \"REPLACEME\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_inst_query <- glue('ringgold-org-id:',\n",
    "                 ringgold_id,\n",
    "                 ' OR email:*',\n",
    "                 email_domain,\n",
    "                 ' OR affiliation-org-name:\"',\n",
    "                 organization_name,\n",
    "                 '\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. Now look up the total number of ORCID IDs in the system by using the code above wrapped in `base::attr()`. Set `query = my_inst_query`. Assign it to a variable `my_inst_count`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3. Construct a vector from zero to that number, counting by 200 by using `seq()`, as we did above. Look at the help page for `seq()` if you need a reminder of how it works. Assign this to `my_inst_pages`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4. Run the loop below to obtain your set of ORCID iDs. This may take a few minutes. Watch for the asterisk to disappear on the left side."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_inst_orcids <- purrr::map(\n",
    "  my_inst_pages,\n",
    "  function(page) {\n",
    "  my_orcids <- rorcid::orcid(query = my_inst_query,\n",
    "                               rows = 200,\n",
    "                               start = page)\n",
    "  return(my_orcids)\n",
    "  })"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5. Use the `bind_rows()` function from `dplyr` to pull the data together into a single dataframe, and the `clean_names()` function from `janitor`, to make the column names easier to handle. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_orcids_data <- my_inst_orcids %>%\n",
    "  map_dfr(., dplyr::as_tibble) %>%\n",
    "  janitor::clean_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember this is just ORCID iDs with no data, and likely containing a number of false positives. For now, save this to a CSV in your Binder session. Read the section below to see how to download that CSV to your own personal computer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_csv(my_orcids_data, \"./data/my_orcids_data.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Saving and downloading files in Binder\n",
    "\n",
    "You can save files while in a Binder session, but you will need to download them before you close the session down. The CSV file we just saved is now available if you click **File > Open** here in your Jupyter Notebook and navigate to the **data** folder. There, you can check the box next to the file and click the **Download** button at the top of the page. Note that this file will disappear when you close down your Binder session.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Finding biographical information\n",
    "\n",
    "The `orcid()` function gets the IDs, but no information about the person. For that, you will need to use `orcid_person()`.\n",
    "\n",
    "Unlike `orcid()`, `orcid_person()` does not take a query; it accepts only ORICID iDs in the form XXXX-XXXX-XXXX-XXXX. So we can get the ORICID iD itself into it's own vector. We can then pass that argument on to `orcid_person()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_orcid <- \"0000-0002-1825-0097\"\n",
    "carberry_person <- rorcid::orcid_person(carberry_orcid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This data comes back as JSON, which is essentially a bunch of nested data. Have a look at <https://ciakovx.github.io/rorcid.html#Finding_biographical_information> to interact with the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see the names of the top-level elements of the list by running `names(carberry_person[[1]])`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE,",
     "classes": [],
     "echo": "FALSE",
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "names(carberry_person[[1]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you click the drop-down arrow next to name. You can run `names(carberry_person[[1]]$name)` to see the names of those elements. While this is great data, if we want to run some analysis on it, we need to get it into a nice, tidy data frame. \n",
    "\n",
    "## Getting the data into a data frame\n",
    "\n",
    "This is not an easy or straightforward process. I provide below one strategy to get some of the relevant data, using `map` functions from the `purrr` package and building a `tibble` (the `tidyverse`'s more efficient data frame) piece by piece."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_data <- carberry_person %>% {\n",
    "    dplyr::tibble(\n",
    "      created_date = purrr::map_dbl(., purrr::pluck, \"name\", \"created-date\", \"value\", .default=NA_character_),\n",
    "      given_name = purrr::map_chr(., purrr::pluck, \"name\", \"given-names\", \"value\", .default=NA_character_),\n",
    "      family_name = purrr::map_chr(., purrr::pluck, \"name\", \"family-name\", \"value\", .default=NA_character_),\n",
    "      credit_name = purrr::map_chr(., purrr::pluck, \"name\", \"credit-name\", \"value\", .default=NA_character_),\n",
    "      other_names = purrr::map(., purrr::pluck, \"other-names\", \"other-name\", \"content\", .default=NA_character_),\n",
    "      orcid_identifier_path = purrr::map_chr(., purrr::pluck, \"name\", \"path\", .default = NA_character_),\n",
    "      biography = purrr::map_chr(., purrr::pluck, \"biography\", \"content\", .default=NA_character_),\n",
    "      researcher_urls = purrr::map(., purrr::pluck, \"researcher-urls\", \"researcher-url\", .default=NA_character_),\n",
    "      emails = purrr::map(., purrr::pluck, \"emails\", \"email\", \"email\", .default=NA_character_),\n",
    "      keywords = purrr::map(., purrr::pluck, \"keywords\", \"keyword\", \"content\", .default=NA_character_),\n",
    "      external_ids = purrr::map(., purrr::pluck, \"external-identifiers\", \"external-identifier\", .default=NA_character_)\n",
    "    )\n",
    "  } \n",
    "carberry_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* The **created_date** comes from within the name list, so technically it is the date the name was created, not the date the ORCID account was created, which is not available in this data. This is plucked with `map_dbl()` because it is in `double` format (a numeric data type in R) . \n",
    "* The **given_name**, **family_name**, **credit_name**, **orcid_identifier_path**, and **biography** are plucked with `map_chr()` because they are both `character` types.\n",
    "* The **other_names**, **keywords**, **researcher_urls**, and **external_ids** are plucked with `map()` because there may be multiple values (unlike the other names, in which you can only have one). For example, someone may have multiple other names, or multiple keywords. So this will return a nested list to the tibble; we will discuss below how to unnest it.\n",
    "\n",
    "Each of these functions includes a `.default = NA_character_` argument because if the value is NULL (if the ORCID author didn't input the information) then it will convert that NULL to NA.\n",
    "\n",
    "## Fixing dates\n",
    "\n",
    "View the created date by running `carberry_data$created_date` and you will see this is a number, not a date:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE,echo=FALSE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_data$created_date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The dates are in [Unix time](https://en.wikipedia.org/wiki/Unix_time), which is the number of seconds that have elapsed since January 1, 1970. In ORCID, this is in milliseconds. We can use the `anytime()` function from the `anytime` package created by Dirk Eddelbuettel to convert it and return a POSIXct object. You have to divide it by 1000 because it's in milliseconds. Below we use the `mutate()` function from `dplyr` to overwrite the `created_date` and `last_modified_date` UNIX time with the human readable POSIXct dates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_datesAndTimes <- carberry_data %>%\n",
    "  dplyr::mutate(created_date = anytime::anytime(created_date/1000))\n",
    "carberry_datesAndTimes$created_date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That looks much better: April 15th, 2016 at 5:00 PM and 17 seconds Central Daylight Time.\n",
    "\n",
    "If you'd prefer to do away with the time altogether (and keep only the month/day/year), you can use `anydate()` instead of `anytime()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_datesOnly <- carberry_data %>%\n",
    "  dplyr::mutate(created_date = anytime::anydate(created_date/1000))\n",
    "carberry_datesOnly$created_date"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check out the `lubridate` package for more you can do with dates. It is installed with `tidyverse`, but not loaded, so you have to load it with its own call to `library()` (we did this at the beginning of the session). For example, you may be more interested in year of creation than month. So after you run the conversion with `anytime`, you can create year variables with `mutate()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_years <- carberry_datesOnly %>%\n",
    "  dplyr::mutate(created_year = lubridate::year(created_date))\n",
    "carberry_years$created_year"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unnesting nested lists\n",
    "\n",
    "There are nested lists in this data frame that can be unnested. The **other_names** and **keywords** values are character vectors, while the **researcher_urls** and **external_ids** values are data frames themselves. We can use the `unnest()` function from the `tidyr` package to unnest both types. In other words, this will make each element of the list its own row. For instance, since there are two keywords for carberry (\"psychoceramics\" and \"ionian philology\"), there will now be two rows that are otherwise identical except for the keywords column:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_keywords <- carberry_data %>%\n",
    "  tidyr::unnest(keywords)\n",
    "carberry_keywords$keywords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see which columns are lists by calling `is_list()` in the `map_lgl()` function (this will return a TRUE/FALSE for each column that is a list), and subsetting the `names()` of `carberry_data` by those values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_list_columns <- map_lgl(carberry_data, is_list)\n",
    "names(carberry_data)[carberry_list_columns]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now `unnest()` keywords. This will give us two observations (i.e., rows) because there are 2 keywords. And if we look in the **keywords** column all the way to the right, we see that each keyword is by itself. The column has been unnested. All other columns are intact.\n",
    "\n",
    "There is an argument to `unnest()` called `.drop`, which, if set to `TRUE`, will remove all additional list columns. If you want to keep them, just set it to `FALSE` Note, however, that it will *not* unnest them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_keywords <- carberry_data %>%\n",
    "  tidyr::unnest(keywords, .drop = FALSE)\n",
    "carberry_keywords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can unnest multiple nested columns, but keep in mind that this will multiply the duplicated columns in your data frame, because there will be it is spreading the key-value pairs across multiple columns. For more on wide and long data, read Hadley Wickham's paper [\"Tidy data,\"](https://vita.had.co.nz/papers/tidy-data.html) published in *The Journal of Statistical Software.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_keywords_otherNames <- carberry_data %>%\n",
    "  tidyr::unnest(keywords, .drop = FALSE) %>%\n",
    "  tidyr::unnest(other_names, .drop = FALSE)\n",
    "carberry_keywords_otherNames"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we unnest *researcher_urls* or *external_ids*, we will see many more columns added. That is because each of these nested lists contains multiple variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_researcherURLs <- carberry_data %>%\n",
    "  tidyr::unnest(researcher_urls, .drop = FALSE)\n",
    "carberry_researcherURLs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Carberry has two URLs: his Wikipedia page and a page about him on the Brown University Library. So a row is created for each of these URLs, and multiple columns are added such as the last modified date, the url value, and so on. You can keep or remove columns you don't want using `select()` from the `dplyr` package."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Writing to CSV\n",
    "\n",
    "We will use the `write_csv()` function from the `readr` package to write our data to disk. This package was loaded when you called `library(tidyverse)` at the beginning of the session.\n",
    "\n",
    "With a typical data frame, you can simply write the `carberry_data` data frame to a CSV with the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "FALSE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "write_csv(carberry_data, \"data/carberry_data.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The problem is, due to the nested lists we described above, R will throw an error: `\"Error in stream_delim_(df, path, ...) : Don't know how to handle vector of type list.\"`\n",
    "\n",
    "You have a few choices:\n",
    "\n",
    "1. You can unnest one of the columns and leave `.drop` set to `TRUE`. This will add rows for all the values in the nested lists, and drop the additional nested lists. \n",
    "\n",
    "Please note that this will write the data to your local Binder repo, but it will not be saved in Binder. If you want to view it on your own computer, download it from Binder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "FALSE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_data_keywords <- carberry_data %>%\n",
    "  tidyr::unnest(keywords, .drop = TRUE)\n",
    "write_csv(carberry_data_keywords, \"data/carberry_data_keywords.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. You can drop the nested lists altogether using a combination of `select_if()` from `dplyr` and `negate()` from `purrr` to drop all lists in the data frame. This is essentially saying, only keep the columns that are not lists. In this example, the number of variables falls to 6, since we have 5 list columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "FALSE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_data_short <- carberry_data %>%\n",
    "  dplyr::select_if(purrr::negate(is_list))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "**TRY IT YOURSELF**\n",
    "\n",
    "1. Use `orcid_person` to get your ORCID biographical data. If you do not have an ORCID profile, use [Scott Chamberlain's](https://orcid.org/0000-0003-1444-9135). Assign it to `my_orcid_person`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the code below to create a flat data frame of your biographical data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_orcid_person_data <- my_orcid_person %>% {\n",
    "    dplyr::tibble(\n",
    "      created_date = purrr::map_dbl(., purrr::pluck, \"name\", \"created-date\", \"value\", .default=NA_character_),\n",
    "      given_name = purrr::map_chr(., purrr::pluck, \"name\", \"given-names\", \"value\", .default=NA_character_),\n",
    "      family_name = purrr::map_chr(., purrr::pluck, \"name\", \"family-name\", \"value\", .default=NA_character_),\n",
    "      credit_name = purrr::map_chr(., purrr::pluck, \"name\", \"credit-name\", \"value\", .default=NA_character_),\n",
    "      other_names = purrr::map(., purrr::pluck, \"other-names\", \"other-name\", \"content\", .default=NA_character_),\n",
    "      orcid_identifier_path = purrr::map_chr(., purrr::pluck, \"name\", \"path\", .default = NA_character_),\n",
    "      biography = purrr::map_chr(., purrr::pluck, \"biography\", \"content\", .default=NA_character_),\n",
    "      researcher_urls = purrr::map(., purrr::pluck, \"researcher-urls\", \"researcher-url\", .default=NA_character_),\n",
    "      emails = purrr::map(., purrr::pluck, \"emails\", \"email\", \"email\", .default=NA_character_),\n",
    "      keywords = purrr::map(., purrr::pluck, \"keywords\", \"keyword\", \"content\", .default=NA_character_),\n",
    "      external_ids = purrr::map(., purrr::pluck, \"external-identifiers\", \"external-identifier\", .default=NA_character_)\n",
    "    )\n",
    "  } %>%\n",
    "  dplyr::mutate(created_date = anytime::anydate(created_date/1000))\n",
    "my_orcid_person_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Optionally, save `my_orcid_person_data` as a CSV to your Binder directory using `write_csv()`, and download it to your computer following the steps described above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting data on multiple people with `orcid_person()`\n",
    "\n",
    "## Searching by ORICID iDs\n",
    "\n",
    "`orcid_person()` is vectorized, so you can pass in multiple ORICID iDs and it will return a list of results for each ID, with each element named by the ORICID iD. See <https://ciakovx.github.io/rorcid.html#Getting_data_on_multiple_people_with_orcid_person()> to see this in list view."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_orcids <- c(\"0000-0002-1825-0097\", \"0000-0002-9260-8456\")\n",
    "my_orcid_person <- rorcid::orcid_person(my_orcids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that we are given a list of 2, each containing the person data. We can put this into a data frame using the same code as above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_orcid_person_data <- my_orcid_person %>% {\n",
    "    dplyr::tibble(\n",
    "      created_date = purrr::map_dbl(., purrr::pluck, \"name\", \"created-date\", \"value\", .default=NA_character_),\n",
    "      given_name = purrr::map_chr(., purrr::pluck, \"name\", \"given-names\", \"value\", .default=NA_character_),\n",
    "      family_name = purrr::map_chr(., purrr::pluck, \"name\", \"family-name\", \"value\", .default=NA_character_),\n",
    "      credit_name = purrr::map_chr(., purrr::pluck, \"name\", \"credit-name\", \"value\", .default=NA_character_),\n",
    "      other_names = purrr::map(., purrr::pluck, \"other-names\", \"other-name\", \"content\", .default=NA_character_),\n",
    "      orcid_identifier_path = purrr::map_chr(., purrr::pluck, \"name\", \"path\", .default = NA_character_),\n",
    "      biography = purrr::map_chr(., purrr::pluck, \"biography\", \"content\", .default=NA_character_),\n",
    "      researcher_urls = purrr::map(., purrr::pluck, \"researcher-urls\", \"researcher-url\", .default=NA_character_),\n",
    "      emails = purrr::map(., purrr::pluck, \"emails\", \"email\", \"email\", .default=NA_character_),\n",
    "      keywords = purrr::map(., purrr::pluck, \"keywords\", \"keyword\", \"content\", .default=NA_character_),\n",
    "      external_ids = purrr::map(., purrr::pluck, \"external-identifiers\", \"external-identifier\", .default=NA_character_)\n",
    "    )\n",
    "  } %>%\n",
    "  dplyr::mutate(created_date = anytime::anydate(created_date/1000))\n",
    "my_orcid_person_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have a nice, neat dataframe of both people's ORCID name data.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "**TRY IT YOURSELF**\n",
    "\n",
    "We will now look up person data for the people in your institution using the `my_orcids_data` file we created above. Please only do this if you want to retrieve biographical data for everyone in your institution.\n",
    "\n",
    "If you have not closed Binder since you created the `my_orcids_data` file, proceed as below. \n",
    "\n",
    "If you have already gathered your institution's ORCID iDs and saved them to your computer, then please upload the saved file rather than run the code again. To do that, click **File > Open** here in Jupyter Notebooks. It will open the Binder directory in a new tab. Click into the **data** folder. Click **Upload** and upload the file. Remember, Binder sessions are *temporary*. As soon as you close down Binder, the file will be deleted. After you have uploaded the file, return to this notebook and enter the following code below: \n",
    "\n",
    "`my_orcids_data <- read_csv(\"data/my_orcids_data.csv\")`\n",
    "\n",
    "We will use the **orcid_identifier_path** variable to look up biographical information for the set of people in this data. In other words, pass `my_orcids_data$orcid_identifier_path` to `orcid_person()`. Assign it to `my_inst_person`. Depending on how large your file is, this may take several minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use our flattening code to create a nice data frame of this data. Assign it to `my_orcids_person_data`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "my_orcids_person_data <- my_inst_person %>% {\n",
    "    dplyr::tibble(\n",
    "      created_date = purrr::map_dbl(., purrr::pluck, \"name\", \"created-date\", \"value\", .default=NA_character_),\n",
    "      given_name = purrr::map_chr(., purrr::pluck, \"name\", \"given-names\", \"value\", .default=NA_character_),\n",
    "      family_name = purrr::map_chr(., purrr::pluck, \"name\", \"family-name\", \"value\", .default=NA_character_),\n",
    "      credit_name = purrr::map_chr(., purrr::pluck, \"name\", \"credit-name\", \"value\", .default=NA_character_),\n",
    "      other_names = purrr::map(., purrr::pluck, \"other-names\", \"other-name\", \"content\", .default=NA_character_),\n",
    "      orcid_identifier_path = purrr::map_chr(., purrr::pluck, \"name\", \"path\", .default = NA_character_),\n",
    "      biography = purrr::map_chr(., purrr::pluck, \"biography\", \"content\", .default=NA_character_),\n",
    "      researcher_urls = purrr::map(., purrr::pluck, \"researcher-urls\", \"researcher-url\", .default=NA_character_),\n",
    "      emails = purrr::map(., purrr::pluck, \"emails\", \"email\", \"email\", .default=NA_character_),\n",
    "      keywords = purrr::map(., purrr::pluck, \"keywords\", \"keyword\", \"content\", .default=NA_character_),\n",
    "      external_ids = purrr::map(., purrr::pluck, \"external-identifiers\", \"external-identifier\", .default=NA_character_)\n",
    "    )\n",
    "  } %>%\n",
    "  dplyr::mutate(created_date = anytime::anydate(created_date/1000))\n",
    "my_orcids_person_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write this to a CSV, and make sure you download and retrieve that CSV before you close out of your Binder session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_csv(my_orcids_person_data, \"data/my_orcids_person_data.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Searching by names\n",
    "\n",
    "When we want data on multiple people and have only their names, we can build a query. \n",
    "\n",
    "Now we can build a query that will work with the `given-names:` and `family-name:` arguments to `query` in `orcid` in order to get the ORICID iDs:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "profs <- dplyr::tibble(\"FirstName\" = c(\"Josiah\", \"Clarke\"),\n",
    "                     \"LastName\" = c(\"Carberry\", \"Iakovakis\"))\n",
    "profs\n",
    "orcid_query <- paste0(\"given-names:\",\n",
    "                      profs$FirstName,\n",
    "                      \" AND family-name:\",\n",
    "                      profs$LastName)\n",
    "orcid_query"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This returns a vector with two queries formatted for nice insertion into `rorcid::orcid()`. We can use `purr::map()` to create a loop. What this is saying is, take each element of `orcid_query` and run a function with it that prints it to the console and runs `rorcid::orcid()` on it, then return each result to `my_orcids_list().` This returns a list of two items. We can then wrap `as_tibble()` in `map_dfr` to create a data frame from those list elements.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_orcids_df <- purrr::map(\n",
    "  orcid_query,\n",
    "  function(x) {\n",
    "    print(x)\n",
    "    orc <- rorcid::orcid(x)\n",
    "  }\n",
    "  ) %>%\n",
    "    map_dfr(., dplyr::as_tibble) %>%\n",
    "  janitor::clean_names()\n",
    "my_orcids_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First we want to remove the Carberry row that we don't want (remember that there are two Carberry ORCID accounts, and one doesn't have much data in it). We can do this using the `filter()` function from `dplyr` and the `!=` symbol, which is equivalent to \"is not equal to.\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_orcids_df <- my_orcids_df %>%\n",
    "  dplyr::filter(orcid_identifier_path != \"0000-0002-1028-6941\")\n",
    "my_orcids_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a data frame of two items. , grab the ORICID iDs, and run the same function we ran above in order to get the name data and the IDs into a single data frame. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_orcids <- my_orcids_df$orcid_identifier_path\n",
    "my_orcid_person <- rorcid::orcid_person(my_orcids)\n",
    "my_orcid_person_data <- my_orcid_person %>% {\n",
    "    dplyr::tibble(\n",
    "      created_date = purrr::map_dbl(., purrr::pluck, \"name\", \"created-date\", \"value\", .default=NA_character_),\n",
    "      given_name = purrr::map_chr(., purrr::pluck, \"name\", \"given-names\", \"value\", .default=NA_character_),\n",
    "      family_name = purrr::map_chr(., purrr::pluck, \"name\", \"family-name\", \"value\", .default=NA_character_),\n",
    "      credit_name = purrr::map_chr(., purrr::pluck, \"name\", \"credit-name\", \"value\", .default=NA_character_),\n",
    "      other_names = purrr::map(., purrr::pluck, \"other-names\", \"other-name\", \"content\", .default=NA_character_),\n",
    "      orcid_identifier_path = purrr::map_chr(., purrr::pluck, \"name\", \"path\", .default = NA_character_),\n",
    "      biography = purrr::map_chr(., purrr::pluck, \"biography\", \"content\", .default=NA_character_),\n",
    "      researcher_urls = purrr::map(., purrr::pluck, \"researcher-urls\", \"researcher-url\", .default=NA_character_),\n",
    "      emails = purrr::map(., purrr::pluck, \"emails\", \"email\", \"email\", .default=NA_character_),\n",
    "      keywords = purrr::map(., purrr::pluck, \"keywords\", \"keyword\", \"content\", .default=NA_character_),\n",
    "      external_ids = purrr::map(., purrr::pluck, \"external-identifiers\", \"external-identifier\", .default=NA_character_)\n",
    "    )\n",
    "} %>%\n",
    "  dplyr::mutate(created_date = anytime::anydate(created_date/1000))\n",
    "my_orcid_person_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will be exactly the same thing as we saw above, however we got it from a simple vector of names.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Splitting Names\n",
    "\n",
    "If the names you have are not already separated into first and last name variables, here is a trick to do that:\n",
    "\n",
    "Create a tibble using the `tibble()` function from the `dplyr` package. Then, use the `extract()` function from `tidyr`, along with some [regular expressions](https://www.rexegg.com/regex-quickstart.html), to create a first and last name variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_names <- dplyr::tibble(\"name\" = c(\"Josiah Carberry\", \"Clarke Iakovakis\"))\n",
    "my_names\n",
    "my_clean_names <- my_names %>%\n",
    "  tidyr::extract(name, c(\"FirstName\", \"LastName\"), \"([^ ]+) (.*)\")\n",
    "my_clean_names\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unnesting\n",
    "\n",
    "Again, we can unnest if we wish, knowing we'll multiply the number of rows even more now, because we have more values. For instance, if we unnest keywords, we'll now have 5 columns (2 keywords for carberry, and 3 keywords for iakovakis):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_orcid_person_keywords <- my_orcid_person_data %>%\n",
    "  tidyr::unnest(keywords)\n",
    "my_orcid_person_keywords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can write this data to CSV using one of the three strategies outlined above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting employment data\n",
    "\n",
    "In addition to biographical data, we can also get employment data with `orcid_employments()`. View the returned data at <https://ciakovx.github.io/rorcid.html#Getting_employment_data_for_an_individual>\n",
    "\n",
    "## Getting employment data for an individual"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "clarke_employment <- rorcid::orcid_employments(orcid = \"0000-0002-9260-8456\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again it comes in a series of nested lists, but we'll just `pluck()` what we need and use `flatten_dfr()` to flatten the lists into a data frame. We will also use the `anydate()` function to go ahead and convert the dates while we're at it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "clarke_employment_data <- clarke_employment %>%\n",
    "  purrr::map(., purrr::pluck, \"affiliation-group\", \"summaries\") %>% \n",
    "  purrr::flatten_dfr() %>%\n",
    "  janitor::clean_names() %>%\n",
    "  dplyr::mutate(employment_summary_end_date = anytime::anydate(employment_summary_end_date/1000),\n",
    "                employment_summary_created_date_value = anytime::anydate(employment_summary_created_date_value/1000),\n",
    "                employment_summary_last_modified_date_value = anytime::anydate(employment_summary_last_modified_date_value/1000))\n",
    "clarke_employment_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The column names are pretty messy here. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names(clarke_employment_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll clean them up a bit by using the `str_replace()` function from `stringr`. You can think of this as analogous to Find + Replace in word processing. We take the `names()` of the data, and replace each of the phrases with nothing (i.e. the set of empty quotes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "names(clarke_employment_data) <- names(clarke_employment_data) %>%\n",
    "  stringr::str_replace(., \"employment_summary_\", \"\") %>%\n",
    "  stringr::str_replace(., \"source_source_\", \"\") %>%\n",
    "  stringr::str_replace(., \"organization_disambiguated_\", \"\")\n",
    "names(clarke_employment_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Much better.\n",
    "\n",
    "If you take a look at the data, you will see that there is no variable indicating whether these are current or past institutions of employment. In fact, the only way to check if the institution is a place of current employment is if the `employment_summary_end_date_year_value` is `NA`. Keep in mind that start and end dates are not required fields, and we can't be certain that people are updating their profiles. However, we can get a data frame of only those items meeting this criteria by using the `filter()` function from `dplyr`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "clarke_employment_data_current <- clarke_employment_data %>%\n",
    "  dplyr::filter(is.na(end_date_year_value))\n",
    "clarke_employment_data_current"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will remove my previous two institutions, and keep only my current one: Oklahoma State University."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting employment data for multiple people\n",
    "\n",
    "Because `orcid_employments()` is vectorized, we can feed it multiple ORCID iDs and it will return data for the entire lot, if the individuals have added it. \n",
    "\n",
    "I'll grab a random assortment of OSU ORCID iDs from the previous section. Recall that these were pulled based on the detection of OSU data in either Ringgold, email, or affiliation names across all fields. We will issue the initial API call with `orcid_employments()`, then put it into a data frame with the same function we used above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_osu_orcid_ids <- c(\"0000-0002-6160-9587\", \"0000-0001-8330-8251\", \"0000-0003-2863-6724\", \"0000-0001-6810-5560\", \"0000-0003-1935-9729\", \"0000-0002-9088-2312\", \"0000-0001-9792-7870\", \"0000-0003-3959-6916\", \"0000-0002-2621-5320\", \"0000-0001-9103-3040\")\n",
    "my_osu_employment <- rorcid::orcid_employments(my_osu_orcid_ids)\n",
    "my_osu_employment_data <- my_osu_employment %>%\n",
    "  purrr::map(., purrr::pluck, \"affiliation-group\", \"summaries\") %>% \n",
    "  purrr::flatten_dfr() %>%\n",
    "  janitor::clean_names() %>%\n",
    "  dplyr::mutate(employment_summary_end_date = anytime::anydate(employment_summary_end_date/1000),\n",
    "                employment_summary_created_date_value = anytime::anydate(employment_summary_created_date_value/1000),\n",
    "                employment_summary_last_modified_date_value = anytime::anydate(employment_summary_last_modified_date_value/1000))\n",
    "my_osu_employment_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clean up the names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "names(my_osu_employment_data) <- names(my_osu_employment_data) %>%\n",
    "  stringr::str_replace(., \"employment_summary_\", \"\") %>%\n",
    "  stringr::str_replace(., \"source_source_\", \"\") %>%\n",
    "  stringr::str_replace(., \"organization_disambiguated_\", \"\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that this may have multiple entries per person because it gathered their entire employment history. Now let's take a look at the unique organizations in this dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_osu_organizations <- my_osu_employment_data %>%\n",
    "  group_by(organization_name) %>%\n",
    "  count() %>%\n",
    "  arrange(desc(n))\n",
    "my_osu_organizations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Out of this set of 10 iDs, only 3 have Oklahoma State University Stillwater listed in their employment, and one has Oklahoma State University, and another has Oklahoma State University - Tulsa. The others may have achieved their degree from OSU, or done some service with OSU. \n",
    "\n",
    "We can make this a bit more manageable by filtering to include only those institutions that include the word \"Oklahoma.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_osu_organizations_filtered <- my_osu_organizations %>%\n",
    "  filter(str_detect(organization_name, \"Oklahoma\"))\n",
    "my_osu_organizations_filtered"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, out of those, we can decide which ones we want to keep. For instance, we may not want Oklahoma State University - Tulsa."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_osu_employment_data_filtered <- my_osu_employment_data %>%\n",
    "  dplyr::filter(organization_name == \"Oklahoma State University Stillwater\"\n",
    "                | organization_name == \"Oklahoma State University\")\n",
    "my_osu_employment_data_filtered"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, out of those, let's see how many are listed as current employees, as indicated with an `NA` in their `end_date_year_value`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "cache": "TRUE",
     "classes": [],
     "eval": "TRUE,",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_osu_employment_data_filtered_current <- my_osu_employment_data_filtered %>%\n",
    "  dplyr::filter(is.na(end_date_year_value))\n",
    "my_osu_employment_data_filtered_current"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like the third row was removed because they have 1997 in their employment end date, and therefore are no longer employed by OSU.\n",
    "\n",
    "Note that this will give you employment records ONLY. In other words, each row represents a single employment record for an individual. The `name_value` variable refers specifically to the name of the person or system that wrote the record, **NOT the name of the individual**. To get that, you must first get all the unique ORCID iDs from the dataset.\n",
    "\n",
    "Problem is, there is actually no distinct value identifying the orcid ID of the person. The `orcid_path` value corresponds to the path of the person who added the employment record (which is usually, but not always the same). Therefore you have to strip out the ORCID iD from the 'path' variable first and put it in it's own value and use it. We do this using str_sub from the stringr package. While we are at it, we can select and reorder the columns we want to keep."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "osu_current_employment_all <- my_osu_employment_data_filtered_current %>%\n",
    "  mutate(orcid_identifier = str_sub(path, 2, 20)) %>%\n",
    "  select(orcid_identifier, organization_name, organization_address_city,\n",
    "         organization_address_region, organization_address_country,\n",
    "         organization_disambiguated_organization_identifier, organization_disambiguation_source, department_name, role_title, url,\n",
    "         display_index, visibility, created_date_value,\n",
    "         start_date_year_value, start_date_month_value, start_date_day_value,\n",
    "         end_date_year_value, end_date_month_value, end_date_day_value)\n",
    "osu_current_employment_all"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to take the next step to join this with biographical information, we create a new vector unique_orcids that includes only `unique()` ORCID iDs from our filtered dataset and remove NA values with `na.omit()`.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "osu_unique_orcids <- unique(osu_current_employment_all$orcid_identifier) %>%\n",
    "  na.omit(.)    \n",
    "osu_unique_orcids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use `orcid_person()` as above, and construct our `tibble()`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "my_osu_orcid_person <- rorcid::orcid_person(osu_unique_orcids)\n",
    "my_osu_orcid_person_data <- my_osu_orcid_person %>% {\n",
    "  dplyr::tibble(\n",
    "    created_date = purrr::map_chr(., purrr::pluck, \"name\", \"created-date\", \"value\", .default=NA_character_),\n",
    "    given_name = purrr::map_chr(., purrr::pluck, \"name\", \"given-names\", \"value\", .default=NA_character_),\n",
    "    family_name = purrr::map_chr(., purrr::pluck, \"name\", \"family-name\", \"value\", .default=NA_character_),\n",
    "    orcid_identifier_path = purrr::map_chr(., purrr::pluck, \"name\", \"path\", .default = NA_character_))\n",
    "} %>%\n",
    "  dplyr::mutate(created_date = anytime::anydate(as.double(created_date)/1000))\n",
    "my_osu_orcid_person_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use `left_join()` from `dplyr` to join this biographical data back to the employment data. It will include a single line of data for individual's employment records"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "osu_orcid_person_employment_join <- my_osu_orcid_person_data %>%\n",
    "  left_join(osu_current_employment_all, by = c(\"orcid_identifier_path\" = \"orcid_identifier\"))\n",
    "osu_orcid_person_employment_join"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Write this to a CSV, and make sure you download and retrieve the CSV before closing your Binder session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "write_csv(osu_orcid_person_employment_join, \"data/osu_orcid_person_employment_join.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting Works with `rorcid::works()` and `rorcid::orcid_works()`\n",
    "\n",
    "## Getting works for an individual\n",
    "\n",
    "There are two functions in `rorcid` to get all of the works associated with an ORICID iD: `orcid_works()` and `works()`. The main difference between these is `orcid_works()` returns a list, with each work as a list item, and each external identifier (e.g. ISSN, DOI) also as a list item. On the other hand, `works()` returns a nice, neat data frame that can be easily exported to a CSV. \n",
    "\n",
    "Like `orcid_person()`, these functions require an ORICID iD, and do not use the query fields we saw with the `orcid()` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "carberry_orcid <- c(\"0000-0002-1825-0097\")\n",
    "carberry_works <- rorcid::works(carberry_orcid) %>%\n",
    "  dplyr::as_tibble() %>%\n",
    "  janitor::clean_names() %>%\n",
    "  dplyr::mutate(created_date_value = anytime::anydate(created_date_value/1000))\n",
    "carberry_works"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dr. Carberry has seven works. Because ORCID data can be manually entered, the integrity, completeness, and consistency of this data will sometimes vary. \n",
    "\n",
    "You can see the **external_ids_external_id** column is actually a nested list, a concept we discussed above. This can be unnested with the `tidyr::unnest()` function. Just as a single researcher can have multiple identifiers, a single work may also have multiple identifiers (e.g., DOI, ISSN, EID). If that is the case, when this column is unnested, there will be repeating rows for those items."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "carberry_works_ids <- carberry_works %>%\n",
    "  tidyr::unnest(external_ids_external_id) %>%\n",
    "  janitor::clean_names()\n",
    "carberry_works_ids"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we now have 13 observations of 27 variables rather than 7 observations of 24 variables. The extra rows are there because all but one of the works has two external identifiers. The extra columns are there because four new columns were added with the unnest (that's why we had to clean the names again): \n",
    "\n",
    "* **external_id_type** identifies the type of external identifier--see a list of [supported identifiers in ORCID](https://pub.orcid.org/v2.0/identifiers). (required)\n",
    "* **external_id_value**: contains the identifier itself (required)\n",
    "* **external_id_url:**  contains a link the identifier will resolve to (optional)\n",
    "* **external_id_relationship**: indicates if the identifier refers to the item itself (`SELF`), such as a DOI or a person identifier, or a whole that the item is part of (`PART_OF`), such as an ISSN for a journal article.\n",
    "\n",
    "So we can follow one of the three strategies outlined above if we want to write this to a CSV file: 1) unnest the column (as above), 2) drop the nested lists, or 3) mutate them into character vectors. \n",
    "\n",
    "## Getting works for multiple people\n",
    "\n",
    "`orcid::works()` is not vectorized, meaning, if you have multiple ORICID iDs, you can't use it. Instead, you have to pass them to the `orcid::orcid_works()` function. You can view this as a list at <https://ciakovx.github.io/rorcid.html#Getting_works_for_multiple_people>."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_orcids <- c(\"0000-0002-1825-0097\", \"0000-0002-9260-8456\", \"0000-0002-2771-9344\")\n",
    "my_works <- rorcid::orcid_works(my_orcids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This returns a list of 3 elements, with the works nested in **group > work-summary**. They can be plucked and flattened into a data frame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_works_data <- my_works %>%\n",
    "  purrr::map_dfr(pluck, \"works\") %>%\n",
    "  janitor::clean_names() %>%\n",
    "  dplyr::mutate(created_date_value = anytime::anydate(created_date_value/1000))\n",
    "my_works_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Unnesting external IDs\n",
    "\n",
    "Now you may want to run some analysis using the external identifiers; for instance, you can use the `roadoi` package to look at which DOIs are open access.\n",
    "\n",
    "We run into a problem here when we try to unnest the external IDs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "FALSE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_works_externalIDs <- my_works_data %>%\n",
    "  tidyr::unnest(external_ids_external_id)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The error message reads: `\"Error: Each column must either be a list of vectors or a list of data frames [external_ids_external_id]\".` This is because some of the list columns are empty. We can just filter them out before unnesting:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_works_externalIDs <- my_works_data %>%\n",
    "  dplyr::filter(!purrr::map_lgl(external_ids_external_id, purrr::is_empty)) %>%\n",
    "  tidyr::unnest(external_ids_external_id)\n",
    "my_works_externalIDs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to keep them, there's a workaround: use `map_lgl` to first remove (`filter()` out) the `NULL` `external_id` columns, then `unnest` the ids, then bind back the `NULL` author columns, and finally deselecting the extra `author` and `link` columns as these are no longer in the transformed, unnested data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "attributes": {
     "classes": [],
     "eval": "TRUE,cache=TRUE",
     "id": ""
    }
   },
   "outputs": [],
   "source": [
    "my_works_externalIDs_keep <- my_works_data %>% \n",
    "  dplyr::filter(!purrr::map_lgl(external_ids_external_id, purrr::is_empty)) %>% \n",
    "  tidyr::unnest(external_ids_external_id, .drop = TRUE) %>% \n",
    "  dplyr::bind_rows(my_works_data %>% \n",
    "                     dplyr::filter(map_lgl(external_ids_external_id, is.null)) %>%\n",
    "                     dplyr::select(-external_ids_external_id))\n",
    "my_works_externalIDs_keep"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "**TRY IT YOURSELF**\n",
    "\n",
    "1. Get your own works using `orcid_works()`. If you don't have any, use [Scott Chamberlain's ORCID](https://orcid.org/0000-0003-1444-9135). Use some of the techniques we have learned to unnest columns and write this to disk.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "\n",
    "The ORCID API is an excellent tool for analyzing research activity on multiple levels. `rorcid` makes gathering and cleaning the data easier. Thanks to both ORCID and Scott Chamberlain for their contributions to the community. Again, read Paul Oldham's excellent post at https://www.pauloldham.net/introduction-to-orcid-with-rorcid/ for more you can do. I hope this walkthrough  helps. If you need to get in touch with me, find my contact info at https://info.library.okstate.edu/clarke-iakovakis."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "4.1.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}