# Binder link to this notebook:

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ciakovx/ciakovx.github.io/master?filepath=rcrossref.ipynb)


# `rcrossref` and `roadoi`

# Licensing

This walkthrough is distributed under a [Creative Commons Attribution 4.0 International (CC BY 4.0) License](https://creativecommons.org/licenses/by/4.0/).

# Load packages
When you download R it already has a number of functions built in: these encompass what is called **Base R**. However, many R users write their own libraries of functions, package them together in R **packages**, and provide them to the R community at no charge. This extends the capacity of R and allows us to do much more. In many cases, they improve on the Base R functions by making them easier and more straight-forward to use. In addition to `rcrossref`, we will also be using the `dplyr`, `purrr`, `stringr` and `tidyr` packages. These packages are part of the [tidyverse](https://www.tidyverse.org/), a collection of R packages designed for data science.

If you are using R and R Studio, you will need to use `install.packages()` function to install the packages first. I have already installed the packages for you in this Binder repository, so we will simply load them by calling `library()`. Let's also set an option to see a max number of 100 columns and max 20 rows in our Jupyter Notebooks environment, to make printed tables easier to look at.

In [None]:
# load packages
library(rcrossref)
library(roadoi)
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)

# increase number of columns and rows displayed when we print a table
options(repr.matrix.max.cols=100, repr.matrix.max.rows=20)

You can ignore the warning message in pink stating `The following objects are masked from ‘package:stats’: filter, lag`. This simply means that there are functions from other packages with the name `filter()` and `lag()` and that the `dplyr` functions will `mask` those (i.e. assume predominance).


# Crossref &amp; `rcrossref`

## Crossref

<div>
    <br>
    <a>
        <img src="images/crossref-logo.png" style="width: 200px;">
    </a>
    <br>
</div>


Crossref is a a not-for-profit membership organization dedicated to interlinking scholarly metadata, including journals, books, conference proceedings, working papers, technical reports, data sets, authors, funders, and more. The [Crossref REST API](https://github.com/CrossRef/rest-api-doc) allows anybody to search and reuse members' metadata in a variety of ways. Read [examples of user stories](https://www.crossref.org/services/metadata-delivery/user-stories/).

## `rcrossref`

`rcrossref` is a package developed by [Scott Chamberlain](https://scottchamberlain.info/), Hao Zhu, Najko Jahn, Carl Boettiger, and Karthik Ram, part of the [rOpenSci](https://ropensci.org/) set of packages. rOpenSci is an incredible organization dedicated to open and reproducible research using shared data and reusable software. I strongly recommend you browse their set of packages at https://ropensci.org/packages/.

<div>
    <br>
    <a>
        <img src="images/ropensci.png" style="width: 200px;">
    </a>
    <br>
</div>

`rcrossref` serves as an interface to the Crossref API.

**Key links**

* [rcrossref documentation](https://cran.r-project.org/web/packages/rcrossref/rcrossref.pdf)
* [Crossref  REST API documentation](https://github.com/ropensci/rcrossref)
* [Crossref Metadata API JSON Format](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md)
* Crossref and rcrossref tutorials
    - [my `rcrossref` tutorial](https://ciakovx.github.io/rcrossref.html)
    - rOpenSci [rcrossref tutorial](https://ropensci.org/tutorials/rcrossref_tutorial/)
    - Paul Oldham: ["Accessing the Scientific Literature with Crossref"](https://poldham.github.io/abs/crossref.html)
    - [rcrossref vignette](https://cran.r-project.org/web/packages/rcrossref/vignettes/crossref_vignette.html)

### Setting up `rcrossref`

As described in the documentation, the Crossref team encourages users to send requests with their email address, and will forward you to a dedicated API cluster for improved performance when you share your email with them. Learn more at <https://github.com/CrossRef/rest-api-doc#good-manners--more-reliable-service>. 

To do this in R, replace `name@example.com` below with your own email address. It will now be shared with Crossref whenever you send them an API request. 

In [None]:
Sys.setenv(crossref_email='name@example.com')

# Getting publications from journals with `cr_journals`

`cr_journals()` takes either an ISSN or a general keyword query, and returns metadata for articles published in the journal, including DOI, title, volume, issue, pages, publisher, authors, etc. A full list of publications in Crossref is [available on their website](https://support.crossref.org/hc/en-us/articles/213197226-Browsable-title-list).

## Getting journal details

Crossref is entirely dependent on publishers to supply the metadata. Some fields are required, while others are optional. You may therefore first be interested in what metadata publishers have submitted to Crossref for a given journal. By using `cr_journals` with `works = FALSE`, you can determine who publishes the journal, the total number of articles for the journal in Crossref, whether abstracts are included, if the full text of articles is deposited, if author ORCIDs are provided, and if the publisher supplies author affiliations, author ORCID iDs, article licensing data, funders for the article, article references, and a few other items. 

Crossref displays some of this data on a publisheropenly on the web at <https://www.crossref.org/members/prep/>.

First we will create a new vector `plosone_issn` with the ISSN for the journal *PLoS ONE*. 

In [None]:
# assign the PLoS ISSN
plosone_issn <- '1932-6203'

We will then run `rcrossref::cr_journals()`, setting the ISSN equal to the `plosone_issn` we just created, and print the results.

In [None]:
# get information about the journal
plosone_details <- cr_journals(issn = plosone_issn, works = FALSE)
plosone_details

This actually comes back as a list of three items: `meta`, `data`, and `facets`. The good stuff is in `data`. 

We use the `pluck()` function from the `purrr` package to pull that data only. We will be using `pluck` throughout this tutorial; it's an easy way of indexing deeply and flexibly into lists to extract information.

We don't have time in this tutorial to discuss list items and purr. For an excellent in-depth tutorial, see Jenny Bryan's [
Introduction to map(): extract elements](https://ciakovx.github.io/jennybc_lists_lesson.html), reproduced on the course website under the terms of a Creative Commons license.

In [None]:
# get information about the journal and pluck the data at the same time
plosone_details <- rcrossref::cr_journals(issn = plosone_issn, works = FALSE) %>%
  purrr::pluck("data")

This is precisely the same thing as passing the ISSN directly to `cr_journals()`:

In [None]:
cr_journals("1932-6203", works = FALSE) %>% 
    pluck("data")

The `purrr::pluck()` function is connected to `plosone_details` with something called a [Pipe Operator](https://www.datacamp.com/community/tutorials/pipe-r-tutorial) `%&gt;%`, which we will be using throughout the tutorial. A pipe takes the output of one statement and immediately makes it the input of the next statement. It helps so that you don't have to write every intermediate, processing data to your R environment. You can think of it as "then" in natural language. So the above script first makes the API call with `cr_journals()`, then it applies `pluck()` to extract only the list element called `"data"`, and returns it to the `plosone_details` value.

We now have a **data frame** including the details Croassref has on file about PLoS ONE. Scroll to the right to see all the columns.

In [None]:
plosone_details

There are a number of ways to explore this data frame:

In [None]:
#display information about the data frame
str(plosone_details)

Type `?str` into the console to read the description of the `str` function. You can call `str()` on an R object to compactly display information about it, including the data type, the number of elements, and a printout of the first few elements.

In [None]:
# dimensions: 1 row, 53 columns
dim(plosone_details)

In [None]:
# number of rows
nrow(plosone_details)

In [None]:
# number of columns
ncol(plosone_details)

In [None]:
# column names
names(plosone_details)

We see this data frame includes one observation of 53 different variables. This includes the total number of DOIs, whether the abstracts, orcids, article references are current; and other information.

You can use the $ symbol to work with particular variables. For example, the `publisher` column:

In [None]:
# print the publisher variable
plosone_details$publisher

# return the publisher variable to a new value `myPLOSONEpublisher`
myPLOSONEpublisher <- plosone_details$publisher

The total number of DOIs on file:

In [None]:
# print the total number of DOIs
plosone_details$total_dois

Whether publisher provides data on funders for articles in the current file (as opposed to the backfile) in Crossref (a TRUE/FALSE value–called “logical” in R):

In [None]:
# is funder data current on deposits?
plosone_details$deposits_funders_current

What percentage of articles in Crossref's current file contains at least one funding award number? (i.e., a number assigned by the funding organization to identify the specific piece of funding (the award or grant))

In [None]:
plosone_details$award_numbers_current * 100

---
**TRY IT YOURSELF**

1. Assign an ISSN for a well-known journal to a new variable in R. Name it whatever you like. You can use the [Scimago Journal Rank](https://www.scimagojr.com/journalrank.php) to look up the ISSN. If you need a couple examples, try [RUSA](https://www.scimagojr.com/journalsearch.php?q=16004&amp;tip=sid&amp;clean=0) or [Library Hi Tech](https://www.scimagojr.com/journalsearch.php?q=144908&amp;tip=sid&amp;clean=0). Make sure to put the ISSN in quotes to create a character vector.
2. Look up the journal details using `cr_journals`. Make sure to pass the argument `works = FALSE`.
3. Print the data to your console by typing in the value name.


**Does it matter if the ISSN has a hyphen or not? Try both methods.**

In [None]:
# assign an ISSN to a value. Call the value what you want (e.g. plosone_issn)


In [None]:
# look up journal details using the cr_journals function and assign it to a new value (e.g. plosone_details). 
# Remember to include a %>% pipe and call purrr::pluck("data")





In [None]:
# print info about the journal details to the console by typing in the value name inside str()


In [None]:
# how many total DOIs does it have on file?


# what percent of articles in the current file have orcid iDs?


# does this journal provide open references in its current file?




## Getting journal publications by ISSN

To get metadata for the publications themselves rather than data about the journal, we will again use the `plosone_issn` value in the `issn =` argument to `cr_journals`, but we now set `works = TRUE`.  

In [None]:
# get metadata on articles by setting works = TRUE
plosone_publications <- cr_journals(issn = plosone_issn, works = TRUE, limit = 25) %>%
  pluck("data")

Let's walk through this step by step:

* First, we are creating a new value called `plosone_publications`
* We are using the assignment operator `&lt;-` to assign the results of an operation to this new value
* We are running the function `cr_journals()`. It is not necessary to add `rcrossref::` to the beginning of the function.
* We **pass** three arguments to the function:
    * `issn = plosone_issn` : We defined `plosone_issn` earlier in the session as '1932-6203'. We are reusing that value here to tell the `cr_journals()` function what journal we want information on
    * `works = TRUE` : When we earlier specified `works = FALSE`, we got back information on the publication. When `works = TRUE`, we will get back article level metadata
    * `limit = 25` : We will get back 25 articles. The default number of articles returned is 20, but you can increase or decrease that with the `limit` argument. The max limit is 1000, but you can get more using the `cursor` argument (see below).
* `%&gt;%` : Pipe operator says to R to take the results of this function and use it as the input for what follows
* `pluck("data")` : This will grab only the contents of the list item "data" and return it to `plosone_publications`.

Let's explore the data frame:

In [None]:
# print dimensions of this data frame
dim(plosone_publications)

When we run `dim()` (dimensions) on this result, we now see a different number of rows and columns: 25 rows and 28 columns. This is therefore a different dataset than `plosone_details`. Let's call `names()` to see what the column names are:

In [None]:
# print column names
names(plosone_publications)

We view the entire data frame below. Because there are some nested lists within the data, we will use the `select()` function from the `dplyr` package to select only a few columns. This will make it easier for us to view here in the Azure Notebook environment. You can also use the `select()` function to rearrange the columns.

In [None]:
# print select columns from the data frame
plosone_publications %>%
  dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count, type, issn)

Here we are just getting back the last 25 articles that have been indexed in Crossref by PLoS ONE. However, this gives you a taste of how rich the metadata is. We have the dates the article was deposited and published online, the title, DOI, the ISSN, the volume, issue, and page numbers, the number of references, the URL, and for some items, the subjects. The omitted columns include information on licensing, authors, and more. We will deal with those columns further down.

## Getting multiple publications by ISSN

You can also pass multiple ISSNs to `cr_journals`. Here we create 2 new values, `jama_issn` and `jah_issn`. These are ISSNs for the *Journal of American History* and *JAMA: The Journal of the American Medical Association*. We then pass them to `cr_journals` by passing them to the `c()` function, which will combine them (it's like CONCATENATE in Excel). We set `works` to `TRUE` so we'll get the publications metadata, and we set the `limit` to 50, so we'll get 50 publications per journal.

In [None]:
# assign the JAMA and JAH ISSNs
jama_issn <- '1538-3598'
jah_issn <- '0021-8723'

# get the last 10 publications on deposit from each journal. For multiple ISSNs, use c() to combine them
jah_jama_publications <- rcrossref::cr_journals(issn = c(jama_issn, jah_issn), 
                                                works = TRUE, 
                                                limit = 10) %>%
  purrr::pluck("data")

In [None]:
c(jama_issn, jah_issn)

Here we used `c()` to combine `jama_issn` and `jah_issn`. `c()` is used to create a **vector** in R. A vector is a sequence of elements of the same type. In this case, even though the ISSNs are numbers, we created them as `character` vectors by surrounding them in quotation marks. You can use single or double quotes. Above, when we assigned 5 to `y`, we created a `numeric` vector. 

Vectors can only contain “homogenous” data–in other words, all data must be of the same type. The type of a vector determines what kind of analysis you can do on it. For example, you can perform mathematical operations on `numeric` objects, but not on `character` objects. You can think of vectors as columns in an Excel spreadsheet: for example, in a name column, you want every value to be a character; in a date column, you want every value to be a date; etc.

Going back to our `jah_jama_publications` object, we have a dataframe composed of 20 observations of 24 variables. This is a rich set of metadata for the articles in the given publications. The fields are detailed in the [Crossref documentation](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md#work), including the field name, type, description, and whether or not it's required. Some of these fields are title, DOI, DOI prefix identifer, ISSN, volume, issue, publisher, abstract (if provided), reference count (if provided--i.e., the number of references *in* the given article), link (if provided), subject (if provided), and other information. The number of citations *to* the article are not pulled, but these can be gathered separately with `cr_citation_count()` (see below).

In [None]:
# print column names
names(jah_jama_publications)

In [None]:
# print data frame with select columns
jah_jama_publications %>%
  dplyr::select(title, container.title, doi, volume, issue, page, issued, url, publisher, reference.count, type, issn)

## Filtering the `cr_journals` query with the `filter` argument

You can use the `filter` argument within `cr_journals` to specify some parameters as the query is executing. This filter is built into the Crossref API query. See the available filters by calling `rcrossref::filter_names()`, and details by calling `rcrossref::filter_details`. It's also in [the API documentation](https://github.com/CrossRef/rest-api-doc#filter-names). 

| filter     | possible values | description|
|:-----------|:----------------|:-----------|
| `has-funder` | | metadata which includes one or more funder entry |
| `funder` | `{funder_id}` | metadata which include the `{funder_id}` in FundRef data |
| `location` |`{country_name}` | funder records where location = `{country name}`. Only works on `/funders` route |
| `prefix` | `{owner_prefix}` | metadata belonging to a DOI owner prefix `{owner_prefix}` (e.g. `10.1016` ) |
| `member` | `{member_id}` | metadata belonging to a Crossref member |
| `from-index-date` | `{date}` | metadata indexed since (inclusive) `{date}` |
| `until-index-date` | `{date}` | metadata indexed before (inclusive) `{date}` |
| `from-deposit-date` | `{date}` | metadata last (re)deposited since (inclusive) `{date}` |
| `until-deposit-date` | `{date}` | metadata last (re)deposited before (inclusive) `{date}` |
| `from-update-date` | `{date}` | Metadata updated since (inclusive) `{date}`. Currently the same as `from-deposit-date`. |
| `until-update-date` | `{date}` | Metadata updated before (inclusive) `{date}`. Currently the same as `until-deposit-date`. |
| `from-created-date` | `{date}` | metadata first deposited since (inclusive) `{date}` |
| `until-created-date` | `{date}` | metadata first deposited before (inclusive) `{date}` |
| `from-pub-date` | `{date}` | metadata where published date is since (inclusive) `{date}` |
| `until-pub-date` | `{date}` | metadata where published date is before (inclusive)  `{date}` |
| `from-online-pub-date` | `{date}` | metadata where online published date is since (inclusive) `{date}` |
| `until-online-pub-date` | `{date}` | metadata where online published date is before (inclusive)  `{date}` |
| `from-print-pub-date` | `{date}` | metadata where print published date is since (inclusive) `{date}` |
| `until-print-pub-date` | `{date}` | metadata where print published date is before (inclusive)  `{date}` |
| `from-posted-date` | `{date}` | metadata where posted date is since (inclusive) `{date}` |
| `until-posted-date` | `{date}` | metadata where posted date is before (inclusive)  `{date}` |
| `from-accepted-date` | `{date}` | metadata where accepted date is since (inclusive) `{date}` |
| `until-accepted-date` | `{date}` | metadata where accepted date is before (inclusive)  `{date}` |
| `has-license` | | metadata that includes any `<license_ref>` elements. |
| `license.url` | `{url}` | metadata where `<license_ref>` value equals `{url}` |
| `license.version` | `{string}` | metadata where the `<license_ref>`'s `applies_to` attribute  is `{string}`|
| `license.delay` | `{integer}` | metadata where difference between publication date and the `<license_ref>`'s `start_date` attribute is <= `{integer}` (in days)|
| `has-full-text` |  | metadata that includes any full text `<resource>` elements. |
| `full-text.version` | `{string}`  | metadata where `<resource>` element's `content_version` attribute is `{string}`. |
| `full-text.type` | `{mime_type}`  | metadata where `<resource>` element's `content_type` attribute is `{mime_type}` (e.g. `application/pdf`). |
| `full-text.application` | `{string}` | metadata where `<resource>` link has one of the following intended applications: `text-mining`, `similarity-checking` or `unspecified` |
| `has-references` | | metadata for works that have a list of references |
| `reference-visibility` | `[open, limited, closed]` | metadata for works where references are either `open`, `limited` (to [Metadata Plus subscribers](https://www.crossref.org/services/metadata-delivery/plus-service/)) or `closed` |
| `has-archive` | | metadata which include name of archive partner |
| `archive` | `{string}` | metadata which where value of archive partner is `{string}` |
| `has-orcid` | | metadata which includes one or more ORCIDs |
| `has-authenticated-orcid` | | metadata which includes one or more ORCIDs where the depositing publisher claims to have witness the ORCID owner authenticate with ORCID |
| `orcid` | `{orcid}` | metadata where `<orcid>` element's value = `{orcid}` |
| `issn` | `{issn}` | metadata where record has an ISSN = `{issn}`. Format is `xxxx-xxxx`. |
| `isbn` | `{isbn}` | metadata where record has an ISBN = `{issn}`. |
| `type` | `{type}` | metadata records whose type = `{type}`. Type must be an ID value from the list of types returned by the `/types` resource |
| `directory` | `{directory}` | metadata records whose article or serial are mentioned in the given `{directory}`. Currently the only supported value is `doaj`. |
| `doi` | `{doi}` | metadata describing the DOI `{doi}` |
| `updates` | `{doi}` | metadata for records that represent editorial updates to the DOI `{doi}` |
| `is-update` | | metadata for records that represent editorial updates |
| `has-update-policy` | | metadata for records that include a link to an editorial update policy |
| `container-title` | | metadata for records with a publication title exactly with an exact match |
| `category-name` | | metadata for records with an exact matching category label. Category labels come from [this list](https://www.elsevier.com/solutions/scopus/content) published by Scopus |
| `type` | | metadata for records with type matching a type identifier (e.g. `journal-article`) |
| `type-name` | | metadata for records with an exactly matching type label |
| `award.number` | `{award_number}` | metadata for records with a matching award nunber. Optionally combine with `award.funder` |
| `award.funder` | `{funder doi or id}` | metadata for records with an award with matching funder. Optionally combine with `award.number` |
| `has-assertion` | | metadata for records with any assertions |
| `assertion-group` | | metadata for records with an assertion in a particular group |
| `assertion` | | metadata for records with a particular named assertion |
| `has-affiliation` | | metadata for records that have any affiliation information |
| `alternative-id` | | metadata for records with the given alternative ID, which may be a publisher-specific ID, or any other identifier a publisher may have provided |
| `article-number` | | metadata for records with a given article number |
| `has-abstract` | | metadata for records which include an abstract |
| `has-clinical-trial-number` | | metadata for records which include a clinical trial number |
| `content-domain` | | metadata where the publisher records a particular domain name as the location Crossmark content will appear |
| `has-content-domain` | | metadata where the publisher records a domain name location for Crossmark content |
| `has-domain-restriction` | | metadata where the publisher restricts Crossmark usage to content domains |
| `has-relation` | | metadata for records that either assert or are the object of a relation |
| `relation.type` | | One of the relation types from the Crossref relations schema (e.g. `is-referenced-by`, `is-parent-of`, `is-preprint-of`) |
| `relation.object` | | Relations where the object identifier matches the identifier provided |
| `relation.object-type` | | One of the identifier types from the Crossref relations schema (e.g. `doi`, `issn`) |</orcid></resource></resource></resource></resource></license_ref></license_ref></license_ref></license_ref>

### Filtering by publication date with `from_pub_date` and `until_pub_date`
For example, you may only want to pull publications from a given year, or within a date range. Remember to increase the limit or use `cursor` if you need to. Also notice three things about the `filter` argument:

* The query parameter is in backticks (the key next to the 1 on the keyboard)
* The query itself is in single quotes
* The whole thing is wrapped in `c()`

Here, we will get all articles from the *Journal of Librarianship and Scholarly Communication* published after January 1, 2019:

In [None]:
# assign the JLSC ISSN
jlsc_issn <- "2162-3309"

# get articles published since January 1, 2019
jlsc_publications_2019 <- rcrossref::cr_journals(issn = jlsc_issn, works = TRUE, 
                                                 filter = c(from_pub_date='2019-01-01')) %>%
  purrr::pluck("data")

# print the dataframe with select column
jlsc_publications_2019 %>%
  dplyr::select(title, container.title, doi, volume, issue, issued, url, publisher, reference.count, type, issn)

### Filtering by funder with `award.funder`

You can also return all articles funded by a specific funder. See the [Crossref Funder Registry](https://gitlab.com/crossref/open_funder_registry) for a list of funders and their DOIs (download the CSV at the bottom of the page). 

Here, we will combine two filters: `award.funder` and `from_pub_date` to return all articles published in PLoS ONE where a) at least one funder is the National Institutes of Health, and b) the article was published after July 1, 2020. Note that we set a `limit` here of 25 because we are doing a teaching activity and we don't want to send heavy queries. If you were doing this on your own, you would likely want to remove the limit.

In [None]:
# assign the PLoS ONE ISSN and the NIH Funder DOI
plosone_issn <- '1932-6203'
nih_funder_doi <- '10.13039/100000002'

# get articles published in PLoS since 3/1 funded by NIH
plosone_publications_nih <- rcrossref::cr_journals(issn = plosone_issn, works = T, limit = 25,
                                                 filter = c(award.funder = nih_funder_doi,
                                                           from_pub_date = '2020-07-01')) %>%
  purrr::pluck("data")

We will use `unnest()` from the `tidyr` package to view the data frame here. This is described below in [Unnesting List Columns](https://rcrossref2-clarkeiakovakis.notebooks.azure.com/j/notebooks/rcrossref_20200305.ipynb#Unnesting-list-columns).

In [None]:
# print the dataframe, first unnesting the funder column
plosone_publications_nih %>%
    tidyr::unnest(funder)

If you scroll all the way to the right, you can see the funder information. Look at the `title` column and you will notice that some article titles are now duplicated, however you will see different funders in the `name` column. This is because a single article may have multiple funders, and a new row is created for each funder, with data including the `award` number.

---
**TRY IT YOURSELF**

1. Run a `cr_journals` query with `works = FALSE` to get information on the journal *Scientometrics* (ISSN 1588-2861). Remember to set `works` to `FALSE` and include `%>% pluck("data")`. Assign it to a symbol of your choosing.
2. Call `str()` on it to view information on it. Does the current Crossref file contain funder information? What percent of articles in the current file have at least one funder listed? What percent of articles in the current file have at least one ORCID iD?

In [None]:
# get data on Scientometrics



In [None]:
# call str() to view information about it



In [None]:
# does Crossref contain information about funders?



In [None]:
# What percent of articles in the current file have at least one funder listed?



In [None]:
# What percent of articles in the current file have at least one ORCID iD?



2. Find out if the NIH has funded any publications in the journal *Scientometrics*  since 2015.

Remember if you are printing the data frame out, include `%>% dplyr::select(title, container.title, doi, volume, issue, issued, url, publisher, reference.count, type, issn)`

In [None]:
# how many publications in Scientometrics funded by the NIH since 2015?




1. Look at the above table of filters to see what the filter argument for ORCID is. Check to see if *Scientometrics* has any articles authored by Dr. Anne-Wil Harzing, whose ORCID iD is `0000-0003-1509-3003`. How many articles does she have?

In [None]:
# does Scientometrics have any articles authored by the professor with ORCID iD 0000-0003-1509-3003?



### Filtering by license with `has_license`

You may be interested in licensing information for articles; for instance, gathering publications in a given journal that are licensed under Creative Commons. First run `cr_journals` with `works` set to `FALSE` in order to return journal details so you can check if the publisher even sends article licensing information to Crossref--it's not required. We will use PLOS ONE again as an example.


In [None]:
# assign the PLoS ONE ISSN and get journal details by setting works = FALSE
plosone_issn <- '1932-6203'
plosone_details <- rcrossref::cr_journals(issn = plosone_issn, works = FALSE) %>%
  purrr::pluck("data")

We can check the `deposits_licenses_current` variables to see if license data on file is current. If it is `TRUE`, PLoS ONE does send licensing information and it is current. 

In [None]:
# is article licensing data on file current?
plosone_details$deposits_licenses_current

We can now rerun the query but set `works = TRUE`, and set the `has_license` to `TRUE`. This will therefore return only articles that have license information. We will set our `limit` to 25.

In [None]:
# get last 25 articles on file where has_license is TRUE
plosone_license <- rcrossref::cr_journals(issn = plosone_issn, works = T, limit = 25, 
                                          filter = c(`has_license` = TRUE)) %>% 
  pluck("data")

In [None]:
# print the data with select columns
plosone_license %>%
  dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count, type, issn, license)

The license data comes in as a nested column. We can unnest it using `tidyr::unnest`, which we used above with funders and will be discussed more below. 

In [None]:
# print the data frame with license unnested. The .drop argument will drop all other list columns.
plosone_license %>%
  tidyr::unnest(license, .drop = TRUE)

This adds four columns all the way to the right: 
* **date** (Date on which this license begins to take effect) 
* **URL** (Link to a web page describing this license--in this case, Creative Commons)
* **delay in days** (Number of days between the publication date of the work and the start date of this license), and 
* **content.version**, which specifies the version of the article the licensing data pertains to (VOR = Version of Record, AM = Accepted Manuscript, TDM = Text and Data Mining). 

Browsing the rows, we see all are CC BY 4.0, which stands to reason given *PLOS ONE* is an open access publisher and [applies the CC BY license](https://journals.plos.org/plosone/s/licenses-and-copyright) to the articles they publish. 

## Filtering rows and selecting columns with `dplyr`

You can use the `filter()` and `select()` functions from the `dplyr` package if you want to get subsets of this data after you have made the query. **Note that this is a completely different `filter` function** than the one used above inside the `cr_journals()` function. That one was an argument we sent with the API call that filtered the results before they were returned. This is a separate function that is part of `dplyr` to help you filter a data frame in R. 

To learn more about the `dplyr` package, read the ["Data Transformation" chapter in *R For Data Science*](https://r4ds.had.co.nz/transform.html).

Above, we retrieved all articles from the *Journal of Librarianship &amp; Scholarly Communication* published after January 1, 2019. Let's say you want only volume 8, issue 1:

In [None]:
# assign the JLSC ISSN and get all publications after January 1, 2019
jlsc_issn <- "2162-3309"
jlsc_publications_2019 <- rcrossref::cr_journals(issn = jlsc_issn, works = T, limit = 25,
                                                 filter = c(from_pub_date='2019-01-01')) %>%
  purrr::pluck("data")

In [None]:
# print the data frame with select columns
jlsc_publications_2019 %>%
  dplyr::select(title, doi, volume, issue, issued, url, publisher, reference.count, type, issn)

In [None]:
# use filter from dplyr to get only volume 8, issue 1
jlsc_8_1 <- jlsc_publications_2019 %>%
  dplyr::filter(volume == "8",
         issue == "1") 

# print the data frame with select columns
jlsc_8_1 %>%
  dplyr::select(title, doi, volume, issue, issued, url, publisher, reference.count, type, issn)

`filter()` will go through each row of your existing `jlsc_publications_2019` data frame, and keep only those rows with values matching the filters you input. **Note:** be careful of filtering by ISSN. If a journal has multiple ISSNs they'll be combined in a single cell with a comma and the `filter()` will fail, as with JAMA above. In this case it may be wiser to use `str_detect()`, as described a couple code chunks down.

In [None]:
jah_jama_publications$issn[1]

We can use `filter()` to get a single article from within this data frame if we need, either by DOI:

In [None]:
# filter to get "The Five Laws of OER" article by DOI
jlsc_article <- jlsc_publications_2019 %>%
  dplyr::filter(doi == "10.7710/2162-3309.2299") 

# print data frame with select columns
jlsc_article %>%
  dplyr::select(title, doi, volume, issue, issued, url, publisher, reference.count, type, issn)

Or by title:

In [None]:
# use str_detect to search the title column for articles that include the term OER
jlsc_article <- jlsc_publications_2019 %>%
  dplyr::filter(stringr::str_detect(title, "OER"))

# print the data frame with select column
jlsc_article %>%
  dplyr::select(title, doi, volume, issue, issued, url, publisher, reference.count, type, issn)

Here, we use the `str_detect()` function from the `stringr` package, which is loaded as part of the `tidyverse`, in order to find a single term (OER) in the title.

Remember that these `dplyr` and `stringr` functions are searching through our existing data frame `jlsc_publications_2019`, not issuing new API calls.

## Field queries

There is yet another way of making your query more precise, and that is to use a field query (`flq`) argument to `cr_journals()`. This allows you to search in specific bibliographic fields such as author, editor, titles, ISSNs, and author affiliation (not widely available). These are listed in the [Crossref documentation](https://github.com/CrossRef/rest-api-doc#field-queries) and reproduced below. You *must* provide an ISSN--in other words, you can't run a field query for authors across all journals. 

| Field query parameter | Description |
|-----------------------|-------------|
| `query.container-title` | Query `container-title` aka. publication name |
| `query.author` | Query author given and family names |
| `query.editor` | Query editor given and family names |
| `query.chair` | Query chair given and family names |
| `query.translator` | Query translator given and family names |
| `query.contributor` | Query author, editor, chair and translator given and family names |
| `query.bibliographic` | Query bibliographic information, useful for citation look up. Includes titles, authors, ISSNs and publication years |
| `query.affiliation` | Query contributor affiliations |

### Field query by title

Here, we get all publications from the Journal of Librarianship and Scholarly Communication with the term "open access" in the title. 


In [None]:
# assign JLSC ISSN and query the bibliographic field for terms mentioning open access. 
jlsc_issn <- "2162-3309"
jlsc_publications_oa <- rcrossref::cr_journals(issn = jlsc_issn, works = T, limit = 25,
                                            flq = c(`query.bibliographic` = 'open access')) %>%
  purrr::pluck("data")

# print the data frame with select columns
jlsc_publications_oa %>%
  dplyr::select(title, doi, volume, issue, issued, issn, author)

### Field query by author, contributor, or editor

The `flq` argument can also be used for authors, contributors, or editors. Here we search the same journal for authors with the name Salo (looking for all articles written by Dorothea Salo).



In [None]:
# Use the query.author field query to find JLSC articles with author name Salo
jlsc_publications_auth <- rcrossref::cr_journals(issn = jlsc_issn, works = T, limit = 25,
                                            flq = c(`query.author` = 'salo')) %>%
  purrr::pluck("data")

# print the data frame with select columns
jlsc_publications_auth %>%
  dplyr::select(title, doi, volume, issue, issued, url, publisher, reference.count, type, issn)

---
**TRY IT YOURSELF**

1. Assign the ISSN for *College &amp; Research Libraries* to a value - 2150-6701
2. Use the `query.author` field query (`flq`) to find all articles written by Lisa Janicke Hinchliffe 
3. Print the tibble using `select` and the specified columns

In [None]:
# assign the C&RL ISSN 2150-6701


In [None]:
# use the query.author field query to search for articles written by Lisa Hinchliffe.
# separate her name with a plus: Lisa+Janicke+Hinchliffe
# make sure to use backticks around query.author (next to the 1 key) `




In [None]:
# print the data using select and the specified columns: title, doi, volume, issue, issued, url, publisher, reference.count, type, issn



# Viewing the JSON file

You can view these files in a JSON viewer using the `toJSON()` function from the `jsonlite` package.

In [None]:
# assign the PLOS ISSN and get the last 25 articles on deposit
plosone_issn <- '1932-6203'
plosone_publications <- cr_journals(issn = plosone_issn, works = TRUE, limit = 5) %>%
  pluck("data")

# use the toJSON function to convert the output to JSON
plosone_publications_json <- jsonlite::toJSON(plosone_publications)

Print the JSON, triple click inside the box to highlight the text, and copy it to the clipboard. Watch out! This will look like a jumbled mess of text!

In [None]:
# print the JSON
plosone_publications_json

Go to [Code Beautify](https://codebeautify.org/jsonviewer) and paste the JSON on the left side. Click **Tree Viewer** to view the data. Open the first item to view the metadata. Note especially the last few variables. These are nested lists, as a single article can have multiple authors, and each author has a given name, family name, and sequence of authorship.

To write to JSON, use `jsonlite::write_json()`. 

In [None]:
# write a JSON file
jsonlite::write_json(plosone_publications_json, "data/plosone_publications.json")

# Saving files in Binder

You can save files while in a Binder session, but you will need to download them before you close the session down. The JSON file we just saved is now available if you click **File > Open** and navigate to the **data** folder. There, you can check the box and click the **Download** button at the top of the page. Note that this file will disappear when you close down your Binder session.

# Using `cr_works()` to get data on articles

`cr_works()` allows you to search by DOI or a general query in order to return the Crossref metadata.

It is important to note, as Crossref does [in the documentation](https://github.com/CrossRef/rest-api-doc/blob/master/demos/crossref-api-demo.ipynb):

&gt; Crossref does not use "works" in the FRBR sense of the word. In Crossref parlance, a "work" is just a thing identified by a DOI. In practice, Crossref DOIs are used as citation identifiers. So, in FRBR terms, this means, that a Crossref DOI tends to refer to one *expression* which might include multiple *manifestations*. So, for example, the ePub, HTML and PDF version of an article will share a Crossref DOI because the differences between them should not effect the interpretation or crediting of the content. In short, they can be cited interchangeably. The same is true of the "accepted manuscript" and the "version-of-record" of that accepted manuscript.


## Searching by DOI

You can pass a DOI directly to `cr_works()` using the `dois` argument:

In [None]:
# Get metadata for a single article by DOI
jlsc_ku_oa <- cr_works(dois = '10.7710/2162-3309.1252') %>%
  purrr::pluck("data")

# print the data frame with select columns
jlsc_ku_oa %>%
  dplyr::select(title, doi, volume, issue, issued, url, publisher, reference.count, type, issn)

You can also pass more than one DOI. Here we start by assigning our DOIs to a variable `my_dois`, then pass it to `cr_works()` in the `doi` argument:

In [None]:
# Use c() to create a vector of DOIs
my_dois <- c("10.2139/ssrn.2697412", 
                        "10.1016/j.joi.2016.08.002", 
                        "10.1371/journal.pone.0020961", 
                        "10.3389/fpsyg.2018.01487", 
                        "10.1038/d41586-018-00104-7", 
                        "10.12688/f1000research.8460.2", 
                        "10.7551/mitpress/9286.001.0001")

# pass the my_dois vector to cr_works()
my_dois_works <- rcrossref::cr_works(dois = my_dois) %>%
  pluck("data")

# print the data frame with select columns
my_dois_works %>%
  dplyr::select(title, doi, volume, issue, issued, url, publisher, reference.count, type, issn)

## Unnesting list columns

Authors, links, licenses, funders, and some other values can appear in nested lists when you call `cr_journals` because there can be, and often are, multiple of each of these items per article. You can check the data classes on all variables by running `typeof()` across all columns using the `map_chr()` function from `purrr`:


In [None]:
# query to get data on a specific PLOS article
plos_article <- cr_works(dois = '10.1371/journal.pone.0228782') %>%
  purrr::pluck("data")

# print the type of each column (e.g. character, numeric, logical, list)
purrr::map_chr(plos_article, typeof)

Our `plos_article` data frame has a nested list for **author**. We can unnest this column using `unnest()` from the `tidyr` package. The `.drop = TRUE` argument will drop any other list columns.

In [None]:
# unnest author column
plos_article %>%
    tidyr::unnest(author, .drop = TRUE) 

We can see this has added 5 rows and 5 new columns: **ORCID** (the URL to the author's ORCID iD), and **authenticated.orcid** (a TRUE/FALSE value indicating whether that ORCID has been authenticated), **given** (first name), **family** (last name), and **sequence** (order in which they appeared). It has dropped the other 4 list columns: funder, link, license, and reference.

See https://ciakovx.github.io/rcrossref.html#unnesting_list_columns for more detailed strategies in unnesting nested lists in Crossref. For more details, call `?unnest` and read the [R for Data Science section on Unnesting](https://r4ds.had.co.nz/many-models.html#unnesting).

---
**TRY IT YOURSELF**

1. Do a quick search in [Google Scholar](https://scholar.google.com/) for an article you are interested in, and create an object below with the DOI. Remember to use quotation marks. 

2. Do a search with `cr_works` to get the article metadata. Assign it to a new symbol. Remember to `pluck()` the data. Print it with `dplyr::select(title, doi, url, publisher, reference.count, type)`

Print the column types using `purrr::map_chr(my_article, typeof)`. Replace `my_article` with whatever you called the symbol containing your article data.

Try unnesting one of list columns using `unnest()`. Set `.drop = TRUE`. Scroll all the way to the right. What new columns have appeared? Have any rows been duplicated?

## Getting more than 1000 results with the `cursor` argument to `cr_journals`

If our result will have more than 1000 results, we have to use the `cursor` argument. We will not be covering this in this class, but see <https://ciakovx.github.io/rcrossref.html#Getting_more_than_1000_results_with_the_cursor_argument_to_cr_journals> for instructions on how to do it.

## Running general queries on `cr_works()`

You can also use `cr_works()` to run a query based on very simple text keywords. For example, you can run `oa_works &lt;- rcrossref::cr_works(query = "open+access")`.
Paul Oldham [gives a great example of this](https://poldham.github.io/abs/crossref.html#searching_crossref), but does make the comment:

> CrossRef is not a text based search engine and the ability to conduct text based searches is presently crude. Furthermore, we can only search a very limited number of fields and this will inevitably result in lower returns than commercial databases such as Web of Science (where abstracts and author keywords are available). 
Unfortunately there is no boolean AND for Crossref queries (see https://github.com/CrossRef/rest-api-doc/issues/135 and  https://twitter.com/CrossrefSupport/status/1073601263659610113). However, as discussed above, the Crossref API assigns a score to each item returned giving a measure of the API's confidence in the match, and if you connect words using `+` the Crossref API will give items with those terms a higher score.

## Specifying field queries to `cr_works()` with `flq`

As with `cr_journals`, you can use `flq` to pass field queries on to `cr_works()`, such as author.

Here we search for the book *Open Access* by Peter Suber by doing a general keyword search for "open access" and an author search for "suber":


In [None]:
# do a general query for the term open access and a field query to return results where the author name includes Suber
suber_oa <- cr_works(query = 'open+access', flq = c(`query.author` = 'suber')) %>%
  pluck("data")

# print the data frame with select columns
suber_oa %>%
  dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count, type, issn)

Dr. Suber has written lots of materials that includes the term "open access." We can use the `filter()` function from `dplyr` to look only at books, from the **type** column:

In [None]:
# use filter() from dplyr to filter that result to include only books
suber_oa_books <- suber_oa %>%
  filter(type == "book")

# print the data frame with select columns
suber_oa_books %>%
  dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count, type, issn)

One is the book from MIT Press that we're looking for; the other is *Knowledge Unbound*, which is a collection of his writings.

We could be more specific from the outset by adding bibliographic information in `query.bibliographic`, such as ISBN (or ISSN, if it's a journal):

In [None]:
# run a different cr_works() query with author set to Suber and his book's ISBN passed to query.bibliographic
suber_isbn <- cr_works(flq = c(`query.author` = 'suber',
                           `query.bibliographic` = '9780262301732')) %>%
  pluck("data")

# print the data frame with select columns
suber_isbn %>%
  dplyr::select(title, doi, issued, url, publisher, type, author)

You can combine the `filter` argument with `flq` to return only items of **type** `book` published in 2012.

## Getting formatted references in a text file

We can use the `cr_cn()` function from the `rcrossref` package to get the citations to those articles in text form in the style you specify. We'll put it into Chicago. The `cr_cn()` function returns each citation into a list element. We can use the `map_chr` and the `pluck` functions from `purrr` to instead assign them to a character vector. 

In [None]:
# Use c() to create a vector of DOIs
my_dois <- c("10.2139/ssrn.2697412", 
                        "10.1016/j.joi.2016.08.002", 
                        "10.1371/journal.pone.0020961", 
                        "10.3389/fpsyg.2018.01487", 
                        "10.1038/d41586-018-00104-7", 
                        "10.12688/f1000research.8460.2", 
                        "10.7551/mitpress/9286.001.0001")

# Use cr_cn to get back citations formatted in Chicago for those DOIs
my_citations <- rcrossref::cr_cn(my_dois,
                                 format = "text",
                                 style = "chicago-note-bibliography") %>% 
  purrr::map_chr(., purrr::pluck, 1)

# print the formatted citations
my_citations

In [None]:
?cr_cn

Beautiful formatted citations from simply a list of DOIs! You can then write this to a text file using `writeLines`. 

In [None]:
# write the formatted citations to a text file
writeLines(my_citations, "data/my_citations_text.txt")

The above is helpful if you need to paste the references somewhere, and there are loads of other citation styles included in `rcrossref`--view them by calling `rcrossref::get_styles()` and it will print a vector of these styles to your console. I'll just print the first 15 below:

In [None]:
# look at the first 15 styles Crossref offers
rcrossref::get_styles()

## Getting formatted references in a BibTeX or RIS file

In addition to a text file, you can also write it to BibTeX or RIS:

In [None]:
# Use cr_cn() to get BibTeX files for my DOIs
my_citations_bibtex <- rcrossref::cr_cn(my_dois, format = "bibtex") %>%
  purrr::map_chr(., purrr::pluck, 1)

Write it to a .bib file using `writeLines()`:

In [None]:
# write to bibtex file
writeLines(my_citations_bibtex, "data/my_citations_bibtex.bib")

Same with RIS files. EndNote has a hard time reading BibTeX, so do this if you use that as your reference management software. Instead, set the format to RIS. For this to work, we must first make it into a `tibble`:

In [None]:
my_citations_ris <- rcrossref::cr_cn(my_references_dois, format = "ris") %>%
  purrr::map_chr(., purrr::pluck, 1) %>%
  dplyr::tibble()

Use `write_csv()` from `readr` to write the RIS file.


In [None]:
readr::write_csv(my_citations_ris, "./data/my_citations_ris.ris"))

## Getting works from a typed citation in a Word document/text file

This can be helpful if you have a bibliography in a Word document or text file that you want to get into a reference management tool like Zotero. For instance, you may have written the citations in APA style and need to change to Chicago, but don't want to rekey it all out. Or perhaps you jotted down your citations hastily and left out volume, issue, or page numbers, and you need a nice, fully-formatted citation.

If each citation is on its own line in your document's bibliography, then you can probably paste the whole bibliography into an Excel spreadsheet. If it goes as planned, each citation will be in its own cell. You can then save it to a CSV file, which can then be read into R. 


In [None]:
# read in a CSV file of citations
my_references <- readr::read_csv("data/references.txt", locale = readr::locale(encoding = "iso-8859-1"))

# print the file
my_references

As you can see, these are just raw citations, not divided into variables by their metadata elements (that is, with title in one column, author in another, etc.). But, we can now run a query to get precisely that from Crossref using `cr_works`. Because `cr_works` is not vectorized, we will need to build a loop using `map()` from the `purrr` package. 

Don't mind the technical details--it is basically saying to take each row and look it up in the Crossref search engine. Basically, this is the equivalent of copy/pasting the whole reference into the Crossref search engine. The loop will `print()` the citation before searching for it so we can keep track of where it is. We set the `limit` to 5. If it didn't find it in the first 3 results, it's not likely to be there at all.

In [None]:
# loop through the references column, using cr_works() to look the item up and return the top 5 hits
my_references_works_list <- purrr::map(
  my_references$reference,
  function(x) {
    print(x)
    my_works <- rcrossref::cr_works(query = x, limit = 5) %>%
      purrr::pluck("data")
  })

The Crossref API assigns a score to each item returned within each query, giving a measure of the API's confidence in the match. The item with the highest score is returned first in the datasets. We can return the first result in each item in the `my_references_works_list` by using `map_dfr()`, which is like `map()` except it returns the results into a data frame rather than a list. This will take a minute to run.

In [None]:
# for each reference looked up, get back the first result
my_references_works_df <- my_references_works_list %>%
  purrr::map_dfr(., function(x) {
    x[1, ]
  })

# print the data frame with select columns
my_references_works_df %>%
  dplyr::select(title, doi, volume, issue, page, issued, url, publisher, reference.count, type, issn)

We can print just the titles to quickly see how well they match with the titles of the works we requested:

In [None]:
# print the title column
my_references_works_df$title

Not bad! Looks like we got 6 out of 8, with problems on number 5 and 7. Let's deal with 5 first. This was the result for "The Ascent of Open Access", which was a report by Digital Science posted to figshare, didn't come back. Even though this report does have a DOI (https://doi.org/10.6084/m9.figshare.7618751.v2) assigned via figshare, the `cr_works()` function searches only for Crossref DOIs. We should check to see if it came back in any of the 5 items we pulled. We do this by calling `pluck()` on the titles of the fifth item in the list:


In [None]:
my_references_works_list %>%
  purrr::pluck(5, "title")

Nope, unfortunately none of these are "The Ascent of Open Access", so we're out of luck. We can just throw this row out entirely using `slice()` from `dplyr`. We'll overwrite our existing `my_references_works_df` because we have no future use for it in this R session.


In [None]:
my_references_works_df <- my_references_works_df %>%
  dplyr::slice(-5)

For row 7, it's giving us the full citation for Peter Suber's book when we asked for the title only, so something is fishy. 

When we look at it more closely (we can call `View(my_references_works_df)`), we see the author of this item is not Peter Suber, but Rob Harle, and, checking the **type** column, it's a journal article, not a book. This is a book review published in the journal *Leonardo*, not Peter Suber's book. So let's go back to `my_references_works_list` and pull data from all 5 items that came back with the API call and see if Suber's book is in there somewhere:


In [None]:
suber <- my_references_works_list %>%
  purrr::pluck(7)
suber

It looks like it is the second item, confirming by seeing the **author** is Peter Suber, the **publisher** is MIT Press, the **type** is book, and the **ISBN** is "9780262301732". 

We do the following to correct it:

* use `filter()` with the isbn to assign the correct row from `suber` to a variable called `suber_correct` 
* remove the incorrect row with `slice` (double checking that it is the 6th row)
* use `bind_rows()` to add the correct one to our `my_references_works_df` data frame. We can just overwrite the existing `my_references_works_df` again


In [None]:
suber_correct <- suber %>%
  dplyr::filter(isbn == "9780262301732")
my_references_works_df <- my_references_works_df %>%
  dplyr::slice(-6) %>%
  bind_rows(suber_correct)

## Writing publications to CSV

We will use the `write.csv()` function to write our data to disk as a CSV file. 

Unfortunately, you cannot simply write the `plosone_publications` data frame to a CSV, due to the nested lists. It will throw an error: `"Error in stream_delim_(df, path, ...) : Don't know how to handle vector of type list."`

I run through three solutions at https://ciakovx.github.io/rcrossref.html#writing_publications_to_disk

Here, we will use solution 3: You can use `mutate()` from `dplyr` to coerce the list columns into character vectors with `as.character()`.

First, identify the list vectors:

In [None]:
my_dois <- c("10.2139/ssrn.2697412", 
                        "10.1016/j.joi.2016.08.002", 
                        "10.1371/journal.pone.0020961", 
                        "10.3389/fpsyg.2018.01487", 
                        "10.1038/d41586-018-00104-7", 
                        "10.12688/f1000research.8460.2", 
                        "10.7551/mitpress/9286.001.0001")

# pass the my_dois vector to cr_works()
my_dois_works <- rcrossref::cr_works(dois = my_dois) %>%
  pluck("data")

In [None]:
# use map_chr to print the column types
purrr::map_chr(my_dois_works, typeof)

For any variables that are type `list`, coerce those columns to character:

In [None]:
# use mutate() to coerce list columns to character vectors
my_dois_mutated <- my_dois_works %>%
  dplyr::mutate(author = as.character(author)) %>%
  dplyr::mutate(assertion = as.character(assertion)) %>%
  dplyr::mutate(link = as.character(link)) %>%
  dplyr::mutate(license = as.character(license)) %>%
  dplyr::mutate(reference = as.character(reference))
write.csv(my_dois_mutated, "data/my_dois_mutated.csv")

Again, this is not an ideal solution, but if you need to move the data into CSV to view in Excel, it can do the trick.

# Using `roadoi` to check for open access

`roadoi` was developed by Najko Jahn, with reviews from Tuija Sonkkila and Ross Mounce. It interfaces with [Unpaywall](https://unpaywall.org) (which used to be called oaDOI), an important tool developed by [ImpactStory](http://unpaywall.org/team) (Heather Piwowar and Jason Priem) for locating open access versions of scholarship--read more in this [*Nature* article](https://www.nature.com/articles/d41586-018-05968-3). See here for [the `roadoi` documentation](https://cran.r-project.org/web/packages/roadoi/roadoi.pdf).

This incredible [Introduction to `roadoi`](https://cran.r-project.org/web/packages/roadoi/vignettes/intro.html) by Najko Jahn provides much of what you need to know to use the tool, as well as an interesting use case. Also see his recently published article [Open Access Evidence in Unpaywall](https://subugoe.github.io/scholcomm_analytics/posts/unpaywall_evidence/), running deep analysis on Unpaywall data.

We loaded the package at the beginning of this notebook with `library(roadoi`

## Setting up `roadoi`

Your API calls to Unpaywall must include a valid email address where you can be reached in order to keep the service open and free for everyone. 

## Checking OA status with `oadoi_fetch`

We then create DOI vector and use the `oadoi_fetch()` function from `roadoi`. 

**Be sure to replace the email below with your own**

In [None]:
# assign your email address to a vector
my_email <- "name@example.com"

In [None]:
# Use c() to create a vector of DOIs
my_dois <- c("10.2139/ssrn.2697412", 
                        "10.1016/j.joi.2016.08.002", 
                        "10.1371/journal.pone.0020961", 
                        "10.3389/fpsyg.2018.01487", 
                        "10.1038/d41586-018-00104-7", 
                        "10.12688/f1000research.8460.2", 
                        "10.7551/mitpress/9286.001.0001")

# use oadoi_fetch() to get Unpaywall data on those DOIs
my_dois_oa <- roadoi::oadoi_fetch(dois = my_dois,
                                 email = my_email)

Look at the column names.

In [None]:
# print column names
names(my_dois_oa)

In [None]:
my_dois_oa

The returned variables are described on the [Unpaywall Data Format](http://unpaywall.org/data-format) page.

We can see that Unpaywall could not find OA versions for one of the seven of these:

In [None]:
my_dois_oa$is_oa

So we will filter it out with `filter()` from the `dplyr` package:

In [None]:
# use filter() to overwrite the data frame and keep only items that are available OA
my_dois_oa <- my_dois_oa %>%
  dplyr::filter(is_oa == TRUE)

As above, it is easier to use `unnest()` to more closely view one of the variables:

In [None]:
# print the data frame with best open access location unnested
my_dois_oa %>%
    tidyr::unnest(best_oa_location, names_repair = "unique")

---
**TRY IT YOURSELF**

Use the same article you found in the above `cr_works` exercise, or something different. Go through the above steps to check if it is open access. If not, find an article that is OA (you can search on [DOAJ](https://doaj.org/), just click the "Articles" button.

In [None]:
# assign your article DOI to a new object my_doi2 or another name of your choosing


In [None]:
# use roadoi::oadoi_fetch() to get OA information about the article


In [None]:
# use mydoi2$is_oa to find out if the article has an open access version. If not, find an open access article and try again.


Use `unnest()` to find out the OA locations. What is the URL to the best OA location? Is the journal in DOAJ? Explore the data.

In [None]:
# use unnest(best_oa_location) to find the OA locations. What is the URL to the best OA location?


# Next steps

There are several other excellent R packages that interface with publication metadata that can be used in conjunction with this package. Examples:

* `rorcid` is a wrapper for the ORCID API. Functions included for searching for people, searching by 'DOI',and searching by ORCID iD. https://cran.r-project.org/web/packages/rorcid/rorcid.pdf. See my walkthrough at https://ciakovx.github.io/rorcid.html.
* `bibliometrix` "is an open-source tool for quantitative research in scientometrics and bibliometrics that includes all the main bibliometric methods of analysis." See more information at https://bibliometrix.org/.
* `rromeo` is a wrapper for the SHERPA-RoMEO API. You can retrieve a set of publications metadata from `rcrossref`, then use the ISSN to look up the policies of the journal regarding the archival of preprints, postprints, and publisher versions. https://cran.r-project.org/web/packages/rromeo/rromeo.pdf
* `crminer` "includes functions for getting getting links to full text of articles, fetching full text articles from those links or Digital Object Identifiers ('DOIs'), and text extraction from 'PDFs'." https://cran.r-project.org/web/packages/crminer/crminer.pdf