--- title: An R Markdown document converted from "./intro_to_bibliometrics/Bibliometrics Lesson 1.ipynb" output: html_document --- # Introduction This notebook will introduce you to `rcrossref` and `citecorp` for doing a few bibliometric analysis tasks. If you plan on using these tools more extensively, I would recommend downloading R and R Studio to your own computer and using it there. Also, the Crossref team encourages requests with appropriate contact information, and will forward you to a dedicated API cluster for improved performance when you share your email with them. We cannot do that in this notebook, but you can do that in R. For a more detailed walkthrough of this entire lesson, including instructions for downloading R and R Studio, sharing your email with Crossref, and much more, please visit the set of tutorials at https://ciakovx.github.io/fsci_syllabus.html. # Licensing This walkthrough is distributed under a [https://creativecommons.org/licenses/by/4.0/](Creative Commons Attribution 4.0 International (CC BY 4.0) License).

# Install and load packages When you download R it already has a number of functions built in: these encompass what is called **Base R**. However, many R users write their own libraries of functions, package them together in R **packages**, and provide them to the R community at no charge. This extends the capacity of R and allows us to do much more. In many cases, they improve on the Base R functions by making them easier and more straight-forward to use. First install `rcrossref` using the `install.packages()` function. Notice the package name is in quotation marks. Click on the left margin of the cell below so it turns blue. Then click the **Run** button on the top toolbar, or press **Ctrl + Enter**. This will **execute** the code, installing the necessary packages into your Jupyter Notebook environment. This may take a couple minutes. You will see a \[*\] on the left margins while the package installs. When the asterisk becomes a number, we will be ready. ```{r} install.packages('rcrossref') ``` Next, install `citecorp`, which is the interface in R to the Open Citations Corpus API, and `tidyverse`. ```{r} install.packages('citecorp') install.packages('tidyverse') ``` We will also be using the `dplyr`, `purrr`, `stringr`, `tidyr` and `readr` packages. These packages are part of the [tidyverse](https://www.tidyverse.org/), a collection of R packages designed for data science. We load packages with the `library()` function. Let's load all three packages by clicking the left margin of the below cell (so it turns blue) and clicking the **Run** button on the top toolbar. Let's also set an option to see a max number of 100 columns and max 20 rows in our Jupyter Notebooks environment. This will just make some of the tables easier to see. ```{r} # load packages library(rcrossref) library(citecorp) library(tidyverse) # increase number of columns and rows displayed when we print a table options(repr.matrix.max.cols=100, repr.matrix.max.rows=20) ``` # `rcrossref` and `citecorp` ## Crossref Crossref is a a not-for-profit membership organization dedicated to interlinking scholarly metadata, including journals, books, conference proceedings, working papers, technical reports, data sets, authors, funders, and more. The [Crossref REST API](https://github.com/CrossRef/rest-api-doc) allows anybody to search and reuse members' metadata in a variety of ways. Read [examples of user stories](https://www.crossref.org/services/metadata-delivery/user-stories/). ## `rcrossref` `rcrossref` is a package developed by [Scott Chamberlain](https://scottchamberlain.info/), Hao Zhu, Najko Jahn, Carl Boettiger, and Karthik Ram, part of the [rOpenSci](https://ropensci.org/) set of packages. rOpenSci is an incredible organization dedicated to open and reproducible research using shared data and reusable software. I strongly recommend you browse their set of packages at https://ropensci.org/packages/.

`rcrossref` serves as an interface to the Crossref API. **Key links** * [rcrossref documentation](https://cran.r-project.org/web/packages/rcrossref/rcrossref.pdf) * [Crossref REST API documentation](https://github.com/ropensci/rcrossref) * [Crossref Metadata API JSON Format](https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md) * Crossref and rcrossref tutorials - [my `rcrossref` tutorial](https://ciakovx.github.io/rcrossref.html) - rOpenSci [rcrossref tutorial](https://ropensci.org/tutorials/rcrossref_tutorial/) - Paul Oldham: ["Accessing the Scientific Literature with Crossref"](https://poldham.github.io/abs/crossref.html) - [rcrossref vignette](https://cran.r-project.org/web/packages/rcrossref/vignettes/crossref_vignette.html) # Exercise 1: Getting article data with Dimensions Your first task is to get a set of articles with citation counts. We will use Dimensions for this, as the largest and most robust abstract and citation database with free functionality. It is free to search, but to export requires first creating an account. 1. Go to https://app.dimensions.ai/ 2. Click **Register** in the upper right corner and create an account. Log-in to your account. 3. Click in the search bar at the top and change the search option to **Title and Abstract** 4. Type exactly as written: bibliometrics AND "open access" 5. Press enter and click **Save/export** and **Export results** 6. Under Export full record, click the drop down next to File Format and select **CSV - Comma separated**. Click **Export**. 7. You will get an email when the export is ready for download. At that point, click your name in the right hand corner and click **Export Center**. Click **Download** to download the file to your computer. 8. Navigate to the file on your computer and unzip it. Rename the CSV file to "dimensions.csv". 8. Here in your Azure Notebook, save any changes you have made. Then click the "intro_to_bibliometrics" button in the upper right corner to navigate to your project home. Click on the "data" folder and save the file there. # Exercise 2: Bibliometric analysis of articles on the topic of bibliometrics and open access in Dimensions 1. Start by importing the dataset we saved under Exercise 1 above. We use the `read_csv()` function, which comes from the `readr` package, which we loaded above. We also pass the `skip` argument, in order to skip line 1, as that is a disclaimer from Digital Science. ```{r} dimensions <- read_csv("data/dimensions.csv", skip = 1) ``` We can preview the first ten rows with `head()`: ```{r} head(dimensions, n = 10) ``` Examining this data, we can see the `Times cited` variable provides a count of the number of times this work was cited, according to the Dimensions dataset. As we pointed out in the lecture, it is important to note that this number may vary depending on the source of data. ## Times cited Let's use the `%>%` pipe to arrange the data by `Times cited` in descending order `desc()`, and again examine the first ten lines. Note: Because this column name has a space in it, make sure that you surround the variable name with backticks, which is on the same keyboard key as the tilde, next to the 1 key. ```{r} dimensions %>% arrange(desc(`Times cited`)) %>% head(n = 10) ``` We can see the classic 2006 study from *PLOS Biology*, "Citation Advantage of Open Access Articles", is ranked first with 361 citations. ## Recent citations How does this compare to `Recent citations`, [which is](https://support.dimensions.ai/support/solutions/articles/13000051232-what-are-recent-citations-) the number of citations that were received in the last two calendar years? ```{r} dimensions %>% arrange(desc(`Recent citations`)) %>% head(n = 10) ``` Some of the articles are the same, but we now see more recent articles rising to the top. ## RCR: Relative Citation Ratio The Dimensions dataset comes with two other citation metrics. The `RCR` is calculated by the NIH for publications archived in PubMed. It represents the "relative citation ratio," which is an attempt to normalize citation count by area of research. It is described in detail in the 2016 article by Hutchins et al., published in *PLOS Biology*: ["Relative Citation Ratio (RCR): A New Metric That Uses Citation Rates to Measure Influence at the Article Level"](https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002541). Area of research is not defined by keywords, but rather by co-citation network; that is, "the other papers that appear alongside it in reference lists". For more on RCR in PubMed, see ["What is the RCR? How is the RCR score calculated? "](https://support.dimensions.ai/support/solutions/articles/13000038246-what-is-the-rcr-how-is-the-rcr-score-calculated-) and ["Measuring Impact of NIH-supported Publications with a New Metric: the Relative Citation Ratio"](https://nexus.od.nih.gov/all/2016/09/08/nih-rcr/). Briefly, an article with an RCR of 1.0 has as many citations as would be expected for articles in its co-citation network, whereas an article with an RCR of 2.0 has approximately twice as many citations as would be expected. *** **TRY IT YOURSELF** Using the above code as an example, write code that sorts the dataset by `RCR` in descending order, and display the top 10 items. ```{r} # write code here dimensions %>% arrange(desc(RCR)) %>% head(n = 10) ``` ## Publication Year What does the distribution of publication years look like? The date range: ```{r} range(dimensions$PubYear) ``` Distribution: ```{r} dimensions %>% group_by(PubYear) %>% tally() %>% arrange(desc(n)) ``` We may also want to run a visualization using `ggplot()` from the `ggplot2` package: ```{r} dimensions %>% ggplot() + geom_bar(mapping = aes(x = PubYear)) ``` We can see a trend that publications on the topic have largely been increasing since 2015, which is not particularly surprising given the [increasing number of open access articles](https://peerj.com/articles/4375/). ## Journals We may also be interested in what journals appear in the result set. In the below code, we remove journals with no data (`NA`) by using the `filter()` function in combination with `complete.cases()`. We then `group_by` (cluster) according to the journal (`Source title`), we then `tally` the number of articles per journal, and arrange that number `n` in descending order. ```{r} dimensions %>% filter(complete.cases(`Source title`)) %>% group_by(`Source title`) %>% tally() %>% arrange(desc(n)) ``` # Exercise 3: Getting references with `citecorp` The Dimensions data includes the citation count, but it does not include a list of the articles that have cited the document, nor does it include a list of references in the document. For this data, we need to use `citecorp`, which is the R wrapper to the Open Citations Corpus API. The `oc_coci_cites()` function will allow you to get citations to a document. For our example, we will use the article "Citation Advantage of Open Access Articles," published by *PLOS Biology* in 2006. According to Dimensions, this article has 361 citations; let's use the Open Citations Corpus to see if there is a difference and to get some information about the citing articles. We pass the DOI on to the function: ```{r} my_cites <- oc_coci_cites(doi = "10.1371/journal.pbio.0040157") ``` ```{r} my_cites ``` As described in the documentation for this function at http://opencitations.net/index/coci/api/v1, the variables returned are: * oci: the Open Citation Identifier (OCI) of the citation in consideration; * citing: the DOI of the citing entity; * cited: the DOI of the cited entity; * creation: the creation date of the citation according to the ISO date format YYYY-MM-DD, which corresponds to the publication date of the citing entity; * timespan: the interval between the publication date of the cited entity and the publication date of the citing entity, expressed using the XSD duration format PnYnMnD; * journal_sc: it records whether the citation is a journal self-citations (i.e. the citing and the cited entities are published in the same journal); * author_sc: it records whether the citation is an author self-citation (i.e. the citing and the cited entities have at least one author in common). This returned 238 citing articles, quite a bit lower than Dimensions. As we have discussed (and there is a good amount of research to substantiate), these numbers can range widely. For instance, Google Scholar returns 936 articles that have cited this one! From here, you can do some filtering right away if you would like to remove author or journal self-citations, or get citations in a particular date range. We will not be doing that here. To get article metadata such as authors, journal titles, open access availability, and even references within the citing documents, we can pass the DOIs in the `citing` column to the function `oc_coci_meta()`. To avoid a potential timeout issue from running this in an Azure Notebook, we will subset to include only the first 50 rows with `[1:50]`. If you were doing this in R Studio, the timeout likely would not be an issue. So, keep in mind that this dataset is for learning purposes only--not analytics--our dataset will include only a subset (50) of the references. ```{r} my_cites_meta <- oc_coci_meta(my_cites$citing[1:50]) ``` ```{r} head(my_cites_meta, n = 10) ``` From here we can do a range of analysis, including authorship, affiliation, journal, open access availability (of particular interest when the articles themselves are on the topic of open access), keyword co-occurences, network analysis, etc. For our purposes here, we will look at the average number of articles per year that have cited our target article. To do this, we first use `mutate()` and `as.integer` to force the `year` variable into a number type (it came in as a character). Then we run a `ggplot()` with a frequency polygon `freqpoly()`, which is essentially a histogram using a line. We set the `bindwidth` to 1. ```{r} my_cites_meta %>% mutate(year = as.integer(year)) %>% ggplot() + geom_freqpoly(mapping = aes(x = year), binwidth = 1) ``` We can also see if the articles that have an open access version available have more citations themselves. Here, we do three mutations: * coerce `year` to integer, as above * in the `oa_link` column, use `na_if` to replace all blank cells, represented by empty quotes `""`, with `NA` * create a new column `oa_available`, using `complete.cases` to return a `TRUE` if there is a value in the `oa_link` column, and `FALSE` if there is an `NA`. We can then use this to indicate if the article has an open access version or not. We then plot a `geom_freqpoly()` as above, only we add the argument `color = oa_available`, which will create two lines: one representing articles that have open access availability (in teal), the other not (in red). ```{r} my_cites_meta %>% mutate(year = as.integer(year), oa_link = na_if(oa_link, ""), oa_available = complete.cases(oa_link)) %>% ggplot() + geom_freqpoly(mapping = aes(x = year, color = oa_available), binwidth = 1) ``` ## Exercise 3: Getting formatted references in a BibTeX or RIS file You can write these to BibTeX or RIS files using the `cr_cn()` function from `rcrossref`. We will just write the first ten using `[1:10]`: ```{r} # Use cr_cn() to get BibTeX files for my DOIs my_citations_bibtex <- cr_cn(my_cites_meta$doi[1:10], format = "bibtex") %>% purrr::map_chr(., purrr::pluck, 1) ``` Write it to a .bib file using `writeLines()` and put it in the `data` folder: ```{r} # write to bibtex file writeLines(my_citations_bibtex, "data/my_citations_bibtex.bib") ``` ## Writing publications to CSV We will use the `write_csv()` function to write our data to disk as a CSV file. ```{r} write_csv(my_cites_meta, "data/my_cites_meta.csv") ``` # Next steps There are several other excellent R packages that interface with publication metadata that can be used in conjunction with this package. Examples: * `rorcid` is a wrapper for the ORCID API. Functions included for searching for people, searching by 'DOI',and searching by ORCID iD. https://cran.r-project.org/web/packages/rorcid/rorcid.pdf. See my walkthrough at https://ciakovx.github.io/rorcid.html. * `bibliometrix` "is an open-source tool for quantitative research in scientometrics and bibliometrics that includes all the main bibliometric methods of analysis." See more information at https://bibliometrix.org/. * `rromeo` is a wrapper for the SHERPA-RoMEO API. You can retrieve a set of publications metadata from `rcrossref`, then use the ISSN to look up the policies of the journal regarding the archival of preprints, postprints, and publisher versions. https://cran.r-project.org/web/packages/rromeo/rromeo.pdf * `crminer` "includes functions for getting getting links to full text of articles, fetching full text articles from those links or Digital Object Identifiers ('DOIs'), and text extraction from 'PDFs'." https://cran.r-project.org/web/packages/crminer/crminer.pdf