--- title: 'covid19.analytics: An R Package to Obtain, Analyze and Visualize Data from the 2019 Coronavirus Disease Pandemic' tags: - R - CRAN - coronavirus - data access authors: - name: Marcelo Ponce^[corresponding author] orcid: 0000-0001-5850-7240 affiliation: "1" # (Multiple affiliations must be quoted) - name: Amit Sandhel affiliation: 2 affiliations: - name: SciNet HPC Consortium, University of Toronto, Canada index: 1 - name: Independent Researcher index: 2 date: 21 January 2021 bibliography: references.bib --- # Summary With the emergence of a new pandemic worldwide, novel strategies attempting to understand its evolution and dynamics, empowered by computational techniques such as data analysis and computational modelling, have also emerged. Several initiatives under the umbrella of *open science* are contributing to tackle this unprecedented situation. In particular, the *R Language* [@citeR; @ihaka:1996] could be an excellent tool and ecosystem for approaches focusing on open science and reproducible research. Hence it is not surprising that with the onset of the pandemic, a large number of R packages and resources were made available for researchers working in the pandemic. In this paper, we present the `covid19.analytics` R package, which allows users to access and analyze worldwide data from publicly available resources. The `covid19.analytics` R package allows users to obtain live^[The data usually is accessible from the repositories with a 24 hours delay.] worldwide data of reported cases from the novel *Coronavirus Disease* (COVID-19), as well as related datasets, such as genomics data, testing and vaccinations records, and historical pandemics records. It does this by accessing and retrieving data publicly available and published by several sources, such as: - The "COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University" [@JHUCSSErepo] for the worldwide and US data - Health Canada [@HealthCanada], for Canada specific data - The city of Toronto for Toronto data [@TorontoData] - Open Data Toronto for Toronto data [@OpenDataToronto] - Our World In Data for testing and vaccination data [@OWIDvaccination] - NCBI servers for genomics datasets [@NCBIdatabases] - Visual Capitalist infographics for historical pandemic records [@VCpandemics] The package also provides basic analysis and visualization tools and functions to investigate these datasets and other structured in a similar fashion. # Statement of need One important element to consider in the case of an ongoing and unfolding pandemic is the rapid data accessibility and its prompt availability. One of the main goals of the `covid19.analytics` package is to make the latest data related to the COVID19 pandemic promptly available to researchers and the scientific community. As of today, there are a few dozen other packages also in the CRAN repository that allow users to gain access to different datasets about the COVID-19 pandemic. Some packages just provide access to data from specific geographical locations or the approach to the data structure in which the data is presented is different from the one presented here. Nevertheless, having a variety of packages that users can try and probably combine is an important and crucial element in data analysis. Moreover, it has been reported that there are cases of data misuse/misinterpretation due to different issues, such as erroneous metadata or data formats [@Ledford_2020], in some cases resulting in articles being retracted [@Schriml:2020aa]. Therefore providing additional functionalities to check the integrity and consistency of the data, as our `covid19.analytics` package does, is paramount. This is especially true in a situation where the unfolding of events and data availability is flowing so fast that it sometimes is even hard to keep track of all the changes. As of today and to the best of our knowledge, there isn't any other R package that provides direct access to genomics data of the COVID-19 virus. The `covid19.analytics` package is indeed capable of retrieving genomics data, and it does this by incorporating a novel, more reliable and robust way of accessing and designing different pathways to the data sources. Moreover, the `covid19.analytics` package offers a modular and versatile approach to the data by allowing users to input their own data for which most of the functions in the package can be applied when the data is structured using a *time series* format as described in the package documentation. # Functionalities and Main Features The `covid19.analytics` package is an open source tool; its main goal is to offer users prompt and quick access to different data sources related to the unfolding pandemic. Its main implementation and API is the R package [@covid19analytics] which is available on CRAN [@cran]. In addition to this, the package has a few more add-ons: - A central GitHub repository, , where the latest development version and source code of the package are available. Users can also submit tickets for bugs, suggestions, or comments using the "issues" tab in this repository. - A rendered version with live examples and documentation also hosted at GitHub pages, - A dashboard for interactive usage of the package with extended capabilities for users without any coding expertise, - A backup data repository hosted at GitHub, , where replicas of the live datasets are stored for redundancy and robust accessibility (see \autoref{fig:dataSrcs}). ![Schematic of the data acquisition flows between the `covid19.analytics` package and the different sources of data. Dark and solid/dashed lines represent API functions provided by the package accessible to the users. Dotted lines are "internal" mechanisms employed by the package to synchronize and update replicas of the data. \label{fig:dataSrcs}](./covid19-data-sources.pdf) The package also incorporates a dashboard to facilitate access to its functionalities to less experienced users (see \autoref{fig:dashboard}). This can be accessed interactively in a local session or through the public deployed dashboard at the URL described above. ![Screenshot of a `covid19.analytics` dashboard implementation -- the dashboard can be used through a web-interface or deployed locally on the users' machine.\label{fig:dashboard}](./dashboard-mosaic.png) Another unique feature of this package is the ability to incorporate models to estimate the disease spread by using the actual data. The `covid19.analytics` package allows users to model the dispersion of the disease by implementing a simple *Susceptible-Infected-Recovered* (SIR) model [@kermack1927contribution; @smith2004sir]. The package can also generate multiple SIR models by varying some of the parameters used to select the actual data to be used in the model and some additional visualizations tools. There are several plotting and visualization functions within the `covid19.analytics` package, generating both static and interactive visualizations. Further details, documentation, examples, and tutorials of the `covid19.analytics` package can be found on the [package repository](https://github.com/mponce0/covid19.analytics) and [@ponce2020covid19analytics]. Users can submit bug reports and improvements requests or seek for support using the ["issues tracker"](https://github.com/mponce0/covid19.analytics/issues) tab in the GitHub repository. Contributions from users and the community are welcome and can be made using ["pull requests"](https://github.com/mponce0/covid19.analytics/pulls) in the repository as well. We will continue working on adding and developing new features to the package; a glimpse and brief description about these can be found on the repository's [Projects area](https://github.com/mponce0/covid19.analytics/projects). # Acknowledgments MP wants to thank all his colleagues at SciNet, especially Daniel Gruner for his continuous and unconditional support, and Marco Saldarriaga who helped us setting up the VM for installing the shiny server hosting the covid19.analytics dashboard web interface. # References