---
title: 'Retriever: Data Retrieval Tool'
tags:
  - data retrieval, data processing, python, data, data science, datasets
authors:
 - name: Henry Senyondo
   orcid: 0000-0001-7105-5808
   affiliation: 1
 - name: Benjamin D. Morris
   orcid: 0000-0003-4418-1360
 - name: Akash Goel
   orcid: 0000-0001-9878-0401
   affiliation: 3
 - name: Andrew Zhang
   orcid: 0000-0003-1148-7734
   affiliation: 4
 - name: Akshay Narasimha
   orcid: 0000-0002-3901-2610
   affiliation: 5
 - name: Shivam Negi
   orcid: 0000-0002-5637-0479
   affiliation: 6
 - name: David J. Harris
   orcid: 0000-0003-3332-9307
   affiliation: 4
 - name: Deborah Gertrude Digges
   orcid: 0000-0002-7840-5054
   affiliation: 10
 - name: Kapil Kumar
   orcid: 0000-0002-2292-1033
   affiliation: 7
 - name: Amritanshu Jain
   orcid: 0000-0003-1187-7900
   affiliation: 5
 - name: Kunal Pal
   orcid: 0000-0002-9657-0053
   affiliation: 8
 - name: Kevinkumar Amipara
   orcid: 0000-0001-5021-2018
   affiliation: 9
 - name: Ethan P. White
   orcid: 0000-0001-6728-7745
   affiliation: 1, 2
affiliations:
 - name: Department of Wildlife Ecology and Conservation, University of Florida
   index: 1
 - name: Informatics Institute, University of Florida
   index: 2
 - name: Delhi Technological University, Delhi
   index: 3
 - name: The University of Florida
   index: 4
 - name: Birla Institute of Technology and Science, Pilani
   index: 5
 - name: Manipal Institute of Technology, Manipal
   index: 6
 - name: National Institute of Technology, Delhi
   index: 7
 - name: RWTH Aachen University, Aachen, Germany
   index: 8
 - name: Sardar Vallabhbhai National Institute of Technology, Surat
   index: 9
 - name: PES Institute of Technology, Bengaluru
   index: 10


date: 16 September 2017 
bibliography: paper.bib
---

# Summary

The Data Retriever automates the first steps in the data analysis workflow by downloading, cleaning, and standardizing tabular datasets, and importing them into relational databases, flat files, or programming languages [@Morris2013]. The automation of this process reduces the time for a user to get most large datasets up and running by hours to days. The retriever uses a plugin infrastructure for both datasets and storage backends. New datasets that are relatively well structured can be added adding a JSON file following the Frictionless Data tabular data metadata package standard [@frictionlessdata_specs]. More complex datasets can be added using a Python script to handle complex data cleaning and merging tasks and then defining the metadata associated with the cleaned tables. New storage backends can be added by overloading a general class with details for storing the data in new file formats or database management systems. The retriever has both a Python API and a command line interface. An R package that wraps the command line interface and a Julia package that wraps the Python API are also available.

The 2.0 and 2.1 releases add extensive new functionality. This includes the Python API, the use of the Frictionless Data metadata standard, Python 3 support, JSON and XML backends, and autocompletion for the command line interface.

# References