This section covers an important strategy when you are cleaning data, that is to aim to create tidy data, which is a defined concept of how data should look before you work with it. This is a concept that is embraced by the godfather of R, Hadley Wickham, and so supported by his collection of packages, the so called tidyverse.

Tidy data

There are a great many resources on tidy data that will explain it better than we can here, so we will just give it a brief summary and leave some links for further reading, but the main point is that we embrace its philosophy and as a data analyst it is wise to aim for tidy data after you have imported it.

Tidy summary

“Tidy datasets are all alike, but every messy dataset is messy in its own way” - Hadley Wickham

The aim of a tidy dataset is to present it in a manner that further processing can be done in an organised fashion:

  1. Each variable in the data set is placed in its own column
  2. Each observation is placed in its own row
  3. Each value is placed in its own cell

Tidy data isn’t necessarily data that is easiest to read by a human, but it is well suited for passing into further R functions, especially those within the tidyverse.

Tidyverse

Tidyverse is the new name for the Hadleyverse, which is a collection of R packages created or supported by Hadley Wickham.

These packages aim to cover the whole spectrum of data analysis within R, each supporting the other in concepts and output. The main packages within are:

Whilst you can achieve what the above packages do in base R or with other packages (data.table() a noteable alternative to dplyr()), if you embrace the tidyverse then you will need to become familiar with some if not all the packages above.