This section covers an important strategy when you are cleaning data, that is to aim to create tidy data, which is a defined concept of how data should look before you work with it. This is a concept that is embraced by the godfather of R, Hadley Wickham, and so supported by his collection of packages, the so called tidyverse.
There are a great many resources on tidy data that will explain it better than we can here, so we will just give it a brief summary and leave some links for further reading, but the main point is that we embrace its philosophy and as a data analyst it is wise to aim for tidy data after you have imported it.
“Tidy datasets are all alike, but every messy dataset is messy in its own way” - Hadley Wickham
The aim of a tidy dataset is to present it in a manner that further processing can be done in an organised fashion:
Tidy data isn’t necessarily data that is easiest to read by a human, but it is well suited for passing into further R functions, especially those within the tidyverse.
Tidyverse is the new name for the Hadleyverse, which is a collection of R packages created or supported by Hadley Wickham.
These packages aim to cover the whole spectrum of data analysis within R, each supporting the other in concepts and output. The main packages within are:
tibble - creating more user-friendly data.framesdplyr - data manipulation of data.framesggplot2 - plotting tidy datatidyr - tools to tidy data.framespurrr - more general data manipulation toolsmagrittr - the origin of %>%, extensivly used in the tidyversebroom - turn statistical models into tidy data.frames/tiddlesWhilst you can achieve what the above packages do in base R or with other packages (data.table() a noteable alternative to dplyr()), if you embrace the tidyverse then you will need to become familiar with some if not all the packages above.