# An R Markdown Demo

*Author: Charles J. Geyer*

## License

This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License
(<http://creativecommons.org/licenses/by-sa/4.0/>).


## R

 * The version of R used to make this document is `r getRversion()`.

 * The version of the `rmarkdown` package used to make this document is `r packageVersion("rmarkdown")`.

 * The version of the `knitr` package used to make this document is `r packageVersion("knitr")`.

 * The version of the `ggplot2` package used to make this document is `r packageVersion("ggplot2")`.

```{r packages}
library(rmarkdown)
library(ggplot2)
```

## Rendering

The source for this file is
https://raw.githubusercontent.com/IRSAAtUMn/RWorkshop18/master/07-rmark.Rmd.

This is a demo for using the R package `rmarkdown`.  To turn this file
into HTML, after you have downloaded it, use the R command
```{r eval=FALSE}
render("07-rmark.Rmd")
```

The same rendering can be accomplished in RStudio by loading the document
into Rstudio and clicking the "Knit" button.

R markdown can also be converted to output formats other than HTML.
Among these are PDF, Microsoft Word, and e-book formats.
Other output formats are
[explained in the Rmarkdown documentation](http://rmarkdown.rstudio.com/lesson-9.html).

## Markdown

Markdown ([Wikipedia page](https://en.wikipedia.org/wiki/Markdown)) is
"markup language" like HTML and LaTeX but much simpler.  Variants of the
original markdown language are used by GitHub (https://github.com) for
formatting README files and other documentation,
by reddit (https://www.reddit.com/) for formatting posts and comments,
and by R Markdown (http://rmarkdown.rstudio.com/) for making documents
that contain R computations and graphics.

Markdown is fast becoming an internet standard for "easy" markup.

![xkcd:927 Standards](images/standards.png){title="Fortunately, the charging one has been solved now that we&#39;ve all standardized on mini-USB. Or is it micro-USB? Shit."}

Markdown is way too simple to replace HTML or LaTeX, but it usually gets the job
done, even if the result isn't as pretty as one might like.

## R Markdown

### Look

If you really need your documents to look exactly the way you want them
to look, then you will have to learn LaTeX and `knitr`.  But if you are
willing to accept the look that R Markdown gives you, then it is satisfactory
for most purposes.

### What Does It Do? {#rdo}

R Markdown allows you to include R computations and graphics in documents
*reproducibly*.  That means that anyone anywhere who gets your R markdown
file can satisfy themselves that R did what you say it did.

This is very different from the old fashioned way of putting results
in documents, where one just cut-and-pastes
([snarf-and-barfs](http://www.catb.org/jargon/html/S/snarf-ampersand-barf.html))
the results
into the document.  Then there is zero evidence that these numbers or figures
were produced the way you claim.

For concreteness, here is a simple example.
```{r}
2 * pi
```
The result here was not cut-and-pasted into the document.  Instead the
R expression `2 * pi` was executed in R and the result was put in the document
*by R Markdown, automatically*.

This happens *every time the document is generated* so the result always
matches the R code that generated it.  There is no way they can fail to match
(which can happen and often does happen with snarf-and-barf).

Here is another concrete example.
```{r fig.align='center', label='histogram', fig.cap='Histogram with probability density function.'}
# set.seed(42) # uncomment to always get the same plot
mydata <- data.frame(x = rnorm(1000))
ggplot(mydata, aes(x)) +
    geom_histogram(aes(y = ..density..), binwidth = 0.5,
        fill = "cornsilk", color = "black") +
    stat_function(fun = dnorm, color = "maroon")
```
Every time the document is generated, this figure is different because the
random sample produced by `rnorm(1000)` is different.  (If I wanted it to
be the same, I could uncomment the `set.seed(42)` statement.)

Again, there is no way the figure can fail to match the R code that generates
it.  The source code for the document (the `Rmd` file) is *proof* that
everything is as the document claims.  Anyone can verify the results by
generating the document again.

### Why Do We Want It?

#### The "Replication Crisis"

Many are now saying there is a "replication crisis" in science
([Wikipedia page](https://en.wikipedia.org/wiki/Replication_crisis)).

There are many issues affecting this "crisis."

 * There are scientific issues, such as what experiments are done
   and how they are interpreted.

 * There are statistical issues, such as too small sample sizes,
   multiple hypothesis testing without correction,
   and publication bias.

 * And there are computational issues: what data analysis was done and was
   it correct?

R Markdown only directly helps with the computational issues.
It also helps to document your statistical analyses (*Duh!*  Do you think
generating
a document that includes a statistical analysis may help to document it?)

Indirectly, R Markdown also helps with the other issues.  Having the whole
analysis from raw data to scientific findings publicly available --- as many
scientific journals now require --- tends to
make you a lot more careful and a lot more thoughtful in doing the analysis.

#### Business and Consulting

Outside of science there is no buzz about "replication crisis,"
at least not yet.  The hype about "big data" is so strong that hardly
anyone is questioning whether results are correct or actually support
conclusions people draw from them.

But even if the results are never made public, R Markdown can still help a lot.
Imagine you have spent weeks doing a very complicated analysis.  The day before
it is finished, a co-worker tells you there is a mistake in the data that has
to be fixed.  If you are generating your report the old-fashioned way, it will
take almost as much time to redo the report as to do it in the first place.
If you have done the report with R Markdown, correct the data, rerun R Markdown,
and you're done.

R Markdown takes some time to learn.  And it always takes more time to do
a data analysis while simultaneously documenting it than to do it while
skipping the documentation.  But it takes a lot less time than doing the
analysis without documentation and adding documentation later.
Also

 * R Markdown documentation is far more likely to be correct, and

 * the analysis itself is far more likely to be correct.

#### Newbie Data Analysis

The way most newbies use R or any other statistical package is to dive right in

 * typing commands into R,

 * typing commands into a file and cut-and-pasting them into R, or

 * using RStudio.

None of these actually document what was done because commands get edited
a lot.

If you are in the habit of saving your workspace when you leave R or RStudio,
can you explain *exactly* how every R object in there was created?  *Starting
from raw data?*  Probably not.

#### Expert Data Analysis

The way experts use R is

 * type commands into a file, say `foo.R`.  Use
```{r eval=FALSE}
R CMD BATCH --vanilla foo.R
```
   to run R to do the analysis.

 * type commands with explanations into an R Markdown file, and render it
   in a clean R environment (empty global environment).  Either start R
   with a clean global environment (with `R --vanilla`) and do
```{r eval=FALSE}
library(rmarkdown)
render("foo.Rmd")
```
   or start RStudio with a clean global environment (on the "Tools" menu,
   select "Global Options" and uncheck "Restore .RData into workspace at
   startup", then close and restart) load the R Markdown file and click "Knit".

The important thing is using a clean R environment so all calculations are
fully reproducible.  Same results every time the analysis is rerun by you
or by anybody, anywhere, on any computer that has R.

## Examples Not Involving R

One of the purposes of this document is to serve as an example of how to
use R Markdown.

We have already exemplified a lot, but there's more.

### Title, Author, Date

Because this document is a chapter in a "book" done by the
`bookdown` R package, it does not serve as a complete example
of a standalone R markdown document.

So we have made a toy example
https://raw.githubusercontent.com/IRSAAtUMn/RWorkshop18/master/foo/foo.Rmd,
which renders as
http://htmlpreview.github.io/?https://github.com/IRSAAtUMn/RWorkshop18/blob/master/foo/foo.html

That document shows how to put in a title, author, and date using
the YAML comment block (set off by lines of hyphens at the top of
the document).  There does not seem to be any reference where all of
the stuff that can be put in a YAML comment is documented.  What
can be done with YAML comments is scattered in many places in the
R Markdown documentation.

### Headers

Headers are indicated by hash marks (one for each level), as with all
the headers in this document.  There is also
[an alternative format](http://rmarkdown.rstudio.com/authoring_pandoc_markdown.html#headers)
that you can use if you want.

### Paragraphs

Paragraph breaks are indicated by blank lines.

### Fonts

Font changes are indicated by *stars for italic,*
**double stars for boldface,** and `backticks for typewriter font (for code)`
as here.

### Lists

Bulleted lists examples are shown above.  Here is a numbered list

 1. item one,

 1. item two, and

 1. item three.

Note that one does not have to do the numbers oneself.  R Markdown figures
them out (so when you insert a new item in the list it gets the numbers
right without you doing anything to make that happen).

[The reference material for lists](http://rmarkdown.rstudio.com/authoring_pandoc_markdown.html#lists)
shows how to make sublists and more.

### HTML Links

Links have also already been exemplified.

If you want the link text to be
the same as the link URL, you can just put the URL in plain text.  R Markdown
will make it a link.  It can also go in angle brackets.

If you want the link text to be
different from the link URL, you can just put the link text in square brackets
followed by the URL in round brackets (with no space in between the two).
For example, `[CRAN](https://cran.r-project.org)` makes the link
[CRAN](https://cran.r-project.org).

For more about links, [here is the documentation](http://rmarkdown.rstudio.com/authoring_pandoc_markdown.html#links).

### Tables

There is Markdown syntax for tables, but we won't illustrate it here
([documentation](http://rmarkdown.rstudio.com/authoring_pandoc_markdown.html#tables)).

We will be more interested in [tables created by R](#rtables).

## Examples Involving R

### Code Chunks

Code chunks begin with ` ```{r} ` and end with ` ``` `
([documentation](http://rmarkdown.rstudio.com/lesson-3.html)).
These delimiters have to begin in column one (I think).

Here is an example.
```{r}
2 + 2
```
This is a "code chunk" processed by `rmarkdown`.
When `rmarkdown` hits such a thing, it processes it, runs R to get the
results, and stuffs the results (by default) in the output file it is
creating.  The text between code chunks is Markdown.

### Code Chunks With Options

Lots of things can be made to happen by including options in the code
chunk beginning delimiter.  AFAIK all
[`knitr` chunk options](https://yihui.name/knitr/options/) can be used.

Here are some simple ones.

The option `eval=FALSE` says to show the chunk
but do not evaluate it, as here
```{r eval=FALSE}
2 + 2
```

The option `echo=FALSE` says to to **not** show the chunk
but **do** evaluate it (just the opposite of `eval=FALSE`), as here
(nothing appears in the output document because of `echo=FALSE` but
the code chunk is executed).
```{r echo=FALSE}
hide <- 3
```

If you look at the document source you will see a hidden code chunk
that assigns a value to the variable `hide` which we can see in a code
chunk with no options
```{r}
hide
```

This example also shows that all code chunks are executed in the same
R session, so R objects assigned names in earlier chunks can be used
in later chunks.

Many more examples of options for code chunks are exemplified below.

### Plots

We showed a simple plot above, here is a more complicated one.

#### Make Up Data

Simulate regression data, and do the regression.
```{r}
n <- 50
x <- seq(1, n)
a.true <- 3
b.true <- 1.5
y.true <- a.true + b.true * x
s.true <- 17.3
y <- y.true + s.true * rnorm(n)
out1 <- lm(y ~ x)
summary(out1)
```

#### Figure with Code to Make It Shown

The following figure is produced by the following code
```{r fig.align='center', fig.cap='Simple Linear Regression', label='regression'}
mydata <- data.frame(x, y)
ggplot(mydata, aes(x = x, y = y)) + geom_point() +
    geom_smooth(method = "lm")
```
Here we use the chunk options
`fig.align='center', fig.cap='Simple Linear Regression'`
to center the figure and to get the figure legend.

Note that options are comma separated.

#### Figure with Code to Make It Not Shown

For this example we do a cubic regression on the same data.
```{r}
out3 <- lm(y ~ x + I(x^2) + I(x^3))
summary(out3)
```
Then we plot this figure with a hidden code chunk (so the R commands
to make it do not appear in the document).
```{r fig.cap="Scatter Plot with Cubic Regression Curve", echo=FALSE, fig.align='center', label='cubic'}
ggplot(mydata, aes(x = x, y = y)) + geom_point() +
    geom_smooth(method = "lm", se = FALSE, formula = y ~ poly(x, 3))
```

This plot is made by a hidden code chunk that uses the option `echo=FALSE`
in addition to `fig.align` and `fig.caption` that were also used in the
preceding section.

Also note that, as with the figure in [the section titled **What Does It Do?**](#rdo) above, every time we rerun `rmarkdown` these two figures
change because the simulated
data are random.  (We could use `set.seed` to make the simulated data always
the same, if we wanted to.)

### R in Text

The *no snarf and barf* rule must be adhered to strictly.  None at all!

When you want to refer to some number in R printout, either make a code chunk
that contains the printout you want, or, much nicer looking, you can "inline"
R printout.

Here we show how to do that.
The quadratic and cubic regression coefficients
in the preceding regression were
$`r out3$coef[3]`$
and
$`r out3$coef[4]`$.
Magic!

See the source for this document to see how the magic works.
The dollar signs around the in-line R chunks are required by the
way R markdown handles scientific notation.  It outputs LaTeX,
which is explained in [Section 7.8](#latex-math) below.
It thus forces you to understand at least that much LaTeX.
(Sorry about that!)

If you never snarf and barf, and everything in your document is computed
by R, then everything is always as claimed.

### Tables {#rtables}

The same goes for tables.
Here is a "table" of sorts in some R printout.
```{r}
out2 <- lm(y ~ x + I(x^2))
anova(out1, out2, out3)
```

We want to turn that into a table in output format we are creating.
First we have to figure out what the output of the R function
`anova` is and capture it so we can use it.
```{r}
foo <- anova(out1, out2, out3)
class(foo)
```
So now we are ready to turn the data frame `foo` into a table
and the simplest way to do that seems to be the `kable` option on
our R chunk
```{r kable, echo = FALSE}
options(knitr.kable.NA = '')
knitr::kable(foo, caption = "ANOVA Table", digits = c(0, 1, 0, 2, 3, 3))
```

## LaTeX Math

You can put real math in R Markdown documents.  The way you do it mimics
LaTeX ([Wikipedia page](https://en.wikipedia.org/wiki/LaTeX)), which is
far and away the best document preparation system for ink on paper documents
(it doesn't work so well for e-readers).

To actually learn LaTeX math, you have to read Section 3.3 of the [LaTeX book](https://www.amazon.com/LaTeX-Document-Preparation-System-2nd/dp/0201529831/) and should also read the [User’s Guide for the `amsmath` Package](http://www.tug.org/teTeX/tetex-texmfdist/doc/latex/amsmath/amsldoc.pdf).

Here are some examples to illustrate how good it is and how it works with
R Markdown.
$$
   f(x) = \frac{1}{\sqrt{2 \pi} \sigma} e^{- (x - \mu)^2 / (2 \sigma^2)},
   \qquad - \infty < x < \infty.
$$

If
$$
   f(x, y) = \frac{1}{2}, \qquad 0 < x < y < 1,
$$
then
\begin{align}
   E(X)
   & =
   \int_0^1 d y \int_0^y x f(x, y) \, d x
   \\
   & =
   \frac{1}{2}
   \int_0^1 d y \int_0^y x \, d x
   \\
   & =
   \frac{1}{2}
   \int_0^1 d y \left[ \frac{x^2}{2} \right]_0^y
   \\
   & =
   \frac{1}{2}
   \int_0^1 \frac{y^2}{2} \, d y
   \\
   & =
   \frac{1}{2}
   \cdot
   \left[ \frac{y^3}{6} \right]_0^1
   \\
   & =
   \frac{1}{2}
   \cdot
   \frac{1}{6}
   \\
   & =
   \frac{1}{12}
\end{align}

## Caching Computation

If computations in an R Markdown take so much time that editing the document
becomes annoying, you can "cache" the computations by adding the option
`cache=TRUE` to time consuming code chunks.

This feature is rather smart.  If anything changes in code for
the cached computations, then the computations will be redone.
But if nothing has changed, the
computations will not be redone (the cached results will be used again)
and no time is lost.

My Stat 3701 course notes have an example that does caching:

 * [R Markdown source](http://www.stat.umn.edu/geyer/3701/notes/mcmc-bayes.Rmd)
   and

 * [HTML output](http://www.stat.umn.edu/geyer/3701/notes/mcmc-bayes.html).

There are also other examples of lots of other things at
<http://www.stat.umn.edu/geyer/3701/notes/>.

## Summary

Rmarkdown is terrific, so important that we cannot get along without
it or its older competitors `Sweave` and `knitr`.

Its virtues are

 * The numbers and graphics you report are actually what they
are claimed to be.

 * Your analysis is reproducible.  Even years later, when you've
completely forgotten what you did, the whole write-up, every single
number or pixel in a plot is reproducible.

 * Your analysis actually works---at least in this particular instance.
The code you show actually executes without error.

 * Toward the end of your work, with the write-up almost done you
discover an error.  Months of rework to do?  No!  Just fix the error
and rerun Rmarkdown.  One single problem like
this and you will have all the time invested in Rmarkdown repaid.

 * This methodology provides discipline.
There's nothing that will make you clean up your code like
the prospect of actually revealing it to the world.

Whether we're talking about homework, a consulting report, a textbook,
or a research paper.  If they involve computing and statistics,
this is the way to do it.