---
title: "Describing data"
author: "Your name here"
format: html
---
In this template, we have put the questions from the exercises in `**` to make them bold (to distinguish the questions from your answers).
```{r}
#| message: false
#| include: false
library(dplyr)
library(ggplot2)
library(farr)
library(forcats) # fct_inorder()
library(tidyr) # pivot_wider()
library(lubridate) # ceiling_date()
```
```{r}
#| include: false
rets_nab_cba <-
aus_banks |>
inner_join(aus_bank_rets, by = "gvkey") |>
filter(ticker %in% c("CBA", "NAB")) |>
select(ticker, datadate, ret)
```
```{r}
#| include: false
rets_nab_cba_wide <-
rets_nab_cba |>
pivot_wider(id_cols = datadate,
names_from = ticker,
values_from = ret) |>
drop_na()
```
```{r}
#| include: false
var_alt <- function(x) {
sum((x - mean(x))^2)/(length(x) - 1)
}
```
```{r}
#| include: false
cov_alt <- function(x, y) {
sum((x - mean(x)) * (y - mean(y)))/(length(x) - 1)
}
```
```{r}
#| include: false
latest_mkt_cap <-
aus_banks |>
inner_join(aus_bank_rets, by = "gvkey") |>
filter(datadate == max(datadate)) |>
select(ticker, co_name, mkt_cap) |>
arrange(desc(mkt_cap)) |>
mutate(ticker = fct_inorder(ticker))
```
# Exercises
```{r}
cor_alt <- function(x, y) {
# This function is not complete!
cov_alt(x, y)
}
```
1. Create a function `cor_alt(x, y)` that uses `cov_alt()` and `var_alt()` to calculate the correlation between `x` and `y`.
Check that it gives the same value as the built-in function `cor()` for the correlation between `ret_nab` and `ret_cba` from `rets_nab_cba_wide`.
2. If we remove the `drop_na()` line used in creating `rets_nab_cba_wide`, we see missing values for `CBA`.
There are two reasons for these missing values.
One reason is explained [here](https://en.wikipedia.org/wiki/Commonwealth_Bank), but the other reason is more subtle and relates to how values are presented in `datadate`.
What is the first reason?
(*Hint:* What happened to CBA in 1991?)
What is the second reason?
How might we use `lubridate::ceiling_date(x, unit = "month")` to address the second reason?
Does this second issue have implications for other plots?
3. Adapt the code used above to calculate the *correlation* matrix for the returns of Australian banks to instead calculate the *covariance* matrix.
What is the calculated value of the variance of the returns for NAB?
4. From the output above, what is the value for the variance of NAB's returns given by the `cov()` function applied to `rets_nab_cba_wide`?
Why does this value differ from that you calculated in the previous question?
5. What do the two-table verbs `semi_join()` and `anti_join()` do?
In what way do they differ from the two-table verbs listed above?
How could we replace `filter(ticker %in% latest_mkt_cap$ticker)` (see above) with one of these two verbs?
6. In calculating ROE above, we used `ib` rather than a measure of "net income".
According to [WRDS](https://wrds-www.wharton.upenn.edu/pages/support/support-articles/compustat/global/ni-net-income-variable/), "`ni` [net income] only applies to Compustat North America. Instead use: `ni = ib + xi + do`."
Looking at the data in `aus_bank_funds`, does this advice seem correct?
How would you check this claim?
(*Hint:* You should probably focus on cases where both `xi` and `do` are non-missing and checking more recent years may be easier if you need to look at banks' financial statements.)
7. Above we created a plot of market-to-book ratios.
Another measure linking stock prices to fundamentals is the **price-to-earnings ratio** (also known as the **PE ratio**).
Typically, PE ratios are calculated as
$$ \textrm{PE} = \frac{\textrm{Stock price}}{\textrm{Earnings per share}} $$
where
$$ \textrm{Earnings per share} = \frac{\textrm{Net income}}{\textrm{Shares outstanding}} $$
So we might write
$$ \textrm{PE} = \frac{\textrm{Stock price} \times \textrm{Shares outstanding}}{\textrm{Net income}} $$
What critical assumption have we made in deriving the last equation?
Is this likely to hold in practice?

8. Calculating the PE ratio using `pe = mkt_cap / ib`, create a plot of PE ratio for Australian banks over time like that created for market-to-book ratios above.
9. Suppose you wanted to produce the plots in the test (market capitalization; market-to-book ratios; histogram of ROE) using Excel starting from spreadsheet versions of the three data sets provided above?
Which aspects of the task would be easier?
Which would be more difficult?
What benefits do you see in using R code as we did above?
10. Using the documentation from the `farr` package, describe the contents of the `by_tag_year` data frame (type `help(by_tag_year)` or `? by_tag_year` after loading the `farr` package).
11. Using `by_tag_year`, create a plot that displays the total number of questions asked across all languages over time.
12. Produce a plot like the one above, but focused on questions related to R.
13. If we want to know the popularity of R relative to other languages, we're probably more interested in a percentage, instead of just the counts.
Add a new variable that is the fraction of all questions asked in each year with a specific tag out to the dataset and plot this variable focused on questions related to R.
14. Two popular R packages we have used in this chapter---`dplyr` and `ggplot2`---also have Stack Overflow tags.
Perform the same steps that you did for R above for these two tags to see whether they are growing as well.
15. Produce a plot that depicts the relative popularity of R, Python, SAS, Stata, and Matlab over time.
16. Which language among R, SAS, Stata, and MATLAB, has triggered the most questions in the history of Stack Overflow?
(*Hint*: Use the `dplyr` verbs `summarize()` and `group_by()`.)