--- title: "Solutions to 'Describing data' exercises" author: "Your name here" format: html --- # Discussion of source file Below (in the source .qmd file) is code that is extracted from the materials in the book that is relevant to the exercises below. We use #| include: false to hide this code from the compiled output. (Your *answers* should not have this in general.) In this template, we have put the questions from the exercises in ** to make them bold (to distinguish the questions from your answers). {r} #| message: false #| include: false library(dplyr) library(ggplot2) library(farr) library(forcats) # For fct_indorder() library(tidyr) # For pivot_wider() library(lubridate) # For ceiling_date()  {r} #| include: false rets_nab_cba <- aus_banks %>% inner_join(aus_bank_rets, by = "gvkey") %>% filter(ticker %in% c("CBA", "NAB")) %>% select(ticker, datadate, ret)  {r} #| include: false rets_nab_cba_wide <- rets_nab_cba %>% pivot_wider(id_cols = datadate, names_from = ticker, values_from = ret) %>% drop_na()  {r} #| include: false var_alt <- function(x) { sum((x - mean(x))^2)/(length(x) - 1) }  {r} #| include: false cov_alt <- function(x, y) { sum((x - mean(x)) * (y - mean(y)))/(length(x) - 1) }  {r} #| include: false latest_mkt_cap <- aus_banks %>% inner_join(aus_bank_rets, by = "gvkey") %>% filter(datadate == max(datadate)) %>% select(ticker, co_name, mkt_cap) %>% arrange(desc(mkt_cap)) %>% mutate(ticker = fct_inorder(ticker))  # Exercises 1. **Create a function cor_alt(x, y) that uses the cov_alt and var_alt to calculate the correlation between x and y. Check that it gives the same value as the built-in function cor for the correlation between ret_nab and ret_cba from rets_nab_cba_wide.** {r} cor_alt <- function(x, y) { # This function is not complete! cov_alt(x, y) }  2. **If we remove the drop_na() line used in creating rets_nab_cba_wide, we see missing values for CBA. There are two reasons for these missing values. One reason is explained [here](https://en.wikipedia.org/wiki/Commonwealth_Bank), but the other reason is more subtle and relates to how values are presented in datadate. What is the first reason? (*Hint:* What happened to CBA in 1991?) What is the second reason? How might we use lubridate::ceiling_date(x, unit = "month") to address the second reason? Does this second issue have implications for other plots?** 3. **Adapt the code used above to calculate the *correlation* matrix for the returns of Australian banks to instead calculate the *covariance* matrix. What is the calculated value of the variance of the returns for NAB?** 4. **From the output above, what is the value for the variance of NAB's returns given by the cov function applied to rets_nab_cba_wide? Why does this value differ from that you calculated in the previous question?** 5. **What do the two-table verbs semi_join() and anti_join() do? In what way do they differ from the two-table verbs listed above? How could we replace filter(ticker %in% latest_mkt_cap\$ticker) (see above) with one of these two verbs?** 6. **In calculating ROE above, we used ib rather than a measure of "net income". According to [WRDS](https://wrds-www.wharton.upenn.edu/pages/support/support-articles/compustat/global/ni-net-income-variable/), "ni [net income] only applies to Compustat North America. Instead use: ni = ib + xi + do." Looking at the data in aus_bank_funds, does this advice seem correct? How would you check this claim? (*Hint:* You should probably focus on cases where both xi and do are non-missing and checking more recent years may be easier.)** 7. **Above we created a plot of market-to-book ratios. Another measure linking stock prices to fundamentals is the *price-to-earnings ratio* (also known as the *PE ratio*). Typically, PE ratios are calculated as** $$\textrm{PE} = \frac{\textrm{Stock price}}{\textrm{Earnings per share}}$$
**where** $$\textrm{Earnings per share} = \frac{\textrm{Net income}}{\textrm{Shares outstanding}}$$ **So we might write** $$\textrm{PE} = \frac{\textrm{Stock price} \times \textrm{Shares outstanding}}{\textrm{Net income}}$$ **But the numerator here is simply market capitalization, suggesting we could use the following:** $$\textrm{PE} = \frac{\textrm{Market capitalization}}{\textrm{Net income}}$$ **What critical assumption have we made in deriving the last equation? Is this likely to hold in practice?**
8. **Calculating the PE ratio using pe = mkt_cap/ib, create a plot of PE ratio for Australian banks over time like that created for market-to-book ratios above.** 9. **Suppose you wanted to produce the plots in the test (market capitalization; market-to-book ratios; histogram of ROE) using Excel starting from spreadsheet versions of the three data sets provided above? Which aspects of the task would be easier? Which would be more difficult? What benefits do you see in using R code as we did above?** 10. **Using the documentation from the farr package, describe the contents of the by_tag_year data frame (type help(by_tag_year) or ? by_tag_year after loading the farr package).** 11. **Using by_tag_year, create a plot that displays the total number of questions asked across all languages over time.** 12. **Produce a plot like the one above, but focused on questions related to R.** 13. **If we want to know the popularity of R relative to other languages, we're probably more interested in a percentage, instead of just the counts. Add a new variable that is the fraction of all questions asked in each year with a specific tag out to the dataset and plot this variable focused on questions related to R.** 14. **Two popular R packages we have used in this chapter---dplyr and ggplot2---also have Stack Overflow tags. Perform the same steps that you did for R above for these two tags to see whether they are growing as well.** 15. **Produce a plot that depicts the relative popularity of R, Python, SAS, Stata, and Matlab over time.** 16. **Which language among R, SAS, Stata, and MATLAB, has triggered the most questions in the history of Stack Overflow? (*Hint*: Use the dplyr verbs summarize() and group_by().)**