---
title: "Discrete Data Analysis: A Friendly Guide to Visualising Categorical Data for Machine Learning Practitioners, with Examples in R"
author: "Julian Hatwell"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
  html_document:
    toc: true
    number_sections: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(warning = FALSE
                      , message = FALSE
                      )

knitr::opts_template$set(
  fig.wide = list(fig.height = 4.5, fig.width = 8, fig.align='center')
  , fig.wideX = list(fig.height = 3, fig.width = 9, fig.align='center')
  , fig.relaxed = list(fig.height = 6, fig.width = 8, fig.align='center')
  , fig.tile = list(fig.height = 3, fig.width = 3, fig.align='center')
)
```

# Introduction

Welcome. If you're reading this, you probably attended a workshop or talk I've given on this topic. If you're here by accident, let me fill you in.

These notes are for you if:

* You work with data sets where many, most or all of the features are discrete (nominal, ordinal, frequency, counts, events).
* You want a quick start guide to develop some intuition in this area of statistics, without getting bogged down in notation, theory and proofs.
* You need a strategy for the Exploratory Data Analysis (EDA) phase of a machine learning project:
    * Assess a new dataset that contains several discrete features
    * Develop an intuition about the relationships between those features
* You want to demonstrate rigorous and robust, yet visually intuitive reporting of machine learning results in:
    * Multi-class problems involving ordinal data
    * Impact of important categorical features on classification results
    * Frequency data: events, counts, occurrences, wait-times
    
*Note, I hope this can be a living document which I can expand as time allows. Check back in a few months. The last date I updated should always be visible in the header.*

If you want to know more you can follow the links to the [vignette](https://cran.r-project.org/web/packages/vcdExtra/vignettes/vcd-tutorial.pdf), [video](https://www.youtube.com/watch?v=qfNsoc7Tf60) and [book](https://www.crcpress.com/Discrete-Data-Analysis-with-R-Visualization-and-Modeling-Techniques-for/Friendly-Meyer/p/book/9781498725835) that have been the most useful sources of information for me.

## R

The examples are all in R because it's simply the right tool for the job. I've kept things about as simple as I can make them. Even if you don't know any R you should get along provided you have enough tech savvy to download and install new software. Maybe you'll even learn some R by osmosis.  That's all I really want to say about R.
    
These are the libraries we'll be using.

```{r load_libs}
# I'm going to load all the R libraries here
# If you stop, quit R and come back later...
# ...you need to re-run this chunk.

# install these packages if you don't have them already
# install.packages("packageNameHere")

library(knitr)
library(lattice)
library(ggplot2)
library(vcd)
library(vcdExtra)
library(ca)
library(randomForest)
library(gmodels)
library(mvtnorm)
library(ROSE)
```

## Credits

The code examples are mostly based on material contained in the book [Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data](https://www.crcpress.com/Discrete-Data-Analysis-with-R-Visualization-and-Modeling-Techniques-for/Friendly-Meyer/p/book/9781498725835) by Michael Friendly and David Meyer. Michael Friendly is a pioneer in this field and has contributed to the development of modules and libraries for SAS and R.

You can also find a much shorter tutorial on these topics in the long form help for the "vcd" package for R: [vcd vignette](https://cran.r-project.org/web/packages/vcdExtra/vignettes/vcd-tutorial.pdf)

This [video](https://www.youtube.com/watch?v=qfNsoc7Tf60) is a much more in depth lecture, by Michael Friendly, on topics covered in the book.

## Intended Audience

Any researchers, analysts and knowledge professionals with some practical experience of Machine Learning (ML), possibly crossing over from another discipline. I assume you have an understanding of the ML project life-cycle, data preparation, exploratory analysis, model building and evaluation, and reporting results. I assume some basic statistical knowledge, such as most people have experienced in high school.

You'll get the most benefit if you already know some R. If you don't, maybe this will help you to learn some. The code examples are mostly just a few lines each.

## What We Won't Cover

Bear in mind, this is just an overview! We will only introduce formal statistics and tests where necessary, and will not develop any mathematical notation, theory or proofs. There is lots of material freely available prepared by real experts, rather than people like me. Head over to www.google.com now, if that's what you're looking for.

For now, I am not covering explanatory and predictive modeling. You will probably be already familiar with logistic regression for binary classification. There are related or equivalent techniques for other kinds of classification (ordinal, polytomous/multi-class and counts). Even though these topics fall squarely under both DDA and ML, it's a massive topic that requires a whole course/module to explain. I'll get around to it, but I don't have time right now.

## Who am I and Why am I Presenting This?

Working backwards:

* Research to PhD (in progress) Data Mining and Machine Learning, Birmingham City University
* MSc. (Distinction), Business Intelligence, Birmingham City University. Master's Dissertation: An Association Rules Based Method for Imputing Missing Likert Scale Data - 97%! (yes, I also wonder what lost me those last 3%)
* Over 15 years in enterprise systems, management information systems and BI, business process improvement and digital transformation in the for-profit education sector, UK and Singapore. This industry vertical includes:
    * Trans-national education – delivery of fully accredited degrees from prestige US, UK and Aus universities in South East Asia
    * Outsourcing of international foundation college for Russell group and 2nd tier universities
    * TEFL colleges and services
* BSc. (Hons) Microbiology
    * Yes, this last one seems a bit disconnected. If you know the British education system, you'll probably understand. It's still relevant here because there's a bit of stats involved. For example, [Most Probable Number Theory](https://en.wikipedia.org/wiki/Most_probable_number) to estimate the number of bacterial cells in a test tube. You certainly can't count them all!

In my career as a data professional the kinds of questions I used to work on were:

* Does the number of students who progress from undergrad to post-grad differ, depending on whether they receive an incentive plan?
* Do different interventions improve the retention of at-risk students?
* What combination of factors most affect student satisfaction (measured on a likert scale)
* How do student outcomes (Distinction, Merit, Pass, Fail) compare across various demographics?
* Can we identify groups in need of extra support?
* How can we help an applicant choose the right course?
    
Here are just some of the factors/features to be utilised in the research questions above:

* subject
* full-time / part-time
* age group
* gender
* nationality (well over 50 unique categories at a single institution)
* high school attended prior to entry
* subjects and grades prior to entry
* student satisfaction surverys (likert scales)
* marital status and dependents (a lot of mature and part-time students)
* modules passed or failed
* exams passed, resat, or re-moduled
    
The types of questions we did *not* need to answer were "what is the average height/weight of students across different groups?" So it was obvious that regular descriptive and quantitative stats (i.e. the normal distribution) were not the right tools for this work. 

With so many dimensions, tables, cross tabs and pivot tables were not that useful. Quantitative stats (t-tests, ANOVA, scatterplots and boxplots) just weren't the right tools. Most importantly, I needed visual tools to communicate findings with the senior management team. I couldn't expect those guys to squint over a page of tables and numbers with me, looking for clues.

After a certain amount of research, I came upon Michael Friendly's work on visualising categorical data and realised it was exactly the tool kit I needed to start making a big difference. I probaby owe Prof. Friendly a drink or two.

I face completely new tasks today in my Machine Learning research. However, I keep noticing that the old tools and techniques are still applicable. I wanted to share these ideas with you and I hope you also find them useful.

# Preliminaries

If you want to run the code chunks and see the results in the R Studio UI you need access to a machine with R + R Studio and the source code for [this file](https://raw.githubusercontent.com/julianhatwell/DDAR/master/DDAR_Display_Page.Rmd) (if you don't want to do a lot of copy/pasting). After you've downloaded it to your working directory, just ditch all my text move all the code chunks to a new .R file with:

```{r purlit, eval=FALSE}
# Don't run this if you've already got the code in a separate file
# knitr::purl("DDAR_Display_Page.Rmd")
```

## What is Discrete Data Analysis (DDA)?

Predictably, the term refers to analysis of discrete data! DDA is a set of tools and techniques that you would apply to achieve the same objectives as any quantitative statistical analysis but with these data types:

* Nominal
* Ordinal
* Counts / Frequencies

## The Basic Tools of DDA

### Plots and Charts

The most powerful set of tools we have is our eyes and brains. Humans are pretty damn impressive when it comes to pattern matching. The trick is to choose the right chart that reveals the right information for the task in hand. A chart is often acceptable as statistical evidence, provided it is enhanced with robust statistical information (confidence intervals, significance test results, etc). More importantly, a good chart really doesn't need much explaining.

The next three charts are examples of what **not** to do. We tend to reach for these types of charts when we see a mix of numeric and discrete data. However, you won't see again these again beyond this introduction because they are intended for quantitative statistics and are the wrong tools for DDA.

There are plenty of examples of the right tools in the case studies.

```{r not_DDA}
# set the random number generator same every time
set.seed(123)

# a discrete feature to label the groups
n <- 100
grp <- rep(c(1, 2), each = n)

# a random discrete feature
rnd <- sample(c(1, 2)
              , replace = TRUE
              , size = 200)

# some random multivariate normal data for grp1
numvars1 <- rmvnorm(n, c(0, 0)
                    , sigma = matrix(c(3,2,2,3)
                                     , ncol=2))

# some random multivariate normal data
# with different parameters for grp2
numvars2 <- rmvnorm(n, c(1, 1)
                    , sigma = matrix(c(1,0.5,0.5,4)
                                     , ncol=2))

# combine together as one data frame
numvars <- rbind(numvars1, numvars2)

# add a discrete feature based on a linear combination
# + noise to make a soft boundary
noise <- rnorm(2*n, mean = 0, sd = 0.5)
fam <- ifelse(rowSums(numvars) + noise > 3, 1, 2)

# take row sums then table-apply the sum by each group
grp_total <- tapply(rowSums(numvars), grp, sum)
rnd_total <- tapply(rowSums(numvars), rnd, sum)
fam_total <- tapply(rowSums(numvars), fam, sum)

par(mfrow=c(1,3))
# plot, using graphic options to separate the discrete classes
plot(numvars, col = fam, pch = rnd, cex = grp
     , main = "scatterplot with size, shape\nand colour options"
     , ylab = "y"
     , xlab = "x")

# take row sums then table-apply the sum by each group
grp_total <- tapply(rowSums(numvars), grp, sum)
rnd_total <- tapply(rowSums(numvars), rnd, sum)
fam_total <- tapply(rowSums(numvars), fam, sum)

# totals barplot
barplot(c(grp_total, rnd_total, fam_total)
        , xlab="grp rnd fam"
        , col="lightgray", border = 1:2
        , main="barplot of quantitative\ntotals by groups"
        )

# classic box plot
boxplot(rowSums(numvars)~rnd*grp, border=1:2
     , main = "boxplot distribution\nanalysis by groups"
     , xlab = "rnd1.grp1 rnd2.grp1 rnd1.grp2 rnd2.grp2"
     , ylab = "row sum value")

par(mfrow=c(1,1))
```

The plots above place all the focus on the numeric features, slicing it into groups by the various discrete features. We would like charts where the focus is on the discrete features; something that provides more information about the discrete features themselves.

```{r count_data}
# length will count the members
grp_count <- tapply(rowSums(numvars), grp, length)
rnd_count <- tapply(rowSums(numvars), rnd, length)
fam_count <- tapply(rowSums(numvars), fam, length)

# counts barplot
barplot(c(grp_count, rnd_count, fam_count)
        , col="lightgray", border = 1:2
        , main="barplot with border options"
        , sub="counts by groups")
```

By counting the group members, the numeric variable is out of the picture entirely. The only trace of it remains in the *fam* group, which was based on a linear combination of the simulated data. This plot reveals that there is an effect controling the ratio of the two categories of *fam*. In DDA, we want to find such effects, particularly in exploratory analysis. 

The *grp* group was fixed at exactly the number of group members, while *rnd* was drawn at random with even odds. It's very easy, using this visual reference, to link the count data to the information we already know.

I will provide much more interesting examples in the case studies.

### Tables

Specifically, contigency tables or crosstabs. The categorical features form the axes, and each cell contains the total number of individuals at the intersection of each category. DDA requires us to be able to manipulate tables and there are several functions for handling them. Here are some examples:

```{r tables}
# 3D table - Admit on rows (D1), Gender on columns (D2)
# and Department on levels (D3)

UCBAdmissions # Department has 6 levels

# In 2D - collapse over department
margin.table(UCBAdmissions, 2:1)

# pivot the department onto columns
aperm(UCBAdmissions, c(2, 3, 1))

# flatten to a crosstab in default order
structable(UCBAdmissions)

# flatten to a crosstab with pivot
structable(aperm(UCBAdmissions, c(2, 3, 1)))

# More D
structable(Survived~., data=Titanic)

# More D - another pivot
structable(Survived+Sex~., data=Titanic)
```

Tables like this have several disadvantages. Staring at numbers is not very informative, unless there are obvious large or small values, and things start to get tricky in more dimensions.

Adding row, column and table totals can be informative but only works well in 2D.

```{r crosstabs}
# An old fashioned crosstab - 2D
CrossTable(margin.table(UCBAdmissions, 2:1)
           , prop.chisq = FALSE
           , format = "SPSS")
```

### $\chi^2$ Test

A huge range of DDA research questions ultimately are answered with a $\chi^2$ test. It'll happen at some point under the hood of most of the functions used in this code. That's simply because tables are the main data structures for DDA and most tasks are driving towards comparing between expected values and actual, observed values.

This test instantly tells you if numbers in table cells are what you expect them to be. You might expect all the numbers to be proportional to the row and column totals, or you might have some very specific numbers you expect. Either way, a $\chi^2$ test reveals whether your observed numbers are near enough to your expected numbers that you can put any small differences down to some random sampling error.

This is good news, right? Nearly everyone who's done any stats has some experience done a $\chi^2$ test and can make sense of the results. A significant result in a $\chi^2$ test means it's very improbable that the data matches what we expected. For reasons that I'm not going into now, you can never really prove anything with absolute, unequivocal certainty with statistics. So, we always phrase such tests in terms of the the reverse of what we're looking for. Essentially, it's more robust if we can demonstrate that some opposing assumption is very likely to be wrong, because it makes it more probable that we're right. This opposite assumption that we state at the beginning of a test is called the null hypothesis, or simply the null. We like to give it the symbol $H_0$

Here is a null for when we want to show that a category of feature A is much more popular:

* $H_0 =$ There is an equal number of instances belonging to each category of feature A. Any observed differences are small enough to be considered a sampling error.

And here's one we would use to show that there is an interaction between features A and B. Certain categories in A and B tend to go together, other combinations don't get chosen very frequently.

* $H_0 =$ Membership of any category in feature A is independent of membership of any category in feature B. Frequency in each combination is proportional to the joint distribution $P(A) \times P(B)$ for every combination of categories in A and B. Any observed differences are small enough to be considered a sampling error.

*Note, we will touch on statistical distributions later, but not the $\chi^2$ distribution. This distribution comes up all the time in statitistical testing, but doesn't really model anything found in the real world that we actually want to measure.*

### Log Odds Ratios

This is another commonly used statistic in DDA. Most people with some Machine Learning experience are aware of the log odds from logistic regression. The odds is itself is the ratio $o = \frac{p}{1-p}$ for $p$, the proportion of yes and $1-p$, the proportion of no in any binary situation, such as a coin flip. The log odds is simply  $\log(\frac{p}{1-p})$.

The odds is converted back to a probability by $p = \frac{o}{1+o}$ which means we can connect the number of times something has happened back to the probability that it happens, and compare these probabilities across categories to see if they show any patterns.

An odds ratio is simply a ratio of odds from two binary systems $\frac{o_1}{o_2}$ and the log odds ratio is $\log(\frac{o_1}{o_2})$. There's a pattern here:

$$\log(\frac{a}{b})$$

Let's think about why we like to take the log of a fraction. First of all, I'm going to set two conditions: $a$ and $b$ are always positive, and $b$ can never be $0$. Now, think about what happens as 1. $a$ and $b$ are equal, 2. $a$ gets much bigger than $b$ and 3. $b$ gets much bigger than $a$.

```{r log ratios}
p <- c(0.05, .1, .25, .50, .75, .9, .95)
odds <- p / (1 - p)
logodds <- log(odds)
logits <- data.frame(p, odds, logodds)
logits

a <- c(1000, 1000, 1)
b <- c(1000, 1, 1000)

a/b
log(a/b)
```

Hopefully you can see that while fractions and ratios are centred on $1$ when $a = b$, they have this nasty habit of flying off towards $\infty$ as $a$ gets much bigger, while squashing closer and closer to $0$ as $b$ gets much bigger. Comparing counts or ratios of $a:b$ over the levels of third variable $c$, becomes very problematic because we have to handle two of the most difficult situations in computing - potentially infinite and infinitesimal numbers - in the same operation. That's really unhelpful.

Here are the top three reasons the log odds ratio is so useful in DDA:

* The $\log(\frac{a}{b})$ is centred on zero and completely linear and symmetrical. Swapping the positions of $a$ and $b$ in the fraction simply switches the sign of the result. What a beautiful thing that is! 
* Massive and tiny numbers are both transformed into a very limited domain: Anything outside the range $-3 < x < 3,\ x = \log(\frac{a}{b})$ reveals enormous relative differences between $a$ and $b$. Again, that's very intuitive.
* A log odds ratio $\phi = \log(\frac{o_a}{o_b})$ is an approximately normally distributed statistic with a mean of $\phi$ and a standard error  $\mathrm{SE}_{\phi} = \sqrt{\frac{1}{p_a} + \frac{1}{1 - p_a} + \frac{1}{p_b} + \frac{1}{1 - p_b}}$. This means that confidence intervals for of log odds ratios are also symmetrical, facilitating visually intuitive and statistically robust comparisons of the odds ratio of two binary features $a$ and $b$ between levels of a third feature $c$.

The log function is very special for lots of other reasons but, in DDA especially, it is very helpful to have this neat way to manage ratios.

That's all I want to say about the log ratio and the log odds ratio right now. We'll come back to it in later sections.

### Discrete Distributions

There are several distributions to help with the most common DDA research questions. You can recognise them because a random variable from any of these can only be a natural number $\{ 0,1,2,\dots \}$ (no negative numbers):

* Binomial - how many heads in $n$ coin tosses?
* Poisson - how many buses come every $n$ minutes?
* Geometric - how many minutes do I wait before a bus comes?
* Negative Binomial - how many minutes is the gap between each bus?

You probably have data matching one of these distributions if you have data that looks like this:

```{r dists_firstlook}
seed <- 125
set.seed(seed)
# the first number is the number of samples
# the other numbers are the parameters of the distribution
rbinom(10, size = 10, prob = 0.2) 
rpois(10, lambda = 5)
rgeom(10, prob = 0.5)
size <- 2
prob <- 0.2
count <- table(rnbinom(1000, 2, 0.4))
freq <- count/sum(count)
waittime <- as.integer(names(count))
xyplot(freq~waittime
       , xlab = "Wait Time (minutes)"
       , ylab = "% of total"
       , type = c("h", "p") # this will draw lollipops
       , pch = 16, lwd = 2
       , main = paste("Negative Binomial. Rate ="
                    , size
                    , ", Dispersion ="
                    , 1/prob)
       , sub = paste("Data from R's pseudo-random\n number generator with seed = ", seed))
```

The main reason we like to use distributions in data analysis *where possible*, is that so many clever people have spent the last two or three hundred years researching how they work! This means that all the most accurate, provable and supported tooling for statistics is based on distributions. What we try to do is match the data we have to a known distribution. If we find a good match, we can assume the mathematical properties of the distribution will apply to the real world data. This is what people mean by parametric tests; using all the tried and tested distributional properties to make assumptions about our real-world data.

However, distributions are not real world data. They are just mathematical constructions. They model many things in the real world but you need to be pretty strict about how and when to use them. You need to provide evidence that your data is a really good fit.

On a project using real-world, not simulated data, there simply might not be a distribution that fits and there are hundreds of non-parametric tools that make a model of the data directly from the data itself. That's what a lot of people think of as Machine Learning (though such definitions moot).

Anyway, the moral of the story: Use a distribution when it makes sense. That's often a good place to start with 1 dimensional data (just a bag of integers as above). If it doesn't, then move on and find another method to analyse your data.

## Common DDA Tasks

This is a non-exhaustive list of common statistical tasks and questions that require specific DDA techniques.

1. 1-way analysis and goodness of fit
    * $H_0 =$ My data fits a common, known distribution $\delta$, so I can use mathematical/analytical properties of $\delta$ for parametric tests.
    * Does my data fit a known distribution and if so, what are the parameters?
    * Does one sample come from the same distribution as another?
2. 2-way analysis
    * 2 $\times$ 2 tables are a special case where a standard $\chi^2$ test is not recommended (gold star if you know why). What tools should I use instead?
    * $\chi^2$ tests on $n \times m$ tables can tell me that an association between features is present/absent, but if $n$ or $m$ are larger than 3-4, how can I find and test for complex relationships between so many combinations of categories?
    * Some of my categories are ordinal - what is the right test?
    * Classification problems with ordinal data - how can I quantify the agreement/disagreement between True and Predicted classes when there is a relationship between classes?
3. n-way analysis:
    * Visualising multidimensional tables for exploratory analysis
    * Statistical hypothesis testing in n-way tables
    * Correspondence Analysis, a dimension reduction technique, like PCA and clustering for categorical features!
4. Explanatory and Predictive modeling:
    * Any of the above, measuring the magnitude of interactions
    * Any of the above, measuring the effect of covariates and co-factors

## Using R for DDA

We don't have time develop a fluency for these tasks but if this is something useful for you there are some "building block" skills that are worth spending time on:

1. The most useful data structures for DDA in R are n-dimensional arrays, n-way tables, and matrices. Only  sometimes data frames are used with a frequency column (for the count data).

1. It's good to become fluent with functions for manipulating table data - pivoting and collapsing dimensions, and converting between data structures to suit your analysis. Useful functions to get to know are table(), addmargins(), prop.table(), ftable(), xtabs(), xtable(), structable(), apply(), margin.table(), collapse.table(), aggregate(), aperm(), expand.dft(), subset(), using integer indexing to re-order categories and pivot dimensions. Most of the specialised objects have as.table() and as.data.frame() methods that generate the most intuitive shape for the output data structure. We had a little practice in the examples above and there will be more in the forthcoming case studies.

1. Library "vcd" comes with its own plotting system. It is based on the grids package but is different from R base graphics, ggplot2 and lattice! It's easy to get started, but a bit of effort is required to learn all the tweaks for making publication ready graphical output. However, interpreting these area-based plots is really intuitive and it's worth the effort to learn so that results can be communicated succinctly (a picture tells a thousand words).

# Workshop

## Task: Analysing n-way Tables with Area Based Plots

### What is this for?

Exploring relationships between categorical features. Mosaic plots are very powerful for visualising categorical data. Area and shading can be used in many ways to highlight important the important interactions. These visual techniques are intuitive up to four or five categorical dimensions. This allows you to identify very complex interactions.

*Note, the 'strucplot framework' that is the heart of the vcd package requires learning a new approach to structuring data (handling tables) and writing plotting routines (handling graphical parameters), compared to your previous experience with R. However, it's worth the effort for the intuitive views of complex relationships in the data.*

### When would I use this?

The exploratory data analysis phase of any project. Do this early on in the project once you've gathered and cleaned up your data enough to start sense-making. It doesn't make any sense to reach for more complex tools before you know what you've got.

Spending some time "getting to know" a new dataset and developing some intuition about important features and interactions will pay off by revealing a better strategy when it comes to modelling.

### Case Study: Exploring the relationships between hair colour, eye colour and gender.

The HairEyeColor dataset ships with base R. It is a survey of 592 statistics students at the University of Delaware in 1974. It is a 3-D array cross-tabulating the observations. The features and their levels are as follows:

```{r describe_haireye, echo=FALSE}
hr_desc <- data.frame(No=1:3
                      , Name=c("Hair", "Eye", "Sex")
                      , Levels=c("Black, Brown, Red, Blond", "Brown, Blue, Hazel, Green", "Male, Female"))
kable(hr_desc)
```

For now, we will ignore the sex of each individual by collapsing the 3-D table down to 2-D, and re-order the table from dark to light eye colour (hair is already in that order).

```{r haireye_prep}
# ignore gender for now
haireye <- margin.table(HairEyeColor, 1:2)

# re-arrange order of eye colour from dark to light
# it's already correct for hair
haireye <- as.table(haireye[, c("Brown", "Hazel", "Green", "Blue")])
haireye
```

Big numbers and small numbers stand out but the table still has 16 cells which is a lot to take in at once. It's not immediately obvious what the relationships are.

A naive approach might be to produce a bar plot of the counts by groups.

```{r not_intuitive}
# non-intuitive views
# this is what the Frequency form looks like
as.data.frame(haireye)

# With a grouped bar plot
# you're forced to favour one variable over the other.

# Does does hair colour depend on eye colour?
barplot(haireye, beside = TRUE, legend = TRUE)

# Or eye colour depend on hair colour?
# pivot the other way
barplot(aperm(haireye, 2:1), beside = TRUE, legend = TRUE)
```

It's hard to compare bars across groups, because your eye has to jump to find each member. A more intuitive view would preserve the table structure and not condition one feature on the other:

```{r tiles}
tile(haireye) # vcd package
```

Hopefully, you can see this simple concept of putting categorical features on the axes and plotting a shape whose area is proportional to the count in each cell. A pattern of some kind is self-evident. Let's formally report it with a $\chi^2$ test. The null in this case is just like our second example above.

```{r hr_chisq}
chisq.test(haireye)
```

With such a small p-value, we have very strong evidence to reject the null hypothesis that there is no relationship between hair colour and eye colour. We can now assume they are interacting, but what exactly is the pattern? The only thing the test tells us is there is a very significant difference between actual and expected counts.

Let's use a mosaic plot to visualise expected counts as if there were no interaction between the features. Here's what no interaction looks like:

```{r hr_expected}
# visualising expected counts for independent features
expected = independence_table(haireye)
round(expected, 1)

mosaic(expected
      , shade = TRUE
      , main="Expected frequencies"
      , labeling = labeling_values
      , value_type = "expected"
      , gp_text = gpar(fontface = 1))
```

Note that the areas are proportional to the counts. The grid lines between the tiles are dead straight because each cell is proportional to the marginal totals.

Here's what actuals looks like:

```{r hr_actuals}
# visualising actual counts for independent features
# A publication ready colour scheme
mosaic(haireye
      , gp = shading_hcl # shade = TRUE
      , main="Actual frequencies"
      , labeling = labeling_values
      , value_type = "observed"
      , gp_text = gpar(fontface = 1)
      , rot_labels = c(top = -20))

# The Friendly colour scheme suits visual analysis
# Bolder colours and carries more information in the outlines
mosaic(haireye
      , gp = shading_Friendly # shade = TRUE
      , main="Actual frequencies"
      , labeling = labeling_values
      , value_type = "observed"
      , gp_text = gpar(fontface = 1)
      , rot_labels = c(top = -20))
```

Here the grid lines between cells are no longer continuous because the cells are out of proportion with the marginal totals. Unexpectedly large cells show that the combination is over-represented and small cells the reverse.

The beauty of the mosaic chart is the use of shading to highlight the the cells that differed the most from expected counts. We refer to these differences as the residuals. This shading scheme uses bold fills for statistically significant residuals but also adds an outline colour to highlight the sign of even the non-significant residuals. This is visually very powerful. There is a clear "opposite corner" pattern.

Let's take this up to all three dimensions now.

```{r hec}
# showing all three dimensions
HEC <- HairEyeColor[, c("Brown", "Hazel", "Green", "Blue"), ]
mosaic(HEC
      , gp = shading_Friendly # shade = TRUE
      , main="Actual frequencies"
      , labeling = labeling_values
      , value_type = "observed"
      , gp_text = gpar(fontface = 1), rot_labels = c(right = -45)) 

# with a different pivot
mosaic(aperm(HEC, 3:1)
      , gp = shading_Friendly # shade = TRUE
      , main="Actual frequencies"
      , labeling = labeling_values
      , value_type = "observed"
      , gp_text = gpar(fontface = 1), rot_labels = c(right = -45)) 
```

We can still see the general opposite corner pattern, and now some additional detail regarding gender. If you decide to learn more about these techniques, you can start to make adjustments, such as telling the mosaic plot that you *expect* Hair and Eye to be linked by creating an explanatory model. That will "clean up" the mosaic and show you more clearly the patterns that remain.

```{r hec_joint_model}
# convert to frequency form
HEC_df <- as.data.frame(HEC)

# explanatory model, hair eye interaction
hec.glm <- glm(Freq~Hair*Eye+Sex, data=HEC_df, family = poisson)

mosaic(hec.glm
       , gp = shading_Friendly
       , formula = ~ Sex + Eye + Hair
       , residuals_type = "rstandard")
```

There is a whole methodology around loglinear modeling which allows you to make statistical hypothesis tests about complex n-way relationships between features. We'll leave that for another time.

This type of plot is still fairly intuitive at 4 dimensions. Take a look at the Titanic dataset. Who died and who survived? Did it depend on some combination of passenger class, gender and age?

```{r titanic_mosaic}
mosaic(aperm(Titanic, c(4, 2, 1, 3))
      , gp = shading_Friendly # shade = TRUE
      , main="Who Died and Who Survived the Titanic?"
      , labeling = labeling_values
      , value_type = "observed"
      , gp_text = gpar(fontface = 1), rot_labels = c(right = -45))
```

More dimensions can be added by other means. A cotabplot allows you to stratify over a further dimension. It looks just like facet_wrap() and facet_grid() in ggplot2, or a lattice/trellis plot.

In this example, with just default settings, two categories have been promoted up to the trellis. You could add further dimensions on the mosaic if you have more, and still make sense of the plots!

```{r titanic_cotab}
# This doesn't appear to render well on the HTML output
# note, this is all the defaults except the shading!
cotabplot(Titanic, gp = shading_Friendly)
```

### Talking points

* Recall that we've made sure the categories are ordered light to dark. There is a very pronounced "opposite corner" relationship. What do you think that reveals (in plain English) and does this fit with intuition?
* Can you see a difference between males and females? If so, can you explain it? *Note, the original study was a self-reporting survey. The participants categorised themselves.*
* Why do the grid lines remain aligned horizontally between hair categories on both actual frequencies mosaics?
* If you discovered important relationships in categorical features during your exploratory phase, how might this affect the next steps of your project?

## Task: Dimension Reduction and Clustering

### What is this for?

Most Machine Learning practitioners are familiar with dimension reduction. It can radically simplify and compress a dataset, making it easier to handle and train models on. Many dimension reduction techniques also double up as unsupervised learning methods i.e. clustering. This is useful as an end in itself, as well as being an additional tool for exploratory data analysis: multidimensional data is projected into 2-D which makes it intuitive to visualise.

If you already know something about Principle Components Analysis, this is similar, but not quite the same. Remember, in DDA, the "dimensions" we want to reduce or cluster are not only the features, but also the categories within each feature. PCA identifies new dimensions that capture the most variance. The equivalent concept in Correspondence Analysis is the "inertia" of cells in the table with the largest and smallest numbers. The only way to really describe what's happening is to see some examples.

Correspondence Analysis is an advanced topic. In particular, the way you pivot the multidimensional array controls the analysis. This requires some understanding of the data manipulation and modeling techniques. For today, we will just focus on interpreting the plots.

### When would I use this?

You can use Correspondence Analysis during the exploratory data analysis phase of any project. Do this early on in the project once you've gathered and cleaned up your data enough to start sense-making.

This is a good technique for of feature engineering. If several features are well clustered you can replace them with the real valued dimensions that this method returns as output. This may convert several categorical dimensions into just a couple of numeric ones that might better suit the next stage of your project. The following two situations are immediately applicable:

* You want to reduce the number of features in your dataset overall
* You want to implement a model that requires only numeric, or real-valued functions. For example, k-NN requires a distance metric that works on all the features, while your dataset has some categorical features. Of course you could convert all your categories to factors, integers or binary encode them (aka create dummy variables). However, Correspondence Analysis actually reduces the number of features while retaining information about the way the original features interact.

You could use it as a stand alone model. This is an unsupervised clustering method for categorical features. Use it where you don't have a specific class label, but just want to find out the associations between all the features.

### Case Study: Analysis of Audience Viewing Data

The audience viewing data from Neilsen Media Research for the week starting November 6, 1995

It is a 3-D array cross-tabulating the viewing figures for three networks, between 8-11pm, Monday to Friday. The features and their levels are as follows:

```{r describe_tv, echo=FALSE}
tv_desc <- data.frame(No=1:3
                      , Name=c("Day", "Time", "Network")
                      , Levels=c("Monday, Tuesday, Wednesday, Thursday, Friday", "8, 9, 10", "ABC, CBS, NBC"))
kable(tv_desc)
```

```{r tv_dataprep}
data("TV", package = "vcdExtra")
# The original data is collected in 15 minutes slices.

# Convert 3-D array to a frequency data frame
# This has a row for each cell of the array
# and a new column for the cell value
TV.df <- as.data.frame.table(TV)

# Convert it into hourly slices
levels(TV.df$Time) <- rep(c("8", "9", "10"), c(4,4,3))

# Convert frequency data back to 3-D array, now with just 3 time levels
TV3 <- xtabs(Freq~Day+Time+Network, TV.df)
```

```{r tv_3wayca}
# create a multiple correspondence analysis
TV3.mca <- mjca(TV3)

# the plot function uses all base R plot stuff
# but needs a bit of manipulation
cols <- c("blue", "black", "red")

# "blank plot"
res <- plot(TV3.mca, labels=0, pch='.', cex.lab=1.2)

# combine Dims, factor names and levels
coords <- data.frame(res$cols, TV3.mca$factors)

# save this for later
m_coords <- coords[ order(coords[,"factor"], -coords[,"Dim1"]), c("Dim1", "Dim2")]

# hard-coded from known number of levels
# day, time, network
nlev <- c(5,3,3)

# everything needs to be in semantic order
coords <- coords[ order(coords[,"factor"], coords[,"level"]), ]
# quick fix for ordering e.g. day of week, not alphabetical
coords$order <- c(5, 1, 4, 2, 3, 6, 7, 8, 11, 9, 10)
coords <- coords[order(coords[, "order"]), ]

# place the points with separate plot chars and colours
points(coords[,1:2], pch=rep(16:18, nlev), col=rep(cols, nlev), cex=1.2)

# place the text
pos <- c(1,4,3)
text(coords[,1:2], labels=coords$level, col=rep(cols, nlev), pos=rep(pos,nlev), cex=1.1, xpd=TRUE)

# join things in sequence
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Day", lty=1, lwd=1, col="blue")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Time",  lty=1, lwd=1, col="red")

# add segement from the origin to channels
nw <- subset(coords, factor=="Network")
segments(0, 0, nw[,"Dim1"], nw[, "Dim2"], col = "black", lwd = 0.5, lty = 3)

# add a legend
legend("topright", legend=c("Day", "Network", "Time"),
       title="Factor", title.col="black",
       col=cols, text.col=cols, pch=16:18,
       bg="gray95")
```

This plot reveals an incredibly strong effect relating NBC and Thursday nights. This is apparently most of the inertia captured by Dimension 1. What's going on there?

```{r TV_day_network}
margin.table(TV3, 1)
margin.table(TV3, c(1, 3))
```

Viewing figures are at their highest on Thursdays compared to any night of the week and for NBC they dramatically outnumber the other two channels, but only on this night. NBC has viewing figures that are comparable to the others most other nights. Everyone appears to be glued to NBC all evening. A little digging and domain knowledge (showing my age, here) means we can give this effect a name: This is the "Friends, Seinfeld and ER" effect! These were three of the most popular shows of that [era](https://en.wikipedia.org/wiki/1995%E2%80%9396_United_States_network_television_schedule). 

The 9pm slot clusters closer to the other four days than 8pm and 10pm. What's happening there?

```{r TV_time_network}
margin.table(TV3, 2)
t(TV3[4,,]) # Thursday
```

It turns out that 9pm is the most viewed time slot throughout the week, independent of day or network. Yet, notice how the NBC audience numbers dip just a little for that time slot on Thursday, even though they still have far greater numbers than either of the other two channels. This dip in numbers runs counter to the "Friends, Seinfeld and ER" effect which has completely dominated the analysis, and so it has also been captured on Dimension 1.

Dimension 2 is forced to capture differences between ABC and CBS schedules on the other nights as well as possible, but there is undoubtedly more detail to untangle.

Now we have identified this difficulty, we can tackle the problem. We're going to use the simple (2-Way) Correspondence Analysis to do this. Using the pivoting function *structable()*, we effectively create a new feature that combines time and day. Counter-intuitively, this allows the Correspondence Analysis *more* flexibility along the combined dimension. The resulting plot has a point for each time slot over the week and teases apart the effect of people channel hopping from the "Friends, Seinfeld and ER" effect. It turns out to be very easy to do (see the code for yourself).

```{r tv_2wayca}
# Flatten to 2-D by stacking Time onto Day
# Note: The data shaping choice here controls the specifics of the analysis.
TV3s <- as.matrix(structable(Network~Time+Day, TV3))

# Create the Correspondence Analysis objects
TV3s.ca <- ca(TV3s)

# Generate the plot
res.ca <- plot(TV3s.ca)
# add some segments from the origin to make things clearer
segments(0, 0, res.ca$cols[,1], res.ca$cols[,2], col = "red", lwd = 1)
segments(0, 0, res.ca$rows[,1], res.ca$rows[,2], col = "blue", lwd = 0.5, lty = 3)
```

As you can see, the small channel hop from NBC on Thursday has moved onto Dimension 2, along with the other variations in within-evening viewing patterns. On the previous plot, there is a very slight affinity 8pm-CBS, and 9pm-ABC. Did you notice? Look again. We can see clearly on this new plot, something of great interest at 8pm, Monday, but not later on the same evening and not again at the same time later in the week. We also can see several 9pm slots clustered around ABC.

An interesting feature of this new analysis, is that distance from the origin is proportional to the size of positive residual. What does this mean? Well for example, from 10pm the viewing figures dwindle, independent of day and channel as people go to bed or go out for a late drink. Thursday night, 10pm, NBC has massive audience figures relative to what's normally expected for that time slot the rest of the week. This inertia moves the point furthest from the origin out of all the nightly time slots. Monday at 8pm on CBS is notable for the same reason. Why not Monday 9pm and ABC? There are other nights when ABC is popular at 9pm, making this combination less unusual.

```{r TV_monday}
t(TV3[1,,]) # Monday
TV3[,2,1] # Every day, 9pm, ABC
```

If the next stage of your project requires real valued features for modeling (such as a distance or similarity based model), you can access the coordinates of each point via the plot object and replace them in your data set. This is more powerful than converting to an integer series or dummy variables, because so much information is retained. For example, it's very tempting to convert [*Monday, Tuesday, Wednesday, Thursday, Friday*] to [*1,2,3,4,5*]. However, we have seen in this case that this retains none of the information about the feature interactions. Likewise, [*8,9,10*] seems obvious to leave as numbers, but you lose information about the audience figures being higher at 9pm than any other time. Also, what's the natural order for [*ABC,CBS,NBC*]? Alphabetical? Think again!

```{r tv_ca_dims}
m_coords
res.ca$rows[order(res.ca$rows[, 1]), ]
```

We'll take a look at one more of these, from the 4-D Titanic data set again.

```{r Titanic_mca}
# one more mca from a 4-D dataset
# this is simpler than it looks
# all the magic happens here
titanic.mca <- mjca(Titanic)

# saving the plot object supplies the coordinate positions
res <- plot(titanic.mca, labels=0, pch='.', cex.lab=1.2)

# extract factor names and levels
coords <- data.frame(res$cols, titanic.mca$factors)

# everything else is handling base R plotting stuff
cols <- c("blue", "red", "brown", "black")
nlev <- c(4,2,2,2)
points(coords[,1:2], pch=rep(16:19, nlev), col=rep(cols, nlev), cex=1.2)
pos <- c(3,1,1,3)
text(coords[,1:2], labels=coords$level, col=rep(cols, nlev), pos=rep(pos,nlev), cex=1.1, xpd=TRUE)
coords <- coords[ order(coords[,"factor"], coords[,"Dim1"]), ]
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Class", lty=1, lwd=2, col="blue")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Sex",  lty=1, lwd=2, col="red")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Age",  lty=1, lwd=2, col="brown")
lines(Dim2 ~ Dim1, data=coords, subset=factor=="Survived",  lty=1, lwd=2, col="black")

legend("topleft", legend=c("Class", "Sex", "Age", "Survived"),
       title="Factor", title.col="black",
       col=cols, text.col=cols, pch=16:19,
       bg="gray95", cex=1.2)
```

Note that the two category (binary) features all pass through the origin.

```{r Titanic.ca}
# Simple ca works on two dimensions
# pivot table of Titanic
Titanic_piv <- structable(~Class+Sex+Survived+Age, Titanic)
Titanic_piv
# all the work happens here
Titanic.ca <- ca(as.matrix(Titanic_piv))

# plot and save the object to get the coords for line segments
res <- plot(Titanic.ca)
segments(0, 0, res$cols[,1], res$cols[,2], col = "red", lwd = 1)
segments(0, 0, res$rows[,1], res$rows[,2], col = "blue", lwd = 0.5, lty = 3)
```

As you can see Correspondence Analysis is an extremely powerful dimension reduction and clustering tool for categorical data.

### Talking Points

* What can you infer from the Titanic Correspondence Analysis Plots?
* Compare the two types of Correspondence Analysis (simple and multiple/joint)
* How might correspondence analysis change your modelling strategy?

## Task: Unpacking a Confusion Matrix by Important Features

### What is this for?

Use this method to deepen your understanding of how well your model is performing.

This task can be guided by two things. Firstly, if you've done a good exploratory analysis, you will have developed an understanding of which features could be important. Secondly, several well known ML models can output feature importance information in their own right, saving you all the hard work.

However, these feature importance measures tend to be crude. The feature is important, but why? All feature importance really tells you is that the model would be less accurate if you excluded that feature. It doesn't tell anything about how that feature is helping or hindering.

Before moving on, let's develop some intuition about this and show how the vcd package can help.

We will do some ML and train a random forest model to predict whether families in Slovenia should receive a state sponsored nursery place. The target variable is *decision*

```{r nursery_data}
set.seed(54321)
nursery <- read.csv(gzfile("nursery.csv.gz"))
summary(nursery)
```
 
All the features are categorical. Random forest ought to perform very well on this without any effort.

```{r nurs_randfor}
# train test split
idx <- sample(c(TRUE, FALSE)
              , size = nrow(nursery)
              , prob = c(0.7, 0.3)
              , replace = TRUE)

nurs_train <- nursery[idx, ]
nurs_test <- nursery[!idx, ]

# train rf accepting all the defaults
nurs_rf <- randomForest(decision~., data=nurs_train)

# get the predictions on new data
nurs_preds <- predict(nurs_rf, newdata = nurs_test)

# confusion matrix
confmat <- with(nurs_test, table(nurs_preds, decision))
confmat
```

Indeed, the model performs well and needed no effort to tune. Ideally, I would like visualise the confusion matrix directly, but it turns out to be a bit of a fussy task in R. I've looked online for any other way to do it, and I can only find some customised ggplot examples. 
Let's compare the different plotting options from R:

```{r plot_confmat}
# base
# there are no words to describe how horrible this is
image(confmat)

# lattice
# wrong diagonal aspect, horrible colours
levelplot(confmat)
# same
levelplot(t(confmat))

# ggplot
# too much fuss
# default colours not subtle enough to show error
# also the wrong diagonal
# First, convert to frequency form
confmat_freq <- as.data.frame(confmat)
# Next, tell ggplot what to do
ggplot(aes(x=nurs_preds
                 , y=decision
                 , fill=Freq)
             , data=confmat_freq) +
  geom_tile()
```

I'm not sure if Michael Friendly and the programmers of vcd meant for it to be used this way, but strucplot options are ideal for confusion matrices.

```{r strucplot_confmat}
# strucplot sieve
# the shading is a bit too dense on the diagonal
# there are lots of options
# it might be possible to customise
sieve(confmat, labeling = labeling_values)

# strucplot tile - I like this one best
tile(confmat, labeling = labeling_values)
```

There is a little bit of confusion among the classes *priority* and *spec_prior*. The model also struggles with the minority *very_recom* class, preferring to label these as *priority*.

One of the most useful features of random forest models is the ability to report variable importance. This measures, for each feature, how much worse the model would do if that feature wasn't available at training time. There is a custom plot function for this.

```{r nurs_varimp}
varImpPlot(nurs_rf)
```

We can see from this plot that the health variable seems to hold a majority of the predictive power for this model. Almost all the information is in this and the next three most important features.

What do we do now we know that? We investigate further! There is always a next question. Now we've become expert in handling 3-D tables, let's stratify (slice) the confusion matrix by this health variable and see what's going on.

```{r nurs_firstvar}
# unfortunately, there is no cotabplot option for tile
# this has to be done manually to make any sense

# from the summary, health can take one of three values
# not_recom, priority, recommended

cm_3d <- with(nurs_test
              , table(nurs_preds, decision, health))

re <- cm_3d[,,"recommended"]
pr <- cm_3d[,,"priority"]
nr <- cm_3d[,,"not_recom"]

tile(re, labeling = labeling_values)
tile(pr, labeling = labeling_values)
tile(nr, labeling = labeling_values)
```

This last plot is very striking. One level of the health feature has all the information required to completely classify one of the class labels. It's not surprising that health is by far the most important variable! However, it turns out to be very uninteresting from a ML point of view. A very simple program could do almost as well as this random forest.

A situation like this is quite unusual in the real world, and truthfully, should be identified during exploratory analysis. The sensible thing to do is probably to filter out all the instances in this health category and build a new model. However, it serves as a good example to show that confusion matrix results are not as informative as they first appear.

In binary classification, this could never happen. There would be no ML to do if one feature was a perfect representation of the class labels. Furthermore, problems are often organised so that the two classes are balanced and this should remain the case in the confusion matrix of a well performing model. How could this slicing technique help?

### When would I use this?

After model training, when you are evaluating your classification model performance with a confusion matrix. If you have a reason to believe that one or more categorical features contributed to good/bad perforemance, you need to investigate further to improve model performance or correct for any bias.

You want to use a tree based method but your situation is sensitive to unbalanced loss. For example, a medical or financial model where either false positives or false negatives are especially costly. You need a detailed understanding of features and categories where the model is more or less confident so you can balance the risk post hoc.

### Case Study: Investigating High False Negative Rate in a Credit Rating data Using Random Forest Feature Importance

We will now do a random forest binary classification. The german data set is a credit rating data set, where individuals are given either a *good* or *bad* rating. Incorrect *good* predictions cost the business much more, because of the risk of loan defaults. Incorrect *bad* results only result in a small reduction of revenue from loan interest payments in the long term.

Let's refer to correct identification of a *bad* result as a true positive outcome. When there is a high proportion of *bad* instances that are incorrectly classified as good, this gives rise to a high false negative rate and is very undesirable.

We could crudely force the model to be more cautious over all, whereby it will make more false positive results (incorrect *bad* ratings). However, it would be better if we could develop a more targetted strategy to improve the false negative rate.

```{r german_randfor}
german <- read.csv(gzfile("german.csv.gz"))
summary(german)

set.seed(54321)
# train test split
idx <- sample(c(TRUE, FALSE)
              , size = nrow(german)
              , prob = c(0.7, 0.3)
              , replace = TRUE)

german_train <- german[idx, ]
german_test <- german[!idx, ]

# make the training data balanced in case over fits good
# originally 300 bad, 700 good
german_train <- ovun.sample(rating~.
                      , data=german_train
                      , method = "over"
                      , p = 0.5)$data

# restore the factor levels to correct order
german_train$rating <- relevel(german_train$rating, "bad") 

# train rf accepting all the defaults
german_rf <- randomForest(rating~., data=german_train)

# get the predictions on new data
german_preds <- predict(german_rf, newdata = german_test)

# confusion matrix
confmat <- with(german_test, table(german_preds, rating))
confmat
```

The confusion matrix looks ok, but how does it really do, in light of our high cost of false negatives?

```{r main_cm_stats}
# accuracy
sum(diag(confmat))/sum(confmat)

# False Negative Rate FN / (TP + FN)
confmat[2, 1]/sum(confmat[, 1])

# False Ommision Rate FN / (TN + FN)
confmat[2, 1]/sum(confmat[2, ])
```

So it's not a great model, but let's assume it's better than what the company was doing before and move on. These numbers are useful but we want an intuitive visual representation for this, so what works best?

A rose plot is a surprisingly intuitive choice for viewing a binary confusion matrix, which is a $2 \times 2$ table. Rose plots are a twist on the pie chart. Pie charts divide a circle into segments whose internal angle $\theta$ is proportional to the category count. A rose plot does the reverse. The segments all have angle $\theta = \frac{2 \pi}{N}$, for $N$ segments. For a binary matrix with $4$ options $\theta = \frac{\pi}{2} = 90^o$. Then, the radius length  varies in proportion with the count of each cell in the table. This has a useful property that $\mathit{count} \propto \sqrt{\mathit{area}}$, which means the rose plot will gracefully handle numbers that differ by one or two orders of magnitude.

Most rose plot options available in R are basically bar charts that are transformed into polar coordinates. Even for a $2 \times 2$ matrix, this would be fiddly to code because it requires converting the matrix to a frequency data frame and then forcing the plot to arrange the four cell values in a way that preserves the confusion matrix shape.

The vcd package contains a specialised fourfold plot whose purpose is to compare several $2 \times 2$ tables over the levels of a third feature. It is intended for a different purpose than what we want to do, but the behaviour can be easily changed with a parameter to the plot function call that  makes it perfectly suited for our needs.

```{r german_fourf}
# fourfold plot preserves the table structure it was given
# however, default balances the areas to compare odds ratios no matter the individual count values
# that's not what we want
fourfold(confmat)

# std = "all.max" suppresses balancing the areas
# behaves like a rose plot
fourfold(confmat, std = "all.max")
```

The rose plot makes it visually clear that the model is great a classifying true negatives, but is predicting too many negatives over all and misclassifying more true prositives than it correctly identifies. Such a high FNR is not acceptable. Why is the model not doing well?

If a categorical feature $c$ has a high importance, it would suggest that the model has encoded how the ratio of *good* and *bad* ratings is different over the different levels of $c$ and can make use of that information in predicting.

Let's use feature importance to break down the results. This data set had a mix of numeric and categorical features.

```{r german_varimp}
varImpPlot(german_rf)
```

First thing to note is that there is no elbow point where we could safely make the model simpler by just picking the top performing features.

Secondly, and very fortunately for us, several categorical features are considered important, including the top one. Three of them are in the top six. Of course, it doesn't matter if categorical variables don't have high importance. Any numeric variable could be binned/discretised, and then have this technique applied.

As we want to customise the fourfold plot a little, we can make a new function to save repeating the code each time.

```{r my_rose}
# customise the fourfold plot
my_rose <- function(cm) {
  fourfold(cm
         , std = "all.max" # mimic a 4 way rose plot
         , conf_level = 0 # suppress confint tests, not useful here
         , color = c("#FF0000", "#000080") # need to force colours due to suppressing confint test
         )
}

# while we're at it, let's make a function to get the numeric statistics that we want from the conf mat
my_confmat_stats <- function(cm) {
  cf_stats <- list()
  for (i in 1:(dim(cm)[3])) {
    
    category <- dimnames(cm)[[3]][i]
    acc <- sum(diag(cm[ , , i]))/sum(cm[ , , i])
    
    # False Negative Rate FN / (TP + FN)
    fnr <- cm[2, 1, i]/sum(cm[ , 1, i])
    
    # False Ommision Rate FN / (TN + FN)
    fomr <- cm[2, 1, i]/sum(cm[2, , i])
    
    # collect
    cf_stats[[category]] <-
      c(acc=acc, fnr=fnr, fomr=fomr)
  }
  
  return(t(as.data.frame(cf_stats)))
}
```

We'll now make rose plots of the confusion matrix, stratified by the top categorical features. Try to relate the numeric stats to the patterns in the plots.

```{r rose_chk}
# stratify the conf mat by chk
confmat_c <- with(german_test
                  , table(german_preds
                          , rating
                          , chk))

# view the stats then plot
my_confmat_stats(confmat_c)
my_rose(confmat_c)
```

Here we can see a pattern. When an instance has *chk* $\in \{ \mathrm{A13}, \mathrm{A14} \}$, the model almost always classifies as *good*. One instance in each case was incorrectly classified as *bad*. Instances having *chk* $\in \{ \mathrm{A11}, \mathrm{A12} \}$ have a rather high FNR at $\approx 0.5$

We now proceed with the next two most important categorical features, but leave any further diagnostics as talking points.

```{r rose_others}
confmat_p <- with(german_test
                  , table(german_preds
                          , rating
                          , pps))

my_confmat_stats(confmat_p)
my_rose(confmat_p)

confmat_h <- with(german_test
                  , table(german_preds
                          , rating
                          , crhis))

my_confmat_stats(confmat_h)
my_rose(confmat_h)

confmat_e <- with(german_test
                  , table(german_preds
                          , rating
                          , emp))

my_confmat_stats(confmat_e)
my_rose(confmat_e)
```

The *dur* feature is integer valued, with 22 unique levels. This technique scales usefully even up to this number of strata! While the exact details would take some time to analyse fully, there are obvious visible patterns.

```{r my_rose_dur}
confmat_d <- with(german_test
                  , table(german_preds
                          , rating
                          , dur))
my_rose(confmat_d)

```

This technique can be scaled to more dimensions, with some sacrifices, using log odds ratios. It is a little less intuitive at first, but can be very useful once you have a clear understanding of that the log odds ratio is telling you.

Recall from the introduction how this statistic was defined. The formula can be rearranged such that

$$\phi = \frac{\mathrm{TP} \times \mathrm{TN}}{\mathrm{FP} \times \mathrm{FN}}$$

The log odds ratio, therefore is larger as the model becomes more accurate. A model that does no better than a random guess has a log odds ratio that is not significantly different from zero. *Note, this approach is not robust to very unbalanced classes.*

A the log odds ratio of a stratified confidence matrix can easily be plotted to see which categories of the stratifying factor provide better accuracy. This will
scale up to two additional dimensions, using position, colour and plot character. Such plots are be challenging to interpret and only works well if the features have two or three levels each at the most. However, under those circumstances, interesting interactions may be revealed.

*Note, all the following plots are suppressing the confidence interval. The reason for this is that the data set is rather small and slicing along multiple dimensions results in low counts. Large standard errors can dominate the plot.*

```{r lodds_ratio_plots}
plot(loddsratio(confmat_c), confidence = FALSE)
plot(loddsratio(confmat_e), confidence = FALSE)

# higher dimensional
confmat_ce <- with(german_test
                   , table(german_preds
                           , rating
                           , chk
                           , emp))

confmat_ce <- with(german_test
                   , table(german_preds
                           , rating
                           , chk
                           , emp))
plot(loddsratio(confmat_ce), confidence = FALSE)

# back to one stratum, but many levels
plot(loddsratio(confmat_d), confidence = FALSE)
```

### Talking Points

* How would you diagnose the patterns in the further rose plots above?
* Can you determine any interesting interactions between *chk* and *emp* from the log odds ratio plots?
* How could you use information from the stratified confusion matrices to improve the model?
* How could you use this information to improve the model?
* Can you think of other ways to use this information?

## Task: Analysing a Confusion Matrix for Ordinal Classification

### What is this for?

In Machine Learning, the standard way to assess the quality of a supervised classification model is with a confusion matrix. This is a two way table comparing True and Predicted class labels. If you are trying to classify ordered labels, the relationship between the labels matters. Standard accuracy measures penalise near disagreements as if they were full errors. This leads to accuracy estimates that are far too conservative; You might throw out a fairly good model, when there is a good chance you can improve it.

Let's take a step back and recap confusion matrices; when classes are well balanced, it's easy to measure accuracy over a confidence matrix as $\frac{\sum(\textit{diag})}{\sum(\textit{total})}$. When they're out of balance you can have a situation such as this:

The ratio of class 1 : class 2 is 9:1. Predict every instance as class 1 and you will get 90% accuracy. This not a useful result!

So we use Cohen's $\kappa$ statistic to weight according to the occurrence of each class in the data. $\kappa = 1$ means perfect agreement while $\kappa = 0$ means no better than chance agreement. The above example would score $\kappa = 0$. 

That's all well and good, but what class labels are related to each other by a natural order such as [*least, middle, most*] then you have an additional complication. You want your model to predict as well as possible, but to make an off-by-one error isn't as bad as a full opposite error. Also, the classification distribution might be biased, giving different skew and kurtosis (flattened or polarised) by the predictive model.

If you can detect this kind of bias, you can correct it! This is where agreement plots can help. They are based on statistical techniques for comparing two raters, such as two doctors, two film critics, two wine critics etc, to validate ratings and analyse these kinds of bias.

### When would I use this?

After training an ordinal classification model, when you want to evaluate its performance.

### Case Study: Comparing Inter-Rater Agreement for Severity of Multiple Sclerosis.

In the following example we compare two sets of diagnoses for the same patients suspected of suffering from Multiple Sclerosis. The diagnoses were made by two neurologists, one in Winnipeg and the  other in New Orleans:

1. Certain
1. Probable
1. Possible
1. Doubtful

```{r MSP_2D}
# combine results from two cities (same raters are involved in all cases)
MSP <- margin.table(MSPatients, 1:2)
```

On this dataset, we can't say which is the true and which is predicted because they're both independent raters, but for now let's assume that Winnipeg Neurologist is the true class and New Orleans Neurologist is the predicted class. The highest accuracy we could expect to get with constant model that always makes the same rating is the proportion of the majority true class. This would give a score of  $\kappa = 0$:

```{r MSP_winntrue}
margin.table(MSP, 2)
props <- prop.table(margin.table(MSP, 2))
props

# The baseline accuracy is
props[which(props == max(props))]
```

Without looking at the confusion matrix, let's take a naive look at the accuracy:

```{r MSP_accuracy}
# basic accuracy
sum(diag(MSP))/sum(MSP)
```

On the face of it, this looks like a terrible result! Is New Orleans Neurologist just predicting the most frequent Winnipeg class?

```{r MSP_confmat}
MSP
```

There are definitely some problems with the inter-rater agreement but it doesn't look like random guessing. Why is accuracy so low? Does class imbalance play a part? What is the $\kappa$ score?

```{r MSP_kappa}
# cohen's kappa - the function in the vcd package provides weights and confidence intervals for this statistic
confint(Kappa(MSP, weights = "Fleiss-Cohen"))
```

Unweighted $\kappa$, the standard measure, also shows that this is not a very good result. 

However, Fleiss-Cohen Weighted $\kappa$ favours near disagreements and taking this into account gives a much better result. Off-by-n errors appear to be common, so these raters are closer than it first appears.

An agreement plot is the best way to visualise a situation like this because it shows how off-by-n errors and other bias could be affecting the outcome.

```{r MSP_agreement, results='hold'}
op <- par(mar = c(4, 3, 4, 1) + .1)
B <- agreementplot(MSP
                   , main = "MS Patient Ratings"
                   , xlab_rot = -20
)
```

```{r, echo=FALSE}
# ignore this
par(op)
```

This plot needs some explaining. The rectangle outlines show the maximum possible agreement, given the marginal totals. The black rectangles indicate where there is full agreement between raters. The shaded area indicates the off by $n$ disagreement. *Note, if there were more categories, there would be more grades of shading*.

Ideally, we want the black area to fill the outline and it's not very close in this chart. However, the shaded area is coming much closer to filling the available space. 

```{r MSP_bangdi}
unlist(B)[1 : 2]
```

The Bangdiwala statistic is calculated based on the ratio of these shaded rectangles to the maximal outline. The large interval between the weighted and unweighted statistics shows that near disagreements are very common and if taken into account, they indicate a much better alignment than was indicated by a naive accuracy measure and even $\kappa$. 

There is another feature of this plot, which is the relationship of the rectangles to the diagonal line. This indicates how well aligned the raters are in using each category. In this chart, we can see how the Winnipeg Neurologist rates cases as "Certain" or "Probable" much more frequently than New Orleans Neurologist. Neither $\kappa$ nor Bangdiwala statistics can detect this. There are statistical tests available but this chart shows the problem directly.

With a little intervention, guided by this information, these two raters could become much better aligned.

### Talking Points

* What other statistics, tools and techniques can you use to analyse a confusion matrix?
* Have you worked with, or can you think of an ordinal classification problem?
* How could you use the Bangdiwala statistic *during* training to improve alignment?

## Task: 1-way analysis (goodness of fit) for monitoring the frequency of events.

Count, or frequency data has many varied applications from event monitoring to text analysis. 1-way analysis uses distributions to compare different samples, and monitor for changes in event frequency.

For the first task we will look at two common distributions for count data and compare two samples.

The Poisson distribution has just one parameter $\lambda$ for the rate, and both the mean and variance are equal to this parameter.

```{r dpois}
set.seed(12345)

KL <- expand.grid(k = 0 : 20, lambda = c(1, 5, 10, 20))
pois_df <- data.frame(KL, prob = dpois(KL$k, KL$lambda))
pois_df$lambda = factor(pois_df$lambda)

xyplot(prob ~ k | lambda, data = pois_df,
       type = c("h", "p"), pch = 16, lwd = 2
       , cex = 1.25, layout = c(4, 1)
       , main = expression(paste("Poisson Distribution by ", lambda))
       , xlab = list("Number of events (k)", cex = 1.25)
       , ylab = list("Probability", cex = 1.25))
```

The negative binomial distribution has two parameters, rate (size) and dispersion ($\frac{1}{\mathrm{prob}}$). The mean is equal to the rate, but the variance is always greater than the rate. It can appear as a right skewed Poisson distribution and behaves as if a Poisson distribution is being sampled over a period of time while the rate is non-constant.

```{r dnbin}
XN <- expand.grid(k = 0 : 20, n = c(1, 5, 10, 20), p = c(0.4, 0.2))
nbin_df <- data.frame(XN, prob = dnbinom(XN$k, XN$n, XN$p))
nbin_df$n <- factor(nbin_df$n)
nbin_df$p <- factor(nbin_df$p)

xyplot(prob ~ k | n + p, data = subset(nbin_df, p == 0.4)
       , main = "Neg. binom by size and prob"
       , xlab = list("Number of failures (k)", cex = 1.25)
       , ylab = list("Probability",  cex = 1.25)
       , type = c("h", "p"), pch = 16, lwd = 2
       , strip = strip.custom(strip.names = TRUE)
)
xyplot(prob ~ k | n + p, data = subset(nbin_df, p == 0.2)
       , main = "Neg. binom by size and prob"
       , xlab = list("Number of failures (k)", cex = 1.25)
       , ylab = list("Probability",  cex = 1.25)
       , type = c("h", "p"), pch = 16, lwd = 2
       , strip = strip.custom(strip.names = TRUE)
)

```

However, Poisson and negative binomial data can look very similar. Identifying the best match and estimating the parameters is essential for accurate event monitoring and reporting results.

### Case Study in Text Analysis: Marker word frequency

The following technique has been used to determine if documents are written by same or different authors. Distributions of marker words are fit to text that is known to come from one author. 

Any previously unseen text can then be tested against these known distributions to determine if they are written by the same author.

In this simple case study, we will use the Federalist dataset to analyse how many times the word 'may' appear per 200 word block. This is a marker word, that could help identify an author by how often they use it. Most studies of this kind would compare results across many such tests with different words.

Notes:

* The rate is occurrences/200 words. 
* The expected count is the rate parameter $\times$ 200
* Plotting the square root transformation of the frequency provides more detail on the small numbers in the right tail of these charts. Log transformation would also work but the rootogram visualisations use square root transformation by default. So, this is also applied manually where necessary for comparison.

```{r federalist}
data("Federalist", package = "vcd")
Federalist
sum(Federalist) # number of blocks
sum(expand.dft(as.data.frame(Federalist))$nMay) # number of occurences
mean(expand.dft(as.data.frame(Federalist))$nMay)
var(expand.dft(as.data.frame(Federalist))$nMay)
```

The descriptive stats showed a slightly higher variance but we don't know if this is significant. What's the right test to use here?

A very naive approach is to take the mean and visually compare a Poisson distribution with this value for $\lambda$ against the actual data.

```{r pois_naive_expected}
barplot(sqrt(200 * dpois(0:6, mean(expand.dft(as.data.frame(Federalist))$nMay)
))
        , main = "Number of text blocks by Number of Occurrences\nnaive expected values"
        , ylab = "sqrt(Naive Expected)"
        , xlab = "Occurences of 'may'"
        , col = "lightblue")
```

The actual data looks a bit like this distribution. Could this be a good fit? How can we easily compare them?

```{r barplot_actual}
barplot(sqrt(Federalist)
        , main = "Number of text blocks by Number of Occurrences\nactual values"
        , xlab = "Occurences of 'may'"
        , ylab = "Sqrt(Actual)"
        , col = "lightgreen"
)
```

Use the goodfit function in the vcd package to estimate the parameters and compare the fit of different distributions. It provides hanging rootograms that are intuitive views of discrete goodness of fit testing.

First, a Poisson fit using maximum likelihood - exactly what we've just done, but with bells on it:

```{r fed_goodfit_pois_chisq}
Fed_fit0 <- goodfit(Federalist, type = "poisson")
# This will show the rate parameter, estimated by Maximum Likelihood
# This estimate is the same as the naive mean. This is to be expected when fitting a poisson.
unlist(Fed_fit0$par)

# This will show a Chi-Square Goodness of fit test against the expected values of a poisson distribution with this parameter
summary(Fed_fit0)
```

The $\chi^2$ test is very strong evidence to reject the null hypothesis. This sample is not from a Poisson distribution with rate = sample mean. However, we don't know much about what caused this result. However, the rootogram plot makes it very clear what the problem is by showing how the variable diverges from the expectations. 

```{r fed_goodfit_pois_plot}
plot(Fed_fit0, type = "hanging", shade = TRUE)
```

By hanging each bar top down from the expected distribution, we just need to look at how the bottoms of the bars miss the $y = 0$ reference line. This is visually much more intuitive than comparing the tops of the bars to a curve. The contribution of each bar to the $\chi^2$ statistic is also made clear by the shading.

We can try fitting a Poisson using a different estimation method, minimising the $\chi^2$ errors instead:

```{r fed_goodfit_pois_mchisq}
Fed_fit0 <- goodfit(Federalist, type = "poisson", method = "MinChisq")
# This has found a different value for the rate parameter.
unlist(Fed_fit0$par)

# This will show a Chi-Square Goodness of fit test against the expected values of a poisson distribution with this parameter
summary(Fed_fit0)
```

The $\chi^2$ test is also very strong evidence against the null.

```{r fed_goodfit_pois_mplot}
plot(Fed_fit0, type = "hanging", shade = TRUE)
```

The rootogram shows that the residuals have just been shuffled around so they balance better by sign. Clearly Poisson is not the right choice.

Next, fit a negative binomial fit:

```{r fed_goodfit_nbinom}
Fed_fit1 <- goodfit(Federalist, type = "nbinomial")
# This will show the rate and dispersion parameters, estimated by Maximum Likelihood
# prob is the rate. Here it's very close to the poisson mean
# size is the dispersion
unlist(Fed_fit1$par)
summary(Fed_fit1)

plot(Fed_fit1, type = "hanging", shade = TRUE)
```

We have estimated some parameters and the $\chi^2$ goodness of fit test result is not significant. We accept that null hypothesis that this sample is from a negative binomial distribution having the estimated parameters. The plot speaks for itself now that the bars drop close to the $y = 0$ reference line, with only insignificant residuals.

We will now assess a different body of text.

```{r nonFederalist}
# Some dummy data
nonFederalist <- structure(c(135L, 71L, 36L, 17L, 8L, 4L, 1L), .Dim = 7L, .Dimnames = structure(list(
    c("0", "1", "2", "3", "4", "5", "6")), .Names = "nMay"), class = "table")
nonFederalist
sum(nonFederalist) # number of blocks
sum(expand.dft(as.data.frame(nonFederalist))$nMay) # number of occurences
mean(expand.dft(as.data.frame(nonFederalist))$nMay)
var(expand.dft(as.data.frame(nonFederalist))$nMay)
barplot(sqrt(nonFederalist)
        , main = "Number of text blocks by Number of Occurrences\nunseen text"
        , xlab = "Occurences of 'may'"
        , ylab = "Sqrt(Actual)"
        , col = "lightgreen"
)
```

The descriptive statistics are a bit larger than Federalist, but is this significant? We can test it properly using the parameters from the goodfit object we just created:

```{r fit_nonfed_chisq}
# we pass in our fitted parameter list this time
Fed_fit3 <- goodfit(nonFederalist, type = "nbinomial", par = Fed_fit1$par)
summary(Fed_fit3)
```

The $\chi^2$ result is significant, and strong evidence that this new text is not from the same distribution. However, on it's own, the test can't provide detailed information.

```{r fit_nonfed_plot}
plot(Fed_fit3, type = "hanging", shade = TRUE)
```

The rootogram hangs the frequency bars off the expected distribution, which this time is fixed with parameters we estimated on the original text. We can now see from the pattern of residuals that the word 'may' seems to appear more frequently in this text. Specifically, the tail, with 3-5 occurrences per 200 words is fatter, while blocks where this word is absent are fewer.

### Talking points

* Does this test imply that the second text was not written by the same author?
* Text analysis has come a long way since these statistical techniques were developed. What are some newer, more sophisticated methods?
* Can you describe, in plain English, what is the implication of finding a Poisson distributed word frequency? Why is the negative binomial a more natural distribution for word frequency? Hints:
    * Frequency is the number of times the word appeared per block of 200 words
    * Review the notes about these two distributions at the top of this task

# Summary

* We have demonstrated some classical statistical techniques that are not usually included in a basic stats curriculum.
* We have discussed how these techniques can be applied to tasks that are typical in many practical Machine Learning projects.
    * Text/Event Analysis
    * Comparing True and Predicted Classes for Ordinal data
    * Exploratory Data Analysis
    * Dimension Reduction and Clustering
* We have shown that simple R functions and visual analyses are extremely powerful and you don't have to dive deep into stats to get the benefits.
* Nearly all the resources you need to take the next steps yourself are available in the "vcd" and "vcdExtra" package for R, R~/vignette("vcd")
* If you want to go much, much deeper with these topics, [Michael Friendly's book](https://www.crcpress.com/Discrete-Data-Analysis-with-R-Visualization-and-Modeling-Techniques-for/Friendly-Meyer/p/book/9781498725835) pulls together decades of research in this field and is a one-stop reference for this branch of statistics.

In our rush to use the latest developments and technologies in Machine Learning and Deep Learning, it's easy forget that none of it would be possible without pioneering statistical research of previous decades. Most of this work is still relevant to everyday problems in Machine Learning practice, and it's always worth a review of the foundations before reaching for the newest tools.

Above all, never under-estimate the power of visual analytics. A well thought out plot can save you paragraphs of explanation.