# Introduction to statistics for Geoscientists (with Python)
### Lecturer: Gerard Gorman
### Lecture 8: Chi-squared tests and some miscellania
### URL: [http://ggorman.github.io/Introduction-to-stats-for-geoscientists/](http://ggorman.github.io/Introduction-to-stats-for-geoscientists/)

## The Chi-squared test (or $\chi^2$ test)

The Chi-Squared test is used for discrete (categorised) data, e.g.:

* Foot length – continuous.
* Shoe size – discrete.

Discrete geological data might be:

* Fossil type (species A, species B, etc).
* Rock classification (sandstone, limestone, mudstone, etc).
* Fault type (normal, thrust, strike-slip, etc).

The Chi-squared test provides a way of assessing how likely it is that counts of discrete data fit some expected pattern.

## Chi-squared example

We have many trilobite fossils from one deposit:

* Fossils are moults.
* Have cranidia, librigena, and pygidia.
* Should have ratio of 1:2:1.

Does our data depart from this? If it does we can infer a taphonomic bias - probably current-sorting.

![trilobite](http://www.trilobites.info/cepthopyg.gif)

![cranidium](http://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Trilobite_cranidium-en.svg/270px-Trilobite_cranidium-en.svg.png)








## Chi-squared example

Chi-squared test requires an observed/expected table of this form:

| | Observation count | Expected (based on 1:2:1 ratio) |
|:---------|:------------------|:--------------------------------|
|Cranidia | 20 |17.25 |
|Librigena | 32 |34.5 |
|Pygidia | 17 |17.25 |
|Total | 69 |69 |

Python provides a chi-squared test via the method [scipy.stats.chisquare](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html), which tests the null hypothesis that the categorical data has the given frequencies.

In [1]:
import numpy as np
from scipy import stats

Obs = np.array([20, 32, 17])
Exp = np.array([17.25, 34.5, 17.25])
s, p = stats.chisquare(Obs, Exp)
print "p-value = {}".format(p)

p-value = 0.732278624487


Therefore we accept the null hypothesis - the data sample has the expected frequencies.

## Chi-squared assumptions

The Chi-squared test has wide applicability. Chi-Squared test quite broadly applicable. There is no requirement for anything to be normal, but:

* No expected category should be less than 1 (it does not matter what the observed values are).
* No more than one-fifth of expected categories should be less than 5.

# Exercise 8.1: Chi-squared Test

## Determine whether marks classifications for a course are atypical

Analysis of 2000 overall course marks from ESESIS shows that the typical marks breakdown is as follows:

Fail: 4.3%
3rd: 9.5%
2ii: 18.4%
2i: 38.4%
1st: 29.4%

Now consider the following distribution of results from two different groups of students:

|Grade | Students - group 1| Students - group 2|
|:------|:------------------|:------------------|
|Failed | 3 | 0 |
|3rd | 10 | 8 |
|2ii | 23 | 7 |
|2i | 30 |25 |
|1st | 20 |39 |

**Consider each group in turn - are their results atypical?**

## Tip 1: The chi-squared test is used to determine whether counts of discrete observations fit a predetermined pattern of expectations (see lecture notes).

[scipy.stats.chisquare(Obs,Exp)](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html) – carry out a Chi-squared test on arrays Obs (Observed values) and Exp (Expected values)

Returns a tuple of (s_statistic, p_value), where s_statistic is the value of Chi-squared (as normal you can probably ignore this), and p_value is the (two-tailed) probability of this result occurring by chance (i.e. of the observations actually fitting the expectations). Chi-squared tests are always two tailed – they only ever test differences between observed and expected – there is no concept of ‘direction’ of difference.

IMPORTANT– the function takes numpy arrays, NOT normal python list. Use numpy array function to convert. Usage example:

Obs = array([20, 32, 17])

Exp = array([17.25, 34.5, 17.25])

s, p = chisquare(Obs, Exp)

## Tip 2: State your hypothesis.

H0: Course has expected classification breakdown

H1: Course classification breakdown does not follow expected pattern

## Tip 3: To calculate Chi-squared you need a list of expected values.

Find the total of the input values (you can use the built in sum function to do this), multiply this by each percentage value given above, and divide by 100. 

## Tip 4: Check the test is valid.

You need to check two things:

1. None of your expected values should be less than 1.
2. No more than one (i.e. more than 1/5th) of them is less than 5.

In [3]:
# solution here

# Exercise 8.2

Every day, you visit the JCR, Library Cafe, College Cafe and all the other taste imperial outlets, and count how many Chicken and Bacon baguettes they have on sale; how many Ham and Cheese baguettes there are; and how many Carrot and Hommous baguettes there are. You record the numbers in a nice table:

|Day\Baguette | C&B | H&C | C&H |
|:-------------|:-----|:----|:----|
|Monday | 32 | 35 | 38 |
|Tuesday | 20 | 18 | 30 |
|Wednesday | 27 | 29 | 8 |
|Thursday | 16 | 19 | 10 |
|Friday | 22 | 27 | 20 |

You have procured all this information because you read somewhere that, supposedly, 20 of each type are being added by Taste Imperial each day and that approximately 20 of each are eaten each day. You realise the ideal distribution should be:

|Day\Baguette | C&B | H&C | C&H |
|:-------------|:-----|:----|:----|
|Monday | 20 | 20 | 20 |
|Tuesday | 20 | 20 | 20 |
|Wednesday | 20 | 20 | 20 |
|Thursday | 20 | 20 | 20 |
|Friday | 20 | 20 | 20 |

Perform a chi-squared test and see if reality matches the statistic that you read about.

(Note: All the above numbers have been invented and may not be anywhere close to the actual values)


In [None]:
# solution here