# Definition of Statistical Measures – Central Tendency and Spread
### Dr. Tirthajyoti Sarkar, Fremont, CA 94536
---
This notebook discusses fundamentals concepts of descriptive statistics such as central tendency and dispersion (spread) measures - mean/median/mode and variance. 

We show how one can compute such descriptive statistics using basic Python code (without using any library) as well as using `NumPy` functions.

### Central tendency
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. They are also categorized as summary statistics:

* **Mean**: Mean is the sum of all values divided by the total number of values.

$$ \mu = \frac{\sum{n_i}}{N} \\ \text{where } N = \sum{i} \text{ : total number of observations}$$

* **Median**: The median is the middle value. It is the value that splits the dataset in half. To find the median, order your data from smallest to largest, and then find the data point that has an equal amount of values above it and below it.
* **Mode**: The mode is the value that occurs the most frequently in your dataset. On a bar chart, the mode is the highest bar.

Generally, the mean is a better measure to use for symmetric data and median is a better measure for data with a skewed (left or right heavy) distribution. For categorical data, you have to use the mode.

### Spread
The spread of the data is a measure of by how much the values in the dataset are likely to differ from the mean of the values. If all the values are close together then the spread is low; on the other hand, if some or all of the values differ by a large
amount from the mean (and each other), then there is a large spread in the data.

* **Variance**: This is the most common measure of spread. Variance is the average of the squares of the deviations from the mean. Squaring the deviations ensures that negative and positive deviations do not cancel each other out.

$$V = \frac{\sum{(n_i-\mu)^2}}{N}$$

* **Standard Deviation**: Because variance is produced by squaring the distance from the mean, its unit does not match that of the original data. Standard deviation is a mathematical trick to bring back the parity. It is the positive square root of the variance.

$$\sigma = \sqrt{\frac{\sum{(n_i-\mu)^2}}{N}}$$

> **NOTE**: When we later build regression models, we will revisit these definitions in the conext of statistical estimation. There, the sample variance will be given by a slightly different formula (the denominator will change),

$$V = \frac{\sum{(n_i-\mu)^2}}{N-2}$$

## Let's measure central tendency of an array of numbers

### Somewhat naive way to do it
We can simply write a 'for' loop, add the numbers, and divide by the length of the array

In [1]:
array = [3,4,4,7,5,6,5.5,8,5,6.5,9,7.5,6]

In [2]:
sum = 0
for num in array:
 sum+=num
mean = sum/len(array)
print("Arithmetic Mean: ",mean)

Arithmetic Mean: 5.884615384615385


In [3]:
from time import time

In [4]:
t1 = time()
for _ in range(100000):
 sum = 0
 for num in array:
 sum+=num
 mean = sum/len(array)
t2 = time()

print("Mean: {}\nAverage time taken for computing the mean using for loop: {} seconds ".format(mean,(t2-t1)/100000))

Mean: 5.884615384615385
Average time taken for computing the mean using for loop: 1.1469221115112304e-06 seconds 


### Using NumPy with `ndarray.mean()` method

**What is Numpy**? - NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation and much more.

https://docs.scipy.org/doc/numpy-1.13.0/user/whatisnumpy.html

In [5]:
import numpy as np
np_array = np.array(array)
print("Mean: ",np_array.mean())

Mean: 5.884615384615385


In [6]:
t1 = time()
np_array = np.array(array)
for _ in range(100000):
 mean = np_array.mean()
t2 = time()

print("Mean: {}\nAverage time taken for computing the mean using NumPy: {} seconds ".format(mean,(t2-t1)/100000))

Mean: 5.884615384615385
Average time taken for computing the mean using NumPy: 3.880062103271484e-06 seconds 


### So, the `NumPy` method does not offer significant boost in performance. But what happens when the array is large?

In [7]:
from random import randint
lst = []
for _ in range(1000000):
 lst.append(randint(1,100))

In [8]:
len(lst)

1000000

In [9]:
t1 = time()
for _ in range(100):
 sum = 0
 for num in lst:
 sum+=num
 mean = sum/len(lst)
t2 = time()

print("Mean: {}\nAverage time taken for computing the mean using for loop: {} seconds ".format(mean,(t2-t1)/100))

Mean: 50.539717
Average time taken for computing the mean using for loop: 0.06872276782989502 seconds 


In [10]:
t1 = time()
np_lst = np.array(lst)
for _ in range(100):
 mean = np_lst.mean()
t2 = time()

print("Mean: {}\nAverage time taken for computing the mean using NumPy: {} seconds ".format(mean,(t2-t1)/100))

Mean: 50.539717
Average time taken for computing the mean using NumPy: 0.001326603889465332 seconds 


### An utility function to generate random arrays

In [11]:
def random_array(num_elements,lower=1,upper=100):
 """
 """
 from random import randint
 lst = []
 for _ in range(num_elements):
 lst.append(randint(lower,upper))
 return lst

In [12]:
random_array(5)

[99, 68, 65, 99, 87]

In [13]:
random_array(10,-20,20)

[-6, -19, -17, 9, 15, -9, -4, 15, 14, -5]

### Computing median using both naive method and Numpy function `np.median()`

In [14]:
array_2 = random_array(15,10,30)

In [15]:
array_2

[25, 13, 11, 19, 13, 24, 27, 18, 30, 14, 30, 20, 18, 19, 23]

In [16]:
# Using the built-in Python 'sorted' method 
array_sorted = sorted(array_2)

In [17]:
array_sorted

[11, 13, 13, 14, 18, 18, 19, 19, 20, 23, 24, 25, 27, 30, 30]

In [18]:
def median (array):
 """
 Computes median of a given numeric array
 """
 num_elements = len(array)
 array_sorted = sorted(array)
 if num_elements%2==1:
 median = array_sorted[int(((num_elements+1)/2)-1)]
 else:
 median = (array_sorted[int(((num_elements+1)/2)-1)]+array_sorted[int(((num_elements+1)/2))])/2.0
 return median

In [19]:
median(array_2)

19

In [20]:
np.median(np.array(array_2))

19.0

In [21]:
array_3 = random_array(16,100,200)

In [22]:
print(array_3)

[183, 199, 136, 101, 196, 107, 153, 173, 122, 157, 117, 118, 125, 161, 171, 169]


In [23]:
median(array_3)

155.0

In [24]:
np.median(np.array(array_3))

155.0

**NOTE**: Unlike `mean()`, an Numpy array does not have `median()` method. We have to use `np.median()` and pass on the array as the argument.

## Variance and standard deviation
* `ndarray.var()`
* `ndarray.std()`

### We will still practice one naive way to compute variance

In [25]:
def mean(array):
 """
 Computes mean
 """
 length = len(array)
 sum = 0
 for i in range(length):
 sum+=array[i]
 mean = sum/length
 return mean

In [26]:
def variance(array):
 """
 Computes variance
 """
 length = len(array)
 avg = mean(array)
 sumsq = 0
 for i in range(length):
 sumsq+=(array[i]-avg)**2
 variance = sumsq/length
 return variance

In [27]:
def std_dev(array):
 """
 Computes std. deviation
 """
 from math import sqrt
 return (sqrt(variance(array)))

In [28]:
array_4 = random_array(100,1,100)

In [29]:
print(array_4)

[12, 14, 9, 89, 31, 38, 60, 45, 18, 48, 61, 21, 80, 47, 91, 83, 57, 92, 85, 60, 43, 61, 76, 71, 100, 18, 35, 77, 27, 18, 95, 15, 71, 50, 92, 78, 64, 58, 49, 5, 9, 55, 19, 20, 36, 27, 62, 81, 42, 64, 95, 89, 40, 66, 75, 44, 54, 57, 41, 39, 34, 87, 33, 64, 61, 84, 51, 6, 1, 69, 5, 14, 54, 42, 94, 24, 34, 78, 56, 98, 35, 40, 11, 90, 7, 16, 7, 60, 32, 16, 64, 68, 85, 48, 91, 38, 34, 9, 95, 1]


In [30]:
variance(array_4)

790.0474999999996

In [31]:
std_dev(array_4)

28.10778361948874

In [32]:
np.var(np.array(array_4))

790.0474999999999

In [33]:
np.std(np.array(array_4))

28.107783619488746

## What if there are `NaN` values in the array
* `nanmean()`
* `nanmedian()`
* `nanstd()`
* `nanvar()`

In [34]:
array = random_array(20,1,50)

In [35]:
print(array)

[33, 1, 7, 34, 47, 34, 2, 24, 27, 9, 23, 45, 26, 46, 1, 7, 2, 47, 49, 19]


In [36]:
array[2]=np.nan
array[6]=np.nan

In [37]:
print(array)

[33, 1, nan, 34, 47, 34, nan, 24, 27, 9, 23, 45, 26, 46, 1, 7, 2, 47, 49, 19]


In [38]:
array = np.array(array)

In [39]:
print("Mean:",array.mean())
print("Var:",array.var())

Mean: nan
Var: nan


### Using special functions which ignore `NaN`. 
Notice they are methods of the base Numpy (`np`) class, and not of an individual array

In [40]:
print("Mean ignoring NaN:",np.nanmean(array))
print("Var ignoring NaN:",np.nanvar(array))
print("Std. dev ignoring NaN:",np.nanstd(array))
print("Median ignoring NaN:",np.nanmedian(array))

Mean ignoring NaN: 26.333333333333332
Var ignoring NaN: 271.44444444444446
Std. dev ignoring NaN: 16.47557114167653
Median ignoring NaN: 26.5


## Other descriptive statistics measures
* Min and max
* Range
* Quantile
* Percentile





In [41]:
array = random_array(20,1,100)
array = np.array(array)
sorted_array = sorted(array)
print(array)

[29 59 16 95 64 3 40 7 61 4 99 32 6 38 59 26 84 53 51 69]


In [42]:
# Using np.amax()
print("Max of the array:",np.amax(array))
# Using array.max()
print("Max of the array:",array.max())

Max of the array: 99
Max of the array: 99


In [43]:
# Using np.amin()
print("Min of the array:",np.amin(array))
# Using array.max()
print("Min of the array:",array.min())

Min of the array: 3
Min of the array: 3


In [44]:
# Compute range by using max() and min() functions
print("Range of the array: ", array.max()-array.min())
# Compute range by using ptp() function
print("Range of the array: ", np.ptp(array))

Range of the array: 96
Range of the array: 96


In [45]:
# Percentile
print("20th percentile of the array: ", np.percentile(array,20))

20th percentile of the array: 14.200000000000003


In [46]:
# Quantile
print("0.25-th quantile of the array: ", np.quantile(array,0.25))
print("0.5-th quantile of the array: ", np.quantile(array,0.5))
print("0.75-th quantile of the array: ", np.quantile(array,0.75))

0.25-th quantile of the array: 23.5
0.5-th quantile of the array: 45.5
0.75-th quantile of the array: 61.75


In [47]:
sorted_array[5]='HERE'
sorted_array[10]='HERE'
sorted_array[15]='HERE'
print(sorted_array)

[3, 4, 6, 7, 16, 'HERE', 29, 32, 38, 40, 'HERE', 53, 59, 59, 61, 'HERE', 69, 84, 95, 99]
