GEOG 172: INTERMEDIATE GEOGRAPHICAL ANALYSIS

Evgeny Noi

Lecture 06: Point Pattern Analysis

Point Pattern Locations¶

Fixed objects in space (intersection and traffic volume). Location is secondary to the value of observed traffic
Sample of continuous geographic process (both the location and value matter!)

Point pattern - manifestation of a geographic process

Consider a series of point locations $(s_1, s_2, ..., s_n)$ in a study region $\mathcal{R}$. For a two-dimensional space where each point $s_j$ can be defined with a pair of coordinates, such that $s_j = (x_j, y_j)$ for each location $j$.¶

A point process is a stochastic mechanism that generates a countable set of events¶

Examples: trees on a plot, meteor strikes, crime, animals in the field¶

Point Pattern Analysis (PPA)¶

Point pattern analysis (PPA) focuses on the analysis, modeling, visualization, and interpretation of spatial distribution of point data.
Applications: crime analysis, in epidemiology, and in facility location planning and management.

PPA¶

The pattern is judged against the hypothesis of complete spatial randomness (CSR)
Departures from CSR can be due to two types of effects:
- First-order effects (intensity of the process across space). Think about intensity of a desease. Violating the constant probability of event (due to underlying covariate).
- Second-order effects (dependence and interaction between events). Think about how presence of infected individual influences the probability of infection in vicinity to that individual.
Once we have established the departure from CSR, further investigation may be required to pinpoint the location of such departure

PPA¶

Density-based methods - deal with first-order properties of data (variation across area of interest)
Distance-based methods - deal with second-order properties of data (influence on one another)

Source: Intro to GIS Spatial Analysis

Methods of PPA¶

Kernel Density Estimation
Quadrat Analysis
Nearest Neighbor Analysis
Ripley's K

The first case of PPA¶

from THIS	to THIS

John Snow and Epidemiology¶

Cholera was one of the deadliest diseases to affect Britain in the nineteenth century
It was believed that the disease was transmitted and spread by a ‘bad air’ or ‘bad smells’ from rotting organic matter
1854 that the physician John Snow (1813-1858) made a major contribution to fighting cholera when he was able to demonstrate a link between cholera and the contaminated drinking water through his pioneering studies.
Case Study: Soho District and Broad Street Pump

Read full story here

Cholera and John Snow¶

In the middle 1800s, people didn’t have running water or modern toilets in their homes. They used town wells and communal pumps to get the water they used for drinking, cooking and washing. Septic systems were primitive and most homes and businesses dumped untreated sewage and animal waste directly into the Thames River or into open pits called “cesspools”. Water companies often bottled water from the Thames and delivered it to pubs, breweries and other businesses.

500 fatal attacks in 10 days

John Snow found several important examples. A workhouse, or prison, near Soho had 535 inmates but almost no cases of cholera. Snow discovered the workhouse had its own well and bought water from the Grand Junction Water Works.

The proprietor of the brewery, Mr. Huggins, told Snow that the men drank the liquor they made or water from the brewery’s own well and not water from the Broad Street pump.

Geographic Analysis on Cholera deaths¶

Does the data look random / uniform / dispersed? Or can we pinpoint the high concentration pockets?
Can we relate high concentration areas to probable causes (e.g. water pumps, cesspools)?
Two types of health geography questions?
- How does the desease spread? (Pump)
- How does the desease get to spreading environment? (Cesspools)

In [1]:

import pandas as pd
import geopandas as gpd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import geoplot as gplt

In [2]:

!wget https://github.com/barguzin/geo_data/raw/main/snow_data.zip -O snow_data.zip -nc 

File 'snow_data.zip' already there; not retrieving.

In [3]:

! mkdir snow_data | unzip -o snow_data.zip -d snow_data/

Archive:  snow_data.zip
  inflating: snow_data/deaths_by_bldg.geojson  
  inflating: snow_data/deaths_by_block.geojson  
  inflating: snow_data/deaths_by_bsrings.geojson  
  inflating: snow_data/deaths_by_otherrings.geojson  
  inflating: snow_data/deaths_nd_by_house.geojson  
  inflating: snow_data/pumps.geojson  
  inflating: snow_data/sewergrates_ventilators.geojson  
  inflating: snow_data/subdistricts.geojson

A subdirectory or file snow_data already exists.

In [4]:

# data documentation here 
# https://geodacenter.github.io/data-and-lab/data/snow_documentation.pdf
death_by_build = gpd.read_file('snow_data/deaths_by_bldg.geojson')
death_by_block = gpd.read_file('snow_data/deaths_by_block.geojson')
pumps = gpd.read_file('snow_data/pumps.geojson')
sewer = gpd.read_file('snow_data/sewergrates_ventilators.geojson')

In [6]:

fig, ax = plt.subplots()

# calculate marker size for plotting
death_by_build['plotted_size'] = death_by_build.deaths * 10

death_by_block.boundary.plot(color='k', alpha=.2, linestyle='dashed',ax=ax); # plot city blocks
#death_by_build.plot(markersize = 'plotted_size', ax=ax, color='r', label='deaths', marker='o'); # plot deaths by buildings 
death_by_build.plot(ax=ax, color='r', label='deaths', marker='o', markersize=20, ec='k')
pumps.plot(marker='s', color='g', ec='k', markersize=120, ax=ax, label='pumps'); # plot pumps

ax.set_title('Cholera Deaths and location of Pumps\n in Soho District, London.', fontsize=16);
_ = ax.axis('off')
ax.legend();

In [8]:

death_by_build

Out[8]:

	ID	x	y	deaths	pumpID	distpump	distBSpump	BSpump	geometry	plotted_size
0	1	529308.741420	181031.351546	3	1	88.048464	88.048	1	POINT (-0.13793 51.51342)	30
1	2	529312.163571	181025.172401	2	1	84.400413	84.400	1	POINT (-0.13788 51.51336)	20
2	3	529314.382429	181020.293688	1	1	82.319197	82.319	1	POINT (-0.13785 51.51332)	10
3	4	529317.379645	181014.258771	1	1	79.916876	79.917	1	POINT (-0.13781 51.51326)	10
4	5	529320.675449	181007.871628	4	1	77.809998	77.810	1	POINT (-0.13777 51.51320)	40
...	...	...	...	...	...	...	...	...	...	...
245	246	529362.664552	181156.058382	3	1	135.342515	135.343	1	POINT (-0.13711 51.51453)	30
246	247	529365.151891	181176.128560	2	1	154.335243	154.336	1	POINT (-0.13706 51.51471)	20
247	248	529274.165044	180907.312598	1	8	114.615825	169.873	0	POINT (-0.13847 51.51231)	10
248	249	529299.361335	180873.184663	1	8	78.425370	180.358	0	POINT (-0.13812 51.51200)	10
249	250	529324.815196	180857.948736	1	8	69.344090	181.907	0	POINT (-0.13776 51.51186)	10

250 rows × 10 columns

In [9]:

# Generate scatter plot
sns.jointplot(x="x", y="y", data=death_by_build, s=20, ec='k', fc='r');

In [10]:

# Generate scatter plot
f, ax = plt.subplots()

sns.histplot(x="x", y="y", data=death_by_build, bins=9, cbar=True, cmap="Reds");
ax.scatter(death_by_build.x, death_by_build.y, color='r', s=20, ec='k');

In [11]:

f, ax = plt.subplots()

hb = ax.hexbin(
    death_by_build["x"],
    death_by_build["y"],
    gridsize=7,
    linewidths=0,
    #alpha=0.5,
    cmap="Reds",
)

ax.scatter(death_by_build.x, death_by_build.y, color='r', s=20, ec='k')

# Add colorbar
plt.colorbar(hb)
# Remove axes
ax.set_axis_off()

In [12]:

# Generate scatter plot
sns.jointplot(x="x", y="y", data=death_by_build, kind='hex', color='red');

In [13]:

f, ax = plt.subplots()

sns.kdeplot(
    data=death_by_build,
    gridsize=50,
    levels=10,
    x = "x",
    y = "y",
    #bw_adjust=1.5,
    fill=True,
    alpha=0.75,
    cmap="Reds",
    cbar=True, 
    cbar_kws=dict(shrink=.5)
)

ax.scatter(death_by_build.x, death_by_build.y, color='r', s=20, ec='k')

# Remove axes
ax.set_axis_off()

In [14]:

g = sns.jointplot(data=death_by_build, x="x", y="y", color='red')
g.plot_joint(sns.kdeplot, zorder=0, levels=6, fill=True, alpha=.75, cmap='Reds')
g.plot_marginals(sns.rugplot, color="r", height=-.15, clip_on=False)
g.ax_joint.set_axis_off() # turn axis off 

Measures of Centrography¶

Central tendency
- Mean Center
- Manhattan Median
- Euclidean Median
Dispersion
- Ellipse
Shape (Extent)
- Convext Hull
- Bounding Box
- etc.

Central Tendency¶

Mean Center¶

$$ x_{mc} = \frac{ \sum x_i }{n} $$$$ y_{mc} = \frac{ \sum y_i }{n} $$

Manhattan Median¶

The Manhattan median is the location which minimizes the absolute distance to all the event points. It is an extension of the median measure in one-dimensional space to two-dimensional space. Since in one-dimensional space, a median is the number separating the higher half of a dataset from the lower half, we define the Manhattan median as a tuple whose first element is the median of coordinates and second element is the median of coordinates.

$$ min f(x_{mm},y_{mm})= \sum^n_{i=1}(|x_i-x_{mm}|+|y_i-y_{mm}|) $$

Euclidean Median¶

The Euclidean Median is the location from which the sum of the Euclidean distances to all points in a distribution is a minimum. It is an optimization problem and very important for more general location allocation problems. There is no closed form solution. We can use first iterative algorithm (Kuhn and Kuenne, 1962) to approximate Euclidean Median.

$$ min f(x_{em},y_{em})= \sum^n_{i=1} \sqrt{(x_i-x_{em})^2+(y_i-y_{em})^2} $$

In [15]:

#import centragraphy analysis functions 
from pointpats.centrography import hull, mbr, mean_center, weighted_mean_center, manhattan_median 
from pointpats.centrography import std_distance,euclidean_median,ellipse
from pointpats import PointPattern

In [16]:

# extract coords from the dataframe and stack them in lng/lat format for using pointpats package
points = np.stack((death_by_build.x.to_numpy(), death_by_build.y.to_numpy()), axis=1)

pp = PointPattern(points) #create a point pattern "pp" from list 

In [17]:

weights = death_by_build.deaths # extract weights for weighted mean 

mc = mean_center(pp.points) # mean center
wmc = weighted_mean_center(pp.points, weights) # weighted mean
em = euclidean_median(pp.points) # euclidean median 
mm = manhattan_median(pp.points)

print('Mean Center coordinates \t \t:', mc)
print('Weighted MC coordinates \t \t:', wmc)
print('Euclidean Median coordinates \t \t:', em)
print('Manhattan Medi an coordinates \t \t:', mm)

Mean Center coordinates 	 	: [529413.85369427 181033.78696473]
Weighted MC coordinates 	 	: [529414.75675154 181031.89339795]
Euclidean Median coordinates 	 	: [529410.20734233 181027.07075087]
Manhattan Medi an coordinates 	 	: [529422.525932 181027.12929 ]

C:\Users\barguzin\Anaconda3\envs\geo_env\lib\site-packages\pointpats\centrography.py:208: UserWarning: Manhattan Median is not unique for even point patterns.
  warnings.warn(s)

In [18]:

pp.plot()
plt.plot(mc[0], mc[1], 'b^', label='Mean Center')
plt.plot(wmc[0], wmc[1], 'gd', label='Weighted Mean Center')
plt.plot(mm[0], mm[1], 'rv', label='Manhattan Median')
plt.plot(em[0], em[1], 'm+', label='Euclidean Median')
plt.legend(numpoints=1)

Out[18]:

<matplotlib.legend.Legend at 0x1f5a7aa3f10>

Dispersion¶

Standard Distance & Standard Distance Circle¶

The Standard distance is closely related to the usual definition of the standard deviation of a data set, and it provides a measure of how dispersed the events are around their mean center

$$ SD = \displaystyle \sqrt{\frac{\sum^n_{i=1}(x_i-x_{m})^2}{n} + \frac{\sum^n_{i=1}(y_i-y_{m})^2}{n}} $$

In [19]:

stdd = std_distance(pp.points)

circle1=plt.Circle((mc[0], mc[1]),stdd,color='blue', alpha=.3)
ax = pp.plot(get_ax=True, title='Standard Distance Circle')
ax.add_artist(circle1)
plt.plot(mc[0], mc[1], 'b^', label='Mean Center')
ax.set_aspect('equal')
plt.legend(numpoints=1)

Out[19]:

<matplotlib.legend.Legend at 0x1f5a7b70d30>

Standard Deviational Ellipse¶

Compared with standard distance circle which measures dispersion using a single parameter SD, standard deviational ellipse measures dispersion and trend in two dimensions through angle of rotation $\theta$, dispersion along major axis $s_x$ and dispersion along minor axis $s_y$.

$$ s_x = \displaystyle \sqrt{\frac{2(\sum_{i=1}^n (x_i-\bar{x})\cos(\theta) - \sum_{i=1}^n (y_i-\bar{y})\sin(\theta))^2}{n-2}} $$$$ s_y = \displaystyle \sqrt{\frac{2(\sum_{i=1}^n (x_i-\bar{x})\sin(\theta) - \sum_{i=1}^n (y_i-\bar{y})\cos(\theta))^2}{n-2}} $$$$ \theta = \displaystyle \arctan{\{ (\sum_i(x_i-\bar{x})^2-\sum_i(y_i-\bar{y})^2) + \frac{[(\sum_i(x_i-\bar{x})^2-\sum_i(y_i-\bar{y})^2)^2 + 4(\sum_i(x-\bar{x})(y_i-\bar{y}))^2]^\frac{1}{2}}{2\sum_i(x-\bar{x})(y_i-\bar{y})}\}} $$

In [20]:

sx, sy, theta = ellipse(pp.points)
theta_degree = np.degrees(theta) #need degree of rotation to plot the ellipse

from matplotlib.patches import Ellipse
from pylab import figure,show,rand
fig = figure()
#ax = fig.add_subplot(111, aspect='equal')
e = Ellipse(xy=mean_center(pp.points), width=sx*2, height=sy*2, angle=-theta_degree) #angle is rotation in degrees (anti-clockwise)
ax = pp.plot(get_ax=True, title='Standard Deviational Ellipse')
ax.add_artist(e)
e.set_clip_box(ax.bbox)
e.set_facecolor([0.8,0,0, .5])
e.set_edgecolor([1,0,0])
ax.set_aspect('equal')
plt.plot(mc[0], mc[1], 'b^', label='Mean Center')
plt.legend(numpoints=1)
show()

<Figure size 640x480 with 0 Axes>

Shape Analysis¶

Convex Hull¶

The convex hull of a point pattern pp is the smallest convex set that contains pp.

In [21]:

pp.plot(title='Centers', hull=True ) #plot point pattern "pp" as well as its convex hull
plt.plot(mc[0], mc[1], 'b^', label='Mean Center')
plt.plot(wmc[0], wmc[1], 'gd', label='Weighted Mean Center')
plt.plot(mm[0], mm[1], 'rv', label='Manhattan Median')
plt.plot(em[0], em[1], 'm+', label='Euclidean Median')
plt.legend(numpoints=1)

Out[21]:

<matplotlib.legend.Legend at 0x1f5aa637220>

Minimum Bounding Rectangle¶

Minimum Bounding Rectangle (Box) is the same as the minimum bounding Rectangle of its convex hull. Thus, it is almost always bigger than convex hull.

In [22]:

pp.plot(title='Centers',  hull=True , window=True )#plot point pattern "pp", convex hull, and Minimum Bounding Rectangle
plt.plot(mc[0], mc[1], 'b^', label='Mean Center')
plt.plot(wmc[0], wmc[1], 'gd', label='Weighted Mean Center')
plt.plot(mm[0], mm[1], 'rv', label='Manhattan Median')
plt.plot(em[0], em[1], 'm+', label='Euclidean Median')
plt.legend(numpoints=1)

Out[22]:

<matplotlib.legend.Legend at 0x1f5aa73eef0>

Quadrat Statistic¶

CSR (Uniform and Independent)
Partition space into regularly spaced grid cells.
Since Process is uniform - expect same intensity everywhere

Quadrat Statistic¶

$$ s^2 = \frac{\sum(x_i - \bar{x})^2}{m-1} $$$$ VMR = \frac{s^2}{\bar{x}}, \quad z = \Bigg( \sqrt{\frac{m-1}{2}} VMR - 1 \Bigg) $$

where $m$ is the number of quadrats, $\bar{x}$ is the mean number of points per quadrat, $s^2$ is the variance of the number of points per quadrat, $(x-\bar{x})^2$ is the cell deviate and $VMR$ is the variance-mean ratio.

See more details here

Quadrat Statistic¶

H_0: The distribution of cholera deaths is not significantly different than random (Null Hypothesis)
H_a: The distribution of cholera deaths is significantly different than random (Alternative Hypothesis)

$\alpha=0.05$

In [23]:

from pointpats import (
    distance_statistics, QStatistic,
    random, PointPattern, centrography
)

qstat = QStatistic(points)
print(f'p-value: {qstat.chi2_pvalue}')
qstat.plot()

p-value: 5.956733903969489e-23

In [24]:

list_counts = np.array([2,12,10,35,56,38,23,60,14])
mean_count = np.repeat(27.7, len(list_counts))
m = 250
cells = 9

xbar = m / cells
var = np.sum((list_counts - mean_count)**2) / m-1
vmr = var / xbar
z = (np.sqrt((m-1)/2)*(vmr-1))


print(var)
print(vmr)
print(z)

12.73444
0.45843984
-6.042704871078314

total points: 250
number of cells: 9
degrees of freedom: (9-1) = 8
$\bar{x} = 250/9 = 27.7$
$s^2 = 12.73
VMR = 12.73/27.7 = 0.46
z = -6.04

Because $z$ is not within $z_l \leq z \leq z_h$, where $z$ comes from the $z$-table and $z_l=-1.96$ and $z_h=1.96$, we cannot accept the null hypothesis.¶

In [25]:

12.73/27.7

Out[25]:

0.45956678700361014

In [26]:

hull = centrography.ConvexHull(pp.points)
random_poisson = random.poisson(hull, size=len(pp.points))

ran_qstat = QStatistic(random_poisson)
print(f'p-value: {ran_qstat.chi2_pvalue}')
ran_qstat.plot()

p-value: 0.000671888215331921

Distance-based Measures of PPA¶

Nearest Neighbor Distance
Distance functions (GFK)

Mean Nearest Neighbor Distance Statistics¶

The nearest neighbor for a point $u$ must meet the following condition:

$$ d_{u,N(u)} \leq d_{u,j} \forall j \in S - u $$

It was demonstrated by Clark and Evans(1954) that mean nearest neighbor distance statistics distribution is a normal distribution under null hypothesis (underlying spatial process is CSR). We can utilize the test statistics to determine whether the point pattern is the outcome of CSR. If not, is it the outcome of cluster or regular spatial process?

$$ \bar{d}_{min}=\frac{1}{n} \sum_{i=1}^n d_{min}(s_i) $$

In [27]:

tt = pp.knn()
# plot distances to nearest neighbor
plt.hist(tt[1])

Out[27]:

(array([188.,  46.,   8.,   6.,   1.,   0.,   0.,   0.,   0.,   1.]),
 array([ 2.46704738, 11.18215934, 19.89727131, 28.61238327, 37.32749524,
        46.0426072 , 54.75771917, 63.47283113, 72.1879431 , 80.90305506,
        89.61816702]),
 <BarContainer object of 10 artists>)

In [29]:

random_point_pattern = PointPattern(random_poisson)

tt = random_point_pattern.knn()
# plot distances to nearest neighbor
plt.hist(tt[1])

Out[29]:

(array([30., 67., 53., 55., 22., 15.,  6.,  0.,  1.,  1.]),
 array([ 1.41033467,  4.91369257,  8.41705047, 11.92040837, 15.42376627,
        18.92712417, 22.43048207, 25.93383998, 29.43719788, 32.94055578,
        36.44391368]),
 <BarContainer object of 10 artists>)

GEOG 172: INTERMEDIATE GEOGRAPHICAL ANALYSIS

Evgeny Noi

Lecture 06: Point Pattern Analysis

Point Pattern Locations¶

Consider a series of point locations $(s_1, s_2, ..., s_n)$ in a study region $\mathcal{R}$. For a two-dimensional space where each point $s_j$ can be defined with a pair of coordinates, such that $s_j = (x_j, y_j)$ for each location $j$.¶

A point process is a stochastic mechanism that generates a countable set of events¶

Examples: trees on a plot, meteor strikes, crime, animals in the field¶

Point Pattern Analysis (PPA)¶

PPA¶

PPA¶

Methods of PPA¶

The first case of PPA¶

John Snow and Epidemiology¶

Cholera and John Snow¶

Geographic Analysis on Cholera deaths¶

Measures of Centrography¶

Central Tendency¶

Mean Center¶

Manhattan Median¶

Euclidean Median¶

Dispersion¶

Standard Distance & Standard Distance Circle¶

Standard Deviational Ellipse¶

Shape Analysis¶

Convex Hull¶

Minimum Bounding Rectangle¶

Quadrat Statistic¶

Quadrat Statistic¶

Quadrat Statistic¶

Because $z$ is not within $z_l \leq z \leq z_h$, where $z$ comes from the $z$-table and $z_l=-1.96$ and $z_h=1.96$, we cannot accept the null hypothesis.¶

Distance-based Measures of PPA¶

Mean Nearest Neighbor Distance Statistics¶

Questions?¶

Data Report instructions are on GauchoSpace and on the course website.¶