Authors: Serge Rey sjsrey@gmail.com and Wei Kang weikang9009@gmail.com
Distance based methods for point patterns are of three types:
In addition, we are going to introduce a computational technique Simulation Envelopes to aid in making inferences about the data generating process. An example is used to demonstrate how to use and interprete simulation envelopes.
from scipy import spatial
import libpysal as ps
import numpy as np
from pointpats import (PointPattern, PoissonPointProcess, as_window,
G, F, J, K, L, Genv, Fenv, Jenv, Kenv, Lenv)
from pointpats import ripley
%matplotlib inline
import matplotlib.pyplot as plt
The nearest neighbor(s) for a point $u$ is the point(s) $N(u)$ which meet the condition $$d_{u,N(u)} \leq d_{u,j} \forall j \in S - u$$
The distance between the nearest neighbor(s) $N(u)$ and the point $u$ is nearest neighbor distance for $u$. After searching for nearest neighbor(s) for all the points and calculating the corresponding distances, we are able to calculate mean nearest neighbor distance by averaging these distances.
It was demonstrated by Clark and Evans(1954) that mean nearest neighbor distance statistics distribution is a normal distribution under null hypothesis (underlying spatial process is CSR). We can utilize the test statistics to determine whether the point pattern is the outcome of CSR. If not, is it the outcome of cluster or regular spatial process?
Mean nearest neighbor distance statistic
$$\bar{d}_{min}=\frac{1}{n} \sum_{i=1}^n d_{min}(s_i)$$points = np.array([[66.22, 32.54], [22.52, 22.39], [31.01, 81.21],
[9.47, 31.02], [30.78, 60.10], [75.21, 58.93],
[79.26, 7.68], [8.23, 39.93], [98.73, 77.17],
[89.78, 42.53], [65.19, 92.08], [54.46, 8.48]])
pp = PointPattern(points)
kdt = spatial.KDTree(points)
Nearest neighbour distance distribution functions (including the nearest “event-to-event” and “point-event” distance distribution functions) of a point process are cumulative distribution functions of several kinds -- $G, F, J$. By comparing the distance function of the observed point pattern with that of the point pattern from a CSR process, we are able to infer whether the underlying spatial process of the observed point pattern is CSR or not for a given confidence level.
The $G$ function is defined as follows: for a given distance $d$, $G(d)$ is the proportion of nearest neighbor distances that are less than $d$. $$G(d) = \sum_{i=1}^n \frac{ \phi_i^d}{n}$$
$$ \phi_i^d = \begin{cases} 1 & \quad \text{if } d_{min}(s_i)<d \\ 0 & \quad \text{otherwise } \\ \end{cases} $$If the underlying point process is a CSR process, $G$ function has an expectation of: $$ G(d) = 1-e(-\lambda \pi d^2) $$ However, if the $G$ function plot is above the expectation this reflects clustering, while departures below expectation reflect dispersion.
gp1 = G(pp, intervals=20)
gp1.plot()
support, gfunction = ripley.g_function(points, support=20)
plt.plot(*ripley.g_function(points, support=20), marker='x')
plt.plot(gp1.d, gp1.ev)
plt.plot(gp1.d, gp1.G, marker='+')
in the q-q plot the csr function is now a diagonal line which serves to make accessment of departures from csr visually easier.
It is obvious that the above $G$ increases very slowly at small distances and the line is below the expected value for a CSR process (green line). We might think that the underlying spatial process is regular point process. However, this visual inspection is not enough for a final conclusion. In Simulation Envelopes, we are going to demonstrate how to simulate data under CSR many times and construct the $95\%$ simulation envelope for $G$.
When the number of events in a point pattern is small, $G$ function is rough (see the $G$ function plot for the 12 size point pattern above). One way to get around this is to turn to $F$ funtion where a given number of randomly distributed points are generated in the domain and the nearest event neighbor distance is calculated for each point. The cumulative distribution of all nearest event neighbor distances is called $F$ function.
fp1 = F(pp, intervals=20) # The default is to randomly generate 100 points.
plt.plot(*ripley.f_function(points, support=20), marker='x')
plt.plot(fp1.d, fp1.ev)
plt.plot(fp1.d, fp1.F, marker='+')
$J$ function is defined as follows:
$$J(d) = \frac{1-G(d)}{1-F(d)}$$If $J(d)<1$, the underlying point process is a cluster point process; if $J(d)=1$, the underlying point process is a random point process; otherwise, it is a regular point process.
jp1 = J(pp, intervals=20)
plt.plot(*ripley.j_function(points, support=20), marker='x')
plt.plot(jp1.d, jp1.ev)
plt.plot(jp1.d, jp1.j, marker='+')
From the above figure, we can observe that $J$ function is obviously above the $J(d)=1$ horizontal line. It is approaching infinity with nearest neighbor distance increasing. We might tend to conclude that the underlying point process is a regular one.
Nearest neighbor distance functions consider only the nearest neighbor distances, "event-event", "point-event" or the combination. Thus, distances to higer order neighbors are ignored, which might reveal important information regarding the point process. Interevent distance functions, including $K$ and $L$ functions, are proposed to consider distances between all pairs of event points. Similar to $G$, $F$ and $J$ functions, $K$ and $L$ functions are also cumulative distribution function.
Given distance $d$, $K(d)$ is defined as: $$K(d) = \frac{\sum_{i=1}^n \sum_{j=1}^n \psi_{ij}(d)}{n \hat{\lambda}}$$
where $$ \psi_{ij}(d) = \begin{cases} 1 & \quad \text{if } d_{ij}<d \\ 0 & \quad \text{otherwise } \\ \end{cases} $$
$\sum_{j=1}^n \psi_{ij}(d)$ is the number of events within a circle of radius $d$ centered on event $s_i$ .
Still, we use CSR as the benchmark (null hypothesis) and see how the $K$ funtion estimated from the observed point pattern deviate from that under CSR, which is $K(d)=\pi d^2$. $K(d)<\pi d^2$ indicates that the underlying point process is a regular point process. $K(d)>\pi d^2$ indicates that the underlying point process is a cluster point process.
kp1 = K(pp, intervals=20)
kp1.plot()
plt.plot(*ripley.k_function(points, support=20))
plt.plot(kp1.d, kp1.ev)
plt.plot(kp1.d, kp1.k)
$L$ function is a scaled version of $K$ function, defined as: $$L(d) = \sqrt{\frac{K(d)}{\pi}}-d$$
lp1 = L(pp)
lp1.plot()
plt.plot(*ripley.l_function(points, support=20, linearized=True))
plt.plot(lp1.d, lp1.l)
A Simulation envelope is a computer intensive technique for inferring whether an observed pattern significantly deviates from what would be expected under a specific process. Here, we always use CSR as the benchmark. In order to construct a simulation envelope for a given function, we need to simulate CSR a lot of times, say $1000$ times. Then, we can calculate the function for each simulated point pattern. For every distance $d$, we sort the function values of the $1000$ simulated point patterns. Given a confidence level, say $95\%$, we can acquire the $25$th and $975$th value for every distance $d$. Thus, a simulation envelope is constructed.
Genv class in pysal.
realizations = PoissonPointProcess(pp.window, pp.n, 100, asPP=True) # simulate CSR 100 times
genv = Genv(pp, intervals=20, realizations=realizations) # call Genv to generate simulation envelope
genv
genv.observed
genv.plot()
In the above figure, LB and UB comprise the simulation envelope. CSR is the mean function calculated from the simulated data. G is the function estimated from the observed point pattern. It is well below the simulation envelope. We can infer that the underlying point process is a regular one.
g_test = ripley.g_test(points, keep_replications=True)
g_test
plt.fill_between(g_test.support,
*np.percentile(g_test.replications,
q=(2.5, 97.5), axis=1),
color='grey', alpha=.2, label='95% CI')
plt.plot(g_test.support, g_test.statistic, color='orangered', label='Observed')
plt.legend()
plt.show()
Fenv class in pysal.
fenv = Fenv(pp, intervals=20, realizations=realizations)
fenv.plot()
f_test = ripley.f_test(points, keep_replications=True)
f_test
plt.fill_between(f_test.support,
*np.percentile(f_test.replications,
q=(2.5, 97.5), axis=1),
color='grey', alpha=.2, label='95% CI')
plt.plot(f_test.support, f_test.statistic, color='orangered',
label='Observed')
plt.legend()
plt.show()
Jenv class in pysal.
jenv = Jenv(pp, intervals=20, realizations=realizations)
jenv.plot()
j_test = ripley.j_test(points, keep_replications=True)
j_test
plt.fill_between(j_test.support,
*np.percentile(j_test.replications,
q=(2.5, 97.5), axis=1),
color='grey', alpha=.2, label='95% CI')
plt.plot(j_test.support, j_test.statistic, color='orangered',
label='Observed')
plt.legend()
plt.show()
Kenv class in pysal.
kenv = Kenv(pp, intervals=20, realizations=realizations)
kenv.plot()
k_test = ripley.k_test(points, keep_replications=True)
k_test
plt.plot(k_test.support,
k_test.replications,
color='k', alpha=.01)
plt.plot(k_test.support,
k_test.replications[:,0],
color='k', label='Simulations')
plt.plot(k_test.support, k_test.statistic, color='orangered',
label='Observed')
plt.title('Alternative Plot Style')
plt.legend()
plt.show()
Lenv class in pysal.
lenv = Lenv(pp, intervals=20, realizations=realizations)
lenv.plot()
l_test = ripley.l_test(points, linearized=True, keep_replications=True)
l_test
plt.plot(l_test.support,
l_test.replications,
color='k', alpha=.01)
plt.plot(l_test.support,
l_test.replications[:,0],
color='k', label='Simulations')
plt.plot(l_test.support, l_test.statistic, color='orangered',
label='Observed')
plt.title('Alternative Plot Style')
plt.legend()
plt.show()
In this example, we are going to generate a point pattern as the "observed" point pattern. The data generating process is CSR. Then, we will simulate CSR in the same domain for 100 times and construct a simulation envelope for each function.
from libpysal.cg import shapely_ext, asShape
from pointpats import Window
import geopandas
df = geopandas.read_file(ps.examples.get_path("vautm17n.shp"))
state = df.geometry.cascaded_union
Generate the point pattern pp (size 100) from CSR as the "observed" point pattern.
a = [[1],[1,2]]
np.asarray(a)
n = 100
samples = 1
pp = PoissonPointProcess(Window(asShape(state).parts),
n, samples, asPP=True)
pp.realizations[0]
pp.n
Simulate CSR in the same domian for 100 times which would be used for constructing simulation envelope under the null hypothesis of CSR.
csrs = PoissonPointProcess(pp.window, 100, 100, asPP=True)
csrs
import importlib
ripley = importlib.reload(ripley)
simulations = ripley.simulate(state, size=(100,100))
plt.scatter(*simulations[23].T)
geopandas.GeoDataFrame(geometry=[state]).boundary.plot(ax=plt.gca())
Construct the simulation envelope for $G$ function.
genv = Genv(pp.realizations[0], realizations=csrs)
genv.plot()
ripley.g_test(simulations[10], hull=state).pvalues
Since the "observed" $G$ is well contained by the simulation envelope, we infer that the underlying point process is a random process.
genv.low # lower bound of the simulation envelope for G
genv.high # higher bound of the simulation envelope for G
Construct the simulation envelope for $F$ function.
fenv = Fenv(pp.realizations[0], realizations=csrs)
fenv.plot()
Construct the simulation envelope for $J$ function.
jenv = Jenv(pp.realizations[0], realizations=csrs)
jenv.plot()
Construct the simulation envelope for $K$ function.
kenv = Kenv(pp.realizations[0], realizations=csrs)
kenv.plot()
Construct the simulation envelope for $L$ function.
lenv = Lenv(pp.realizations[0], realizations=csrs)
lenv.plot()