Distance Based Statistical Method for Planar Point Patterns¶

Authors: Serge Rey sjsrey@gmail.com and Wei Kang weikang9009@gmail.com

Introduction¶

Distance based methods for point patterns are of three types:

Mean Nearest Neighbor Distance Statistics
Nearest Neighbor Distance Functions
Interevent Distance Functions

In addition, we are going to introduce a computational technique Simulation Envelopes to aid in making inferences about the data generating process. An example is used to demonstrate how to use and interprete simulation envelopes.

In [1]:

from scipy import spatial
import libpysal as ps
import numpy as np
from pointpats import (PointPattern, PoissonPointProcess, as_window, 
                       G, F, J, K, L, Genv, Fenv, Jenv, Kenv, Lenv)
from pointpats import ripley
%matplotlib inline
import matplotlib.pyplot as plt

Mean Nearest Neighbor Distance Statistics¶

The nearest neighbor(s) for a point $u$ is the point(s) $N(u)$ which meet the condition $$d_{u,N(u)} \leq d_{u,j} \forall j \in S - u$$

The distance between the nearest neighbor(s) $N(u)$ and the point $u$ is nearest neighbor distance for $u$. After searching for nearest neighbor(s) for all the points and calculating the corresponding distances, we are able to calculate mean nearest neighbor distance by averaging these distances.

It was demonstrated by Clark and Evans(1954) that mean nearest neighbor distance statistics distribution is a normal distribution under null hypothesis (underlying spatial process is CSR). We can utilize the test statistics to determine whether the point pattern is the outcome of CSR. If not, is it the outcome of cluster or regular spatial process?

Mean nearest neighbor distance statistic

$$\bar{d}_{min}=\frac{1}{n} \sum_{i=1}^n d_{min}(s_i)$$

In [2]:

points = np.array([[66.22, 32.54], [22.52, 22.39], [31.01, 81.21],
                   [9.47, 31.02],  [30.78, 60.10], [75.21, 58.93],
                   [79.26,  7.68], [8.23, 39.93],  [98.73, 77.17],
                   [89.78, 42.53], [65.19, 92.08], [54.46, 8.48]])
pp = PointPattern(points)
kdt = spatial.KDTree(points)

Nearest Neighbor Distance Functions¶

Nearest neighbour distance distribution functions (including the nearest “event-to-event” and “point-event” distance distribution functions) of a point process are cumulative distribution functions of several kinds -- $G, F, J$. By comparing the distance function of the observed point pattern with that of the point pattern from a CSR process, we are able to infer whether the underlying spatial process of the observed point pattern is CSR or not for a given confidence level.

$G$ function - event-to-event¶

The $G$ function is defined as follows: for a given distance $d$, $G(d)$ is the proportion of nearest neighbor distances that are less than $d$. $$G(d) = \sum_{i=1}^n \frac{ \phi_i^d}{n}$$

$$ \phi_i^d = \begin{cases} 1 & \quad \text{if } d_{min}(s_i)<d \\ 0 & \quad \text{otherwise } \\ \end{cases} $$

If the underlying point process is a CSR process, $G$ function has an expectation of: $$ G(d) = 1-e(-\lambda \pi d^2) $$ However, if the $G$ function plot is above the expectation this reflects clustering, while departures below expectation reflect dispersion.

In [3]:

gp1 = G(pp, intervals=20)
gp1.plot()

In [4]:

support, gfunction = ripley.g_function(points, support=20)

In [5]:

plt.plot(*ripley.g_function(points, support=20), marker='x')
plt.plot(gp1.d, gp1.ev)
plt.plot(gp1.d, gp1.G, marker='+')

Out[5]:

[<matplotlib.lines.Line2D at 0x128fea460>]

in the q-q plot the csr function is now a diagonal line which serves to make accessment of departures from csr visually easier.

It is obvious that the above $G$ increases very slowly at small distances and the line is below the expected value for a CSR process (green line). We might think that the underlying spatial process is regular point process. However, this visual inspection is not enough for a final conclusion. In Simulation Envelopes, we are going to demonstrate how to simulate data under CSR many times and construct the $95\%$ simulation envelope for $G$.

$F$ function - "point-event"¶

When the number of events in a point pattern is small, $G$ function is rough (see the $G$ function plot for the 12 size point pattern above). One way to get around this is to turn to $F$ funtion where a given number of randomly distributed points are generated in the domain and the nearest event neighbor distance is calculated for each point. The cumulative distribution of all nearest event neighbor distances is called $F$ function.

In [6]:

fp1 = F(pp, intervals=20) # The default is to randomly generate 100 points.

In [7]:

plt.plot(*ripley.f_function(points, support=20), marker='x')
plt.plot(fp1.d, fp1.ev)
plt.plot(fp1.d, fp1.F, marker='+')

Out[7]:

[<matplotlib.lines.Line2D at 0x1290be8b0>]

$J$ function - a combination of "event-event" and "point-event"¶

$J$ function is defined as follows:

$$J(d) = \frac{1-G(d)}{1-F(d)}$$

If $J(d)<1$, the underlying point process is a cluster point process; if $J(d)=1$, the underlying point process is a random point process; otherwise, it is a regular point process.

In [8]:

jp1 = J(pp, intervals=20)

In [9]:

plt.plot(*ripley.j_function(points, support=20), marker='x')
plt.plot(jp1.d, jp1.ev)
plt.plot(jp1.d, jp1.j, marker='+')

/Users/lw17329/Dropbox/dev/pointpats/pointpats/ripley.py:656: UserWarning: requested 20 bins to evaluate the J function, but it reaches infinity at d=25.5178, meaning only 14 bins will be used to characterize the J function.
  warnings.warn(

Out[9]:

[<matplotlib.lines.Line2D at 0x1291aa340>]

From the above figure, we can observe that $J$ function is obviously above the $J(d)=1$ horizontal line. It is approaching infinity with nearest neighbor distance increasing. We might tend to conclude that the underlying point process is a regular one.

Interevent Distance Functions¶

Nearest neighbor distance functions consider only the nearest neighbor distances, "event-event", "point-event" or the combination. Thus, distances to higer order neighbors are ignored, which might reveal important information regarding the point process. Interevent distance functions, including $K$ and $L$ functions, are proposed to consider distances between all pairs of event points. Similar to $G$, $F$ and $J$ functions, $K$ and $L$ functions are also cumulative distribution function.

$K$ function - "interevent"¶

Given distance $d$, $K(d)$ is defined as: $$K(d) = \frac{\sum_{i=1}^n \sum_{j=1}^n \psi_{ij}(d)}{n \hat{\lambda}}$$

where $$ \psi_{ij}(d) = \begin{cases} 1 & \quad \text{if } d_{ij}<d \\ 0 & \quad \text{otherwise } \\ \end{cases} $$

$\sum_{j=1}^n \psi_{ij}(d)$ is the number of events within a circle of radius $d$ centered on event $s_i$ .

Still, we use CSR as the benchmark (null hypothesis) and see how the $K$ funtion estimated from the observed point pattern deviate from that under CSR, which is $K(d)=\pi d^2$. $K(d)<\pi d^2$ indicates that the underlying point process is a regular point process. $K(d)>\pi d^2$ indicates that the underlying point process is a cluster point process.

In [10]:

kp1 = K(pp, intervals=20)
kp1.plot()

In [11]:

plt.plot(*ripley.k_function(points, support=20))
plt.plot(kp1.d, kp1.ev)
plt.plot(kp1.d, kp1.k)

Out[11]:

[<matplotlib.lines.Line2D at 0x129342730>]

$L$ function - "interevent"¶

$L$ function is a scaled version of $K$ function, defined as: $$L(d) = \sqrt{\frac{K(d)}{\pi}}-d$$

In [12]:

lp1 = L(pp)
lp1.plot()

In [13]:

plt.plot(*ripley.l_function(points, support=20, linearized=True))
plt.plot(lp1.d, lp1.l)

Out[13]:

[<matplotlib.lines.Line2D at 0x129467760>]

Simulation Envelopes¶

A Simulation envelope is a computer intensive technique for inferring whether an observed pattern significantly deviates from what would be expected under a specific process. Here, we always use CSR as the benchmark. In order to construct a simulation envelope for a given function, we need to simulate CSR a lot of times, say $1000$ times. Then, we can calculate the function for each simulated point pattern. For every distance $d$, we sort the function values of the $1000$ simulated point patterns. Given a confidence level, say $95\%$, we can acquire the $25$th and $975$th value for every distance $d$. Thus, a simulation envelope is constructed.

Simulation Envelope for G function¶

Genv class in pysal.

In [14]:

realizations = PoissonPointProcess(pp.window, pp.n, 100, asPP=True) # simulate CSR 100 times
genv = Genv(pp, intervals=20, realizations=realizations) # call Genv to generate simulation envelope
genv

Out[14]:

<pointpats.distance_statistics.Genv at 0x12942bb50>

In [15]:

genv.observed

Out[15]:

array([[ 0.        ,  0.        ],
       [ 1.73156208,  0.        ],
       [ 3.46312417,  0.        ],
       [ 5.19468625,  0.        ],
       [ 6.92624834,  0.        ],
       [ 8.65781042,  0.        ],
       [10.3893725 ,  0.16666667],
       [12.12093459,  0.16666667],
       [13.85249667,  0.16666667],
       [15.58405875,  0.16666667],
       [17.31562084,  0.25      ],
       [19.04718292,  0.25      ],
       [20.77874501,  0.25      ],
       [22.51030709,  0.58333333],
       [24.24186917,  0.58333333],
       [25.97343126,  0.83333333],
       [27.70499334,  0.83333333],
       [29.43655542,  0.83333333],
       [31.16811751,  0.91666667],
       [32.89967959,  0.91666667],
       [34.63124168,  1.        ],
       [36.36280376,  1.        ]])

In [16]:

genv.plot()

In the above figure, LB and UB comprise the simulation envelope. CSR is the mean function calculated from the simulated data. G is the function estimated from the observed point pattern. It is well below the simulation envelope. We can infer that the underlying point process is a regular one.

In [17]:

g_test = ripley.g_test(points, keep_replications=True)
g_test

Out[17]:

GtestResult(support=array([ 0.        ,  1.82269693,  3.64539386,  5.46809079,  7.29078772,
        9.11348465, 10.93618158, 12.75887851, 14.58157544, 16.40427237,
       18.2269693 , 20.04966623, 21.87236316, 23.69506009, 25.51775702,
       27.34045395, 29.16315088, 30.98584782, 32.80854475, 34.63124168]), statistic=array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.0952381 , 0.0952381 , 0.0952381 , 0.0952381 , 0.19047619,
       0.19047619, 0.19047619, 0.28571429, 0.47619048, 0.57142857,
       0.76190476, 0.85714286, 0.95238095, 0.95238095, 1.        ]), pvalue=array([0.    , 0.    , 0.    , 0.    , 0.    , 0.3179, 0.1553, 0.0608,
       0.0189, 0.0252, 0.0058, 0.002 , 0.0014, 0.0084, 0.0102, 0.075 ,
       0.1346, 0.3083, 0.1204, 0.    ]), replications=array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.2       , 0.        , ..., 0.        , 0.10526316,
        0.        ],
       ...,
       [0.95833333, 0.95      , 1.        , ..., 0.85714286, 1.        ,
        0.95238095],
       [1.        , 1.        , 1.        , ..., 0.95238095, 1.        ,
        0.95238095],
       [1.        , 1.        , 1.        , ..., 1.        , 1.        ,
        1.        ]]))

In [18]:

plt.fill_between(g_test.support, 
                 *np.percentile(g_test.replications, 
                                q=(2.5, 97.5), axis=1),
                 color='grey', alpha=.2, label='95% CI')
plt.plot(g_test.support, g_test.statistic, color='orangered', label='Observed')
plt.legend()
plt.show()

Simulation Envelope for F function¶

Fenv class in pysal.

In [19]:

fenv = Fenv(pp, intervals=20, realizations=realizations)
fenv.plot()

In [20]:

f_test = ripley.f_test(points, keep_replications=True)
f_test

Out[20]:

FtestResult(support=array([ 0.        ,  1.82269693,  3.64539386,  5.46809079,  7.29078772,
        9.11348465, 10.93618158, 12.75887851, 14.58157544, 16.40427237,
       18.2269693 , 20.04966623, 21.87236316, 23.69506009, 25.51775702,
       27.34045395, 29.16315088, 30.98584782, 32.80854475, 34.63124168]), statistic=array([0.   , 0.01 , 0.045, 0.103, 0.193, 0.297, 0.41 , 0.521, 0.646,
       0.775, 0.869, 0.931, 0.971, 0.997, 1.   , 1.   , 1.   , 1.   ,
       1.   , 1.   ]), pvalue=array([0.    , 0.0409, 0.013 , 0.0082, 0.0718, 0.1941, 0.3569, 0.4831,
       0.2651, 0.0693, 0.0267, 0.0136, 0.008 , 0.002 , 0.0042, 0.0211,
       0.0604, 0.1262, 0.2283, 0.    ]), replications=array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.01003009, 0.01484624, 0.015     , ..., 0.01778243, 0.0167364 ,
        0.0079096 ],
       [0.05616851, 0.05408271, 0.068     , ..., 0.07217573, 0.06276151,
        0.0519774 ],
       ...,
       [0.98996991, 0.95015907, 0.998     , ..., 0.95711297, 0.96025105,
        0.96497175],
       [0.99498495, 0.97348887, 1.        , ..., 0.9832636 , 0.98221757,
        0.98418079],
       [1.        , 1.        , 1.        , ..., 1.        , 1.        ,
        1.        ]]))

In [21]:

plt.fill_between(f_test.support, 
                 *np.percentile(f_test.replications, 
                                q=(2.5, 97.5), axis=1),
                 color='grey', alpha=.2, label='95% CI')
plt.plot(f_test.support, f_test.statistic, color='orangered',
         label='Observed')
plt.legend()
plt.show()

Simulation Envelope for J function¶

Jenv class in pysal.

In [22]:

jenv = Jenv(pp, intervals=20, realizations=realizations)
jenv.plot()

In [23]:

j_test = ripley.j_test(points, keep_replications=True)
j_test

Out[23]:

JtestResult(support=array([ 0.        ,  1.82269693,  3.64539386,  5.46809079,  7.29078772,
        9.11348465, 10.93618158, 12.75887851, 14.58157544, 16.40427237,
       18.2269693 , 20.04966623, 21.87236316, 23.69506009]), statistic=array([  1.        ,   1.00908174,   1.05374078,   1.14810563,
         1.27226463,   1.27420999,   1.53186275,   1.97472354,
         2.85388128,   3.86597938,   6.30252101,  12.29508197,
        26.51515152, 208.33333333]), pvalue=array([0.000e+00, 9.720e-02, 3.000e-01, 4.309e-01, 2.423e-01, 2.479e-01,
       1.459e-01, 6.450e-02, 1.790e-02, 1.160e-02, 4.300e-03, 1.500e-03,
       1.000e-03, 1.000e-04]), replications=array([[1.        , 1.        , 1.        , ..., 1.        , 1.        ,
        1.        ],
       [1.01425439, 1.01754386, 1.01853871, ..., 1.02073171, 1.01914414,
        1.01380898],
       [0.74425287, 1.06879607, 1.06864989, ..., 1.08279431, 1.09167672,
        1.05635492],
       ...,
       [1.15625   , 1.02112676, 2.55890411, ..., 0.        , 1.86213992,
        0.        ],
       [2.5       , 2.68518519, 5.49411765, ..., 0.        , 4.18981481,
        0.        ],
       [1.        , 1.        , 1.        , ..., 1.        , 1.        ,
        1.        ]]))

In [24]:

plt.fill_between(j_test.support, 
                 *np.percentile(j_test.replications, 
                                q=(2.5, 97.5), axis=1),
                 color='grey', alpha=.2, label='95% CI')
plt.plot(j_test.support, j_test.statistic, color='orangered',
         label='Observed')
plt.legend()
plt.show()

Simulation Envelope for K function¶

Kenv class in pysal.

In [25]:

kenv = Kenv(pp, intervals=20, realizations=realizations)
kenv.plot()

In [26]:

k_test = ripley.k_test(points, keep_replications=True)
k_test

Out[26]:

KtestResult(support=array([ 0.        ,  1.82269693,  3.64539386,  5.46809079,  7.29078772,
        9.11348465, 10.93618158, 12.75887851, 14.58157544, 16.40427237,
       18.2269693 , 20.04966623, 21.87236316, 23.69506009, 25.51775702,
       27.34045395, 29.16315088, 30.98584782, 32.80854475, 34.63124168]), statistic=array([   0.        ,    0.        ,    0.        ,    0.        ,
          0.        ,  106.08611111,  106.08611111,  106.08611111,
        106.08611111,  212.17222222,  212.17222222,  212.17222222,
        318.25833333,  530.43055556,  636.51666667,  848.68888889,
       1060.86111111, 1273.03333333, 1273.03333333, 1273.03333333]), pvalue=array([0.    , 0.    , 0.    , 0.    , 0.    , 0.3985, 0.2023, 0.0788,
       0.0232, 0.0451, 0.0097, 0.001 , 0.0029, 0.031 , 0.0352, 0.1081,
       0.2332, 0.3624, 0.1898, 0.0815]), replications=array([[   0.        ,    0.        ,    0.        , ...,    0.        ,
           0.        ,    0.        ],
       [   0.        ,    0.        ,    0.        , ...,    0.        ,
           0.        ,    0.        ],
       [   0.        ,    0.        ,    0.        , ...,   77.81536443,
           0.        ,    0.        ],
       ...,
       [1197.29622961, 1501.02714205, 1245.41812316, ..., 1634.12265299,
        1482.57903984, 1393.33205578],
       [1473.59535951, 1626.11273722, 1512.29343527, ..., 1711.93801742,
        1575.24022983, 1486.2208595 ],
       [1841.99419939, 1751.19833239, 1690.21031   , ..., 1867.56874628,
        1667.90141982, 1486.2208595 ]]))

In [27]:

plt.plot(k_test.support, 
         k_test.replications, 
         color='k', alpha=.01)
plt.plot(k_test.support, 
         k_test.replications[:,0], 
         color='k', label='Simulations')
plt.plot(k_test.support, k_test.statistic, color='orangered',
         label='Observed')
plt.title('Alternative Plot Style')
plt.legend()
plt.show()

Simulation Envelope for L function¶

Lenv class in pysal.

In [28]:

lenv = Lenv(pp, intervals=20, realizations=realizations)
lenv.plot()

In [29]:

l_test = ripley.l_test(points, linearized=True, keep_replications=True)
l_test

Out[29]:

LtestResult(support=array([ 0.        ,  1.82269693,  3.64539386,  5.46809079,  7.29078772,
        9.11348465, 10.93618158, 12.75887851, 14.58157544, 16.40427237,
       18.2269693 , 20.04966623, 21.87236316, 23.69506009, 25.51775702,
       27.34045395, 29.16315088, 30.98584782, 32.80854475, 34.63124168]), statistic=array([  0.        ,  -1.82269693,  -3.64539386,  -5.46809079,
        -7.29078772,  -3.30243845,  -5.12513538,  -6.94783231,
        -8.77052924,  -8.18621202, -10.00890895, -11.83160588,
       -11.8073359 , -10.70116577, -11.28365896, -10.90433326,
       -10.7870093 , -10.85579328, -12.67849021, -14.50118714]), pvalue=array([0.    , 0.    , 0.    , 0.    , 0.    , 0.4021, 0.2052, 0.0869,
       0.0273, 0.0449, 0.0112, 0.0021, 0.0027, 0.0314, 0.0342, 0.1121,
       0.2296, 0.364 , 0.1906, 0.0819]), replications=array([[  0.        ,   0.        ,   0.        , ...,   0.        ,
          0.        ,   0.        ],
       [ -1.82269693,  -1.82269693,  -1.82269693, ...,  -1.82269693,
         -1.82269693,   2.58715172],
       [ -3.64539386,  -3.64539386,   1.38442619, ...,  -3.64539386,
          1.34537542,   0.76445479],
       ...,
       [-12.6454847 , -10.97749319,  -9.06137052, ...,  -9.76660799,
        -11.65668151, -12.27644449],
       [-12.99868847, -12.18440487, -10.31450566, ...,  -9.88914623,
        -12.23107585, -13.08710202],
       [-14.8213854 , -13.40917683, -12.13720259, ..., -11.1724894 ,
        -13.45720087, -14.42277641]]))

In [30]:

plt.plot(l_test.support, 
         l_test.replications, 
         color='k', alpha=.01)
plt.plot(l_test.support, 
         l_test.replications[:,0], 
         color='k', label='Simulations')
plt.plot(l_test.support, l_test.statistic, color='orangered',
         label='Observed')
plt.title('Alternative Plot Style')
plt.legend()
plt.show()

CSR Example¶

In this example, we are going to generate a point pattern as the "observed" point pattern. The data generating process is CSR. Then, we will simulate CSR in the same domain for 100 times and construct a simulation envelope for each function.

In [31]:

from libpysal.cg import shapely_ext, asShape
from pointpats import Window
import geopandas
df = geopandas.read_file(ps.examples.get_path("vautm17n.shp"))
state = df.geometry.cascaded_union

Generate the point pattern pp (size 100) from CSR as the "observed" point pattern.

In [32]:

a = [[1],[1,2]]
np.asarray(a)

Out[32]:

array([list([1]), list([1, 2])], dtype=object)

In [33]:

n = 100
samples = 1
pp = PoissonPointProcess(Window(asShape(state).parts), 
                         n, samples, asPP=True)
pp.realizations[0]

Out[33]:

<pointpats.pointpattern.PointPattern at 0x1303d2070>

In [34]:

pp.n

Out[34]:

Simulate CSR in the same domian for 100 times which would be used for constructing simulation envelope under the null hypothesis of CSR.

In [35]:

csrs = PoissonPointProcess(pp.window, 100, 100, asPP=True)
csrs

Out[35]:

<pointpats.process.PoissonPointProcess at 0x1303d2730>

In [36]:

import importlib
ripley = importlib.reload(ripley)

In [46]:

simulations = ripley.simulate(state, size=(100,100))

In [59]:

plt.scatter(*simulations[23].T)
geopandas.GeoDataFrame(geometry=[state]).boundary.plot(ax=plt.gca())

Out[59]:

<matplotlib.axes._subplots.AxesSubplot at 0x13a8cdc40>

Construct the simulation envelope for $G$ function.

In [60]:

genv = Genv(pp.realizations[0], realizations=csrs)
genv.plot()

In [63]:

ripley.g_test(simulations[10], hull=state).pvalues

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-63-b71a997a58e0> in <module>
----> 1 ripley.g_test(simulations[10], hull=state).pvalues

AttributeError: 'GtestResult' object has no attribute 'pvalues'

Since the "observed" $G$ is well contained by the simulation envelope, we infer that the underlying point process is a random process.

In [39]:

genv.low # lower bound of the simulation envelope for G

Out[39]:

array([0.  , 0.04, 0.25, 0.55, 0.76, 0.88, 0.93, 0.96, 0.97, 0.98, 0.98,
       0.98])

In [40]:

genv.high # higher bound of the simulation envelope for G

Out[40]:

array([0.  , 0.2 , 0.51, 0.73, 0.9 , 0.96, 0.99, 1.  , 1.  , 1.  , 1.  ,
       1.  ])

Construct the simulation envelope for $F$ function.

In [41]:

fenv = Fenv(pp.realizations[0], realizations=csrs)
fenv.plot()

Construct the simulation envelope for $J$ function.

In [42]:

jenv = Jenv(pp.realizations[0], realizations=csrs)
jenv.plot()

Construct the simulation envelope for $K$ function.

In [43]:

kenv = Kenv(pp.realizations[0], realizations=csrs)
kenv.plot()

Construct the simulation envelope for $L$ function.

In [44]:

lenv = Lenv(pp.realizations[0], realizations=csrs)
lenv.plot()

In [ ]: