{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Distance Based Statistical Method for Planar Point Patterns\n", "\n", "**Authors: Serge Rey and Wei Kang **\n", "\n", "## Introduction\n", "\n", "Distance based methods for point patterns are of three types:\n", "\n", "* [Mean Nearest Neighbor Distance Statistics](#Mean-Nearest-Neighbor-Distance-Statistics)\n", "* [Nearest Neighbor Distance Functions](#Nearest-Neighbor-Distance-Functions)\n", "* [Interevent Distance Functions](#Interevent-Distance-Functions)\n", "\n", "In addition, we are going to introduce a computational technique [Simulation Envelopes](#Simulation-Envelopes) to aid in making inferences about the data generating process. An [example](#CSR-Example) is used to demonstrate how to use and interprete simulation envelopes." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from scipy import spatial\n", "import libpysal as ps\n", "import numpy as np\n", "from pointpats import ripley\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mean Nearest Neighbor Distance Statistics\n", "\n", "The nearest neighbor(s) for a point $u$ is the point(s) $N(u)$ which meet the condition\n", "$$d_{u,N(u)} \\leq d_{u,j} \\forall j \\in S - u$$\n", "\n", "The distance between the nearest neighbor(s) $N(u)$ and the point $u$ is nearest neighbor distance for $u$. After searching for nearest neighbor(s) for all the points and calculating the corresponding distances, we are able to calculate mean nearest neighbor distance by averaging these distances.\n", "\n", "It was demonstrated by Clark and Evans(1954) that mean nearest neighbor distance statistics distribution is a normal distribution under null hypothesis (underlying spatial process is CSR). We can utilize the test statistics to determine whether the point pattern is the outcome of CSR. If not, is it the outcome of cluster or regular\n", "spatial process?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "points = np.array([[66.22, 32.54], [22.52, 22.39], [31.01, 81.21],\n", " [9.47, 31.02], [30.78, 60.10], [75.21, 58.93],\n", " [79.26, 7.68], [8.23, 39.93], [98.73, 77.17],\n", " [89.78, 42.53], [65.19, 92.08], [54.46, 8.48]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nearest Neighbor Distance Functions\n", "\n", "Nearest neighbour distance distribution functions (including the nearest “event-to-event” and “point-event” distance distribution functions) of a point process are cumulative distribution functions of several kinds -- $G, F, J$. By comparing the distance function of the observed point pattern with that of the point pattern from a CSR process, we are able to infer whether the underlying spatial process of the observed point pattern is CSR or not for a given confidence level." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### $G$ function - event-to-event\n", "\n", "The $G$ function is a kind of \"cumulative\" density describing the distribution of distances within a point pattern. For a given distance $d$, $G(d)$ is the proportion of nearest neighbor distances that are less than $d$. To express this, we first need to define the nearest neighbor distance, which is the smallest distance from each observation $i$ to some other observation $j$, where $j \\neq i$:\n", "$$min_{j\\neq i}\\{d_{ij}\\} = d^*_i$$\n", "\n", "With this, we can define the $G$ function as a cumulative density function:\n", "$$G(d) = \\frac{1}{N}\\sum_{i=1}^N \\mathcal{I}(d^*_i < d)$$\n", "where $\\mathcal{I}(.)$ is an *indicator function* that is $1$ when the argument is true and is zero otherwise. In simple terms, $G(d)$ gives the percentage of of nearest neighbor distances ($d^*_i$) that are smaller than $d$; when $d$ is very small, $G(d)$ is close to zero. When $d$ is large, $G(d)$ approaches one. \n", "\n", "Analytical results about $G$ are available assuming that the \"null\" process of locating points in the study area is completely spatially random. In a completely spatially random process, the $G(d)$ value should be:\n", "$$\n", "G(d) = 1-e^{-\\lambda \\pi d^2}\n", "$$\n", "Practically, we assess statistical significance for the $G(d)$ function using simulations, where a known spatially-random process is generated and then analyzed. This partially accounts for issues with irregularly-shaped study areas, where locations of points are constrained. \n", "\n", "In practice, we use the ripley.g_test function to conduct a test on the $G(d)$. It estimates a value of $G(d)$ for a set of values (called the support). To compute the $G$ function for ten values of $d$ ranging from the smallest possible to the largest values in the data:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "g_test = ripley.g_test(points, support=10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All statistical tests in the pointpats.distance_statistics return a collections.namedtuple object with the following properties:\n", "- support, which contains the distance values ($d$) used to compute the distance statistic. \n", "- statistic, which expresses the value of the requested function at each value of $d$ in the support. \n", "- pvalue, which expresses the fraction of observed simulations (under a completely spatially random process) that are more extreme than the observed statistics. \n", "- simulations, which stores the simulated values of the statistic under a spatially random process. Generally, this is *not* saved (for efficiency reasons), but can be requested using keep_simulations. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0. , 3.84791574, 7.69583148, 11.54374723, 15.39166297,\n", " 19.23957871, 23.08749445, 26.93541019, 30.78332593, 34.63124168])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g_test.support" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0. , 0. , 0. , 0.16666667, 0.16666667,\n", " 0.25 , 0.58333333, 0.83333333, 0.91666667, 1. ])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g_test.statistic" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.00e+00, 0.00e+00, 0.00e+00, 2.89e-02, 1.10e-03, 1.00e-04,\n", " 4.30e-03, 6.10e-02, 7.33e-02, 0.00e+00])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "g_test.pvalue" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "g_test.simulations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make a plot of the statistic, the statistic is generally plotted on the vertical axis and the support on the horizontal axis. Here, we will show the median simulated value of $G(d)$ as well." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "g_test = ripley.g_test(points, support=10, keep_simulations=True)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "