{ "cells": [ { "cell_type": "markdown", "metadata": { "colab_type": "text", "id": "view-in-github" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "DNWKwG4TAcD2" }, "source": [ "\n", "
\"Logo
\n", "# Apache Sedona with PySpark\n", "\n", "Apache Sedona™ is a prime example of a distributed engine built on top of Spark, specifically designed for geographic data processing.\n", "\n", "The home page describes Apache Sedona ([https://sedona.apache.org/](https://sedona.apache.org/)) as:\n", "\n", "> *a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark, Apache Flink, and Snowflake, with a set of out-of-the-box distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines.*\n", "\n", "In this notebook we are going to execute a basic Sedona demonstration using PySpark. The Sedona notebook starts below at [Apache Sedona Core demo](#scrollTo=Apache_Sedona_Core_demo).\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "3AQNoWmX_B78" }, "source": [ "## Install Apache Sedona, PySpark, and required libraries\n", "\n", "To start with, we are going to install `apache-sedona` and PySpark making sure that we have the desired Spark version.\n", "\n", "\n", "The required packages are specified in this [Pipfile](https://github.com/apache/sedona/blob/master/python/Pipfile) under `[packages]`:\n", "\n", "```\n", "[packages]\n", "pandas=\"<=1.5.3\"\n", "geopandas=\"*\"\n", "numpy=\"<2\"\n", "shapely=\">=1.7.0\"\n", "pyspark=\">=2.3.0\"\n", "attrs=\"*\"\n", "pyarrow=\"*\"\n", "keplergl = \"==0.3.2\"\n", "pydeck = \"===0.8.0\"\n", "rasterio = \">=1.2.10\"\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "rq9dEksqznlQ" }, "source": [ "Install Apache Sedona without Spark. To install Spark as well you can use `pip install apache-sedona[spark]` but we chose to use the Spark engine that comes with PySpark." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:41:37.588324Z", "iopub.status.busy": "2024-10-14T20:41:37.587814Z", "iopub.status.idle": "2024-10-14T20:41:40.227352Z", "shell.execute_reply": "2024-10-14T20:41:40.226623Z" }, "id": "SOB7DNZ6AOio", "outputId": "64cd7ac7-5b5c-442c-ce57-bff88ab495ae" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting apache-sedona\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading apache_sedona-1.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)\r\n", "Requirement already satisfied: attrs in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from apache-sedona) (24.2.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting shapely>=1.7.0 (from apache-sedona)\r\n", " Downloading shapely-2.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.0 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting rasterio>=1.2.10 (from apache-sedona)\r\n", " Downloading rasterio-1.3.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)\r\n", "Collecting affine (from rasterio>=1.2.10->apache-sedona)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading affine-2.4.0-py3-none-any.whl.metadata (4.0 kB)\r\n", "Requirement already satisfied: certifi in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from rasterio>=1.2.10->apache-sedona) (2024.8.30)\r\n", "Collecting click>=4.0 (from rasterio>=1.2.10->apache-sedona)\r\n", " Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting cligj>=0.5 (from rasterio>=1.2.10->apache-sedona)\r\n", " Downloading cligj-0.7.2-py3-none-any.whl.metadata (5.0 kB)\r\n", "Requirement already satisfied: numpy in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from rasterio>=1.2.10->apache-sedona) (1.24.4)\r\n", "Collecting snuggs>=1.4.1 (from rasterio>=1.2.10->apache-sedona)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading snuggs-1.4.7-py3-none-any.whl.metadata (3.4 kB)\r\n", "Collecting click-plugins (from rasterio>=1.2.10->apache-sedona)\r\n", " Downloading click_plugins-1.1.1-py2.py3-none-any.whl.metadata (6.4 kB)\r\n", "Requirement already satisfied: setuptools in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from rasterio>=1.2.10->apache-sedona) (56.0.0)\r\n", "Requirement already satisfied: importlib-metadata in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from rasterio>=1.2.10->apache-sedona) (8.5.0)\r\n", "Requirement already satisfied: pyparsing>=2.1.6 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from snuggs>=1.4.1->rasterio>=1.2.10->apache-sedona) (3.1.4)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: zipp>=3.20 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from importlib-metadata->rasterio>=1.2.10->apache-sedona) (3.20.2)\r\n", "Downloading apache_sedona-1.6.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (177 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading rasterio-1.3.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (21.8 MB)\r\n", "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/21.8 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.8/21.8 MB\u001b[0m \u001b[31m136.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[?25h" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading shapely-2.0.6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB)\r\n", "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/2.5 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m\r", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.5/2.5 MB\u001b[0m \u001b[31m126.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[?25hDownloading click-8.1.7-py3-none-any.whl (97 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading cligj-0.7.2-py3-none-any.whl (7.1 kB)\r\n", "Downloading snuggs-1.4.7-py3-none-any.whl (5.4 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading affine-2.4.0-py3-none-any.whl (15 kB)\r\n", "Downloading click_plugins-1.1.1-py2.py3-none-any.whl (7.5 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Installing collected packages: snuggs, shapely, click, affine, cligj, click-plugins, rasterio, apache-sedona\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Successfully installed affine-2.4.0 apache-sedona-1.6.1 click-8.1.7 click-plugins-1.1.1 cligj-0.7.2 rasterio-1.3.11 shapely-2.0.6 snuggs-1.4.7\r\n" ] } ], "source": [ "!pip install apache-sedona" ] }, { "cell_type": "markdown", "metadata": { "id": "QOQHjm-fy8Uj" }, "source": [ "For the sake of this tutorial we are going to use the Spark engine that is included in the Pyspark distribution. Since Sedona needs Spark $3.4.0$ we need to make sure that we choose the correct PySpark version." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:41:40.230064Z", "iopub.status.busy": "2024-10-14T20:41:40.229844Z", "iopub.status.idle": "2024-10-14T20:42:04.624932Z", "shell.execute_reply": "2024-10-14T20:42:04.624105Z" }, "id": "a6jgrBFZs_Qx", "outputId": "0f02e1f6-7cc4-472e-b3b2-7238a98a792c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting pyspark==3.4.0\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading pyspark-3.4.0.tar.gz (310.8 MB)\r\n", "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/310.8 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m34.3/310.8 MB\u001b[0m \u001b[31m207.4 MB/s\u001b[0m eta \u001b[36m0:00:02\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m84.1/310.8 MB\u001b[0m \u001b[31m228.7 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━\u001b[0m\u001b[90m╺\u001b[0m\u001b[90m━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m134.5/310.8 MB\u001b[0m \u001b[31m236.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━━━━━━━\u001b[0m \u001b[32m183.2/310.8 MB\u001b[0m \u001b[31m237.3 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━━━━━━━\u001b[0m \u001b[32m232.0/310.8 MB\u001b[0m \u001b[31m238.0 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m\u001b[90m━━━\u001b[0m \u001b[32m282.3/310.8 MB\u001b[0m \u001b[31m245.3 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m310.6/310.8 MB\u001b[0m \u001b[31m245.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m310.6/310.8 MB\u001b[0m \u001b[31m245.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m310.6/310.8 MB\u001b[0m \u001b[31m245.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[91m╸\u001b[0m \u001b[32m310.6/310.8 MB\u001b[0m \u001b[31m245.1 MB/s\u001b[0m eta \u001b[36m0:00:01\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m310.8/310.8 MB\u001b[0m \u001b[31m135.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[?25h" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Installing build dependencies ... \u001b[?25l-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \bdone\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?25h Getting requirements to build wheel ... \u001b[?25l-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \bdone\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?25h Preparing metadata (pyproject.toml) ... \u001b[?25l-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \bdone\r\n", "\u001b[?25hCollecting py4j==0.10.9.7 (from pyspark==3.4.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading py4j-0.10.9.7-py2.py3-none-any.whl.metadata (1.5 kB)\r\n", "Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)\r\n", "Building wheels for collected packages: pyspark\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Building wheel for pyspark (pyproject.toml) ... \u001b[?25l-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b\\" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b|" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b/" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\b \b-\b \bdone\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[?25h Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317121 sha256=3fa98b3b6e594f997111ac6bda1a8c0447aad8149992847de7d91b2071ecbd91\r\n", " Stored in directory: /home/runner/.cache/pip/wheels/27/3e/a7/888155c6a7f230b13a394f4999b90fdfaed00596c68d3de307\r\n", "Successfully built pyspark\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Installing collected packages: py4j, pyspark\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Successfully installed py4j-0.10.9.7 pyspark-3.4.0\r\n" ] } ], "source": [ "!pip install pyspark==3.4.0" ] }, { "cell_type": "markdown", "metadata": { "id": "NX8-hGa90tAu" }, "source": [ "Verify that PySpark is using Spark version $3.4.0$." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:04.627901Z", "iopub.status.busy": "2024-10-14T20:42:04.627397Z", "iopub.status.idle": "2024-10-14T20:42:06.409429Z", "shell.execute_reply": "2024-10-14T20:42:06.408544Z" }, "id": "WNtkrTykyYgl", "outputId": "7a48683f-8306-4b54-c8b6-9b01592435b9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Welcome to\r\n", " ____ __\r\n", " / __/__ ___ _____/ /__\r\n", " _\\ \\/ _ \\/ _ `/ __/ '_/\r\n", " /___/ .__/\\_,_/_/ /_/\\_\\ version 3.4.0\r\n", " /_/\r\n", " \r\n", "Using Scala version 2.12.17, OpenJDK 64-Bit Server VM, 11.0.24\r\n", "Branch HEAD\r\n", "Compiled by user xinrong.meng on 2023-04-07T02:18:01Z\r\n", "Revision 87a5442f7ed96b11051d8a9333476d080054e5a0\r\n", "Url https://github.com/apache/spark\r\n", "Type --help for more information.\r\n" ] } ], "source": [ "!pyspark --version" ] }, { "cell_type": "markdown", "metadata": { "id": "76yI9hGZz_Ss" }, "source": [ "### Install Geopandas\n", "\n", "The libraries `numpy`, `pandas`, `geopandas`, and `shapely` are available by default on Google Colab." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:06.412452Z", "iopub.status.busy": "2024-10-14T20:42:06.412011Z", "iopub.status.idle": "2024-10-14T20:42:08.676601Z", "shell.execute_reply": "2024-10-14T20:42:08.675765Z" }, "id": "KliMM6ka0Qee", "outputId": "af1179f9-3c2b-466d-c511-6804648c1029" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Collecting geopandas\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ " Downloading geopandas-0.13.2-py3-none-any.whl.metadata (1.5 kB)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting fiona>=1.8.19 (from geopandas)\r\n", " Downloading fiona-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (56 kB)\r\n", "Requirement already satisfied: packaging in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from geopandas) (24.1)\r\n", "Requirement already satisfied: pandas>=1.1.0 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from geopandas) (2.0.3)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Collecting pyproj>=3.0.1 (from geopandas)\r\n", " Downloading pyproj-3.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (28 kB)\r\n", "Requirement already satisfied: shapely>=1.7.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from geopandas) (2.0.6)\r\n", "Requirement already satisfied: attrs>=19.2.0 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (24.2.0)\r\n", "Requirement already satisfied: certifi in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (2024.8.30)\r\n", "Requirement already satisfied: click~=8.0 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (8.1.7)\r\n", "Requirement already satisfied: click-plugins>=1.0 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (1.1.1)\r\n", "Requirement already satisfied: cligj>=0.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (0.7.2)\r\n", "Requirement already satisfied: importlib-metadata in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from fiona>=1.8.19->geopandas) (8.5.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: python-dateutil>=2.8.2 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas>=1.1.0->geopandas) (2.9.0.post0)\r\n", "Requirement already satisfied: pytz>=2020.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas>=1.1.0->geopandas) (2024.2)\r\n", "Requirement already satisfied: tzdata>=2022.1 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas>=1.1.0->geopandas) (2024.2)\r\n", "Requirement already satisfied: numpy>=1.20.3 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from pandas>=1.1.0->geopandas) (1.24.4)\r\n", "Requirement already satisfied: six>=1.5 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas>=1.1.0->geopandas) (1.16.0)\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: zipp>=3.20 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from importlib-metadata->fiona>=1.8.19->geopandas) (3.20.2)\r\n", "Downloading geopandas-0.13.2-py3-none-any.whl (1.1 MB)\r\n", "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/1.1 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m31.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[?25hDownloading fiona-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)\r\n", "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/17.3 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m17.3/17.3 MB\u001b[0m \u001b[31m143.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[?25hDownloading pyproj-3.5.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)\r\n", "\u001b[?25l \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m0.0/7.8 MB\u001b[0m \u001b[31m?\u001b[0m eta \u001b[36m-:--:--\u001b[0m" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\r", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m137.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\r\n", "\u001b[?25h" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Installing collected packages: pyproj, fiona, geopandas\r\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Successfully installed fiona-1.10.1 geopandas-0.13.2 pyproj-3.5.0\r\n" ] } ], "source": [ "!pip install geopandas" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:08.679333Z", "iopub.status.busy": "2024-10-14T20:42:08.679056Z", "iopub.status.idle": "2024-10-14T20:42:09.569566Z", "shell.execute_reply": "2024-10-14T20:42:09.568798Z" }, "id": "1ua2XEC51QFT", "outputId": "a03ccb00-ffea-4038-bb07-0535bf06de61" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: shapely in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (2.0.6)\r\n", "Requirement already satisfied: numpy<3,>=1.14 in /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages (from shapely) (1.24.4)\r\n" ] } ], "source": [ "!pip install shapely" ] }, { "cell_type": "markdown", "metadata": { "id": "rhVHMoa31bQ6" }, "source": [ "## Download the data\n", "\n", "We are going to download the data from Sedona's GitHub repository." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:09.572624Z", "iopub.status.busy": "2024-10-14T20:42:09.572084Z", "iopub.status.idle": "2024-10-14T20:42:09.581661Z", "shell.execute_reply": "2024-10-14T20:42:09.581133Z" }, "id": "7nhI9NX9impw" }, "outputs": [], "source": [ "import json\n", "import os\n", "import urllib\n", "import base64\n", "\n", "def download_blob(url, path, localfile):\n", " print(f\"Downloading blob to localfile: {localfile} from {url}\")\n", "\n", " # Fetch the JSON data from the URL\n", " with urllib.request.urlopen(url) as response:\n", " json_data = response.read().decode('utf-8')\n", "\n", " # Load the JSON data into a dictionary\n", " data = json.loads(json_data)\n", "\n", " # Extract the Base64 content\n", " base64_content = data['content']\n", "\n", " # Decode the Base64 content\n", " decoded_content = base64.b64decode(base64_content)\n", "\n", " try:\n", " # Attempt to decode as UTF-8 text\n", " decoded_text = decoded_content.decode('utf-8')\n", " with open(localfile, 'w') as f:\n", " f.write(decoded_text)\n", " except UnicodeDecodeError:\n", " # If text decoding fails, save as binary\n", " with open(localfile, 'wb') as f:\n", " f.write(decoded_content)\n", "\n", "\n", "def download_gitpath(url, path, localpath):\n", " \"\"\"\n", " Recursively downloads a specific path (directory or file) from a GitHub repository using the GitHub API\n", " and saves it to a specified local directory, preserving the repository's structure.\n", "\n", " Args:\n", " url (str):\n", " The GitHub API URL that points to the tree structure of the repository.\n", " This URL should provide a JSON response containing the file and directory information for the tree.\n", "\n", " path (str):\n", " The path within the repository that you want to download.\n", " This path is used to filter the relevant files and directories within the tree structure.\n", " Example: \"src/utils\" would download everything under the `src/utils` directory in the repository.\n", "\n", " localpath (str):\n", " The local directory path where the downloaded files and directories will be saved.\n", " If the directory does not exist, it will be created. The repository's structure will be mirrored in this location.\n", "\n", " Returns:\n", " None\n", "\n", " Example:\n", " tree_url = \"https://api.github.com/repos/{owner}/{repo_name}/git/trees/master?recursive=true\"\n", " tree_url = \"https://api.github.com/repos/apache/spark/git/trees/master?recursive=true\"\n", " repo_path = \"data/mllib/images\"\n", " local_dir = \"./downloaded_images\"\n", "\n", " download_gitpath(tree_url, repo_path, local_dir)\n", "\n", " How it works:\n", " 1. The function fetches the tree of files and directories from the GitHub API using the provided URL.\n", " 2. It filters the tree data to only include items that fall under the specified `path`.\n", " 3. Files (blobs) are downloaded and saved locally.\n", " 4. Directories (trees) are handled recursively by creating local directories and downloading their contents.\n", "\n", " Notes:\n", " - Ensure that you have access to the repository's GitHub API, especially if it's private (you may need a token).\n", " - This function will handle both text and binary files appropriately.\n", " - Error handling for network and API issues is minimal and could be enhanced.\n", " \"\"\"\n", " #print(f\"Processing path: {path} into local directory: {localpath}\")\n", "\n", " # Create the local directory if it doesn't exist\n", " os.makedirs(localpath, exist_ok=True)\n", "\n", " with urllib.request.urlopen(url) as response:\n", " json_data = response.read().decode('utf-8')\n", "\n", " # Load the JSON data into a dictionary\n", " data = json.loads(json_data)['tree']\n", "\n", " # Filter and map the paths to their URLs\n", " items = {x['path']: (x['url'], x['type']) for x in data if x['path'].startswith(path)}\n", "\n", " for item_path, (item_url, item_type) in items.items():\n", "\n", " # Handle blobs (files)\n", " if item_type == 'blob':\n", " local_file_path = os.path.join(localpath, os.path.relpath(item_path, path))\n", " download_blob(item_url, item_path, local_file_path)\n", "\n", " # Handle trees (directories)\n", " elif item_type == 'tree':\n", " new_local_dir = os.path.join(localpath, os.path.relpath(item_path, path))\n", " os.makedirs(new_local_dir, exist_ok=True)\n", " download_gitpath(item_url, item_path, new_local_dir)\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:09.583597Z", "iopub.status.busy": "2024-10-14T20:42:09.583394Z", "iopub.status.idle": "2024-10-14T20:42:17.922471Z", "shell.execute_reply": "2024-10-14T20:42:17.921877Z" }, "id": "-3fmC-hcuiH-", "outputId": "b6c340f1-b467-4902-afa5-f72f8d4397b1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/arealm-small.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/1f090b8189a075988ddc1816aa20de77bd9c2905\n", "Downloading blob to localfile: ./data/county_small.tsv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6b90b565b6b30f5a88c203159b9b080f0f6520f6\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/county_small_wkb.tsv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6b8760e24b51e89802c724c3d8616e8d99d5c0d0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/gis_osm_pois_free_1.cpg from https://api.github.com/repos/jiayuasu/sedona/git/blobs/7edc66b06a9d03549d75d2c9cbb89f83c611ddd3\n", "Downloading blob to localfile: ./data/gis_osm_pois_free_1.dbf from https://api.github.com/repos/jiayuasu/sedona/git/blobs/61b1f3da529a27d9ba3560e8981f200e4f0977cb\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/gis_osm_pois_free_1.prj from https://api.github.com/repos/jiayuasu/sedona/git/blobs/8f73f480ffe4929be342f332ea5ee69fb0ef9fb4\n", "Downloading blob to localfile: ./data/gis_osm_pois_free_1.shp from https://api.github.com/repos/jiayuasu/sedona/git/blobs/fede38f6c34d2ec344658543dac87abb7ba84ee0\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/gis_osm_pois_free_1.shx from https://api.github.com/repos/jiayuasu/sedona/git/blobs/577b40830e9845c58f36e524792c598ac23cc835\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.cpg from https://api.github.com/repos/jiayuasu/sedona/git/blobs/7edc66b06a9d03549d75d2c9cbb89f83c611ddd3\n", "Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.dbf from https://api.github.com/repos/jiayuasu/sedona/git/blobs/03caa768d026ba1f61bf123340dc5a0600d43c09\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.prj from https://api.github.com/repos/jiayuasu/sedona/git/blobs/8f73f480ffe4929be342f332ea5ee69fb0ef9fb4\n", "Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.shp from https://api.github.com/repos/jiayuasu/sedona/git/blobs/e630c6f44eb605f71764c15130517a7767ccf520\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/ne_50m_admin_0_countries_lakes/ne_50m_admin_0_countries_lakes.shx from https://api.github.com/repos/jiayuasu/sedona/git/blobs/76a04349bbf11e581e1aa8c41b7ccac0eb548959\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/ne_50m_airports/ne_50m_airports.dbf from https://api.github.com/repos/jiayuasu/sedona/git/blobs/bdf30050021a27e1f5bc504b6454e2540fbe253c\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/ne_50m_airports/ne_50m_airports.prj from https://api.github.com/repos/jiayuasu/sedona/git/blobs/40dd8c6cdcbabeb4297cb137424805c3f3472ee2\n", "Downloading blob to localfile: ./data/ne_50m_airports/ne_50m_airports.shp from https://api.github.com/repos/jiayuasu/sedona/git/blobs/d1417b2ed8cf7ed53b816c07acac1fcf67300a6f\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/ne_50m_airports/ne_50m_airports.shx from https://api.github.com/repos/jiayuasu/sedona/git/blobs/b99c0fc7d5578b015ee1ccfd60d66cc0914b9d14\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/polygon/map.dbf from https://api.github.com/repos/jiayuasu/sedona/git/blobs/a71205ad62a1485d700625f73464bffc226b96be\n", "Downloading blob to localfile: ./data/polygon/map.shp from https://api.github.com/repos/jiayuasu/sedona/git/blobs/ab10fd7037950e9d41a4dcd7e10fcba72ad00a8d\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/polygon/map.shx from https://api.github.com/repos/jiayuasu/sedona/git/blobs/53a25eb312a5ae471370d248aa46698be33629d9\n", "Downloading blob to localfile: ./data/primaryroads-linestring.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6e57b87527b6c9d23327905d0e8c565bce6b0a0a\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/primaryroads-polygon.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/eeb2f10678eb0c13d9ace21e5de9b597cc8355df\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/raster/T21HUB_4704_4736_8224_8256.tif from https://api.github.com/repos/jiayuasu/sedona/git/blobs/776d75342642fd994a7429c8145b651e9e45ed82\n", "Downloading blob to localfile: ./data/raster/test1.tiff from https://api.github.com/repos/jiayuasu/sedona/git/blobs/bebd68232e85a348c1bc1d067ddc95d38ad66c0f\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/raster/test5.tiff from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6caabeadae31129afc2c4309348c5ff6a10d0d24\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/raster/vya_T21HUB_992_1024_4352_4384.tif from https://api.github.com/repos/jiayuasu/sedona/git/blobs/6df2b0a467b56e0fb49bd7844a7b8edcb0dcb355\n", "Downloading blob to localfile: ./data/testPolygon.json from https://api.github.com/repos/jiayuasu/sedona/git/blobs/b5802075d13b1f1bcca169f8fdb9eca94b84e361\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Downloading blob to localfile: ./data/testpoint.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/05ff16a1be4e774d5954e83ddeef23f931bf5f75\n", "Downloading blob to localfile: ./data/zcta510-small.csv from https://api.github.com/repos/jiayuasu/sedona/git/blobs/ced508092e7bebc389af9438b1ae2d5b5e1e2808\n" ] } ], "source": [ "url = 'https://api.github.com/repos/jiayuasu/sedona/git/trees/master?recursive=true'\n", "path = 'docs/usecases/data/'\n", "# download only if data folder does not exist\n", "if not os.path.exists('./data'):\n", " os.makedirs('./data')\n", " download_gitpath(url, path, './data')" ] }, { "cell_type": "markdown", "metadata": { "id": "qQVz6ckaScn8" }, "source": [ "Verify the presence of data in the designated `data` folder." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:17.924926Z", "iopub.status.busy": "2024-10-14T20:42:17.924679Z", "iopub.status.idle": "2024-10-14T20:42:18.053141Z", "shell.execute_reply": "2024-10-14T20:42:18.052375Z" }, "id": "Rxy-t1JnRnf6", "outputId": "e55ba864-6d27-45b5-feba-5f0ede5c2e12" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "data:\r\n", "total 16716\r\n", "-rw-r--r-- 1 runner docker 200449 Oct 14 20:42 arealm-small.csv\r\n", "-rw-r--r-- 1 runner docker 4182112 Oct 14 20:42 county_small.tsv\r\n", "-rw-r--r-- 1 runner docker 6379960 Oct 14 20:42 county_small_wkb.tsv\r\n", "-rw-r--r-- 1 runner docker 6 Oct 14 20:42 gis_osm_pois_free_1.cpg\r\n", "-rw-r--r-- 1 runner docker 1841000 Oct 14 20:42 gis_osm_pois_free_1.dbf\r\n", "-rw-r--r-- 1 runner docker 144 Oct 14 20:42 gis_osm_pois_free_1.prj\r\n", "-rw-r--r-- 1 runner docker 360544 Oct 14 20:42 gis_osm_pois_free_1.shp\r\n", "-rw-r--r-- 1 runner docker 103084 Oct 14 20:42 gis_osm_pois_free_1.shx\r\n", "drwxr-xr-x 2 runner docker 4096 Oct 14 20:42 ne_50m_admin_0_countries_lakes\r\n", "drwxr-xr-x 2 runner docker 4096 Oct 14 20:42 ne_50m_airports\r\n", "drwxr-xr-x 2 runner docker 4096 Oct 14 20:42 polygon\r\n", "-rw-r--r-- 1 runner docker 1132600 Oct 14 20:42 primaryroads-linestring.csv\r\n", "-rw-r--r-- 1 runner docker 1399092 Oct 14 20:42 primaryroads-polygon.csv\r\n", "drwxr-xr-x 2 runner docker 4096 Oct 14 20:42 raster\r\n", "-rw-r--r-- 1 runner docker 1324354 Oct 14 20:42 testPolygon.json\r\n", "-rw-r--r-- 1 runner docker 12993 Oct 14 20:42 testpoint.csv\r\n", "-rw-r--r-- 1 runner docker 129482 Oct 14 20:42 zcta510-small.csv\r\n", "\r\n", "data/ne_50m_admin_0_countries_lakes:\r\n", "total 2164\r\n", "-rw-r--r-- 1 runner docker 6 Oct 14 20:42 ne_50m_admin_0_countries_lakes.cpg\r\n", "-rw-r--r-- 1 runner docker 546979 Oct 14 20:42 ne_50m_admin_0_countries_lakes.dbf\r\n", "-rw-r--r-- 1 runner docker 144 Oct 14 20:42 ne_50m_admin_0_countries_lakes.prj\r\n", "-rw-r--r-- 1 runner docker 1652184 Oct 14 20:42 ne_50m_admin_0_countries_lakes.shp\r\n", "-rw-r--r-- 1 runner docker 2028 Oct 14 20:42 ne_50m_admin_0_countries_lakes.shx\r\n", "\r\n", "data/ne_50m_airports:\r\n", "total 336\r\n", "-rw-r--r-- 1 runner docker 326032 Oct 14 20:42 ne_50m_airports.dbf\r\n", "-rw-r--r-- 1 runner docker 148 Oct 14 20:42 ne_50m_airports.prj\r\n", "-rw-r--r-- 1 runner docker 7968 Oct 14 20:42 ne_50m_airports.shp\r\n", "-rw-r--r-- 1 runner docker 2348 Oct 14 20:42 ne_50m_airports.shx\r\n", "\r\n", "data/polygon:\r\n", "total 7344\r\n", "-rw-r--r-- 1 runner docker 10033 Oct 14 20:42 map.dbf\r\n", "-rw-r--r-- 1 runner docker 7424324 Oct 14 20:42 map.shp\r\n", "-rw-r--r-- 1 runner docker 80100 Oct 14 20:42 map.shx\r\n", "\r\n", "data/raster:\r\n", "total 396\r\n", "-rw-r--r-- 1 runner docker 6619 Oct 14 20:42 T21HUB_4704_4736_8224_8256.tif\r\n", "-rw-r--r-- 1 runner docker 174803 Oct 14 20:42 test1.tiff\r\n", "-rw-r--r-- 1 runner docker 209199 Oct 14 20:42 test5.tiff\r\n", "-rw-r--r-- 1 runner docker 7689 Oct 14 20:42 vya_T21HUB_992_1024_4352_4384.tif\r\n" ] } ], "source": [ "!ls -lR data" ] }, { "cell_type": "markdown", "metadata": { "id": "Mk20FV3ZHj_U" }, "source": [ "# Apache Sedona Core demo\n", "\n", "The notebook is available at the following link: https://github.com/apache/sedona/blob/master/docs/usecases/ApacheSedonaCore.ipynb.\n", "\n", "Refer to https://mvnrepository.com/artifact/org.apache.sedona/sedona-spark-3.4 for making sense of packages and versions.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "AcjGVcGzOjcK" }, "source": [ "```\n", "Licensed to the Apache Software Foundation (ASF) under one\n", "or more contributor license agreements. See the NOTICE file\n", "distributed with this work for additional information\n", "regarding copyright ownership. The ASF licenses this file\n", "to you under the Apache License, Version 2.0 (the\n", "\"License\"); you may not use this file except in compliance\n", "with the License. You may obtain a copy of the License at\n", " http://www.apache.org/licenses/LICENSE-2.0\n", "Unless required by applicable law or agreed to in writing,\n", "software distributed under the License is distributed on an\n", "\"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n", "KIND, either express or implied. See the License for the\n", "specific language governing permissions and limitations\n", "under the License.\n", "```" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:18.055647Z", "iopub.status.busy": "2024-10-14T20:42:18.055439Z", "iopub.status.idle": "2024-10-14T20:42:18.523899Z", "shell.execute_reply": "2024-10-14T20:42:18.523239Z" }, "id": "sK9oIz0FOjcN" }, "outputs": [], "source": [ "from pyspark.sql import SparkSession\n", "from pyspark import StorageLevel\n", "import geopandas as gpd\n", "import pandas as pd\n", "from pyspark.sql.types import StructType\n", "from pyspark.sql.types import StructField\n", "from pyspark.sql.types import StringType\n", "from pyspark.sql.types import LongType\n", "from shapely.geometry import Point\n", "from shapely.geometry import Polygon\n", "\n", "from sedona.spark import *\n", "from sedona.core.geom.envelope import Envelope" ] }, { "cell_type": "markdown", "metadata": { "id": "_hScR-pd2rVi" }, "source": [ "Note: the next cell might take a while to execute. Stretch your legs and contemplate the mysteries of the universe in the meantime. Hang tight!" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:18.526496Z", "iopub.status.busy": "2024-10-14T20:42:18.526022Z", "iopub.status.idle": "2024-10-14T20:42:33.032662Z", "shell.execute_reply": "2024-10-14T20:42:33.031657Z" }, "id": "HzVqFLBtOjcO" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ":: loading settings :: url = jar:file:/opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Ivy Default Cache set to: /home/runner/.ivy2/cache\n", "The jars for the packages stored in: /home/runner/.ivy2/jars\n", "org.apache.sedona#sedona-spark-3.4_2.12 added as a dependency\n", "org.datasyslab#geotools-wrapper added as a dependency\n", "uk.co.gresearch.spark#spark-extension_2.12 added as a dependency\n", ":: resolving dependencies :: org.apache.spark#spark-submit-parent-c2fcb9ab-9ede-4abc-bad7-73533a42fa83;1.0\n", "\tconfs: [default]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tfound org.apache.sedona#sedona-spark-3.4_2.12;1.6.0 in central\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tfound org.apache.sedona#sedona-common;1.6.0 in central\n", "\tfound org.apache.commons#commons-math3;3.6.1 in central\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tfound org.locationtech.jts#jts-core;1.19.0 in central\n", "\tfound org.wololo#jts2geojson;0.16.1 in central\n", "\tfound org.locationtech.spatial4j#spatial4j;0.8 in central\n", "\tfound com.google.geometry#s2-geometry;2.0.0 in central\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tfound com.google.guava#guava;25.1-jre in central\n", "\tfound com.google.code.findbugs#jsr305;3.0.2 in central\n", "\tfound org.checkerframework#checker-qual;2.0.0 in central\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tfound com.google.errorprone#error_prone_annotations;2.1.3 in central\n", "\tfound com.google.j2objc#j2objc-annotations;1.1 in central\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tfound org.codehaus.mojo#animal-sniffer-annotations;1.14 in central\n", "\tfound com.uber#h3;4.1.1 in central\n", "\tfound net.sf.geographiclib#GeographicLib-Java;1.52 in central\n", "\tfound com.github.ben-manes.caffeine#caffeine;2.9.2 in central\n", "\tfound org.checkerframework#checker-qual;3.10.0 in central\n", "\tfound com.google.errorprone#error_prone_annotations;2.5.1 in central\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tfound org.apache.sedona#sedona-spark-common-3.4_2.12;1.6.0 in central\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\tfound commons-lang#commons-lang;2.6 in central\n", "\tfound org.scala-lang.modules#scala-collection-compat_2.12;2.5.0 in central\n", "\tfound org.beryx#awt-color-factory;1.0.0 in central\n", "\tfound org.datasyslab#geotools-wrapper;1.6.0-28.2 in central\n", "\tfound uk.co.gresearch.spark#spark-extension_2.12;2.11.0-3.4 in central\n", "\tfound com.github.scopt#scopt_2.12;4.1.0 in central\n", "downloading https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-3.4_2.12/1.6.0/sedona-spark-3.4_2.12-1.6.0.jar ...\n", "\t[SUCCESSFUL ] org.apache.sedona#sedona-spark-3.4_2.12;1.6.0!sedona-spark-3.4_2.12.jar (11ms)\n", "downloading https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.6.0-28.2/geotools-wrapper-1.6.0-28.2.jar ...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\t[SUCCESSFUL ] org.datasyslab#geotools-wrapper;1.6.0-28.2!geotools-wrapper.jar (272ms)\n", "downloading https://repo1.maven.org/maven2/uk/co/gresearch/spark/spark-extension_2.12/2.11.0-3.4/spark-extension_2.12-2.11.0-3.4.jar ...\n", "\t[SUCCESSFUL ] uk.co.gresearch.spark#spark-extension_2.12;2.11.0-3.4!spark-extension_2.12.jar (11ms)\n", "downloading https://repo1.maven.org/maven2/org/apache/sedona/sedona-common/1.6.0/sedona-common-1.6.0.jar ...\n", "\t[SUCCESSFUL ] org.apache.sedona#sedona-common;1.6.0!sedona-common.jar (13ms)\n", "downloading https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-common-3.4_2.12/1.6.0/sedona-spark-common-3.4_2.12-1.6.0.jar ...\n", "\t[SUCCESSFUL ] org.apache.sedona#sedona-spark-common-3.4_2.12;1.6.0!sedona-spark-common-3.4_2.12.jar (37ms)\n", "downloading https://repo1.maven.org/maven2/org/locationtech/jts/jts-core/1.19.0/jts-core-1.19.0.jar ...\n", "\t[SUCCESSFUL ] org.locationtech.jts#jts-core;1.19.0!jts-core.jar(bundle) (22ms)\n", "downloading https://repo1.maven.org/maven2/org/wololo/jts2geojson/0.16.1/jts2geojson-0.16.1.jar ...\n", "\t[SUCCESSFUL ] org.wololo#jts2geojson;0.16.1!jts2geojson.jar (7ms)\n", "downloading https://repo1.maven.org/maven2/org/scala-lang/modules/scala-collection-compat_2.12/2.5.0/scala-collection-compat_2.12-2.5.0.jar ...\n", "\t[SUCCESSFUL ] org.scala-lang.modules#scala-collection-compat_2.12;2.5.0!scala-collection-compat_2.12.jar (14ms)\n", "downloading https://repo1.maven.org/maven2/org/apache/commons/commons-math3/3.6.1/commons-math3-3.6.1.jar ...\n", "\t[SUCCESSFUL ] org.apache.commons#commons-math3;3.6.1!commons-math3.jar (29ms)\n", "downloading https://repo1.maven.org/maven2/org/locationtech/spatial4j/spatial4j/0.8/spatial4j-0.8.jar ...\n", "\t[SUCCESSFUL ] org.locationtech.spatial4j#spatial4j;0.8!spatial4j.jar(bundle) (9ms)\n", "downloading https://repo1.maven.org/maven2/com/google/geometry/s2-geometry/2.0.0/s2-geometry-2.0.0.jar ...\n", "\t[SUCCESSFUL ] com.google.geometry#s2-geometry;2.0.0!s2-geometry.jar (15ms)\n", "downloading https://repo1.maven.org/maven2/com/uber/h3/4.1.1/h3-4.1.1.jar ...\n", "\t[SUCCESSFUL ] com.uber#h3;4.1.1!h3.jar (21ms)\n", "downloading https://repo1.maven.org/maven2/net/sf/geographiclib/GeographicLib-Java/1.52/GeographicLib-Java-1.52.jar ...\n", "\t[SUCCESSFUL ] net.sf.geographiclib#GeographicLib-Java;1.52!GeographicLib-Java.jar (15ms)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "downloading https://repo1.maven.org/maven2/com/github/ben-manes/caffeine/caffeine/2.9.2/caffeine-2.9.2.jar ...\n", "\t[SUCCESSFUL ] com.github.ben-manes.caffeine#caffeine;2.9.2!caffeine.jar (17ms)\n", "downloading https://repo1.maven.org/maven2/com/google/guava/guava/25.1-jre/guava-25.1-jre.jar ...\n", "\t[SUCCESSFUL ] com.google.guava#guava;25.1-jre!guava.jar(bundle) (31ms)\n", "downloading https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar ...\n", "\t[SUCCESSFUL ] com.google.code.findbugs#jsr305;3.0.2!jsr305.jar (7ms)\n", "downloading https://repo1.maven.org/maven2/com/google/j2objc/j2objc-annotations/1.1/j2objc-annotations-1.1.jar ...\n", "\t[SUCCESSFUL ] com.google.j2objc#j2objc-annotations;1.1!j2objc-annotations.jar (6ms)\n", "downloading https://repo1.maven.org/maven2/org/codehaus/mojo/animal-sniffer-annotations/1.14/animal-sniffer-annotations-1.14.jar ...\n", "\t[SUCCESSFUL ] org.codehaus.mojo#animal-sniffer-annotations;1.14!animal-sniffer-annotations.jar (5ms)\n", "downloading https://repo1.maven.org/maven2/org/checkerframework/checker-qual/3.10.0/checker-qual-3.10.0.jar ...\n", "\t[SUCCESSFUL ] org.checkerframework#checker-qual;3.10.0!checker-qual.jar (12ms)\n", "downloading https://repo1.maven.org/maven2/com/google/errorprone/error_prone_annotations/2.5.1/error_prone_annotations-2.5.1.jar ...\n", "\t[SUCCESSFUL ] com.google.errorprone#error_prone_annotations;2.5.1!error_prone_annotations.jar (7ms)\n", "downloading https://repo1.maven.org/maven2/commons-lang/commons-lang/2.6/commons-lang-2.6.jar ...\n", "\t[SUCCESSFUL ] commons-lang#commons-lang;2.6!commons-lang.jar (8ms)\n", "downloading https://repo1.maven.org/maven2/org/beryx/awt-color-factory/1.0.0/awt-color-factory-1.0.0.jar ...\n", "\t[SUCCESSFUL ] org.beryx#awt-color-factory;1.0.0!awt-color-factory.jar (7ms)\n", "downloading https://repo1.maven.org/maven2/com/github/scopt/scopt_2.12/4.1.0/scopt_2.12-4.1.0.jar ...\n", "\t[SUCCESSFUL ] com.github.scopt#scopt_2.12;4.1.0!scopt_2.12.jar (10ms)\n", ":: resolution report :: resolve 5088ms :: artifacts dl 598ms\n", "\t:: modules in use:\n", "\tcom.github.ben-manes.caffeine#caffeine;2.9.2 from central in [default]\n", "\tcom.github.scopt#scopt_2.12;4.1.0 from central in [default]\n", "\tcom.google.code.findbugs#jsr305;3.0.2 from central in [default]\n", "\tcom.google.errorprone#error_prone_annotations;2.5.1 from central in [default]\n", "\tcom.google.geometry#s2-geometry;2.0.0 from central in [default]\n", "\tcom.google.guava#guava;25.1-jre from central in [default]\n", "\tcom.google.j2objc#j2objc-annotations;1.1 from central in [default]\n", "\tcom.uber#h3;4.1.1 from central in [default]\n", "\tcommons-lang#commons-lang;2.6 from central in [default]\n", "\tnet.sf.geographiclib#GeographicLib-Java;1.52 from central in [default]\n", "\torg.apache.commons#commons-math3;3.6.1 from central in [default]\n", "\torg.apache.sedona#sedona-common;1.6.0 from central in [default]\n", "\torg.apache.sedona#sedona-spark-3.4_2.12;1.6.0 from central in [default]\n", "\torg.apache.sedona#sedona-spark-common-3.4_2.12;1.6.0 from central in [default]\n", "\torg.beryx#awt-color-factory;1.0.0 from central in [default]\n", "\torg.checkerframework#checker-qual;3.10.0 from central in [default]\n", "\torg.codehaus.mojo#animal-sniffer-annotations;1.14 from central in [default]\n", "\torg.datasyslab#geotools-wrapper;1.6.0-28.2 from central in [default]\n", "\torg.locationtech.jts#jts-core;1.19.0 from central in [default]\n", "\torg.locationtech.spatial4j#spatial4j;0.8 from central in [default]\n", "\torg.scala-lang.modules#scala-collection-compat_2.12;2.5.0 from central in [default]\n", "\torg.wololo#jts2geojson;0.16.1 from central in [default]\n", "\tuk.co.gresearch.spark#spark-extension_2.12;2.11.0-3.4 from central in [default]\n", "\t:: evicted modules:\n", "\torg.checkerframework#checker-qual;2.0.0 by [org.checkerframework#checker-qual;3.10.0] in [default]\n", "\tcom.google.errorprone#error_prone_annotations;2.1.3 by [com.google.errorprone#error_prone_annotations;2.5.1] in [default]\n", "\t---------------------------------------------------------------------\n", "\t| | modules || artifacts |\n", "\t| conf | number| search|dwnlded|evicted|| number|dwnlded|\n", "\t---------------------------------------------------------------------\n", "\t| default | 25 | 25 | 25 | 2 || 23 | 23 |\n", "\t---------------------------------------------------------------------\n", ":: retrieving :: org.apache.spark#spark-submit-parent-c2fcb9ab-9ede-4abc-bad7-73533a42fa83\n", "\tconfs: [default]\n", "\t23 artifacts copied, 0 already retrieved (42692kB/72ms)\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "24/10/14 20:42:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n", "Setting default log level to \"WARN\".\n", "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n" ] } ], "source": [ "config = SedonaContext.builder() .\\\n", " config('spark.jars.packages',\n", " 'org.apache.sedona:sedona-spark-3.4_2.12:1.6.0,'\n", " 'org.datasyslab:geotools-wrapper:1.6.0-28.2,'\n", " 'uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4'). \\\n", " getOrCreate()\n", "\n", "\n", "sedona = SedonaContext.create(config)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 196 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:33.035480Z", "iopub.status.busy": "2024-10-14T20:42:33.035248Z", "iopub.status.idle": "2024-10-14T20:42:33.045338Z", "shell.execute_reply": "2024-10-14T20:42:33.044570Z" }, "id": "JI32GOnEOjcO", "outputId": "dd06a54a-4872-4ec3-9924-cbc87443507a" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "

SparkContext

\n", "\n", "

Spark UI

\n", "\n", "
\n", "
Version
\n", "
v3.4.0
\n", "
Master
\n", "
local[*]
\n", "
AppName
\n", "
pyspark-shell
\n", "
\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sc = sedona.sparkContext\n", "sc" ] }, { "cell_type": "markdown", "metadata": { "id": "3ZV55Eom3oRH" }, "source": [ "`config` is the Spark session" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 203 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:33.047914Z", "iopub.status.busy": "2024-10-14T20:42:33.047342Z", "iopub.status.idle": "2024-10-14T20:42:33.052073Z", "shell.execute_reply": "2024-10-14T20:42:33.051384Z" }, "id": "dK2TLoxmyD5m", "outputId": "99ae1be4-4181-4633-eec0-c2ed4e237aa9" }, "outputs": [ { "data": { "text/plain": [ "pyspark.sql.session.SparkSession" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(config)" ] }, { "cell_type": "markdown", "metadata": { "id": "OdQM3oaYOjcO" }, "source": [ "# Create SpatialRDD" ] }, { "cell_type": "markdown", "metadata": { "id": "X2umXxYfOjcO" }, "source": [ "## Reading to PointRDD from CSV file" ] }, { "cell_type": "markdown", "metadata": { "id": "j8Rgv--0OjcP" }, "source": [ "We now want load the CSV file into Apache Sedona PointRDD\n", "```\n", "testattribute0,-88.331492,32.324142,testattribute1,testattribute2\n", "testattribute0,-88.175933,32.360763,testattribute1,testattribute2\n", "testattribute0,-88.388954,32.357073,testattribute1,testattribute2\n", "testattribute0,-88.221102,32.35078,testattribute1,testattribute2\n", "testattribute0,-88.323995,32.950671,testattribute1,testattribute2\n", "testattribute0,-88.231077,32.700812,testattribute1,testattribute2\n", "```" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:33.054534Z", "iopub.status.busy": "2024-10-14T20:42:33.054121Z", "iopub.status.idle": "2024-10-14T20:42:33.184828Z", "shell.execute_reply": "2024-10-14T20:42:33.184139Z" }, "id": "a6pMS1VPxYx6", "outputId": "8071a6be-01c2-45ac-ccb3-67cb0591beba" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "testattribute0,-88.331492,32.324142,testattribute1,testattribute2\r", "\r\n", "testattribute0,-88.175933,32.360763,testattribute1,testattribute2\r", "\r\n", "testattribute0,-88.388954,32.357073,testattribute1,testattribute2\r", "\r\n", "testattribute0,-88.221102,32.35078,testattribute1,testattribute2\r", "\r\n", "testattribute0,-88.323995,32.950671,testattribute1,testattribute2\r", "\r\n", "testattribute0,-88.231077,32.700812,testattribute1,testattribute2\r", "\r\n", "testattribute0,-88.349276,32.548266,testattribute1,testattribute2\r", "\r\n", "testattribute0,-88.304259,32.488903,testattribute1,testattribute2\r", "\r\n", "testattribute0,-88.182481,32.59966,testattribute1,testattribute2\r", "\r\n", "testattribute0,-86.955186,32.617088,testattribute1,testattribute2\r", "\r\n" ] } ], "source": [ "!head data/arealm-small.csv" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:33.187328Z", "iopub.status.busy": "2024-10-14T20:42:33.186876Z", "iopub.status.idle": "2024-10-14T20:42:33.593899Z", "shell.execute_reply": "2024-10-14T20:42:33.593055Z" }, "id": "gusbYu1fOjcP" }, "outputs": [], "source": [ "point_rdd = PointRDD(sc, \"data/arealm-small.csv\", 1, FileDataSplitter.CSV, True, 10)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:33.597378Z", "iopub.status.busy": "2024-10-14T20:42:33.596670Z", "iopub.status.idle": "2024-10-14T20:42:33.602433Z", "shell.execute_reply": "2024-10-14T20:42:33.601769Z" }, "id": "Ykc5FwouOjcP", "outputId": "d7af7db0-c274-41a5-9f3e-d238a00a028e" }, "outputs": [ { "data": { "text/plain": [ "3000" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Getting approximate total count\n", "point_rdd.approximateTotalCount" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 121 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:33.605246Z", "iopub.status.busy": "2024-10-14T20:42:33.604784Z", "iopub.status.idle": "2024-10-14T20:42:33.721571Z", "shell.execute_reply": "2024-10-14T20:42:33.720797Z" }, "id": "PdEjrrRHOjcP", "outputId": "102de206-4006-4d41-f62c-638a199e5d46" }, "outputs": [ { "data": { "image/svg+xml": [ "" ], "text/plain": [ "Envelope(-173.120769, -84.965961, 30.244859, 71.355134)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# getting boundary for PointRDD or any other SpatialRDD, it returns Enelope object which inherits from\n", "# shapely.geometry.Polygon\n", "point_rdd.boundary()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:33.724509Z", "iopub.status.busy": "2024-10-14T20:42:33.724189Z", "iopub.status.idle": "2024-10-14T20:42:33.867802Z", "shell.execute_reply": "2024-10-14T20:42:33.867043Z" }, "id": "MDV2-f2tOjcP", "outputId": "29e2e81b-1989-4da0-d6a6-d8e30e2298fb" }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# To run analyze please use function analyze\n", "point_rdd.analyze()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 121 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:33.872940Z", "iopub.status.busy": "2024-10-14T20:42:33.872315Z", "iopub.status.idle": "2024-10-14T20:42:33.894653Z", "shell.execute_reply": "2024-10-14T20:42:33.893878Z" }, "id": "puMA8o6POjcP", "outputId": "2bb39f30-9f97-414f-e06c-15da6e63465b" }, "outputs": [ { "data": { "image/svg+xml": [ "" ], "text/plain": [ "Envelope(-173.120769, -84.965961, 30.244859, 71.355134)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Finding boundary envelope for PointRDD or any other SpatialRDD, it returns Enelope object which inherits from\n", "# shapely.geometry.Polygon\n", "point_rdd.boundaryEnvelope" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:33.897392Z", "iopub.status.busy": "2024-10-14T20:42:33.897143Z", "iopub.status.idle": "2024-10-14T20:42:34.046953Z", "shell.execute_reply": "2024-10-14T20:42:34.045993Z" }, "id": "hgOCi8lJOjcQ", "outputId": "ef453113-1e1c-4263-9635-bb9eb0e1ae61" }, "outputs": [ { "data": { "text/plain": [ "2996" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Calculate number of records without duplicates\n", "point_rdd.countWithoutDuplicates()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:34.051076Z", "iopub.status.busy": "2024-10-14T20:42:34.050712Z", "iopub.status.idle": "2024-10-14T20:42:34.058113Z", "shell.execute_reply": "2024-10-14T20:42:34.057323Z" }, "id": "-l1BvhakOjcQ", "outputId": "5cea5725-6cf1-4a05-f069-dbb5065ec4a1" }, "outputs": [ { "data": { "text/plain": [ "''" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Getting source epsg code\n", "point_rdd.getSourceEpsgCode()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:34.062403Z", "iopub.status.busy": "2024-10-14T20:42:34.061113Z", "iopub.status.idle": "2024-10-14T20:42:34.069006Z", "shell.execute_reply": "2024-10-14T20:42:34.068225Z" }, "id": "tiFEDeNdOjcQ", "outputId": "a9038b22-2f07-49b8-d0d3-bc8d1f6503f9" }, "outputs": [ { "data": { "text/plain": [ "''" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Getting target epsg code\n", "point_rdd.getTargetEpsgCode()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:34.071864Z", "iopub.status.busy": "2024-10-14T20:42:34.071590Z", "iopub.status.idle": "2024-10-14T20:42:34.206230Z", "shell.execute_reply": "2024-10-14T20:42:34.205532Z" }, "id": "oip1bNogOjcQ", "outputId": "ace94a95-c0b1-4632-8fd5-70757d4b50a7" }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Spatial partitioning data\n", "point_rdd.spatialPartitioning(GridType.KDBTREE)" ] }, { "cell_type": "markdown", "metadata": { "id": "setaW5adOjcQ" }, "source": [ "## Operations on RawSpatialRDD" ] }, { "cell_type": "markdown", "metadata": { "id": "XeAQe3qqOjcQ" }, "source": [ "rawSpatialRDD method returns RDD which consists of GeoData objects which has 2 attributes\n", "
  • geom: shapely.geometry.BaseGeometry
  • \n", "
  • userData: str
  • \n", "\n", "You can use any operations on those objects and spread across machines" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:34.209470Z", "iopub.status.busy": "2024-10-14T20:42:34.209195Z", "iopub.status.idle": "2024-10-14T20:42:35.079053Z", "shell.execute_reply": "2024-10-14T20:42:35.078297Z" }, "id": "pNlkO6kYOjcR", "outputId": "9c00cc48-d0e2-4bb1-9923-6a8b976d2342" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 8:> (0 + 1) / 1]\r", "\r", " \r" ] }, { "data": { "text/plain": [ "[Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# take firs element\n", "point_rdd.rawSpatialRDD.take(1)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:35.081380Z", "iopub.status.busy": "2024-10-14T20:42:35.080974Z", "iopub.status.idle": "2024-10-14T20:42:35.369886Z", "shell.execute_reply": "2024-10-14T20:42:35.369052Z" }, "id": "22UNTZiIOjcR", "outputId": "2d54f73e-eeec-4da6-d065-9e0ea6930756" }, "outputs": [ { "data": { "text/plain": [ "[Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2,\n", " Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2,\n", " Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2,\n", " Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2,\n", " Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# collect to Python list\n", "point_rdd.rawSpatialRDD.collect()[:5]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:35.372898Z", "iopub.status.busy": "2024-10-14T20:42:35.372591Z", "iopub.status.idle": "2024-10-14T20:42:35.646956Z", "shell.execute_reply": "2024-10-14T20:42:35.646110Z" }, "id": "_vz8jzj6OjcR", "outputId": "2bbb1a27-ef3b-4ac5-b93f-f651d79b2297" }, "outputs": [ { "data": { "text/plain": [ "[111.08786851399313,\n", " 110.92828303170774,\n", " 111.1385974283527,\n", " 110.97450594034112,\n", " 110.97122518072091]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# apply map functions, for example distance to Point(52 21)\n", "point_rdd.rawSpatialRDD.map(lambda x: x.geom.distance(Point(21, 52))).take(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "VJUwfnWbOjcR" }, "source": [ "## Transforming to GeoPandas" ] }, { "cell_type": "markdown", "metadata": { "id": "HJT4cqunOjcR" }, "source": [ "## Loaded data can be transformed to GeoPandas DataFrame in a few ways" ] }, { "cell_type": "markdown", "metadata": { "id": "ENst5TKKOjcR" }, "source": [ "### Directly from RDD" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:35.649528Z", "iopub.status.busy": "2024-10-14T20:42:35.649319Z", "iopub.status.idle": "2024-10-14T20:42:35.655517Z", "shell.execute_reply": "2024-10-14T20:42:35.654742Z" }, "id": "khvSdVtROjcR" }, "outputs": [], "source": [ "point_rdd_to_geo = point_rdd.rawSpatialRDD.map(lambda x: [x.geom, *x.getUserData().split(\"\\t\")])" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:35.657900Z", "iopub.status.busy": "2024-10-14T20:42:35.657447Z", "iopub.status.idle": "2024-10-14T20:42:36.426929Z", "shell.execute_reply": "2024-10-14T20:42:36.426034Z" }, "id": "h_aQLibwOjcR" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "[Stage 11:===================================================> (9 + 1) / 10]\r", "\r", " \r" ] } ], "source": [ "point_gdf = gpd.GeoDataFrame(\n", " point_rdd_to_geo.collect(), columns=[\"geom\", \"attr1\", \"attr2\", \"attr3\"], geometry=\"geom\"\n", ")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:36.429746Z", "iopub.status.busy": "2024-10-14T20:42:36.429124Z", "iopub.status.idle": "2024-10-14T20:42:36.440600Z", "shell.execute_reply": "2024-10-14T20:42:36.440022Z" }, "id": "Ht7qYNuwOjcS", "outputId": "5fb53799-4dc4-4b21-9a31-3696e733ed26" }, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    geomattr1attr2attr3
    0POINT (-88.33149 32.32414)testattribute0testattribute1testattribute2
    1POINT (-88.17593 32.36076)testattribute0testattribute1testattribute2
    2POINT (-88.38895 32.35707)testattribute0testattribute1testattribute2
    3POINT (-88.22110 32.35078)testattribute0testattribute1testattribute2
    4POINT (-88.32399 32.95067)testattribute0testattribute1testattribute2
    \n", "
    " ], "text/plain": [ " geom attr1 attr2 attr3\n", "0 POINT (-88.33149 32.32414) testattribute0 testattribute1 testattribute2\n", "1 POINT (-88.17593 32.36076) testattribute0 testattribute1 testattribute2\n", "2 POINT (-88.38895 32.35707) testattribute0 testattribute1 testattribute2\n", "3 POINT (-88.22110 32.35078) testattribute0 testattribute1 testattribute2\n", "4 POINT (-88.32399 32.95067) testattribute0 testattribute1 testattribute2" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "point_gdf[:5]" ] }, { "cell_type": "markdown", "metadata": { "id": "2vW97p_iOjcS" }, "source": [ "### Using Adapter" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:36.443012Z", "iopub.status.busy": "2024-10-14T20:42:36.442781Z", "iopub.status.idle": "2024-10-14T20:42:36.445523Z", "shell.execute_reply": "2024-10-14T20:42:36.444991Z" }, "id": "HN0IIhZ8OjcS" }, "outputs": [], "source": [ "# Adapter allows you to convert geospatial data types introduced with sedona to other ones" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:36.447674Z", "iopub.status.busy": "2024-10-14T20:42:36.447476Z", "iopub.status.idle": "2024-10-14T20:42:36.536965Z", "shell.execute_reply": "2024-10-14T20:42:36.536328Z" }, "id": "69dR65hgOjcS" }, "outputs": [], "source": [ "spatial_df = Adapter.\\\n", " toDf(point_rdd, [\"attr1\", \"attr2\", \"attr3\"], sedona).\\\n", " createOrReplaceTempView(\"spatial_df\")\n", "\n", "spatial_gdf = sedona.sql(\"Select attr1, attr2, attr3, geometry as geom from spatial_df\")" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:36.539485Z", "iopub.status.busy": "2024-10-14T20:42:36.539279Z", "iopub.status.idle": "2024-10-14T20:42:36.876037Z", "shell.execute_reply": "2024-10-14T20:42:36.874985Z" }, "id": "TnGQMPUIOjcS", "outputId": "1bd148c5-6680-4956-a7cb-559efda8a225" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------+--------------+--------------+----------------------------+\n", "|attr1 |attr2 |attr3 |geom |\n", "+--------------+--------------+--------------+----------------------------+\n", "|testattribute0|testattribute1|testattribute2|POINT (-88.331492 32.324142)|\n", "|testattribute0|testattribute1|testattribute2|POINT (-88.175933 32.360763)|\n", "|testattribute0|testattribute1|testattribute2|POINT (-88.388954 32.357073)|\n", "|testattribute0|testattribute1|testattribute2|POINT (-88.221102 32.35078) |\n", "|testattribute0|testattribute1|testattribute2|POINT (-88.323995 32.950671)|\n", "+--------------+--------------+--------------+----------------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "spatial_gdf.show(5, False)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:36.879365Z", "iopub.status.busy": "2024-10-14T20:42:36.879038Z", "iopub.status.idle": "2024-10-14T20:42:37.532597Z", "shell.execute_reply": "2024-10-14T20:42:37.531868Z" }, "id": "IdL0awafOjcS", "outputId": "179bc434-d408-4446-a31c-c245143ddbb0" }, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    attr1attr2attr3geom
    0testattribute0testattribute1testattribute2POINT (-88.33149 32.32414)
    1testattribute0testattribute1testattribute2POINT (-88.17593 32.36076)
    2testattribute0testattribute1testattribute2POINT (-88.38895 32.35707)
    3testattribute0testattribute1testattribute2POINT (-88.22110 32.35078)
    4testattribute0testattribute1testattribute2POINT (-88.32399 32.95067)
    \n", "
    " ], "text/plain": [ " attr1 attr2 attr3 geom\n", "0 testattribute0 testattribute1 testattribute2 POINT (-88.33149 32.32414)\n", "1 testattribute0 testattribute1 testattribute2 POINT (-88.17593 32.36076)\n", "2 testattribute0 testattribute1 testattribute2 POINT (-88.38895 32.35707)\n", "3 testattribute0 testattribute1 testattribute2 POINT (-88.22110 32.35078)\n", "4 testattribute0 testattribute1 testattribute2 POINT (-88.32399 32.95067)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpd.GeoDataFrame(spatial_gdf.toPandas(), geometry=\"geom\")[:5]" ] }, { "cell_type": "markdown", "metadata": { "id": "ilmR8RjIOjcS" }, "source": [ "### With DataFrame creation" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:37.535801Z", "iopub.status.busy": "2024-10-14T20:42:37.535343Z", "iopub.status.idle": "2024-10-14T20:42:37.539701Z", "shell.execute_reply": "2024-10-14T20:42:37.538992Z" }, "id": "LqgFLBZEOjcT" }, "outputs": [], "source": [ "schema = StructType(\n", " [\n", " StructField(\"geometry\", GeometryType(), False),\n", " StructField(\"attr1\", StringType(), False),\n", " StructField(\"attr2\", StringType(), False),\n", " StructField(\"attr3\", StringType(), False),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:37.542426Z", "iopub.status.busy": "2024-10-14T20:42:37.542013Z", "iopub.status.idle": "2024-10-14T20:42:37.581112Z", "shell.execute_reply": "2024-10-14T20:42:37.580207Z" }, "id": "tEO42a4DOjcT" }, "outputs": [], "source": [ "geo_df = sedona.createDataFrame(point_rdd_to_geo, schema, verifySchema=False)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:37.585157Z", "iopub.status.busy": "2024-10-14T20:42:37.584686Z", "iopub.status.idle": "2024-10-14T20:42:38.144246Z", "shell.execute_reply": "2024-10-14T20:42:38.143423Z" }, "id": "PaIAqir1OjcT", "outputId": "0b307995-0f98-47a6-b479-d991d5bee530" }, "outputs": [ { "data": { "text/html": [ "
    \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
    geometryattr1attr2attr3
    0POINT (-88.33149 32.32414)testattribute0testattribute1testattribute2
    1POINT (-88.17593 32.36076)testattribute0testattribute1testattribute2
    2POINT (-88.38895 32.35707)testattribute0testattribute1testattribute2
    3POINT (-88.22110 32.35078)testattribute0testattribute1testattribute2
    4POINT (-88.32399 32.95067)testattribute0testattribute1testattribute2
    \n", "
    " ], "text/plain": [ " geometry attr1 attr2 attr3\n", "0 POINT (-88.33149 32.32414) testattribute0 testattribute1 testattribute2\n", "1 POINT (-88.17593 32.36076) testattribute0 testattribute1 testattribute2\n", "2 POINT (-88.38895 32.35707) testattribute0 testattribute1 testattribute2\n", "3 POINT (-88.22110 32.35078) testattribute0 testattribute1 testattribute2\n", "4 POINT (-88.32399 32.95067) testattribute0 testattribute1 testattribute2" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gpd.GeoDataFrame(geo_df.toPandas(), geometry=\"geometry\")[:5]" ] }, { "cell_type": "markdown", "metadata": { "id": "kO1EyFACOjcT" }, "source": [ "# Load Typed SpatialRDDs" ] }, { "cell_type": "markdown", "metadata": { "id": "w_0_hHNQOjcT" }, "source": [ "Currently The library supports 5 typed SpatialRDDs:\n", "
  • RectangleRDD
  • \n", "
  • PointRDD
  • \n", "
  • PolygonRDD
  • \n", "
  • LineStringRDD
  • \n", "
  • CircleRDD
  • " ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:38.147602Z", "iopub.status.busy": "2024-10-14T20:42:38.146932Z", "iopub.status.idle": "2024-10-14T20:42:38.667375Z", "shell.execute_reply": "2024-10-14T20:42:38.666445Z" }, "id": "2m2k-n-LOjcT" }, "outputs": [], "source": [ "rectangle_rdd = RectangleRDD(sc, \"data/zcta510-small.csv\", FileDataSplitter.CSV, True, 11)\n", "point_rdd = PointRDD(sc, \"data/arealm-small.csv\", 1, FileDataSplitter.CSV, False, 11)\n", "polygon_rdd = PolygonRDD(sc, \"data/primaryroads-polygon.csv\", FileDataSplitter.CSV, True, 11)\n", "linestring_rdd = LineStringRDD(sc, \"data/primaryroads-linestring.csv\", FileDataSplitter.CSV, True)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:38.671636Z", "iopub.status.busy": "2024-10-14T20:42:38.670574Z", "iopub.status.idle": "2024-10-14T20:42:38.960906Z", "shell.execute_reply": "2024-10-14T20:42:38.960177Z" }, "id": "qRHPdLYdOjcb", "outputId": "618f7c86-860a-48d4-cb91-447e5309261d" }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rectangle_rdd.analyze()\n", "point_rdd.analyze()\n", "polygon_rdd.analyze()\n", "linestring_rdd.analyze()" ] }, { "cell_type": "markdown", "metadata": { "id": "J2uTMMk1Ojcb" }, "source": [ "# Spatial Partitioning" ] }, { "cell_type": "markdown", "metadata": { "id": "h3AC5CpdOjcc" }, "source": [ "Apache Sedona spatial partitioning method can significantly speed up the join query. Three spatial partitioning methods are available: KDB-Tree, Quad-Tree and R-Tree. Two SpatialRDD must be partitioned by the same way." ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:38.964082Z", "iopub.status.busy": "2024-10-14T20:42:38.963800Z", "iopub.status.idle": "2024-10-14T20:42:39.080542Z", "shell.execute_reply": "2024-10-14T20:42:39.079837Z" }, "id": "2icbQf9NOjcc", "outputId": "aa24268e-220c-4ced-e775-ba679c6234f5" }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "point_rdd.spatialPartitioning(GridType.KDBTREE)" ] }, { "cell_type": "markdown", "metadata": { "id": "_ipdd7umOjcc" }, "source": [ "# Create Index" ] }, { "cell_type": "markdown", "metadata": { "id": "hEPaVPmJOjcc" }, "source": [ "Apache Sedona provides two types of spatial indexes, Quad-Tree and R-Tree. Once you specify an index type, Apache Sedona will build a local tree index on each of the SpatialRDD partition." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:39.083378Z", "iopub.status.busy": "2024-10-14T20:42:39.082851Z", "iopub.status.idle": "2024-10-14T20:42:39.089300Z", "shell.execute_reply": "2024-10-14T20:42:39.088783Z" }, "id": "3YIsx_55Ojcc" }, "outputs": [], "source": [ "point_rdd.buildIndex(IndexType.RTREE, True)" ] }, { "cell_type": "markdown", "metadata": { "id": "LRHcULQDOjcc" }, "source": [ "# SpatialJoin" ] }, { "cell_type": "markdown", "metadata": { "id": "ZnnDo119Ojcc" }, "source": [ "Spatial join is operation which combines data based on spatial relations like:\n", "
  • intersects
  • \n", "
  • touches
  • \n", "
  • within
  • \n", "
  • etc
  • \n", "\n", "To Use Spatial Join in GeoPyspark library please use JoinQuery object, which has implemented below methods:\n", "```python\n", "SpatialJoinQuery(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD\n", "\n", "DistanceJoinQuery(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD\n", "\n", "spatialJoin(queryWindowRDD: SpatialRDD, objectRDD: SpatialRDD, joinParams: JoinParams) -> RDD\n", "\n", "DistanceJoinQueryFlat(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD\n", "\n", "SpatialJoinQueryFlat(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "icOL3ukTOjcc" }, "source": [ "## Example SpatialJoinQueryFlat PointRDD with RectangleRDD" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:39.091737Z", "iopub.status.busy": "2024-10-14T20:42:39.091360Z", "iopub.status.idle": "2024-10-14T20:42:39.204118Z", "shell.execute_reply": "2024-10-14T20:42:39.203276Z" }, "id": "unG1rYNZOjcc" }, "outputs": [], "source": [ "# partitioning the data\n", "point_rdd.spatialPartitioning(GridType.KDBTREE)\n", "rectangle_rdd.spatialPartitioning(point_rdd.getPartitioner())\n", "# building an index\n", "point_rdd.buildIndex(IndexType.RTREE, True)\n", "# Perform Spatial Join Query\n", "result = JoinQuery.SpatialJoinQueryFlat(point_rdd, rectangle_rdd, False, True)" ] }, { "cell_type": "markdown", "metadata": { "id": "5A0s6CE5Ojcd" }, "source": [ "As result we will get RDD[GeoData, GeoData]\n", "It can be used like any other Python RDD. You can use map, take, collect and other functions " ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:39.207991Z", "iopub.status.busy": "2024-10-14T20:42:39.207319Z", "iopub.status.idle": "2024-10-14T20:42:39.212590Z", "shell.execute_reply": "2024-10-14T20:42:39.212054Z" }, "id": "8dw_JRafOjcd", "outputId": "e7bc8e6a-70b5-4f95-ba0b-31bd4aa51296" }, "outputs": [ { "data": { "text/plain": [ "MapPartitionsRDD[63] at map at FlatPairRddConverter.scala:30" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:39.214784Z", "iopub.status.busy": "2024-10-14T20:42:39.214561Z", "iopub.status.idle": "2024-10-14T20:42:40.365727Z", "shell.execute_reply": "2024-10-14T20:42:40.365088Z" }, "id": "7QKdzz9ZOjcd", "outputId": "d1debc1f-e803-489d-c05f-975064807c2f" }, "outputs": [ { "data": { "text/plain": [ "[[Geometry: Polygon userData: , Geometry: Point userData: ],\n", " [Geometry: Polygon userData: , Geometry: Point userData: ]]" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result.take(2)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:40.368030Z", "iopub.status.busy": "2024-10-14T20:42:40.367628Z", "iopub.status.idle": "2024-10-14T20:42:40.575631Z", "shell.execute_reply": "2024-10-14T20:42:40.574797Z" }, "id": "cBMsFNe2Ojcd", "outputId": "4c1b8acd-a731-466c-f78e-2c36a90f56d8" }, "outputs": [ { "data": { "text/plain": [ "[[Geometry: Polygon userData: , Geometry: Point userData: ],\n", " [Geometry: Polygon userData: , Geometry: Point userData: ],\n", " [Geometry: Polygon userData: , Geometry: Point userData: ]]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result.collect()[:3]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:40.578778Z", "iopub.status.busy": "2024-10-14T20:42:40.578026Z", "iopub.status.idle": "2024-10-14T20:42:41.375274Z", "shell.execute_reply": "2024-10-14T20:42:41.374326Z" }, "id": "ZUPAfZ0wOjcd", "outputId": "29fafd15-40af-4a11-e3ae-38f34a5d392f" }, "outputs": [ { "data": { "text/plain": [ "[0.0, 0.0, 0.0, 0.0, 0.0]" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# getting distance using SpatialObjects\n", "result.map(lambda x: x[0].geom.distance(x[1].geom)).take(5)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:41.378255Z", "iopub.status.busy": "2024-10-14T20:42:41.378035Z", "iopub.status.idle": "2024-10-14T20:42:42.024185Z", "shell.execute_reply": "2024-10-14T20:42:42.023383Z" }, "id": "2ztnCLMnOjcd", "outputId": "22d5a50d-c4ef-4255-b169-6f2f330a08da" }, "outputs": [ { "data": { "text/plain": [ "[0.026651558685001447,\n", " 0.026651558685001447,\n", " 0.026651558685001447,\n", " 0.026651558685001447,\n", " 0.057069904940998895]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# getting area of polygon data\n", "result.map(lambda x: x[0].geom.area).take(5)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:42.026495Z", "iopub.status.busy": "2024-10-14T20:42:42.026089Z", "iopub.status.idle": "2024-10-14T20:42:42.029403Z", "shell.execute_reply": "2024-10-14T20:42:42.028876Z" }, "id": "nALxyjQiOjce" }, "outputs": [], "source": [ "# Base on result you can create DataFrame object, using map function and build DataFrame from RDD\n", "schema = StructType(\n", " [\n", " StructField(\"geom_left\", GeometryType(), False),\n", " StructField(\"geom_right\", GeometryType(), False)\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:42.031434Z", "iopub.status.busy": "2024-10-14T20:42:42.031056Z", "iopub.status.idle": "2024-10-14T20:42:43.080725Z", "shell.execute_reply": "2024-10-14T20:42:43.079808Z" }, "id": "UWqys-lqOjce", "outputId": "8ae18f86-4064-4b2b-ed5c-704dc80a3f24" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+--------------------+\n", "| geom_left| geom_right|\n", "+--------------------+--------------------+\n", "|POLYGON ((-87.229...|POINT (-87.204033...|\n", "|POLYGON ((-87.229...|POINT (-87.204299...|\n", "|POLYGON ((-87.229...|POINT (-87.19351 ...|\n", "|POLYGON ((-87.229...|POINT (-87.18222 ...|\n", "|POLYGON ((-87.285...|POINT (-87.28468 ...|\n", "+--------------------+--------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "# Set verifySchema to False\n", "spatial_join_result = result.map(lambda x: [x[0].geom, x[1].geom])\n", "sedona.createDataFrame(spatial_join_result, schema, verifySchema=False).show(5, True)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:43.083836Z", "iopub.status.busy": "2024-10-14T20:42:43.083556Z", "iopub.status.idle": "2024-10-14T20:42:43.107209Z", "shell.execute_reply": "2024-10-14T20:42:43.106344Z" }, "id": "_BQVrY-IOjce", "outputId": "bd0d4e7e-e18a-40a7-a30f-80ffc8d9acf9" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- geom_left: geometry (nullable = false)\n", " |-- geom_right: geometry (nullable = false)\n", "\n" ] } ], "source": [ "# Above code produces DataFrame with geometry Data type\n", "sedona.createDataFrame(spatial_join_result, schema, verifySchema=False).printSchema()" ] }, { "cell_type": "markdown", "metadata": { "id": "5oID6P5kOjce" }, "source": [ "We can create DataFrame object from Spatial Pair RDD using Adapter object as follows" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:43.110392Z", "iopub.status.busy": "2024-10-14T20:42:43.109927Z", "iopub.status.idle": "2024-10-14T20:42:44.879189Z", "shell.execute_reply": "2024-10-14T20:42:44.878344Z" }, "id": "OyD3_m_zOjce", "outputId": "9c7aa351-fb83-4b80-b764-4948bfaf98fb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+-----+--------------------+-----+\n", "| geom_1|attr1| geom_2|attr2|\n", "+--------------------+-----+--------------------+-----+\n", "|POLYGON ((-87.229...| |POINT (-87.204033...| |\n", "|POLYGON ((-87.229...| |POINT (-87.204299...| |\n", "|POLYGON ((-87.229...| |POINT (-87.19351 ...| |\n", "|POLYGON ((-87.229...| |POINT (-87.18222 ...| |\n", "|POLYGON ((-87.285...| |POINT (-87.28468 ...| |\n", "+--------------------+-----+--------------------+-----+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "Adapter.toDf(result, [\"attr1\"], [\"attr2\"], sedona).show(5, True)" ] }, { "cell_type": "markdown", "metadata": { "id": "qAmXD2SKOjce" }, "source": [ "This also produce DataFrame with geometry DataType" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:44.881630Z", "iopub.status.busy": "2024-10-14T20:42:44.881410Z", "iopub.status.idle": "2024-10-14T20:42:45.526703Z", "shell.execute_reply": "2024-10-14T20:42:45.525839Z" }, "id": "U7evFgqkOjce", "outputId": "ce73803b-0b8a-4b46-9c2a-9be0c858e902" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- geom_1: geometry (nullable = true)\n", " |-- attr1: string (nullable = true)\n", " |-- geom_2: geometry (nullable = true)\n", " |-- attr2: string (nullable = true)\n", "\n" ] } ], "source": [ "Adapter.toDf(result, [\"attr1\"], [\"attr2\"], sedona).printSchema()" ] }, { "cell_type": "markdown", "metadata": { "id": "TldNHgcQOjcf" }, "source": [ "We can create RDD which will be of type RDD[GeoData, List[GeoData]]\n", "We can for example calculate number of Points within some polygon data" ] }, { "cell_type": "markdown", "metadata": { "id": "GlhD3mSvOjcf" }, "source": [ "To do that we can use code specified below" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:45.529305Z", "iopub.status.busy": "2024-10-14T20:42:45.528871Z", "iopub.status.idle": "2024-10-14T20:42:45.577097Z", "shell.execute_reply": "2024-10-14T20:42:45.576372Z" }, "id": "bl5mg9qdOjcf" }, "outputs": [], "source": [ "point_rdd.spatialPartitioning(GridType.KDBTREE)\n", "rectangle_rdd.spatialPartitioning(point_rdd.getPartitioner())" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:45.579742Z", "iopub.status.busy": "2024-10-14T20:42:45.579333Z", "iopub.status.idle": "2024-10-14T20:42:45.619194Z", "shell.execute_reply": "2024-10-14T20:42:45.618262Z" }, "id": "6S1gs7nDOjcf" }, "outputs": [], "source": [ "spatial_join_result_non_flat = JoinQuery.SpatialJoinQuery(point_rdd, rectangle_rdd, False, True)" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:45.622607Z", "iopub.status.busy": "2024-10-14T20:42:45.621947Z", "iopub.status.idle": "2024-10-14T20:42:45.626771Z", "shell.execute_reply": "2024-10-14T20:42:45.625826Z" }, "id": "ZJiTacNkOjcf" }, "outputs": [], "source": [ "# number of point for each polygon\n", "number_of_points = spatial_join_result_non_flat.map(lambda x: [x[0].geom, x[1].__len__()])" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:45.629586Z", "iopub.status.busy": "2024-10-14T20:42:45.629098Z", "iopub.status.idle": "2024-10-14T20:42:45.633466Z", "shell.execute_reply": "2024-10-14T20:42:45.632361Z" }, "id": "-EtrJVuROjcf" }, "outputs": [], "source": [ "schema = StructType([\n", " StructField(\"geometry\", GeometryType(), False),\n", " StructField(\"number_of_points\", LongType(), False)\n", "])" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:45.637096Z", "iopub.status.busy": "2024-10-14T20:42:45.636621Z", "iopub.status.idle": "2024-10-14T20:42:47.063040Z", "shell.execute_reply": "2024-10-14T20:42:47.062280Z" }, "id": "mktoa5-HOjcf", "outputId": "905c6634-bbf6-4c72-fad3-a4a4ce07fec8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+----------------+\n", "| geometry|number_of_points|\n", "+--------------------+----------------+\n", "|POLYGON ((-87.082...| 12|\n", "|POLYGON ((-87.229...| 7|\n", "|POLYGON ((-86.697...| 1|\n", "|POLYGON ((-87.105...| 15|\n", "|POLYGON ((-87.092...| 5|\n", "|POLYGON ((-86.860...| 12|\n", "|POLYGON ((-86.816...| 6|\n", "|POLYGON ((-87.114...| 15|\n", "|POLYGON ((-87.285...| 26|\n", "|POLYGON ((-86.749...| 4|\n", "+--------------------+----------------+\n", "\n" ] } ], "source": [ "sedona.createDataFrame(number_of_points, schema, verifySchema=False).show()" ] }, { "cell_type": "markdown", "metadata": { "id": "ZosmctfPOjcf" }, "source": [ "# KNNQuery" ] }, { "cell_type": "markdown", "metadata": { "id": "utC2fQXwOjcg" }, "source": [ "Spatial KNNQuery is operation which help us find answer which k number of geometries lays closest to other geometry.\n", "\n", "For Example:\n", " 5 closest Shops to your home. To use Spatial KNNQuery please use object\n", " KNNQuery which has one method:\n", "```python\n", "SpatialKnnQuery(spatialRDD: SpatialRDD, originalQueryPoint: BaseGeometry, k: int, useIndex: bool)-> List[GeoData]\n", "```" ] }, { "cell_type": "markdown", "metadata": { "id": "sTcvQarvOjcg" }, "source": [ "### Finds 5 closest points from PointRDD to given Point" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:47.066137Z", "iopub.status.busy": "2024-10-14T20:42:47.065753Z", "iopub.status.idle": "2024-10-14T20:42:47.157543Z", "shell.execute_reply": "2024-10-14T20:42:47.156659Z" }, "id": "n6BIfKk9Ojcg" }, "outputs": [], "source": [ "result = KNNQuery.SpatialKnnQuery(point_rdd, Point(-84.01, 34.01), 5, False)" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:47.160503Z", "iopub.status.busy": "2024-10-14T20:42:47.160221Z", "iopub.status.idle": "2024-10-14T20:42:47.165654Z", "shell.execute_reply": "2024-10-14T20:42:47.164908Z" }, "id": "i-SgOWD2Ojcg", "outputId": "f82b836e-07d5-40ad-dd82-6782685b4b75" }, "outputs": [ { "data": { "text/plain": [ "[Geometry: Point userData: ,\n", " Geometry: Point userData: ,\n", " Geometry: Point userData: ,\n", " Geometry: Point userData: ,\n", " Geometry: Point userData: ]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result" ] }, { "cell_type": "markdown", "metadata": { "id": "Dft5_7AMOjcg" }, "source": [ "As Reference geometry you can also use Polygon or LineString object" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:47.168516Z", "iopub.status.busy": "2024-10-14T20:42:47.168092Z", "iopub.status.idle": "2024-10-14T20:42:47.410353Z", "shell.execute_reply": "2024-10-14T20:42:47.409454Z" }, "id": "C-G_a1iXOjcg" }, "outputs": [], "source": [ "polygon = Polygon(\n", " [(-84.237756, 33.904859), (-84.237756, 34.090426),\n", " (-83.833011, 34.090426), (-83.833011, 33.904859),\n", " (-84.237756, 33.904859)\n", " ])\n", "polygons_nearby = KNNQuery.SpatialKnnQuery(polygon_rdd, polygon, 5, False)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:47.413866Z", "iopub.status.busy": "2024-10-14T20:42:47.413183Z", "iopub.status.idle": "2024-10-14T20:42:47.418512Z", "shell.execute_reply": "2024-10-14T20:42:47.417710Z" }, "id": "pYtK1T1oOjcg", "outputId": "1e27ad5f-e82d-421e-a081-2c371cf48f42" }, "outputs": [ { "data": { "text/plain": [ "[Geometry: Polygon userData: ,\n", " Geometry: Polygon userData: ,\n", " Geometry: Polygon userData: ,\n", " Geometry: Polygon userData: ,\n", " Geometry: Polygon userData: ]" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "polygons_nearby" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 0 }, "execution": { "iopub.execute_input": "2024-10-14T20:42:47.421097Z", "iopub.status.busy": "2024-10-14T20:42:47.420831Z", "iopub.status.idle": "2024-10-14T20:42:47.427928Z", "shell.execute_reply": "2024-10-14T20:42:47.427179Z" }, "id": "-4Mf5TP-Ojcg", "outputId": "892e50d3-ea2f-40fc-e2ae-669dfa7453d2" }, "outputs": [ { "data": { "text/plain": [ "'POLYGON ((-83.993559 34.087259, -83.993559 34.131247, -83.959903 34.131247, -83.959903 34.087259, -83.993559 34.087259))'" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "polygons_nearby[0].geom.wkt" ] }, { "cell_type": "markdown", "metadata": { "id": "SxP3ufPGOjch" }, "source": [ "# RangeQuery" ] }, { "cell_type": "markdown", "metadata": { "id": "Air8I_XIOjch" }, "source": [ "A spatial range query takes as input a range query window and an SpatialRDD and returns all geometries that intersect / are fully covered by the query window.\n", "RangeQuery has one method:\n", "\n", "```python\n", "SpatialRangeQuery(self, spatialRDD: SpatialRDD, rangeQueryWindow: BaseGeometry, considerBoundaryIntersection: bool, usingIndex: bool) -> RDD\n", "```" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:47.430731Z", "iopub.status.busy": "2024-10-14T20:42:47.430483Z", "iopub.status.idle": "2024-10-14T20:42:47.434698Z", "shell.execute_reply": "2024-10-14T20:42:47.433903Z" }, "id": "7YQd4wvuOjch" }, "outputs": [], "source": [ "from sedona.core.geom.envelope import Envelope" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:47.437543Z", "iopub.status.busy": "2024-10-14T20:42:47.437280Z", "iopub.status.idle": "2024-10-14T20:42:47.450824Z", "shell.execute_reply": "2024-10-14T20:42:47.449922Z" }, "id": "8WLf9PLbOjch" }, "outputs": [], "source": [ "query_envelope = Envelope(-85.01, -60.01, 34.01, 50.01)\n", "\n", "result_range_query = RangeQuery.SpatialRangeQuery(linestring_rdd, query_envelope, False, False)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:47.454197Z", "iopub.status.busy": "2024-10-14T20:42:47.453484Z", "iopub.status.idle": "2024-10-14T20:42:47.460623Z", "shell.execute_reply": "2024-10-14T20:42:47.459849Z" }, "id": "9Wp7sOciOjch", "outputId": "f957dce6-1b30-4c1c-8d94-21d88b16f90e" }, "outputs": [ { "data": { "text/plain": [ "MapPartitionsRDD[127] at map at GeometryRddConverter.scala:30" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_range_query" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:47.463075Z", "iopub.status.busy": "2024-10-14T20:42:47.462780Z", "iopub.status.idle": "2024-10-14T20:42:47.530348Z", "shell.execute_reply": "2024-10-14T20:42:47.529622Z" }, "id": "C1mittdHOjch", "outputId": "56c14bb2-2aca-4174-92fd-8a9da1a4ecca" }, "outputs": [ { "data": { "text/plain": [ "[Geometry: LineString userData: ,\n", " Geometry: LineString userData: ,\n", " Geometry: LineString userData: ,\n", " Geometry: LineString userData: ,\n", " Geometry: LineString userData: ,\n", " Geometry: LineString userData: ]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result_range_query.take(6)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:47.533389Z", "iopub.status.busy": "2024-10-14T20:42:47.532709Z", "iopub.status.idle": "2024-10-14T20:42:47.536590Z", "shell.execute_reply": "2024-10-14T20:42:47.535896Z" }, "id": "Jv3ik2obOjch" }, "outputs": [], "source": [ "# Creating DataFrame from result" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:47.539307Z", "iopub.status.busy": "2024-10-14T20:42:47.539062Z", "iopub.status.idle": "2024-10-14T20:42:47.542523Z", "shell.execute_reply": "2024-10-14T20:42:47.541801Z" }, "id": "OF2xJfLCOjch" }, "outputs": [], "source": [ "schema = StructType([StructField(\"geometry\", GeometryType(), False)])" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:47.545222Z", "iopub.status.busy": "2024-10-14T20:42:47.544770Z", "iopub.status.idle": "2024-10-14T20:42:47.731848Z", "shell.execute_reply": "2024-10-14T20:42:47.731059Z" }, "id": "7Y4ruCQ-Ojci", "outputId": "629ce863-1b19-4850-f0a4-5eeb039b25d4" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+\n", "| geometry|\n", "+--------------------+\n", "|LINESTRING (-72.1...|\n", "|LINESTRING (-72.4...|\n", "|LINESTRING (-72.4...|\n", "|LINESTRING (-73.4...|\n", "|LINESTRING (-73.6...|\n", "+--------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "sedona.createDataFrame(\n", " result_range_query.map(lambda x: [x.geom]),\n", " schema,\n", " verifySchema=False\n", ").show(5, True)" ] }, { "cell_type": "markdown", "metadata": { "id": "jwiSJcuLOjci" }, "source": [ "# Load From other Formats" ] }, { "cell_type": "markdown", "metadata": { "id": "aLuQLIfJOjci" }, "source": [ "GeoPyspark allows to load the data from other Data formats like:\n", "
  • GeoJSON
  • \n", "
  • Shapefile
  • \n", "
  • WKB
  • \n", "
  • WKT
  • " ] }, { "cell_type": "markdown", "metadata": { "id": "Udl4O-yE44Cb" }, "source": [ "## ShapeFile - load to SpatialRDD" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:47.735381Z", "iopub.status.busy": "2024-10-14T20:42:47.735080Z", "iopub.status.idle": "2024-10-14T20:42:47.887884Z", "shell.execute_reply": "2024-10-14T20:42:47.887113Z" }, "id": "3bcx_gxAOjci" }, "outputs": [], "source": [ "shape_rdd = ShapefileReader.readToGeometryRDD(sc, \"data/polygon\")" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:47.891057Z", "iopub.status.busy": "2024-10-14T20:42:47.890573Z", "iopub.status.idle": "2024-10-14T20:42:47.895647Z", "shell.execute_reply": "2024-10-14T20:42:47.894987Z" }, "id": "nIjo83o8Ojci", "outputId": "103e2a26-632d-48dc-8d8b-a609c1577952" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "shape_rdd" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:47.898003Z", "iopub.status.busy": "2024-10-14T20:42:47.897560Z", "iopub.status.idle": "2024-10-14T20:42:48.041402Z", "shell.execute_reply": "2024-10-14T20:42:48.040593Z" }, "id": "4y9EXCKfOjci", "outputId": "ad7e4283-68c7-423b-9f94-f1d3e5af10b1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+\n", "| geometry|\n", "+--------------------+\n", "|MULTIPOLYGON (((1...|\n", "|MULTIPOLYGON (((-...|\n", "|MULTIPOLYGON (((1...|\n", "|POLYGON ((118.362...|\n", "|MULTIPOLYGON (((-...|\n", "+--------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "Adapter.toDf(shape_rdd, sedona).show(5, True)" ] }, { "cell_type": "markdown", "metadata": { "id": "gjRBpmLQ48Sx" }, "source": [ "## GeoJSON - load to SpatialRDD" ] }, { "cell_type": "markdown", "metadata": { "id": "Mh1K7181Ojcj" }, "source": [ "```\n", "{ \"type\": \"Feature\", \"properties\": { \"STATEFP\": \"01\", \"COUNTYFP\": \"077\", \"TRACTCE\": \"011501\", \"BLKGRPCE\": \"5\", \"AFFGEOID\": \"1500000US010770115015\", \"GEOID\": \"010770115015\", \"NAME\": \"5\", \"LSAD\": \"BG\", \"ALAND\": 6844991, \"AWATER\": 32636 }, \"geometry\": { \"type\": \"Polygon\", \"coordinates\": [ [ [ -87.621765, 34.873444 ], [ -87.617535, 34.873369 ], [ -87.6123, 34.873337 ], [ -87.604049, 34.873303 ], [ -87.604033, 34.872316 ], [ -87.60415, 34.867502 ], [ -87.604218, 34.865687 ], [ -87.604409, 34.858537 ], [ -87.604018, 34.851336 ], [ -87.603716, 34.844829 ], [ -87.603696, 34.844307 ], [ -87.603673, 34.841884 ], [ -87.60372, 34.841003 ], [ -87.603879, 34.838423 ], [ -87.603888, 34.837682 ], [ -87.603889, 34.83763 ], [ -87.613127, 34.833938 ], [ -87.616451, 34.832699 ], [ -87.621041, 34.831431 ], [ -87.621056, 34.831526 ], [ -87.62112, 34.831925 ], [ -87.621603, 34.8352 ], [ -87.62158, 34.836087 ], [ -87.621383, 34.84329 ], [ -87.621359, 34.844438 ], [ -87.62129, 34.846387 ], [ -87.62119, 34.85053 ], [ -87.62144, 34.865379 ], [ -87.621765, 34.873444 ] ] ] } },\n", "```" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:48.044471Z", "iopub.status.busy": "2024-10-14T20:42:48.044208Z", "iopub.status.idle": "2024-10-14T20:42:48.128575Z", "shell.execute_reply": "2024-10-14T20:42:48.127802Z" }, "id": "g_a3q71ZOjcj" }, "outputs": [], "source": [ "geo_json_rdd = GeoJsonReader.readToGeometryRDD(sc, \"data/testPolygon.json\")" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:48.131377Z", "iopub.status.busy": "2024-10-14T20:42:48.131158Z", "iopub.status.idle": "2024-10-14T20:42:48.135644Z", "shell.execute_reply": "2024-10-14T20:42:48.135003Z" }, "id": "LGrMegQ6Ojcj", "outputId": "12841adc-47b7-4fc2-8edd-94ae174aff32" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "geo_json_rdd" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:48.138214Z", "iopub.status.busy": "2024-10-14T20:42:48.137701Z", "iopub.status.idle": "2024-10-14T20:42:48.410662Z", "shell.execute_reply": "2024-10-14T20:42:48.409805Z" }, "id": "ZfH32XssOjcj", "outputId": "121f8492-dc10-40a6-a60e-4b312810f967" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+\n", "| geometry|STATEFP|COUNTYFP|TRACTCE|BLKGRPCE| AFFGEOID| GEOID|NAME|LSAD| ALAND|\n", "+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+\n", "|POLYGON ((-87.621...| 01| 077| 011501| 5|1500000US01077011...|010770115015| 5| BG| 6844991|\n", "|POLYGON ((-85.719...| 01| 045| 021102| 4|1500000US01045021...|010450211024| 4| BG|11360854|\n", "|POLYGON ((-86.000...| 01| 055| 001300| 3|1500000US01055001...|010550013003| 3| BG| 1378742|\n", "|POLYGON ((-86.574...| 01| 089| 001700| 2|1500000US01089001...|010890017002| 2| BG| 1040641|\n", "|POLYGON ((-85.382...| 01| 069| 041400| 1|1500000US01069041...|010690414001| 1| BG| 8243574|\n", "+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "Adapter.toDf(geo_json_rdd, sedona).drop(\"AWATER\").show(5, True)" ] }, { "cell_type": "markdown", "metadata": { "id": "WK8g2bYB5AlA" }, "source": [ "## WKT - loading to SpatialRDD" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:48.413641Z", "iopub.status.busy": "2024-10-14T20:42:48.413103Z", "iopub.status.idle": "2024-10-14T20:42:48.467811Z", "shell.execute_reply": "2024-10-14T20:42:48.466981Z" }, "id": "0d9owuZROjcj" }, "outputs": [], "source": [ "wkt_rdd = WktReader.readToGeometryRDD(sc, \"data/county_small.tsv\", 0, True, False)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:48.471527Z", "iopub.status.busy": "2024-10-14T20:42:48.471214Z", "iopub.status.idle": "2024-10-14T20:42:48.478565Z", "shell.execute_reply": "2024-10-14T20:42:48.477875Z" }, "id": "XKALKBbeOjcj", "outputId": "ed7f8e62-adcc-472d-8a68-1f8d89bc9e59" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wkt_rdd" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:48.481770Z", "iopub.status.busy": "2024-10-14T20:42:48.481273Z", "iopub.status.idle": "2024-10-14T20:42:48.494180Z", "shell.execute_reply": "2024-10-14T20:42:48.493430Z" }, "id": "UUJfqy5EOjcj", "outputId": "70f34978-5f9d-4bc0-8101-b1cb26c9de93" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- geometry: geometry (nullable = true)\n", "\n" ] } ], "source": [ "Adapter.toDf(wkt_rdd, sedona).printSchema()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:48.497580Z", "iopub.status.busy": "2024-10-14T20:42:48.497068Z", "iopub.status.idle": "2024-10-14T20:42:48.682134Z", "shell.execute_reply": "2024-10-14T20:42:48.681442Z" }, "id": "qjIgGZB2Ojck", "outputId": "98bda941-5e93-4386-c9dd-33b638671789" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+\n", "| geometry|\n", "+--------------------+\n", "|POLYGON ((-97.019...|\n", "|POLYGON ((-123.43...|\n", "|POLYGON ((-104.56...|\n", "|POLYGON ((-96.910...|\n", "|POLYGON ((-98.273...|\n", "+--------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "Adapter.toDf(wkt_rdd, sedona).show(5, True)" ] }, { "cell_type": "markdown", "metadata": { "id": "ZlZpg34F5Es2" }, "source": [ "## WKB - load to SpatialRDD" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:48.685184Z", "iopub.status.busy": "2024-10-14T20:42:48.684894Z", "iopub.status.idle": "2024-10-14T20:42:48.731203Z", "shell.execute_reply": "2024-10-14T20:42:48.730317Z" }, "id": "1-saFLNoOjck" }, "outputs": [], "source": [ "wkb_rdd = WkbReader.readToGeometryRDD(sc, \"data/county_small_wkb.tsv\", 0, True, False)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:48.734373Z", "iopub.status.busy": "2024-10-14T20:42:48.734079Z", "iopub.status.idle": "2024-10-14T20:42:48.868150Z", "shell.execute_reply": "2024-10-14T20:42:48.867379Z" }, "id": "WFUhy3AtOjck", "outputId": "f166e515-bd68-4432-da2d-8196d213b491" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+\n", "| geometry|\n", "+--------------------+\n", "|POLYGON ((-97.019...|\n", "|POLYGON ((-123.43...|\n", "|POLYGON ((-104.56...|\n", "|POLYGON ((-96.910...|\n", "|POLYGON ((-98.273...|\n", "+--------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "Adapter.toDf(wkb_rdd, sedona).show(5, True)" ] }, { "cell_type": "markdown", "metadata": { "id": "xj-mt-60Ojck" }, "source": [ "## Converting RDD Spatial join result to DF directly, avoiding jvm python serde" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:48.870793Z", "iopub.status.busy": "2024-10-14T20:42:48.870257Z", "iopub.status.idle": "2024-10-14T20:42:48.923655Z", "shell.execute_reply": "2024-10-14T20:42:48.922683Z" }, "id": "5DTG6EvFOjck" }, "outputs": [], "source": [ "point_rdd.spatialPartitioning(GridType.KDBTREE)\n", "rectangle_rdd.spatialPartitioning(point_rdd.getPartitioner())\n", "# building an index\n", "point_rdd.buildIndex(IndexType.RTREE, True)\n", "# Perform Spatial Join Query\n", "result = JoinQueryRaw.SpatialJoinQueryFlat(point_rdd, rectangle_rdd, False, True)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:48.927069Z", "iopub.status.busy": "2024-10-14T20:42:48.926777Z", "iopub.status.idle": "2024-10-14T20:42:48.936857Z", "shell.execute_reply": "2024-10-14T20:42:48.935954Z" }, "id": "Fs8XaQ6kOjck" }, "outputs": [], "source": [ "# without passing column names, the result will contain only two geometries columns\n", "geometry_df = Adapter.toDf(result, sedona)" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:48.939779Z", "iopub.status.busy": "2024-10-14T20:42:48.939509Z", "iopub.status.idle": "2024-10-14T20:42:48.944870Z", "shell.execute_reply": "2024-10-14T20:42:48.944082Z" }, "id": "qiWJjHz7Ojck", "outputId": "4baca1b2-680a-426e-91e5-392892802269" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- leftgeometry: geometry (nullable = true)\n", " |-- rightgeometry: geometry (nullable = true)\n", "\n" ] } ], "source": [ "geometry_df.printSchema()" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:48.947502Z", "iopub.status.busy": "2024-10-14T20:42:48.947223Z", "iopub.status.idle": "2024-10-14T20:42:49.449145Z", "shell.execute_reply": "2024-10-14T20:42:49.448234Z" }, "id": "CUuhG1HHOjcl", "outputId": "e69b3b5b-f67a-4ec0-d0c0-659bff62fa1b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+--------------------+\n", "| leftgeometry| rightgeometry|\n", "+--------------------+--------------------+\n", "|POLYGON ((-87.092...|POINT (-86.94719 ...|\n", "|POLYGON ((-86.816...|POINT (-86.709657...|\n", "|POLYGON ((-86.816...|POINT (-86.717655...|\n", "|POLYGON ((-86.749...|POINT (-86.736302...|\n", "|POLYGON ((-86.749...|POINT (-86.735506...|\n", "+--------------------+--------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "geometry_df.show(5)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:49.452639Z", "iopub.status.busy": "2024-10-14T20:42:49.452419Z", "iopub.status.idle": "2024-10-14T20:42:49.616600Z", "shell.execute_reply": "2024-10-14T20:42:49.615797Z" }, "id": "J_113xc7Ojcl", "outputId": "22d1b8d1-063d-4470-cc0b-975fc7eeea3a" }, "outputs": [ { "data": { "text/plain": [ "Row(leftgeometry=, rightgeometry=)" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "geometry_df.collect()[0]" ] }, { "cell_type": "markdown", "metadata": { "id": "S-eRk6AROjcl" }, "source": [ "## Passing column names" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:49.619915Z", "iopub.status.busy": "2024-10-14T20:42:49.619212Z", "iopub.status.idle": "2024-10-14T20:42:49.633664Z", "shell.execute_reply": "2024-10-14T20:42:49.633136Z" }, "id": "5CnT4RzeOjcl" }, "outputs": [], "source": [ "geometry_df = Adapter.toDf(result, [\"left_user_data\"], [\"right_user_data\"], sedona)" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:49.636089Z", "iopub.status.busy": "2024-10-14T20:42:49.635583Z", "iopub.status.idle": "2024-10-14T20:42:49.870375Z", "shell.execute_reply": "2024-10-14T20:42:49.869542Z" }, "id": "qfaNUcXQOjcl", "outputId": "7f7a6112-dc45-4797-82f5-a3b839dd0c3d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+--------------+--------------------+---------------+\n", "| leftgeometry|left_user_data| rightgeometry|right_user_data|\n", "+--------------------+--------------+--------------------+---------------+\n", "|POLYGON ((-87.092...| |POINT (-86.94719 ...| null|\n", "|POLYGON ((-86.816...| |POINT (-86.709657...| null|\n", "|POLYGON ((-86.816...| |POINT (-86.717655...| null|\n", "|POLYGON ((-86.749...| |POINT (-86.736302...| null|\n", "|POLYGON ((-86.749...| |POINT (-86.735506...| null|\n", "+--------------------+--------------+--------------------+---------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "geometry_df.show(5)" ] }, { "cell_type": "markdown", "metadata": { "id": "TzWjjT0UOjcl" }, "source": [ "# Converting RDD Spatial join result to DF directly, avoiding jvm python serde" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:49.873266Z", "iopub.status.busy": "2024-10-14T20:42:49.872983Z", "iopub.status.idle": "2024-10-14T20:42:49.879900Z", "shell.execute_reply": "2024-10-14T20:42:49.879233Z" }, "id": "OB_t5UOcOjcl" }, "outputs": [], "source": [ "query_envelope = Envelope(-85.01, -60.01, 34.01, 50.01)\n", "\n", "result_range_query = RangeQueryRaw.SpatialRangeQuery(linestring_rdd, query_envelope, False, False)" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:49.882137Z", "iopub.status.busy": "2024-10-14T20:42:49.881926Z", "iopub.status.idle": "2024-10-14T20:42:49.890439Z", "shell.execute_reply": "2024-10-14T20:42:49.889882Z" }, "id": "58B9hoX_Ojcl" }, "outputs": [], "source": [ "# converting to df\n", "gdf = Adapter.toDf(result_range_query, sedona)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:49.892569Z", "iopub.status.busy": "2024-10-14T20:42:49.892360Z", "iopub.status.idle": "2024-10-14T20:42:49.944872Z", "shell.execute_reply": "2024-10-14T20:42:49.944113Z" }, "id": "Rgna3MsmOjcl", "outputId": "d99b4253-854f-4206-c5de-ab593887751d" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+\n", "| geometry|\n", "+--------------------+\n", "|LINESTRING (-72.1...|\n", "|LINESTRING (-72.4...|\n", "|LINESTRING (-72.4...|\n", "|LINESTRING (-73.4...|\n", "|LINESTRING (-73.6...|\n", "+--------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "gdf.show(5)" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:49.947689Z", "iopub.status.busy": "2024-10-14T20:42:49.947174Z", "iopub.status.idle": "2024-10-14T20:42:49.951669Z", "shell.execute_reply": "2024-10-14T20:42:49.950942Z" }, "id": "Gzw-LrkgOjcm", "outputId": "600213af-3162-4ef4-b93e-aa5cbdfcf081" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- geometry: geometry (nullable = true)\n", "\n" ] } ], "source": [ "gdf.printSchema()" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "execution": { "iopub.execute_input": "2024-10-14T20:42:49.953914Z", "iopub.status.busy": "2024-10-14T20:42:49.953727Z", "iopub.status.idle": "2024-10-14T20:42:49.961982Z", "shell.execute_reply": "2024-10-14T20:42:49.961371Z" }, "id": "C5-BtLhSOjcm" }, "outputs": [], "source": [ "# Passing column names\n", "# converting to df\n", "gdf_with_columns = Adapter.toDf(result_range_query, sedona, [\"_c1\"])" ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:49.964454Z", "iopub.status.busy": "2024-10-14T20:42:49.963855Z", "iopub.status.idle": "2024-10-14T20:42:50.073150Z", "shell.execute_reply": "2024-10-14T20:42:50.072252Z" }, "id": "r9ctqnn1Ojcm", "outputId": "713dd79f-f9c9-4b72-dbd0-c091654a6325" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+---+\n", "| geometry|_c1|\n", "+--------------------+---+\n", "|LINESTRING (-72.1...| |\n", "|LINESTRING (-72.4...| |\n", "|LINESTRING (-72.4...| |\n", "|LINESTRING (-73.4...| |\n", "|LINESTRING (-73.6...| |\n", "+--------------------+---+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "gdf_with_columns.show(5)" ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2024-10-14T20:42:50.075573Z", "iopub.status.busy": "2024-10-14T20:42:50.075343Z", "iopub.status.idle": "2024-10-14T20:42:50.079469Z", "shell.execute_reply": "2024-10-14T20:42:50.078942Z" }, "id": "h-xj5agIOjcm", "outputId": "1cd994f2-1f13-43b6-8b96-558be2ce2b02" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "root\n", " |-- geometry: geometry (nullable = true)\n", " |-- _c1: string (nullable = true)\n", "\n" ] } ], "source": [ "gdf_with_columns.printSchema()" ] }, { "cell_type": "markdown", "metadata": { "id": "VIQE0WpbUQjD" }, "source": [ "# Summary\n", "\n", "We have shown how to install Sedona with Pyspark and run a basic example (source: https://github.com/apache/sedona/blob/master/docs/usecases/ApacheSedonaCore.ipynb) on Google Colab. This demo uses the Spark engine provided by PySpark." ] } ], "metadata": { "colab": { "authorship_tag": "ABX9TyNd7fWhsusaKBDoXDhKLy4R", "include_colab_link": true, "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" } }, "nbformat": 4, "nbformat_minor": 0 }