{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true,
"authorship_tag": "ABX9TyM9JVOiZXxIFObMpyuhPLLD",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
""
]
},
{
"cell_type": "markdown",
"source": [
"\n",
"\n",
"# Apache Sedona with PySpark\n",
"\n",
"Apache Sedona™ is\n",
"\n",
"> *a cluster computing system for processing large-scale spatial data. Sedona extends existing cluster computing systems, such as Apache Spark, Apache Flink, and Snowflake, with a set of out-of-the-box distributed Spatial Datasets and Spatial SQL that efficiently load, process, and analyze large-scale spatial data across machines.* ([https://sedona.apache.org/](https://sedona.apache.org/))\n",
"\n",
"To execute a basic Sedona demonstration using PySpark on Google Colab, we made a few minor adjustments. The Sedona notebook starts below at [Apache Sedona Core demo](#scrollTo=Apache_Sedona_Core_demo).\n",
"\n"
],
"metadata": {
"id": "DNWKwG4TAcD2"
}
},
{
"cell_type": "markdown",
"source": [
"## Install Apache Sedona and PySpark\n",
"\n",
"To start with, we are going to install PySpark with Sedona following the instructions at: https://sedona.apache.org/latest-snapshot/setup/install-python/ but first we need to downgrade `shapely` because the version 2.0.2 that comes with Google Colab does not play well with the current version of Apache Sedona (see https://shapely.readthedocs.io/en/stable/migration.html)."
],
"metadata": {
"id": "3AQNoWmX_B78"
}
},
{
"cell_type": "markdown",
"source": [
"### Downgrade Shapely to version 1.7.1\n",
"\n",
"We need to install install any version of `shapely>=1.7.0` but smaller than `2.0`. We picked `1.7.1` because with 1.7.0 we got the error\n",
"\n",
" geopandas 0.13.2 requires shapely>=1.7.1, but you have shapely 1.7.0 which is incompatible.\n",
"\n",
"Explanation for `pip -I`:\n",
"\n",
"- [`-I, --ignore-installed`](https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-I)\n",
"> Ignore the installed packages, overwriting them. This can break your system if the existing package is of a different version or was installed with a different package manager!\n",
"\n",
"\n",
"\n",
"\n",
"\n"
],
"metadata": {
"id": "zDoBDC3yAVvI"
}
},
{
"cell_type": "code",
"source": [
"!pip install -I shapely==1.7.1"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "_4nnDMo0Blfu",
"outputId": "cf77f2ad-6e17-43da-8578-3ff535497efa"
},
"execution_count": 1,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Collecting shapely==1.7.1\n",
" Using cached Shapely-1.7.1-cp310-cp310-linux_x86_64.whl\n",
"Installing collected packages: shapely\n",
"\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
"lida 0.0.10 requires fastapi, which is not installed.\n",
"lida 0.0.10 requires kaleido, which is not installed.\n",
"lida 0.0.10 requires python-multipart, which is not installed.\n",
"lida 0.0.10 requires uvicorn, which is not installed.\u001b[0m\u001b[31m\n",
"\u001b[0mSuccessfully installed shapely-1.7.1\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### Install Geopandas\n",
"\n",
"This step is only needed outside of Colab because on Google Colab `geopandas` is available by default."
],
"metadata": {
"id": "76yI9hGZz_Ss"
}
},
{
"cell_type": "code",
"source": [
"!pip install geopandas==0.13.2"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "KliMM6ka0Qee",
"outputId": "170bc61d-9911-4929-8a51-7d82fd24d6aa"
},
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Requirement already satisfied: geopandas==0.13.2 in /usr/local/lib/python3.10/dist-packages (0.13.2)\n",
"Requirement already satisfied: fiona>=1.8.19 in /usr/local/lib/python3.10/dist-packages (from geopandas==0.13.2) (1.9.5)\n",
"Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from geopandas==0.13.2) (23.2)\n",
"Requirement already satisfied: pandas>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from geopandas==0.13.2) (1.5.3)\n",
"Requirement already satisfied: pyproj>=3.0.1 in /usr/local/lib/python3.10/dist-packages (from geopandas==0.13.2) (3.6.1)\n",
"Requirement already satisfied: shapely>=1.7.1 in /usr/local/lib/python3.10/dist-packages (from geopandas==0.13.2) (1.7.1)\n",
"Requirement already satisfied: attrs>=19.2.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas==0.13.2) (23.2.0)\n",
"Requirement already satisfied: certifi in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas==0.13.2) (2024.2.2)\n",
"Requirement already satisfied: click~=8.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas==0.13.2) (8.1.7)\n",
"Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas==0.13.2) (1.1.1)\n",
"Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas==0.13.2) (0.7.2)\n",
"Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas==0.13.2) (1.16.0)\n",
"Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from fiona>=1.8.19->geopandas==0.13.2) (67.7.2)\n",
"Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0->geopandas==0.13.2) (2.8.2)\n",
"Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0->geopandas==0.13.2) (2023.4)\n",
"Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.1.0->geopandas==0.13.2) (1.23.5)\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"### Install Apache Sedona and PySpark\n",
"\n",
"We can now install Apache Sedona together with PySpark (and Spark)."
],
"metadata": {
"id": "Vv5bhPBIDQ5G"
}
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "SOB7DNZ6AOio",
"outputId": "5aa22c6b-d2eb-41ce-dd2e-f46a2dd2be69"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Requirement already satisfied: apache-sedona[spark] in /usr/local/lib/python3.10/dist-packages (1.5.1)\n",
"Requirement already satisfied: attrs in /usr/local/lib/python3.10/dist-packages (from apache-sedona[spark]) (23.2.0)\n",
"Requirement already satisfied: shapely>=1.7.0 in /usr/local/lib/python3.10/dist-packages (from apache-sedona[spark]) (1.7.1)\n",
"Requirement already satisfied: pyspark>=2.3.0 in /usr/local/lib/python3.10/dist-packages (from apache-sedona[spark]) (3.5.0)\n",
"Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-packages (from pyspark>=2.3.0->apache-sedona[spark]) (0.10.9.7)\n"
]
}
],
"source": [
"!pip install apache-sedona[spark]"
]
},
{
"cell_type": "code",
"source": [
"%env SPARK_HOME = \"/usr/local/lib/python3.10/dist-packages/pyspark\""
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ZoNNqcGaGrEh",
"outputId": "aa79c622-24dd-413d-e86c-4e8f379e0b0b"
},
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"env: SPARK_HOME=\"/usr/local/lib/python3.10/dist-packages/pyspark\"\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"%env PYTHONPATH = /usr/local/lib/python3.10/dist-packages/pyspark/python"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "z9fpbLrjHBeY",
"outputId": "5dade95c-b20b-496f-a8d1-504ef69d100d"
},
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"env: PYTHONPATH=/usr/local/lib/python3.10/dist-packages/pyspark/python\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!pip info pyspark"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "d_YNI8fKiU3k",
"outputId": "1cb25835-1990-44ed-8a65-2afeafc4dc32"
},
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"ERROR: unknown command \"info\"\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"## Setup environment variables\n",
"\n",
"We need to set two environment variables:\n",
"\n",
"- `SPARK_HOME`\n",
"- `PYTHONPATH`\n",
"\n",
"Once we have set `SPARK_HOME`, the variable `PYTHONPATH` is `$SPARK_HOME/python`.\n",
"\n",
"### Find Spark home\n",
"\n",
"There's an utility to find Spark home and I always forget how it's called exactly, what I remember is that it contains `\"find\"` and `\"spark\"`. Let us search for it:"
],
"metadata": {
"id": "Sjlr4JO-Erv3"
}
},
{
"cell_type": "code",
"source": [
"!find / -name \"*find*spark*\""
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "oK5vO9MUGQA6",
"outputId": "c018213e-e7cc-4494-a45f-587f175d9eb7"
},
"execution_count": 7,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"/usr/local/bin/find_spark_home.py\n",
"/usr/local/bin/__pycache__/find_spark_home.cpython-310.pyc\n",
"/usr/local/bin/find-spark-home.cmd\n",
"/usr/local/bin/find-spark-home\n",
"/usr/local/lib/python3.10/dist-packages/pyspark/bin/find-spark-home.cmd\n",
"/usr/local/lib/python3.10/dist-packages/pyspark/bin/find-spark-home\n",
"/usr/local/lib/python3.10/dist-packages/pyspark/find_spark_home.py\n",
"/usr/local/lib/python3.10/dist-packages/pyspark/__pycache__/find_spark_home.cpython-310.pyc\n",
"find: ‘/proc/61/task/61/net’: Invalid argument\n",
"find: ‘/proc/61/net’: Invalid argument\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"The script `/usr/local/bin/find_spark_home.py` is successful at finding Spark's home."
],
"metadata": {
"id": "Nr_2rb6pFzF8"
}
},
{
"cell_type": "markdown",
"source": [
"#### Set `SPARK_HOME`"
],
"metadata": {
"id": "qxlEqcY9LRI4"
}
},
{
"cell_type": "code",
"source": [
"import sys\n",
"import os\n",
"IN_COLAB = 'google.colab' in sys.modules\n",
"if IN_COLAB:\n",
" output = !python /usr/local/bin/find_spark_home.py\n",
"else:\n",
" output = !find / -name \"pyspark\" -type d 2>/dev/null|head -1\n",
"# Store the output using %store\n",
"%store output\n",
"# get rid of extra quotation marks\n",
"os.environ['SPARK_HOME'] = output[0].replace('\"', '')"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "GCgU9MnaIZ_k",
"outputId": "58d228a7-3a12-43f6-a7c7-eaedd18efe61"
},
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Stored 'output' (SList)\n"
]
}
]
},
{
"cell_type": "code",
"source": [
"!pip show pyspark"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "olIKq1bOkMDy",
"outputId": "6140ea63-9eca-4456-9c3c-a1a8db16f283"
},
"execution_count": 9,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Name: pyspark\n",
"Version: 3.5.0\n",
"Summary: Apache Spark Python API\n",
"Home-page: https://github.com/apache/spark/tree/master/python\n",
"Author: Spark Developers\n",
"Author-email: dev@spark.apache.org\n",
"License: http://www.apache.org/licenses/LICENSE-2.0\n",
"Location: /usr/local/lib/python3.10/dist-packages\n",
"Requires: py4j\n",
"Required-by: \n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"Verify that the correct `SPARK_HOME` has been set."
],
"metadata": {
"id": "Ic1uHIKAK1ss"
}
},
{
"cell_type": "code",
"source": [
"os.environ['SPARK_HOME']"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0
},
"id": "BraJPa5KZY1y",
"outputId": "8a434517-6d46-4bfd-edb3-f6a21ae93141"
},
"execution_count": 10,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'/usr/local/lib/python3.10/dist-packages/pyspark'"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
}
},
"metadata": {},
"execution_count": 10
}
]
},
{
"cell_type": "code",
"source": [
"%env SPARK_HOME"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0
},
"id": "sONAYA_DKOry",
"outputId": "65c0353e-a303-4574-e044-7fe05e405920"
},
"execution_count": 11,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'/usr/local/lib/python3.10/dist-packages/pyspark'"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
}
},
"metadata": {},
"execution_count": 11
}
]
},
{
"cell_type": "markdown",
"source": [
"#### Set `PYTHONPATH`"
],
"metadata": {
"id": "gQ3zuxd9LKxA"
}
},
{
"cell_type": "code",
"source": [
"os.environ['PYTHONPATH'] = os.environ['SPARK_HOME'] + '/python'"
],
"metadata": {
"id": "SOXsRzqNLAqx"
},
"execution_count": 12,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Check"
],
"metadata": {
"id": "HTzSh-RBLbsg"
}
},
{
"cell_type": "code",
"source": [
"%env PYTHONPATH"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0
},
"id": "Log-_SFWJScq",
"outputId": "636e1c0a-282c-402b-90bc-8eac7f15db93"
},
"execution_count": 13,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'/usr/local/lib/python3.10/dist-packages/pyspark/python'"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
}
},
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "markdown",
"source": [
"## Download data\n",
"\n",
"In order to run, the Sedona notebook expects to find some specific files in the local folder `data`. Let us populate `data` with the files from the Sedona Github repository."
],
"metadata": {
"id": "qi3MxEWpV9CL"
}
},
{
"cell_type": "code",
"source": [
"%%bash\n",
"# it would be more efficient to just download the \"data\" folder and not the whole repo\n",
"[ -d sedona ] || git clone https://github.com/apache/sedona.git\n",
"\n",
"cp -r sedona/binder/data ./"
],
"metadata": {
"id": "wnzi7JCOIA1v"
},
"execution_count": 14,
"outputs": []
},
{
"cell_type": "markdown",
"source": [
"Verify the presence of data in the designated `data` folder."
],
"metadata": {
"id": "qQVz6ckaScn8"
}
},
{
"cell_type": "code",
"source": [
"!ls data"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Rxy-t1JnRnf6",
"outputId": "0cfc7552-1a33-407e-9a9d-bfc7bcd901b8"
},
"execution_count": 15,
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"arealm-small.csv\t gis_osm_pois_free_1.shp\t primaryroads-polygon.csv\n",
"county_small.tsv\t gis_osm_pois_free_1.shx\t raster\n",
"county_small_wkb.tsv\t ne_50m_admin_0_countries_lakes testpoint.csv\n",
"gis_osm_pois_free_1.cpg ne_50m_airports\t\t testPolygon.json\n",
"gis_osm_pois_free_1.dbf polygon\t\t\t zcta510-small.csv\n",
"gis_osm_pois_free_1.prj primaryroads-linestring.csv\n"
]
}
]
},
{
"cell_type": "markdown",
"source": [
"# Apache Sedona Core demo\n",
"\n",
"The notebook is available at the following link:\n",
"https://github.com/apache/sedona/blob/master/binder/ApacheSedonaCore.ipynb\n"
],
"metadata": {
"id": "Mk20FV3ZHj_U"
}
},
{
"cell_type": "markdown",
"metadata": {
"id": "AcjGVcGzOjcK"
},
"source": [
"```\n",
"Licensed to the Apache Software Foundation (ASF) under one\n",
"or more contributor license agreements. See the NOTICE file\n",
"distributed with this work for additional information\n",
"regarding copyright ownership. The ASF licenses this file\n",
"to you under the Apache License, Version 2.0 (the\n",
"\"License\"); you may not use this file except in compliance\n",
"with the License. You may obtain a copy of the License at\n",
" http://www.apache.org/licenses/LICENSE-2.0\n",
"Unless required by applicable law or agreed to in writing,\n",
"software distributed under the License is distributed on an\n",
"\"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
"KIND, either express or implied. See the License for the\n",
"specific language governing permissions and limitations\n",
"under the License.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"id": "sK9oIz0FOjcN",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "a4ce9ce6-a014-4279-8d23-a9555a13574a"
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"Skipping SedonaKepler import, verify if keplergl is installed\n",
"Skipping SedonaPyDeck import, verify if pydeck is installed\n"
]
}
],
"source": [
"from pyspark.sql import SparkSession\n",
"from pyspark import StorageLevel\n",
"import geopandas as gpd\n",
"import pandas as pd\n",
"from pyspark.sql.types import StructType\n",
"from pyspark.sql.types import StructField\n",
"from pyspark.sql.types import StringType\n",
"from pyspark.sql.types import LongType\n",
"from shapely.geometry import Point\n",
"from shapely.geometry import Polygon\n",
"\n",
"from sedona.spark import *\n",
"from sedona.core.geom.envelope import Envelope\n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"id": "HzVqFLBtOjcO"
},
"outputs": [],
"source": [
"config = SedonaContext.builder() .\\\n",
" config('spark.jars.packages',\n",
" 'org.apache.sedona:sedona-spark-3.4_2.12:1.5.1,'\n",
" 'org.datasyslab:geotools-wrapper:1.5.1-28.2,'\n",
" 'uk.co.gresearch.spark:spark-extension_2.12:2.11.0-3.4'). \\\n",
" config('spark.jars.repositories', 'https://artifacts.unidata.ucar.edu/repository/unidata-all'). \\\n",
" getOrCreate()\n",
"\n",
"sedona = SedonaContext.create(config)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "JI32GOnEOjcO"
},
"outputs": [],
"source": [
"sc = sedona.sparkContext"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "OdQM3oaYOjcO"
},
"source": [
"# Create SpatialRDD"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "X2umXxYfOjcO"
},
"source": [
"## Reading to PointRDD from CSV file"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "j8Rgv--0OjcP"
},
"source": [
"Suppose we want load the CSV file into Apache Sedona PointRDD\n",
"```\n",
"testattribute0,-88.331492,32.324142,testattribute1,testattribute2\n",
"testattribute0,-88.175933,32.360763,testattribute1,testattribute2\n",
"testattribute0,-88.388954,32.357073,testattribute1,testattribute2\n",
"testattribute0,-88.221102,32.35078,testattribute1,testattribute2\n",
"testattribute0,-88.323995,32.950671,testattribute1,testattribute2\n",
"testattribute0,-88.231077,32.700812,testattribute1,testattribute2\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "gusbYu1fOjcP"
},
"outputs": [],
"source": [
"point_rdd = PointRDD(sc, \"data/arealm-small.csv\", 1, FileDataSplitter.CSV, True, 10)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"id": "Ykc5FwouOjcP",
"outputId": "4a981655-2238-4bb4-fd67-7c56584585a0",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"3000"
]
},
"metadata": {},
"execution_count": 20
}
],
"source": [
"## Getting approximate total count\n",
"point_rdd.approximateTotalCount"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "PdEjrrRHOjcP",
"outputId": "fc0d5620-a4eb-4e37-e819-3e04e16ddcd4",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 121
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Envelope(-173.120769, -84.965961, 30.244859, 71.355134)"
],
"image/svg+xml": ""
},
"metadata": {},
"execution_count": 21
}
],
"source": [
"# getting boundary for PointRDD or any other SpatialRDD, it returns Enelope object which inherits from\n",
"# shapely.geometry.Polygon\n",
"point_rdd.boundary()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"id": "MDV2-f2tOjcP",
"outputId": "626448bb-0e07-4463-9830-7ec607546fcd",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {},
"execution_count": 22
}
],
"source": [
"# To run analyze please use function analyze\n",
"point_rdd.analyze()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"id": "puMA8o6POjcP",
"outputId": "3b701a97-f5ca-4316-e345-35416125e9f8",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 121
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Envelope(-173.120769, -84.965961, 30.244859, 71.355134)"
],
"image/svg+xml": ""
},
"metadata": {},
"execution_count": 23
}
],
"source": [
"# Finding boundary envelope for PointRDD or any other SpatialRDD, it returns Enelope object which inherits from\n",
"# shapely.geometry.Polygon\n",
"point_rdd.boundaryEnvelope"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "hgOCi8lJOjcQ",
"outputId": "fbf49d82-845e-4ddb-a178-c6d5817de5d6",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"2996"
]
},
"metadata": {},
"execution_count": 24
}
],
"source": [
"# Calculate number of records without duplicates\n",
"point_rdd.countWithoutDuplicates()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"id": "-l1BvhakOjcQ",
"outputId": "f67bcf6c-e663-4fa3-ee71-bccd904b6806",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"''"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
}
},
"metadata": {},
"execution_count": 25
}
],
"source": [
"# Getting source epsg code\n",
"point_rdd.getSourceEpsgCode()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"id": "tiFEDeNdOjcQ",
"outputId": "cc91916a-dcbf-442f-ff95-662975ddc04d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 35
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"''"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
}
},
"metadata": {},
"execution_count": 26
}
],
"source": [
"# Getting target epsg code\n",
"point_rdd.getTargetEpsgCode()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"id": "oip1bNogOjcQ",
"outputId": "dfb01317-0542-4763-ee44-96b6e5eff05a",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {},
"execution_count": 27
}
],
"source": [
"# Spatial partitioning data\n",
"point_rdd.spatialPartitioning(GridType.KDBTREE)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "setaW5adOjcQ"
},
"source": [
"## Operations on RawSpatialRDD"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XeAQe3qqOjcQ"
},
"source": [
"rawSpatialRDD method returns RDD which consists of GeoData objects which has 2 attributes\n",
"
geom: shapely.geometry.BaseGeometry \n",
" userData: str \n",
"\n",
"You can use any operations on those objects and spread across machines"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"id": "pNlkO6kYOjcR",
"outputId": "d2a40cc1-dca5-413a-e4f0-d22c0c2be1a1",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2]"
]
},
"metadata": {},
"execution_count": 28
}
],
"source": [
"# take firs element\n",
"point_rdd.rawSpatialRDD.take(1)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"id": "22UNTZiIOjcR",
"outputId": "ec6cad29-26a3-44c6-8d0e-477397864945",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2,\n",
" Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2,\n",
" Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2,\n",
" Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2,\n",
" Geometry: Point userData: testattribute0\ttestattribute1\ttestattribute2]"
]
},
"metadata": {},
"execution_count": 29
}
],
"source": [
"# collect to Python list\n",
"point_rdd.rawSpatialRDD.collect()[:5]"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"id": "_vz8jzj6OjcR",
"outputId": "305812a3-9435-412f-d331-0d1f09544b41",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[111.08786851399313,\n",
" 110.92828303170774,\n",
" 111.1385974283527,\n",
" 110.97450594034112,\n",
" 110.97122518072091]"
]
},
"metadata": {},
"execution_count": 30
}
],
"source": [
"# apply map functions, for example distance to Point(52 21)\n",
"point_rdd.rawSpatialRDD.map(lambda x: x.geom.distance(Point(21, 52))).take(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "VJUwfnWbOjcR"
},
"source": [
"## Transforming to GeoPandas"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "HJT4cqunOjcR"
},
"source": [
"## Loaded data can be transformed to GeoPandas DataFrame in a few ways"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ENst5TKKOjcR"
},
"source": [
"### Directly from RDD"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"id": "khvSdVtROjcR"
},
"outputs": [],
"source": [
"point_rdd_to_geo = point_rdd.rawSpatialRDD.map(lambda x: [x.geom, *x.getUserData().split(\"\\t\")])"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"id": "h_aQLibwOjcR"
},
"outputs": [],
"source": [
"point_gdf = gpd.GeoDataFrame(\n",
" point_rdd_to_geo.collect(), columns=[\"geom\", \"attr1\", \"attr2\", \"attr3\"], geometry=\"geom\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"id": "Ht7qYNuwOjcS",
"outputId": "a5f39f31-a0f7-414d-d3fe-17d7625cf0e4",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" geom attr1 attr2 attr3\n",
"0 POINT (-88.33149 32.32414) testattribute0 testattribute1 testattribute2\n",
"1 POINT (-88.17593 32.36076) testattribute0 testattribute1 testattribute2\n",
"2 POINT (-88.38895 32.35707) testattribute0 testattribute1 testattribute2\n",
"3 POINT (-88.22110 32.35078) testattribute0 testattribute1 testattribute2\n",
"4 POINT (-88.32399 32.95067) testattribute0 testattribute1 testattribute2"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" geom | \n",
" attr1 | \n",
" attr2 | \n",
" attr3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" POINT (-88.33149 32.32414) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
" 1 | \n",
" POINT (-88.17593 32.36076) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
" 2 | \n",
" POINT (-88.38895 32.35707) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
" 3 | \n",
" POINT (-88.22110 32.35078) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
" 4 | \n",
" POINT (-88.32399 32.95067) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
]
},
"metadata": {},
"execution_count": 33
}
],
"source": [
"point_gdf[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "2vW97p_iOjcS"
},
"source": [
"### Using Adapter"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"tags": [],
"id": "HN0IIhZ8OjcS"
},
"outputs": [],
"source": [
"# Adapter allows you to convert geospatial data types introduced with sedona to other ones"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"id": "69dR65hgOjcS"
},
"outputs": [],
"source": [
"spatial_df = Adapter.\\\n",
" toDf(point_rdd, [\"attr1\", \"attr2\", \"attr3\"], sedona).\\\n",
" createOrReplaceTempView(\"spatial_df\")\n",
"\n",
"spatial_gdf = sedona.sql(\"Select attr1, attr2, attr3, geometry as geom from spatial_df\")"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"id": "TnGQMPUIOjcS",
"outputId": "d3923b4f-12a4-43df-f874-8cba67e04bef",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------+--------------+--------------+----------------------------+\n",
"|attr1 |attr2 |attr3 |geom |\n",
"+--------------+--------------+--------------+----------------------------+\n",
"|testattribute0|testattribute1|testattribute2|POINT (-88.331492 32.324142)|\n",
"|testattribute0|testattribute1|testattribute2|POINT (-88.175933 32.360763)|\n",
"|testattribute0|testattribute1|testattribute2|POINT (-88.388954 32.357073)|\n",
"|testattribute0|testattribute1|testattribute2|POINT (-88.221102 32.35078) |\n",
"|testattribute0|testattribute1|testattribute2|POINT (-88.323995 32.950671)|\n",
"+--------------+--------------+--------------+----------------------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"spatial_gdf.show(5, False)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"id": "IdL0awafOjcS",
"outputId": "cafcd06d-0376-4b32-ff79-dd96936865e9",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" attr1 attr2 attr3 geom\n",
"0 testattribute0 testattribute1 testattribute2 POINT (-88.33149 32.32414)\n",
"1 testattribute0 testattribute1 testattribute2 POINT (-88.17593 32.36076)\n",
"2 testattribute0 testattribute1 testattribute2 POINT (-88.38895 32.35707)\n",
"3 testattribute0 testattribute1 testattribute2 POINT (-88.22110 32.35078)\n",
"4 testattribute0 testattribute1 testattribute2 POINT (-88.32399 32.95067)"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" attr1 | \n",
" attr2 | \n",
" attr3 | \n",
" geom | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
" POINT (-88.33149 32.32414) | \n",
"
\n",
" \n",
" 1 | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
" POINT (-88.17593 32.36076) | \n",
"
\n",
" \n",
" 2 | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
" POINT (-88.38895 32.35707) | \n",
"
\n",
" \n",
" 3 | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
" POINT (-88.22110 32.35078) | \n",
"
\n",
" \n",
" 4 | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
" POINT (-88.32399 32.95067) | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
]
},
"metadata": {},
"execution_count": 37
}
],
"source": [
"gpd.GeoDataFrame(spatial_gdf.toPandas(), geometry=\"geom\")[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ilmR8RjIOjcS"
},
"source": [
"### With DataFrame creation"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"id": "LqgFLBZEOjcT"
},
"outputs": [],
"source": [
"schema = StructType(\n",
" [\n",
" StructField(\"geometry\", GeometryType(), False),\n",
" StructField(\"attr1\", StringType(), False),\n",
" StructField(\"attr2\", StringType(), False),\n",
" StructField(\"attr3\", StringType(), False),\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"id": "tEO42a4DOjcT"
},
"outputs": [],
"source": [
"geo_df = sedona.createDataFrame(point_rdd_to_geo, schema, verifySchema=False)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"id": "PaIAqir1OjcT",
"outputId": "89b2c8b0-f645-4ae0-c1be-660c5d7e60ea",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" geometry attr1 attr2 attr3\n",
"0 POINT (-88.33149 32.32414) testattribute0 testattribute1 testattribute2\n",
"1 POINT (-88.17593 32.36076) testattribute0 testattribute1 testattribute2\n",
"2 POINT (-88.38895 32.35707) testattribute0 testattribute1 testattribute2\n",
"3 POINT (-88.22110 32.35078) testattribute0 testattribute1 testattribute2\n",
"4 POINT (-88.32399 32.95067) testattribute0 testattribute1 testattribute2"
],
"text/html": [
"\n",
" \n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" geometry | \n",
" attr1 | \n",
" attr2 | \n",
" attr3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" POINT (-88.33149 32.32414) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
" 1 | \n",
" POINT (-88.17593 32.36076) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
" 2 | \n",
" POINT (-88.38895 32.35707) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
" 3 | \n",
" POINT (-88.22110 32.35078) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
" 4 | \n",
" POINT (-88.32399 32.95067) | \n",
" testattribute0 | \n",
" testattribute1 | \n",
" testattribute2 | \n",
"
\n",
" \n",
"
\n",
"
\n",
"
\n",
"
\n"
]
},
"metadata": {},
"execution_count": 40
}
],
"source": [
"gpd.GeoDataFrame(geo_df.toPandas(), geometry=\"geometry\")[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "kO1EyFACOjcT"
},
"source": [
"# Load Typed SpatialRDDs"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "w_0_hHNQOjcT"
},
"source": [
"Currently The library supports 5 typed SpatialRDDs:\n",
" RectangleRDD \n",
" PointRDD \n",
" PolygonRDD \n",
" LineStringRDD \n",
" CircleRDD "
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"id": "2m2k-n-LOjcT"
},
"outputs": [],
"source": [
"rectangle_rdd = RectangleRDD(sc, \"data/zcta510-small.csv\", FileDataSplitter.CSV, True, 11)\n",
"point_rdd = PointRDD(sc, \"data/arealm-small.csv\", 1, FileDataSplitter.CSV, False, 11)\n",
"polygon_rdd = PolygonRDD(sc, \"data/primaryroads-polygon.csv\", FileDataSplitter.CSV, True, 11)\n",
"linestring_rdd = LineStringRDD(sc, \"data/primaryroads-linestring.csv\", FileDataSplitter.CSV, True)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"id": "qRHPdLYdOjcb",
"outputId": "ef9149b6-ad2c-46f0-fba0-ae478f802ba9",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {},
"execution_count": 42
}
],
"source": [
"rectangle_rdd.analyze()\n",
"point_rdd.analyze()\n",
"polygon_rdd.analyze()\n",
"linestring_rdd.analyze()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J2uTMMk1Ojcb"
},
"source": [
"# Spatial Partitioning"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "h3AC5CpdOjcc"
},
"source": [
"Apache Sedona spatial partitioning method can significantly speed up the join query. Three spatial partitioning methods are available: KDB-Tree, Quad-Tree and R-Tree. Two SpatialRDD must be partitioned by the same way."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"id": "2icbQf9NOjcc",
"outputId": "a8f20089-8b5d-42bf-b6ad-ad23f57ee496",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {},
"execution_count": 43
}
],
"source": [
"point_rdd.spatialPartitioning(GridType.KDBTREE)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "_ipdd7umOjcc"
},
"source": [
"# Create Index"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hEPaVPmJOjcc"
},
"source": [
"Apache Sedona provides two types of spatial indexes, Quad-Tree and R-Tree. Once you specify an index type, Apache Sedona will build a local tree index on each of the SpatialRDD partition."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"id": "3YIsx_55Ojcc"
},
"outputs": [],
"source": [
"point_rdd.buildIndex(IndexType.RTREE, True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LRHcULQDOjcc"
},
"source": [
"# SpatialJoin"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZnnDo119Ojcc"
},
"source": [
"Spatial join is operation which combines data based on spatial relations like:\n",
" intersects \n",
" touches \n",
" within \n",
" etc \n",
"\n",
"To Use Spatial Join in GeoPyspark library please use JoinQuery object, which has implemented below methods:\n",
"```python\n",
"SpatialJoinQuery(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD\n",
"\n",
"DistanceJoinQuery(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD\n",
"\n",
"spatialJoin(queryWindowRDD: SpatialRDD, objectRDD: SpatialRDD, joinParams: JoinParams) -> RDD\n",
"\n",
"DistanceJoinQueryFlat(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD\n",
"\n",
"SpatialJoinQueryFlat(spatialRDD: SpatialRDD, queryRDD: SpatialRDD, useIndex: bool, considerBoundaryIntersection: bool) -> RDD\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "icOL3ukTOjcc"
},
"source": [
"## Example SpatialJoinQueryFlat PointRDD with RectangleRDD"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"id": "unG1rYNZOjcc"
},
"outputs": [],
"source": [
"# partitioning the data\n",
"point_rdd.spatialPartitioning(GridType.KDBTREE)\n",
"rectangle_rdd.spatialPartitioning(point_rdd.getPartitioner())\n",
"# building an index\n",
"point_rdd.buildIndex(IndexType.RTREE, True)\n",
"# Perform Spatial Join Query\n",
"result = JoinQuery.SpatialJoinQueryFlat(point_rdd, rectangle_rdd, False, True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5A0s6CE5Ojcd"
},
"source": [
"As result we will get RDD[GeoData, GeoData]\n",
"It can be used like any other Python RDD. You can use map, take, collect and other functions "
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"id": "8dw_JRafOjcd",
"outputId": "d688a5ec-27e0-4fe6-846d-bdaf98a824a3",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"MapPartitionsRDD[63] at map at FlatPairRddConverter.scala:30"
]
},
"metadata": {},
"execution_count": 46
}
],
"source": [
"result"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"id": "7QKdzz9ZOjcd",
"outputId": "565047a0-8f8c-4abd-a370-47731886c95c",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[[Geometry: Polygon userData: , Geometry: Point userData: ],\n",
" [Geometry: Polygon userData: , Geometry: Point userData: ]]"
]
},
"metadata": {},
"execution_count": 47
}
],
"source": [
"result.take(2)"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"id": "cBMsFNe2Ojcd",
"outputId": "166c9992-13e8-4603-c306-2fbd8cb0b815",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[[Geometry: Polygon userData: , Geometry: Point userData: ],\n",
" [Geometry: Polygon userData: , Geometry: Point userData: ],\n",
" [Geometry: Polygon userData: , Geometry: Point userData: ]]"
]
},
"metadata": {},
"execution_count": 48
}
],
"source": [
"result.collect()[:3]"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"id": "ZUPAfZ0wOjcd",
"outputId": "7139d469-1c67-452f-dfc6-5ce6062bf81c",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[0.0, 0.0, 0.0, 0.0, 0.0]"
]
},
"metadata": {},
"execution_count": 49
}
],
"source": [
"# getting distance using SpatialObjects\n",
"result.map(lambda x: x[0].geom.distance(x[1].geom)).take(5)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"id": "2ztnCLMnOjcd",
"outputId": "7ad1f07a-7103-467a-b97d-2adcb496694b",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[0.026651558685001447,\n",
" 0.026651558685001447,\n",
" 0.026651558685001447,\n",
" 0.026651558685001447,\n",
" 0.026651558685001447]"
]
},
"metadata": {},
"execution_count": 50
}
],
"source": [
"# getting area of polygon data\n",
"result.map(lambda x: x[0].geom.area).take(5)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"id": "nt-r8k9VOjcd"
},
"outputs": [],
"source": [
"# Base on result you can create DataFrame object, using map function and build DataFrame from RDD"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"id": "nALxyjQiOjce"
},
"outputs": [],
"source": [
"schema = StructType(\n",
" [\n",
" StructField(\"geom_left\", GeometryType(), False),\n",
" StructField(\"geom_right\", GeometryType(), False)\n",
" ]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"id": "UWqys-lqOjce",
"outputId": "8b85bdf9-ea59-4306-fe79-cb47af321128",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+--------------------+\n",
"| geom_left| geom_right|\n",
"+--------------------+--------------------+\n",
"|POLYGON ((-87.229...|POINT (-87.105455...|\n",
"|POLYGON ((-87.229...|POINT (-87.10534 ...|\n",
"|POLYGON ((-87.229...|POINT (-87.160372...|\n",
"|POLYGON ((-87.229...|POINT (-87.204033...|\n",
"|POLYGON ((-87.229...|POINT (-87.204299...|\n",
"+--------------------+--------------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"# Set verifySchema to False\n",
"spatial_join_result = result.map(lambda x: [x[0].geom, x[1].geom])\n",
"sedona.createDataFrame(spatial_join_result, schema, verifySchema=False).show(5, True)"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"id": "pkY70z0lOjce"
},
"outputs": [],
"source": [
"# Above code produces DataFrame with geometry Data type"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"id": "_BQVrY-IOjce",
"outputId": "87a00dfd-cd77-47a2-f73a-48c63be79c8f",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"root\n",
" |-- geom_left: geometry (nullable = false)\n",
" |-- geom_right: geometry (nullable = false)\n",
"\n"
]
}
],
"source": [
"sedona.createDataFrame(spatial_join_result, schema, verifySchema=False).printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5oID6P5kOjce"
},
"source": [
"We can create DataFrame object from Spatial Pair RDD using Adapter object as follows"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"id": "OyD3_m_zOjce",
"outputId": "27fb6d7d-013e-480a-de6d-80e4a72c1fc7",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+-----+--------------------+-----+\n",
"| geom_1|attr1| geom_2|attr2|\n",
"+--------------------+-----+--------------------+-----+\n",
"|POLYGON ((-87.229...| |POINT (-87.105455...| |\n",
"|POLYGON ((-87.229...| |POINT (-87.10534 ...| |\n",
"|POLYGON ((-87.229...| |POINT (-87.160372...| |\n",
"|POLYGON ((-87.229...| |POINT (-87.204033...| |\n",
"|POLYGON ((-87.229...| |POINT (-87.204299...| |\n",
"+--------------------+-----+--------------------+-----+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"Adapter.toDf(result, [\"attr1\"], [\"attr2\"], sedona).show(5, True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qAmXD2SKOjce"
},
"source": [
"This also produce DataFrame with geometry DataType"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"id": "U7evFgqkOjce",
"outputId": "530f8329-c296-47e6-8ccf-642bf0596817",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"root\n",
" |-- geom_1: geometry (nullable = true)\n",
" |-- attr1: string (nullable = true)\n",
" |-- geom_2: geometry (nullable = true)\n",
" |-- attr2: string (nullable = true)\n",
"\n"
]
}
],
"source": [
"Adapter.toDf(result, [\"attr1\"], [\"attr2\"], sedona).printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TldNHgcQOjcf"
},
"source": [
"We can create RDD which will be of type RDD[GeoData, List[GeoData]]\n",
"We can for example calculate number of Points within some polygon data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GlhD3mSvOjcf"
},
"source": [
"To do that we can use code specified below"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"id": "bl5mg9qdOjcf"
},
"outputs": [],
"source": [
"point_rdd.spatialPartitioning(GridType.KDBTREE)\n",
"rectangle_rdd.spatialPartitioning(point_rdd.getPartitioner())"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"id": "6S1gs7nDOjcf"
},
"outputs": [],
"source": [
"spatial_join_result_non_flat = JoinQuery.SpatialJoinQuery(point_rdd, rectangle_rdd, False, True)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"id": "ZJiTacNkOjcf"
},
"outputs": [],
"source": [
"# number of point for each polygon\n",
"number_of_points = spatial_join_result_non_flat.map(lambda x: [x[0].geom, x[1].__len__()])"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"id": "-EtrJVuROjcf"
},
"outputs": [],
"source": [
"schema = StructType([\n",
" StructField(\"geometry\", GeometryType(), False),\n",
" StructField(\"number_of_points\", LongType(), False)\n",
"])"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"id": "mktoa5-HOjcf",
"outputId": "170b19d6-9498-4dbb-86d9-831fe2337708",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+----------------+\n",
"| geometry|number_of_points|\n",
"+--------------------+----------------+\n",
"|POLYGON ((-87.114...| 15|\n",
"|POLYGON ((-87.082...| 12|\n",
"|POLYGON ((-86.697...| 1|\n",
"|POLYGON ((-87.285...| 26|\n",
"|POLYGON ((-87.105...| 15|\n",
"|POLYGON ((-86.816...| 6|\n",
"|POLYGON ((-87.229...| 7|\n",
"|POLYGON ((-87.092...| 5|\n",
"|POLYGON ((-86.749...| 4|\n",
"|POLYGON ((-86.860...| 12|\n",
"+--------------------+----------------+\n",
"\n"
]
}
],
"source": [
"sedona.createDataFrame(number_of_points, schema, verifySchema=False).show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZosmctfPOjcf"
},
"source": [
"# KNNQuery"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "utC2fQXwOjcg"
},
"source": [
"Spatial KNNQuery is operation which help us find answer which k number of geometries lays closest to other geometry.\n",
"\n",
"For Example:\n",
" 5 closest Shops to your home. To use Spatial KNNQuery please use object\n",
" KNNQuery which has one method:\n",
"```python\n",
"SpatialKnnQuery(spatialRDD: SpatialRDD, originalQueryPoint: BaseGeometry, k: int, useIndex: bool)-> List[GeoData]\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sTcvQarvOjcg"
},
"source": [
"### Finds 5 closest points from PointRDD to given Point"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"id": "n6BIfKk9Ojcg"
},
"outputs": [],
"source": [
"result = KNNQuery.SpatialKnnQuery(point_rdd, Point(-84.01, 34.01), 5, False)"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"id": "i-SgOWD2Ojcg",
"outputId": "a69d9f13-1aeb-4aed-e726-ab1767ee5143",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[Geometry: Point userData: ,\n",
" Geometry: Point userData: ,\n",
" Geometry: Point userData: ,\n",
" Geometry: Point userData: ,\n",
" Geometry: Point userData: ]"
]
},
"metadata": {},
"execution_count": 64
}
],
"source": [
"result"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Dft5_7AMOjcg"
},
"source": [
"As Reference geometry you can also use Polygon or LineString object"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"id": "C-G_a1iXOjcg"
},
"outputs": [],
"source": [
"polygon = Polygon(\n",
" [(-84.237756, 33.904859), (-84.237756, 34.090426),\n",
" (-83.833011, 34.090426), (-83.833011, 33.904859),\n",
" (-84.237756, 33.904859)\n",
" ])\n",
"polygons_nearby = KNNQuery.SpatialKnnQuery(polygon_rdd, polygon, 5, False)"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"id": "pYtK1T1oOjcg",
"outputId": "d6b951cc-bca0-4b83-d42f-d8dde0ab7a8f",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[Geometry: Polygon userData: ,\n",
" Geometry: Polygon userData: ,\n",
" Geometry: Polygon userData: ,\n",
" Geometry: Polygon userData: ,\n",
" Geometry: Polygon userData: ]"
]
},
"metadata": {},
"execution_count": 66
}
],
"source": [
"polygons_nearby"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"id": "-4Mf5TP-Ojcg",
"outputId": "2edfc5ae-f8d9-4225-c88e-12bb74d58a87",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 0
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"'POLYGON ((-83.993559 34.087259, -83.993559 34.131247, -83.959903 34.131247, -83.959903 34.087259, -83.993559 34.087259))'"
],
"application/vnd.google.colaboratory.intrinsic+json": {
"type": "string"
}
},
"metadata": {},
"execution_count": 67
}
],
"source": [
"polygons_nearby[0].geom.wkt"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "SxP3ufPGOjch"
},
"source": [
"# RangeQuery"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Air8I_XIOjch"
},
"source": [
"A spatial range query takes as input a range query window and an SpatialRDD and returns all geometries that intersect / are fully covered by the query window.\n",
"RangeQuery has one method:\n",
"\n",
"```python\n",
"SpatialRangeQuery(self, spatialRDD: SpatialRDD, rangeQueryWindow: BaseGeometry, considerBoundaryIntersection: bool, usingIndex: bool) -> RDD\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"id": "7YQd4wvuOjch"
},
"outputs": [],
"source": [
"from sedona.core.geom.envelope import Envelope"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"id": "8WLf9PLbOjch"
},
"outputs": [],
"source": [
"query_envelope = Envelope(-85.01, -60.01, 34.01, 50.01)\n",
"\n",
"result_range_query = RangeQuery.SpatialRangeQuery(linestring_rdd, query_envelope, False, False)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"id": "9Wp7sOciOjch",
"outputId": "ae9362ae-4ee1-4fa5-84aa-7b1053c17229",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"MapPartitionsRDD[127] at map at GeometryRddConverter.scala:30"
]
},
"metadata": {},
"execution_count": 70
}
],
"source": [
"result_range_query"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"id": "C1mittdHOjch",
"outputId": "c93a70cd-cabb-4e56-b399-99f23a9b7778",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[Geometry: LineString userData: ,\n",
" Geometry: LineString userData: ,\n",
" Geometry: LineString userData: ,\n",
" Geometry: LineString userData: ,\n",
" Geometry: LineString userData: ,\n",
" Geometry: LineString userData: ]"
]
},
"metadata": {},
"execution_count": 71
}
],
"source": [
"result_range_query.take(6)"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"id": "Jv3ik2obOjch"
},
"outputs": [],
"source": [
"# Creating DataFrame from result"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"id": "OF2xJfLCOjch"
},
"outputs": [],
"source": [
"schema = StructType([StructField(\"geometry\", GeometryType(), False)])"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"id": "7Y4ruCQ-Ojci",
"outputId": "23bc8a1d-d16a-4ba7-a490-3b39b9d782cd",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+\n",
"| geometry|\n",
"+--------------------+\n",
"|LINESTRING (-72.1...|\n",
"|LINESTRING (-72.4...|\n",
"|LINESTRING (-72.4...|\n",
"|LINESTRING (-73.4...|\n",
"|LINESTRING (-73.6...|\n",
"+--------------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"sedona.createDataFrame(\n",
" result_range_query.map(lambda x: [x.geom]),\n",
" schema,\n",
" verifySchema=False\n",
").show(5, True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jwiSJcuLOjci"
},
"source": [
"# Load From other Formats"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aLuQLIfJOjci"
},
"source": [
"GeoPyspark allows to load the data from other Data formats like:\n",
" GeoJSON \n",
" Shapefile \n",
" WKB \n",
" WKT "
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"id": "vuiEDP9uOjci"
},
"outputs": [],
"source": [
"## ShapeFile - load to SpatialRDD"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"id": "3bcx_gxAOjci"
},
"outputs": [],
"source": [
"shape_rdd = ShapefileReader.readToGeometryRDD(sc, \"data/polygon\")"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"id": "nIjo83o8Ojci",
"outputId": "69c02c33-e22f-44a5-d6e0-fb509f3b4e7f",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
""
]
},
"metadata": {},
"execution_count": 77
}
],
"source": [
"shape_rdd"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"id": "4y9EXCKfOjci",
"outputId": "62c206c8-e84f-4532-c8f7-8075d519164d",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+\n",
"| geometry|\n",
"+--------------------+\n",
"|MULTIPOLYGON (((1...|\n",
"|MULTIPOLYGON (((-...|\n",
"|MULTIPOLYGON (((1...|\n",
"|POLYGON ((118.362...|\n",
"|MULTIPOLYGON (((-...|\n",
"+--------------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"Adapter.toDf(shape_rdd, sedona).show(5, True)"
]
},
{
"cell_type": "code",
"execution_count": 79,
"metadata": {
"id": "LkVpdZgGOjci"
},
"outputs": [],
"source": [
"## GeoJSON - load to SpatialRDD"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Mh1K7181Ojcj"
},
"source": [
"```\n",
"{ \"type\": \"Feature\", \"properties\": { \"STATEFP\": \"01\", \"COUNTYFP\": \"077\", \"TRACTCE\": \"011501\", \"BLKGRPCE\": \"5\", \"AFFGEOID\": \"1500000US010770115015\", \"GEOID\": \"010770115015\", \"NAME\": \"5\", \"LSAD\": \"BG\", \"ALAND\": 6844991, \"AWATER\": 32636 }, \"geometry\": { \"type\": \"Polygon\", \"coordinates\": [ [ [ -87.621765, 34.873444 ], [ -87.617535, 34.873369 ], [ -87.6123, 34.873337 ], [ -87.604049, 34.873303 ], [ -87.604033, 34.872316 ], [ -87.60415, 34.867502 ], [ -87.604218, 34.865687 ], [ -87.604409, 34.858537 ], [ -87.604018, 34.851336 ], [ -87.603716, 34.844829 ], [ -87.603696, 34.844307 ], [ -87.603673, 34.841884 ], [ -87.60372, 34.841003 ], [ -87.603879, 34.838423 ], [ -87.603888, 34.837682 ], [ -87.603889, 34.83763 ], [ -87.613127, 34.833938 ], [ -87.616451, 34.832699 ], [ -87.621041, 34.831431 ], [ -87.621056, 34.831526 ], [ -87.62112, 34.831925 ], [ -87.621603, 34.8352 ], [ -87.62158, 34.836087 ], [ -87.621383, 34.84329 ], [ -87.621359, 34.844438 ], [ -87.62129, 34.846387 ], [ -87.62119, 34.85053 ], [ -87.62144, 34.865379 ], [ -87.621765, 34.873444 ] ] ] } },\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 80,
"metadata": {
"id": "g_a3q71ZOjcj"
},
"outputs": [],
"source": [
"geo_json_rdd = GeoJsonReader.readToGeometryRDD(sc, \"data/testPolygon.json\")"
]
},
{
"cell_type": "code",
"execution_count": 81,
"metadata": {
"id": "LGrMegQ6Ojcj",
"outputId": "faf4d1b1-a49a-4139-c881-dffce997ea0d",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
""
]
},
"metadata": {},
"execution_count": 81
}
],
"source": [
"geo_json_rdd"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"id": "ZfH32XssOjcj",
"outputId": "a773dbe0-962f-4c45-c1a4-4b44914b69f5",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+\n",
"| geometry|STATEFP|COUNTYFP|TRACTCE|BLKGRPCE| AFFGEOID| GEOID|NAME|LSAD| ALAND|\n",
"+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+\n",
"|POLYGON ((-87.621...| 01| 077| 011501| 5|1500000US01077011...|010770115015| 5| BG| 6844991|\n",
"|POLYGON ((-85.719...| 01| 045| 021102| 4|1500000US01045021...|010450211024| 4| BG|11360854|\n",
"|POLYGON ((-86.000...| 01| 055| 001300| 3|1500000US01055001...|010550013003| 3| BG| 1378742|\n",
"|POLYGON ((-86.574...| 01| 089| 001700| 2|1500000US01089001...|010890017002| 2| BG| 1040641|\n",
"|POLYGON ((-85.382...| 01| 069| 041400| 1|1500000US01069041...|010690414001| 1| BG| 8243574|\n",
"+--------------------+-------+--------+-------+--------+--------------------+------------+----+----+--------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"Adapter.toDf(geo_json_rdd, sedona).drop(\"AWATER\").show(5, True)"
]
},
{
"cell_type": "code",
"execution_count": 83,
"metadata": {
"id": "wuJkRwbROjcj"
},
"outputs": [],
"source": [
"## WKT - loading to SpatialRDD"
]
},
{
"cell_type": "code",
"execution_count": 84,
"metadata": {
"id": "0d9owuZROjcj"
},
"outputs": [],
"source": [
"wkt_rdd = WktReader.readToGeometryRDD(sc, \"data/county_small.tsv\", 0, True, False)"
]
},
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"id": "XKALKBbeOjcj",
"outputId": "e0b7e2d7-7626-4154-8028-518c9b8d6804",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
""
]
},
"metadata": {},
"execution_count": 85
}
],
"source": [
"wkt_rdd"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"id": "UUJfqy5EOjcj",
"outputId": "7806b0df-cd77-4990-ca9e-b4ce8a27a646",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"root\n",
" |-- geometry: geometry (nullable = true)\n",
"\n"
]
}
],
"source": [
"Adapter.toDf(wkt_rdd, sedona).printSchema()"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {
"id": "qjIgGZB2Ojck",
"outputId": "8785e560-ad7c-419c-98e0-cdfc8cfe7d5e",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+\n",
"| geometry|\n",
"+--------------------+\n",
"|POLYGON ((-97.019...|\n",
"|POLYGON ((-123.43...|\n",
"|POLYGON ((-104.56...|\n",
"|POLYGON ((-96.910...|\n",
"|POLYGON ((-98.273...|\n",
"+--------------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"Adapter.toDf(wkt_rdd, sedona).show(5, True)"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {
"id": "9CzKWF_ZOjck"
},
"outputs": [],
"source": [
"## WKB - load to SpatialRDD"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"id": "1-saFLNoOjck"
},
"outputs": [],
"source": [
"wkb_rdd = WkbReader.readToGeometryRDD(sc, \"data/county_small_wkb.tsv\", 0, True, False)"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"id": "WFUhy3AtOjck",
"outputId": "46d24d80-46b7-4165-cde1-eb52817ed7a9",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+\n",
"| geometry|\n",
"+--------------------+\n",
"|POLYGON ((-97.019...|\n",
"|POLYGON ((-123.43...|\n",
"|POLYGON ((-104.56...|\n",
"|POLYGON ((-96.910...|\n",
"|POLYGON ((-98.273...|\n",
"+--------------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"Adapter.toDf(wkb_rdd, sedona).show(5, True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "xj-mt-60Ojck"
},
"source": [
"## Converting RDD Spatial join result to DF directly, avoiding jvm python serde"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {
"id": "5DTG6EvFOjck"
},
"outputs": [],
"source": [
"point_rdd.spatialPartitioning(GridType.KDBTREE)\n",
"rectangle_rdd.spatialPartitioning(point_rdd.getPartitioner())\n",
"# building an index\n",
"point_rdd.buildIndex(IndexType.RTREE, True)\n",
"# Perform Spatial Join Query\n",
"result = JoinQueryRaw.SpatialJoinQueryFlat(point_rdd, rectangle_rdd, False, True)"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {
"id": "Fs8XaQ6kOjck"
},
"outputs": [],
"source": [
"# without passing column names, the result will contain only two geometries columns\n",
"geometry_df = Adapter.toDf(result, sedona)"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"id": "qiWJjHz7Ojck",
"outputId": "f5c12650-bce0-4d87-cacc-dcc82f943b41",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"root\n",
" |-- leftgeometry: geometry (nullable = true)\n",
" |-- rightgeometry: geometry (nullable = true)\n",
"\n"
]
}
],
"source": [
"geometry_df.printSchema()"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"id": "CUuhG1HHOjcl",
"outputId": "963d7d66-c39f-44ef-8139-772dd2bdc563",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+--------------------+\n",
"| leftgeometry| rightgeometry|\n",
"+--------------------+--------------------+\n",
"|POLYGON ((-87.229...|POINT (-87.105455...|\n",
"|POLYGON ((-87.229...|POINT (-87.10534 ...|\n",
"|POLYGON ((-87.229...|POINT (-87.160372...|\n",
"|POLYGON ((-87.229...|POINT (-87.204033...|\n",
"|POLYGON ((-87.229...|POINT (-87.204299...|\n",
"+--------------------+--------------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"geometry_df.show(5)"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"id": "J_113xc7Ojcl",
"outputId": "13894e68-ca5a-4fc4-ee26-f8bc7216c8d3",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"Row(leftgeometry=, rightgeometry=)"
]
},
"metadata": {},
"execution_count": 95
}
],
"source": [
"geometry_df.collect()[0]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "S-eRk6AROjcl"
},
"source": [
"## Passing column names"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {
"id": "5CnT4RzeOjcl"
},
"outputs": [],
"source": [
"geometry_df = Adapter.toDf(result, [\"left_user_data\"], [\"right_user_data\"], sedona)"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {
"id": "qfaNUcXQOjcl",
"outputId": "fa42ffa3-5d2c-4aaa-f2f0-de165a56554d",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+--------------+--------------------+---------------+\n",
"| leftgeometry|left_user_data| rightgeometry|right_user_data|\n",
"+--------------------+--------------+--------------------+---------------+\n",
"|POLYGON ((-87.229...| |POINT (-87.105455...| null|\n",
"|POLYGON ((-87.229...| |POINT (-87.10534 ...| null|\n",
"|POLYGON ((-87.229...| |POINT (-87.160372...| null|\n",
"|POLYGON ((-87.229...| |POINT (-87.204033...| null|\n",
"|POLYGON ((-87.229...| |POINT (-87.204299...| null|\n",
"+--------------------+--------------+--------------------+---------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"geometry_df.show(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TzWjjT0UOjcl"
},
"source": [
"# Converting RDD Spatial join result to DF directly, avoiding jvm python serde"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {
"id": "OB_t5UOcOjcl"
},
"outputs": [],
"source": [
"query_envelope = Envelope(-85.01, -60.01, 34.01, 50.01)\n",
"\n",
"result_range_query = RangeQueryRaw.SpatialRangeQuery(linestring_rdd, query_envelope, False, False)"
]
},
{
"cell_type": "code",
"execution_count": 99,
"metadata": {
"id": "58B9hoX_Ojcl"
},
"outputs": [],
"source": [
"# converting to df\n",
"gdf = Adapter.toDf(result_range_query, sedona)"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {
"id": "Rgna3MsmOjcl",
"outputId": "f7412685-61ac-4712-cf53-a938170315f4",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+\n",
"| geometry|\n",
"+--------------------+\n",
"|LINESTRING (-72.1...|\n",
"|LINESTRING (-72.4...|\n",
"|LINESTRING (-72.4...|\n",
"|LINESTRING (-73.4...|\n",
"|LINESTRING (-73.6...|\n",
"+--------------------+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"gdf.show(5)"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {
"id": "Gzw-LrkgOjcm",
"outputId": "924b5905-530a-4058-d6d0-3f36d9e5c230",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"root\n",
" |-- geometry: geometry (nullable = true)\n",
"\n"
]
}
],
"source": [
"gdf.printSchema()"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {
"id": "C5-BtLhSOjcm"
},
"outputs": [],
"source": [
"# Passing column names\n",
"# converting to df\n",
"gdf_with_columns = Adapter.toDf(result_range_query, sedona, [\"_c1\"])"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {
"id": "r9ctqnn1Ojcm",
"outputId": "a874be8b-00fd-4b9f-c4e5-a5f8006b4202",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"+--------------------+---+\n",
"| geometry|_c1|\n",
"+--------------------+---+\n",
"|LINESTRING (-72.1...| |\n",
"|LINESTRING (-72.4...| |\n",
"|LINESTRING (-72.4...| |\n",
"|LINESTRING (-73.4...| |\n",
"|LINESTRING (-73.6...| |\n",
"+--------------------+---+\n",
"only showing top 5 rows\n",
"\n"
]
}
],
"source": [
"gdf_with_columns.show(5)"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {
"id": "h-xj5agIOjcm",
"outputId": "30469257-a9ee-4711-8dbe-0527ff9e22f6",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
"root\n",
" |-- geometry: geometry (nullable = true)\n",
" |-- _c1: string (nullable = true)\n",
"\n"
]
}
],
"source": [
"gdf_with_columns.printSchema()"
]
},
{
"cell_type": "markdown",
"source": [
"# Summary\n",
"\n",
"We have shown how to install Sedona with Pyspark and run a basic example (source: https://github.com/apache/sedona/blob/master/binder/ApacheSedonaCore.ipynb) on Google Colab."
],
"metadata": {
"id": "VIQE0WpbUQjD"
}
}
]
}