{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Example: Reading vector data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "This example illustrates the how to read raster data using the HydroMT [DataCatalog](https://deltares.github.io/hydromt/latest/_generated/hydromt.data_catalog.DataCatalog.html) with the `vector` or `vector_table`  drivers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "import hydromt\n",
    "\n",
    "# Download artifacts for the Piave basin to `~/.hydromt_data/`.\n",
    "data_catalog = hydromt.DataCatalog(data_libs=[\"artifact_data=v1.0.0\"])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "## Pyogrio driver"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "4",
   "metadata": {},
   "source": [
    "To read vector data and parse it into a [geopandas.GeoDataFrame](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.html) object we use the [geopandas.read_file](https://geopandas.org/en/stable/docs/reference/api/geopandas.read_file.html) method, see the [geopandas documentation](https://geopandas.org/en/stable/docs/user_guide/io.html#reading-spatial-data) for details. Geopandas supports many file formats, see below. For large datasets we recommend using data formats which contain a spatial index, such as 'GeoPackage (GPKG)' or 'FlatGeoBuf' to speed up reading spatial subsets of the data. Here we use a spatial subset of the [Database of Global Administrative Areas (GADM)](https://gadm.org/download_world.html) level 3 units."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# inspect data source entry in data catalog yaml file\n",
    "data_catalog.get_source(\"gadm_level3\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "6",
   "metadata": {},
   "source": [
    "We can load any GeoDataFrame using the [get_geodataframe()](https://deltares.github.io/hydromt/latest/_generated/hydromt.data_catalog.DataCatalog.get_geodataframe.html) method of the DataCatalog. Note that if we don't provide any arguments it returns the full dataset with nine data variables and for the full spatial domain. Only the data coordinates are actually read, the data variables are still loaded lazy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7",
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf = data_catalog.get_geodataframe(\"gadm_level3\")\n",
    "print(f\"number of rows: {gdf.index.size}\")\n",
    "gdf.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "8",
   "metadata": {},
   "source": [
    "We can request a (spatial) subset data by providing additional `variables` and `bbox` / `geom` arguments. Note that this returns less polygons (rows) and only two columns with attribute data,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9",
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf_subset = data_catalog.get_geodataframe(\n",
    "    \"gadm_level3\", bbox=gdf[:5].total_bounds, variables=[\"GID_0\", \"NAME_3\"]\n",
    ")\n",
    "print(f\"number of rows: {gdf_subset.index.size}\")\n",
    "gdf_subset.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "10",
   "metadata": {},
   "source": [
    "## Vector_table driver"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create example point CSV data with funny `x` coordinate name and additional column\n",
    "import numpy as np\n",
    "\n",
    "path = \"tmpdir/xy.csv\"\n",
    "df = pd.DataFrame(\n",
    "    columns=[\"x_centroid\", \"y\"],\n",
    "    data=np.vstack([gdf_subset.centroid.x, gdf_subset.centroid.y]).T,\n",
    ")\n",
    "df[\"name\"] = gdf_subset[\"NAME_3\"]\n",
    "df.to_csv(path)  # write to file\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Because the data we wrote does not live in the root of the data catalog we'll have to\n",
    "# start with a new one\n",
    "data_catalog = hydromt.DataCatalog(data_libs=None)\n",
    "\n",
    "# Create data source entry for the data catalog for the new csv data\n",
    "# NOTE that we add specify the name of the x coordinate with the `x_dim` argument, while\n",
    "# the y coordinate is understood by HydroMT.\n",
    "data_source = {\n",
    "    \"GADM_level3_centroids\": {\n",
    "        \"uri\": path,\n",
    "        \"data_type\": \"GeoDataFrame\",\n",
    "        \"driver\": {\"name\": \"geodataframe_table\", \"options\": {\"x_dim\": \"x_centroid\"}},\n",
    "        \"metadata\": {\"crs\": 4326},\n",
    "    }\n",
    "}\n",
    "data_catalog.from_dict(data_source)\n",
    "data_catalog.get_source(\"GADM_level3_centroids\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_catalog.get_source(\"GADM_level3_centroids\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "# we can then read the data back as a GeoDataFrame\n",
    "gdf_centroid = data_catalog.get_geodataframe(\"GADM_level3_centroids\")\n",
    "print(f\"CRS: {gdf_centroid.crs}\")\n",
    "gdf_centroid.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "## Visualize vector data"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "The data can be visualized with the [.plot()](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.plot.html) geopandas method. In an interactive environment you can also try the [.explore()](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.explore.html) method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = gdf.plot()\n",
    "gdf_subset.plot(ax=ax, color=\"red\")\n",
    "gdf_centroid.plot(ax=ax, color=\"k\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  },
  "vscode": {
   "interpreter": {
    "hash": "3808d5b5b54949c7a0a707a38b0a689040fa9c90ab139a050e41373880719ab1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}