---
name: geopandas-geospatial
description: >-
  Geospatial vector analysis extending pandas. Read/write spatial formats
  (Shapefile, GeoJSON, GeoPackage, Parquet, PostGIS), CRS handling, geometric
  ops (buffer, simplify, centroid, affine), spatial analysis (joins, overlays,
  dissolve, clipping, distance), visualization (choropleth, interactive maps,
  basemaps). Use for spatial joins, overlays, CRS transforms, area/distance, maps.
license: BSD-3-Clause
---

# GeoPandas Geospatial Analysis

## Overview

GeoPandas extends pandas with spatial operations on geometric types, combining pandas DataFrames with Shapely geometries and Fiona for file I/O. It enables reading, writing, manipulating, and visualizing geospatial vector data (points, lines, polygons) with a familiar pandas-like API.

## When to Use

- Reading and writing spatial file formats (Shapefile, GeoJSON, GeoPackage, Parquet)
- Performing spatial joins between geographic datasets (points in polygons, nearest neighbors)
- Running overlay operations (intersection, union, difference, clipping)
- Computing geometric properties (area, distance, buffer, centroid)
- Creating choropleth maps and interactive web maps
- Reprojecting data between coordinate reference systems
- Aggregating spatial features by attribute (dissolve)
- For raster data analysis, use rasterio/xarray instead
- For large-scale distributed geospatial, consider Dask-GeoPandas or Apache Sedona

## Prerequisites

```bash
pip install geopandas matplotlib
# Optional:
# pip install folium       — interactive maps
# pip install mapclassify  — classification schemes for choropleth
# pip install contextily   — basemaps
# pip install pyarrow      — faster I/O (2-4x speedup)
# pip install psycopg2 geoalchemy2  — PostGIS support
```

## Quick Start

```python
import geopandas as gpd

# Read spatial data
gdf = gpd.read_file("data.geojson")
print(f"Shape: {gdf.shape}, CRS: {gdf.crs}")
print(f"Geometry types: {gdf.geometry.geom_type.unique()}")

# Reproject, compute area, save
gdf_proj = gdf.to_crs("EPSG:3857")
gdf_proj['area_m2'] = gdf_proj.geometry.area
gdf_proj.to_file("output.gpkg")

# Quick map
gdf.plot(column='population', legend=True, figsize=(10, 8))
```

## Core API

### 1. Data I/O

```python
import geopandas as gpd

# Read various formats
gdf = gpd.read_file("data.shp")           # Shapefile
gdf = gpd.read_file("data.geojson")       # GeoJSON
gdf = gpd.read_file("data.gpkg")          # GeoPackage
gdf = gpd.read_file("data.gpkg", layer="roads")  # Specific layer

# Filtered reading (load only needed data)
gdf = gpd.read_file("data.gpkg", bbox=(xmin, ymin, xmax, ymax))
gdf = gpd.read_file("data.gpkg", columns=["name", "geometry"])
gdf = gpd.read_file("data.gpkg", where="population > 10000")

# Arrow acceleration (2-4x faster)
gdf = gpd.read_file("data.gpkg", use_arrow=True)

# Parquet/Feather (columnar, fast, preserves CRS)
gdf = gpd.read_parquet("data.parquet")
gdf.to_parquet("output.parquet")

# PostGIS database
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:pass@host/db")
gdf = gpd.read_postgis("SELECT * FROM parcels", con=engine, geom_col='geom')
gdf.to_postgis("output_table", con=engine)

# Write
gdf.to_file("output.gpkg")             # GeoPackage (recommended)
gdf.to_file("output.shp")              # Shapefile
gdf.to_file("output.geojson", driver="GeoJSON")
```

### 2. CRS Management

```python
# Check current CRS
print(gdf.crs)                  # e.g., EPSG:4326
print(gdf.crs.is_geographic)    # True for lat/lon
print(gdf.crs.is_projected)     # True for meters

# Reproject (transforms coordinates)
gdf_proj = gdf.to_crs("EPSG:3857")       # Web Mercator
gdf_proj = gdf.to_crs(epsg=32633)        # UTM zone 33N

# Set CRS (only when metadata missing, does NOT transform coordinates)
gdf = gdf.set_crs("EPSG:4326")

# Estimate appropriate UTM zone
utm_crs = gdf.estimate_utm_crs()
gdf_utm = gdf.to_crs(utm_crs)
```

**Common EPSG codes**:

| Code | Name | Use |
|------|------|-----|
| 4326 | WGS 84 | GPS coordinates, web data |
| 3857 | Web Mercator | Web mapping (Google/OSM tiles) |
| 326xx | UTM zones (N) | Area/distance calculations |
| 5070 | Albers Equal Area (US) | Area-preserving US maps |

### 3. Geometric Operations

```python
# Buffer (expand/erode geometry by distance)
buffered = gdf.geometry.buffer(100)      # 100 units (meters if projected)
eroded = gdf.geometry.buffer(-50)        # Negative = erosion

# Simplify (reduce complexity)
simplified = gdf.geometry.simplify(tolerance=10, preserve_topology=True)

# Centroid, convex hull, envelope
centroids = gdf.geometry.centroid
hulls = gdf.geometry.convex_hull
bounds = gdf.geometry.envelope

# Union all geometries
unified = gdf.geometry.union_all()

# Affine transformations
rotated = gdf.geometry.rotate(angle=45, origin='center')
scaled = gdf.geometry.scale(xfact=2.0, yfact=2.0)
translated = gdf.geometry.translate(xoff=100, yoff=50)

# Geometric properties
areas = gdf.geometry.area           # Use projected CRS for accuracy
lengths = gdf.geometry.length       # Perimeter for polygons
is_valid = gdf.geometry.is_valid    # Validate geometry
total = gdf.geometry.total_bounds   # [minx, miny, maxx, maxy]
```

### 4. Spatial Analysis

```python
# Spatial join (combine datasets by spatial relationship)
joined = gpd.sjoin(points_gdf, polygons_gdf, predicate='intersects')
joined = gpd.sjoin(gdf1, gdf2, predicate='within')
joined = gpd.sjoin(gdf1, gdf2, predicate='contains', how='left')

# Nearest neighbor join
nearest = gpd.sjoin_nearest(gdf1, gdf2, max_distance=1000, distance_col='dist')

# Overlay operations (set-theoretic)
intersection = gpd.overlay(gdf1, gdf2, how='intersection')
union = gpd.overlay(gdf1, gdf2, how='union')
difference = gpd.overlay(gdf1, gdf2, how='difference')
sym_diff = gpd.overlay(gdf1, gdf2, how='symmetric_difference')

# Dissolve (aggregate by attribute)
dissolved = gdf.dissolve(by='region', aggfunc='sum')
dissolved = gdf.dissolve(by='region', aggfunc={'population': 'sum', 'area': 'mean'})

# Clip to boundary
clipped = gpd.clip(gdf, boundary_gdf)

# Distance calculations (use projected CRS)
distances = gdf.geometry.distance(single_point)

# Spatial predicates
within_mask = gdf1.geometry.within(gdf2.geometry)
intersects_mask = gdf1.geometry.intersects(gdf2.geometry)
```

### 5. Visualization

```python
import matplotlib.pyplot as plt

# Basic plot
gdf.plot(figsize=(10, 8))

# Choropleth map
gdf.plot(column='population', cmap='YlOrRd', legend=True, figsize=(12, 8))

# Classification schemes (requires mapclassify)
gdf.plot(column='population', scheme='quantiles', k=5, legend=True)
gdf.plot(column='population', scheme='fisher_jenks', k=5, legend=True)

# Multi-layer map
fig, ax = plt.subplots(figsize=(12, 10))
polygons_gdf.plot(ax=ax, color='lightblue', edgecolor='black')
points_gdf.plot(ax=ax, color='red', markersize=10)
roads_gdf.plot(ax=ax, color='gray', linewidth=0.5)
ax.set_title('Multi-layer Map')
ax.set_axis_off()

# Interactive map (requires folium)
m = gdf.explore(column='population', cmap='YlOrRd', legend=True,
                tooltip=['name', 'population'])
m.save('map.html')

# Multi-layer interactive
m = gdf1.explore(color='blue', name='Layer 1')
gdf2.explore(m=m, color='red', name='Layer 2')
import folium
folium.LayerControl().add_to(m)

# Basemap (requires contextily)
import contextily as ctx
gdf_wm = gdf.to_crs(epsg=3857)
ax = gdf_wm.plot(alpha=0.5, figsize=(10, 10))
ctx.add_basemap(ax)
```

## Key Concepts

### Data Structures

- **GeoSeries**: Pandas Series of Shapely geometries with spatial methods (area, distance, buffer, etc.)
- **GeoDataFrame**: Pandas DataFrame with one or more geometry columns. One column is the "active geometry" used by spatial methods

```python
from shapely.geometry import Point

# Create from coordinates
gdf = gpd.GeoDataFrame(
    {'name': ['A', 'B'], 'value': [10, 20]},
    geometry=[Point(0, 0), Point(1, 1)],
    crs="EPSG:4326"
)

# Multiple geometry columns
gdf['centroid'] = gdf.geometry.centroid
gdf = gdf.set_geometry('centroid')  # Switch active geometry
```

### CRS Rules for Spatial Operations

- **Always check CRS** before any spatial operation: `print(gdf.crs)`
- **Match CRS** before spatial joins, overlays, or distance calculations
- **Use projected CRS** (meters) for area/distance — geographic CRS (degrees) gives wrong results
- **set_crs()** only adds metadata; **to_crs()** transforms coordinates

### Spatial Indexing

GeoPandas automatically creates spatial indexes (R-tree) for sjoin, overlay, and other spatial operations. For manual queries:

```python
sindex = gdf.sindex
possible_idx = list(sindex.intersection((xmin, ymin, xmax, ymax)))
```

## Common Workflows

### Workflow 1: Load → Transform → Analyze → Export

```python
import geopandas as gpd

# Load
gdf = gpd.read_file("parcels.shp")
print(f"CRS: {gdf.crs}, Rows: {len(gdf)}")

# Transform to projected CRS for measurements
gdf = gdf.to_crs(gdf.estimate_utm_crs())

# Analyze
gdf['area_ha'] = gdf.geometry.area / 10000  # hectares
gdf['perimeter_m'] = gdf.geometry.length
print(f"Total area: {gdf['area_ha'].sum():.1f} ha")

# Export
gdf.to_file("parcels_analyzed.gpkg")
```

### Workflow 2: Spatial Join and Aggregate

```python
# Count points per polygon
points_in_poly = gpd.sjoin(points_gdf, polygons_gdf, predicate='within')
counts = points_in_poly.groupby('index_right').agg(
    point_count=('geometry', 'size'),
    total_value=('value', 'sum')
)
result = polygons_gdf.merge(counts, left_index=True, right_index=True, how='left')
result['point_count'] = result['point_count'].fillna(0)
print(f"Polygons with points: {(result['point_count'] > 0).sum()}/{len(result)}")
```

### Workflow 3: Multi-Source Integration

```python
# Read from different sources, ensure matching CRS
roads = gpd.read_file("roads.shp")
buildings = gpd.read_file("buildings.geojson")
parcels = gpd.read_postgis("SELECT * FROM parcels", con=engine, geom_col='geom')

target_crs = roads.crs
buildings = buildings.to_crs(target_crs)
parcels = parcels.to_crs(target_crs)

# Find buildings within 50m of roads
buildings_near_roads = gpd.sjoin_nearest(
    buildings, roads, max_distance=50, distance_col='road_dist'
)
print(f"Buildings near roads: {len(buildings_near_roads)}/{len(buildings)}")
```

## Key Parameters

| Parameter | Function | Default | Effect |
|-----------|----------|---------|--------|
| `predicate` | sjoin | `'intersects'` | Spatial relationship: intersects, within, contains, touches, crosses |
| `how` | sjoin, overlay | `'inner'` | Join type: inner, left, right |
| `max_distance` | sjoin_nearest | None | Search radius limit (improves performance) |
| `k` | sjoin_nearest | 1 | Number of nearest neighbors to find |
| `tolerance` | simplify | Required | Douglas-Peucker tolerance (in CRS units) |
| `preserve_topology` | simplify | True | Prevents self-intersections |
| `resolution` | buffer | 16 | Number of segments for buffer curves |
| `aggfunc` | dissolve | 'first' | Aggregation function for non-geometry columns |
| `use_arrow` | read_file | False | Enable Arrow acceleration (2-4x faster) |
| `scheme` | plot | None | Classification: quantiles, equal_interval, fisher_jenks |
| `k` | plot (with scheme) | 5 | Number of classification bins |

## Best Practices

1. **Always check CRS** before spatial operations — mismatched CRS gives wrong or empty results
2. **Use projected CRS** for area and distance calculations — geographic CRS (degrees) is meaningless for measurements
3. **Match CRS before joins** — `gdf2 = gdf2.to_crs(gdf1.crs)` before `gpd.sjoin(gdf1, gdf2)`
4. **Validate geometries** with `.is_valid` before complex operations — invalid geometries cause silent errors
5. **Use GeoPackage** over Shapefile — no 10-char column name limit, supports multiple layers, better performance
6. **Filter during read** — use `bbox`, `columns`, `where` to load only needed data for large files
7. **Set max_distance in sjoin_nearest** — unbounded nearest-neighbor search is slow on large datasets
8. **Use `.copy()` when modifying geometry** — avoid unintended side effects on original GeoDataFrame

## Common Recipes

### Recipe: Count Points in Polygons

When to use: Aggregate point observations (sample sites, observations) by region (county, watershed).

```python
import geopandas as gpd

regions = gpd.read_file("regions.geojson")
points = gpd.read_file("observations.geojson").to_crs(regions.crs)

joined = gpd.sjoin(points, regions, how="inner", predicate="within")
counts = joined.groupby("index_right").size().rename("point_count")
regions_with_counts = regions.join(counts).fillna(0)
regions_with_counts["point_count"] = regions_with_counts["point_count"].astype(int)
print(regions_with_counts[["name", "point_count"]].sort_values("point_count", ascending=False).head())
```

### Recipe: Buffer Around Points and Dissolve Overlaps

When to use: Create service areas or catchment zones from point locations.

```python
import geopandas as gpd

facilities = gpd.read_file("facilities.geojson")
utm_crs = facilities.estimate_utm_crs()   # Project to meters
facilities_m = facilities.to_crs(utm_crs)

# 1 km buffer
buffers = facilities_m.copy()
buffers["geometry"] = facilities_m.geometry.buffer(1000)

# Dissolve overlapping buffers into single polygon
service_area = buffers.dissolve()
service_area_wgs84 = service_area.to_crs("EPSG:4326")
service_area_wgs84.to_file("service_area.geojson", driver="GeoJSON")
print(f"Service area: {service_area_wgs84.geometry.area.sum():.0f} sq degrees")
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| Empty spatial join result | CRS mismatch between GeoDataFrames | Ensure matching CRS: `gdf2 = gdf2.to_crs(gdf1.crs)` |
| Wrong area/distance values | Using geographic CRS (degrees) | Reproject to projected CRS: `gdf.to_crs(gdf.estimate_utm_crs())` |
| `DriverError` on read | Missing driver or corrupt file | Check file exists; try `driver="GeoJSON"` explicitly |
| Geometry column lost after merge | Called `df.merge(gdf)` instead of `gdf.merge(df)` | Always call merge ON the GeoDataFrame |
| `TopologicalError` on overlay | Invalid geometries | Fix with `gdf.geometry = gdf.geometry.buffer(0)` |
| Slow sjoin_nearest | No distance limit on large dataset | Set `max_distance` parameter |
| Folium map blank | Geometries not in EPSG:4326 | Reproject to WGS84: `gdf.to_crs(epsg=4326).explore()` |
| Shapefile column names truncated | Shapefile 10-char limit | Use GeoPackage instead: `gdf.to_file("out.gpkg")` |

## References

- GeoPandas documentation: https://geopandas.org/
- GeoPandas GitHub: https://github.com/geopandas/geopandas
- Shapely documentation: https://shapely.readthedocs.io/
- EPSG registry: https://epsg.io/

## Related Skills

- **matplotlib-scientific-plotting** — advanced map styling and figure export
- **polars-dataframes** — high-performance tabular analysis before/after spatial operations
- **folium** — advanced interactive web mapping beyond geopandas.explore()