--- name: xarray description: N-dimensional labeled arrays and datasets in Python. Built on top of NumPy and Dask. It introduces labels in the form of dimensions, coordinates, and attributes on top of raw NumPy-like arrays, making data analysis in physical sciences more intuitive and less error-prone. Use for working with multi-dimensional scientific data, NetCDF/GRIB/Zarr files, climate/weather/oceanographic datasets, remote sensing, geospatial imaging, large out-of-memory datasets with Dask, and labeled array operations. version: 2024.01 license: Apache-2.0 --- # Xarray - N-Dimensional Labeled Arrays Xarray provides a pandas-like experience for multidimensional data. It is the core of the Pangeo ecosystem and is essential for working with NetCDF, GRIB, and Zarr formats. ## When to Use - Working with multi-dimensional scientific data (Time, Lat, Lon, Level, Ensemble). - Analyzing climate, weather, or oceanographic datasets (NetCDF files). - Handling large datasets that don't fit in memory (via Dask integration). - Performing complex broadcasting and alignment based on dimension names instead of axis indices. - Storing metadata (units, descriptions) directly inside the data object. - Remote sensing and geospatial imaging analysis. ## Reference Documentation **Official docs**: https://docs.xarray.dev/ **Tutorials**: https://tutorial.xarray.dev/ **Search patterns**: `xr.DataArray`, `xr.Dataset`, `ds.sel`, `ds.groupby`, `ds.resample`, `xr.open_dataset` ## Core Principles ### DataArray vs Dataset | Structure | Description | Analogy | |-----------|-------------|---------| | DataArray | A single labeled N-dimensional array. | Like a pandas.Series but N-D. | | Dataset | A dict-like container of multiple DataArrays. | Like a pandas.DataFrame but N-D. | ### Key Concepts - **Dimensions**: Names of the axes (e.g., x, y, time). - **Coordinates**: Values associated with dimensions (e.g., actual timestamps or latitude values). - **Attributes**: Arbitrary metadata (e.g., units='Kelvin', standard_name='air_temperature'). ## Quick Reference ### Installation ```bash pip install xarray netCDF4 dask zarr ``` ### Standard Imports ```python import xarray as xr import numpy as np import pandas as pd import matplotlib.pyplot as plt ``` ### Basic Pattern - Creation ```python import xarray as xr import numpy as np # Create a DataArray data = np.random.rand(4, 3) times = pd.date_range("2023-01-01", periods=4) lons = [-120, -110, -100] da = xr.DataArray( data, coords={"time": times, "lon": lons}, dims=("time", "lon"), name="temp", attrs={"units": "degC"} ) # Convert to Dataset ds = da.to_dataset() print(ds) ``` ## Critical Rules ### ✅ DO - **Use Named Dimensions** - Always use `dim=('time', 'lat', 'lon')` instead of integer axes. - **Select by Labels** - Use `.sel()` for coordinate values and `.isel()` for index integers. - **Lazy Loading** - Use `chunks={}` in `open_dataset` to handle large files with Dask. - **Keep Metadata** - Populate `.attrs` to ensure your data is self-describing. - **Alignment** - Let Xarray handle broadcasting; it will automatically align data based on coordinate values. - **Accessor power** - Use `.dt` for datetime and `.str` for string operations. ### ❌ DON'T - **Use Integer Indexing** - Avoid `data[0, :, 5]` (unreadable and fragile). Use `.isel(time=0, lon=5)`. - **Ignore the Encoding** - When saving to NetCDF, check `ds.encoding` for compression/scaling. - **Manual Loops** - Don't loop over time steps; use `.groupby()` or `.resample()`. - **Forget Dask** - For datasets larger than RAM, ensure Dask is installed and chunks are defined. ## Anti-Patterns (NEVER) ```python # ❌ BAD: Positional indexing (What is axis 1? Lat or Lon?) mean_val = ds.variable.mean(axis=1) # ✅ GOOD: Named dimension reduction (Clear and robust) mean_val = ds.variable.mean(dim='lat') # ❌ BAD: Manual time slicing with list comprehensions # subset = [ds.sel(time=t) for t in my_times if t > '2020-01-01'] # ✅ GOOD: Built-in slicing subset = ds.sel(time=slice('2020-01-01', '2021-12-31')) # ❌ BAD: Losing metadata during numpy conversion raw_data = ds.temp.values # Now it's just a numpy array, units are gone! # ✅ GOOD: Keep in Xarray as long as possible processed = ds.temp * 10 # Units and coords are preserved ``` ## Selection and Indexing ### sel vs isel ```python # Select by coordinate values subset = ds.sel(lat=45.0, lon=slice(-100, -80)) # Select by index (integer) first_step = ds.isel(time=0) # Nearest neighbor lookup point = ds.sel(lat=42.1, lon=-71.2, method="nearest") # Multi-dimensional selection high_temp_days = ds.where(ds.temp > 30, drop=True) ``` ## Computation and Math ### Broadcasting and Alignment ```python # Xarray aligns automatically by coordinate names da1 = xr.DataArray([1, 2], coords=[[1, 2]], dims=['x']) da2 = xr.DataArray([1, 2, 3], coords=[[1, 2, 3]], dims=['y']) # result is a 2x3 matrix result = da1 + da2 # Mathematical operations preserve coordinates log_temp = np.log(ds.temp) anomalies = ds.temp - ds.temp.mean(dim='time') ``` ## GroupBy and Resampling ### Time Series and Spatial Aggregation ```python # Monthly means monthly = ds.resample(time="1MS").mean() # Climatology (group by month regardless of year) climatology = ds.groupby("time.month").mean() # Calculate anomalies relative to climatology anomalies = ds.groupby("time.month") - climatology # Rolling window (Moving average) rolling_mean = ds.rolling(time=7, center=True).mean() ``` ## File I/O (NetCDF, Zarr) ### Reading and Writing ```python # Open a single file ds = xr.open_dataset("weather_data.nc") # Open multiple files (MFDataset) ds_all = xr.open_mfdataset("data/*.nc", combine="by_coords", chunks={'time': 100}) # Write to NetCDF ds.to_netcdf("output.nc") # Write to Zarr (Cloud optimized) ds.to_zarr("data.zarr") ``` ## Plotting ### High-level wrapping of Matplotlib ```python # 1D plot ds.temp.sel(lat=0, lon=0, method='nearest').plot() # 2D map ds.temp.isel(time=0).plot(cmap='RdBu_r', robust=True) # Faceting (Subplots) ds.temp.isel(time=slice(0, 4)).plot(col="time", col_wrap=2) ``` ## Integration with pandas and NumPy ```python # To Pandas df = ds.to_dataframe() # From Pandas new_ds = xr.Dataset.from_dataframe(df) # To NumPy (Lose coordinates) arr = ds.temp.values # Interoperability # Xarray objects work in many SciPy/NumPy functions from scipy.signal import detrend detrended = xr.apply_ufunc(detrend, ds.temp, input_core_dims=[['time']], output_core_dims=[['time']]) ``` ## Advanced: Dask for Big Data ### Out-of-memory computation ```python # Opening with chunks creates Dask arrays ds = xr.open_dataset("huge_file.nc", chunks={'time': 500, 'lat': 100, 'lon': 100}) # Computation is now lazy result = ds.temp.mean(dim='time') # Returns immediately # Trigger computation final_val = result.compute() ``` ## Practical Workflows ### 1. Global Temperature Anomaly Workflow ```python def calculate_temp_anomaly(filepath): """Calculate monthly anomalies from NetCDF data.""" ds = xr.open_dataset(filepath) # 1. Compute climatology (mean for each month of the year) climatology = ds.temp.groupby("time.month").mean("time") # 2. Subtract climatology from original data anomalies = ds.temp.groupby("time.month") - climatology # 3. Global mean anomaly # Weighted by cos(lat) because grid cells get smaller at poles weights = np.cos(np.deg2rad(ds.lat)) weights.name = "weights" anom_weighted = anomalies.weighted(weights) return anom_weighted.mean(("lat", "lon")) # ts_anomaly = calculate_temp_anomaly("global_temps.nc") ``` ### 2. Multi-Model Ensemble Analysis ```python def analyze_ensemble(file_list): """Combine multiple model runs into a single dataset with a 'model' dimension.""" datasets = [xr.open_dataset(f) for f in file_list] model_names = ["Model_A", "Model_B", "Model_C"] # Concatenate along a new dimension combined = xr.concat(datasets, dim=pd.Index(model_names, name="model")) # Calculate ensemble mean and spread ens_mean = combined.mean(dim="model") ens_std = combined.std(dim="model") return ens_mean, ens_std ``` ### 3. Satellite Image Processing (NDVI) ```python def calculate_ndvi(ds): """Calculate NDVI from Red and NIR bands in an Xarray Dataset.""" # NDVI = (NIR - Red) / (NIR + Red) red = ds.sel(band='red') nir = ds.sel(band='nir') ndvi = (nir - red) / (nir + red) ndvi.attrs['long_name'] = "Normalized Difference Vegetation Index" return ndvi ``` ## Performance Optimization ### Chunking Strategies ```python # ❌ Problem: Small chunks lead to massive overhead # ds = ds.chunk({'time': 1, 'lat': 1, 'lon': 1}) # ✅ Solution: Aim for 10MB - 100MB per chunk ds = ds.chunk({'time': -1, 'lat': 100, 'lon': 100}) ``` ### Vectorization with apply_ufunc ```python # Wrap a custom numpy function to work on Xarray objects efficiently def my_complex_stat(x): return np.median(x) * np.std(x) result = xr.apply_ufunc( my_complex_stat, ds.temp, input_core_dims=[['time']], # The dimension to map the function over vectorize=True, dask="parallelized" ) ``` ## Common Pitfalls and Solutions ### Coordinate Mismatch ```python # ❌ Problem: DataArrays don't align due to floating point jitter in lat/lon # ✅ Solution: Use .interp_like() or .reindex_like() ds2_aligned = ds2.interp_like(ds1) ``` ### Memory Leak with values ```python # ❌ Problem: Calling .values on a huge Dask-backed array crashes the machine # ✅ Solution: Use .compute() or subset first subset_val = ds.temp.isel(time=0).values # This is safer ``` ### Slicing issues (Start/End) ```python # ❌ Problem: slice(10, 0) returns empty because order is wrong # ✅ Solution: Check if your index is ascending or descending # ds.sortby('lat').sel(lat=slice(-90, 90)) ``` Xarray is the bridge between raw N-dimensional math and high-level data analysis. Its ability to handle labels and metadata makes scientific code self-documenting and significantly more reliable.