{ "cells": [ { "cell_type": "markdown", "id": "28c5e04d", "metadata": {}, "source": [ "# 2.5 Data Arrays\n", "\n", "This lecture introduces arrays.\n", "\n", "## 1. Why Manipulate Arrays in Geoscientific Data?\n", "* **Data Complexity**: Geoscientific data is often multidimensional (e.g., 3D grids for seismic data, 4D climate models with time). Arrays offer efficient ways to store and manipulate such data.\n", "* **Data Transformation**: Preprocessing geoscientific data often involves transformations such as normalization, aggregation, resampling, and feature extraction, all of which require array manipulation.\n", "* **Performance**: Arrays provide optimized data structures that make operations faster and more memory efficient. This is crucial when working with large datasets (e.g., satellite imagery, climate models).\n", "* **Model Input**: Machine learning models, especially deep learning, often require structured data inputs in the form of arrays or tensors.\n", "\n", "\n", "**Overview of Array Libraries**\n", "* **NumPy Arrays**: The foundation for numerical computing in Python. Offers basic array operations, efficient mathematical functions, and indexing/slicing tools.\n", "* **Xarray**: Designed for labeled, multi-dimensional data. Xarray is ideal for geoscientific data that has labels for each dimension (e.g., time, latitude, longitude).\n", "* **PyTorch Tensors**: A library primarily for deep learning. It allows for GPU-accelerated tensor operations and is useful when geoscientific problems intersect with machine learning tasks.\n", "\n", "\n", "At all levels, we will practice\n", "- **Matplotlib** python package that provides functionalities for basic plotting." ] }, { "cell_type": "markdown", "id": "47124ce9", "metadata": {}, "source": [ "## 2. Numpy arrays\n", "\n", "Sequence of data can be stored in python *lists*. Lists are very flexible; data of identical type can be appended to list on the fly.\n", "\n", "Numpy arrays area multi-dimensional objects of specific data types (floats, strings, integers, ...). ! Numpy arrays should be declared first ! Allocating memory of the data ahead of time can save computational time. Numpy arrays support arithmetic operations.\n", "There are numerous tutorials to get help, such as on [Earth Data Science](https://www.earthdatascience.org/courses/intro-to-earth-data-science/scientific-data-structures-python/numpy-arrays/)." ] }, { "cell_type": "code", "execution_count": null, "id": "86fc68b4", "metadata": {}, "outputs": [], "source": [ "# import module\n", "import numpy as np\n", "\n", "# define an array of dimension one from.\n", "#this is a list of floats:\n", "a=[0.7 , 0.75, 1.85]\n", "# convert a list to a numpy array\n", "a_nparray = np.array(a)" ] }, { "cell_type": "markdown", "id": "30fba6ef", "metadata": {}, "source": [ "### 2.1 1D arrays in Numpy\n", "1-D arrays are also called ``vectors``. The Numpy function ``arange`` will create regularly spaced values within a specific range. It is similar to the python function ``range``. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "f8f04ca2", "metadata": {}, "outputs": [], "source": [ "# create 1D arrays\n", "N=100\n", "x_int = np.arange(N)# this will make a vector of int from 0 to N-1\n", "print(x_int)" ] }, { "cell_type": "code", "execution_count": null, "id": "27e7a216", "metadata": {}, "outputs": [], "source": [ "N = 100 # number of points\n", "\n", "# linearly spaced vectors\n", "x_lat = np.linspace(39,50,N)# this will make a vector of floats from min to max values evenly spaced\n", "x_lon = np.linspace(-128,-125,N)# same for longitude\n", "print(x_lat)\n", "print(x_lon)\n", "\n", "\n", "\n", "# logspaced vectors\n", "# 10^(-1) -> 10^(1) => 0.1 - 1\n", "x_tl = np.logspace(-1,1,N)\n", "\n", "print(x_tl)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "c2671e24", "metadata": {}, "outputs": [], "source": [ "\n", "# time vectors\n", "x_t = np.linspace(0,100,N+1)\n", "dt=x_t[1]\n", "print(dt)" ] }, { "cell_type": "markdown", "id": "23aa62bc", "metadata": {}, "source": [ "**Indexing Numpy arrays**\n", "\n", "Accessing individual elements is done using squared brackets ``[]``. First element is indexed at ``0``, last element is indexed at ``-1``.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "a0dd1c01", "metadata": {}, "outputs": [], "source": [ "arr = np.arange(5)\n", "\n", "# Access the first element (index 0)\n", "element = arr[0]\n", "print(element) # Output: 1\n", "\n", "# Access the fourth element (index 3)\n", "element = arr[3]\n", "print(element) # Output: 4\n", "\n", "\n", "# Access the last element (index 5)\n", "element = arr[4]\n", "print(element) # Output: 5\n", "\n", "# Access the last element (index 5)\n", "element = arr[-1]\n", "print(element) # Output: 5\n" ] }, { "cell_type": "markdown", "id": "a7386d91", "metadata": {}, "source": [ "**Slicing**\n", "\n", "We can slice arrays using the basic syntax ``start:stop:step``:" ] }, { "cell_type": "code", "execution_count": null, "id": "778e2928", "metadata": {}, "outputs": [], "source": [ "x=np.arange(10)\n", "print(x)" ] }, { "cell_type": "code", "execution_count": null, "id": "5147550b", "metadata": {}, "outputs": [], "source": [ "# select subarrays between indexes 3 and 6\n", "x1 = x[2:10:2]\n", "print(x1)" ] }, { "cell_type": "markdown", "id": "aca4b17d", "metadata": {}, "source": [ "Create an array that select every other elements" ] }, { "cell_type": "code", "execution_count": null, "id": "84f6785c", "metadata": {}, "outputs": [], "source": [ "# answer here" ] }, { "cell_type": "markdown", "id": "8f5a75e5", "metadata": {}, "source": [ "Create an array with a reversed ordered indexes:" ] }, { "cell_type": "code", "execution_count": null, "id": "e264262f", "metadata": {}, "outputs": [], "source": [ "# answer here\n", "x[10:0:-1]" ] }, { "cell_type": "markdown", "id": "178b20ce", "metadata": {}, "source": [ "### 2.2 Plotting with Matplotlib\n", "\n", "**Some tips from Sofware-Carpentry**\n", "\n", "- Make sure your text is **large** enough to read. Use the fontsize parameter in xlabel, ylabel, title, and legend, and tick_params with labelsize to increase the text size of the numbers on your axes.\n", "\n", "- Similarly, you should make your graph elements easy to see. Use ``s`` to increase the size of your scatterplot markers and linewidth to increase the sizes of your plot lines.\n", "\n", "- Using color (and nothing else) to distinguish between different plot elements will make your plots unreadable to anyone who is colorblind, or who happens to have a black-and-white office printer. For lines, the linestyle parameter lets you use different types of lines. For scatterplots, marker lets you change the shape of your points. If you’re unsure about your colors, you can use Coblis or Color Oracle to simulate what your plots would look like to those with colorblindness." ] }, { "cell_type": "code", "execution_count": null, "id": "323f29fc", "metadata": {}, "outputs": [], "source": [ "# make the plots appear in the notebook\n", "%matplotlib inline \n", "import matplotlib.pyplot as plt\n", "# import matplotlib.pylab as pylab\n", "\n", "# set default parameters\n", "params = {'legend.fontsize': 14, \\\n", " 'xtick.labelsize':14, \\\n", " 'ytick.labelsize':14, \\\n", " 'font.size':14}\n", "plt.rcParams.update(params)" ] }, { "cell_type": "code", "execution_count": null, "id": "d9451b3e", "metadata": {}, "outputs": [], "source": [ "fig, (ax1,ax2) = plt.subplots(1,2, figsize=(10,5)) # \n", "ax1.plot(x_t,'.') # plot the vector x_t\n", "ax2.plot(x_tl,'r.');ax2.set_yscale('log') # plot the vector x_tl in red and log scale\n", "ax2.set_xlabel('Index of vector') # set the x-axis label\n", "ax1.set_xlabel('Index of vector') # set the x-axis label\n", "ax2.set_ylabel('Time (s)') # set the y-axis label\n", "ax1.set_ylabel('Time (s)') # set the y-axis label\n", "ax1.set_title('My first plot!')\n", "ax2.set_title('My second plot!') ; \n", "plt.tight_layout()\n", "ax1.grid(True)\n", "ax2.grid(True)\n", "plt.savefig('plot_test.png')" ] }, { "cell_type": "markdown", "id": "e156ec89", "metadata": {}, "source": [ "### 2.3 Random Arrays\n", "\n", "Creating synthetic arrays from statistical distributions of data.\n", "\n", "Find some basic function from the statistics functions in Numpy:\n", "https://numpy.org/doc/stable/reference/routines.statistics.html" ] }, { "cell_type": "code", "execution_count": null, "id": "e3938d8a", "metadata": {}, "outputs": [], "source": [ "# generate random fields\n", "from numpy import random\n", "\n", "N=10000 # number of samples\n", "# Uniform distribution\n", "x1=2*random.rand(N)-1 # uniform distribution centered at 0, varying between -1 and 2.\n", "# Gaussian distribution\n", "x2=random.randn(N) # Gaussian distribution with a standard deviation of 1\n", "# Power law\n", "a=2 # power of the distribution\n", "x3=random.power(a,N)\n", "# poisson\n", "aa=4\n", "x4=random.poisson(aa,N)" ] }, { "cell_type": "code", "execution_count": null, "id": "449dfc8b", "metadata": {}, "outputs": [], "source": [ "# now compare this distribution to the Gaussian distribution\n", "fig1, (ax1,ax2,ax3,ax4) = plt.subplots(1,4, figsize=(15,5)) # 1 row, 2 co\n", "ax1.plot(x1);ax1.set_title('Uniform')\n", "ax2.plot(x2);ax2.set_title('Gaussian')\n", "ax3.plot(x3);ax3.set_title('Power')\n", "ax4.plot(x4);ax4.set_title('Poisson')\n", "\n", "\n", "# plot the 2 histograms\n", "fig2, (ax11,ax12,ax13,ax14) = plt.subplots(1,4, figsize=(15,5)) # 1 row, 2 co\n", "ax11.hist(x1,bins=100);ax11.set_title('Uniform')\n", "ax12.hist(x2,bins=100);ax12.set_title('Gaussian')\n", "ax13.hist(x3,bins=100);ax13.set_title('Power')\n", "ax14.hist(x4,bins=100);ax14.set_title('Poisson')" ] }, { "cell_type": "code", "execution_count": null, "id": "c74db68b", "metadata": {}, "outputs": [], "source": [ "# numpy has built-in functions to calculate basic statistical properties of numpy arrays\n", "print(\"mean of x1\", \"standard deviation of x1\",\"mean of x2\", \"standard deviation of x2\",)\n", "print(np.mean(x1),np.std(x1),np.mean(x2),np.std(x2))" ] }, { "cell_type": "markdown", "id": "d5d9add8", "metadata": {}, "source": [ "### 2.4 2D arrays in Numpy\n", "\n", "We will now create random 2D arrays" ] }, { "cell_type": "code", "execution_count": null, "id": "c97eb3cd", "metadata": { "scrolled": true }, "outputs": [], "source": [ "N=1000 # number of samples\n", "# create\n", "# Uniform distribution\n", "x1=2*random.rand(N,N)-1 # uniform distribution centered at 0, varying between -1 and 2.\n", "# Gaussian distribution\n", "x2=random.randn(N,N) # Gaussian distribution with a standard deviation of 1\n", "# Power law\n", "a=2 # power of the distribution\n", "x3=random.power(a,[N,N])\n", "# poisson\n", "aa=4\n", "x4=random.poisson(aa,[N,N])\n", "\n", "# now compare this distribution to the Gaussian distribution\n", "fig, (ax1,ax2,ax3,ax4) = plt.subplots(1,4, figsize=(26,4)) # 1 row, 2 co\n", "ax1.pcolor(x1,vmin=-1, vmax=1)\n", "ax2.pcolor(x2,vmin=-1, vmax=1)\n", "ax3.pcolor(x3,vmin=-1, vmax=1)\n", "ax4.pcolor(x4,vmin=0, vmax=10)" ] }, { "cell_type": "markdown", "id": "209cdfa9", "metadata": {}, "source": [ "#### Bonus\n", "\n", "Learn how to smooth/blurr these 2D arrays" ] }, { "cell_type": "code", "execution_count": null, "id": "ebdc31dc", "metadata": {}, "outputs": [], "source": [ "# use scipy gaussian filtering to smooth the x2 array\n", "from scipy.ndimage import gaussian_filter\n", "x2_smooth = gaussian_filter(x2, sigma=5)\n", "fig, (ax1,ax2) = plt.subplots(1,2, figsize=(10,5)) # 1 row, 2 co\n", "ax1.pcolor(x2,vmin=-1, vmax=1)\n", "ax2.pcolor(x2_smooth,vmin=-1, vmax=1)" ] }, { "cell_type": "code", "execution_count": null, "id": "d9b35b48", "metadata": {}, "outputs": [], "source": [ "# check the distributions of both arrays before and after smoothing using histograms to compare\n", "fig, (ax1,ax2) = plt.subplots(1,2, figsize=(10,5)) # 1 row, 2 co\n", "ax1.hist(x2.flatten(),bins=100) # flatten the 2D array to 1D and plot the histogram\n", "ax2.hist(x2_smooth.flatten(),bins=100);" ] }, { "cell_type": "markdown", "id": "2639b0e1", "metadata": {}, "source": [ "### 2.5 Array Norms\n", "\n", "$L_2$ norm of an array $X$:\n", "\n", "$||X||_2 = \\sqrt{\\sum_i^N \\left(X_i^2\\right) / N} $\n", "In numpy that is the default norm of the linear algebra module linalg: np.linalg.norm(X) \n", "\n", "We can normalize an array and will focus on wich dimension the array is normalized." ] }, { "cell_type": "code", "execution_count": null, "id": "c538d34f", "metadata": {}, "outputs": [], "source": [ "# define random array\n", "x1=2*random.rand(N)-1 # uniform distribution centered at 0, varying between -1 and 2.\n", "# calculate its norm\n", "norm_x1 = np.linalg.norm(x1);print(norm_x1)\n", "# normalize it\n", "x1_norm = x1/norm_x1\n", "# calculate its norm\n", "norm_x1_norm = np.linalg.norm(x1_norm);print(norm_x1_norm)\n" ] }, { "cell_type": "markdown", "id": "5afc4417", "metadata": {}, "source": [ "### 2.6 Measures of distance\n", "Comparing data often means calculating a distance or dissimilarity between the two data. Similarity is equialent to proximity of two data.\n", "\n", "**Euclidian distance**\n", "\n", "$L_2$ norm of the residual between 2 vectors:\n", "\n", "$d = ||X -Y||_2 = \\sqrt{\\sum_i^N \\left(X_i^2 - Y_i^2 \\right) / N} $\n", "In numpy that is the default norm of the linear algebra module linalg: d=np.linalg.norm(X-Y) \n", "\n", "**Total variation distance**\n", "\n", "Is the $L_1$-norm equivalent of the Euclidian distance: d=np.linalg.norm(X-Y,ord=1) \n", "\n", "\n", "**Pearson coefficient (aka the correlation coefficient)**\n", "\n", "$ P(X,Y) = \\frac{cov(X,Y)}{std(X)std(Y)} $\n", "\n", "$ P(X,Y) = \\frac{ \\sum{ (X_i-mean(X)) (Y_i-mean(Y)}}{\\sqrt{\\sum{ (X_i-mean(X))^2 } \\sum{ (Y_i-mean(Y))^2 } }} $" ] }, { "cell_type": "code", "execution_count": null, "id": "f7fa3862", "metadata": {}, "outputs": [], "source": [ "# cross correlate 2 vectors:\n", "\n", "x2=random.randn(N) # Gaussian distribution \n", "x3=random.randn(N) # Gaussian distribution \n", "\n", "# Euclidian distance and make sure it is normalized\n", "d2=np.linalg.norm(x3-x2) # Euclidian distance\n", "# Total variation distance:\n", "d1=np.linalg.norm(x3-x2,ord=1)\n", "# Pearson coefficient\n", "r = np.corrcoef(x2,x3)[0,1]\n", "\n", "# what is the mean of the distances between each point of the two vectors?\n", "dist = np.sqrt(np.sum(np.abs(x3-x2)**2))\n", "\n", "print(\"the three distances are:\")\n", "print(d2,d1,r,dist)\n", "\n", "plt.plot(x2,x3,'o');plt.grid(True)\n", "plt.xlabel('X2');plt.ylabel('X3')\n" ] }, { "cell_type": "markdown", "id": "8de40708", "metadata": {}, "source": [ "# 3. Xarrays\n", "\n", "## 3.1 Xarray basics\n", "\n", "This tutorial has been copied and modified from the Xarray package tutorials.\n", "\n", "Geoscientific data comes with complex metadata that describe the meaning and context for data arrays. Xarrays allows to attach metadat to the data arrays, making it easier to keep track of variables, their units, coordinate systems, and other important data attributes. Because geoscientific data are typically multi-dimensional, integrating dimensions like time, spatial coordinate, allows users to manipulate, transform, perform complex operations on the data arras, In particular, it simplifies tasks to slice in time and space, allows for regridding and subsetting.\n", "\n", "Xarrays are well integrated with high performance computing, such as Dask, and storage such as HPC storage using ``NetCDF`` and cloud storage with ``Zarr``.\n", "\n", "Xarrays were designed to facilitate the manipulation of geoscientific data. Ge\n", "\n", "\n", "Xarrays are multi-dimensional arrays (\"tensors\") that can have several attributes and dimensions. The core structure is ``DataArray``, the N dimensional array that is similar to a ``pandas.Series``. The second is the ``Dataset`` that is a multi-dimensional, in-memory array database. It is a dictionary like container of ``DataArray``, the equivalent to ``pandas.DataFrame``.\n", "\n", "Xarrrays can be read from netCDF and from Zarr.\n", "\n", "You will find plenty of useful tutorials from the Xarray project, such as [this](https://docs.xarray.dev/en/stable/getting-started-guide/quick-overview.html). \n", "\n", "[Here](https://tutorial.xarray.dev/fundamentals/01_datastructures.html) is the Xarray tutorial book introducing Xarray from data structure to visualization. Generally, Xarray wraps Numpy and Pandas and behaves in a similar way as in Pandas. The transition to adopt Xarray should be smooth if you are familiar with Numpy and Pandas already. \n" ] }, { "cell_type": "code", "execution_count": null, "id": "f618954c", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import xarray as xr\n", "%matplotlib inline\n", "%config InlineBackend.figure_format='retina'" ] }, { "cell_type": "code", "execution_count": null, "id": "9495b831", "metadata": {}, "outputs": [], "source": [ "ds = xr.tutorial.load_dataset(\"air_temperature\")\n", "ds" ] }, { "cell_type": "markdown", "id": "2b1d2986", "metadata": {}, "source": [ "What are the keys/attributes of the data set?" ] }, { "cell_type": "code", "execution_count": null, "id": "6b6f4da0", "metadata": {}, "outputs": [], "source": [ "# what are the key attributes of ds\n", "print(ds.attrs)" ] }, { "cell_type": "markdown", "id": "617e4b69", "metadata": {}, "source": [ "Find two ways to print the values from attribute ``air``" ] }, { "cell_type": "code", "execution_count": null, "id": "6ab28ee2", "metadata": {}, "outputs": [], "source": [ "ds[\"air\"]" ] }, { "cell_type": "code", "execution_count": null, "id": "4f06469e", "metadata": {}, "outputs": [], "source": [ "ds.air" ] }, { "cell_type": "code", "execution_count": null, "id": "9e1d9642", "metadata": {}, "outputs": [], "source": [ "with xr.set_options(display_style=\"html\"):\n", " display(ds)" ] }, { "cell_type": "markdown", "id": "049748ad", "metadata": {}, "source": [ "The DataArray has named dimension:" ] }, { "cell_type": "code", "execution_count": null, "id": "fb1478fa", "metadata": {}, "outputs": [], "source": [ "ds.air.dims" ] }, { "cell_type": "markdown", "id": "9b3e67ae", "metadata": {}, "source": [ "and the coordinates are saved in ``.coord``:" ] }, { "cell_type": "code", "execution_count": null, "id": "6bdd64e2", "metadata": {}, "outputs": [], "source": [ "ds.air.coords" ] }, { "cell_type": "code", "execution_count": null, "id": "1f0cf574", "metadata": {}, "outputs": [], "source": [ "ds.air.attrs[\"long_name\"]" ] }, { "cell_type": "markdown", "id": "cd59ba8a", "metadata": {}, "source": [ "Add new attributes" ] }, { "cell_type": "code", "execution_count": null, "id": "8fd6ee7a", "metadata": {}, "outputs": [], "source": [ "# assign your own attributes!\n", "ds.air.attrs[\"who_is_awesome\"] = \"xarray\"\n", "ds.air.attrs" ] }, { "cell_type": "markdown", "id": "0f5cd594", "metadata": {}, "source": [ "The underlying data is a numpy array" ] }, { "cell_type": "code", "execution_count": null, "id": "03115c04", "metadata": {}, "outputs": [], "source": [ "print(type(ds.air.data))\n", "print(ds.air.data)" ] }, { "cell_type": "markdown", "id": "da30af75", "metadata": {}, "source": [ "### 3.2 Extracting data\n", "\n", "How to extract data:\n", "\n", "* label-based indexing using ``.sel``\n", "\n", "* position-based indexing using ``.isel``" ] }, { "cell_type": "code", "execution_count": null, "id": "3c974037", "metadata": {}, "outputs": [], "source": [ "ds.air.isel(time=1).plot(x=\"lon\")" ] }, { "cell_type": "markdown", "id": "90dc8663", "metadata": {}, "source": [ "You would notice that the air temperature is in Kelvin. We can convert it to Celsius by removing 273.15 and changing the attributes ``units``." ] }, { "cell_type": "code", "execution_count": null, "id": "d4b308b4", "metadata": {}, "outputs": [], "source": [ "ds2=ds\n", "ds2['air']=ds['air']-273.15\n", "ds2['air']['units']='degC'" ] }, { "cell_type": "code", "execution_count": null, "id": "cb95b07f", "metadata": {}, "outputs": [], "source": [ "ds.air.isel(time=1).plot(x=\"lon\")" ] }, { "cell_type": "markdown", "id": "ac3a193e", "metadata": {}, "source": [ "We also want to show the longitudes in the west direction by removing 360$^\\circ$." ] }, { "cell_type": "code", "execution_count": null, "id": "b54ff7de", "metadata": {}, "outputs": [], "source": [ "ds2.coords[\"lon\"]=ds2.coords[\"lon\"]-360" ] }, { "cell_type": "code", "execution_count": null, "id": "f2b9cbf3", "metadata": {}, "outputs": [], "source": [ "ds.air.isel(time=1).plot(x=\"lon\")" ] }, { "cell_type": "markdown", "id": "02f024be", "metadata": {}, "source": [ "Show the mean temperature" ] }, { "cell_type": "code", "execution_count": null, "id": "08f78964", "metadata": {}, "outputs": [], "source": [ "ds2.air.mean(\"time\").plot()" ] }, { "cell_type": "code", "execution_count": null, "id": "d91f82f1", "metadata": {}, "outputs": [], "source": [ "ds2.sel(time=\"2013-05\")" ] }, { "cell_type": "code", "execution_count": null, "id": "359b47cb", "metadata": {}, "outputs": [], "source": [ "# demonstrate slicing\n", "ds.sel(time=slice(\"2013-05\", \"2013-07\"))" ] }, { "cell_type": "code", "execution_count": null, "id": "f740cd3a", "metadata": {}, "outputs": [], "source": [ "# \"nearest indexing at multiple points\"\n", "ds.sel(lon=[240.125-360, 234-360], lat=[40.3, 50.3], method=\"nearest\")" ] }, { "cell_type": "markdown", "id": "0265dc81", "metadata": {}, "source": [ "### 3.3 High level computation\n", "\n", "* groupby : Bin data in to groups and reduce\n", "\n", "* resample : Groupby specialized for time axes. Either downsample or upsample your data.\n", "\n", "* rolling : Operate on rolling windows of your data e.g. running mean\n", "\n", "* coarsen : Downsample your data\n", "\n", "* weighted : Weight your data before reducing" ] }, { "cell_type": "code", "execution_count": null, "id": "53f54258", "metadata": {}, "outputs": [], "source": [ "# seasonal groups\n", "ds.groupby(\"time.season\")" ] }, { "cell_type": "code", "execution_count": null, "id": "fc9ee06b", "metadata": {}, "outputs": [], "source": [ "# make a seasonal mean\n", "seasonal_mean = ds.groupby(\"time.season\").mean()\n", "seasonal_mean = seasonal_mean.sel(season=[\"DJF\", \"MAM\", \"JJA\", \"SON\"])\n", "seasonal_mean" ] }, { "cell_type": "code", "execution_count": null, "id": "725fb273", "metadata": {}, "outputs": [], "source": [ "# resample to monthly frequency\n", "ds.resample(time=\"M\").mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "fe77f30d", "metadata": {}, "outputs": [], "source": [ "# facet the seasonal_mean\n", "seasonal_mean.air.plot(col=\"season\")" ] }, { "cell_type": "markdown", "id": "afbc355e", "metadata": {}, "source": [ "We can save Xarrays in to NetCDF and Zarr files" ] }, { "cell_type": "code", "execution_count": null, "id": "60da43a6", "metadata": {}, "outputs": [], "source": [ "# write to netCDF\n", "%timeit ds.to_netcdf(\"my-example-dataset.nc\")\n", "!ls -lh my-example-dataset.nc" ] }, { "cell_type": "code", "execution_count": null, "id": "4cbef219", "metadata": {}, "outputs": [], "source": [ "\n", "%timeit ds.to_zarr(store=\"./my-example-dataset.zarr\",mode=\"w\")\n", "!du -sh ./my-example-dataset.zarr" ] }, { "cell_type": "markdown", "id": "3dde4d78", "metadata": {}, "source": [ "## 3.4 Student exercise\n", "\n", "1. **Retrieve Air Temperature Data:**\n", "\n", "Use the air temperature data set and extract the time series at the latitude and longitude of Seattle.\n", "\n", "2. **Plot the Time Series:**\n", "\n", "Plot the time series of the air temperature data using ``matplotlib``.\n", "\n", "3. **Fit a Sine Function:**\n", "\n", "Fit a sine function to the time series data using ``scipy.optimize.curve_fit``." ] }, { "cell_type": "code", "execution_count": null, "id": "42957cc5", "metadata": {}, "outputs": [], "source": [ "# Latitude\n", "l\n", "\n", "# Longitude" ] }, { "cell_type": "code", "execution_count": null, "id": "94e2670b", "metadata": {}, "outputs": [], "source": [ "# extract the air temperature data at these latitude and longitude\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7d8692c4", "metadata": {}, "outputs": [], "source": [ "# extract the time series, print its data type, and convert it to numerical values" ] }, { "cell_type": "code", "execution_count": null, "id": "c6cbb6f0", "metadata": {}, "outputs": [], "source": [ "# plot the data with appropriate labels and legends" ] }, { "cell_type": "code", "execution_count": null, "id": "4509ee5e", "metadata": {}, "outputs": [], "source": [ "# import the curve_fit function from scipy.optimize" ] }, { "cell_type": "code", "execution_count": null, "id": "acc75dcc", "metadata": {}, "outputs": [], "source": [ "# define a function to take the time series data and fit it to a sinusoidal function" ] }, { "cell_type": "code", "execution_count": null, "id": "2b2dd14a", "metadata": {}, "outputs": [], "source": [ "#" ] }, { "cell_type": "markdown", "id": "1d38aac1", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "ac06db34", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "c2f06647", "metadata": {}, "source": [ "# 4. Pytorch Tensors\n", "\n", "This material is extracted from the Pytorch Package materials and from Dive into Deep Learning\n", "\n", "The ``tensor`` class in Pytorch is similar to an ``ndarray`` in Numpy, with the added features that: 1) it has automatic differentiation and 2) it runs on CPUs and GPUs." ] }, { "cell_type": "code", "execution_count": null, "id": "e472e4bc", "metadata": {}, "outputs": [], "source": [ "import torch" ] }, { "cell_type": "markdown", "id": "d5cb6a88", "metadata": {}, "source": [ "We can define a basic tensor that will be by default a 1-D array (vector) and run on the CPU" ] }, { "cell_type": "code", "execution_count": null, "id": "9a9e3489", "metadata": {}, "outputs": [], "source": [ "x = torch.arange(12, dtype=torch.float32)\n", "x" ] }, { "cell_type": "markdown", "id": "f6ebd0d7", "metadata": {}, "source": [ "Each value is refered to as an ``element``. We can extract the length of the vector using the **method** ``numel()`` (a method is applied with ``()``) and its **attribute** shape is found by applying``shape``" ] }, { "cell_type": "code", "execution_count": null, "id": "4b6698bd", "metadata": {}, "outputs": [], "source": [ "x.numel()" ] }, { "cell_type": "code", "execution_count": null, "id": "8bc6c2e4", "metadata": {}, "outputs": [], "source": [ "x.shape" ] }, { "cell_type": "markdown", "id": "27c7b883", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "252e6246", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "873908fd", "metadata": {}, "source": [ "We can reshape a tensor and flip dimensions" ] }, { "cell_type": "code", "execution_count": null, "id": "7f365f57", "metadata": {}, "outputs": [], "source": [ "x.reshape(12,1).shape" ] }, { "cell_type": "code", "execution_count": null, "id": "58e11e8e", "metadata": {}, "outputs": [], "source": [ "x2=x.reshape(3,4)\n", "x2.shape" ] }, { "cell_type": "markdown", "id": "f0fa8074", "metadata": {}, "source": [ "We can create a random tensor" ] }, { "cell_type": "code", "execution_count": null, "id": "6dbdc945", "metadata": {}, "outputs": [], "source": [ "x3=torch.randn(3, 4)" ] }, { "cell_type": "markdown", "id": "3236b54d", "metadata": {}, "source": [ "We can apply **element-wise** operations on the tensor simply as methods. Here is an example by applying the function ``exp`` to all elements of the random tensor x3." ] }, { "cell_type": "code", "execution_count": null, "id": "303e9be8", "metadata": {}, "outputs": [], "source": [ "x3.exp()" ] }, { "cell_type": "markdown", "id": "0a080ef1", "metadata": {}, "source": [ "We can manipulate 2 arrays of the **same** shape" ] }, { "cell_type": "code", "execution_count": null, "id": "b2a869da", "metadata": {}, "outputs": [], "source": [ "x = torch.tensor([1.0, 2, 4, 8])\n", "y = torch.tensor([2, 2, 2, 2])\n", "x + y, x - y, x * y, x / y, x ** y" ] }, { "cell_type": "markdown", "id": "18490b36", "metadata": {}, "source": [ "We can normalize the tensor with its L2 norm. Try below" ] }, { "cell_type": "code", "execution_count": null, "id": "8a8ffe0d", "metadata": {}, "outputs": [], "source": [ "x /=np.linalg.norm(x)" ] }, { "cell_type": "code", "execution_count": null, "id": "83248d95", "metadata": {}, "outputs": [], "source": [ "x" ] }, { "cell_type": "markdown", "id": "16a836da", "metadata": {}, "source": [ "We can convert a numpy array to a Pytorch tensor" ] }, { "cell_type": "code", "execution_count": null, "id": "3733a51f", "metadata": {}, "outputs": [], "source": [ "a=x.numpy()\n", "print(type(a))\n", "print(type(x))" ] }, { "cell_type": "markdown", "id": "84d9bbe5", "metadata": {}, "source": [ "Move an array from CPU to GPU by changing the ``device`` attribute" ] }, { "cell_type": "code", "execution_count": null, "id": "abb523b3", "metadata": {}, "outputs": [], "source": [ "device = torch.device(\"cpu\")\n", "print(device)" ] }, { "cell_type": "markdown", "id": "2036d202", "metadata": {}, "source": [ "Below, we are going to perform a regression to write a sine function as a function" ] }, { "cell_type": "code", "execution_count": null, "id": "fb5dc0d3", "metadata": {}, "outputs": [], "source": [ "import torch\n", "import math\n", "\n", "\n", "dtype = torch.float\n", "device = torch.device(\"cpu\")\n", "# device = torch.device(\"cuda:0\") # Uncomment this to run on GPU\n", "\n", "# Create random input and output data\n", "x = torch.linspace(-math.pi, math.pi, 2000, device=device, dtype=dtype)\n", "y = torch.sin(x)\n", "\n", "# Randomly initialize weights\n", "a = torch.randn((), device=device, dtype=dtype)\n", "b = torch.randn((), device=device, dtype=dtype)\n", "c = torch.randn((), device=device, dtype=dtype)\n", "d = torch.randn((), device=device, dtype=dtype)\n", "\n", "learning_rate = 1e-6\n", "for t in range(2000):\n", " # Forward pass: compute predicted y\n", " y_pred = a + b * x + c * x ** 2 + d * x ** 3\n", "\n", " # Compute and print loss\n", " loss = (y_pred - y).pow(2).sum().item()\n", " if t % 100 == 99:\n", " print(t, loss)\n", "\n", " # Backprop to compute gradients of a, b, c, d with respect to loss\n", " grad_y_pred = 2.0 * (y_pred - y)\n", " grad_a = grad_y_pred.sum()\n", " grad_b = (grad_y_pred * x).sum()\n", " grad_c = (grad_y_pred * x ** 2).sum()\n", " grad_d = (grad_y_pred * x ** 3).sum()\n", "\n", " # Update weights using gradient descent\n", " a -= learning_rate * grad_a\n", " b -= learning_rate * grad_b\n", " c -= learning_rate * grad_c\n", " d -= learning_rate * grad_d\n", "\n", "\n", "print(f'Result: y = {a.item()} + {b.item()} x + {c.item()} x^2 + {d.item()} x^3')" ] }, { "cell_type": "markdown", "id": "9688d90c", "metadata": {}, "source": [ "\n", "## 5. Comparison of NumPy, Xarray, and PyTorch\n", "\n", "| Feature/Use Case | NumPy | Xarray | PyTorch |\n", "|-----------------------------------|---------------------------------|--------------------------------|--------------------------------|\n", "| Basic Array Manipulation | Excellent | Good | Good |\n", "| Labeled Dimensions | Limited | Excellent | Limited |\n", "| Multi-dimensional Data Support | Excellent | Excellent | Excellent |\n", "| Deep Learning Integration | Limited | Limited | Excellent |\n", "| GPU Acceleration | None | None | Excellent |\n", "| Geoscientific Data (climate, etc.)| Possible but requires effort | Excellent | Possible |\n", "\n", "\n", "\n", "### **Choosing the Right Tool for the Task**\n", "\n", "- **NumPy**: Best for basic numerical operations and standard array manipulation.\n", "- **Xarray**: Ideal for labeled, multi-dimensional geoscientific datasets, particularly in climate modeling, remote sensing, and spatial data.\n", "- **PyTorch**: Best when applying machine learning or deep learning models to geoscientific data, especially when dealing with large datasets and requiring GPU acceleration.\n", "\n" ] }, { "cell_type": "markdown", "id": "e9de5bbc", "metadata": {}, "source": [ "\n", "### **Array Conversions**\n", "\n", "**Task:**\n", "- Convert the xarray of air temperature into a 3D numpy and a 3D pytorch array\n", "- systematically print the dimensions\n" ] } ], "metadata": { "kernelspec": { "display_name": "mlgeo", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" } }, "nbformat": 4, "nbformat_minor": 5 }