--- syncID: c1cd91f1343b430c9c37497c52cf98ac title: "Intro to Working with Hyperspectral Remote Sensing Data in HDF5 Format in R" code1: https://raw.githubusercontent.com/NEONScience/NEON-Data-Skills/main/tutorials/R/AOP/Hyperspectral/Work-With-Hyperspectral-Data-In-R/Work-With-Hyperspectral-Data-In-R.R authors: Leah A. Wasser, Edmund Hart, Donal O'Leary contributors: Felipe Sanchez, Bridget Hass dataProduct: DP3.30006.001 dateCreated: 2014-11-26 20:49:52 description: Open up and explore a hyperspectral dataset stored in HDF5 format in R. Learn about the power of data slicing in HDF5. Slice our band subsets of the data to create and visualize one band. estimatedTime: 0.5 - 1 Hours languagesTool: R packagesLibraries: rhdf5, terra, neonUtilities topics: hyperspectral, HDF5, remote-sensing tutorialSeries: intro-hsi-r-series urlTitle: hsi-hdf5-r --- In this tutorial, we will show how to read and extract NEON reflectance data stored within an HDF5 file using R.

## Learning Objectives After completing this tutorial, you will be able to: * Explain how HDF5 data can be used to store spatial data and the associated benefits of this format when working with large spatial data cubes. * Extract metadata from HDF5 files. * Slice or subset HDF5 data. You will extract one band of pixels. * Plot a matrix as an image and a raster. * Export a final GeoTIFF (spatially projected) that can be used both in further analysis and in common GIS tools like QGIS. ## Things You’ll Need To Complete This Tutorial To complete this tutorial you will need the most current version of R and, preferably, RStudio installed on your computer. ### R Libraries to Install: * **rhdf5**: `install.packages("BiocManager")`, `BiocManager::install("rhdf5")` * **terra**: `install.packages("terra")` * **neonUtilities**: `install.packages("neonUtilities")` More on Packages in R - Adapted from Software Carpentry. ### Data to Download Data will be downloaded in the tutorial using the `neonUtilities::byTileAOP` function. These hyperspectral remote sensing data provide information on the National Ecological Observatory Network's San Joaquin Experimental Range field site in March of 2021. The data were collected over the San Joaquin field site located in California (Domain 17).The entire dataset can be also be downloaded from the NEON Data Portal. *** **Set Working Directory:** This lesson assumes that you have set your working directory to the location of the downloaded data. An overview of setting the working directory in R can be found here. **R Script & Challenge Code:** NEON data lessons often contain challenges to reinforce skills. If available, the code for challenge solutions is found in the downloadable R script of the entire lesson, available in the footer of each lesson page.

## About Hyperspectral Remote Sensing Data The electromagnetic spectrum is composed of thousands of bands representing different types of light energy. Imaging spectrometers (instruments that collect hyperspectral data) break the electromagnetic spectrum into groups of bands that support classification of objects by their spectral properties on the Earth's surface. Hyperspectral data consists of many bands - up to hundreds of bands - that span a portion of the electromagnetic spectrum, from the visible to the Short Wave Infrared (SWIR) regions. The NEON imaging spectrometer (NIS) collects data within the 380 nm to 2510 nm portions of the electromagnetic spectrum within bands that are approximately 5 nm in width. This results in a hyperspectral data cube that contains approximately 426 bands - which means BIG DATA.

Data cube graphic of NEON hyperspectral data. Each layer in the cube represents a band. — A data cube of NEON hyperspectral data. Each layer in the cube represents a band.

The HDF5 data model natively compresses data stored within it (makes it smaller) and supports data slicing (extracting only the portions of the data that you need to work with rather than reading the entire dataset into memory). These features make it ideal for working with large data cubes such as those generated by imaging spectrometers, in addition to supporting spatial data and associated metadata. In this tutorial we will demonstrate how to read and extract spatial raster data stored within an HDF5 file using R. ## Read HDF5 data into R We will use the `terra` and `rhdf5` packages to read in the HDF5 file that contains hyperspectral data for the NEON San Joaquin (SJER) field site. Let's start by calling the needed packages and reading in our NEON HDF5 file. Please be sure that you have *at least* version 2.10 of `rhdf5` installed. Use: `packageVersion("rhdf5")` to check the package version.

**Data Tip:** To update all packages installed in R, use `update.packages()`.

```{r install-load-library, results="hide" } # Load `terra` and `rhdf5` packages to read NIS data into R library(terra) library(rhdf5) library(neonUtilities) ``` Set the working directory to ensure R can find the file we are importing, and we know where the file is being saved. You can move the file that is downloaded afterward, but be sure to re-set the path to the file. ```{r set-wd, results="hide" } wd <- "~/data/" #This will depend on your local environment setwd(wd) ``` We can use the `neonUtilities` function `byTileAOP` to download a single reflectance tile. You can run `help(byTileAOP)` to see more details on what the various inputs are. For this exercise, we'll specify the UTM Easting and Northing to be (257500, 4112500), which will download the tile with the lower left corner (257000,4112000). By default, the function will check the size total size of the download and ask you whether you wish to proceed (y/n). This file is ~672.7 MB, so make sure you have enough space on your local drive. You can set `check.size=FALSE` if you want to download without a prompt. ```{r download-refl, eval=FALSE} byTileAOP(dpID='DP3.30006.001', site='SJER', year='2021', easting=257500, northing=4112500, check.size=TRUE, # set to FALSE if you don't want to enter y/n savepath = wd) ``` This file will be downloaded into a nested subdirectory under the `~/data` folder, inside a folder named `DP3.30006.001` (the Data Product ID). The file should show up in this location: `~/data/DP3.30006.001/neon-aop-products/2021/FullSite/D17/2021_SJER_5/L3/Spectrometer/Reflectance/NEON_D17_SJER_DP3_257000_4112000_reflectance.h5`.

**Data Tip:** To make sure you are pointing to the correct path, look in the `~/data` folder and navigate to where the .h5 file is saved, or use the R command `list.files(path=wd,pattern="\\.h5$",recursive=TRUE,full.names=TRUE)` to display the full path of the .h5 file. Note, if you have any other .h5 files downloaded in this folder, it will display all of the hdf5 files.

```{r define-h5, results="hide" } # Define the h5 file name to be opened h5_file <- paste0(wd,"DP3.30006.001/neon-aop-products/2021/FullSite/D17/2021_SJER_5/L3/Spectrometer/Reflectance/NEON_D17_SJER_DP3_257000_4112000_reflectance.h5") ``` You can use `h5ls` and/or `View(h5ls(...))` to look at the contents of the hdf5 file, as follows: ```{r view-file-strux, eval=FALSE, comment=NA} # look at the HDF5 file structure View(h5ls(h5_file,all=T)) ``` When you look at the structure of the data, take note of the "map info" dataset, the `Coordinate_System` group, and the `wavelength` and `Reflectance` datasets. The `Coordinate_System` folder contains the spatial attributes of the data including its EPSG Code, which is easily converted to a Coordinate Reference System (CRS). The CRS documents how the data are physically located on the Earth. The `wavelength` dataset contains the wavelength values for each band in the data. The `Reflectance` dataset contains the image data that we will use for both data processing and visualization. More Information on raster metadata: * Raster Data in R - The Basics - this tutorial explains more about how rasters work in R and their associated metadata. * About Hyperspectral Remote Sensing Data -this tutorial explains more about metadata and important concepts associated with multi-band (multi and hyperspectral) rasters.

**Data Tip - HDF5 Structure:** Note that the structure of individual HDF5 files may vary depending on who produced the data. In this case, the Wavelength and reflectance data within the file are both h5 datasets. However, the spatial information is contained within a group. Data downloaded from another organization (like NASA) may look different. This is why it's important to explore the data as a first step!

We can use the `h5readAttributes()` function to read and extract metadata from the HDF5 file. Let's start by learning about the wavelengths described within this file. ```{r read-band-wavelength-attributes } # get information about the wavelengths of this dataset wavelengthInfo <- h5readAttributes(h5_file,"/SJER/Reflectance/Metadata/Spectral_Data/Wavelength") wavelengthInfo ``` Next, we can use the `h5read` function to read the data contained within the HDF5 file. Let's read in the wavelengths of the band centers: ```{r read-band-wavelengths } # read in the wavelength information from the HDF5 file wavelengths <- h5read(h5_file,"/SJER/Reflectance/Metadata/Spectral_Data/Wavelength") head(wavelengths) tail(wavelengths) ``` Which wavelength is band 21 associated with? (Hint: look at the wavelengths vector that we just imported and check out the data located at index 21 - `wavelengths[21]`).

Graphical representation showing where the 482 nm wavelength falls within the blue portion of the visible light region of the electromagnetic spectrum. — 482 nanometers falls within the blue portion of the electromagnetic spectrum. Source: National Ecological Observatory Network

Band 21 has a associated wavelength center of 481.7982 nanometers (nm) which is in the blue portion (~380-500 nm) of the visible electromagnetic spectrum (~380-700 nm). ### Bands and Wavelengths A *band* represents a group of wavelengths. For example, the wavelength values between 695 nm and 700 nm might be one band captured by an imaging spectrometer. The imaging spectrometer collects reflected light energy in a pixel for light in that band. Often when you work with a multi- or hyperspectral dataset, the band information is reported as the center wavelength value. This value represents the mean value of the wavelengths represented in that band. Thus in a band spanning 695-700 nm, the center would be 697.5 nm). The full width half max (FWHM) will also be reported. This value can be thought of as the spread of the band around that center point. So, a band that covers 800-805 nm might have a FWHM of 5 nm and a wavelength value of 802.5 nm.

Graphical representation showing how bands represent a range of values within the electromagnetic spectrum. The graphic shows wavelengths 675 nm through 700 nm split into five different bands, labeled bands A through E. Values for each band are often represented as the center point value of each band. — Bands represent a range of values (types of light) within the electromagnetic spectrum. Values for each band are often represented as the center point value of each band. Source: National Ecological Observatory Network (NEON)

The HDF5 dataset that we are working with in this activity may contain more information than we need to work with. For example, we don't necessarily need to process all 426 bands available in a full NEON hyperspectral reflectance file - if we are interested in creating a product like NDVI which only uses bands in the Near InfraRed (NIR) and Red portions of the spectrum. Or we might only be interested in a spatial subset of the data - perhaps an area where we have collected corresponding ground data in the field. The HDF5 format allows us to slice (or subset) the data - quickly extracting the subset that we need to process. Let's extract one of the green bands - band 34. By the way - what is the center wavelength value associated with band 34? Hint: `wavelengths[34]`. How do we know this band is a green band in the visible portion of the spectrum? In order to effectively subset our data, let's first read the reflectance metadata stored as *attributes* in the "Reflectance_Data" dataset. ```{r get-reflectance-shape} # First, we need to extract the reflectance metadata: reflInfo <- h5readAttributes(h5_file, "/SJER/Reflectance/Reflectance_Data") reflInfo # Next, we read the different dimensions nRows <- reflInfo$Dimensions[1] nCols <- reflInfo$Dimensions[2] nBands <- reflInfo$Dimensions[3] nRows nCols nBands ``` The HDF5 read function reads data in the order: Bands, Cols, Rows. This is different from how R reads data. We'll adjust for this later. ```{r get-reflectance-shape-2} # Extract or "slice" data for band 34 from the HDF5 file b34 <- h5read(h5_file,"/SJER/Reflectance/Reflectance_Data",index=list(34,1:nCols,1:nRows)) # what type of object is b34? class(b34) ``` ### A Note About Data Slicing in HDF5 Data slicing allows us to extract and work with subsets of the data rather than reading in the entire dataset into memory. In this example, we will extract and plot the green band without reading in all 426 bands. The ability to slice large datasets makes HDF5 ideal for working with big data. Next, let's convert our data from an array (more than 2 dimensions) to a matrix (just 2 dimensions). We need to have our data in a matrix format to plot it. ```{r convert-to-matrix} # convert from array to matrix by selecting only the first band b34 <- b34[1,,] # display the class of this re-defined variable class(b34) ``` ### Arrays vs. Matrices Arrays are matrices with more than 2 dimensions. When we say dimension, we are talking about the "z" associated with the data (imagine a series of tabs in a spreadsheet). Put the other way: matrices are arrays with only 2 dimensions. Arrays can have any number of dimensions one, two, ten or more. Here is a matrix that is 4 x 3 in size (4 rows and 3 columns): | Metric | species 1 | species 2 | |----------------|-----------|-----------| | total number | 23 | 45 | | average weight | 14 | 5 | | average length | 2.4 | 3.5 | | average height | 32 | 12 | ### Dimensions in Arrays An array contains 1 or more dimensions in the "z" direction. For example, let's say that we collected the same set of species data for every day in a 30 day month. We might then have a matrix like the one above for each day for a total of 30 days making a 4 x 3 x 30 array (this dataset has more than 2 dimensions). More on R object types here (links to external site, DataCamp).

Left: a matrix has only 2 dimensions. Right: an array has more than 2 dimensions.

Graphic showing an array, which in contrast to a matrix, has more than two dimensions. In this graphic, additional dimensions are represented in the z direction, and labeled a through d. — Left: a matrix has only 2 dimensions. Right: an array has more than 2 dimensions.

Next, let's look at the metadata for the reflectance data. When we do this, take note of 1) the scale factor and 2) the data ignore value. Then we can plot the band 34 data. Plotting spatial data as a visual "data check" is a good idea to make sure processing is being performed correctly and all is well with the image. ```{r read-attributes-plot, fig.cap=c("Plot of reflectance values for band 34 data. This plot shows a very washed out image lacking any detail.","Plot of log transformed reflectance values for band 34 data. Log transformation improved the visibility of details in the image, but it is still not great.")} # look at the metadata for the reflectance dataset h5readAttributes(h5_file,"/SJER/Reflectance/Reflectance_Data") # plot the image image(b34) ``` What do you notice about the first image? It's washed out and lacking any detail. What could be causing this? It got better when plotting the log of the values, but still not great. ``` {r plot-log-b34} # this is a little hard to visually interpret - what happens if we plot a log of the data? image(log(b34)) ``` Let's look at the distribution of reflectance values in our data to figure out what is going on. ```{r hist-data, fig.cap=c("Histogram of reflectance values for band 34. The x-axis represents the reflectance values and ranges from 0 to 8000. The frequency of these values is on the y-axis. The histogram shows reflectance values are skewed to the right, where the majority of the values lie between 0 and 1000. We can conclude that reflectance values are not equally distributed across the range of reflectance values, resulting in a washed out image.","Histogram of reflectance values between 0 and 5000 for band 34. Reflectance values are on the x-axis, and the frequency is on the y-axis. The x-axis limit has been set 5000 in order to better visualize the distribution of reflectance values. We can confirm that the majority of the values are indeed within the 0 to 4000 range.","Histogram of reflectance values between 5000 and 15000 for band 34. Reflectance values are on the x-axis, and the frequency is on the y-axis. Plot shows that a very few number of pixels have reflectance values larger than 5,000. These values are skewing how the image is being rendered and heavily impacting the way the image is drawn on our monitor.")} # Plot range of reflectance values as a histogram to view range # and distribution of values. hist(b34,breaks=50,col="darkmagenta") # View values between 0 and 5000 hist(b34,breaks=100,col="darkmagenta",xlim = c(0, 5000)) # View higher values hist(b34, breaks=100,col="darkmagenta",xlim = c(5000, 15000),ylim = c(0, 750)) ``` As you're examining the histograms above, keep in mind that reflectance values range between 0-1. The **data scale factor** in the metadata tells us to divide all reflectance values by 10,000. Thus, a value of 5,000 equates to a reflectance value of 0.50. Storing data as integers (without decimal places) compared to floating points (with decimal places) creates a smaller file. This type of scaling is commin in remote sensing datasets. Notice in the data that there are some larger reflectance values (>5,000) that represent a smaller number of pixels. These pixels are skewing how the image renders. ### Data Ignore Value Image data in raster format will often contain a data ignore value and a scale factor. The data ignore value represents pixels where there are no data. Among other causes, no data values may be attributed to the sensor not collecting data in that area of the image or to processing results which yield null values. Remember that the metadata for the `Reflectance` dataset designated -9999 as `data ignore value`. Thus, let's set all pixels with a value == -9999 to `NA` (no value). If we do this, R won't render these pixels. ```{r set-values-NA, fig.cap="Plot of reflectance values for band 34 data with values equal to -9999 set to NA. Image data in raster format will often contain no data values, which may be attributed to the sensor not collecting data in that area of the image or to processing results which yield null values. Reflectance datasets designate -9999 as data ignore values. As such, we will reassign -9999 values to NA so R won't try to render these pixels."} # there is a no data value in our raster - let's define it noDataValue <- as.numeric(reflInfo$Data_Ignore_Value) noDataValue # set all values equal to the no data value (-9999) to NA b34[b34 == noDataValue] <- NA # plot the image now image(b34) ``` ### Reflectance Values and Image Stretch Our image still looks dark because R is trying to render all reflectance values between 0 and 14999 as if they were distributed equally in the histogram. However we know they are not distributed equally. There are many more values between 0-5000 than there are values >5000. Images contain a distribution of reflectance values. A typical image viewing program will render the values by distributing the entire range of reflectance values across a range of "shades" that the monitor can render - between 0 and 255. However, often the distribution of reflectance values is not linear. For example, in the case of our data, most of the reflectance values fall between 0 and 0.5. Yet there are a few values >0.8 that are heavily impacting the way the image is drawn on our monitor. Imaging processing programs like ENVI, QGIS and ArcGIS (and even Adobe Photoshop) allow you to adjust the stretch of the image. This is similar to adjusting the contrast and brightness in Photoshop. The proper way to adjust our data would be to apply what's called an `image stretch`. We will learn how to stretch our image data later. For now, let's plot the values as the log function on the pixel reflectance values to factor out those larger values. ```{r plot-log, fig.cap="Plot of log transformed reflectance values for the previous b34 image. Applying the log to the image increases the contrast making it look more like an image by factoring out those larger values. While an improvement, the image is still far from great. The proper way to adjust an image is by doing whats called an image stretch." } image(log(b34)) ``` The log applied to our image increases the contrast making it look more like an image. However, look at the images below. The top one is an RGB image as the image should look. The bottom one is our log-adjusted image. Notice a difference?

RGB image of the SJER field site. At the top right of the image, there is dark, brackish water. Scattered throughout the image, there are several trees. At the center of the image, there is a baseball field, with low grass. At the bottom left of the image, there is a parking lot and some buildings with highly reflective surfaces, and adjacent to it is a section of a gravel lot. — Top: The image as it should look. Bottom: the image that we outputted from the code above. Notice a difference?

Plot of log transformed reflectance values for the b34 image previously plotted. Applying the log to the image increases the contrast making it look more like an image by factoring out those larger values. While an improvement, the image is still far from great. The proper way to adjust an image is by applying an image stretch. The log transformed image appears flipped because when R reads in the dataset, it reads them as: Columns x Bands x Rows, as opposed to the RGB image on the left which has dimensions as Bands x Rows x Columns. — Top: The image as it should look. Bottom: the image that we outputted from the code above. Notice a difference?

### Transpose Image Notice that there are three data dimensions for this file: Bands x Rows x Columns. However, when R reads in the dataset, it reads them as: Columns x Bands x Rows. The data are flipped. We can quickly transpose the data to correct for this using the `t` or `transpose` command in R. The orientation is rotated in our log adjusted image. This is because R reads in matrices starting from the upper left hand corner. While most rasters read pixels starting from the lower left hand corner. In the next section, we will deal with this issue by creating a proper georeferenced (spatially located) raster in R. The raster format will read in pixels following the same methods as other GIS and imaging processing software like QGIS and ENVI do. ```{r transpose-data, fig.cap="Plot showing the transposed image of the log transformed reflectance values of b34. The orientation of the image is rotated in our log transformed image, because R reads in the matrices starting from the upper left hand corner."} # We need to transpose x and y values in order for our # final image to plot properly b34 <- t(b34) image(log(b34), main="Transposed Image") ``` ## Create a Georeferenced Raster Next, we will create a proper raster using the `b34` matrix. The raster format will allow us to define and manage: * Image stretch * Coordinate reference system & spatial reference * Resolution * and other raster attributes... It will also account for the orientation issue discussed above. To create a raster in R, we need a few pieces of information, including: * The coordinate reference system (CRS) * The spatial extent of the image ### Define Raster CRS First, we need to define the Coordinate reference system (CRS) of the raster. To do that, we can first grab the EPSG code from the HDF5 attributes, and covert the EPSG to a CRS string. Then we can assign that CRS to the raster object. ```{r define-CRS, fig.cap="Plot of the properly oriented raster image of the band 34 data. In order to orient the image correctly, the coordinate reference system was defined and assigned to the raster object. X-axis represents the UTM Easting values, and the Y-axis represents the Northing values."} # Extract the EPSG from the h5 dataset h5EPSG <- h5read(h5_file, "/SJER/Reflectance/Metadata/Coordinate_System/EPSG Code") # convert the EPSG code to a CRS string h5CRS <- crs(paste0("+init=epsg:",h5EPSG)) # define final raster with projection info # note that capitalization will throw errors on a MAC. # if UTM is all caps it might cause an error! b34r <- rast(b34, crs=h5CRS) # view the raster attributes b34r # let's have a look at our properly oriented raster. Take note of the # coordinates on the x and y axis. image(log(b34r), xlab = "UTM Easting", ylab = "UTM Northing", main = "Properly Oriented Raster") ``` Next we define the extents of our raster. The extents will be used to calculate the raster's resolution. Fortunately, the spatial extent is provided in the HDF5 file "Reflectance_Data" group attributes that we saved before as `reflInfo`. ```{r define-extent} # Grab the UTM coordinates of the spatial extent xMin <- reflInfo$Spatial_Extent_meters[1] xMax <- reflInfo$Spatial_Extent_meters[2] yMin <- reflInfo$Spatial_Extent_meters[3] yMax <- reflInfo$Spatial_Extent_meters[4] # define the extent (left, right, top, bottom) rasExt <- ext(xMin,xMax,yMin,yMax) rasExt # assign the spatial extent to the raster ext(b34r) <- rasExt # look at raster attributes b34r ```

Image showing how the extent of a raster image represents the spatial location of each corner. The coordinate units are determined by the spatial projection/coordinate reference system that are assigned to the data. — The extent of a raster represents the spatial location of each corner. The coordinate units will be determined by the spatial projection coordinate reference system that the data are in. Source: National Ecological Observatory Network (NEON)

Learn more about working with Raster data in R in the Data Carpentry workshop: Introduction to Geospatial Raster and Vector Data with R. We can adjust the colors of our raster as well, if desired. ```{r plot-colors-raster, fig.cap="Plot of the properly oriented raster image of B34 with custom colors. We can adjust the colors of the image by adjusting the z limits, which in this case makes the highly reflective surfaces more vibrant. This color adjustment is more apparent in the bottom left of the image, where the parking lot, buildings and bare surfaces are located. The X-axis represents the UTM Easting values, and the Y-axis represents the Northing values."} # let's change the colors of our raster and adjust the zlim col <- terrain.colors(25) image(b34r, xlab = "UTM Easting", ylab = "UTM Northing", main= "Spatially Referenced Raster", col=col, zlim=c(0,3000)) ``` We've now created a raster from band 34 reflectance data. We can export the data as a raster, using the `writeRaster` command. Note that it's good practice to close the H5 connection before moving on! ```{r write-raster, eval=FALSE, comment=NA} # write out the raster as a geotiff writeRaster(b34r, file=paste0(wd,"band34.tif"), overwrite=TRUE) # close the H5 file H5close() ```

### Challenge: Work with Rasters Try these three extensions on your own: 1. Create rasters using other bands in the dataset. 2. Vary the distribution of values in the image to mimic an image stretch. e.g. `b34[b34 > 6000 ] <- 6000` 3. Use what you know to extract ALL of the reflectance values for ONE pixel rather than for an entire band. HINT: this will require you to pick an x and y value and then all values in the z dimension: `aPixel<- h5read(h5_file,"Reflectance",index=list(NULL,100,35))`. Plot the spectra output.