---
title: "How to load and save vector data in R"
author: "Krzysztof Dyba"
date:  "08 December, 2023"
comments: false
layout: post
categories: r
---

```{r include=FALSE}
knitr::opts_chunk$set(fig.path = "images/")
```

TOC

[DOWNLOADHERE]

**Summary**:

If you have spatial vector data and are wondering how to load / save it in R,
this tutorial is the answer to your questions. It presents practical examples
for the most popular formats using the [**sf**](https://r-spatial.github.io/sf/)
package. We will use free vector layers from
[Natural Earth](https://www.naturalearthdata.com/) as a data source.

# Introduction

For convenience, all necessary files are located in the
[GitHub repository](https://github.com/kadyb/sf_load_save):

- countries.shp (and related files)
- rivers.gpkg
- cities.geojson

We can download the mentioned data and interactive notebook (.Rmd) manually
from the repository ("Code" button > "Download ZIP") or use the following script.

```{r eval=FALSE}
url = "https://github.com/kadyb/sf_load_save/archive/refs/heads/main.zip"
download.file(url, "sf_load_save.zip")
unzip("sf_load_save.zip")
```

In the first step, we need to download the **sf** package using the
`install.packages()` function, and then use the `library()` function to load it
into the session.

```{r eval=FALSE}
install.packages("sf")
```

```{r message=FALSE}
library("sf")
```

# Vector loading

## Shapefile (.shp)

Let's start by loading the shapefile format, which actually consists of several
files (e.g., .shp, .shx, .dbf, .prj). More information can be found on
[Wikipedia](https://en.wikipedia.org/wiki/Shapefile), but currently it is not
recommended to use this format due to its
[many limitations](http://switchfromshapefile.org/).

Generally, we can use the `read_sf()` function to load data. It requires
providing a path to the file. The file path can be defined in two ways in R
and this is the most common source of problems (errors like: `Error: Cannot
open "file.shp"; The file doesn't seem to exist.`).

The first way (easier) is to provide an **absolute path**, i.e. we
must provide the exact location where the file is located. For instance:

```
path = "C:/Users/Krzysztof/Documents/file.shp"
```

However, this [is not](https://r4ds.hadley.nz/workflow-scripts#relative-and-absolute-paths)
the recommended method, as it makes it impossible to locate files on different
operating systems. The second way is to specify a **relative path**. In this
case, we specify the location of the file relative to the current working
directory (or project). To find out where the working directory is, we can
use the `getwd()` function, and to change it the `setwd()` function. For instance:

```
getwd()
#> "C:/Users/Krzysztof/Documents"
path = "file.shp"
```

Let's load the shapefile using a relative path (all data can be found in the
`data` folder).

```{r}
countries = read_sf("data/countries/countries.shp")
```

We can then print the metadata about this vector layer by referring to the
`countries` object.

```{r}
countries
```

We can see that this layer consists of 52 features (rows) and 168 fields (columns).
The next information is about geometry type, dimension, spatial extent (bounding box)
and coordinate reference system (CRS). In addition, the first 10 rows were
printed.

After loading the data, it is a good idea to present it on a map. A simple
`plot()` function can be used for this purpose. The `countries` object has many
fields (attributes), but to start with we only need geometry. It can be obtained
by using the `st_geometry()` function.

```{r fig.path = "images/", label="sf-load-save-1"}
plot(st_geometry(countries))
```

## GeoPackage (.gpkg)

The next dataset is rivers (linear geometry) saved in
[GeoPackage format](https://www.geopackage.org/). It is loaded in exactly the
same way as the shapefile before. Note that this format can consist of multiple
layers of different types. In this case, we must define which layer exactly we
want to load. To check what layers are in the geopackage, use the `st_layers()`
function, and then specify it using the `layer` argument in `read_sf()`.
If the file only contains one layer, we don't need to do this.

```{r}
st_layers("data/rivers.gpkg")
rivers = read_sf("data/rivers.gpkg", layer = "rivers")
```

We can also display metadata as in the previous example.

```{r}
rivers
```

And make a visualization, but this time we will plot rivers against the
background of country borders. Adding more layers to the visualization is
done with the `add = TRUE` argument in `plot()` function. Note that the order
in which objects are added is important -- the objects added last are displayed
at the top. The `col` argument is used to set the color of the object.

```{r fig.path = "images/", label="sf-load-save-2"}
plot(st_geometry(countries))
plot(st_geometry(rivers), add = TRUE, col = "blue")
```

## GeoJSON (.geojson)

The last GeoJSON file contains cities in the world. In this case, we also use
the `read_sf()` function to load this file.

```{r}
cities = read_sf("data/cities.geojson")
cities
```

In this dataset, there is the `featurecla` column that indicates the type of city.
So let's try to print them and then select only state capitals.

We can print a column (attribute) in two ways, i.e. by specifying the column
name in:

1. Single square brackets -- a spatial object will be printed
2. Double square brackets (alternatively a dollar sign) -- only the text will
be printed

```{r}
cities["featurecla"]
```

```{r}
# the `head()` function prints only the first 6 elements
head(cities[["featurecla"]])

# or alternatively
# head(cities$featurecla)
```
This layer contains 1287 different cities. To find out what types of cities
these are, we can use the `table()` function, which will summarize them.

```{r}
table(cities[["featurecla"]])
```

We are interested in `Admin-0 capital` and `Admin-0 capital alt` types because
some countries have two capitals. We make selection as follows using the `|` (OR)
operator:

```{r}
sel = cities$featurecla == "Admin-0 capital" | cities$featurecla == "Admin-0 capital alt"
head(sel)
```

As a result of this operation, we got a logical vector with TRUE and FALSE
values (if the city is / is not the capital). Now let's create a new object
named `capitals`, which will contain only capitals.

```{r}
# select only those cities that meet the above conditions
capitals = cities[sel, ]
capitals["name"]
```

In the last step, we prepare the final visualization. We can add a title
(`main` argument), axes (`axes` argument) and change the background color
(`bgc` argument) of the figure. We can also change the point symbol (`pch`
argument), set its size (`cex` argument) and fill color (`bg` argument).

```{r fig.path = "images/", label="sf-load-save-3"}
plot(st_geometry(countries), main = "Africa", axes = TRUE, bgc = "deepskyblue",
     col = "burlywood")
plot(st_geometry(rivers), add = TRUE, col = "blue")
plot(st_geometry(capitals), add = TRUE, pch = 24, bg = "red", cex = 0.8)
```

# Vector saving

Saving vector data is as easy as loading. There is a dedicated `write_sf()`
function for this purpose and it requires two arguments:

1. The object we want to save
2. The path to save with file extension

For example, let's save our `capital` object as a GeoPackage (.gpkg), but as an
exercise you can save it in other formats as well (you just need to change the
extension).

```{r}
write_sf(capitals, "data/capitals.gpkg")
```

# Synopsis

The **sf** package allows loading vector data with the `read_sf()` function
and saving it with the `write_sf()` function in R. A list of all supported vector
formats can be found on the [GDAL website](https://gdal.org/drivers/vector/index.html).

For more information, see:

1. **sf** introductory vignette: [Reading, Writing and Converting Simple Features](https://r-spatial.github.io/sf/articles/sf2.html)
2. Introduction to **sf** and **stars** in [Spatial Data Science: With Applications in R](https://r-spatial.org/book/07-Introsf.html)
(Pebesma E. & Bivand R., 2023)

# Supplement

In the previous part of the tutorial, we looked at simple examples of loading
vector data, while in this section we will check out more advanced ways.

## Zipped shapefile (.shz)

As we noted earlier, a shapefile consists of several files, which can be
cumbersome. Some solution is to use zipped shapefiles, which is de facto an
archive. To create such a file, the extension .shz (or .shp.zip) and the
`ESRI Shapefile` driver are required. Loading is done in a standard way by
specifying the path to the ".shz" file.

```{r}
write_sf(capitals, "data/capitals.shz", driver = "ESRI Shapefile")
```

Hooray, only one file on the disk!

## Virtual File Systems

GDAL provides some facilities for loading files using some abstraction by 
[Virtual File Systems](https://gdal.org/user/virtual_file_systems.html).
In practice, this means that we can refer directly to the files without
first unpacking or downloading them in R. For example, we can directly open
the shapefile that is in the archive on the website. To do this, we must
use two prefixes:

1. `/vsicurl/` to download the file
2. `/vsizip/` to unpack the archive

```{r}
# URL is file path
url = "https://raw.githubusercontent.com/OSGeo/gdal/master/autotest/ogr/data/shp/poly.zip"
# note that the order of the prefixes is reverse
f = paste0("/vsizip/", "/vsicurl/", url)
read_sf(f)
```

## SQL preselection

We can use [SQL queries](https://en.wikipedia.org/wiki/SQL) to pre-filter
features, so only selected objects / attributes will be loaded. This allows us
to limit the size of the object in memory and speed up the operation time.
Moreover, we can also make spatial selection, i.e. limit the loading of data
only to a selected area.

### Columns filtering

The `query` argument in the `read_sf()` function is used to pass SQL queries.
Let's go back to the `countries` dataset and load only the column with the names
of countries (`NAME_LONG`).

```{r}
sql = "SELECT NAME_LONG FROM countries"
f = "data/countries/countries.shp"
read_sf(f, query = sql)
```

### Rows filtering

We can also select rows using a condition, e.g. population (`POP_EST`) greater
than 25 million.

```{r}
sql = "SELECT * FROM countries WHERE POP_EST > 25000000"
f = "data/countries/countries.shp"
read_sf(f, query = sql) # 17 countries
```

### Spatial filtering

Finally, to perform spatial filtering, we must first define the spatial extent
/ bounding box (`st_bbox()` function) and specify its coordinate reference
system (CRS). Then the bounding box needs to be converted into a polygon using
the `st_as_sfc()` function and finally converted to a 
[Well-Know Text](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry)
representation using `st_as_text()` function. Therefore, prepared text is passed
to the `wkt_filter` argument. Follow the example below of loading rivers only
in southern Africa:

```{r}
bbox = st_bbox(c(xmin = 10, xmax = 40, ymax = -35, ymin = -20),
               crs = st_crs(4326))
bbox = st_as_text(st_as_sfc(bbox))
bbox
```

```{r fig.path = "images/", label="sf-load-save-4"}
f = "data/rivers.gpkg"
rivers_south = read_sf(f, wkt_filter = bbox)
plot(st_geometry(rivers_south), axes = TRUE)
```