---
title: "Accessing WoSIS from R -- 'Latest' Version"
author: "D G Rossiter"
date: "`r Sys.Date()`"
output:
  word_document:
    toc: yes
  html_document:
    fig_height: 4
    fig_width: 6
    number_section: yes
    theme: spacelab
    toc: yes
    toc_float: yes
bibliography: wosis.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

This document shows how to access WoSIS "Latest" data from  R. For access to WoSIS "Snapshot" data from R, see https://www.isric.org/accessing-wosis-snapshot-r

The "Latest" dynamic dataset contains the most recent version of standardised soil data served from WoSIS. Being dynamic, the dataset will grow once new point data are standardised, additional soil properties are considered, and/or when possible ammendments are required.

For an overview of WoSIS, see https://www.isric.org/explore/wosis. This links to https://www.isric.org/explore/wosis/accessing-wosis-derived-datasets which explains the difference between snapshot and dynamic datasets, and how to access them.

The Procedures Manual ^[https://www.isric.org/sites/default/files/isric_report_2018_01.pdf] describes how the WoSIS database is built.


This tutorial uses many procedures from the "tidyverse" [@Wickham_Grolemund_2016], including pipes（the `%>%` operator).

# Packages

If you do not have these on your system, install with `install.packages(..., dependencies=TRUE)` or via the R Studio package manager.

```{r load.package}
library(sp)             # spatial data types
library(rgdal)          # GDAL access from R
library(gdalUtils)      # wrappers for GDAL utility programs that could be
                        #  called from the command line
library(aqp)            # Algorithms for Quantitative Pedology
```

GDAL is used for spatial data import/export, coordinate systems etc. Check for a valid GDAL installation:

```{r set.gdal,eval=FALSE}
gdal_setInstallation()
valid_install <- !is.null(getOption("gdalUtils_gdalPath"))
if (valid_install)
   print("Valid GDAL found") else
   {print("No valid GDAL"); quit()}
```

# WoSIS Web Feature Service

The "latest" (dynamic) version of WoSIS is provided via WFS (Web Feature Services). This standard is described at http://www.opengeospatial.org/standards/wfs and described in less techncial terms in Wikipedia: https://en.wikipedia.org/wiki/Web_Feature_Service.

WFS allows you to incorporate geographic data into your own GIS projects, unlike WMS (Web Map Service), which only displays geographic data within a GIS. Here we use R as the GIS, i.e., it can handle both geographic and feature-space information.

## Specify WoSIS WFS location and list available layers

Specify the web address of the "latest" version of WoSIS:

```{r specify.wfs.address}
wfs <- "WFS:http://data.isric.org/geoserver/wosis_latest/wfs/"
```

List the layers in the WFS source with `rgdal::ogrinfo`. This can be somewhat slow, because it depends on the network and the remote server. The `ogrinfo` function lists information about an OGR-supported data source, including this WFS source. ^[OGR stands for "OGR Simple Features Library", which is now part of GDAL, which stands for "Geospatial Data Abstraction Library", see https://gdal.org]

```{r ogrinfo.wfs}
layers.info <- ogrinfo(wfs, ro=TRUE, so=TRUE)
# `ro` = read-only; `so` = summary only
cat(layers.info, sep = "\n")
```

There are `r length(layers.info)` layers. Most of the names refer to soil properties in a fairly obvious way: the database name `wosis_latest:wosis_latest_` and then a property name, e.g., `bdfi33`. These names are explained at https://www.isric.org/explore/wosis/accessing-wosis-derived-datasets; for example `bdfi33` is "Bulk density of the fine earth fraction, equilibrated at 33 kPa". units are $\textrm{mg} \; \textrm{kg}^{-1}$. 

The final layer `"wosis_latest:wosis_latest_profiles (Point)"` is site information.

## Display layer properties

We need to know the layer's meaning, format, extent etc. before we read it. Select the first property in the information list above. Again call `ogrinfo` but this time also specifying a layer within the WFS. Show only the summary, not the features, with the `so=TRUE` optional argument.

```{r ogrinfo.properties}
property.info <- ogrinfo(wfs, layer="wosis_latest_bdfi33",
                         ro=TRUE, so=TRUE,  verbose=TRUE)
cat(property.info, sep="\n")
property.info[6]
property.info[11]
```

We see this is "Bulk density of the fine earth fraction < 2 mm, equilibrated at 33 kPa". Refer to the procedures manual for how each property (here, bulk density) was determined. We see the feature count `r property.info[11]`, its geometry (points, as is everything in WoSIS), its Coordinate Reference System (CRS, listed here as `GEOCGS` under `SRS` = "Spatial Reference System", `WKT` = "Well-Known Type"), which is `EPSG:4326`, i.e., geographic coördinates on the WGS84 datum. ^[For an exmplanation of the EPSG system, see https://www.epsg-registry.org]

The `rgdal::ogrinfo` function allows SQL (Structured Query Language) ^[https://en.wikipedia.org/wiki/SQL] queries ^[https://en.wikipedia.org/wiki/SQL_syntax] to limit the extent of the search, by using the (optional) `where` argument. To use this we need to know the field names, which we saw in the previous output.

For example, all the subsoil bulk densities from India:

```{r ogrinfo.bd.india}
bd.india.info <- ogrinfo(wfs, ro=TRUE, so=TRUE, layer="wosis_latest_bdfi33",
        where="country_name='India' AND upper_depth > 100",
        verbose=TRUE)
bd.india.info[11]
```

Here there are only `r bd.india.info[11]` records.

The `rgdal::ogrinfo` function also allows an (optional) limitation to a bounding box with the `spat` argument, as a four-element vector `(xmin, ymin, xmax, ymax)`.

For example, all the profile (site) information from a $2^\circ \times 2^\circ$ tile in central Europe.

```{r ogrinfo.eu.profiles}
central.eu.profiles.info <-
        ogrinfo(wfs, ro=TRUE, so=TRUE, layer="wosis_latest_profiles",
        spat=c(6, 48, 8, 50))
central.eu.profiles.info[6]
central.eu.profiles.info[11]
```


## Import WoSIS layers to the client system

There seems to be no way to directly import from the WFS to an R workspace object, so there must first be an intermediate step: download the WFS layer in an appropriate format to a local directory, and then import as usual for GIS layers.

The `rgdal::ogr2ogr` function reads from one format on the server and writes to another in our client. The default output file format is an ESRI Shapefile ^[https://en.wikipedia.org/wiki/Shapefile]; other formats can be specified with the (optional) `f` argument.

The possible formats are listed at https://docs.geoserver.org/latest/en/user/services/wfs/outputformats.html. Here we will
use the default format, i.e. the ESRI shapefile. (See the next section for another format, CSV).

The `where` and `spat` arguments can also be used here. In general you will want  to limit the size of the object with one or both of these.

The `ogr2ogr` function also allows an (optional) transformation to any EPSG-defined CRS with the `t_srs` argument.

See `?ogr2ogr` for  more options.

We download the first property, i.e., bulk density, and save it a subdirectory of the current path, and with the default format. The directory must be first created if does not already exist.

```{r download.bd}
wosis.dir.name <- "./wosis"
if (!file.exists(wosis.dir.name)) dir.create(wosis.dir.name)
layer.name <- "wosis_latest_bdfi33"
ogr2ogr(src=wfs,
        dst=wosis.dir.name,
        layer=layer.name,
        f = "ESRI Shapefile",
        overwrite=TRUE,
        skipfailures=TRUE)
```

Because the  destination format is `"ESRI Shapefile"` many of the field names have to be shortened, as shown by warnings such as "Warning 6: Normalized/laundered field name: 'profile_layer_id' to 'profile_la'".

Note that this query can can be combined with an SQL query to limit the download, with the `sql` optional argument.


## Reading imported layer into R

Now read the downloaded shapefile into an R `sp` object.
For shapefiles, the directory and layer names must be specified separately as two arguments, `dsn` ("data set name") and `layer`.

```{r import.bd}
(shapefile.name <- paste0('wosis_latest:',layer.name))
# here strings are just that, not to be interpreted as R factors
bd33 <- readOGR(dsn=wosis.dir.name, layer=shapefile.name,
                stringsAsFactors = FALSE)
class(bd33)
proj4string(bd33)
names(bd33@data)
head(bd33@data)
```

Each attribute has several names, with the following extensions:

* `value`  -- one or more values, in the format {1:value; 2:value...}, which are duplicate measurements 
* `value_avg`  -- the average of the values
* `method` -- text description of the analytical method
* `date` -- one or more values, in the format {1:yyyy-mm-dd; 2:yyyy-mm-dd...}, which are the dates each of th eduplicate measurements was added to the database (not the original measurement date, nor the field sampling date)
* `dataset_id` -- text code of original database
* `profile_code` -- text code of profile from original database
* `licence` -- text string of the Creative Commons ^[https://creativecommons.org/licenses/] license for this value, e.g. `CC-BY-NC`

So for example the  attribute `bdfi33` has the following fields, shortened when input to a shapefile to the first ten characters; if these would be duplicated the name is further manipulated:

* `bdfi33_value` $\to$ `bdfi33_val`
* `bdfi33_value_avg`  $\to$ `bdfi33_v_1`
* `bdfi33_method`  $\to$ `bdfi33_met`
* `bdfi33_date`  $\to$ `bdfi33_dat`
* `bdfi33_dataset_id`  $\to$ `bdfi33_d_1`
* `bdfi33_profile_code`  $\to$ `bdfi33_pro`
* `bdfi33_licence`  $\to$ `bdfi33_lic`

The shapefile has been imported as a `SpatialPointsDataFrame` with the correct CRS, as we saw from the `ogrinfo`.

Examine the format of the attribute, this is in field `*_val`:

```{r head.bd}
head(bd33@data$bdfi33_val)
```

The format is `{seq:val[,seq:val]}` where the `seq` is an integer on `[1...]` indicating which measurement number -- note that there can be more than one measurement per property, e.g., repeated lab. measurements, and `val` is the numeric value.

But the average value has its own field, so if we only want the average, it is prepared for us. We see an example here, from six rows chosen to show several profiles with their layers:

```{r example.bd}
bd33@data[75:80, c("profile_id","upper_dept","lower_dept","bdfi33_val","bdfi33_v_1")]
```


# WoSIS layers as CSV files

Another output format for `ogr2ogr` is the CSV 'comma-separated values' plain-text file.
These typically have one header row giving the name of each column (field), and then one row (case, tuple) per observation

## Read a layer into the client system as a CSV file

For example, read the profile information for the $2^\circ \times 2^\circ$ tile in central Europe:

(Note: `overwrite=TRUE` does not seem to work for `.csv` files, so any previous must be manually deleted).

```{r download.profiles.csv}
wosis.dir.name <- "./wosis"
layer.name <- "wosis_latest_profiles"
target.name <- paste0(wosis.dir.name,"/",layer.name,".csv")
if (file.exists(target.name)) { file.remove(target.name) }
ogr2ogr(src=wfs, 
        spat=c(6, 48, 8, 50),
        dst=target.name,
        layer=layer.name,
        f="CSV",
        overwrite=TRUE)
```

Also get the the bulk density:


```{r download.bd.csv}
layer.name <- "wosis_latest_bdfi33"
target.name <- paste0(wosis.dir.name,"/",layer.name,".csv")
if (file.exists(target.name)) { file.remove(target.name) }
ogr2ogr(src=wfs,
        spat=c(6, 48, 8, 50),
        dst=target.name,
        layer=layer.name,
        f="CSV",
        overwrite=TRUE)
```

## Read from a local text file into R objects

The `read.csv` function reads from a CSV file into an R `data.frame`.

First, the bulk density points:

```{r import.bd.csv}
layer.name <- "wosis_latest_bdfi33"
bd33.pts <- read.csv(file=paste0(wosis.dir.name,"/",layer.name,".csv"),
                stringsAsFactors = FALSE)
dim(bd33.pts)
names(bd33.pts)
```

Notice that here we have the complete field names, not truncated as they were in the shapefiles.
These can now be processed as above (the shapefiles).
 
Second, the profiles (sites):

```{r import.profiles.csv}
layer.name <- "wosis_latest_profiles"
profiles <- read.csv(paste0(wosis.dir.name, "/",layer.name,".csv"),
                stringsAsFactors = FALSE)
names(profiles)
```


Convert the profiles table to a spatial object in the `sp` package, by specifying the coördinates and the Coordinate Reference System (CRS):
 
```{r coordinates.profiles}
coordinates(profiles) <- c("longitude", "latitude")
proj4string(profiles) <- CRS("+init=epsg:4326")
```

Review some site information, e.g., the WRB Reference Soil Groups:

```{r profiles.wrb}
table(profiles@data$cwrb_reference_soil_group)
```

Note that most of these profiles do not have a WRB classification.

And a map:

```{r map.profiles.wrb}
spplot(profiles, zcol="cwrb_reference_soil_group", key.space="right")
```

# Working with the WoSIS layer as a SoilProfileCollection

The `aqp` "Algorithms for Quantitive Pedology" package [@Beaudette_2013] defines data structures and functions specific to soil profile data, i.e., with site and linked layer information.

Convert the bulk density structure to a `SoilProfileCollection`, a data type defined in `aqp`. This data type has separate structures for the site (profile) and horizons.

Initialize with the horizons from the data frame created from the WoSIS layer. 
The `aqp::depth` function initializes the SoilProfileCollection object. The formula has the field name of the profile on the left, and the the field names of the horizon boundaries on the right. These fields are in the WoSIS layer.

```{r bd.aqp}
ds.aqp <- bd33@data
depths(ds.aqp) <- profile_id ~ upper_dept + lower_dept
is(ds.aqp)
slotNames(ds.aqp)
str(ds.aqp@site)
str(ds.aqp@horizons)
head(ds.aqp@site)
head(ds.aqp@horizons[c(2,5,6,7,9)],12)
```

Note how the horizons have been grouped into sites, in the `@site` slot, and the per-horizon (by depth) values are in the `@horizons` slot. Here we have `r dim(ds.aqp@horizons)[1]` horizons in `r dim(ds.aqp@site)[1]` profiles.

Now this `SoilProfileCollection` can be used for many `aqp` functions. For example, here is the depth distribution of average bulk density of the components for the first 24 listed profiles, labelled by genetic horizon:

```{r plot.bd.spc,fig.width=12, fig.height=8}
plotSPC(ds.aqp[1:24,], name="layer_name", color='bdfi33_v_1')
```

A few layers in this set of profiles are missing bulk density.


# References