18 Tutorial 2: Adding a more complex dataset

18.1 Overview

This is the second of five tutorials on adding datasets to your traits.build database.

Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the datasets in traits.build-template. Instructions are available at Tutorial: Example compilation.

Goals

Learn how to merge in location data from a standalone spreadsheet.
Learn how to add substitutions for categorical trait values.
Learn how to add custom R code to the metadata file.
Understand the importance of attributing traits to the correct entity_type.
Understand the importance of having the dataset pivot.

New functions introduced

metadata_add_substitution
metadata_add_substitutions_list
metadata_check_custom_R_code

18.2 Adding tutorial_dataset_2

Ensure the dataset folder contains the correct data files

In the traits.build-template repository, there is a folder titled tutorial_dataset_2 within the data folder.

Ensure that this folder exists on your computer.
The file data.csv exists within the tutorial_dataset_2 folder.
There is a folder raw nested within the tutorial_dataset_2 folder, that contains two files, locations.csv and notes.txt.

source necessary functions

You’ll need to both source the traits.build functions and some ancillary functions that are in a file in the scripts folder:

library(traits.build)
source("R/custom_R_code.R")

Use functions to create a metadata.yml file

Create a metadata template

To create the metadata template, run:

metadata_create_template("tutorial_dataset_2")

As with tutorial_dataset_1 this function leads you through a series of menus requiring user input. Ensure you select:

data format: wide
taxon_name column: name_original
location_name column: site_TEXT
individual_id column: 1: NA
collection_date column: 1: NA
Enter collection_date range in format ‘2007/2009’: 1996/1997 Do all traits need repeat_measurements_id’s? 2: No

Navigate to the dataset’s folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.

Propagate source information into the metadata.yml file

This dataset is from a published source and therefore the source information can be added with the function metadata_add_source_doi:

metadata_add_source_doi(dataset_id = "tutorial_dataset_2", 
                        doi = "10.1046/j.1365-2745.2000.00506.x")

confirm:

the authors’ names are formatted as first name last name or first initial last name
the article title is in sentence case
the page numbers are filled in as a range, separated by a double dash

Add location details

For this dataset, location data is provided as a standalone spreadsheet, located in the raw data folder: tutorial_dataset_2\raw\locations.csv

First read the location data provided into R:

locations <-
  read_csv("data/tutorial_dataset_2/raw/location_data.csv")

traits.build requires three fields to use a specific syntax:

latitude must be in decimal degrees and the field name (column header) must be latitude (deg)
longitude must be in decimal degrees and the field name (column header) must be longitude (deg)
A general site description is document in the field description

traits.build does not require that the labels for other location properties align across datasets, but it is best practice to use a controlled vocabulary, so database users can easily search across all datasets for information on a specific climate variable or soil nutrient content. For new databases or new location properties, any naming/labeling convention can be established.

To confirm you are using the correct syntax, check the terms already in use:

locations_properties <-
  traits.build_database$locations %>%
  distinct(location_property) %>%
  View()

Then rename your columns to match those in use:

locations <-
  locations %>%
    rename(
      `longitude (deg)` = long,
      `latitude (deg)` = lat,
      `description` = vegetation,
      `elevation (m)` = elevation,
      `precipitation, MAP (mm)` = MAP,
      `soil P, total (mg/kg)` = `soil P`,
      `soil N, total (ppm)` = `soil N`,
      `geology (parent material)` = `parent material`
    )

Now add the location information into the metadata file:

metadata_add_locations(dataset_id = "tutorial_dataset_2", location_data = locations)

Ensure you select:

location_name: location
location_property columns: 1 2 3 4 5 6 7 8

Check the metadata.yml file to ensure the location information has been added as expected. If there is a problem, rerun the necessary code; this will overwrite what is present. You can also manually add additional properties if something is forgotten.

Add traits

To select columns in the data.csv file that include trait data, run:

metadata_add_traits(dataset_id = "tutorial_dataset_2")

Select columns 3 4 5 6, as these contain trait data.

Manual filling in of metadata

After confirming that the skeletal traits section has been added to metadata.yml file, you must fill in all the unknown fields.

For this dataset, you will later use functions to add substitutions and exclude unwanted observations, but it is best to first fill in the information for contributors, the dataset, and the traits.

These are all fields that contain the word unknown and must be filled in manually:

the contributors section
description, basis_of_record, life_stage, sampling_strategy, original_file, and notes under the dataset section
details for each trait, including unit_in, trait_name, entity_type, value_type, basis_of_record, replicates and methods

Adding contributors

The file data/tutorial_dataset_2/raw/tutorial_dataset_2_notes.txt indicates the main data_contributor for this study.

Fill in the remaining contributor information as described in the tutorial_dataset_1 tutorial.

Dataset fields

The file data/tutorial_dataset_2/raw/tutorial_dataset_2_notes.txt indicates how to fill in the unknown dataset fields for this study.

Trait details

The file data/tutorial_dataset_2/raw/tutorial_dataset_2_notes.txt indicates how to fill in the unknown trait fields for this study, but see below as well.

column in dataset	trait concept	units_in	entity_type	value_type	basis_of_value	replicates
TRAIT Growth Form CATEGORICAL EP epiphyte (mistletoe) F fern G grass H herb S shrub T tree V vine	plant_ growth_ form	.na	species	mode	expert_score	.na
TRAIT SLA UNITS mm2/g	leaf_mass_ per_area	mm2/g	population	mean	measurement	5
TRAIT Leaf Size UNITS mm2	leaf_area	mm2	population	mean	measurement	5
TRAIT Leaf Dry Mass UNITS g	leaf_dry_ mass	g	population	mean	measurement	5

Some notes:

The trait_name must match a trait concept within the traits dictionary.
The second trait in this dataset is documented as specific leaf area, the inverse of the trait concept leaf mass per area. The unit conversions algorithm inverts data read in as specific leaf area, converting it to leaf mass per area.
Categorical traits do not have units or replicates, so these fields become .na.
The traits.build convention for a categorical trait is value_type: mode, indicating the recorded value is the most commonly observed trait value. In some datasets there may be multiple space-delimited values within a single cell in the data.csv file, indicating there are multiple commonly observed categorical trait values.
For most observations of categorical traits, the traits.build convention is that the basis_of_value is determined by an expert examining an individual, population or species, and is therefore an expert_score.

Additional steps

Once you are well-versed in adding datasets to a traits.build database you will know that there is additional information required in metadata.yml.

However, for this tutorial, let’s begin by assuming we’re finished adding dataset metadata and check for errors:

dataset_test("tutorial_dataset_2")

*Users, please note, not all of these test failures or messages are currently in place. Adding test to document all failures is still a work in progress.

Three items will fail:

There are unknown trait values for plant_growth_form (error doesn’t yet exist)
There are values out of range for leaf_dry_mass (error doesn’t yet exist)
The dataset cannot pivot between long and wide formats. (error doesn’t yet exist)

As indicated in the output messages, there is a [troubleshooting vignette](https://github…/vignettes/) to help solve these errors.

For this tutorial however, keep reading…

There are several ways to proceed, but for these errors, it is useful to next build the dataset:

build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")

If you look at the excluded_data table, you’ll find any data for this dataset that could not be mapped to known traits, known trait values, or fell within allowable ranges.

traits.build_database$excluded_data %>%
  filter(dataset_id == "tutorial_dataset_2") %>%
  View()

This code displays a table with 190 rows of excluded data.

187 instances of Unsupported trait value for the trait plant_growth_form
3 instance of Value out of allowable range for the trait leaf_dry_mass

Looking through the output you’ll notice that the Unsupported trait value error exists because the data.csv file used plant growth form values that are different to those in the trait dictionary.

The values that triggered the Value out of allowable range error are all 0’s, a disallowed leaf_dry_mass value; the trait dictionary specifies that leaf_dry_mass can range from 0.01 - 15000.0 mg.

Adding trait value substitutions

For categorical traits, only trait values that are indicated in the trait dictionary are recognised. This is an important harmonisation step, as it ensures the same trait concept value is mapped to the same trait value throughout the database.

However, researchers use countless synonyms, abbreviations and syntax to express an identical trait value. traits.build converts all input to lowercase, but all other substitutions must be specified in the dataset’s metadata.yml file.

For this example individual letters were used to express 7 plant growth forms: EP, F, G, H, S, T, V

Looking at the definition for plant_growth_form in the trait dictionary and the helpful column header provided by the contributor, you can deduce that t is for tree; s is for shrub, etc.

There are two ways to add substitutions into the metadata.yml file.

Map in individual trait value substitutions using metadata_add_substitution:

metadata_add_substitution(dataset_id = "tutorial_dataset_2", 
        trait_name = "plant_growth_form", find = "t", replace = "tree")

Look at the metadata.yml file and you’ll note that a substitution has been added, to indicate the t’s are tree’s

You would repeat this step for the remaining unknown trait values.

Map in a table of substitutions using metadata_add_substitutions_list:

If there are quite a few trait values that require replacements, it is easier to first create a table of the required substitutions, then add a column of substitutions in either R or Excel.

table <-
  traits.build_database$excluded_data %>%
  filter(
    dataset_id == "tutorial_dataset_2" &
      error == "Unsupported trait value"
  ) %>%
  distinct(trait_name, value) %>%
  rename(find = value)

Next view your table to check the order of trait values and check the allowed values and definitions in the trait dictionary to ensure you replace each abbreviation with an accepted value. Note that epiphyte is not an allowed value for plant_growth_form in the trait dictionary, as, AusTraits uses a narrow definition of plant_growth_form, and separately has a trait plant_growth_substrate which includes the trait value epiphyte. For now, we’ll simply ignore this data.

To add a column with substitutions, then add the substitutions to the metadata file:

table <- table %>%
  mutate(replace = c("shrub", "tree", "herb", NA, 
                     "graminoid", "fern", "climber_herbaceous"))

## an alternative is 
## `mutate(replace = c("shrub", "tree", "herb", "epiphyte", "graminoid",  
##          "fern", "climber_herbaceous"))` 
## which will result in the epiphyte observations remaining in the excluded data table

metadata_add_substitutions_list("tutorial_dataset_2", table)

All required substitutions have been added to the metadata.yml file.

If you were to rerun dataset_test("tutorial_dataset_2") the error referring to Unsupported trait values would now have vanished.

Replacing “placeholder characters” with NA’s

The Values out of range error is triggered for numeric traits when the traits.build pipeline detects values that fall outside the range specified for the trait in the traits dictionary.

There are three common situations that lead to this error warning:

Values are truly out of range, possibly due to human error or because a plant really was not performing as expected (i.e. not photosynthesising).
Values appear to be out of range because of a “unit conversion issue” - that is, the dataset curator or the dataset contributor got the units wrong. This is fixed by working out the correct units and adjusting this in the traits section of the metadata.yml file.
A dataset contributor has used a “dummy symbol” to indicate missing data, such as 0, x, missing, etc. Truly missing data should be a blank cell - i.e. NA

This study is likely an example of (3), where 0 is a placeholder symbol. While the 0’s can be left in the excluded_data table, cluttering the excluded_data table with extraneous measurements makes it difficult to scan for true examples of Value out of allowable range errors in the future. (Values that are NA are, by default, omitted from the excluded_data table.)

adding custom R code

Instead, you can add R code within the metadata file to replace the placeholder symbol with NA.

Look at metadata.yml in Visual Studio Code.
The second field under the dataset section is custom_R_code: .na.
You can write any code you’d like within this section.
The file R/custom_R_code.R that you sourced at the beginning of this tutorial contains customised functions commonly used in custom_R_code.

For this example, replace:

custom_R_code: na

with:

custom_R_code: '
  data %>%
    mutate(
      across(c("TRAIT Leaf Dry Mass UNITS g"), ~na_if(.x,0))
    )
'

This code replaces all 0’s in the column with NA’s.

You can confirm that the custom R code has made the anticipated change with the function metadata_check_custom_R_code. This function reads in the data.csv file, then applies any manipulations from the custom_R_code:

metadata_check_custom_R_code("tutorial_dataset_2") %>% View()

Note:

use the format with the single quotes; this allows you to manually add line breaks, not otherwise permitted in the metadata.yml format.
you pipe in data to begin with, but do not need to assign your code back to data; that occurs automatically.

Build the database again, then check the excluded_data table to confirm there are no longer any excluded measurements:

source("build.R")

traits.build_database$excluded_data %>%
  filter(dataset_id == "tutorial_dataset_2") %>%
  View()

Replacing duplicate values with NA’s

Run the tests again to confirm the errors related to disallowed trait values and values out of range have vanished:

dataset_test("tutorial_dataset_2")

However, there should still be an error indicating that the dataset cannot pivot between long and wide formats.

The ability to pivot is important for 2 reasons:{#dataset_pivot}

Database users may prefer to display data in wide format to readily compare the values of multiple traits collected on the same individual (or population or species).
The pivot test groups together 13 variables that are meant to uniquely identify each row of data (dataset_id, trait_name, observation_id, source_id, taxon_name, entity_type, life_stage, basis_of_record, value_type, population_id, individual_id, temporal_id, method_id, entity_context_id, original_name). An inability to pivot indicates either:
- A variable present in the data.csv file to distinguish between unique observations has not been mapped into the metadata file. (Most likely a context property, column with locations, individual_id or source_id)
- Duplicate values exist within the data.csv file and have been read in multiple times.

The error in this dataset is a common one:

The three numeric traits (leaf_mass_per_area, leaf_mass_per_area and leaf_dry_mass) are all population-level measurements, while plant_growth_form is mapped in as having entity_type: species, meaning it is considered a species-level measurement. This means if the same species occurs at multiple sites, its growth form value is read in twice. However, because it is designated as entity_type: species that traits.build workflow does not connect the value to a location, since the species has the same growth form regardless of location.
Two options are to:
1. Recategorise plant_growth_form as having entity_type: population.
2. Only read in a single instance of plant_growth_form per species.

Either allows the dataset to pivot, but if the taxon truly displays only a single growth form across all populations it is much better to read in plant growth form once per species. Otherwise the database becomes longer without capturing additional information. This dataset has only a few instance of duplication, but imagine the dataset that has 500 rows of data for the same tree species - suddenly 500 instances of that species being a tree are read into the database.

The solution is to modify the existing custom_R_code, adding one of the customised functions from the file R/custom_R_code.R, replace_duplicates_with_NA:

custom_R_code: '
  data %>%
    mutate(
      across(c("TRAIT Leaf Dry Mass UNITS g"), ~na_if(.x,0))
    ) %>%
    group_by(name_original) %>%
    mutate(
      across(c("TRAIT Growth Form CATEGORICAL EP epiphyte (mistletoe) F fern G grass H herb S shrub T tree V vine"), 
              replace_duplicates_with_NA)
    ) %>%
    ungroup()
'

Rerun the tests and everything should now pass:

dataset_test("tutorial_dataset_2")

Then rebuild the database and look at the output in the traits table for one of the taxa that previously had duplicate plant_growth_form entries:

source("build.R")

traits.build_database$traits %>%
  filter(dataset_id == "tutorial_dataset_2") %>%
  filter(taxon_name == "Actinotus minor") %>% View()

  dataset_id     taxon_name      observation_id trait_name         value            unit  entity_type location_id
  <chr>          <chr>           <chr>          <chr>              <chr>            <chr> <chr>       <chr>
1 tutorial_dataset_2 Actinotus minor 010            leaf_area          18.8             mm2   population  02
2 tutorial_dataset_2 Actinotus minor 010            leaf_dry_mass      7                mg    population  02
3 tutorial_dataset_2 Actinotus minor 010            leaf_mass_per_area 344.827586206897 g/m2  population  02
4 tutorial_dataset_2 Actinotus minor 011            leaf_area          75.9             mm2   population  03
5 tutorial_dataset_2 Actinotus minor 011            leaf_dry_mass      7                mg    population  03
6 tutorial_dataset_2 Actinotus minor 011            leaf_mass_per_area 89.2857142857143 g/m2  population  03
7 tutorial_dataset_2 Actinotus minor 012            plant_growth_form  herb             NA    species     NA

The measurements for the three numeric traits from a single location share a common observation_id, as they are all part of an observation of a common entity (a specific population of Actinotus minor), at a single location, at a single point in time. However the row with the plant growth form measurement has a separate observation_id reflecting that this is an observation of a different entity (the taxon Actinotus minor).

Build dataset report

As a final step, build a report for the study

traits.build_database$build_info$version <- "5.0.0"  # a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_2", traits.build_database, overwrite = TRUE)