19 Tutorial 3: Adding contexts and complex units

19.1 Overview

This is the third of five tutorials on adding datasets to your traits.build database.

Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the datasets in traits.build-template. Instructions are available at Tutorial: Example compilation.

Goals

Learn how to add contexts.
Learn some complexities with respect to units.
Learn additional custom_R_code tricks.

New functions introduced

metadata_add_contexts

19.2 Adding tutorial_dataset_3

Ensure the dataset folder contains the correct data files

In the traits.build-template repository, there is a folder titled tutorial_dataset_3 within the data folder.

Ensure that this folder exists on your computer.
The file data.csv exists within the tutorial_dataset_3 folder.
There is a folder raw nested within the tutorial_dataset_3 folder, that contains one file, notes.txt.

source necessary functions

If you have restarted R Studio since last adding a dataset, ensure all functions are loaded from both the traits.build package and the custom functions file:

library(traits.build)
source("R/custom_R_code.R")

Use functions to create a metadata.yml file

Create a metadata template

To create the metadata template, run:

metadata_create_template("tutorial_dataset_3")

As with in the previous tutorials, this function leads you through a series of menus requiring user input. Ensure you select:

data format: wide
taxon_name column: 1: Species
location_name column: 5: site
individual_id column: 1: NA
collection_date column: 1: NA
Enter collection_date range in format ‘2007/2009’: 2011-02/2011-03
Do all traits need repeat_measurements_id’s? 2: No

In this dataset, unlike the first two, the data being input is at the individual-level. Since there is only a single data row for each individual, it is not required to map in an individual_id. A column with an individual_id is required if you want to keep track of multiple rows of data for the same individual.

Navigate to the dataset’s folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.

Propagate source information into the metadata.yml file

This dataset is from a published source and therefore the source information can be added with the function metadata_add_source_doi:

metadata_add_source_doi(dataset_id = "tutorial_dataset_3", 
                        doi = "10.1007/s11104-013-1725-x")

confirm:

the authors’ names are formatted as first name last name or first initial last name
the article title is in sentence case
the page numbers are filled in as a range, separated by a double dash

You have just added 3 doi’s that all yield perfect reference information - and indeed most references are added correctly, but some journals and doi’s for many older references are in ALL CAPS or missing page numbers, so it is worth checking.

Add location details

All data for this dataset was collected at a single location, specified in the data.csv file as The University of Melbourne Burnley campus. No additional details are provided. For such studies, it is best to look up the campus location and input approximate latitude/longitude coordinates.

As well as adding locations and location properties from a table, the function metadata_add_locations lets you add a basic location data scaffold in metadata.yml.

For instance, for this study:

you add the location names from the data.csv file
the function automatically adds blank fields for latitude, longitude, and description
values for these fields must then be filled in manually

data <- read_csv("data/tutorial_dataset_3/data.csv")

metadata_add_locations("tutorial_dataset_3", data)

You select the location name, but not any location properties, as none are provided in the data.csv file or another tabular format.

location_name: 4: site
location_property columns: just press enter

This creates the following scaffold in methdata.yml:

  The University of Melbourne Burnley campus:
    latitude (deg): na_character
    longitude (deg): na_character
    description: na_character

metadata_add_locations automatically selects the unique values in the location name column.
if no columns with location properties are specified, the function just adds the three core location properties.
the values for these location properties are available in the notes file.

Add traits

To select columns in the data.csv file that include trait data, run:

metadata_add_traits(dataset_id = "tutorial_dataset_3")

Select columns 5 6 7 8 9 10, as these contain trait data.

Add contexts

A context is any piece of ancillary information that helps explain why a certain trait value was measured.

In traits.build, some contexts are mapped in as part of the default metadata structure, including the location (& location properties), a general sense of organism age (life_stage), basis_of_record, and the general methods for each trait.

However most contexts are pieces of information that are essential to record for a specific dataset, but not recorded for most other datasets. The context field therefore allows any context property to be added manually.

Context properties are divided into 5 categories:

method contexts: Context properties that capture differences in method between measurements of the same trait. For plants, canopy position and leaf age are two common method contexts.
temporal contexts: Context properties that capture explicit time-related differences between groups of measurements. This is separate from collection_date, as an explicit meaning should accompany each temporal context property and the distinct values may span a range of collection dates. For plants sampling season (dry versus wet) is a commonly mapped in temporal context.
entity contexts: This context property category pertains to individual-level measurements, and documents features of the individual that explicitly distinguish it from other individuals that are measured. In addition to features like the sex of an individual, it is the location to document individual-level co-variates that are not themselves traits, but are information required to interpret other trait values.
treatment contexts: Any experimental treatment that has been applied to groups of individuals.
plot contexts: Any variation within a documented location, where different individuals experience know differences in growing/living conditions or growing/living history. For plants, this context category is frequently used to map in slope position or fire history.

Context properties are most frequently included in the data.csv file as columns of values. Occasionally, separate columns of trait values might represent measurements with different context property values, a topic for a later tutorial.

Context properties that are columns in the data file, can be added with the function metadata_add_contexts:

metadata_add_contexts("tutorial_dataset_3")

This leads to a user-prompt to select the relevent columns:

Indicate all columns that contain additional contextual data for tutorial_dataset_3 (by number separated by space; e.g. ‘1 2 4’):

1: Species
2: Treatment
3: Replicate
4: site
5: life_form
6: WP leaf (Mpa) predawn
7: WP leaf (Mpa) midday
8: LMA kg/m2
9: Stomatal density Upper surface
10: Stomatal density Lower surface

Select column 2 which is the only column with a context property:

Selection: 2

Additional user prompts ask for details about the context property category and values:

What category does context Treatment fit in? (by number separated by space; e.g. ‘1 2 4’):

1: treatment_context
2: plot_context
3: temporal_context
4: method_context
5: entity_context

This is a treatment context, so select 1:

Selection: 1

The following values exist for this context: Drought, Watered.

Are replacement values required? (y/n) y

Although the trait values Drought and Watered are probably sufficiently descriptive, for other drought-treatment studies we’ve used drought and well-watered, so prefer to align the context property values with these.

Are descriptions required? (y/n) y

The free-form description field let’s you add details about the exact meaning of drought vs well-watered for this study.

In the metadata.yml file, there will now be a scaffold for the contexts:

contexts:
- context_property: unknown
  category: treatment
  var_in: Treatment
  values:
  - find: Drought
    value: unknown
    description: unknown
  - find: Watered
    value: unknown
    description: unknown

In addition to filling in the preferred context property values and descriptions, you must also assign a name to the context_property. This is a free-form field, but as with location_property it is best to ensure you align context_propery names throughout the database. In the AusTraits plant trait database, this context_property is always called drought treatment.

The finished context section will be:

contexts:
- context_property: drought treatment
  category: treatment
  var_in: Treatment
  values:
  - find: Drought
    value: drought
    description: The plants were watered with 20% of the water used by well-watered
      plants (determined gravimetrically) in the 3-4 days preceding each watering
      event).
  - find: Watered
    value: well-watered
    description: The plants were watered to pot capacity at (2 L per pot).

Manual filling in of metadata

The components of this dataset that can be propagated with functions are not complete, and the remaining unknown fields must now be filled in manually.

the contributors section
description, basis_of_record, life_stage, sampling_strategy, original_file, and notes under the dataset section
details for each trait, including unit_in, trait_name, entity_type, value_type, basis_of_record, replicates and methods

Adding contributors

The file data/tutorial_dataset_3/raw/tutorial_dataset_3_notes.txt indicates the main data_contributor for this study.

Dataset fields

The file data/tutorial_dataset_3/raw/tutorial_dataset_3_notes.txt indicates how to fill in the unknown dataset fields for this study.

Trait details

The file data/tutorial_dataset_3/raw/tutorial_dataset_3_notes.txt indicates how to fill in the unknown trait fields for this study, but see below as well.

Remember, the trait_name must match a trait concept within the traits dictionary. For this example:

column in dataset	trait concept	units_in	entity_type	value_type	basis_of_ value	replicates
life_form	life_form	.na	species	mode	expert_score	.na
WP leaf (Mpa) predawn	water_potential_predawn	neg_MPa	individual	raw	measurement	1
WP leaf (Mpa) midday	water_potential_midday	neg_MPa	individual	raw	measurement	1
LMA kg/m2	leaf_mass_per_area	kg/m2	individual	raw	measurement	1
Stomatal density Upper surface	leaf_stomatal_density_adaxial	‘{count}/mm2’	individual	raw	measurement	1
Stomatal density Lower surface	leaf_stomatal_density_abaxial	‘{count}/mm2’	individual	raw	measurement	1

With the units, note:

In the data.csv file, all water potential values are positive, indicating the data contributor mapped in the “negative” of the true water potential values (which are always below zero). A negative sign at the beginning of the units field is not recognised and therefore the convention is to use the prefix neg_ to indicate the values input are the negative of the true values.
stomatal density is a “count density”, a number of stomata per unit area. The actual UCUM standard for this is simply 1/mm2 , but for clarity we use {count}/mm2 . The word count is in curly brackets, since it is a “note” rather than a true unit.
If the unit begin with a curly bracket, the unit needs to be placed in single quotes

Testing, error fixes, and report building

At this point, run the dataset tests and rebuild the dataset:

dataset_test("tutorial_dataset_3")

build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")

The dataset test should yield an error that one water_potential_predawn value does not convert to numeric, indicating a placeholder-character is being used in place of an NA {note: this error wasn’t triggering as this vignette was being written)

Looking at the excluded_data table indicates there is a “*” in one column, so one adds:

  custom_R_code: '
    data %>%
      mutate(
        across(c("WP leaf (Mpa) predawn"), ~na_if(.x,"*"))
      )
  '

However, now you’ll get the error: Caused by error in na_if(): ! Can’t convert y to match type of x .

This indicates a mismatch between column types, necessitating that you change the column to character:

  custom_R_code: '
    data %>%
      mutate(
        across(c("WP leaf (Mpa) predawn"), ~as.character(.x)),
        across(c("WP leaf (Mpa) predawn"), ~na_if(.x,"*"))
      )
  '

At this point, rerunning the tests and rebuilding the database should not generate any errors or excluded values, so you can build and review the report.

As a final step, build a report for the study

traits.build_database$build_info$version <- "5.0.0"  
    # a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_3", traits.build_database, overwrite = TRUE)