20  Tutorial 4: Additional complexities

20.1 Overview

This is the fourth tutorial on adding datasets to your traits.build database.

Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the example datasets in traits.build-template. Instructions are available at Tutorial: Example compilation.

Goals

New functions introduced

  • none.

20.2 Adding tutorial_dataset_4

This dataset is a subset of data from Togashi_2015 in AusTraits. The data are a compilation of many datasets, each with their own references information.

For such datasets, there are two options: 1) sourcing the data from the original publication, and adding it to the database as its own datset; 2) entering the dataset as part of a broader compilation. For this study, in AusTraits, a combination of both approaches was used. For original sources that included a broader range of traits and similar (or better) resolution, the original source was used. However, in this compilation there were several studies where Togashi had already individually contacted the authors of the source publications, as the data appendix for this publication often had better resolution than the original papers.

This tutorial focuses on adding a single dataset derived from many original studies.

Before you begin creating the metadata file, take a look at the data.csv file - you’ll notice that the columns present are quite different from the datasets added so far. There are columns for latitude & longitude, but not location name column. There are columns for the original source. These are columns for entity_type, basis_of_record, and value_type, three metadata fields that, for the previous tutorial datasets, were entered in the metadata file as a fixed value. And the only trait column log_LA.SA is in a non-standard format.

With a few small tricks this dataset can also be added seamlessly.

Ensure the dataset folder contains the correct data files

In the traits.build-template repository, there is a folder titled tutorial_dataset_4 within the data folder. 

  • Ensure that this folder exists on your computer. 

  • The file data.csv exists within the tutorial_dataset_4 folder. 

  • There is a folder raw nested within the tutorial_dataset_4 folder, that contains one file, tutorial_dataset_4_notes.txt

source necessary functions

  • If you have restarted R Studio since last adding a dataset, ensure all functions are loaded from both the traits.build package and the custom functions file:
library(traits.build)
source("R/custom_R_code.R")

Create the metadata.yml file

  • Note, that because, for this dataset, there are a number of variables that cannot simply be adding using the metadata_add_... functions, unlike in previous tutorials, you’ll now mix the three steps that were previously separated:
  1. Use metadata_add... functions where possible.
  2. Mutate new columns using custom_R_code.
  3. Manually map newly created column names in the appropriate part of the metadata.yml file.

Create a metadata template

To create the metadata template, run:

metadata_create_template("tutorial_dataset_4")

As with in the previous tutorials, this function leads you through a series of menus requiring user input. Ensure you select:

data format: wide
taxon_name column: 1: Species
location_name column: 1: NA
individual_id column: 1: NA
collection_date column: 1: NA
Enter collection_date range in format ‘2007/2009’: 1996/2015
Do all traits need repeat_measurements_id’s? 2: No

Notes:

  • As you noted before, there is no location column to map in automatically, so that must be added later. You enter NA for now.

  • Since this is a compilation, there is nowhere in the manuscript that indicates the collection dates for each study. The best estimate is to list the range of publication dates of the sources: 1996–2015.

Navigate to the dataset’s folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.


Propagate source information into the metadata.yml file

Entering the reference information for the manuscript from which the data appendix was sourced is straightforward, as this is a single publication:

metadata_add_source_doi(dataset_id = "tutorial_dataset_4", doi = "10.1002/ece3.1344")

However, you also want to acknowledge the original data sources, those documented in the column References.

read_csv("data/tutorial_dataset_4/data.csv") %>% distinct(References)

# A tibble: 11 × 1
References
   <chr>
 1 Barrett et al. 1996
 2 Benyon et al. 1999
 3 Bleby et al.2009
 4 Brodribb and Felid 2000
 5 Brooksbank et al. 2011
 6 Canham et al. 2009
 7 Carter and White 2009
 8 Cernusak et al. 2006
 9 Choat et al. 2005
10 Drake and Franks 2003
11 Drake et al. 2011

You would now need to look up each of these references in the reference section of the manuscript and use Google Scholar (or another reference resource) to look up the doi for each reference. To add additional references, you need to add an argument to the metadata_add_source_doi function:

metadata_add_source_doi(dataset_id = "tutorial_dataset_4", doi = "10.1071/bt9960249", 
                        type = "original_01")

Per the traits.build schema, a secondary reference is when there have been two related publications out of a single dataset, while the term original is used if the data were collected as part of a previous dataset (generally by different authors) and added to traits.build as part of a compilation.

When you add Benyon_1999 (doi - “10.1016/s0378-3774(98)00080-8”), you’ll need to change the type field to original_02, etc.

Note that AusTraits will build fine without adding all 11 original sources; it is up to you how many you want to add as a test.


Map source_id into metadata file

  • source_id is a metadata field that is used relatively infrequently. In the AusTraits trait database only 1/20 studies require the mapping of a source_id and therefore as a default the field does not get added to the metadata template.

  • Instead, you have to manually add it to the files’ dataset section, generally directly below location_name.

  • Simply add a line source_id: source_id. Make sure the indents line up with the fields above/below.

  • When the AusTraits team initially added this dataset, the curator manually added the source_id column. Otherwise such a column could be mutated from the reference column using custom_R_code or added manually in Excel.


Add location details

Your protocol for adding locations diverges from the past 3 tutorials, because in this dataset you don’t yet have location names. The data file instead includes the latitude and longitude of each site, information you can use to “create” location names; the actual name doesn’t matter, just that each latitude/longitude combination has a unique location name.

Although you could edit the data.csv file directly (and sometimes we do), you could alternatively create a location_name column through custom_R_code:

  custom_R_code: '
    data %>%
      mutate(
        location_name = paste0("lat_",Lat,"_long_",Long)
      )
'

In this case you’re using custom_R_code to generate many unique location names as a column in the data table as it is first read into the R workflow. 

Note: While, for this example you are generating many unique location names, there are many datasets where all data have been collected at a single location, and therefore the submitted dataset doesn’t include a location_name column. For all of those you simply add code like mutate(location_name = "Broken Hill") into custom_R_code.

You next want to create a table of location names and location properties (i.e latitude & longitude):

location_table <-
  metadata_check_custom_R_code("tutorial_dataset_4") %>%
  select(location_name, Lat, Long) %>%
  rename(`latitude (deg)` = Lat, `longitude (deg)` = Long) %>%
  distinct()

metadata_add_locations("tutorial_dataset_4", location_table)

Notes:

  • Remember the function metadata_check_custom_R_code reads in the data.csv file, applies custom_R_code manipulations, and outputs the updated data table. This is very useful if you want to check your custom_R_code is performing as expected or if you want to perform further manipulations to the output.

  • There are many other ways to create a table of location names and properties. You could create a standalone table using R (or Excel), but this solution generates no additional files to store.


Map location name into metadata file

  • Because the column location_name did not exist when you created the metadata table, you filled in location_name = NA during the initial user prompts.

  • You now need to manually fill the newly mutated location name column into the metadata file. Under the dataset section you’ll find the field location_name: unknown. This needs to be replaced with location_name: location_name.


Fill in missing dataset-level metadata

  • Manually fill in the the information for the fields description, basis_of_record, life_stage and sampling_strategy, copying information from the supplied text file into metadata.yml.

Add traits

There is a single trait for this study, Huber value, which is the sapwood area to leaf area ratio. However, in the data.csv file it is documented as log_LA.SA and another line of custom R code must be added:

  custom_R_code: '
    data %>%
      mutate(
        location_name = paste0("lat_",Lat,"_long_",Long),
        LA.SA = 10^(log_LA.SA)
      )
'

You could take the inverse of LA.SA, but this can also be accomplished through unit conversions.

You can then run:

metadata_add_traits("tutorial_dataset_4")

Select column 13, your newly created column of trait data.

As with previous datasets, the following section has been added to metadata.yml:

- var_in: LA.SA
  unit_in: unknown
  trait_name: unknown
  entity_type: unknown
  value_type: unknown
  basis_of_value: unknown
  replicates: unknown
  methods: unknown

This trait has several “non-standard” values:

  • unit_in: The units for the input column are leaf area/sapwood area, a dimensionless “area ratio”. Meanwhile, Huber value is reported as sapwood area/leaf area, the inverse dimensionless “area ratio”.

The UCUM standard to which traits.build conforms specifies that “dimensionless” is only accepted for the very few traits that are truly dimensionless, not traits where units top and bottom simply cancel out. You need to specify that it is a ratio of area/area (or mass/mass, count/count, etc.)

Looking in the trait dictionary, you’ll see that the units are specified as: mm2{sapwood}/mm2{leaf}, specifically to be explicit about which area is the denominator vs numerator. You therefore specify that the units_in are mm2{leaf}/mm2{sapwood}.

Since it is dimensionless, you could, of course specify any area units on top and bottom as long as they are identical, but I know mm2{leaf}/mm2{sapwood} is already in the unit conversions file.

  • entity_type, value_type: Because this study is a compilation of many sources, the entity_type and value_type are not consistent across all measurements. Instead the data curator had to go back to many of the original sources and document which were population-level versus individual-level measurements, and correspondingly which were means vs raw values. For such circumstances (which also can occur within a single study), you can map a column name as the value.

  • replicates: For most of the studies the number of replicates comprising the trait values is unknown.

Taking all this into account, you’re left with:

- var_in: LA.SA
  unit_in: mm2{leaf}/mm2{sapwood}
  trait_name: huber_value
  entity_type: entity
  value_type: value
  basis_of_value: measurement
  replicates: unknown
  methods: Leaf area was measured using a desktop scanner for all leaves on a sampled
    branch. After bark removal, the branch cross-sectional area was measured with
    a digital caliper at two points near the cut (methodology that includes the non-conductive
    part of the sapwood). The pith cross-sectional area was measured and subtracted
    from the branch cross-sectional area. For compiled and contributed data, measurements
    on branches were made as described above, although with a variable number of branches
    sampled per tree. In some studies where the branch diameter was too small (<10
    mm), the pith was considered part of the sapwood. For whole trees, cross-sectional
    sapwood area was measured at 1.3 m above the ground from bored cores or harvesting
    the tree. Leaf area in whole trees was in most cases measured from harvested trees.

Add contexts

If you consider the columns within the data.csv file you’ll note that Sample documents a method context and needs to be added as a context property. In addition, within this dataset, Height (plant height) is not a trait, but a covariate; some datasets documented this, because they thought it might influence the Huber value of the plant. It is therefore a context.

metadata_add_contexts("tutorial_dataset_4")

Following the user prompts select:

5: Sample
6: Height

You are then led sequentially through the user prompts for each of the context properties:

Sample

4: method_context

This context is a method context, because it specifies a difference in methodology that might influence that trait value.

The following values exist for this context: trunk sample branch sample
Are replacement values required? (y/n) n
Are descriptions required? (y/n) y

The answers to the next two questions are up to the dataset curator, but at AusTraits we decided that trunk sample and branch sample were sufficiently explicit context property values, but that it would be helpful to add a description.

Therefore, we filled in context_property: wood sample type and added descriptions for the context property values:

- context_property: wood sample type
  category: method_context
  var_in: Sample
  values:
  - value: trunk sample
    description: Measurements taken on the trunk of the tree.
  - value: branch sample
    description: Measeurements taken on a branch from the tree.

Height

Plant height is an entity context. It is a feature of the entity (an individual) that might influence the trait value.

5: entity_context

Again, the dataset curator may choose what information to document within the metadata file. For continuous traits, like plant height, the general consensus is that the values are self-explanatory, so you’d select:

The following values exist for this context: 6.7, 4.8, 6.9
Are replacement values required? (y/n) n
Are descriptions required? (y/n) n

Filling in that the context property is tree height (m), the metadata file would simply be:

- context_property: tree height (m)
  category: entity_context
  var_in: Height

Testing, error fixes, and report building

At this point, run the dataset tests, rebuild the dataset, and check for excluded data:

dataset_test("tutorial_dataset_4")

build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")

traits.build_database$excluded_data %>% 
  filter(dataset_id == "tutorial_dataset_4") %>%  View()

There should be no errors and no excluded data, so go ahead and build a report for the study:

traits.build_database$build_info$version <- "5.0.0"  
    # a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_4", traits.build_database, overwrite = TRUE)

Overall, this report isn’t very informative since it is the first Huber value dataset in the new database.

But let me draw your attention to the list of taxa at the bottom. Because the tutorials are, for now, ignoring taxon alignments (traits.build in a state of flux with regard to this), the tutorials have ignored this section.
But let me draw your attention to the list of taxa at the bottom. Because the tutorials are, for now, ignoring taxon alignments (traits.build in a state of flux with regard to this), the tutorials have ignored this section. 

However, note the unknown taxa names unk sp. 1 and unk sp. 2. Although the AusTraits database accepts names resolved to genus and family, data collected on a truly unknown taxon is useless and should be excluded. Being a terrestrial vascular plant database, we also exclude mosses and lichens that are sometimes in datasets. The curators for other databases aligned to the traits.build workflow will have their own standards for values to explicitly disallow, based on taxonomy (or some other variable).

Near the bottom of the metadata file is a section for excluding data.

It currently reads:

exclude_observations: .na

Change this to:

exclude_observations:
- variable: taxon_name
  find: unk sp. 1, unk sp. 2
  reason: omitting completely unknown taxa (E Wenk, 2023.09.20)

Notes:

  • Other variable names can also be used here. Perhaps there is a particular context value or location that is known to have problematic data. In AusTraits this field is almost exclusively used to exclude specific taxa, but the metadata section is designed to have broader applications.

If you now rebuild the database, you’ll see that the measurements associated with these two data are now in the excluded_data table.