library(traits.build)
source("R/custom_R_code.R")
18 Tutorial 2: Adding a more complex dataset
18.1 Overview
This is the second of five tutorials on adding datasets to your traits.build
database.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the datasets in traits.build-template
. Instructions are available at Tutorial: Example compilation.
Goals
Learn how to merge in location data from a standalone spreadsheet.
Learn how to add substitutions for categorical trait values.
Learn how to add custom R code to the metadata file.
Understand the importance of attributing traits to the correct entity_type.
Understand the importance of having the dataset pivot.
New functions introduced
metadata_add_substitution
metadata_add_substitutions_list
metadata_check_custom_R_code
18.2 Adding tutorial_dataset_2
Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled tutorial_dataset_2
within the data folder.
Ensure that this folder exists on your computer.
The file
data.csv
exists within thetutorial_dataset_2
folder.There is a folder
raw
nested within thetutorial_dataset_2
folder, that contains two files,locations.csv
andnotes.txt
.
source necessary functions
- You’ll need to both source the
traits.build
functions and some ancillary functions that are in a file in the scripts folder:
Use functions to create a metadata.yml file
Create a metadata template
To create the metadata template, run:
metadata_create_template("tutorial_dataset_2")
As with tutorial_dataset_1
this function leads you through a series of menus requiring user input. Ensure you select:
data format: wide
taxon_name column: name_original
location_name column: site_TEXT
individual_id column: 1: NA
collection_date column: 1: NA
Enter collection_date range in format ‘2007/2009’: 1996/1997 Do all traits need repeat_measurements_id
’s? 2: No
Navigate to the dataset’s folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.
Propagate source information into the metadata.yml file
This dataset is from a published source and therefore the source information can be added with the function metadata_add_source_doi
:
metadata_add_source_doi(dataset_id = "tutorial_dataset_2",
doi = "10.1046/j.1365-2745.2000.00506.x")
confirm:
- the authors’ names are formatted as
first name last name
orfirst initial last name
- the article title is in sentence case
- the page numbers are filled in as a range, separated by a double dash
Add location details
For this dataset, location data is provided as a standalone spreadsheet, located in the raw data folder: tutorial_dataset_2\raw\locations.csv
First read the location data provided into R:
<-
locations read_csv("data/tutorial_dataset_2/raw/location_data.csv")
traits.build
requires three fields to use a specific syntax:
latitude
must be in decimal degrees and the field name (column header) must belatitude (deg)
longitude
must be in decimal degrees and the field name (column header) must belongitude (deg)
A general site description is document in the field
description
traits.build
does not require that the labels for other location properties align across datasets, but it is best practice to use a controlled vocabulary, so database users can easily search across all datasets for information on a specific climate variable or soil nutrient content. For new databases or new location properties, any naming/labeling convention can be established.
To confirm you are using the correct syntax, check the terms already in use:
<-
locations_properties $locations %>%
traits.build_databasedistinct(location_property) %>%
View()
Then rename your columns to match those in use:
<-
locations %>%
locations rename(
`longitude (deg)` = long,
`latitude (deg)` = lat,
`description` = vegetation,
`elevation (m)` = elevation,
`precipitation, MAP (mm)` = MAP,
`soil P, total (mg/kg)` = `soil P`,
`soil N, total (ppm)` = `soil N`,
`geology (parent material)` = `parent material`
)
Now add the location information into the metadata file:
metadata_add_locations(dataset_id = "tutorial_dataset_2", location_data = locations)
Ensure you select:
location_name: location
location_property columns: 1 2 3 4 5 6 7 8
Check the metadata.yml file to ensure the location information has been added as expected. If there is a problem, rerun the necessary code; this will overwrite what is present. You can also manually add additional properties if something is forgotten.
Add traits
To select columns in the data.csv
file that include trait data, run:
metadata_add_traits(dataset_id = "tutorial_dataset_2")
Select columns 3 4 5 6, as these contain trait data.
Manual filling in of metadata
After confirming that the skeletal traits section has been added to metadata.yml
file, you must fill in all the unknown
fields.
For this dataset, you will later use functions to add substitutions and exclude unwanted observations, but it is best to first fill in the information for contributors, the dataset, and the traits.
These are all fields that contain the word unknown
and must be filled in manually:
the
contributors
section
description
,basis_of_record
,life_stage
,sampling_strategy
,original_file
, andnotes
under thedataset
section
details for each trait, including
unit_in
,trait_name
,entity_type
,value_type
,basis_of_record
,replicates
andmethods
Adding contributors
The file data/tutorial_dataset_2/raw/tutorial_dataset_2_notes.txt
indicates the main data_contributor for this study.
Fill in the remaining contributor information as described in the tutorial_dataset_1
tutorial.
Dataset fields
The file data/tutorial_dataset_2/raw/tutorial_dataset_2_notes.txt
indicates how to fill in the unknown
dataset fields for this study.
Trait details
The file data/tutorial_dataset_2/raw/tutorial_dataset_2_notes.txt
indicates how to fill in the unknown
trait fields for this study, but see below as well.
column in dataset | trait concept | units_in | entity_type | value_type | basis_of_value | replicates |
---|---|---|---|---|---|---|
TRAIT Growth Form CATEGORICAL EP epiphyte (mistletoe) F fern G grass H herb S shrub T tree V vine | plant_ growth_ form | .na | species | mode | expert_score | .na |
TRAIT SLA UNITS mm2/g | leaf_mass_ per_area | mm2/g | population | mean | measurement | 5 |
TRAIT Leaf Size UNITS mm2 | leaf_area | mm2 | population | mean | measurement | 5 |
TRAIT Leaf Dry Mass UNITS g | leaf_dry_ mass | g | population | mean | measurement | 5 |
Some notes:
The
trait_name
must match a trait concept within the traits dictionary.The second trait in this dataset is documented as
specific leaf area
, the inverse of the trait conceptleaf mass per area
. The unit conversions algorithm inverts data read in as specific leaf area, converting it to leaf mass per area.Categorical traits do not have units or replicates, so these fields become
.na
.The
traits.build
convention for a categorical trait isvalue_type: mode
, indicating the recorded value is the most commonly observed trait value. In some datasets there may be multiple space-delimited values within a single cell in the data.csv file, indicating there are multiple commonly observed categorical trait values.For most observations of categorical traits, the
traits.build
convention is that thebasis_of_value
is determined by an expert examining an individual, population or species, and is therefore anexpert_score
.
Additional steps
Once you are well-versed in adding datasets to a traits.build
database you will know that there is additional information required in metadata.yml
.
However, for this tutorial, let’s begin by assuming we’re finished adding dataset metadata and check for errors:
dataset_test("tutorial_dataset_2")
*Users, please note, not all of these test failures or messages are currently in place. Adding test to document all failures is still a work in progress.
Three items will fail:
- There are unknown trait values for
plant_growth_form
(error doesn’t yet exist) - There are values out of range for
leaf_dry_mass
(error doesn’t yet exist) - The dataset cannot pivot between
long
andwide
formats. (error doesn’t yet exist)
As indicated in the output messages, there is a [troubleshooting vignette](https://github…/vignettes/) to help solve these errors.
For this tutorial however, keep reading…
There are several ways to proceed, but for these errors, it is useful to next build the dataset:
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
If you look at the excluded_data table, you’ll find any data for this dataset that could not be mapped to known traits, known trait values, or fell within allowable ranges.
$excluded_data %>%
traits.build_databasefilter(dataset_id == "tutorial_dataset_2") %>%
View()
This code displays a table with 190 rows of excluded data.
- 187 instances of
Unsupported trait value
for the traitplant_growth_form
- 3 instance of
Value out of allowable range
for the traitleaf_dry_mass
Looking through the output you’ll notice that the Unsupported trait value
error exists because the data.csv file used plant growth form values that are different to those in the trait dictionary.
The values that triggered the Value out of allowable range
error are all 0’s, a disallowed leaf_dry_mass
value; the trait dictionary specifies that leaf_dry_mass
can range from 0.01 - 15000.0 mg
.
Adding trait value substitutions
For categorical traits, only trait values that are indicated in the trait dictionary are recognised. This is an important harmonisation step, as it ensures the same trait concept value is mapped to the same trait value throughout the database.
However, researchers use countless synonyms, abbreviations and syntax to express an identical trait value. traits.build
converts all input to lowercase, but all other substitutions must be specified in the dataset’s metadata.yml
file.
For this example individual letters were used to express 7 plant growth forms: EP, F, G, H, S, T, V
Looking at the definition for plant_growth_form
in the trait dictionary and the helpful column header provided by the contributor, you can deduce that t
is for tree
; s
is for shrub
, etc.
There are two ways to add substitutions into the metadata.yml
file.
- Map in individual trait value substitutions using
metadata_add_substitution
:
metadata_add_substitution(dataset_id = "tutorial_dataset_2",
trait_name = "plant_growth_form", find = "t", replace = "tree")
Look at the metadata.yml
file and you’ll note that a substitution has been added, to indicate the t
’s are tree
’s
You would repeat this step for the remaining unknown trait values.
- Map in a table of substitutions using
metadata_add_substitutions_list
:
- If there are quite a few trait values that require replacements, it is easier to first create a table of the required substitutions, then add a column of substitutions in either R or Excel.
<-
table $excluded_data %>%
traits.build_databasefilter(
== "tutorial_dataset_2" &
dataset_id == "Unsupported trait value"
error %>%
) distinct(trait_name, value) %>%
rename(find = value)
Next view your table to check the order of trait values and check the allowed values and definitions in the trait dictionary to ensure you replace each abbreviation with an accepted value. Note that epiphyte
is not an allowed value for plant_growth_form
in the trait dictionary, as, AusTraits uses a narrow definition of plant_growth_form
, and separately has a trait plant_growth_substrate
which includes the trait value epiphyte
. For now, we’ll simply ignore this data.
To add a column with substitutions, then add the substitutions to the metadata file:
<- table %>%
table mutate(replace = c("shrub", "tree", "herb", NA,
"graminoid", "fern", "climber_herbaceous"))
## an alternative is
## `mutate(replace = c("shrub", "tree", "herb", "epiphyte", "graminoid",
## "fern", "climber_herbaceous"))`
## which will result in the epiphyte observations remaining in the excluded data table
metadata_add_substitutions_list("tutorial_dataset_2", table)
All required substitutions have been added to the metadata.yml
file.
If you were to rerun dataset_test("tutorial_dataset_2")
the error referring to Unsupported trait values
would now have vanished.
Replacing “placeholder characters” with NA’s
The Values out of range
error is triggered for numeric traits when the traits.build
pipeline detects values that fall outside the range specified for the trait in the traits dictionary.
There are three common situations that lead to this error warning:
- Values are truly out of range, possibly due to human error or because a plant really was not performing as expected (i.e. not photosynthesising).
- Values appear to be out of range because of a “unit conversion issue” - that is, the dataset curator or the dataset contributor got the units wrong. This is fixed by working out the correct units and adjusting this in the traits section of the
metadata.yml
file. - A dataset contributor has used a “dummy symbol” to indicate missing data, such as
0
,x
,missing
, etc. Truly missing data should be a blank cell - i.e.NA
This study is likely an example of (3), where 0
is a placeholder symbol. While the 0’s can be left in the excluded_data
table, cluttering the excluded_data
table with extraneous measurements makes it difficult to scan for true examples of Value out of allowable range
errors in the future. (Values that are NA
are, by default, omitted from the excluded_data table.)
adding custom R code
Instead, you can add R code within the metadata file to replace the placeholder symbol with NA
.
- Look at
metadata.yml
in Visual Studio Code. - The second field under the
dataset
section iscustom_R_code: .na
. - You can write any code you’d like within this section.
- The file
R/custom_R_code.R
that you sourced at the beginning of this tutorial contains customised functions commonly used in custom_R_code.
For this example, replace:
: na custom_R_code
with:
: '
custom_R_code data %>%
mutate(
across(c("TRAIT Leaf Dry Mass UNITS g"), ~na_if(.x,0))
)
'
This code replaces all 0’s in the column with NA’s.
You can confirm that the custom R code has made the anticipated change with the function metadata_check_custom_R_code
. This function reads in the data.csv file, then applies any manipulations from the custom_R_code:
metadata_check_custom_R_code("tutorial_dataset_2") %>% View()
Note:
- use the format with the single quotes; this allows you to manually add line breaks, not otherwise permitted in the metadata.yml format.
- you pipe in
data
to begin with, but do not need to assign your code back to data; that occurs automatically.
Build the database again, then check the excluded_data
table to confirm there are no longer any excluded measurements:
source("build.R")
$excluded_data %>%
traits.build_databasefilter(dataset_id == "tutorial_dataset_2") %>%
View()
Replacing duplicate values with NA’s
Run the tests again to confirm the errors related to disallowed trait values and values out of range have vanished:
dataset_test("tutorial_dataset_2")
However, there should still be an error indicating that the dataset cannot pivot between long
and wide
formats.
The ability to pivot is important for 2 reasons:{#dataset_pivot}
- Database users may prefer to display data in wide format to readily compare the values of multiple traits collected on the same individual (or population or species).
- The
pivot test
groups together 13 variables that are meant to uniquely identify each row of data (dataset_id, trait_name, observation_id, source_id, taxon_name, entity_type, life_stage, basis_of_record, value_type, population_id, individual_id, temporal_id, method_id, entity_context_id, original_name). An inability to pivot indicates either:- A variable present in the
data.csv
file to distinguish between unique observations has not been mapped into the metadata file. (Most likely a context property, column with locations, individual_id or source_id) - Duplicate values exist within the data.csv file and have been read in multiple times.
- A variable present in the
The error in this dataset is a common one:
The three numeric traits (
leaf_mass_per_area
,leaf_mass_per_area
andleaf_dry_mass
) are all population-level measurements, whileplant_growth_form
is mapped in as havingentity_type: species
, meaning it is considered a species-level measurement. This means if the same species occurs at multiple sites, its growth form value is read in twice. However, because it is designated asentity_type: species
thattraits.build
workflow does not connect the value to a location, since the species has the same growth form regardless of location.Two options are to:
- Recategorise
plant_growth_form
as havingentity_type: population
. - Only read in a single instance of
plant_growth_form
per species.
- Recategorise
Either allows the dataset to pivot, but if the taxon truly displays only a single growth form across all populations it is much better to read in plant growth form once per species. Otherwise the database becomes longer without capturing additional information. This dataset has only a few instance of duplication, but imagine the dataset that has 500 rows of data for the same tree species - suddenly 500 instances of that species being a tree are read into the database.
The solution is to modify the existing custom_R_code, adding one of the customised functions from the file R/custom_R_code.R
, replace_duplicates_with_NA
:
: '
custom_R_code data %>%
mutate(
across(c("TRAIT Leaf Dry Mass UNITS g"), ~na_if(.x,0))
) %>%
group_by(name_original) %>%
mutate(
across(c("TRAIT Growth Form CATEGORICAL EP epiphyte (mistletoe) F fern G grass H herb S shrub T tree V vine"),
replace_duplicates_with_NA)
) %>%
ungroup()
'
Rerun the tests and everything should now pass:
dataset_test("tutorial_dataset_2")
Then rebuild the database and look at the output in the traits table for one of the taxa that previously had duplicate plant_growth_form
entries:
source("build.R")
$traits %>%
traits.build_databasefilter(dataset_id == "tutorial_dataset_2") %>%
filter(taxon_name == "Actinotus minor") %>% View()
dataset_id taxon_name observation_id trait_name value unit entity_type location_id<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 tutorial_dataset_2 Actinotus minor 010 leaf_area 18.8 mm2 population 02
2 tutorial_dataset_2 Actinotus minor 010 leaf_dry_mass 7 mg population 02
3 tutorial_dataset_2 Actinotus minor 010 leaf_mass_per_area 344.827586206897 g/m2 population 02
4 tutorial_dataset_2 Actinotus minor 011 leaf_area 75.9 mm2 population 03
5 tutorial_dataset_2 Actinotus minor 011 leaf_dry_mass 7 mg population 03
6 tutorial_dataset_2 Actinotus minor 011 leaf_mass_per_area 89.2857142857143 g/m2 population 03
7 tutorial_dataset_2 Actinotus minor 012 plant_growth_form herb NA species NA
The measurements for the three numeric traits from a single location share a common observation_id
, as they are all part of an observation of a common entity (a specific population of Actinotus minor), at a single location, at a single point in time. However the row with the plant growth form measurement has a separate observation_id
reflecting that this is an observation of a different entity (the taxon Actinotus minor).
Build dataset report
As a final step, build a report for the study
$build_info$version <- "5.0.0" # a fix because the function was built around specific AusTraits versions
traits.build_databasedataset_report("tutorial_dataset_2", traits.build_database, overwrite = TRUE)