library(traits.build)
source("R/custom_R_code.R")
20 Tutorial 4: Additional complexities
20.1 Overview
This is the fourth tutorial on adding datasets to your traits.build
database.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the example datasets in traits.build-template
. Instructions are available at Tutorial: Example compilation.
Goals
Learn how to generate location names using custom_R_code
Learn how to add multiple sources to the reference section.
Learn how to add map source_ID into the metadata file.
Learn how to map metadata from columns.
Learn how to exclude data.
New functions introduced
- none.
20.2 Adding tutorial_dataset_4
This dataset is a subset of data from Togashi_2015 in AusTraits. The data are a compilation of many datasets, each with their own references information.
For such datasets, there are two options: 1) sourcing the data from the original publication, and adding it to the database as its own datset; 2) entering the dataset as part of a broader compilation. For this study, in AusTraits, a combination of both approaches was used. For original sources that included a broader range of traits and similar (or better) resolution, the original source was used. However, in this compilation there were several studies where Togashi had already individually contacted the authors of the source publications, as the data appendix for this publication often had better resolution than the original papers.
This tutorial focuses on adding a single dataset derived from many original studies.
Before you begin creating the metadata file, take a look at the data.csv file - you’ll notice that the columns present are quite different from the datasets added so far. There are columns for latitude & longitude, but not location name column. There are columns for the original source. These are columns for entity_type, basis_of_record, and value_type, three metadata fields that, for the previous tutorial datasets, were entered in the metadata file as a fixed value. And the only trait column log_LA.SA
is in a non-standard format.
With a few small tricks this dataset can also be added seamlessly.
Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled tutorial_dataset_4
within the data folder.
Ensure that this folder exists on your computer.
The file
data.csv
exists within thetutorial_dataset_4
folder.There is a folder
raw
nested within thetutorial_dataset_4
folder, that contains one file,tutorial_dataset_4_notes.txt
.
source necessary functions
- If you have restarted R Studio since last adding a dataset, ensure all functions are loaded from both the
traits.build
package and the custom functions file:
Create the metadata.yml file
- Note, that because, for this dataset, there are a number of variables that cannot simply be adding using the
metadata_add_...
functions, unlike in previous tutorials, you’ll now mix the three steps that were previously separated:
- Use
metadata_add...
functions where possible.
- Mutate new columns using
custom_R_code
.
- Manually map newly creted column names in the appropriate part of the
metadata.yml
file.
Create a metadata template
To create the metadata template, run:
metadata_create_template("tutorial_dataset_4")
As with in the previous tutorials, this function leads you through a series of menus requiring user input. Ensure you select:
data format: wide
taxon_name column: 1: Species
location_name column: 1: NA
individual_id column: 1: NA
collection_date column: 1: NA
Enter collection_date range in format ‘2007/2009’: 1996/2015
Do all traits need repeat_measurements_id
’s? 2: No
Notes:
As you noted before, there is no location column to map in automatically, so that must be added later. You enter
NA
for now.
Since this is a compilation, there is nowhere in the manuscript that indicates the collection dates for each study. The best estimate is to list the range of publication dates of the sources: 1996–2015.
Navigate to the dataset’s folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.
Propagate source information into the metadata.yml file
Entering the reference information for the manuscript from which the data appendix was sourced is straightforward, as this is a single publication:
metadata_add_source_doi(dataset_id = "tutorial_dataset_4", doi = "10.1002/ece3.1344")
However, you also want to acknowledge the original data sources, those documented in the column References
.
read_csv("data/tutorial_dataset_4/data.csv") %>% distinct(References)
# A tibble: 11 × 1
References<chr>
1 Barrett et al. 1996
2 Benyon et al. 1999
3 Bleby et al.2009
4 Brodribb and Felid 2000
5 Brooksbank et al. 2011
6 Canham et al. 2009
7 Carter and White 2009
8 Cernusak et al. 2006
9 Choat et al. 2005
10 Drake and Franks 2003
11 Drake et al. 2011
You would now need to look up each of these references in the reference section of the manuscript and use Google Scholar (or another reference resource) to look up the doi for each reference. To add additional references, you need to add an argument to the metadata_add_source_doi
function:
metadata_add_source_doi(dataset_id = "tutorial_dataset_4", doi = "10.1071/bt9960249",
type = "original_01")
Per the traits.build
schema, a secondary reference is when there have been two related publications out of a single dataset, while the term original
is used if the data were collected as part of a previous dataset (generally by different authors) and added to traits.build
as part of a compilation.
When you add Benyon_1999
(doi - “10.1016/s0378-3774(98)00080-8”), you’ll need to change the type
field to original_02
, etc.
Note that AusTraits will build fine without adding all 11 original sources; it is up to you how many you want to add as a test.
Map source_id into metadata file
source_id
is a metadata field that is used relatively infrequently. In the AusTraits trait database only 1/20 studies require the mapping of asource_id
and therefore as a default the field does not get added to the metadata template.
Instead, you have to manually add it to the files’
dataset
section, generally directly belowlocation_name
.
Simply add a line
source_id: source_id
. Make sure the indents line up with the fields above/below.
When the AusTraits team initially added this dataset, the curator manually added the
source_id
column. Otherwise such a column could be mutated from the reference column usingcustom_R_code
or added manually in Excel.
Add location details
Your protocol for adding locations diverges from the past 3 tutorials, because in this dataset you don’t yet have location names. The data file instead includes the latitude and longitude of each site, information you can use to “create” location names; the actual name doesn’t matter, just that each latitude/longitude combination has a unique location name.
Although you could edit the data.csv file directly (and sometimes we do), you could alternatively create a location_name
column through custom_R_code:
: '
custom_R_code data %>%
mutate(
location_name = paste0("lat_",Lat,"_long_",Long)
)
'
In this case you’re using custom_R_code
to generate many unique location names as a column in the data table as it is first read into the R workflow.
Note: While, for this example you are generating many unique location names, there are many datasets where all data have been collected at a single location, and therefore the submitted dataset doesn’t include a location_name
column. For all of those you simply add code like mutate(location_name = "Broken Hill")
into custom_R_code
.
You next want to create a table of location names and location properties (i.e latitude & longitude):
<-
location_table metadata_check_custom_R_code("tutorial_dataset_4") %>%
select(location_name, Lat, Long) %>%
rename(`latitude (deg)` = Lat, `longitude (deg)` = Long) %>%
distinct()
metadata_add_locations("tutorial_dataset_4", location_table)
Notes:
Remember the function
metadata_check_custom_R_code
reads in the data.csv file, applies custom_R_code manipulations, and outputs the updated data table. This is very useful if you want to check yourcustom_R_code
is performing as expected or if you want to perform further manipulations to the output.
There are many other ways to create a table of location names and properties. You could create a standalone table using R (or Excel), but this solution generates no additional files to store.
Map location name into metadata file
Because the column location_name did not exist when you created the metadata table, you filled in
location_name = NA
during the initial user prompts.You now need to manually fill the newly mutated location name column into the metadata file. Under the
dataset
section you’ll find the fieldlocation_name: unknown
. This needs to be replaced withlocation_name: location_name
.
Add traits
There is a single trait for this study, Huber value, which is the sapwood area to leaf area ratio. However, in the data.csv file it is documented as log_LA.SA
and another line of custom R code must be added:
: '
custom_R_code data %>%
mutate(
location_name = paste0("lat_",Lat,"_long_",Long),
LA.SA = 10^(log_LA.SA)
)
'
You could take the inverse of LA.SA, but this can also be accomplished through unit conversions.
You can then run:
metadata_add_traits("tutorial_dataset_4")
Select column 13, your newly created column of trait data.
As with previous datasets, the following section has been added to metadata.yml
:
- var_in: LA.SA
: unknown
unit_in: unknown
trait_name: unknown
entity_type: unknown
value_type: unknown
basis_of_value: unknown
replicates: unknown methods
This trait has several “non-standard” values:
- unit_in: The units for the input column are leaf area/sapwood area, a dimensionless “area ratio”. Meanwhile, Huber value is reported as sapwood area/leaf area, the inverse dimensionless “area ratio”.
The UCUM standard to which traits.build
conforms specifies that “dimensionless” is only accepted for the very few traits that are truly dimensionless, not traits where units top and bottom simply cancel out. You need to specify that it is a ratio of area/area (or mass/mass, count/count, etc.)
Looking in the trait dictionary, you’ll see that the units are specified as: mm2{sapwood}/mm2{leaf}
, specifically to be explicit about which area is the denominator vs numerator. You therefore specify that the units_in
are mm2{leaf}/mm2{sapwood}
.
Since it is dimensionless, you could, of course specify any area units on top and bottom as long as they are identical, but I know mm2{leaf}/mm2{sapwood}
is already in the unit conversions file.
entity_type, value_type: Because this study is a compilation of many sources, the entity_type and value_type are not consistent across all measurements. Instead the data curator had to go back to many of the original sources and document which were population-level versus individual-level measurements, and correspondingly which were means vs raw values. For such circumstances (which also can occur within a single study), you can map a column name as the value.
replicates: For most of the studies the number of replicates comprising the trait values is unknown.
Taking all this into account, you’re left with:
- var_in: LA.SA
: mm2{leaf}/mm2{sapwood}
unit_in: huber_value
trait_name: entity
entity_type: value
value_type: measurement
basis_of_value: unknown
replicates: Leaf area was measured using a desktop scanner for all leaves on a sampled
methods-sectional area was measured with
branch. After bark removal, the branch crosscut (methodology that includes the non-conductive
a digital caliper at two points near the -sectional area was measured and subtracted
part of the sapwood). The pith cross-sectional area. For compiled and contributed data, measurements
from the branch cross
on branches were made as described above, although with a variable number of branchessmall (<10
sampled per tree. In some studies where the branch diameter was too -sectional
mm), the pith was considered part of the sapwood. For whole trees, cross1.3 m above the ground from bored cores or harvesting
sapwood area was measured at in whole trees was in most cases measured from harvested trees. the tree. Leaf area
Add contexts
If you consider the columns within the data.csv
file you’ll note that Sample
documents a method context and needs to be added as a context property. In addition, within this dataset, Height
(plant height) is not a trait, but a covariate; some datasets documented this, because they thought it might influence the Huber value of the plant. It is therefore a context.
metadata_add_contexts("tutorial_dataset_4")
Following the user prompts select:
5: Sample
6: Height
You are then led sequentially through the user prompts for each of the context properties:
Sample
4: method_context
This context is a method context, because it specifies a difference in methodology that might influence that trait value.
The following values exist for this context: trunk sample branch sample
Are replacement values required? (y/n) n
Are descriptions required? (y/n) y
The answers to the next two questions are up to the dataset curator, but at AusTraits we decided that trunk sample
and branch sample
were sufficiently explicit context property values, but that it would be helpful to add a description.
Therefore, we filled in context_property: wood sample type
and added descriptions for the context property values:
- context_property: wood sample type
: method_context
category: Sample
var_in:
values- value: trunk sample
: Measurements taken on the trunk of the tree.
description- value: branch sample
: Measeurements taken on a branch from the tree. description
Height
Plant height is an entity context. It is a feature of the entity (an individual) that might influence the trait value.
5: entity_context
Again, the dataset curator may choose what information to document within the metadata file. For continuous traits, like plant height, the general consensus is that the values are self-explanatory, so you’d select:
The following values exist for this context: 6.7, 4.8, 6.9
Are replacement values required? (y/n) n
Are descriptions required? (y/n) n
Filling in that the context property
is tree height (m)
, the metadata file would simply be:
- context_property: tree height (m)
: entity_context
category: Height var_in
Testing, error fixes, and report building
At this point, run the dataset tests, rebuild the dataset, and check for excluded data:
dataset_test("tutorial_dataset_4")
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
$excluded_data %>%
traits.build_databasefilter(dataset_id == "tutorial_dataset_4") %>% View()
There should be no errors and no excluded data, so go ahead and build a report for the study:
$build_info$version <- "5.0.0"
traits.build_database# a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_4", traits.build_database, overwrite = TRUE)
Overall, this report isn’t very informative since it is the first Huber value dataset in the new database.
But let me draw your attention to the list of taxa at the bottom. Because the tutorials are, for now, ignoring taxon alignments (traits.build
in a state of flux with regard to this), the tutorials have ignored this section.
But let me draw your attention to the list of taxa at the bottom. Because the tutorials are, for now, ignoring taxon alignments (traits.build
in a state of flux with regard to this), the tutorials have ignored this section.
However, note the unknown taxa names unk sp. 1
and unk sp. 2
. Although the AusTraits database accepts names resolved to genus and family, data collected on a truly unknown taxon is useless and should be excluded. Being a terrestrial vascular plant database, we also exclude mosses and lichens that are sometimes in datasets. The curators for other databases aligned to the traits.build
workflow will have their own standards for values to explicitly disallow, based on taxonomy (or some other variable).
Near the bottom of the metadata file is a section for excluding data.
It currently reads:
: .na exclude_observations
Change this to:
:
exclude_observations- variable: taxon_name
: unk sp. 1, unk sp. 2
find: omitting completely unknown taxa (E Wenk, 2023.09.20) reason
Notes:
- Other variable names can also be used here. Perhaps there is a particular context value or location that is known to have problematic data. In AusTraits this field is almost exclusively used to exclude specific taxa, but the metadata section is designed to have broader applications.
If you now rebuild the database, you’ll see that the measurements associated with these two data are now in the excluded_data
table.