library(traits.build)
source("R/custom_R_code.R")
21 Tutorial 5: Multiple columns for a trait
21.1 Overview
This is the fifth tutorial on adding datasets to your traits.build
database.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the example datasets in traits.build-template
. Instructions are available at Tutorial: Example compilation.
It is also recommended that you first work through some of the earlier tutorials, as many steps for adding datasets to a traits.build
database are only thoroughly described in the early tutorials.
Goals
Learn how to map context properties into the metadata traits section
Learn how to add measurement remarks
New functions introduced
- none.
21.2 Adding tutorial_dataset_5
This dataset is a subset of data from Geange_2017 in AusTraits.
This tutorial focuses on how to input a dataset where there are multiple columns for the same trait, with each column indicating measurements made under different context conditions.
Before you begin creating the metadata file, take a look at the data.csv file. Note that there are two columns for each photosynthesis and conductance. For this dataset these represent repeat measurements made on the same individuals under different experimental treatments. For other studies, these may be multiple columns if the same trait was measured using separate methods.
Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled tutorial_dataset_5
within the data folder.
Ensure that this folder exists on your computer.
The file
data.csv
exists within thetutorial_dataset_5
folder.There is a folder
raw
nested within thetutorial_dataset_5
folder, that contains one file,tutorial_dataset_5_notes.txt
.
source necessary functions
- If you have restarted R Studio since last adding a dataset, ensure all functions are loaded from both the
traits.build
package and the custom functions file:
Create a metadata.yml file
Create a metadata template
To create the metadata template, run:
metadata_create_template("tutorial_dataset_5")
As with in the previous tutorials, this function leads you through a series of menus requiring user input. Ensure you select:
data format: wide
taxon_name column: 1: species_name
location_name column: 1: NA
individual_id column: 1: NA
collection_date column: 21: date
Do all traits need repeat_measurements_id
’s? 2: No
Notes:
There is no location column to map in automatically, so that must be added later. You enter
NA
for now.
The column
Abbrev!
appears to be a unique identifier for each individual that can be mapped in to identify individual plants. Mapping in anindividual_id
column is essential if multiple rows include measurements for the same individual, as might happen if individuals are measured repeatedly across time. For this dataset, each row includes measurements on a separate individual, so it isn’t required to map individual_id. Moreover, if you were to look closely at the values in the columnAbbrev!
you would notice there are 3 instances of duplication. Were you to map inindividual_id: Abbrev!
you would end up with an error - as occurred initially when this dataset was added to AusTraits.
Navigate to the dataset’s folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.
Propagate source information into the metadata.yml file
Use the function metadata_add_source_doi
to add the source.
The reference doi is 10.1186/s40665-017-0033-8
.
Add measurement remarks
There is a free-form comments column called measurement_remarks
that can be mapped in at the dataset level (i.e. for all measurements) or under specific traits.
This column is not used to generate any of the identifiers and therefore cannot be used as a location to document information that is a context property, source, location, method, etc. However, there can be minor notes that have been documented about specific observations or trait measurements that should be retained in the traits.build
output and if this information if available in a column it can be mapped into measurement remarks.
For instance, in this dataset, the column Mother
documents the maternal lineage of each individual. This could be recorded as an official context property, or, alternatively could simply be added as a measurement remark, first mutating a column:
: '
custom_R_code data %>%
mutate(
measurement_remarks = paste0("maternal lineage ", Mother)
)
'
and then adding measurement_remarks: measurement_remarks
to the dataset section of the metadata file, below life_stage
Add location details
There isn’t a location name specified in the data.csv file, so use custom_R_code
to mutate a new column, location_name.
: '
custom_R_code data %>%
mutate(
measurement_remarks = paste0("maternal lineage ", Mother),
location = "Australian National University glasshouse"
)
'
And then specify this column as the source of location_name
in the dataset section of the metadata file.
And manually add the location details to the location section of the metadata file
:
Australian National University glasshouselatitude (deg): -35.283
longitude (deg): 149.1167
MAP (mm): 622
precipitation, : Australian National University glasshouses description
Add traits
To select columns in the data.csv
file that include trait data, run:
metadata_add_traits(dataset_id = "tutorial_dataset_5")
Select columns 13 14 15 16 17 18 19, as these contain trait data.
Then fill in the details for each trait column in the traits section of the metadata file.
Remember, the trait_name
must match a trait concept within the traits dictionary. For this example:
column in dataset | trait concept | units_in | entity_type | value_type | basis_of_ value | replicates |
---|---|---|---|---|---|---|
Photo | leaf_photosynthetic_rate_per_area_saturated | umol{CO2}/m2/s | individual | raw | measurement | 1 |
Cond | leaf_stomatal_conductance_per_area_at_Asat | mol{H2O}/m2/s | individual | raw | measurement | 1 |
Photo_D | leaf_photosynthetic_rate_per_area_saturated | umol{CO2}/m2/s | individual | raw | measurement | 1 |
Cond_D | leaf_stomatal_conductance_per_area_at_Asat | mol{H2O}/m2/s | individual | raw | measurement | 1 |
area_mm2 | leaf_area | mm2 | individual | raw | measurement | 1 |
SLA_cm_g:4 | leaf_mass_per_area | cm2/g | individual | raw | measurement | 1 |
%N:1 | leaf_N_per_dry_mass | ‘%’ | individual | raw | measurement | 1 |
Add contexts
Contexts from columns
There are two columns in the data.csv
file that specify contexts, Elevation
(seed provenance) and Treatment
(drought treatment).
To add these contexts to the metadata file, run:
metadata_add_contexts(dataset_id = "tutorial_dataset_5")
Select columns 6 7 as these contain context properties
The category for both of these is treatment_context
.
And as the values for both are abbreviations, it is recommended to replace the values for both context properties with proper terms and descriptions.
Therefore, the metadata template will now have the following section:
:
contexts- context_property: unknown
: treatment_context
category: Elevation
var_in:
values- find: LoElev
: unknown
value: unknown
description- find: HiElev
: unknown
value: unknown
description- context_property: unknown
: treatment_context
category: Treatment
var_in:
values- find: LoWat
: unknown
value: unknown
description- find: HiWat
: unknown
value: unknown description
Which will be filled in as:
- context_property: seed provenance
: treatment_context
category: Elevation
var_in:
values- find: LoElev
: low elevation
value: Seeds sourced from low elevation populations.
description- find: HiElev
: high elevation
value: Seeds sourced from hight elevation populations.
description- context_property: drought treatment
: treatment_context
category: Treatment
var_in:
values- find: LoWat
: low water
value: Plants assigned to low water treatment.
description- find: HiWat
: high water
value: Plants assigned to high water treatment. description
Contexts manually added
As background, at the point in the traits.build
workflow where the trait metadata is read in, the trait data has been converted to long
format, with each trait measurement is its own row. This allows columns such as methods, entity_type, and units to be added, which are inherently unique to a specific trait. It also means that context properties can now be added, with different values assigned to different traits.
For this study, in addition to the two contexts that are documented as columns, there is a context property that is documented across columns, the time from last watering to gas exchange measurements. For photosynthesis and conductance, the columns Photo
and Cond
document measurements made just after a watering cycle, while the columns Photo_D
and Cond_D
document measurements made at the very end of a watering cycle.
For such situations, you add a line to the traits section of the metadata for each of these traits.
For instance, for the column Photo
, you would add:
: 1
replicates: start of watering cycle
time_since_watering: Gas exchange was measured using... methods
This creates a new column, time_since_watering
and for the trait column Photo
, the value is start of watering cycle
.
You add an identical line to the trait column Cond
, while for the trait columns Photo_D
and Cond_D
instead insert a line time_since_watering: end of watering cycle
.
For the other traits, no lines are added, as these the context property time_since_watering
doesn’t apply to them.
Since the trait metadata is read in long after the custom_R_code
code is executed, this context property cannot be read in using a function. Instead it must be manually added as a context_property.
- context_property: time since watering
: temporal
category: time_since_watering
var_in:
values- value: start of watering cycle
: Measurements made on the morning following a watering event when the plants were at their least water-limited.
description- value: end of watering cycle
: Measurements made on the final day of a watering cycle when the plants were at the driest point in the cycle. description
Adding contributors
The file data/tutorial_dataset_5/raw/tutorial_dataset_5_notes.txt
indicates the main data_contributor for this study.
Dataset fields
The file data/tutorial_dataset_5/raw/tutorial_dataset_5_notes.txt
indicates how to fill in the unknown
dataset fields for this study.
Testing, error fixes, and report building
At this point, run the dataset tests, rebuild the dataset, and check for excluded data:
dataset_test("tutorial_dataset_5")
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
$excluded_data %>%
traits.build_databasefilter(dataset_id == "tutorial_dataset_5") %>% View()
There should be no errors.
There are a handful of excluded values, including both negative photosynthetic rates and negative conductance rates and two instances where leaf_area = 0
. The leaf_area = 0
values need to be removed using custom_R_code
.
mutate(across(c("area_mm2"), ~na_if(.x,0)))
Then remake the database and again check the excluded data table.
If the only excluded values remaining are the negative gas exchange rates, build a report for the study:
$build_info$version <- "5.0.0"
traits.build_database# a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_5", traits.build_database, overwrite = TRUE)