24 Adding datasets, a lengthy guide
This vignette is an exhaustive reference for adding datasets to a traits.build database.
If you are embarking on building a new database using the traits.build
standard a better place to get started are 7 tutorials.
Then come back to this document for details and unusual dataset circumstances that are not covered in the tutorials.
Other chapters you may want to read include:
- an overview of
traits.build
, - the instructions provided to data contributors,
- the structure of a compiled
traits.build
database, - the structure of the raw data files
- the overview for adding data.
24.1 Getting started
The traits.build
package offers a workflow to build a harmonised trait database from disparate sources, with different data formats and containing varying metadata.
There are two key components required to merge datasets into a database with a common output structure:
A workflow to wrangle datasets into a standardised input format, using a combination of
{traits.build}
functions and manual steps.A process to harmonise information across datasets and build them into a single database.
This document details all the steps to format datasets into a pair of standardised files for input, a tabular data file and a structured metadata file. It includes examples of code you might use.
To begin, install the traits.build package.
#remotes::install_github("traitecoevo/traits.build", quick=TRUE)
library(traits.build)
24.2 Standardised input files required
24.3 Create a dataset folder
Add a new folder within the data
folder. Its name should be the study’s unique dataset_id
.
The preferred format for dataset_id
is the surname of the first author of any corresponding publication, followed by the year, as surname_year
. E.g. Falster_2005
. Wherever there are multiple studies with the same id, we add a suffix _2
, _3
etc. E.g.Falster_2005
, Falster_2005_2
.
dataset_id
is one of the core identifiers within a traits.build
database.
24.4 Constructing the data.csv
file
The trait data for each study (dataset_id
) must be in a single table, data.csv
. The data.csv
file can either be in a wide format (1 column for each trait, with the various trait names
as the column headers) or long format (a single column for all trait values
and an additional column for trait name
.
Required columns
taxon_name
trait_name
(many columns for wide format; 1 column for long format)value
(trait value; for long format only)location_name
(if required)
contexts
(if required)
collection_date
(if required)individual_id
(if required)
For all field studies, ensure there is a column for
location_name
. If all measurements were made at a single location, alocation_name
column can easily be mutated using custom_R_code within the metadata.yml file. See sections adding locations and adding contexts below for more information on compiling location and context data.If available, be sure to include a column with
collection date
. If possible, provide dates inyyyy-mm-dd
(e.g. 2020-03-05) format or, if the day of the month isn’t known, asyyyy-mm
(e.g. 2020-03). However, any format is allowed and the column can be parsed to the proper yyyy-mm-dd format usingcustom_R_code
. If the samecollection date
applies to the entire study it can be added directly into the metadata.yml file.If applicable, ensure there are columns for all context properties, including experimental treatments, specific differences in method, a stratified sampling scheme within a plot, or sampling season. Additional context columns could be added through
custom_R_code
or keyed in where traits are added, but it is best to include a column in the data.csv file whenever possible. The protocol for adding context properties to the metadata file is under adding contexts
Data may need to be summarised
Data submitted by a contributor should be in the rawest form possible; always request data with individual measurements over location/species means.
Some datasets include replicate measurements on an individual at a single point in time, such as the leaf area of 5 individual leaves. In AusTraits (the Australian plant trait database) we generally merge such measurements into an individual mean
in the data.csv
file, but the raw values are preserved in the contributor’s raw data files. Be sure to calculate the number of replicates that contributed to each mean value.
When there is just a single column of trait values to summarise, use:
::read_csv("data/dataset_id/raw/raw_data.csv") %>%
readr::group_by(individual, `species name`, location, context, etc) %>%
dplyr::summarise(
dplyrleaf_area_mean = mean(leaf_area),
leaf_area_replicates = n()
%>%
) ::ungroup() dplyr
Make sure you group_by
all categorical variables you want to retain, for only columns that are grouping variables will be kept.
When you want to take the mean of multiple data columns simultaneously, use:
::read_csv("data/dataset_id/raw/raw_data.csv") %>%
readr::group_by(individual, `species name`, location, context, etc) %>%
dplyr::summarise(
dplyracross(c(leaf_area, `leaf N`), ~ mean(.x, na.rm = TRUE)),
across(c(growth_form, `photosynthetic pathway`), ~ first(.x)),
replicates = n()
%>%
) ::ungroup() dplyr
{dplyr}
hints:
- Categorical variables not included as grouping variables will return
NA
. - Generally use the function
first
for categorical variables - it simply retains the trait value in the first column. - You can identify runs of columns by column number/position. For instance
c(5:25), ~ mean(.x, na.rm = TRUE)
orc(leaf_area:leaf_N), ~ mean(.x, na.rm = TRUE)
. - Be sure to
ungroup
at the end. - Before summarising, ensure variables you expect are numeric, are indeed numeric:
utils::str(data)
.
Merging multiple spreadsheets
If multiple spreadsheets of data are submitted these must be merged together.
- If the spreadsheets include different trait measurements made on the same individual (or location means for the same species), they are best merged using
dplyr::left_join
, specifying all conditions that need to be matched across spreadsheets (e.g. individual, species, location, context). Ensure the column names are identical between spreadsheets or specify columns that need to be matched.
::read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
readr::read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2
readr
%>%
data_1 ::left_join(
dplyr
data_2,by = c("Individual", "Taxon" = "taxon", "Location", "Context")
)
- If the spreadsheets include trait measurements for different individuals (or possibly data at different scales - such as individual level data for some traits and species means for other traits), they are best merged using
dplyr::bind_rows
. Ensure the column names for taxon name, location name, context, individual, and collection date are identical between spreadsheets. If there are data for the same traits in both spreadsheets, make sure those column headers are identical as well.
::read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
readr::read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2
readr
%>%
data_1 ::bind_rows(data_2) dplyr
Taxon names
Taxon names need to be complete names. If the main data file includes code names, with a key as a separate file, they are best merged now to avoid many individual replacements later.
::read_csv("data/dataset_id/raw/species_key.csv") -> species_key
readr::read_csv("data/dataset_id/raw/data_file.csv") -> data
readr
%>%
data ::left_join(species_key, by = "code") dplyr
Unexpected hangups
When Excel saves an
.xls
file as a.csv
file it only preserves the number of significant figures that are displayed on the screen. This means that if, for some reason, a column has been set to display a very low number of significant figures or a column is very narrow, data quality is lost.If you’re reading a file into R where there are lots of blanks at the beginning of a column of numeric data, the defaults for
readr::
read_csvfail to register the column as numeric. It is fixed by adding the argument
guess_max`:
read_csv("data/dataset_id/raw/raw_data.csv", guess_max = 10000)
This checks 10,000 rows of data before declaring the column is non-numeric.
(When data.csv
files are read in through the {traits.build}
workflow, guess_max = 100000
.)
24.5 Constructing the metadata.yml
file
As described in detail here the metadata.yml
file maps the meanings of the individual columns within the data.csv
file and documents all additional dataset metadata.
Before beginning, it is a good idea to look at the two example dataset metadata files in the traits.build-template
repository, to become familiar with the general structure.
The sections of the metadata.yml
file are:
- source
- contributors
- dataset (includes adding custom R code)
- locations
- contexts
- traits
- substitutions
- taxonomic_updates
- exclude_observations
- questions
This document covers these metadata sections in sequence.
Use a proper text editor
- Install a proper text editor, such as Visual Studio Code (our favorite), Rstudio, textmate, or sublime text. Using Microsoft word will make a mess of the formatting.
Source the {traits.build}
functions
To assist you in constructing the metadata.yml
file, we have developed functions to help propagate and fill in the different sections of the file.
If you haven’t already, run:
library(traits.build)
The functions for populating the metadata file all begin with metadata_
.
A full list is available here.
Creating a template
The first step is to create a blank metadata.yml
file.
::metadata_create_template("Yang_2028") traits.build
As each function prompts you to enter the dataset_id, it can be useful to assign the dataset’s id to a variable you can use repeatedly:
<- "Yang_2028"
current_study
::metadata_create_template(current_study) traits.build
This function cycles through a series of user-input menus, querying about both the data format (long versus wide) and which columns contain which variables (taxon name, location name, individual identifiers, collection date). It then creates a relatively empty metadata file data/dataset_id/metadata.yml
.
The questions are:
- Is the data long or wide format?
A wide dataset has each variable (i.e. trait ) as a column. A long dataset has a single row containing all trait values.
- Select column for
taxon_name
- Select column for
trait_name
(long datasets only) - Select column for
trait values
(long datasets only) - Select column for
location_name
If your data.csv
file does not yet have a location_name
column, this information can later be added manually.
- Select column for
individual_id
(a column that links measurements on the same individual) - Select column for
collection_date
If your data.csv
file does not have a collection_date
column, you will be prompted to Enter collection_date range in format ‘2007/2009’. A fixed value in a yyyy
, yyyy-mm
or yyyy-mm-dd
format is accepted, either as a single value or range of values. This information can be edited later.
- Indicate whether all traits need
repeat_measurements_id
’s
repeat_measurements_id
’s are only required if the dataset documents response curve data (e.g. an A-ci or light response curve for plants; or a temperature response curve for animal or plant behaviour). They can also be added to individual traits (later). They are intended to capture multiple “sub-measurements” that together comprise a single “trait measurement”.
Adding a source
The skeletal metadata.yml
file created by metadata_create_template
included a section for the primary source with default fields for a journal article.
You can manually enter citation details, but whenever possible, use one of the three functions developed to automatically propagate citation details.
Adding source from a doi
If you have a doi
for your study, use the function.
::metadata_add_source_doi(dataset_id = current_study, doi = "doi") traits.build
The different elements within the source will automatically be generated.
Double check the information added to ensure:
- The title is in
sentence case
.
- The information isn’t in
all caps
(sources from a few journals gets read in as all caps). - Pages numbers are present and include
--
between page numbers (for example,123 -- 134
). - If there is a colon (:) or apostrophe (’) in a reference, the text for that line must be in quotes (“).
By default, details are added as the primary source. If multiple sources are linked to a single dataset_id
, you can specify a source as secondary
.
::metadata_add_source_doi(dataset_id = current_study, doi = "doi",
traits.buildtype = "secondary")
- Attempting to add a second primary source will overwrite the information already input. Instead, if there is a third resource to add, use
type = "secondary_2"
- Always check the
key
field, as it can be incorrect for hyphenated last names. - If the dataset being entered is a compilation of many original sources, you should add all the original sources, specifying,
type = "original_01"
,type = "original_02"
etc. See Richards_2008 for an example of a complex source list.
Adding source from a bibtex file
::metadata_add_source_doi(dataset_id, file = "myref.bib") traits.build
(These options require the packages rcrossref and RefManageR to be installed.)
Proper formatting of different source types
Different source types require different fields, formatting:
Book:
source:
primary:
key: Cooper_2013
bibtype: Book
year: 2013
author: Wendy Cooper and William T. Cooper
title: Australian rainforest fruits
publisher: CSIRO Publishing
pages: 272
Online resource:
source:
primary:
key: TMAG_2009
bibtype: Online
author: '{Tasmanian Herbarium}'
year: 2009
title: Flora of Tasmania Online
publisher: Tasmanian Museum & Art Gallery (Hobart)
url: http://www.tmag.tas.gov.au/floratasmania
Thesis:
source:
primary:
key: Kanowski_2000
bibtype: Thesis
year: 1999
author: John Kanowski
title: Ecological determinants of the distribution and abundance of the folivorous
marsupials endemic to the rainforests of the Atherton uplands, north Queensland.
type: PhD
institution: James Cook University, Townsville
Unpublished dataset:
source:
primary:
key: Ooi_2018
bibtype: Unpublished
year: 2018
author: Mark K. J. Ooi
title: "Unpublished data: Herbivory survey within Royal National Park, University
of New South Wales"
- Note the title of an unpublished dataset must begin with the words “Unpublished data” and include the data collectors affiliation.
Adding contributors
The skeletal metadata.yml
file created by the function metadata_create_template
includes a template for entering details about data contributors. Edit this manually, duplicating if details for multiple people are required.
data_collectors
are people who played a key intellectual role in the study’s experimental design and data collection. Most studies have 1-3data_collectors
listed. Four fields of information are required for each data collector:last_name
,given_name
,affiliation
andORCID
(if available). Nominate a single data collector to be the dataset’s point of contact.- Additional field assistants can be listed under
assistants
. - The data entry person is listed under
dataset_curators
. - email addresses for the
data_collectors
are not included in themetadata.yml
file, but it is recommended that a database curator maintain a list of email addresses of all data collectors to whom authorship may be extended on a future database data paper. Authorship “rules” will vary across databases, but for AusTraits we extend authorship to alldata_collectors
who we successfully contact.
For example, in Roderick_2002:
contributors:
data_collectors:
- last_name: Roderick
given_name: Michael
ORCID: 0000-0002-3630-7739
affiliation: The Australian National University, Australia
additional_role: contact
assistants: Michelle Cochrane
dataset_curators: Elizabeth Wenk
Custom R code
The goal is always to maintain data.csv
files that are as similar as possible to the contributed dataset. However, for many studies there are minor changes we want to make to a dataset before the data.csv file is processed by the {traits.build}
workflow. These may include applying a function to transform a particular column of data, a function to filter data, or a function to replace a contributor’s “measurement missing” placeholder symbol with NA
. In each case it is appropriate to leave the rawer data in data.csv
and edit the data table as it is read into the {traits.build}
workflow.
Background
To allow custom modifications to a particular dataset before the common pipeline of operations gets applied, the workflow permits for some customised R code to be run as a first step in the processing pipeline. That pipeline (the function process_custom_code
called within dataset_process
) looks like this:
<-
data ::read_csv(filename_data_raw, col_types = cols(), guess_max = 100000,
readrprogress = FALSE) %>%
process_custom_code(metadata[["dataset"]][["custom_R_code"]])()
The second line shows that the custom code gets applied, right after the file is loaded.
Overview of options and syntax
- A copy of the file containing functions the AusTraits team have explicitly developed to use within the custom_R_code field is available at custom_R_code.R and should be placed with the
R
folder within your database repository, then sourced (source("R/custom_R_code.R")
). - Place a single apostrophe (’) at the start and end of your custom R code; this allows you to add line breaks between pipes.
- Begin your custom R code with
data %>%
, then apply whatever fixes are needed. - Use functions from the packages dplyr, tiydr, stringr (e.g.
mutate
,rename
,summarise
,str_detect
), but avoid other packages. - Alternatively, use the functions we’ve created explicitly for pre-processing data that were sourced through the file
custom.R
. You may choose to expand this file within your own database repository. - Custom R code is not intended for reading in files. Any reading in and merging of multiple files should be done before creating the dataset’s
data.csv
file. - Use pipes to weave together a single statement, where possible. If you need to manipulate/subset the data.csv file into multiple data frames and then bind them back together, you’ll need to use semi colons
;
at the end of each statement.
Examples of appropriate use of custom R code
- Converting times to
NY
strings
Most sources from herbaria record flowering_time
and fruiting_time
as a span of months, while AusTraits codes these variables as a sequence of 12 N’s and Y’s for the 12 months. A series of functions make this conversion in custom_R_code. These include:
- ‘
format_flowering_months
’ (Create flowering times from start to end pair) - ‘
convert_month_range_string_to_binary
’ (Converts flowering and fruiting month ranges to 12 element character strings of binary data) - ‘
convert_month_range_vec_to_binary
’ (Convert vectors of month range to 12 element character strings of binary data) - ‘
collapse_multirow_phenology_data_to_binary_vec
’ (Converts multi-row phenology data to a 12 digit binary string)
- Splitting ranges into min, max pairs
Many datasets from herbaria record traits like leaf_length
, leaf_width
, seed_length
, etc. as a range (e.g. 2-8
). The function separate_range
separates this data into a pair of columns with minimum
and maximum
values, which is the preferable way to merge the data into a trait database.
- Removing duplicate values within a dataset
Duplicate values within a study need to be filtered out using the custom function replace_duplicates_with_NA
If a species-level trait value has been entered repeatedly on rows containing individual-level trait measurements, you need to filter out the duplicates. For instance, plant growth form is generally a species-level observation, with the same value on every row with individual-level trait measurements. There are also instances, where a population-level numeric trait appears repeatedly, such as if nutrient analyses were performed on a bulked sample at each site.
Before applying the function, you must group by the variable(s) that contain the unique values. This might be at the species or population level. For instance, use group_by(Species, Location)
if there are unique values at the species x location level.
%>%
data ::group_by(Species) %>%
dplyr::mutate(
dplyracross(c(`leaf_percentN`, `plant growth form`), replace_duplicates_with_NA)
%>%
) ::ungroup() dplyr
- Removing duplicate values across datasets
Values that were sourced from a different study need to be filtered out. See Duplicates between studies below -functions to automate this process are in progress.
- Replacing “missing values” with NA’s
If missing data values in a dataset are represented by a symbol, such as 0
or *
, these need to be converted to NA’s:
%>%
data ::mutate(
dplyracross(c(`height (cm)`, `leaf area (mm2)`), ~ na_if(., 0))
)
- Mapping data from one trait to a second trait, part 1
If a subset of data in a column are also values
for a second trait in AusTraits, some data values can be duplicated into a second temporary column. In the example below, some data in the contributor’s fruit_type
column also apply to the trait fruit_fleshiness
in AusTraits:
%>%
data ::mutate(
dplyrfruit_fleshiness = ifelse(`fruit type` == "pome", "fleshy", NA)
)
The function move_values_to_new_trait
is being developed to automate this and currently resides in the custom_R_code.R
file within the austraits.build repository.
- Mapping data from one trait to a second trait, part 2
If a subset of data in a column are instead values
for a second trait in AusTraits, some data values can be moved to a second column (second trait), also using the function ‘move_values_to_new_trait
’. In the example below, some data in the contributor’s growth_form
column only apply to the trait parasitic
in AusTraits. Note you need to create a blank variable, before moving the trait values.
%>%
data ::mutate(new_trait = NA_character) %>%
dplyrmove_values_to_new_trait(
original_trait = "growth form",
new_trait = "parasitic",
original_values = "parasitic",
values_for_new_trait = "parasitic",
values_to_keep = "xx") %>%
::mutate(across(c(original_trait), ~ na_if(., "xx"))) dplyr
or
%>%
data ::mutate(dispersal_appendage = NA.char) %>%
dplyrmove_values_to_new_trait(
"fruits", "dispersal_appendage",
c("dry & winged", "enclosed in aril"),
c("wings", "aril"),
c("xx", "enclosed") %>%
::mutate(across(c(original_trait), ~ na_if(., "xx"))) dplyr
- Note, the parameter
values_to_keep
doesn’t acceptNA
, leading to the clunky coding. This bug is known, but we haven’t managed to fix it.
- Mutating a new trait from other traits
If the data.csv
file includes raw data that you want to manipulate into a trait
, or the contributor presents the data in a different formulation than AusTraits, you may choose to mutate a new column, containing a new trait
.
%>%
data ::mutate(
dplyrroot_mass_fraction = `root mass` / (`root mass` + `shoot mass`)
)
- Mutating a location name column
If the dataset has location information, but lacks unique location names (or any location name), you might mutate a location name
column to map in. (See also Adding location details).
%>%
data ::mutate(
dplyrlocation_name = ifelse(location_name == "Mt Field" & habitat == "Montane rainforest",
"Mt Field_wet", location_name),
location_name = ifelse(location_name == "Mt Field" & habitat == "Dry sclerophyll",
"Mt Field_dry", location_name)
)
or
%>%
data ::mutate(
dplyrlocation_name = dplyr::case_when(
== 151.233056 ~ "heath",
longitude == 151.245833 ~ "terrace",
longitude == 151.2917 ~ "diatreme"
longitude
)
)# Note with `dplyr::case_when`,
# any rows that do not match any of the conditions become `NA`'s.
or
%>%
data ::mutate(
dplyrlocation_name = paste0("lat_", round(latitude,3),"_long_", round(longitude,3))
))
- Generating
measurement_remarks
Sometimes there is a note column with abbreviated information about individual rows of data that is appropriate to map as a context. This could be included in the field measurement_remarks
:
%>%
data ::mutate(
dplyrmeasurement_remarks = paste0("maternal lineage ", Mother)
)
- Reformatting dates
You can reformat collection_dates
to conform to the yyyy-mm-dd
format, or add a date column
Converting from any mdy
format to yyyy-mm-dd
(e.g. Dec 3 2015
to 2015-12-03
)
%>%
data ::mutate(
dplyrDate = Date %>% lubridate::mdy()
)
Converting from any dmy
format to yyyy-mm-dd
(e.g. 3-12-2015
to 2015-12-03
)
%>%
data ::mutate(
dplyrDate = Date %>% lubridate::dmy()
)
Converting from a mmm-yyyy
(string) format to yyyy-mm
(e.g. Dec 2015
to 2015-12
)
%>%
data ::mutate(
dplyrDate =
::parse_date_time(Date, orders = "my") %>%
lubridate::format.Date("%Y-%m")
base )
Converting from a mdy
format to yyyy-mm
(e.g. Excel has reinterpreted the data as full dates 12-01-2015
but the resolution should be “month” 2015-12
)
%>%
data ::mutate(
dplyrDate =
::parse_date_time(Date, orders = "mdy") %>%
lubridate::format.Date("%Y-%m")
base )
A particularly complicated example where some dates are presented as yyyy-mm
and others as yyyy-mm-dd
%>%
data ::mutate(
dplyrweird_date = ifelse(stringr::str_detect(gathering_date, "^[0-9]{4}"),
NA),
gathering_date, gathering_date = gathering_date %>%
::mdy(quiet = T) %>% as.character(),
lubridategathering_date = coalesce(gathering_date, weird_date)
%>%
) select(-weird_date)
Testing your custom R code
After you’ve added the custom R code to a file, check that output is indeed as intended:
metadata_check_custom_R_code("Blackman_2010")
Fill in metadata[["dataset"]]
The dataset
section includes fields that are:
- filled in automatically by the function
metadata_create_template()
- mandatory fields that need to be filled in manually for all datasets
- optional fields that are included and filled in only for a subset of datasets
fields automatically filled in
data_is_long_format yes/no
taxon_name
location_name
collection_date If this is not read in as a specified column, it needs to be filled in manually as
start date/end date
in yyyy-mm-dd, yyyy-mm, or yyyy format, depending on the relevant resolution. If the collection dates are unknown, writeunknown/publication year
, as inunknown/2022
individual_id Individual_id is one of the fields that can be read in during
metadata_create_template
. However, you may instead mutate your ownindividual_id
usingcustom_R_code
and add it in manually. For a wide dataset individual_id is required anytime there are multiple rows of data for the same individual and you want to keep these linked. This field should only be included if it is required.WARNING If you have an entry
individual_id: unknown
this assigns all rows of data to an individual named “unknown” and the entire dataset will be assumed to be from a single individual. This is why it is essential to omit this field if there isn’t an actual row of data being read in.
NOTE For individual-level measurements, each row of data is presumed to be a different individual during dataset processing. Individual_id is only required if there are multiple rows of data (long or wide format) with information for the same individual.repeat_measurements_id
repeat_measurement_id
’s are sequential integer identifiers assigned to a sequence of measurements on a single trait that together represent a single observation (and are assigned a singleobservation_id
by thetraits.build
pipeline. The assumption is that these are measurements that document points on a response curve. The functionmetadata_create_template
offers an option to add it tometadata[["dataset"]]
, but it can alternately be specified under specific traits, asrepeat_measurements_id: TRUE
required fields manually filled in
description: 1-2 sentence description of the study’s goals. The abstract of a manuscript usually includes some good sentences/phrases to borrow.
basis_of_record: Basis of record can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the
data.csv
file. If it is being read in from a column list the column name in the field, otherwise input the fixed value. Allowable values are:field
,field_experiment
,captive_cultivated
,lab
,preserved_specimen
, andliterature
. See the database structure vignette for definitions of these accepted basis_of_record values. If fixed values are specified for both the entire dataset undermetadata[["dataset"]]
and for specific locations/traits undermetadata[["locations"]]
ormetadata[["traits"]]
, the location/trait value overrides that entered undermetadata[["dataset"]]
.life_stage: Life stage can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the
data.csv
file. If it is being read in from a column list the column name in the field, otherwise input the fixed value. Allowable values are:adult
,sapling
,seedling
,juvenile
. See the database structure vignette for definitions of these accepted basis_of_record values. If fixed values are specified for both the entire dataset undermetadata[["dataset"]]
and for specific locations/traits undermetadata[["locations"]]
ormetadata[["traits"]]
, the location/trait value overrides that entered undermetadata[["dataset"]]
.sampling_strategy: Often a quite long description of the sampling strategy, extracted verbatim from a manuscript whenever possible.
original_file: The name of the file initially submitted to the database curators. It is generally archived in the dataset folder, in a subfolder named
raw
. For AusTraits datasets are also usually archived in the project’s GoogleDrive folder.notes: Notes about the study and processing of data, especially if there were complications or if some data is suspected duplicates with another study and were filtered out.
optional fields manually filled in
measurement_remarks: Measurement remarks is a field to capture a miscellaneous notes column. This should be information that is not captured by trait methods (which is fixed to a single value for a trait) or as a
context
. Measurement_remarks can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in thedata.csv
file.entity_type is standardly added to each trait, and is described below under traits, but a fixed value or column could be read in under
metadata[["dataset"]]
Adding location details
Location data includes location names, latitude/longitude coordinates, verbal location descriptions, and any additional abiotic/biotic location variables provided by the contributor (or in the accompanying manuscript). For studies with more than a few locations, it is most efficient to create a table of this data that is automatically read into the metadata.yml
file.
The function metadata_add_locations
automatically propagates location information from a stand-alone location properties table into metadata[["locations"]]
:
<- read_csv("data/dataset_id/raw/locations.csv")
locations ::metadata_add_locations(current_study, locations) traits.build
The function metadata_add_locations
first prompts the user to identify the column with the location name and then to list all columns that contain location data. This automatically fills in the location component on the metadata file.
Rules for formatting a locations
table to read in:
Location names must be identical (including syntax, case) to those in
data.csv
Column headers for latitude and longitude data must read
latitude (deg)
andlongitude (deg)
Latitude and longitude must be in decimal degrees (i.e. -46.5832). There are many online converters to convert from
degrees,minutes,seconds
format orUTM
. Or use the following formula:decimal_degrees = degrees + (minutes/60) + (seconds/3600)
If there is a column with a general vegetation description (i.e.
rainforest
,coastal heath
it should be titleddescription
)Although location properties are not restricted to a controlled vocabulary, newly added studies should use the same location property syntax as others whenever possible, to allow future discoverability. To generate a list of already used under
location_property
:
database$locations %>% dplyr::distinct(location_property)
Some examples of syntax to add locations
data that exists in different formats.
- When the main data.csv file has columns for a few location properties:
<-
locations check_custom_R_code(current_study) %>%
::distinct(location_name, latitude, longitude, `veg type`) %>%
dplyr::rename(dplyr::all_of(c("latitude (deg)" = "latitude",
dplyr"longitude (deg)" = "longitude",
"description" = "veg type")))
::metadata_add_locations(current_study, locations) traits.build
If you were to want to add or edit the data, it is probably easiest to save the
locations
table, then edit in Excel, before reading it back into RIt is possible that you will want to specify
life_stage
orbasis_of_record
at the location_level. When required, it is usually easiest to manually add these fields to some or all locations.
Adding contexts
The dictionary definition of a context is the situation within which something exists or happens, and that can help explain it. This is exactly what context_properties
are in AusTraits, ancillary information that is important to explaining and understanding a trait value.
AusTraits recognises 5 categories of contexts:
- treatment contexts Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity. Examples include soil nutrient manipulations, growing temperatures, or CO2 enhancement.
- plot contexts Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity. Examples are an property that is stratified within a “geographic location”, such as topographic position.
Plots
are of courselocations
themselves; what is alocation
vsplot_context
depends on the geographic resolution a dataset collector has applied to their locations. - entity contexts Context property that is information about an organismal entity (individual, population or taxon) that does not comprise a trait-centered observation but might affect the trait values measured on the entity. This might be the entity’s sex, caste (for social insects), or host plant (for insects).
- temporal contexts Context property that is a feature of a “point in time” that might affect the trait values measured on an individual, population or species-level entity. They generally represent repeat measurements on the same entity across time and may simply be numbered observations or might be explicitly linked to growing season or time of day.
- method contexts Context property that records specific information about a measurement method that is modified between measurements. These might be samples from different canopy light environments, different leaf ages, or sapwood samples from different branch diameters.
Context properties are not restricted to a controlled vocabulary. However, newly added studies should use the same context property syntax as others whenever possible, to allow future discoverability. To generate a list of terms already used under context_property
, use:
$contexts %>%
database::distinct(context_property, category) dplyr
Context properties are most easily read into the metadata.yml
file with the dedicated function:
::metadata_add_contexts(dataset_id) traits.build
The function first displays a list of all data columns (from the data.csv file) and prompts you to select those that are context properties.
For each column you are asked to indicate its
category
(those described above).You are shown a list of the unique values present in the data column and asked if these require any substitutions. (y/n)
You are asked if descriptions are required for the context property values (y/n)
This function then adds the contexts to the metadata[["contexts"]]
section.
If you selected both substitutions and descriptions required:
- context_property: unknown
category: temporal_context
var_in: month
values:
- find: AUG
value: unknown
description: unknown
- find: DEC
value: unknown
description: unknown
- find: FEB
value: unknown
description: unknown
- context_property: unknown
category: treatment_context
var_in: CO2_Treat
values:
- find: ambient CO2
value: unknown
description: unknown
- find: added CO2
value: unknown
description: unknown
If you selected just substitutions required:
- context_property: unknown
category: temporal_context
var_in: month
values:
- find: AUG
value: unknown
- find: DEC
value: unknown
- find: FEB
value: unknown
- context_property: unknown
category: treatment_context
var_in: CO2_Treat
values:
- find: ambient CO2
value: unknown
- find: added CO2
value: unknown
If you selected neither substitutions nor descriptions required:
- context_property: unknown
category: temporal_context
var_in: month
- context_property: unknown
category: treatment_context
var_in: CO2_Treat
- You must then manually fill in the fields designated as
unknown
. - If there is a value in a column that is not a context property, set its value to
value: .na
If there are additional context properties that were designated in the traits section, these will have to be added manually, as this information is not captured in a column that is read in. A final output might be:
- context_property: sampling season
category: temporal_context
var_in: month
values:
- find: AUG
value: August
description: August (late winter)
- find: DEC
value: December
description: December (early summer)
- find: FEB
value: February
description: February (late summer)
- context_property: CO2 treatment
category: treatment_context
var_in: CO2_Treat
values:
- find: ambient CO2
value: 400 ppm
description: Plants grown at ambient CO2 (400 ppm).
- find: added CO2
value: 640 ppm
description: Plants grown at elevated CO2 (640 ppm); 240 ppm above ambient.
- context_property: measurement temperature
category: method_context
var_in: method_context # this field would be included in the relevant traits
values:
- value: 20°C # this value would be keyed in through the relevant traits
description: Measurement made at 20°C
- value: 25°C
description: Measurement made at 25°C
Adding traits
The function metadata_add_traits()
adds a scaffold for trait metadata to the skeletal metadata.yml
file.
metadata_add_traits(current_study)
You will be asked to indicate which columns include trait data.
This automatically propagates the following metadata fields for each trait selected into metadata[["traits"]
. var_in
is the name of a column in the data.csv
file (for wide datasets) or a unique trait name in the trait_name
column (for a long dataset):
- var_in: leaf area (mm2)
unit_in: .na
trait_name: .na
entity_type: .na
value_type: .na
basis_of_value: .na
replicates: .na
methods: .na
The trait details then need to be filled in manually.
- units: fill in the units associated with the trait values in the submitted dataset - such as mm2 in the example above. If you’re uncertain about the syntax/format used for some more complex units, look through the traits definition file (
config/traits.yml
) or the file showing unit conversions (config/unit_conversions.csv
). For categorical variables, leave this as.na
.
AusTraits uses the Unified Code for Units of Measure (UCUM) standard for units (https://ucum.org/ucum), but each database using the traits.build
workflow can select their own choices for unit abbreviations. The UCUM standard follows clear, simple rules, but also has a flexible syntax for documenting notes that are recorded as part of the ‘unit’ for specific traits, yet are not formally units, in curly brackets. For instance, {count}/mm2 or umol{CO2}/m2/s, where the actual units are 1/mm2 and umol/m2/s. There are a few not-very-intuitive units in UCUM. a
is year
(annum).
Notes:
- If the units start with a punctuation symbol, the units must be in single, straight quotes, such as: unit_in: '{count}/mm2'
- It is best not to start units with a -
(negative sign). In AusTraits we’ve adopted the convention of using, for instance, neg_MPa
instead of -MPa
trait_name: This is the trait name of the appropriate trait concept for the datasets
config/traits.yml
. For currently unsupported traits, leave this as.na
but then fill in the rest of the data and flag this study as having a potential new trait concept. Then in the future, if an appropriate trait concept is added to thetraits.yml
file, the data can be read into the database by simply replacing the.na
with a trait name. Each database will have their own criteria/rules for adding traits to the trait dictionary, and likely rules that evolve as a trait database grows. In AusTraits, if no appropriate trait concept exists in the trait dictionary, a new trait must be defined within the accompanying AusTraits Plant Dictionary and should only be added if it is clearly a distinct trait concept, can be explicitly defined, and there exists sufficient trait data that the measurements have comparative value.entity_type: Entity type indicates “what” is being observed for the trait measurements - as in the organismal-level to which the trait measurements apply. As such,
entity_type
can beindividual
,population
,species
,genus
,family
ororder
. Metapopulation-level measurements are coded aspopulation
and infraspecific taxon-level measurements are coded asspecies
. See the database structure vignette for definitions of these acceptedentity_type
values.
Note:
- entity_type
is about the “organismal-level” to which the trait measurement refers; this is separate from the taxonomic resolution of the entity’s name.
value_type: Value type indicates the statistical nature of the trait value recorded. Allowable value types are
mean
,minimum
,maximum
,mode
,range
,raw
, andbin
. See the database structure vignette for definitions of these accepted value types. All categorical traits are generally scored as being amode
, the most commonly observed value. Note that for values that arebins
, the two numbers are separated by a double-hyphen,1 -- 10
.basis_of_value: Basis of value indicates how a value was determined. Allowable terms are
measurement
,expert_score
,model_derived
, andliterature
. See the database structure vignette for definitions of these acceptedbasis_of_value
values, but most categorical traits measurements are values that have been scored by an expert (expert_score
) and most numeric trait values aremeasurements
.replicates: Fill in with the appropriate number of measurements that comprise each value.
If the values are raw values (i.e. a measurement of an individual) replicates: 1
.
If the values are, for instance, means of 5 leaves from an individual, replicates: 5
.
If there is just a single population-level value for a trait, that comprises measurements on 5 individuals, replicates: 5
.
For categorical variables, leave this as .na
.
If there is a column that specifies replicate number, you can list the column name in the field.
- methods: This information can usually be copied verbatim from a manuscript and is a textual description of all components of the method used to measure the trait.
In general, methods sections extracted from pdfs include “special characters” (non-UTF-8 characters). Non-English alphabet characters are recognised (e.g. é, ö) and should remain unchanged. Other characters will be re-formatted during the study input process, so double check that degree symbols (º), en-dashes (–), em-dashes (—), and curly quotes (‘,’,“,”) have been maintained or reformatted with a suitable alternative. Greek letters and some other characters are replaced with their Unicode equivalent (e.g. <U+03A8> replaces Psi (Ψ)); for these it is best to replace the symbol with an interpretable English-character equivalent.
If the there are two columns of data with measurements for the same trait using completely different methods, simply add the respective methods to the metadata for the respective columns. A method_id
counter will be added to these during processing to ensure the correct trait values are linked to the correct methods. This is separate to method_contexts
which are minor tweaks to the methods between measurements, that are expected to have concurrent effects on trait values (see below).
NOTE:
- If the identical methods apply to a string of traits, for the first trait use the following syntax, where the &leaf_length_method
notation assigns the remaining text in the field as the leaf_length_method
.
methods: &leaf_length_method All measurements were from dry herbarium
collections, with leaf and bracteole measurements taken from the largest
of these structures on each specimen.
Then for the next trait that uses this method you can just include. At the end of processing you can read/write the yml file and this will fill in the assigned text throughout.
methods: *leaf_length_method
In addition to the automatically propagated fields, there are a number of optional fields you can add if appropriate.
life_stage If all measurements in a dataset were made on plants of the same
life stage
a global value should be entered undermetadata[["dataset"]]
. However if different traits were measured at different life stages you can specify a uniquelife stage
for each trait or indicate a column where this information is stored.basis_of_record If all measurements in a dataset represent the same
basis_of_record
a global value should be entered undermetadata[["dataset"]]
. However if different traits have different basis_of_record values you can specify a uniquebasis_of_record
value for each trait or indicate a column where this information is stored.measurement_remarks: Measurement remarks is a field to indicate miscellaneous comments. If these comments only apply to specific trait(s), this field should be specified with those trait’s metadata sections. This meant to be information that is not captured by “methods” (which is fixed to a single value for a trait).
method_context If different columns in a wide data.csv file indicate measurements on the same trait using different methods, this needs to be designated. At the bottom of the trait’s metadata, add a
method_context_name
field (e.g.method_context
orleaf_age_type
are good options). Write a word or short phrase that indicate the method context property value that applies to that trait (data column). For instance, one trait might havemethod_context: fully expanded leaves
and a second traits entry might have the same trait name and methods, butmethod_context: leaves still expanding
. The method context details must also be added to the contexts section.temporal_context If different columns in a wide data.csv file indicate measurements on the same trait, on the same individuals at different points in time, this needs to be designated. At the bottom of the trait’s metadata, add a
temporal_context_name
field (e.g.temporal_context
ormeasurement_time_of_day
work well). Write a word or short phrase that indicates which temporal context applies to that trait (data column). For instance, one trait might havetemporal_context: dry season
and a second entry with the same trait name and method might havetemporal_context: after rain
. The temporal context details must also be added to the contexts section.
Adding substitutions
It is very unlikely that a contributor will use categorical trait values that are entirely identical to those listed as allowed trait values for the corresponding trait concept in the traits.yml
file. You need to add substitutions for those that do not exactly align to match the wording and syntax of the trait values in the trait dictionary.
metadata[["substitutions"]]
entries are formatted as:
substitutions:
- trait_name: dispersal_appendage
find: attached carpels
replace: floral_parts
- trait_name: dispersal_appendage
find: awn
replace: bristles
- trait_name: dispersal_appendage
find: awn bristles
replace: bristles
The three elements it includes are:
- trait_name is the AusTraits defined trait name.
- find is the trait value used in the data.csv file.
- replace is the trait value supported by AusTraits.
You can manually type substitutions into the metadata.yml
file, ensuring you have the syntax and spacing accurate.
Alternately, function metadata_add_substitution
adds single substitutions directly into metadata[["substitutions"]]
:
::metadata_add_substitution(current_study, "trait_name", "find", "replace") traits.build
Notes:
- Combinations of multiple trait values are allowed - simply list them, space delimited (e.g. shrub tree
for a species whose growth form includes both).
- Combinations of multiple trait values are reorganised into alphabetic order in order to collapse into fewer combinations (e.g. “fire_killed resprouts” and “resprouts fire_killed” are alphabetised and hence collapsed into one combination, “fire_killed resprouts”).
- If a trait value is N
or Y
that needs to be in single, straight quotes (usually edited later, directly in the metadata.yml
file)
If you have many substitutions to add, it is more efficient to create a spreadsheet with a list of all trait_name
by trait_value
combinations requiring substitutions. The spreadsheet would have four columns with headers dataset_id
, trait_name
, find
and replace
. This table can be read directly into the metadata.yml
file using the function metadata_add_substitutions_table
:
<-
substitutions_to_add ::read_csv("data/dataset_id/raw/substitutions_required.csv")
readr
::metadata_add_substitutions_list(current_study, substitutions_to_add) traits.build
Once you’ve build the new dataset (see below), you can quickly create a table of all values that require substitutions:
$excluded_data %>%
austraitsfilter(
== current_study,
dataset_id == "Unsupported trait value"
error %>%
) distinct(dataset_id, trait_name, value) %>%
rename("find" = "value") %>%
select(-dataset_id) %>%
write_csv("data/dataset_id/raw/substitutions_required.csv")
Manually add the aligned values in Excel, then:
<-
substitutions_to_add ::read_csv("data/dataset_id/raw/substitutions_required_after_editing.csv")
readr
metadata_add_substitutions_list(dataset_id, substitutions_to_add)
Adding taxonomic updates
metadata[["taxonomic_updates"]]
is a metadata section to document edits to taxonomic names to align the names submitted by the dataset contributor with a taxon name in the databases taxonomic resources master list, config/taxon_list.csv
. This includes correcting typos, standardising syntax (punctuation, abbreviations used for words like subspecies
), and reformatting names to adhere to taxonomic standards for a specific taxon group and a specific databases’ rules.
metadata[["taxonomic_updates"]]
entries are formatted as:
taxonomic_updates:
- find: Acacia ancistrophylla/sclerophylla
replace: Acacia sp. [Acacia ancistrophylla/sclerophylla; White_2020]
reason: Rewording taxon where `/` indicates uncertain species identification
to align with `APC accepted` genus (2022-11-10)
taxonomic_resolution: genus
- find: Pimelea neo-anglica
replace: Pimelea neoanglica
reason: Fuzzy alignment with accepted canonical name in APC (2022-11-22)
taxonomic_resolution: species
- find: Plantago gaudichaudiana
replace: Plantago gaudichaudii
reason: Fuzzy alignment with accepted canonical name in APC (2022-11-10)
taxonomic_resolution: species
- find: Poa sp.
replace: Poa sp. [Angevin_2011]
reason: Adding dataset_id to genus-level taxon names. (2023-06-16)
taxonomic_resolution: genus
- find: Polyalthia (Wyvur)
replace: Polyalthia sp. (Wyvuri B.P.Hyland RFK2632)
reason: Fuzzy match alignment with species-level canonical name in `APC known`
when everything except first 2 words ignored (2022-11-10)
taxonomic_resolution: Species
Notes:
- Each trait database will have their own conventions for how to align names that cannot be perfectly matched to an accepted/valid taxon concept. The examples and notes provided here indicate the conventions used by AusTraits.
- Poa sp.
and Acacia ancistrophylla/sclerophylla
are examples of taxon names that can only be aligned to genus. The taxonomic_resolution
is therefore specified as genus
. The portion of the name that can be aligned to the taxonomic resource must be before the square brackets. Any information within the square brackets is important for uniquely identifying this entry within the trait database, but does not provide additional taxonomic information.
- Polyalthia (Wyvur)
is a poorly formatted phrase name that has been matched to its appropriate syntax in the APC.
The four elements it includes are:
- find: The original name given to taxon in the original data supplied by the authors.
- replace: The updated taxon name, that should now to aligned to a taxon name within the chosen taxonomic reference.
- reason: Records why the change was implemented, e.g. typos
, taxonomic synonyms
, and standardising spellings
.
- taxonomic_resolution: The rank of the most specific taxon name (or scientific name) to which a submitted original name resolves.
The function metadata_add_taxonomic_change
adds single taxonomic updates directly into metadata[["taxonomic_updates"]]
:
::metadata_add_taxonomic_change(current_study,
traits.build"find", "replace", "reason",
"taxonomic_resolution")
The function metadata_add_taxonomic_changes_list
adds a table of taxonomic updates directly into metadata[["taxonomic_updates"]]
. The column headers must be find
, replace
, reason
, and taxonomic_resolution
.
::metadata_add_taxonomic_changes_list(current_study, table_of_substitutions) traits.build
Working manually through taxonomic alignments for all datasets in a database can be a huge time sink. The AusTraits team developed the R-package {APCalign}
to automate the process of aligning names of Australian plants to names within the National Species Lists, the APC and APNI. This is supplemented by an AusTraits-specific function build_align_taxon_names
that uses {APCalign}
to automatically add taxonomic_updates
to metadata.yml
files. While these packages/functions are Australian-plant specific, they include code that can be re-purposed for other global regions or taxonomic groups.
For instance,
<- readr::read_csv("config/taxon_list.csv")
taxon_list
<-
names_to_align $taxonomic_updates %>%
database::filter(dataset_id == current_study) %>%
dplyr# next row down might be modified to also filter names in an external taxonomic resource
::filter(!aligned_name %in% taxon_list$aligned_name) %>%
dplyr::filter(is.na(taxonomic_resolution)) %>%
dplyr::distinct(original_name) dplyr
Some of these names will require alignments and others might be truly garbage (unknown species 1
) and should instead be excluded.
Excluded observations
metadata[["exclude_observations"]]
is a metadata section for excluding specific variable (column) values. It is most often used to exclude specific taxon names, but could be used for locations
, trait_name
, etc. These are values that are in the data.csv
file but should be excluded from AusTraits.
metadata[["exclude_observations"]]
entries are formatted as:
:
exclude_observations- variable: taxon_name
: Campylopus introflexus, Dicranoloma menziesii, Philonotis tenuis, Polytrichastrum
find
alpinum, Polytrichum juniperinum, Sphagnum cristatum: moss (E Wenk, 2020.06.18)
reason- variable: taxon_name
: Xanthoparmelia semiviridis
find: lichen (E Wenk, 2020.06.18) reason
The three elements it includes are: - variable: A variable from the traits table, typically taxon_name
, location_name
or context_name
- find: Value of variable to remove
- reason: Records why the data were removed, e.g. exotic
NOTE: Multiple, comma-delimited values can be added under find
.
The function metadata_exclude_observations
adds single exclusions directly into metadata[["exclude_observations"]]
:
::metadata_exclude_observations(current_study, "variable", "find", "reason") traits.build
Questions
The final section of the metadata.yml
file is titled questions
. This is a location to:
- Ask the data contributor targeted questions about their study. When you generate a report (described below) these questions will appear at the top of the report.
- Preface the first question you have with
contributor:
(indented once), and additional questions withquestion2:
, etc. - Ask contributors about missing metadata
- Point contributors attention to odd data distributions, to make sure they look at those traits extra carefully.
- Let contributors know if you’re uncertain about their units or if you transformed the data in a fairly major way.
- Ask the contributors if you’re uncertain you aligned their trait names correctly.
- Preface the first question you have with
- This is a place to list any trait data that are not yet
traits
supported by AusTraits. Use the following syntax, indented once:additional_traits:
, followed by a list of traits.
Hooray! You now have a fully propagated metadata.yml file!
Next is making sure it has captured all the data exactly as you’ve intended.
24.6 Quality checks
If you haven’t done so already, assign the dataset_id
to a variable, current_study
:
<- "Wright_2001" current_study
This lets you have a list of tests you run for each study and you just have to reassign a new dataset_id
to current_study
.
Clear formatting
The clear formatting
code below reads and re-writes the yaml file. This is the same process that is repeated when running functions that automatically add substitutions or check taxonomy. Running it first ensures that any formatting issues introduced (or fixed) during the read/write process are identified and solved first.
For instance, the write_metadata
function inserts line breaks every 80 characters and reworks other line breaks (except in custom_R_code). It also reformats special characters in the text, substituting in its accepted format for degree symbols, en-dashes, em-dashes and quotes, and substituting in Unicode codes for more obscure symbols.
<- file.path("data", current_study, "metadata.yml")
f ::read_metadata(f) %>% traits.build::write_metadata(f) traits.build
Running tests
An extensive dataset test protocol ensures: - the metadata.yml
file is complete and properly formatted - details entered into the metadata.yml
file match those in the accompanying data.csv
file (column names, values for locations, contexts) - details for each trait (trait name, categorical trait values) match those in the trait dictionary
Certain special characters may show up as errors and need to be manually adjusted in the metadata.yml
file
To run the dataset tests,
# Tests run test on one study
::dataset_test(current_study)
traits.build
# Tests run test on all studies
::dataset_test(dir("data")) traits.build
Messages identify errors in the dataset, hopefully pointing you quickly to the changes that are required.
Do not be disheartened by errors when you first run tests on a newly entered dataset - even after adding 100’s of datasets into AusTraits it is very rare to have zero errors on a first run of dataset_test()
.
Fix as many errors as you can and then rerun dataset_test()
repeatedly until no errors remain.
You may want to fix errors in tandem with building the new dataset, such as to be able to quickly compile a list of trait values requiring substitutions or taxon names requiring taxonomic updates
See the common issues chapter for solutions to common issues, such as:
- dataset not pivoting
- unsupported trait values
Rebuild AusTraits
To continue your checks
it is necessary to rebuild your database.
Until tests come back clean
you can simply build the new dataset:
::build_setup_pipeline(method = "remake", database_name = "database")
traits.build<- remake::make(current_study) austraits
To continue on to building the dataset report, you need to rebuild the entire database:
::build_setup_pipeline(method = "remake", database_name = "database")
traits.build<- remake::make("austraits") austraits
Check excluded data
AusTraits automatically excludes measurements for a number of reasons. Data might be excluded for legitimate reasons (value far out of range for a numeric trait) or because
These are available in the frame database$excluded_data
.
Possible reasons for excluding measurements include:
Missing species name: Species name is missing from data.csv file for a given row of data. This usually occurs when there are stray characters in the data.csv file below the data – delete these rows.
Missing unit conversion: Value was present but appropriate unit conversion was missing. This requires that you add a new unit conversion to the file
config/unit_conversions.csv
. Add additional conversions near similar unit conversions already in the file for easier searching in the future.Observation excluded in metadata: Specific values, usually certain taxon names can be excluded in the metadata. This is generally used when a study includes a number of non-native and non-naturalised species that need to be excluded. These should be intentional exclusions, as they have been added by you.
Trait name not in trait dictionary:
trait_name
not listed inconfig/traits.yml
as a trait concept. Double check you have used the correct spelling/exact syntax for thetrait_name
, adding a newtrait concept
to thetraits.yml
file if appropriate. If there is a trait that is currently unsupported by AusTraits, leavetrait_name: .na
. Do not fill in an arbitrary name.Unsupported trait value: This error, referencing categorical traits, means that the
value
for a trait is not included in the list of supported trait values for that trait inconfig/traits.yml
. See adding many substitutions if there are many trait values requiring substitutions. If appropriate, add another trait value to thetraits.yml
file, but confer with other curators, as the lists of trait values have been carefully agreed upon through workshop sessions.Value does not convert to numeric: Is there a strange character in the file preventing easy conversion? This error is rare and generally justified.
Value out of allowable range: This error, referencing numeric traits, means that the trait value, after unit conversions, falls outside of the allowable range specified for that trait in
config/traits.yml
. Sometimes the AusTraits range is too narrow and other times the author’s value is truly an outlier that should be excluded. Look closely at these and adjust the range inconfig/traits.yml
if justified. Generally, don’t change the range until you’ve create a report for the study and confirmed that the general cloud of data aligns with other studies as excepted. Most frequently the units or unit conversion is what is incorrect.Value contains unsupported characters: This error appears if there are unusual punctuation characters in the trait name. As such characters do not appear as allowed values within the trait dictionary, these represent transient errors that need to be corrected through either
metadata[["substitutions"]]
or, occasionally, in thedata.csv
file.
When you are finished running quality checks, no data should be excluded due to Missing unit conversion, Trait name not in trait dictionary, and it should be very rare that any data is excluded due to Missing species name or Value contains unsupported characters.
The dataset curator should be confident that every value that lands in the excluded data table is their legitimately.
For instance, out of the 1.8 million+ records in AusTraits v5.0.0, the excluded data table contains:
Error | Count |
---|---|
Observation excluded in metadata | 4061 |
Unsupported trait value | 354 |
Value does not convert to numeric | 12 |
Value out of allowable range | 427 |
The best way to view excluded data for a study is:
$excluded_data %>%
austraits::filter(
dplyr== current_study,
dataset_id != "Observation excluded in metadata"
error %>%
) View()
Missing values (blank cells, cells with NA) are not included in the excluded_data
table, because they are assumed to be legitimate blanks. If you want to confirm this, you need to temporarily change the default arguments for the internal function dataset_process
where it is called within the remake.yml
or build.R
file that compiles the database. For instance, the default,
dataset_process("data/Ahrens_2019/data.csv",
Ahrens_2019_config,
schema )
needs to be changed to:
dataset_process("data/Ahrens_2019/data.csv",
Ahrens_2019_config,
schema,filter_missing_values = FALSE
)
To check how many of each error type are present for a study:
$excluded_data %>%
database::filter(dataset_id == current_study) %>%
dplyr::pull(error) %>%
dplyrtable()
Or produce a table of error type by trait:
$excluded_data %>%
database::filter(
dplyr== current_study,
dataset_id %>%
) ::select(trait_name, error) %>%
dplyrtable()
Build study report
Another important check for each study is building a study report, that summarises all metadata and trait data.
Make sure you’ve rebuilt the entire database (not just the new study) before building the report.
<- remake::make("database")
database ::dataset_report(database, current_study, overwrite = TRUE) traits.build
NOTES:
- The report will appear in the folder export/reports
- The argument overwrite = TRUE
overwrites pre-existing copies of the report in this folder.
Check the study report to ensure:
- All possible metadata fields were filled in
- The locations plot sensibly on the map
- For numeric traits, the trait values plot sensibly relative to other studies
- The list of unknown/unmatched species doesn’t include names you think should be recognised/aligned
If necessary, cycle back through earlier steps to fix any errors, rebuilding the study report as necessary
At the very end, re-clear formatting, re-run tests, rebuild AusTraits, rebuild report.
To generate a report for a collection of studies:
::dataset_reports(database, c("Falster_2005_1", "Wright_2002"),
traits.buildoverwrite = TRUE)
Or for all studies:
::dataset_reports(database, overwrite = TRUE) traits.build
(Reports are written in Rmarkdown and generated via the knitr package. The template is here).