24  Adding datasets, a lengthy guide

This vignette is an exhaustive reference for adding datasets to a traits.build database.

If you are embarking on building a new database using the traits.build standard a better place to get started are 7 tutorials.

Then come back to this document for details and unusual dataset circumstances that are not covered in the tutorials.

Other chapters you may want to read include:

24.1 Getting started

The traits.build package offers a workflow to build a harmonised trait database from disparate sources, with different data formats and containing varying metadata.

There are two key components required to merge datasets into a database with a common output structure:

  1. A workflow to wrangle datasets into a standardised input format, using a combination of {traits.build} functions and manual steps.

  2. A process to harmonise information across datasets and build them into a single database.

This document details all the steps to format datasets into a pair of standardised files for input, a tabular data file and a structured metadata file. It includes examples of code you might use.

To begin, install the traits.build package.

#remotes::install_github("traitecoevo/traits.build", quick=TRUE)

library(traits.build) 

24.2 Standardised input files required

24.3 Create a dataset folder

Add a new folder within the data folder. Its name should be the study’s unique dataset_id.

The preferred format for dataset_id is the surname of the first author of any corresponding publication, followed by the year, as surname_year. E.g. Falster_2005. Wherever there are multiple studies with the same id, we add a suffix _2, _3 etc. E.g.Falster_2005, Falster_2005_2.

dataset_id is one of the core identifiers within a traits.build database.

24.4 Constructing the data.csv file

The trait data for each study (dataset_id) must be in a single table, data.csv. The data.csv file can either be in a wide format (1 column for each trait, with the various trait names as the column headers) or long format (a single column for all trait values and an additional column for trait name.

Required columns

  • taxon_name
  • trait_name (many columns for wide format; 1 column for long format)
  • value (trait value; for long format only)
  • location_name (if required)
  • contexts (if required)
  • collection_date (if required)
  • individual_id (if required)
  1. For all field studies, ensure there is a column for location_name. If all measurements were made at a single location, a location_name column can easily be mutated using custom_R_code within the metadata.yml file. See sections adding locations and adding contexts below for more information on compiling location and context data.

  2. If available, be sure to include a column with collection date. If possible, provide dates in yyyy-mm-dd (e.g. 2020-03-05) format or, if the day of the month isn’t known, as yyyy-mm (e.g. 2020-03). However, any format is allowed and the column can be parsed to the proper yyyy-mm-dd format using custom_R_code. If the same collection date applies to the entire study it can be added directly into the metadata.yml file.

  3. If applicable, ensure there are columns for all context properties, including experimental treatments, specific differences in method, a stratified sampling scheme within a plot, or sampling season. Additional context columns could be added through custom_R_code or keyed in where traits are added, but it is best to include a column in the data.csv file whenever possible. The protocol for adding context properties to the metadata file is under adding contexts

Data may need to be summarised

Data submitted by a contributor should be in the rawest form possible; always request data with individual measurements over location/species means.

Some datasets include replicate measurements on an individual at a single point in time, such as the leaf area of 5 individual leaves. In AusTraits (the Australian plant trait database) we generally merge such measurements into an individual mean in the data.csv file, but the raw values are preserved in the contributor’s raw data files. Be sure to calculate the number of replicates that contributed to each mean value.

When there is just a single column of trait values to summarise, use:

readr::read_csv("data/dataset_id/raw/raw_data.csv") %>%
  dplyr::group_by(individual, `species name`, location, context, etc) %>%
  dplyr::summarise(
    leaf_area_mean = mean(leaf_area),
    leaf_area_replicates = n()
    ) %>%
  dplyr::ungroup()

Make sure you group_by all categorical variables you want to retain, for only columns that are grouping variables will be kept.

When you want to take the mean of multiple data columns simultaneously, use:

readr::read_csv("data/dataset_id/raw/raw_data.csv") %>%
  dplyr::group_by(individual, `species name`, location, context, etc) %>%
  dplyr::summarise(
    across(c(leaf_area, `leaf N`), ~ mean(.x, na.rm = TRUE)),
    across(c(growth_form, `photosynthetic pathway`), ~ first(.x)),
    replicates = n()
  ) %>%
  dplyr::ungroup()

{dplyr} hints:

  • Categorical variables not included as grouping variables will return NA.
  • Generally use the function first for categorical variables - it simply retains the trait value in the first column.
  • You can identify runs of columns by column number/position. For instance c(5:25), ~ mean(.x, na.rm = TRUE) or c(leaf_area:leaf_N), ~ mean(.x, na.rm = TRUE).
  • Be sure to ungroup at the end.
  • Before summarising, ensure variables you expect are numeric, are indeed numeric: utils::str(data).

Merging multiple spreadsheets

If multiple spreadsheets of data are submitted these must be merged together.

  • If the spreadsheets include different trait measurements made on the same individual (or location means for the same species), they are best merged using dplyr::left_join, specifying all conditions that need to be matched across spreadsheets (e.g. individual, species, location, context). Ensure the column names are identical between spreadsheets or specify columns that need to be matched.
readr::read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
readr::read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2

data_1 %>% 
  dplyr::left_join(
    data_2,
    by = c("Individual", "Taxon" = "taxon", "Location", "Context")
  )
  • If the spreadsheets include trait measurements for different individuals (or possibly data at different scales - such as individual level data for some traits and species means for other traits), they are best merged using dplyr::bind_rows. Ensure the column names for taxon name, location name, context, individual, and collection date are identical between spreadsheets. If there are data for the same traits in both spreadsheets, make sure those column headers are identical as well.
readr::read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
readr::read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2

data_1 %>% 
  dplyr::bind_rows(data_2)

Taxon names

Taxon names need to be complete names. If the main data file includes code names, with a key as a separate file, they are best merged now to avoid many individual replacements later.

readr::read_csv("data/dataset_id/raw/species_key.csv") -> species_key
readr::read_csv("data/dataset_id/raw/data_file.csv")  -> data

data %>%
  dplyr::left_join(species_key, by = "code")

Unexpected hangups

  • When Excel saves an .xls file as a .csv file it only preserves the number of significant figures that are displayed on the screen. This means that if, for some reason, a column has been set to display a very low number of significant figures or a column is very narrow, data quality is lost.

  • If you’re reading a file into R where there are lots of blanks at the beginning of a column of numeric data, the defaults for readr::read_csvfail to register the column as numeric. It is fixed by adding the argumentguess_max`:

read_csv("data/dataset_id/raw/raw_data.csv", guess_max = 10000)

This checks 10,000 rows of data before declaring the column is non-numeric.

(When data.csv files are read in through the {traits.build} workflow, guess_max = 100000.)

24.5 Constructing the metadata.yml file

As described in detail here the metadata.yml file maps the meanings of the individual columns within the data.csv file and documents all additional dataset metadata.

Before beginning, it is a good idea to look at the two example dataset metadata files in the traits.build-template repository, to become familiar with the general structure.

The sections of the metadata.yml file are:

This document covers these metadata sections in sequence.

Use a proper text editor

  • Install a proper text editor, such as Visual Studio Code (our favorite), Rstudio, textmate, or sublime text. Using Microsoft word will make a mess of the formatting.

Source the {traits.build} functions

To assist you in constructing the metadata.yml file, we have developed functions to help propagate and fill in the different sections of the file.

If you haven’t already, run:

library(traits.build)

The functions for populating the metadata file all begin with metadata_.

A full list is available here.

Creating a template

The first step is to create a blank metadata.yml file.

traits.build::metadata_create_template("Yang_2028")

As each function prompts you to enter the dataset_id, it can be useful to assign the dataset’s id to a variable you can use repeatedly:

current_study <- "Yang_2028"

traits.build::metadata_create_template(current_study)

This function cycles through a series of user-input menus, querying about both the data format (long versus wide) and which columns contain which variables (taxon name, location name, individual identifiers, collection date). It then creates a relatively empty metadata file data/dataset_id/metadata.yml.

The questions are:

  • Is the data long or wide format?

A wide dataset has each variable (i.e. trait ) as a column. A long dataset has a single row containing all trait values.

  • Select column for taxon_name
  • Select column for trait_name (long datasets only)
  • Select column for trait values (long datasets only)
  • Select column for location_name

If your data.csv file does not yet have a location_name column, this information can later be added manually.

  • Select column for individual_id (a column that links measurements on the same individual)
  • Select column for collection_date

If your data.csv file does not have a collection_date column, you will be prompted to Enter collection_date range in format ‘2007/2009’. A fixed value in a yyyy, yyyy-mm or yyyy-mm-dd format is accepted, either as a single value or range of values. This information can be edited later.

  • Indicate whether all traits need repeat_measurements_id’s

repeat_measurements_id’s are only required if the dataset documents response curve data (e.g. an A-ci or light response curve for plants; or a temperature response curve for animal or plant behaviour). They can also be added to individual traits (later). They are intended to capture multiple “sub-measurements” that together comprise a single “trait measurement”.

Adding a source

The skeletal metadata.yml file created by metadata_create_template included a section for the primary source with default fields for a journal article.

You can manually enter citation details, but whenever possible, use one of the three functions developed to automatically propagate citation details.

Adding source from a doi

If you have a doi for your study, use the function.

traits.build::metadata_add_source_doi(dataset_id = current_study, doi = "doi")

The different elements within the source will automatically be generated.

Double check the information added to ensure:

  1. The title is in sentence case.
  2. The information isn’t in all caps (sources from a few journals gets read in as all caps).
  3. Pages numbers are present and include -- between page numbers (for example, 123 -- 134).
  4. If there is a colon (:) or apostrophe (’) in a reference, the text for that line must be in quotes (“).

By default, details are added as the primary source. If multiple sources are linked to a single dataset_id, you can specify a source as secondary.

traits.build::metadata_add_source_doi(dataset_id = current_study, doi = "doi", 
                                      type = "secondary")
  • Attempting to add a second primary source will overwrite the information already input. Instead, if there is a third resource to add, use type = "secondary_2"
  • Always check the key field, as it can be incorrect for hyphenated last names.
  • If the dataset being entered is a compilation of many original sources, you should add all the original sources, specifying, type = "original_01", type = "original_02" etc. See Richards_2008 for an example of a complex source list.

Adding source from a bibtex file

traits.build::metadata_add_source_doi(dataset_id, file = "myref.bib")

(These options require the packages rcrossref and RefManageR to be installed.)

Proper formatting of different source types

Different source types require different fields, formatting:

Book:

source:
  primary:
      key: Cooper_2013
      bibtype: Book
      year: 2013
      author: Wendy Cooper and William T. Cooper
      title: Australian rainforest fruits
      publisher: CSIRO Publishing
      pages: 272

Online resource:

source:
  primary:
    key: TMAG_2009
    bibtype: Online
    author: '{Tasmanian Herbarium}'
    year: 2009
    title: Flora of Tasmania Online
    publisher: Tasmanian Museum & Art Gallery (Hobart)
    url: http://www.tmag.tas.gov.au/floratasmania

Thesis:

source:
  primary:
      key: Kanowski_2000
      bibtype: Thesis
      year: 1999
      author: John Kanowski
      title: Ecological determinants of the distribution and abundance of the folivorous
        marsupials endemic to the rainforests of the Atherton uplands, north Queensland.
      type: PhD
      institution: James Cook University, Townsville

Unpublished dataset:

source:
  primary:
    key: Ooi_2018
    bibtype: Unpublished
    year: 2018
    author: Mark K. J. Ooi
    title: "Unpublished data: Herbivory survey within Royal National Park, University
      of New South Wales"
  • Note the title of an unpublished dataset must begin with the words “Unpublished data” and include the data collectors affiliation.

Adding contributors

The skeletal metadata.yml file created by the function metadata_create_template includes a template for entering details about data contributors. Edit this manually, duplicating if details for multiple people are required.

  • data_collectors are people who played a key intellectual role in the study’s experimental design and data collection. Most studies have 1-3 data_collectors listed. Four fields of information are required for each data collector: last_name, given_name, affiliation and ORCID (if available). Nominate a single data collector to be the dataset’s point of contact.
  • Additional field assistants can be listed under assistants.
  • The data entry person is listed under dataset_curators.
  • email addresses for the data_collectors are not included in the metadata.yml file, but it is recommended that a database curator maintain a list of email addresses of all data collectors to whom authorship may be extended on a future database data paper. Authorship “rules” will vary across databases, but for AusTraits we extend authorship to all data_collectors who we successfully contact.

For example, in Roderick_2002:

contributors:
  data_collectors:
  - last_name: Roderick
    given_name: Michael
    ORCID: 0000-0002-3630-7739
    affiliation: The Australian National University, Australia
    additional_role: contact
  assistants: Michelle Cochrane
  dataset_curators: Elizabeth Wenk

Custom R code

The goal is always to maintain data.csv files that are as similar as possible to the contributed dataset. However, for many studies there are minor changes we want to make to a dataset before the data.csv file is processed by the {traits.build} workflow. These may include applying a function to transform a particular column of data, a function to filter data, or a function to replace a contributor’s “measurement missing” placeholder symbol with NA. In each case it is appropriate to leave the rawer data in data.csv and edit the data table as it is read into the {traits.build} workflow.

Background

To allow custom modifications to a particular dataset before the common pipeline of operations gets applied, the workflow permits for some customised R code to be run as a first step in the processing pipeline. That pipeline (the function process_custom_code called within dataset_process) looks like this:

data <-
  readr::read_csv(filename_data_raw, col_types = cols(), guess_max = 100000, 
                  progress = FALSE) %>%
  process_custom_code(metadata[["dataset"]][["custom_R_code"]])()

The second line shows that the custom code gets applied, right after the file is loaded.

Overview of options and syntax

  • A copy of the file containing functions the AusTraits team have explicitly developed to use within the custom_R_code field is available at custom_R_code.R and should be placed with the R folder within your database repository, then sourced (source("R/custom_R_code.R")).
  • Place a single apostrophe (’) at the start and end of your custom R code; this allows you to add line breaks between pipes.
  • Begin your custom R code with data %>%, then apply whatever fixes are needed.
  • Use functions from the packages dplyr, tiydr, stringr (e.g. mutate, rename, summarise, str_detect), but avoid other packages.
  • Alternatively, use the functions we’ve created explicitly for pre-processing data that were sourced through the file custom.R. You may choose to expand this file within your own database repository.
  • Custom R code is not intended for reading in files. Any reading in and merging of multiple files should be done before creating the dataset’s data.csv file.
  • Use pipes to weave together a single statement, where possible. If you need to manipulate/subset the data.csv file into multiple data frames and then bind them back together, you’ll need to use semi colons ; at the end of each statement.
Examples of appropriate use of custom R code
  1. Converting times to NY strings

Most sources from herbaria record flowering_time and fruiting_time as a span of months, while AusTraits codes these variables as a sequence of 12 N’s and Y’s for the 12 months. A series of functions make this conversion in custom_R_code. These include:

  • format_flowering_months’ (Create flowering times from start to end pair)
  • convert_month_range_string_to_binary’ (Converts flowering and fruiting month ranges to 12 element character strings of binary data)
  • convert_month_range_vec_to_binary’ (Convert vectors of month range to 12 element character strings of binary data)
  • collapse_multirow_phenology_data_to_binary_vec’ (Converts multi-row phenology data to a 12 digit binary string)
  1. Splitting ranges into min, max pairs

Many datasets from herbaria record traits like leaf_length, leaf_width, seed_length, etc. as a range (e.g. 2-8). The function separate_range separates this data into a pair of columns with minimum and maximum values, which is the preferable way to merge the data into a trait database.

  1. Removing duplicate values within a dataset

Duplicate values within a study need to be filtered out using the custom function replace_duplicates_with_NA

If a species-level trait value has been entered repeatedly on rows containing individual-level trait measurements, you need to filter out the duplicates. For instance, plant growth form is generally a species-level observation, with the same value on every row with individual-level trait measurements. There are also instances, where a population-level numeric trait appears repeatedly, such as if nutrient analyses were performed on a bulked sample at each site.

Before applying the function, you must group by the variable(s) that contain the unique values. This might be at the species or population level. For instance, use group_by(Species, Location) if there are unique values at the species x location level.

data %>%
  dplyr::group_by(Species) %>%
    dplyr::mutate(
      across(c(`leaf_percentN`, `plant growth form`), replace_duplicates_with_NA)
      ) %>%
  dplyr::ungroup()
  1. Removing duplicate values across datasets

Values that were sourced from a different study need to be filtered out. See Duplicates between studies below -functions to automate this process are in progress.

  1. Replacing “missing values” with NA’s

If missing data values in a dataset are represented by a symbol, such as 0 or *, these need to be converted to NA’s:

data %>% 
  dplyr::mutate(
    across(c(`height (cm)`, `leaf area (mm2)`), ~ na_if(., 0))
  )
  1. Mapping data from one trait to a second trait, part 1

If a subset of data in a column are also values for a second trait in AusTraits, some data values can be duplicated into a second temporary column. In the example below, some data in the contributor’s fruit_type column also apply to the trait fruit_fleshiness in AusTraits:

data %>% 
  dplyr::mutate(
    fruit_fleshiness = ifelse(`fruit type` == "pome", "fleshy", NA)
  )

The function move_values_to_new_trait is being developed to automate this and currently resides in the custom_R_code.R file within the austraits.build repository.

  1. Mapping data from one trait to a second trait, part 2

If a subset of data in a column are instead values for a second trait in AusTraits, some data values can be moved to a second column (second trait), also using the function ‘move_values_to_new_trait’. In the example below, some data in the contributor’s growth_form column only apply to the trait parasitic in AusTraits. Note you need to create a blank variable, before moving the trait values.

data %>%
  dplyr::mutate(new_trait = NA_character) %>%
  move_values_to_new_trait(
    original_trait = "growth form",
    new_trait = "parasitic",
    original_values = "parasitic",
    values_for_new_trait = "parasitic",
    values_to_keep = "xx") %>%
  dplyr::mutate(across(c(original_trait), ~ na_if(., "xx")))

or

data %>%
  dplyr::mutate(dispersal_appendage = NA.char) %>%
  move_values_to_new_trait(
    "fruits", "dispersal_appendage",
    c("dry & winged", "enclosed in aril"),
    c("wings", "aril"),
    c("xx", "enclosed") %>%
  dplyr::mutate(across(c(original_trait), ~ na_if(., "xx")))
  • Note, the parameter values_to_keep doesn’t accept NA, leading to the clunky coding. This bug is known, but we haven’t managed to fix it.
  1. Mutating a new trait from other traits

If the data.csv file includes raw data that you want to manipulate into a trait, or the contributor presents the data in a different formulation than AusTraits, you may choose to mutate a new column, containing a new trait.

data %>% 
  dplyr::mutate(
    root_mass_fraction = `root mass` / (`root mass` + `shoot mass`)
  )
  1. Mutating a location name column

If the dataset has location information, but lacks unique location names (or any location name), you might mutate a location name column to map in. (See also Adding location details).

data %>%
  dplyr::mutate(
    location_name = ifelse(location_name == "Mt Field" & habitat == "Montane rainforest", 
                           "Mt Field_wet", location_name),
    location_name = ifelse(location_name == "Mt Field" & habitat == "Dry sclerophyll", 
                           "Mt Field_dry", location_name)
  )

or

data %>% 
  dplyr::mutate(
    location_name = dplyr::case_when(
      longitude == 151.233056 ~ "heath",
      longitude == 151.245833 ~ "terrace",
      longitude == 151.2917 ~ "diatreme"
    )
  )
# Note with `dplyr::case_when`, 
# any rows that do not match any of the conditions become `NA`'s.

or

data %>% 
  dplyr::mutate(
    location_name = paste0("lat_", round(latitude,3),"_long_", round(longitude,3))
    )
  )
  1. Generating measurement_remarks

Sometimes there is a note column with abbreviated information about individual rows of data that is appropriate to map as a context. This could be included in the field measurement_remarks:

data %>%
  dplyr::mutate(
    measurement_remarks = paste0("maternal lineage ", Mother)
  )
  1. Reformatting dates

You can reformat collection_dates to conform to the yyyy-mm-dd format, or add a date column

Converting from any mdy format to yyyy-mm-dd (e.g. Dec 3 2015 to 2015-12-03)

data %>% 
  dplyr::mutate(
    Date = Date %>% lubridate::mdy()
    )

Converting from any dmy format to yyyy-mm-dd (e.g. 3-12-2015 to 2015-12-03)

data %>% 
  dplyr::mutate(
    Date = Date %>% lubridate::dmy()
    )

Converting from a mmm-yyyy (string) format to yyyy-mm (e.g. Dec 2015 to 2015-12)

data %>% 
  dplyr::mutate(
    Date = 
      lubridate::parse_date_time(Date, orders = "my") %>% 
      base::format.Date("%Y-%m")
    )

Converting from a mdy format to yyyy-mm (e.g. Excel has reinterpreted the data as full dates 12-01-2015 but the resolution should be “month” 2015-12)

data %>% 
  dplyr::mutate(
    Date = 
      lubridate::parse_date_time(Date, orders = "mdy") %>% 
      base::format.Date("%Y-%m")
    )

A particularly complicated example where some dates are presented as yyyy-mm and others as yyyy-mm-dd

data %>%
    dplyr::mutate(
      weird_date = ifelse(stringr::str_detect(gathering_date, "^[0-9]{4}"), 
                          gathering_date, NA),
      gathering_date = gathering_date %>% 
          lubridate::mdy(quiet = T) %>% as.character(),
      gathering_date = coalesce(gathering_date, weird_date)
    ) %>%
    select(-weird_date)

Testing your custom R code

After you’ve added the custom R code to a file, check that output is indeed as intended:

metadata_check_custom_R_code("Blackman_2010")

Fill in metadata[["dataset"]]

The dataset section includes fields that are:

  1. filled in automatically by the function metadata_create_template()
  2. mandatory fields that need to be filled in manually for all datasets
  3. optional fields that are included and filled in only for a subset of datasets

fields automatically filled in

  • data_is_long_format yes/no

  • taxon_name

  • location_name

  • collection_date If this is not read in as a specified column, it needs to be filled in manually as start date/end date in yyyy-mm-dd, yyyy-mm, or yyyy format, depending on the relevant resolution. If the collection dates are unknown, write unknown/publication year, as in unknown/2022

  • individual_id Individual_id is one of the fields that can be read in during metadata_create_template. However, you may instead mutate your own individual_id using custom_R_code and add it in manually. For a wide dataset individual_id is required anytime there are multiple rows of data for the same individual and you want to keep these linked. This field should only be included if it is required.

    WARNING If you have an entry individual_id: unknown this assigns all rows of data to an individual named “unknown” and the entire dataset will be assumed to be from a single individual. This is why it is essential to omit this field if there isn’t an actual row of data being read in.
    NOTE For individual-level measurements, each row of data is presumed to be a different individual during dataset processing. Individual_id is only required if there are multiple rows of data (long or wide format) with information for the same individual.

  • repeat_measurements_id repeat_measurement_id’s are sequential integer identifiers assigned to a sequence of measurements on a single trait that together represent a single observation (and are assigned a single observation_id by the traits.build pipeline. The assumption is that these are measurements that document points on a response curve. The function metadata_create_template offers an option to add it to metadata[["dataset"]], but it can alternately be specified under specific traits, as repeat_measurements_id: TRUE

required fields manually filled in

  • description: 1-2 sentence description of the study’s goals. The abstract of a manuscript usually includes some good sentences/phrases to borrow.

  • basis_of_record: Basis of record can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the data.csv file. If it is being read in from a column list the column name in the field, otherwise input the fixed value. Allowable values are: field, field_experiment, captive_cultivated, lab, preserved_specimen, and literature. See the database structure vignette for definitions of these accepted basis_of_record values. If fixed values are specified for both the entire dataset under metadata[["dataset"]] and for specific locations/traits under metadata[["locations"]] or metadata[["traits"]], the location/trait value overrides that entered under metadata[["dataset"]].

  • life_stage: Life stage can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the data.csv file. If it is being read in from a column list the column name in the field, otherwise input the fixed value. Allowable values are: adult, sapling, seedling, juvenile. See the database structure vignette for definitions of these accepted basis_of_record values. If fixed values are specified for both the entire dataset under metadata[["dataset"]] and for specific locations/traits under metadata[["locations"]] or metadata[["traits"]], the location/trait value overrides that entered under metadata[["dataset"]].

  • sampling_strategy: Often a quite long description of the sampling strategy, extracted verbatim from a manuscript whenever possible.

  • original_file: The name of the file initially submitted to the database curators. It is generally archived in the dataset folder, in a subfolder named raw. For AusTraits datasets are also usually archived in the project’s GoogleDrive folder.

  • notes: Notes about the study and processing of data, especially if there were complications or if some data is suspected duplicates with another study and were filtered out.

optional fields manually filled in

  • measurement_remarks: Measurement remarks is a field to capture a miscellaneous notes column. This should be information that is not captured by trait methods (which is fixed to a single value for a trait) or as a context. Measurement_remarks can be coded in as a fixed value for an entire dataset, by trait, by location or read in from a column in the data.csv file.

  • entity_type is standardly added to each trait, and is described below under traits, but a fixed value or column could be read in under metadata[["dataset"]]

Adding location details

Location data includes location names, latitude/longitude coordinates, verbal location descriptions, and any additional abiotic/biotic location variables provided by the contributor (or in the accompanying manuscript). For studies with more than a few locations, it is most efficient to create a table of this data that is automatically read into the metadata.yml file.

The function metadata_add_locations automatically propagates location information from a stand-alone location properties table into metadata[["locations"]]:

locations <- read_csv("data/dataset_id/raw/locations.csv")
traits.build::metadata_add_locations(current_study, locations)

The function metadata_add_locations first prompts the user to identify the column with the location name and then to list all columns that contain location data. This automatically fills in the location component on the metadata file.

Rules for formatting a locations table to read in:

  1. Location names must be identical (including syntax, case) to those in data.csv

  2. Column headers for latitude and longitude data must read latitude (deg) and longitude (deg)

  3. Latitude and longitude must be in decimal degrees (i.e. -46.5832). There are many online converters to convert from degrees,minutes,seconds format or UTM. Or use the following formula: decimal_degrees = degrees + (minutes/60) + (seconds/3600)

  4. If there is a column with a general vegetation description (i.e. rainforest, coastal heath it should be titled description)

  5. Although location properties are not restricted to a controlled vocabulary, newly added studies should use the same location property syntax as others whenever possible, to allow future discoverability. To generate a list of already used under location_property:

database$locations %>% dplyr::distinct(location_property)

Some examples of syntax to add locations data that exists in different formats.

  • When the main data.csv file has columns for a few location properties:
locations <-
  check_custom_R_code(current_study) %>%
    dplyr::distinct(location_name, latitude, longitude, `veg type`) %>%
    dplyr::rename(dplyr::all_of(c("latitude (deg)" = "latitude",
                                  "longitude (deg)" = "longitude", 
                                  "description" = "veg type")))

traits.build::metadata_add_locations(current_study, locations)
  • If you were to want to add or edit the data, it is probably easiest to save the locations table, then edit in Excel, before reading it back into R

  • It is possible that you will want to specify life_stage or basis_of_record at the location_level. When required, it is usually easiest to manually add these fields to some or all locations.

Adding contexts

The dictionary definition of a context is the situation within which something exists or happens, and that can help explain it. This is exactly what context_properties are in AusTraits, ancillary information that is important to explaining and understanding a trait value.

AusTraits recognises 5 categories of contexts:

  • treatment contexts Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity. Examples include soil nutrient manipulations, growing temperatures, or CO2 enhancement.
  • plot contexts Context property that is a feature of a plot (subset of a location) that might affect the trait values measured on an individual, population or species-level entity. Examples are an property that is stratified within a “geographic location”, such as topographic position. Plots are of course locations themselves; what is a location vs plot_context depends on the geographic resolution a dataset collector has applied to their locations.
  • entity contexts Context property that is information about an organismal entity (individual, population or taxon) that does not comprise a trait-centered observation but might affect the trait values measured on the entity. This might be the entity’s sex, caste (for social insects), or host plant (for insects).
  • temporal contexts Context property that is a feature of a “point in time” that might affect the trait values measured on an individual, population or species-level entity. They generally represent repeat measurements on the same entity across time and may simply be numbered observations or might be explicitly linked to growing season or time of day.
  • method contexts Context property that records specific information about a measurement method that is modified between measurements. These might be samples from different canopy light environments, different leaf ages, or sapwood samples from different branch diameters.

Context properties are not restricted to a controlled vocabulary. However, newly added studies should use the same context property syntax as others whenever possible, to allow future discoverability. To generate a list of terms already used under context_property, use:

database$contexts %>% 
  dplyr::distinct(context_property, category)

Context properties are most easily read into the metadata.yml file with the dedicated function:

traits.build::metadata_add_contexts(dataset_id)

The function first displays a list of all data columns (from the data.csv file) and prompts you to select those that are context properties.

  1. For each column you are asked to indicate its category (those described above).

  2. You are shown a list of the unique values present in the data column and asked if these require any substitutions. (y/n)

  3. You are asked if descriptions are required for the context property values (y/n)

This function then adds the contexts to the metadata[["contexts"]] section.

If you selected both substitutions and descriptions required:

- context_property: unknown
  category: temporal_context
  var_in: month
  values:
  - find: AUG
    value: unknown
    description: unknown
  - find: DEC
    value: unknown
    description: unknown
  - find: FEB
    value: unknown
    description: unknown
- context_property: unknown
  category: treatment_context
  var_in: CO2_Treat
  values:
  - find: ambient CO2
    value: unknown
    description: unknown
  - find: added CO2
    value: unknown
    description: unknown

If you selected just substitutions required:

- context_property: unknown
  category: temporal_context
  var_in: month
  values:
  - find: AUG
    value: unknown
  - find: DEC
    value: unknown
  - find: FEB
    value: unknown
- context_property: unknown
  category: treatment_context
  var_in: CO2_Treat
  values:
  - find: ambient CO2
    value: unknown
  - find: added CO2
    value: unknown

If you selected neither substitutions nor descriptions required:

- context_property: unknown
  category: temporal_context
  var_in: month
- context_property: unknown
  category: treatment_context
  var_in: CO2_Treat
  • You must then manually fill in the fields designated as unknown.
  • If there is a value in a column that is not a context property, set its value to value: .na

If there are additional context properties that were designated in the traits section, these will have to be added manually, as this information is not captured in a column that is read in. A final output might be:

- context_property: sampling season
  category: temporal_context
  var_in: month
  values:
  - find: AUG
    value: August
    description: August (late winter)
  - find: DEC
    value: December
    description: December (early summer)
  - find: FEB
    value: February
    description: February (late summer)
- context_property: CO2 treatment
  category: treatment_context
  var_in: CO2_Treat
  values:
  - find: ambient CO2
    value: 400 ppm
    description: Plants grown at ambient CO2 (400 ppm).
  - find: added CO2
    value: 640 ppm
    description: Plants grown at elevated CO2 (640 ppm); 240 ppm above ambient.
- context_property: measurement temperature
  category: method_context
  var_in: method_context          # this field would be included in the relevant traits
  values:
  - value: 20°C                   # this value would be keyed in through the relevant traits
    description: Measurement made at 20°C
  - value: 25°C
    description: Measurement made at 25°C

Adding traits

The function metadata_add_traits() adds a scaffold for trait metadata to the skeletal metadata.yml file.

metadata_add_traits(current_study)

You will be asked to indicate which columns include trait data.

This automatically propagates the following metadata fields for each trait selected into metadata[["traits"]. var_in is the name of a column in the data.csv file (for wide datasets) or a unique trait name in the trait_name column (for a long dataset):

- var_in: leaf area (mm2)
  unit_in: .na
  trait_name: .na
  entity_type: .na
  value_type: .na
  basis_of_value: .na
  replicates: .na
  methods: .na

The trait details then need to be filled in manually.

  • units: fill in the units associated with the trait values in the submitted dataset - such as mm2 in the example above. If you’re uncertain about the syntax/format used for some more complex units, look through the traits definition file (config/traits.yml) or the file showing unit conversions (config/unit_conversions.csv). For categorical variables, leave this as .na.

AusTraits uses the Unified Code for Units of Measure (UCUM) standard for units (https://ucum.org/ucum), but each database using the traits.build workflow can select their own choices for unit abbreviations. The UCUM standard follows clear, simple rules, but also has a flexible syntax for documenting notes that are recorded as part of the ‘unit’ for specific traits, yet are not formally units, in curly brackets. For instance, {count}/mm2 or umol{CO2}/m2/s, where the actual units are 1/mm2 and umol/m2/s. There are a few not-very-intuitive units in UCUM. a is year (annum).

Notes:
- If the units start with a punctuation symbol, the units must be in single, straight quotes, such as: unit_in: '{count}/mm2'
- It is best not to start units with a - (negative sign). In AusTraits we’ve adopted the convention of using, for instance, neg_MPa instead of -MPa

  • trait_name: This is the trait name of the appropriate trait concept for the datasets config/traits.yml. For currently unsupported traits, leave this as .na but then fill in the rest of the data and flag this study as having a potential new trait concept. Then in the future, if an appropriate trait concept is added to the traits.yml file, the data can be read into the database by simply replacing the .na with a trait name. Each database will have their own criteria/rules for adding traits to the trait dictionary, and likely rules that evolve as a trait database grows. In AusTraits, if no appropriate trait concept exists in the trait dictionary, a new trait must be defined within the accompanying AusTraits Plant Dictionary and should only be added if it is clearly a distinct trait concept, can be explicitly defined, and there exists sufficient trait data that the measurements have comparative value.

  • entity_type: Entity type indicates “what” is being observed for the trait measurements - as in the organismal-level to which the trait measurements apply. As such, entity_type can be individual, population, species, genus, family or order. Metapopulation-level measurements are coded as population and infraspecific taxon-level measurements are coded as species. See the database structure vignette for definitions of these accepted entity_type values.

Note:
- entity_type is about the “organismal-level” to which the trait measurement refers; this is separate from the taxonomic resolution of the entity’s name.

  • value_type: Value type indicates the statistical nature of the trait value recorded. Allowable value types are mean, minimum, maximum, mode, range, raw, and bin. See the database structure vignette for definitions of these accepted value types. All categorical traits are generally scored as being a mode, the most commonly observed value. Note that for values that are bins, the two numbers are separated by a double-hyphen, 1 -- 10.

  • basis_of_value: Basis of value indicates how a value was determined. Allowable terms are measurement, expert_score, model_derived, and literature. See the database structure vignette for definitions of these accepted basis_of_value values, but most categorical traits measurements are values that have been scored by an expert (expert_score) and most numeric trait values are measurements.

  • replicates: Fill in with the appropriate number of measurements that comprise each value.

If the values are raw values (i.e. a measurement of an individual) replicates: 1.
If the values are, for instance, means of 5 leaves from an individual, replicates: 5.
If there is just a single population-level value for a trait, that comprises measurements on 5 individuals, replicates: 5.
For categorical variables, leave this as .na.
If there is a column that specifies replicate number, you can list the column name in the field.

  • methods: This information can usually be copied verbatim from a manuscript and is a textual description of all components of the method used to measure the trait.

In general, methods sections extracted from pdfs include “special characters” (non-UTF-8 characters). Non-English alphabet characters are recognised (e.g. é, ö) and should remain unchanged. Other characters will be re-formatted during the study input process, so double check that degree symbols (º), en-dashes (–), em-dashes (—), and curly quotes (‘,’,“,”) have been maintained or reformatted with a suitable alternative. Greek letters and some other characters are replaced with their Unicode equivalent (e.g. <U+03A8> replaces Psi (Ψ)); for these it is best to replace the symbol with an interpretable English-character equivalent.

If the there are two columns of data with measurements for the same trait using completely different methods, simply add the respective methods to the metadata for the respective columns. A method_id counter will be added to these during processing to ensure the correct trait values are linked to the correct methods. This is separate to method_contexts which are minor tweaks to the methods between measurements, that are expected to have concurrent effects on trait values (see below).

NOTE:
- If the identical methods apply to a string of traits, for the first trait use the following syntax, where the &leaf_length_method notation assigns the remaining text in the field as the leaf_length_method.

  methods: &leaf_length_method All measurements were from dry herbarium 
    collections, with leaf and bracteole measurements taken from the largest 
    of these structures on each specimen.

Then for the next trait that uses this method you can just include. At the end of processing you can read/write the yml file and this will fill in the assigned text throughout.

  methods: *leaf_length_method

In addition to the automatically propagated fields, there are a number of optional fields you can add if appropriate.

  • life_stage If all measurements in a dataset were made on plants of the same life stage a global value should be entered under metadata[["dataset"]]. However if different traits were measured at different life stages you can specify a unique life stage for each trait or indicate a column where this information is stored.

  • basis_of_record If all measurements in a dataset represent the same basis_of_record a global value should be entered under metadata[["dataset"]]. However if different traits have different basis_of_record values you can specify a unique basis_of_record value for each trait or indicate a column where this information is stored.

  • measurement_remarks: Measurement remarks is a field to indicate miscellaneous comments. If these comments only apply to specific trait(s), this field should be specified with those trait’s metadata sections. This meant to be information that is not captured by “methods” (which is fixed to a single value for a trait).

  • method_context If different columns in a wide data.csv file indicate measurements on the same trait using different methods, this needs to be designated. At the bottom of the trait’s metadata, add a method_context_name field (e.g. method_context or leaf_age_type are good options). Write a word or short phrase that indicate the method context property value that applies to that trait (data column). For instance, one trait might have method_context: fully expanded leaves and a second traits entry might have the same trait name and methods, but method_context: leaves still expanding. The method context details must also be added to the contexts section.

  • temporal_context If different columns in a wide data.csv file indicate measurements on the same trait, on the same individuals at different points in time, this needs to be designated. At the bottom of the trait’s metadata, add a temporal_context_name field (e.g. temporal_context or measurement_time_of_day work well). Write a word or short phrase that indicates which temporal context applies to that trait (data column). For instance, one trait might have temporal_context: dry season and a second entry with the same trait name and method might have temporal_context: after rain. The temporal context details must also be added to the contexts section.

Adding substitutions

It is very unlikely that a contributor will use categorical trait values that are entirely identical to those listed as allowed trait values for the corresponding trait concept in the traits.yml file. You need to add substitutions for those that do not exactly align to match the wording and syntax of the trait values in the trait dictionary.

metadata[["substitutions"]] entries are formatted as:

substitutions:
- trait_name: dispersal_appendage
  find: attached carpels
  replace: floral_parts
- trait_name: dispersal_appendage
  find: awn
  replace: bristles
- trait_name: dispersal_appendage
  find: awn bristles
  replace: bristles

The three elements it includes are:
- trait_name is the AusTraits defined trait name.
- find is the trait value used in the data.csv file.
- replace is the trait value supported by AusTraits.

You can manually type substitutions into the metadata.yml file, ensuring you have the syntax and spacing accurate.

Alternately, function metadata_add_substitution adds single substitutions directly into metadata[["substitutions"]]:

traits.build::metadata_add_substitution(current_study, "trait_name", "find", "replace")

Notes:
- Combinations of multiple trait values are allowed - simply list them, space delimited (e.g. shrub tree for a species whose growth form includes both).
- Combinations of multiple trait values are reorganised into alphabetic order in order to collapse into fewer combinations (e.g. “fire_killed resprouts” and “resprouts fire_killed” are alphabetised and hence collapsed into one combination, “fire_killed resprouts”).
- If a trait value is N or Y that needs to be in single, straight quotes (usually edited later, directly in the metadata.yml file)

If you have many substitutions to add, it is more efficient to create a spreadsheet with a list of all trait_name by trait_value combinations requiring substitutions. The spreadsheet would have four columns with headers dataset_id, trait_name, find and replace. This table can be read directly into the metadata.yml file using the function metadata_add_substitutions_table:

substitutions_to_add <- 
  readr::read_csv("data/dataset_id/raw/substitutions_required.csv")

traits.build::metadata_add_substitutions_list(current_study, substitutions_to_add)

Once you’ve build the new dataset (see below), you can quickly create a table of all values that require substitutions:

austraits$excluded_data %>%
  filter(
    dataset_id == current_study,
    error == "Unsupported trait value"
  ) %>%
  distinct(dataset_id, trait_name, value) %>%
  rename("find" = "value") %>%
  select(-dataset_id) %>%
  write_csv("data/dataset_id/raw/substitutions_required.csv")

Manually add the aligned values in Excel, then:

substitutions_to_add <-
  readr::read_csv("data/dataset_id/raw/substitutions_required_after_editing.csv")

metadata_add_substitutions_list(dataset_id, substitutions_to_add)

Adding taxonomic updates

metadata[["taxonomic_updates"]] is a metadata section to document edits to taxonomic names to align the names submitted by the dataset contributor with a taxon name in the databases taxonomic resources master list, config/taxon_list.csv. This includes correcting typos, standardising syntax (punctuation, abbreviations used for words like subspecies), and reformatting names to adhere to taxonomic standards for a specific taxon group and a specific databases’ rules.

metadata[["taxonomic_updates"]] entries are formatted as:

taxonomic_updates:
- find: Acacia ancistrophylla/sclerophylla
  replace: Acacia sp. [Acacia ancistrophylla/sclerophylla; White_2020]
  reason: Rewording taxon where `/` indicates uncertain species identification 
    to align with `APC accepted` genus (2022-11-10)
  taxonomic_resolution: genus
- find: Pimelea neo-anglica
  replace: Pimelea neoanglica
  reason: Fuzzy alignment with accepted canonical name in APC (2022-11-22)
  taxonomic_resolution: species
- find: Plantago gaudichaudiana
  replace: Plantago gaudichaudii
  reason: Fuzzy alignment with accepted canonical name in APC (2022-11-10)
  taxonomic_resolution: species
- find: Poa sp.
  replace: Poa sp. [Angevin_2011]
  reason: Adding dataset_id to genus-level taxon names. (2023-06-16)
  taxonomic_resolution: genus
- find: Polyalthia (Wyvur)
  replace: Polyalthia sp. (Wyvuri B.P.Hyland RFK2632)
  reason: Fuzzy match alignment with species-level canonical name in `APC known` 
    when everything except first 2 words ignored (2022-11-10)
  taxonomic_resolution: Species

Notes:
- Each trait database will have their own conventions for how to align names that cannot be perfectly matched to an accepted/valid taxon concept. The examples and notes provided here indicate the conventions used by AusTraits.
- Poa sp. and Acacia ancistrophylla/sclerophylla are examples of taxon names that can only be aligned to genus. The taxonomic_resolution is therefore specified as genus. The portion of the name that can be aligned to the taxonomic resource must be before the square brackets. Any information within the square brackets is important for uniquely identifying this entry within the trait database, but does not provide additional taxonomic information.
- Polyalthia (Wyvur) is a poorly formatted phrase name that has been matched to its appropriate syntax in the APC.

The four elements it includes are:
- find: The original name given to taxon in the original data supplied by the authors.
- replace: The updated taxon name, that should now to aligned to a taxon name within the chosen taxonomic reference.
- reason: Records why the change was implemented, e.g. typos, taxonomic synonyms, and standardising spellings.
- taxonomic_resolution: The rank of the most specific taxon name (or scientific name) to which a submitted original name resolves.

The function metadata_add_taxonomic_change adds single taxonomic updates directly into metadata[["taxonomic_updates"]]:

traits.build::metadata_add_taxonomic_change(current_study, 
                                            "find", "replace", "reason", 
                                            "taxonomic_resolution")

The function metadata_add_taxonomic_changes_list adds a table of taxonomic updates directly into metadata[["taxonomic_updates"]]. The column headers must be find, replace, reason, and taxonomic_resolution.

traits.build::metadata_add_taxonomic_changes_list(current_study, table_of_substitutions)

Working manually through taxonomic alignments for all datasets in a database can be a huge time sink. The AusTraits team developed the R-package {APCalign} to automate the process of aligning names of Australian plants to names within the National Species Lists, the APC and APNI. This is supplemented by an AusTraits-specific function build_align_taxon_names that uses {APCalign} to automatically add taxonomic_updates to metadata.yml files. While these packages/functions are Australian-plant specific, they include code that can be re-purposed for other global regions or taxonomic groups.

For instance,

taxon_list <- readr::read_csv("config/taxon_list.csv")

names_to_align <- 
  database$taxonomic_updates %>%
    dplyr::filter(dataset_id == current_study) %>%
    # next row down might be modified to also filter names in an external taxonomic resource
    dplyr::filter(!aligned_name %in% taxon_list$aligned_name) %>%
    dplyr::filter(is.na(taxonomic_resolution)) %>%
    dplyr::distinct(original_name)

Some of these names will require alignments and others might be truly garbage (unknown species 1) and should instead be excluded.

Excluded observations

metadata[["exclude_observations"]] is a metadata section for excluding specific variable (column) values. It is most often used to exclude specific taxon names, but could be used for locations, trait_name, etc. These are values that are in the data.csv file but should be excluded from AusTraits.

metadata[["exclude_observations"]] entries are formatted as:

exclude_observations:
- variable: taxon_name
  find: Campylopus introflexus, Dicranoloma menziesii, Philonotis tenuis, Polytrichastrum
    alpinum, Polytrichum juniperinum, Sphagnum cristatum
  reason: moss (E Wenk, 2020.06.18)
- variable: taxon_name
  find: Xanthoparmelia semiviridis
  reason: lichen (E Wenk, 2020.06.18)

The three elements it includes are: - variable: A variable from the traits table, typically taxon_name, location_name or context_name
- find: Value of variable to remove
- reason: Records why the data were removed, e.g. exotic

NOTE: Multiple, comma-delimited values can be added under find.

The function metadata_exclude_observations adds single exclusions directly into metadata[["exclude_observations"]]:

traits.build::metadata_exclude_observations(current_study, "variable", "find", "reason")

Questions

The final section of the metadata.yml file is titled questions. This is a location to:

  1. Ask the data contributor targeted questions about their study. When you generate a report (described below) these questions will appear at the top of the report.
    • Preface the first question you have with contributor: (indented once), and additional questions with question2:, etc.
    • Ask contributors about missing metadata
    • Point contributors attention to odd data distributions, to make sure they look at those traits extra carefully.
    • Let contributors know if you’re uncertain about their units or if you transformed the data in a fairly major way.
    • Ask the contributors if you’re uncertain you aligned their trait names correctly.
  2. This is a place to list any trait data that are not yet traits supported by AusTraits. Use the following syntax, indented once: additional_traits:, followed by a list of traits.

Hooray! You now have a fully propagated metadata.yml file!

Next is making sure it has captured all the data exactly as you’ve intended.

24.6 Quality checks

If you haven’t done so already, assign the dataset_id to a variable, current_study:

current_study <- "Wright_2001"

This lets you have a list of tests you run for each study and you just have to reassign a new dataset_id to current_study.

Clear formatting

The clear formatting code below reads and re-writes the yaml file. This is the same process that is repeated when running functions that automatically add substitutions or check taxonomy. Running it first ensures that any formatting issues introduced (or fixed) during the read/write process are identified and solved first.

For instance, the write_metadata function inserts line breaks every 80 characters and reworks other line breaks (except in custom_R_code). It also reformats special characters in the text, substituting in its accepted format for degree symbols, en-dashes, em-dashes and quotes, and substituting in Unicode codes for more obscure symbols.

f <- file.path("data", current_study, "metadata.yml")
traits.build::read_metadata(f) %>% traits.build::write_metadata(f)

Running tests

An extensive dataset test protocol ensures: - the metadata.yml file is complete and properly formatted - details entered into the metadata.yml file match those in the accompanying data.csv file (column names, values for locations, contexts) - details for each trait (trait name, categorical trait values) match those in the trait dictionary

Certain special characters may show up as errors and need to be manually adjusted in the metadata.yml file

To run the dataset tests,

# Tests run test on one study
traits.build::dataset_test(current_study)

# Tests run test on all studies
traits.build::dataset_test(dir("data"))

Messages identify errors in the dataset, hopefully pointing you quickly to the changes that are required.

Do not be disheartened by errors when you first run tests on a newly entered dataset - even after adding 100’s of datasets into AusTraits it is very rare to have zero errors on a first run of dataset_test().

Fix as many errors as you can and then rerun dataset_test() repeatedly until no errors remain.

You may want to fix errors in tandem with building the new dataset, such as to be able to quickly compile a list of trait values requiring substitutions or taxon names requiring taxonomic updates

See the common issues chapter for solutions to common issues, such as:
- dataset not pivoting
- unsupported trait values

Rebuild AusTraits

To continue your checks it is necessary to rebuild your database.

Until tests come back clean you can simply build the new dataset:

traits.build::build_setup_pipeline(method = "remake", database_name = "database")
austraits <- remake::make(current_study)

To continue on to building the dataset report, you need to rebuild the entire database:

traits.build::build_setup_pipeline(method = "remake", database_name = "database")
austraits <- remake::make("austraits")

Check excluded data

AusTraits automatically excludes measurements for a number of reasons. Data might be excluded for legitimate reasons (value far out of range for a numeric trait) or because

These are available in the frame database$excluded_data.

Possible reasons for excluding measurements include:

  • Missing species name: Species name is missing from data.csv file for a given row of data. This usually occurs when there are stray characters in the data.csv file below the data – delete these rows.

  • Missing unit conversion: Value was present but appropriate unit conversion was missing. This requires that you add a new unit conversion to the file config/unit_conversions.csv. Add additional conversions near similar unit conversions already in the file for easier searching in the future.

  • Observation excluded in metadata: Specific values, usually certain taxon names can be excluded in the metadata. This is generally used when a study includes a number of non-native and non-naturalised species that need to be excluded. These should be intentional exclusions, as they have been added by you.

  • Trait name not in trait dictionary: trait_name not listed in config/traits.yml as a trait concept. Double check you have used the correct spelling/exact syntax for the trait_name, adding a new trait concept to the traits.yml file if appropriate. If there is a trait that is currently unsupported by AusTraits, leave trait_name: .na. Do not fill in an arbitrary name.

  • Unsupported trait value: This error, referencing categorical traits, means that the value for a trait is not included in the list of supported trait values for that trait in config/traits.yml. See adding many substitutions if there are many trait values requiring substitutions. If appropriate, add another trait value to the traits.yml file, but confer with other curators, as the lists of trait values have been carefully agreed upon through workshop sessions.

  • Value does not convert to numeric: Is there a strange character in the file preventing easy conversion? This error is rare and generally justified.

  • Value out of allowable range: This error, referencing numeric traits, means that the trait value, after unit conversions, falls outside of the allowable range specified for that trait in config/traits.yml. Sometimes the AusTraits range is too narrow and other times the author’s value is truly an outlier that should be excluded. Look closely at these and adjust the range in config/traits.yml if justified. Generally, don’t change the range until you’ve create a report for the study and confirmed that the general cloud of data aligns with other studies as excepted. Most frequently the units or unit conversion is what is incorrect.

  • Value contains unsupported characters: This error appears if there are unusual punctuation characters in the trait name. As such characters do not appear as allowed values within the trait dictionary, these represent transient errors that need to be corrected through either metadata[["substitutions"]] or, occasionally, in the data.csv file.

When you are finished running quality checks, no data should be excluded due to Missing unit conversion, Trait name not in trait dictionary, and it should be very rare that any data is excluded due to Missing species name or Value contains unsupported characters.

The dataset curator should be confident that every value that lands in the excluded data table is their legitimately.

For instance, out of the 1.8 million+ records in AusTraits v5.0.0, the excluded data table contains:

Error Count
Observation excluded in metadata 4061
Unsupported trait value 354
Value does not convert to numeric 12
Value out of allowable range 427

The best way to view excluded data for a study is:

austraits$excluded_data %>%
  dplyr::filter(
    dataset_id == current_study,
    error != "Observation excluded in metadata"
  ) %>%
  View()

Missing values (blank cells, cells with NA) are not included in the excluded_data table, because they are assumed to be legitimate blanks. If you want to confirm this, you need to temporarily change the default arguments for the internal function dataset_process where it is called within the remake.yml or build.R file that compiles the database. For instance, the default,

      dataset_process("data/Ahrens_2019/data.csv",
                  Ahrens_2019_config,
                  schema
                 )

needs to be changed to:

      dataset_process("data/Ahrens_2019/data.csv",
                  Ahrens_2019_config,
                  schema,
                  filter_missing_values = FALSE
                 )

To check how many of each error type are present for a study:

database$excluded_data %>%
  dplyr::filter(dataset_id == current_study) %>%
  dplyr::pull(error) %>%
  table()

Or produce a table of error type by trait:

database$excluded_data %>%
  dplyr::filter(
    dataset_id == current_study,
  ) %>%
  dplyr::select(trait_name, error) %>%
  table()

Build study report

Another important check for each study is building a study report, that summarises all metadata and trait data.

Make sure you’ve rebuilt the entire database (not just the new study) before building the report.

database <- remake::make("database")
traits.build::dataset_report(database, current_study, overwrite = TRUE)

NOTES:
- The report will appear in the folder export/reports
- The argument overwrite = TRUE overwrites pre-existing copies of the report in this folder.

Check the study report to ensure:

  • All possible metadata fields were filled in
  • The locations plot sensibly on the map
  • For numeric traits, the trait values plot sensibly relative to other studies
  • The list of unknown/unmatched species doesn’t include names you think should be recognised/aligned

If necessary, cycle back through earlier steps to fix any errors, rebuilding the study report as necessary

At the very end, re-clear formatting, re-run tests, rebuild AusTraits, rebuild report.

To generate a report for a collection of studies:

traits.build::dataset_reports(database, c("Falster_2005_1", "Wright_2002"), 
                              overwrite = TRUE)

Or for all studies:

traits.build::dataset_reports(database, overwrite = TRUE)

(Reports are written in Rmarkdown and generated via the knitr package. The template is here).