File structure

The main directory for the austraits.build repository contains the following files and folders, with purpose as indicated.

R project file

├── austraits.build.Rproj     # Rstudio project

Files for maintaining a repo on github

├── README.md         # landing page
├── .github           # folder containing github actions, issue templates, code of conduct
├── LICENCE
├── NEWS.md
├── _pkgdown.yml      # used to create packagedown website
├── docs              # contains website
├── Dockerfile        # creates an image of R environment used in build

Files used for creation of R package austraits.build

├── NAMESPACE             # functions being exported
├── DESCRIPTION           # R package description
├── man                   # XXXX-not in file structure
├── R                     # folder with functions and scripts to build AusTraits
├── tests                 # defines tests applied to datasets
├── vignettes             # documentation of repo file structure, AusTraits database structure, definitions, data input processes

Files used for data compilation

├── remake.yml            # instructions for build
├── config                # configuration and definition files
├── data                  # raw data files
├── export                # folder for output
└── scripts               # scripts for processing files before/after building austraits, including compiling reports

Details on files used for data compilation:

Configuration

The folder config contains three files which govern the building of the dataset.

config
├── definitions.yml
├── taxon_list.csv
└── unit_conversions.csv

The file definitions.yml defines the structure of the database and all terms. There are three main sections to the definitions.yml file, the trait definitions used to compile AusTraits, allowable trait values, and definitions of all elements included in the tables that comprise AusTraits. The trait definitions and definitions of all elements in AusTraits (and value types) are fully described in additional vignettes. A .yml file is a structured data file where information is presented in a hierarchical format (see appendix for details).

The file taxon_list.csv is our master list of known taxa. Each species is listed once, with links to species’ identifiers provided by the Australian Plant Name Index (APNI). The file taxon_list.csv is added to if a study includes taxa not previously represented in AusTraits. These can be names included in either the APC/APNI, compilations of taxonomic concepts (APC) or names (APNI) for plants that are either native to or naturalised in Australia, or taxa without recognised names.

taxon_name family scientificNameAuthorship source taxonomicStatusClean taxonIDClean
Abelia Caprifoliaceae R.Br. APC accepted https://id.biodiversity.org.au/node/apni/2892114
Abelia x grandiflora Caprifoliaceae (Rovelli ex André) Rehder APC accepted https://id.biodiversity.org.au/node/apni/2914209
Abelmoschus Malvaceae Medik. APC accepted https://id.biodiversity.org.au/node/apni/2898872
Abelmoschus ficulneus Malvaceae (L.) Wight APC accepted https://id.biodiversity.org.au/node/apni/2897916
Abelmoschus manihot Malvaceae (L.) Medik. APC accepted https://id.biodiversity.org.au/node/apni/2901085
Abelmoschus moschatus Malvaceae Medik. APC accepted https://id.biodiversity.org.au/node/apni/2900572
Abildgaardia Cyperaceae Vahl APC accepted https://id.biodiversity.org.au/node/apni/2905759
Abildgaardia ovata Cyperaceae (Burm.f.) Kral APC accepted https://id.biodiversity.org.au/node/apni/2919627
Abildgaardia vaginata Cyperaceae R.Br. APC accepted https://id.biodiversity.org.au/node/apni/2899106
Abrodictyum Hymenophyllaceae C.Presl APC accepted https://id.biodiversity.org.au/node/apni/7402562

The file unit_conversions.csv defines the unit conversions that are used when converting contributed trait data to common units, e.g.

unit_from unit_to function
% mg/g x*10
% g/g x*0.01
% mg/mg x*0.01
% mg/kg x*10000
% n/n x*0.01
% dimensionless x*.01
years month x*12
count/m2 count/mm2 x*1/1000000
cm m x*0.01
cm mm x*10

Data

The folder data contains the raw data from individual studies included in AusTraits.

Records within the data folder are organised as coming from a particular study, defined by the dataset_id. Data from each study is organised into a separate folder, with two files:

  • data.csv: a table containing the actual trait data.
  • metadata.yml: a file that contains study metadata (source, methods, sites, and context), maps trait names and units onto standard types, and lists any substitutions applied to the data in processing.

The folder data thus contains a long list of folders, one for each study and each containing two files:

data
├── Angevin_2010
│   ├── data.csv
│   └── metadata.yml
├── Barlow_1981
│   ├── data.csv
│   └── metadata.yml
├── Bean_1997
│   ├── data.csv
│   └── metadata.yml
├── ....

where Angevin_2010, Barlow_1981, & Bean_1997 are each a unique dataset_id in the final dataset.

Data.csv

The file data.csv contains raw measurements and can be in either long or wide format.

Required columns include the taxon name, the trait name (column in long format, header in wide format), units (column in long format, part of header in wide format), site (if applicable), context (if applicable), date (if available), and trait values.

It is important that all trait measurements made on the same individual or that are the mean of a species’ measurements from the same site are kept linked.

  • If the data are in wide format, each row should include measurements made on a single individual or a single species-by-site mean, with different trait values as consecutive columns.

  • If the data are in long format, an additional column is required to ensure multiple trait measurements made on the same individual or are the mean of a species’ measurements from the same site are linked: observation_ID (or other identifier) must be assigned to identify which rows of measurements (for different traits) are linked to a unique individual or site.

We aim to keep the data file in rawest form possible (i.e. as few changes as possible) but it must be a single csv file. Additional custom R code may be required to make the file exactly compatible with the AusTraits format, but these changes should be executed as AusTraits is compiled and should be in the metadata.yml file under config/custom_R_code (see below). Any files used to create the submitted data.csv file (e.g. Excel …) should be archived in a sub-folder within the study folder named raw.

Metadata.yml

The metadata is compiled in a .yml file, a structured data file where information is presented in a hierarchical format (see Appendix for details). There are 11 values at the top hierarchical level: source, people, dataset, sites, contexts, config, traits, substitutions, taxonomic_updates, exclude_observations, questions. These are each described below.

As a start, you may want to checkout some examples from existing studies in Austraits, e.g. Angevin_2010 or Wright_2009.

Source

This section provides citation details for the original source(s) for the data, whether it is a published journal article, book, website, or thesis. In general we aim to reference the primary source. References are written in structured yml format, under the category source and then sub-groupings primary and secondary. General guidelines for describing a source

  • maximum of one primary source allowed.
  • elements are names as in bibtex format.
  • keys should be named in the format Surname_year and should be identical to the name given to the dataset folder. A second instance of the identical Surname_year should have the key Surname_year_2.
  • one or more secondary source may be included if traits from a single dataset were presented in two different manuscripts. Multiple sources are also appropriate if an author has compiled data from a number of sources, that are not individually in AusTraits, for a published or unpublished compilation.
  • if your data are from an unpublished study, only include the elements that are applicable.
  • If someone has transcribed a published source, the primary source will be the published work and the person who has completed the transcription will be acknowledged as the contributor of the dataset.

An example of a primary source that is a journal article is:

source:
  primary:
    key: Falster_2005_1
    bibtype: Article
    author: Daniel S. Falster, Mark Westoby
    year: 2005
    title: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
    journal: Journal of Ecology
    volume: 93
    pages: 521--535
    publisher: Wiley-Blackwell
    doi: 10.1111/j.0022-0477.2005.00992.x

If a secondary source is included it may look like:

  primary:
    key: Choat_2006
    bibtype: Article
    year: '2006'
    author: B. Choat and M. C. Ball and J. G. Luly and C. F. Donnelly and J. A. M.
      Holtum
    journal: Tree Physiology
    title: Seasonal patterns of leaf gas exchange and water relations in dry rain
      forest trees of contrasting leaf phenology
    volume: '26'
    number: '5'
    pages: 657--664
    doi: 10.1093/treephys/26.5.657
  secondary:
    key: Choat_2005
    bibtype: Article
    year: '2005'
    author: Brendan Choat and Marilyn C. Ball and Jon G. Luly and Joseph A. M. Holtum
    journal: Trees
    title: Hydraulic architecture of deciduous and evergreen dry rainforest tree species
      from north-eastern Australia
    volume: '19'
    number: '3'
    pages: 305--311
    doi: 10.1007/s00468-004-0392-1

People

This section provides a list of the key contributors to the study, their respective institutions and roles in the study. Roles are defined as follows:

key value
collector The person (people) leading data collection (generally 1-2 people)
contributor Person responsible for entering data into AusTraits
lab_leader Leader of lab group at time of collection
assistant Anyone else who assisted in collection of the data
contact The person to contact with questions about the data set

An example is as follows:

people:
- name: Daniel Falster
  institution: Macquarie University
  role: collector, contact, contributor
- name: Mark Westoby
  institution: Macquarie University
  role: lab_leader

Note that only the AusTraits custodians have the contributors e-mail addresses on file. This information will not be directly available to AusTraits users or new contributors via Github.

Dataset

This section includes study details, including study description, sampling strategy, sampling time frame, and sample age class.

The following elements are included under the element dataset:

  • year_collected_start: The year data collection commenced.
  • year_collected_end: The year data collection was completed.
  • description: A 1-2 sentence description of the purpose of the study.
  • collection_type: A field to indicate where the majority of plants on which traits were measured were collected - in the field, lab, glasshouse, botanical collection, or literature. The latter should only be used when the data were sourced from the literature and the collection type is unknown.
  • sample_age_class: A field to indicate if the study was completed on adult or juvenile plants.
  • sampling_strategy: A written description of how study sites were selected and how study individuals were selected. When available, this information is lifted verbatim from a published manuscript. For botanical collections, this field ideally indicates which records were ‘sampled’ to measure a specific trait.
  • original_file: The name of the file initially submitted to AusTraits
  • notes: Generic notes about the study and processing of data

An example is as follows:

  year_collected_start: 2004
  year_collected_end: 2004
  description: Trait values for species with faster versus slower height growth following disturbance for Myall Lakes species.
  collection_type: field
  sample_age_class: adult
  sampling_strategy: Fire is a recurrent disturbance in the park (interval – 0–30 years; Fox and Fox 1986). A mosaic of fire histories has facilitated previous use of space-for-time substitutions in studies of small mammal succession (Fox and McKay 1981). Here we employ the same methodology to reconstruct species height-growth trajectories (Enright and Goldblum 1999). Sites were identified at a range of times since fire with the use of NSW national parks GIS fire history records and personal observations of Karen Ross (Ross et al. 2002). Patches of vegetation 1, 2, 4, 8, 10, 12, 15, 27 and 28 years since fire were identified. Where possible several patches within a given age class were surveyed to determine species presence or absence. Nineteen species recorded in a majority of patches were selected for further study. This included eight resprouting species and 11 obligate seeders (full list in Appendix 1).
  original_file: Falster & Westoby 2005 Oikos appendix.doc
  notes: none

Config

This section includes information on the format of the submitted data file.

Values are as follows:

  • data_is_long_format: Indicates if the data spreadsheet has a vertical (long) or horizontal (wide) configuration with yes or no terminology
  • variable_match: Identifies which information is in each column in the file data.csv, excluding the actual trait data. One element within variable_match must be taxon_name. Datasets with data_is_long_format set to yes must also identify which column includes data on value and trait_name. Other allowed values include date, site_name, observation_id, context_name if these columns are present in the data.csv file or created with custom R code.
  • custom_R_code: A field where additional R code can be included. This allows for custom manipulation of the data in the submitted spreadsheet into a different format for easy integration with AusTraits. .na indicates no custom R code was used.

An example is

config:
  data_is_long_format: yes
  variable_match:
    species_name: Taxon
    value: trait value
    trait_name: trait
  custom_R_code: ' 
    data %>% 
    mutate(`trait value` = ifelse(trait == 'flowering time', 
      convert_month_range_vec_to_binary(`trait value`), `trait value`)
      )
    '

A common use of the custom_R_code is to automate the conversion of a verbal description of flowering or fruiting periods into the supported trait values, as occurs in this example. It might also be used if values for a single trait are expressed across multiple columns and need to be merged. See Catford_2014 as an example of this. The adding data vignette provides additional examples of code regularly implemented in custom_R_code, including functions specifically developed for AusTraits data manipulations.

Traits

This section provides a translation table mapping traits and units from a contributed study onto corresponding variables in AusTraits. Also specified here are methods used to collect the data.

For each trait submitted to AusTraits, there is the following information:

  • var_in: Name of trait in the original data submitted
  • unit_in: Units of trait in the original data submitted
  • trait_name: Name of trait sampled. Allowable values specified in the table definitions.
  • value_type: A categorical variable describing the type of trait value recorded.
  • replicates: Number of replicate measurements that comprise the data points for the trait for each measurement. A numeric value (or range) is ideal and appropriate if the value type is a mean, median, min or max. For these value types, if replication is unknown the entry should be unknown. If the value type is raw_value the replicate value should be 1. If the value type is expert_mean, expert_min, or expert_max the replicate value should be .na.
  • methods: A textual description of the methods used to collect the trait data. Whenever available, methods are taken near-verbatim from referenced source. Methods can include descriptions such as ‘measured on botanical collections’, ‘data from the literature’, or a detailed description of the field or lab methods used to collect the data.

Values under trait_name must be allowable values, list in the definitions file or vignette. Similarly, values under value_type must be allowable values, also listed in the definitions file.

An example is as follows:

traits:
- var_in: LMA (mg mm-2)
  unit_in: mg/mm2
  trait_name: specific_leaf_area
  value_type: site_mean
  replicates: 3
  methods: LMA was calculated as the leaf dry mass (oven-dried for 48 hours at 65 °C) divided by leaf size. It was measured on the first five fully expanded leaves at the tip of each individual.
- var_in: leaf size (mm2)
  unit_in: mm2
  trait_name: leaf_area
  value_type: site_mean
  replicates: 3
  methods: Leaf size was calculated as the one-sided leaf area (flat bed scanner). It was measured on the first five fully expanded leaves at the tip of each individual.

Substitutions

This section provides a list of any “find and replace” substitutions needed to get the data into the right format.

Substitutions are required whenever the exact word(s) used to describe a categorical trait value in AusTraits is different from the vocabulary used by the author in the data.csv file. It is preferable to align vocabulary using substitutions rather than changing the data.csv file. The trait definitions file provides a list of supported values for each trait.

Each substitution is documented using the following elements:

  • trait_name: Trait where substitutions are required
  • find: Contributor’s trait value that needs to be changed
  • replace: AusTraits supported replacement value

An example is as follows:

substitutions:
- trait_name: life_history
  find: p
  replace: perennial
- trait_name: plant_growth_form
  find: s
  replace: shrub
- ...

Taxonomic updates

This section provides a table of taxonomic name changes needed to align original names in dataset with taxa in apc and apni.

Each substitution is documented using the following elements:

  • find: Name given to taxon in the original data supplied by the authors
  • replace: Whenever possible, this field indicates the currently accepted scientific name of a taxon, per the Australian Plant Census (APC). Alternatively, taxon_name can indicate an alignment with a name included in the comprehensive Australian Plant Names Index (APNI) but not currently marked as accepted in APC; or a name that cannot be aligned to available lists of Australian plant names.
  • reason: Records why the change was implemented, e.g. typos, taxonomic synonyms, and standardising spellings

Algorithms within AusTraits automatically align outdated taxonomy and taxonomic synonyms to their currently accepted scientific name, so such adjustments are not documented as substitutions.

An example is as follows:

taxonomic_updates:
- find: Eucalyptus albens X crebra
  replace: Eucalyptus albens x Eucalyptus crebra
  reason: Change wording for hybrid species (Elizabeth Wenk, 2020-06-30)
- find: Eucalyptus albens x moluccana
  replace: Eucalyptus albens x  Eucalyptus moluccana
  reason: Change wording for hybrid species (Elizabeth Wenk, 2020-06-30)

Questions

This section provides a place to record any queries we have about the dataset (recorded as a named array), including notes on any additional traits that may have been collected in the study but have not been incorporated into austraits.

An example is as follows:

questions:
  questions for author: Triglochin procera has very different seed masses in the main traits spreadsheet and the field seeds worksheet. Which is correct? There are a number of species with values in the field leaves worksheet that are absent in the main traits worksheet - we have included this data into Austraits; please advise if this was inappropriate.
  austraits: need to map aquatic_terrestrial onto an actual trait once one is created.

Appendices

File types

CSV

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. This is a comma format for storing tables of data in a simple text file. You can edit it an Excel or in a text editor. For more, see here.

YAML files

The yml file extension (pronounced “YAML”) is a type structured data file, that is both human and machine readable. You can edit it any text editor, or also in Rstudio. Generally, yml is used in situations where a table does not suit because of variable lengths and or nested structures. It has the advantage over a spreadsheet in that the nested “headers” can have variable numbers of categories. The data under each of the hierarchical headings are easily extracted by R.

Adding custom R code into metadata.yml

Occasionally all the changes we want to make to dataset may not fit into the prescribed workflow used in AusTraits. For example, we assume each trait has a single unit. But there are a few datasets where data on different rows have different units. So we want to make to make some custom modifications to this particular dataset before the common pipeline of operations gets applied. To make this possible, the workflow allows for some custom R code to be run as a first step in the processing pipeline. That pipeline (in the function read_data_study) looks like:

data <-
  read_csv(filename_data_raw, col_types = cols()) %>%
  custom_manipulation(metadata[["config"]][["custom_R_code"]])() %>%
  parse_data(dataset_id, metadata) %>%
  ...()

Note the second line.

Example problem

As an example, for Blackman_2010 we want to combine two columns to create an appropriate site variable. Here is the code that was included in data/Blackman_2010/metadata.yml under custom_R_code.

data %>% mutate(
  site = ifelse(site == "Mt Field" & habitat == "Montane rainforest", "Mt Field_wet", site),
  site = ifelse(site == "Mt Field" & habitat == "Dry sclerophyll", "Mt Field_dry", site)
)

This is the finished solution, but to get there we did as follows.

Generally, this code should

  • assume a single object called data, and apply whatever fixes are needed
  • use dplyr functions like mutate, rename, etc
  • use pipes to weave together a single statement, possible. (Otherwise you’ll need a semi colons ; at the end of each statement).
  • be fully self contained (we’re not going to use any of the other remake machinery here)

First, load an object called data:

library(readr)
library(yaml)

data <- read_csv(file.path("data", "Blackman_2010", "data.csv"), col_types = cols(.default = "c"))
data

Second, write your code to manipulate data, like the example above/

Third, once you have some working code, you then want to add it into your yml file under a group config -> custom_R_code.

Finally, check it works. Let’s assume you added it in. The function metadata_check_custom_R_code loads the data and applies the custom R code: