7  File structure

This chapter desribes the typical files you may encounter in a traits.build compilation. The description is based on the austraits.build compilation.

We strongly suggest you create a standalone folder for your repository, e.g. austraits.build. This folder should contain all files needed to build your compilation. We’re big fans of github as a platform for collaboration. If you’re not familiar with git or github, we suggest you check out the happy git with R book.

7.1 Repository structure

The main directory for the austraits.build repository contains the following files and folders, with purpose as indicated. Not all of these files are required for a compilation, some are used for extra features such as website. They are included here for completeness.

Files used for data compilation

├── remake.yml/build.R    # instructions for build
├── config                # configuration files
├── data                  # raw data files
├── R                     # folder with custom functions
├── export                # folder for output
└── scripts               # scripts for processing files before/after build

R project file

├── traits.build.Rproj     # Rstudio project

Files for maintaining a repo on github

├── README.md         # landing page
├── .github           # folder containing github actions, issue templates, code of conduct
├── LICENCE
├── NEWS.md
├── _pkgdown.yml      # used to create packagedown website
├── docs              # contains website
├── Dockerfile        # creates an image of R environment used in build

Files used for creation of R package for this compilation

XXX Explan this more

├── NAMESPACE             # functions being exported
├── DESCRIPTION           # R package description
├── tests                 # defines tests applied to datasets
├── vignettes             # documentation of repo file structure, AusTraits database structure, definitions, data input processes

7.2 /config folder

The folder config contains four files which govern the building of the dataset.

config
├── metadata.yml
├── traits.yml
├── taxon_list.csv
└── unit_conversions.csv

metadata.yml

XXX

traits.yml

The file traits.yml provides the trait definitions used to compile AusTraits, including allowable trait values. The trait definitions are fully described in an additional vignette. A .yml file is a structured data file where information is presented in a hierarchical format (see appendix for details).

taxon_list.csv

The file taxon_list.csv is our master list of known taxa.

XXX Explan this more. Only show essential variables in table below.

taxon_name family scientific_name_authorship taxonomic_reference cleaned_name_taxonomic_status cleaned_scientific_name_id
Abelia x grandiflora Caprifoliaceae (Rovelli ex André) Rehder APC accepted https://id.biodiversity.org.au/name/apni/190758
Abelmoschus ficulneus Malvaceae (L.) Wight APC accepted https://id.biodiversity.org.au/name/apni/55929
Abelmoschus manihot Malvaceae (L.) Medik. APC accepted https://id.biodiversity.org.au/name/apni/55937
Abelmoschus manihot subsp. manihot Malvaceae NA APC accepted https://id.biodiversity.org.au/name/apni/116920
Abelmoschus manihot subsp. tetraphyllus Malvaceae (Roxb. ex Hornem.) Borss.Waalk. APC accepted https://id.biodiversity.org.au/name/apni/55945
Abelmoschus moschatus Malvaceae Medik. APC accepted https://id.biodiversity.org.au/name/apni/55953
Abelmoschus moschatus subsp. biakensis Malvaceae (Hochr.) Borss.Waalk. APC accepted https://id.biodiversity.org.au/name/apni/116595
Abelmoschus moschatus subsp. moschatus Malvaceae NA APC accepted https://id.biodiversity.org.au/name/apni/243806
Abelmoschus moschatus subsp. tuberosus Malvaceae (Span.) Borss.Waalk. APC accepted https://id.biodiversity.org.au/name/apni/55961
Abildgaardia ovata Cyperaceae (Burm.f.) Kral APC accepted https://id.biodiversity.org.au/name/apni/150737

unit_conversions.csv

The file unit_conversions.csv defines the unit conversions that are used when converting contributed trait data to common units, e.g.

unit_from unit_to function
% mg/g x*10
% g/g x*0.01
% mg/mg x*0.01
% mg/kg x*10000
% {dimensionless} x*.01
% {count}/{count} x*.01
{dimensionless} {count}/{count} x*1
a mo x*12
{count}/m2 {count}/mm2 x*1/1000000
cm m x*0.01

7.3 /data folder

The folder data contains the raw data from individual studies included in AusTraits.

Records within the data folder are organised as coming from a particular study, defined by the dataset_id. Data from each study are organised into a separate folder, with two files:

  • data.csv: a table containing the actual trait data.
  • metadata.yml: a file that contains study metadata (source, methods, locations, and context), maps trait names and units onto standard types, and lists any substitutions applied to the data in processing.

The folder data thus contains a long list of folders, one for each study and each containing two files:

data
├── Angevin_2010
│   ├── data.csv
│   └── metadata.yml
├── Barlow_1981
│   ├── data.csv
│   └── metadata.yml
├── Bean_1997
│   ├── data.csv
│   └── metadata.yml
├── ....

where Angevin_2010, Barlow_1981, & Bean_1997 are each a unique dataset_id in the final dataset.

7.4 dataset_id/data.csv

The file data.csv contains raw measurements and can be in either long or wide format.

Required columns include the taxon name, the trait name (column in long format, header in wide format), units (column in long format, part of header in wide format), location (if applicable), context (if applicable), date (if available), and trait values.

It is important that all trait measurements made on the same individual or that are the mean of a species’ measurements from the same location are kept linked.

  • If the data is in wide format, each row should include measurements made on a single individual at a single point in time or a single species-by-location mean, with different trait values as consecutive columns.

  • If the data is in long format, an additional column, individual_id, is required to ensure multiple trait measurements made on the same individual, or the mean of a species’ measurements from the same location, are linked. If the data is in wide format and there are multiple rows of data for the same individual, an individual_id column should be included. These individual_id columns ensure that related data values remain linked.

We aim to keep the data file in the rawest form possible (i.e. with as few changes as possible) but it must be a single csv file. Additional custom R code may be required to make the file exactly compatible with the AusTraits format, but these changes should be executed as AusTraits is compiled and should be in the metadata.yml file under dataset/custom_R_code (see below). Any files used to create the submitted data.csv file (e.g. Excel …) should be archived in a sub-folder within the study folder named raw.

7.5 dataset_id/metadata.yml

The metadata is compiled in a .yml file, a structured data file where information is presented in a hierarchical format (see Appendix for details). There are 10 values at the top hierarchical level: source, contributors, dataset, locations, contexts, traits, substitutions, taxonomic_updates, exclude_observations, questions. These are each described below.

As a start, you may want to check out some examples from existing studies in Austraits, e.g. Angevin_2010 or Wright_2009.

source

This section provides citation details for the original source(s) for the data, whether it is a published journal article, book, website, or thesis. In general we aim to reference the primary source. References are written in structured yml format, under the category source and then under sub-groupings primary, secondary, and original. A reference is designated as secondary if it is a second publication by the data collector that analyses the data. When the primary reference is a compilation of multiple sources for a meta-analysis, the original references are designated as original.

General guidelines for describing a source include:

  • A maximum of one primary source allowed.
  • Elements are names as in bibtex format.
  • Keys should be named in the format Surname_year and the primary source is almost always identical to the name given to the dataset folder. A second instance of the identical Surname_year should have the key Surname_year_2.
  • One or more secondary source may be included if traits from a single dataset were presented in two different manuscripts. Multiple sources are also appropriate if an author has compiled data from a number of sources, which are not individually in AusTraits, for a published or unpublished compilation.
  • If your data is from an unpublished study, only include the elements that are applicable.
  • If someone has transcribed a published source, the primary source will be the published work and the person who has completed the transcription will be acknowledged as the contributor of the dataset.

An example of a primary source that is a journal article is:

source:
  primary:
    key: Falster_2005_1
    bibtype: Article
    author: Daniel S. Falster, Mark Westoby
    year: 2005
    title: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
    journal: Journal of Ecology
    volume: 93
    pages: 521--535
    publisher: Wiley-Blackwell
    doi: 10.1111/j.0022-0477.2005.00992.x

If a secondary source is included it may look like:

  primary:
    key: Choat_2006
    bibtype: Article
    year: '2006'
    author: B. Choat and M. C. Ball and J. G. Luly and C. F. Donnelly and J. A. M.
      Holtum
    journal: Tree Physiology
    title: Seasonal patterns of leaf gas exchange and water relations in dry rain
      forest trees of contrasting leaf phenology
    volume: '26'
    number: '5'
    pages: 657--664
    doi: 10.1093/treephys/26.5.657
  secondary:
    key: Choat_2005
    bibtype: Article
    year: '2005'
    author: Brendan Choat and Marilyn C. Ball and Jon G. Luly and Joseph A. M. Holtum
    journal: Trees
    title: Hydraulic architecture of deciduous and evergreen dry rainforest tree species
      from north-eastern Australia
    volume: '19'
    number: '3'
    pages: 305--311
    doi: 10.1007/s00468-004-0392-1

contributors

This section provides a list of contributors to the study, their respective affiliations, roles in the study, and orcids. The following information is recorded for each data contributor:

key value
last_name Last name of data collector.
given_name Given name of data collector.
affiliation Affiliation of data collector.
ORCID ORCID ID (Open Researcher and Contributor ID) for the data collector, if available.
notes optional notes for the data collector.
additional_role Any additional roles the data collector had in the study, a field most frequently used to identify which data contributor is the contact person for the dataset.

An example is as follows:

 data_collectors:
  - last_name: Falster
    given_name: Daniel
    ORCID: 0000-0002-9814-092X
    affiliation: Evolution & Ecology Research Centre, School of Biological, Earth,
      and Environmental Sciences, UNSW Sydney, Australia
    additional_role: contact
  - last_name: Westoby
    given_name: Mark
    ORCID: 0000-0001-7690-4530
    affiliation: Department of Biological Sciences, Macquarie University, Australia

Note that only the AusTraits custodians have the contributors’ e-mail addresses on file. This information will not be directly available to AusTraits users or new contributors via Github.

Additional fields within contributors are:

  • Assistants, names of additional people who played a more minor role in data collection for the study.
  • dataset_curators, names of austraits team member(s) who contacted the data collectors and added the study to the austraits repository.

dataset

This section includes study details, including format of the data, custom r code applied to data, and various descriptors. the value entered for each element can be either a header for a column within the data.csv file or the actual value to be used.

The following elements are included under the element dataset:

  • data_is_long_format: Indicates if the data spreadsheet has a vertical (long) or horizontal (wide) configuration with yes or no terminology.
  • custom_R_code: A field where additional R code can be included. This allows for custom manipulation of the data in the submitted spreadsheet into a different format for easy integration with AusTraits. .na indicates no custom R code was used.
  • collection_date: Date sample was taken, in the format yyyy-mm-dd, yyyy-mm or yyyy, depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a /, as in 2010-10/2011-03
  • taxon_name: Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
  • location_name: location name
  • source_id: For datasets that are compilations, an identifier for the original data source.
  • individual_id: A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time.
  • repeat_measurements_id: A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve.
  • trait_name: Element required for long datasets to specify the column indicating the trait name associated with each row of data.
  • value: The measured value of a trait.
  • description: A 1-2 sentence description of the purpose of the study.
  • basis_of_record: A categorical variable specifying from which kind of specimen traits were recorded.
  • life_stage: A field to indicate the life stage or age class of the entity measured. Standard values are adult, sapling, seedling and juvenile.
  • sampling_strategy: A written description of how study locations were selected and how study individuals were selected. When available, this information is lifted verbatim from a published manuscript. For preserved specimens, this field ideally indicates which records were ‘sampled’ to measure a specific trait.
  • measurement_remarks: Brief comments or notes accompanying the trait measurement.
  • original_file: The name of the file initially submitted to AusTraits.
  • notes: Generic notes about the study and processing of data.

Of these, the fields collection_date, life_stage, basis_of_record, and measurement_remarks can all be specified at the dataset level or the traits level (which overrides a dataset-level entry) or location level (which also overrides a dataset-level entry). In each case, they can be a fixed text value or indicate a column within the data.csv file (or generated through custom_R_code) that includes the relevant information.

  • life_stage, basis_of_record, and collection_date are usually included under metadata$dataset unless they vary by trait.

  • entity_type, replicates, basis_of_value, and value_type are usually different across traits and are usually mapped under the metadata$traits section (see below), but are allowed to be specified for the entire dataset in this section.

  • traits and value are only specified in metadata$dataset for long-format datasets.

  • measurement_remarks and individual_id are only included if required. They are absent from the majority of datasets.

An example is as follows:

  data_is_long_format: no
  custom_R_code: '
    data %>%
      mutate(
        location_name = "Howard River catchment",
        date = date %>% mdy()
      ) %>%
      arrange(date) %>%
      group_by(Tree) %>%
        mutate(observation_number = dplyr::row_number()) %>%
      ungroup() %>%
      group_by(species) %>%
        mutate(across(c("specific leaf area (m2 kg-1)"), replace_duplicates_with_NA)) %>%
      ungroup()
  '
  collection_date: date
  taxon_name: species
  context_name: context
  location_name: location_name
  individual_id: Tree
  description: Measurements of stem CO2 efflux and leaf gas exchange in a tropical
    savanna ecosystem in northern Australia, and assessed the impact of fire on these
    processes.
  basis_of_record: field
  life_stage: adult
  sampling_strategy: The stem CO2 efflux was initially measured at two locations,
    each of which was nested within a 3 km 2 plot...
  original_file: leaf_summary.xls, Rbranch summary2.xls, and Rstem summary6.xls submitted
    by Lucas Cernusak and archived in the raw data folder and GoogleDrive folder.
  notes: none

A common use of the custom_R_code is to automate the conversion of a verbal description of flowering or fruiting periods into the supported trait values. It might also be used if values for a single trait are expressed across multiple columns and need to be merged. See Catford_2014 as an example of this. The adding data vignette provides additional examples of code regularly implemented in custom_R_code, including functions specifically that were developed for AusTraits data manipulations and are in the file scripts\custom.R.

locations

This section provides a list of study locations (sites) and information about each of the study locations where data were collected. Each should include at least three variables - latitude (deg), longitude (deg) and description. Additional variables can be included where available. Set to .na for botanical collections and field studies where data values are a mean across many locations.

Although the properties listed under each location are not part of a controlled vocabulary, it is best practice to align with in-use properties whenever possible. These can be identified by running austraits$locations %>% distinct(location_property).

An example of how a location and its properties, and the value of each property are listed (modified from Vesk_2019), is:

  Round Hill-Nombinnie Nature Reserve:
    latitude (deg): -32.965
    longitude (deg): 146.161
    precipitation, MAP (mm): 370
    temperature, summer mean (C): 32.5
    temperature, winter mean (C): 14.2
    soil type: loamy red sands light red clays and light red browns earths
    description: predominantly open Callitris glaucophylla - Eucalyptus populnea woodland
      and Eucalyptus dumosa - E. socialis shrub mallee woodland
    fire frequency (years): 5-20 years

contexts

This section provides contextual characteristics associated with information in traits.

Within the context section is a list of contextual properties, each encapsulating information read in through a different column or created through custom_R_code or as elements within specific traits (see below).

  • context_property: The context property represented by the data in the column specified by var_in.
  • category: The category of contextual data. Options are plot (a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment), treatment (an experimental treatment), entity_context (contextual information to record about the entity the isn’t documented elsewhere, including the entity’s sex, caste), temporal (indicating when repeat observations are made on the same individual (or population, or taxon) across time) and method (indicating the same trait was measured on the same individual (or population, or taxon) using multiple methods).
  • var_in: Name of column with contextual data in the original data submitted.
  • find: The contextual values in the original data submitted (optional)
  • value: The standardised contextual values, aligning syntax and wording with other studies.
  • description: A description of the contextual values.

If the contextual values read in are appropriate and no substitutions are required, the field find can be omitted, with the values from the data.csv column entered under the field value. The field description can likewise be omitted if it is redundant; for instance, if the values are simply sequential observation numbers, times of day, or taxon names (e.g. insect host plants).

As with location, the context properties are not part of a controlled vocabulary, but it is best practice to align syntax with in-use properties whenever possible. These can be identified by running austraits$contexts %>% distinct(context_property).

An example of how the contexts for a study are formatted (modified from Crous_2013), is:

contexts:
- context_property: sampling season
  category: temporal_context
  var_in: month
  values:
  - find: AUG
    value: August
    description: August (late winter)
  - find: DEC
    value: December
    description: December (early summer)
  - find: FEB
    value: February
    description: February (late summer)
- context_property: temperature treatment
  category: treatment_context
  var_in: Temp-trt
  values:
  - value: ambient
    description: Plants grown at ambient temperatures; Jan average max = 29.4 dec
      C / July average min = 3.2 dec C.
  - value: elevated
    description: Plants grown 3 deg C above ambient temperatures.
- context_property: CO2 treatment
  category: treatment_context
  var_in: CO2_Treat
  values:
  - find: ambient CO2
    value: 400 ppm
    description: Plants grown at ambient CO2 (400 ppm).
  - find: added CO2
    value: 640 ppm
    description: Plants grown at elevated CO2 (640 ppm); 240 ppm above ambient.
- context_property: measurement temperature
  category: method_context
  var_in: method_context
  values:
  - find: Measurement made at 20°C
    value: 20°C
    description: Measurement made at 20°C
  - find: Measurement made at 25°C
    value: 25°C
    description: Measurement made at 25°C

traits

This section provides a translation table, mapping traits and units from a contributed study onto corresponding variables in AusTraits. The methods used to collect the data are also specified here.

For each trait submitted to AusTraits, there is the following information:

  • var_in: Name of trait in the original data submitted.
  • unit_in: Units of trait in the original data submitted.
  • trait_name: Name of the trait sampled. Allowable values specified in the table definitions.
  • entity_type: A categorical variable specifying the entity corresponding to the trait values recorded.
  • value_type: A categorical variable describing the statistical nature of the trait value recorded.
  • basis_of_record: A categorical variable specifying from which kind of specimen traits were recorded.
  • basis_of_value: A categorical variable describing how the trait value was obtained.
  • replicates: Number of replicate measurements that comprise a recorded trait measurement. A numeric value (or range) is ideal and appropriate if the value type is a mean, median, min or max. For these value types, if replication is unknown the entry should be unknown. If the value type is raw_value the replicate value should be 1. If the trait is categorical or the value indicates a measurement for an entire species (or other taxon) replicate value should be .na.
  • collection_date: Date sample was taken, in the format yyyy-mm-dd, yyyy-mm or yyyy, depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a /, as in 2010-10/2011-03
  • measurement_remarks: Brief comments or notes accompanying the trait measurement.
  • methods: A textual description of the methods used to collect the trait data. Whenever available, methods are taken near-verbatim from the referenced source. Methods can include descriptions such as ‘measured on botanical collections’, ‘data from the literature’, or a detailed description of the field or lab methods used to collect the data.
  • life_stage: A field to indicate the life stage or age class of the entity measured. Standard values are adult, sapling, seedling and juvenile.
  • repeat_measurements_id: A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve.

The elements trait_name, entity_type, value_type, basis_of_record, and basis of value are controlled vocabularies; the values for these elements must be from the list of allowable values. Those for traits are listed in the traits.yml file or vignette. For the other elements, see the database structure vignette.

The fields replicates, basis_of_value, value_type, life_stage, basis_of_record, and measurement_remarks can all be specified at the dataset level or the traits level (which overrides a dataset-level entry). In each case, they can be a fixed text value or indicate a column (within the data.csv file or generated through custom_R_code) that includes the relevant information. In addition, fields can be added to specify a specific context (most commonly a method context, but occasionally a temporal context). If such a field is added, the same name must appear in both the contexts section and for some (or all) of the traits.

Two examples are as follows:

- var_in: LeafP.m
  unit_in: mg/g
  trait_name: leaf_P_per_dry_mass
  entity_type: individual                   # fixed value
  value_type: value_type_column             # referencing a column
  basis_of_value: measurement               # fixed value
  replicates: count                         # referencing a column
  methods: Oven-dried leaf material was used for determination of total leaf nitrogen
    and phosphorus. Dried ground leaf material was hot-digested in acid-peroxide before
    colorimetric analysis using a flow injection system (QuikChem 8500, Lachat Instruments,
    Loveland, Colorado, USA).

and

- var_in: Jmax25
  unit_in: umol/m2/s
  trait_name: Jmax_per_area
  entity_type: individual                    # fixed value
  value_type: raw                            # fixed value
  basis_of_value: measurement                # fixed value
  replicates: 1                              # fixed value
  method_context: 25C                        # optional field
  methods: Controlled photosynthetic CO2 response curve measurements were made using
    Li-Cor 6400 portable infrared gas analysers (LiCor Inc., Lincoln, NE, USA). CO2
    response curves of net CO2 assimilation (Anet) were developed at a constant temperature
    (termed 'Anet-Ci curves') for intact leaves within each tree chamber. These Anet-Ci
    curve measurements progressed at four to five specified leaf temperatures for
    the same leaf (i.e. one leaf per chamber) in each of three seasons (early summer,
    December 2010; late summer, February 2011...

substitutions

This section provides a list of any “find and replace” substitutions needed to get the data into the right format.

Substitutions are required whenever the exact word(s) used to describe a categorical trait value in AusTraits is different from the vocabulary used by the author in the data.csv file. It is preferable to align vocabulary using substitutions rather than changing the data.csv file. The trait definitions file provides a list of supported values for each trait.

Each substitution is documented using the following elements:

  • trait_name: Trait where substitutions are required.
  • find: Contributor’s trait value that needs to be changed.
  • replace: AusTraits supported replacement value.

An example is as follows:

substitutions:
- trait_name: life_history
  find: p
  replace: perennial
- trait_name: plant_growth_form
  find: s
  replace: shrub
- ...

taxonomic_updates

This section provides a table of taxonomic name changes needed to align original names in the dataset with taxon names in the chosen taxonomic reference(s).

Each substitution is documented using the following elements:

  • find: Name given to taxon in the original data supplied by the authors.
  • replace: Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
  • reason: Records why the change was implemented, e.g. typos, taxonomic synonyms, and standardising spellings

Algorithms within AusTraits automatically align outdated taxonomy and taxonomic synonyms to their currently accepted scientific name, so such adjustments are not documented as substitutions.

Some examples of taxonomic updates are as follows:

taxonomic_updates:
- find: Drummondita rubroviridis
  replace: Drummondita rubriviridis
  reason: match_07_fuzzy. Fuzzy alignment with accepted canonical name in APC (2022-11-21)
  taxonomic_resolution: Species
- find: Acacia ancistrophylla/sclerophylla
  replace: Acacia sp. [Acacia ancistrophylla/sclerophylla; White_2020]
  reason: match_04. Rewording taxon where `/` indicates uncertain species identification
    to align with `APC accepted` genus (2022-11-10)
  taxonomic_resolution: genus
- find: Polyalthia (Wyvur)
  replace: Polyalthia sp. (Wyvuri B.P.Hyland RFK2632)
  reason: match_15_fuzzy. Fuzzy match alignment with species-level canonical name
    in `APC known` when everything except first 2 words ignored (2022-11-10)
  taxonomic_resolution: Species

questions

This section provides a place to record any queries we have about the dataset (recorded as a named array), including notes on any additional traits that may have been collected in the study but have not been incorporated into austraits.

An example is as follows:

questions:
  questions for author: Triglochin procera has very different seed masses in the main traits spreadsheet and the field seeds worksheet. Which is correct? There are a number of species with values in the field leaves worksheet that are absent in the main traits worksheet - we have included this data into Austraits; please advise if this was inappropriate.
  austraits: need to map aquatic_terrestrial onto an actual trait once one is created.

7.6 R/custom_R_code.R

The austraits.build compilation contains an extra folder, R containning a file custom_R_code.R. This file documents any custom functions used in the compilation, called as part of the custom_R_code section of metadata files.