taxon_name | family | scientific_name_authorship | taxonomic_reference | cleaned_name_taxonomic_status | cleaned_scientific_name_id |
---|---|---|---|---|---|
Abelia x grandiflora | Caprifoliaceae | (Rovelli ex André) Rehder | APC | accepted | https://id.biodiversity.org.au/name/apni/190758 |
Abelmoschus ficulneus | Malvaceae | (L.) Wight | APC | accepted | https://id.biodiversity.org.au/name/apni/55929 |
Abelmoschus manihot | Malvaceae | (L.) Medik. | APC | accepted | https://id.biodiversity.org.au/name/apni/55937 |
Abelmoschus manihot subsp. manihot | Malvaceae | NA | APC | accepted | https://id.biodiversity.org.au/name/apni/116920 |
Abelmoschus manihot subsp. tetraphyllus | Malvaceae | (Roxb. ex Hornem.) Borss.Waalk. | APC | accepted | https://id.biodiversity.org.au/name/apni/55945 |
Abelmoschus moschatus | Malvaceae | Medik. | APC | accepted | https://id.biodiversity.org.au/name/apni/55953 |
Abelmoschus moschatus subsp. biakensis | Malvaceae | (Hochr.) Borss.Waalk. | APC | accepted | https://id.biodiversity.org.au/name/apni/116595 |
Abelmoschus moschatus subsp. moschatus | Malvaceae | NA | APC | accepted | https://id.biodiversity.org.au/name/apni/243806 |
Abelmoschus moschatus subsp. tuberosus | Malvaceae | (Span.) Borss.Waalk. | APC | accepted | https://id.biodiversity.org.au/name/apni/55961 |
Abildgaardia ovata | Cyperaceae | (Burm.f.) Kral | APC | accepted | https://id.biodiversity.org.au/name/apni/150737 |
7 File structure
This chapter desribes the typical files you may encounter in a traits.build
compilation. The description is based on the austraits.build
compilation.
We strongly suggest you create a standalone folder for your repository, e.g. austraits.build
. This folder should contain all files needed to build your compilation. We’re big fans of github as a platform for collaboration. If you’re not familiar with git or github, we suggest you check out the happy git with R book.
7.1 Repository structure
The main directory for the austraits.build
repository contains the following files and folders, with purpose as indicated. Not all of these files are required for a compilation, some are used for extra features such as website. They are included here for completeness.
Files used for data compilation
├── remake.yml/build.R # instructions for build
├── config # configuration files
├── data # raw data files
├── R # folder with custom functions
├── export # folder for output
└── scripts # scripts for processing files before/after build
R project file
├── traits.build.Rproj # Rstudio project
Files for maintaining a repo on github
├── README.md # landing page
├── .github # folder containing github actions, issue templates, code of conduct
├── LICENCE
├── NEWS.md
├── _pkgdown.yml # used to create packagedown website
├── docs # contains website
├── Dockerfile # creates an image of R environment used in build
Files used for creation of R package for this compilation
XXX Explan this more
├── NAMESPACE # functions being exported
├── DESCRIPTION # R package description
├── tests # defines tests applied to datasets
├── vignettes # documentation of repo file structure, AusTraits database structure, definitions, data input processes
7.2 /config
folder
The folder config
contains four files which govern the building of the dataset.
config
├── metadata.yml
├── traits.yml
├── taxon_list.csv
└── unit_conversions.csv
metadata.yml
XXX
traits.yml
The file traits.yml
provides the trait definitions used to compile AusTraits, including allowable trait values. The trait definitions are fully described in an additional vignette. A .yml
file is a structured data file where information is presented in a hierarchical format (see appendix for details).
taxon_list.csv
The file taxon_list.csv
is our master list of known taxa.
XXX Explan this more. Only show essential variables in table below.
unit_conversions.csv
The file unit_conversions.csv
defines the unit conversions that are used when converting contributed trait data to common units, e.g.
unit_from | unit_to | function |
---|---|---|
% | mg/g | x*10 |
% | g/g | x*0.01 |
% | mg/mg | x*0.01 |
% | mg/kg | x*10000 |
% | {dimensionless} | x*.01 |
% | {count}/{count} | x*.01 |
{dimensionless} | {count}/{count} | x*1 |
a | mo | x*12 |
{count}/m2 | {count}/mm2 | x*1/1000000 |
cm | m | x*0.01 |
7.3 /data
folder
The folder data
contains the raw data from individual studies included in AusTraits.
Records within the data
folder are organised as coming from a particular study, defined by the dataset_id
. Data from each study are organised into a separate folder, with two files:
data.csv
: a table containing the actual trait data.metadata.yml
: a file that contains study metadata (source, methods, locations, and context), maps trait names and units onto standard types, and lists any substitutions applied to the data in processing.
The folder data
thus contains a long list of folders, one for each study and each containing two files:
data
├── Angevin_2010
│ ├── data.csv
│ └── metadata.yml
├── Barlow_1981
│ ├── data.csv
│ └── metadata.yml
├── Bean_1997
│ ├── data.csv
│ └── metadata.yml
├── ....
where Angevin_2010
, Barlow_1981
, & Bean_1997
are each a unique dataset_id
in the final dataset.
7.4 dataset_id/data.csv
The file data.csv
contains raw measurements and can be in either long or wide format.
Required columns include the taxon name, the trait name (column in long format, header in wide format), units (column in long format, part of header in wide format), location (if applicable), context (if applicable), date (if available), and trait values.
It is important that all trait measurements made on the same individual or that are the mean of a species’ measurements from the same location are kept linked.
If the data is in wide format, each row should include measurements made on a single individual at a single point in time or a single species-by-location mean, with different trait values as consecutive columns.
If the data is in long format, an additional column,
individual_id
, is required to ensure multiple trait measurements made on the same individual, or the mean of a species’ measurements from the same location, are linked. If the data is in wide format and there are multiple rows of data for the same individual, anindividual_id
column should be included. Theseindividual_id
columns ensure that related data values remain linked.
We aim to keep the data file in the rawest form possible (i.e. with as few changes as possible) but it must be a single csv file. Additional custom R code may be required to make the file exactly compatible with the AusTraits format, but these changes should be executed as AusTraits is compiled and should be in the metadata.yml
file under dataset/custom_R_code
(see below). Any files used to create the submitted data.csv
file (e.g. Excel …) should be archived in a sub-folder within the study folder named raw
.
7.5 dataset_id/metadata.yml
The metadata is compiled in a .yml
file, a structured data file where information is presented in a hierarchical format (see Appendix for details). There are 10 values at the top hierarchical level: source, contributors, dataset, locations, contexts, traits, substitutions, taxonomic_updates, exclude_observations, questions. These are each described below.
As a start, you may want to check out some examples from existing studies in Austraits, e.g. Angevin_2010 or Wright_2009.
source
This section provides citation details for the original source(s) for the data, whether it is a published journal article, book, website, or thesis. In general we aim to reference the primary source. References are written in structured yml format, under the category source
and then under sub-groupings primary
, secondary
, and original
. A reference is designated as secondary
if it is a second publication by the data collector that analyses the data. When the primary
reference is a compilation of multiple sources for a meta-analysis, the original references are designated as original
.
General guidelines for describing a source include:
- A maximum of one primary source allowed.
- Elements are names as in bibtex format.
- Keys should be named in the format
Surname_year
and the primary source is almost always identical to the name given to the dataset folder. A second instance of the identical Surname_year should have the key Surname_year_2. - One or more secondary source may be included if traits from a single dataset were presented in two different manuscripts. Multiple sources are also appropriate if an author has compiled data from a number of sources, which are not individually in AusTraits, for a published or unpublished compilation.
- If your data is from an unpublished study, only include the elements that are applicable.
- If someone has transcribed a published source, the primary source will be the published work and the person who has completed the transcription will be acknowledged as the
contributor
of the dataset.
An example of a primary source that is a journal article is:
source:
primary:
key: Falster_2005_1
bibtype: Article
author: Daniel S. Falster, Mark Westoby
year: 2005
title: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
journal: Journal of Ecology
volume: 93
pages: 521--535
publisher: Wiley-Blackwell
doi: 10.1111/j.0022-0477.2005.00992.x
If a secondary source is included it may look like:
primary:
key: Choat_2006
bibtype: Article
year: '2006'
author: B. Choat and M. C. Ball and J. G. Luly and C. F. Donnelly and J. A. M.
Holtum
journal: Tree Physiology
title: Seasonal patterns of leaf gas exchange and water relations in dry rain
forest trees of contrasting leaf phenology
volume: '26'
number: '5'
pages: 657--664
doi: 10.1093/treephys/26.5.657
secondary:
key: Choat_2005
bibtype: Article
year: '2005'
author: Brendan Choat and Marilyn C. Ball and Jon G. Luly and Joseph A. M. Holtum
journal: Trees
title: Hydraulic architecture of deciduous and evergreen dry rainforest tree species
from north-eastern Australia
volume: '19'
number: '3'
pages: 305--311
doi: 10.1007/s00468-004-0392-1
contributors
This section provides a list of contributors to the study, their respective affiliations, roles in the study, and orcids. The following information is recorded for each data contributor:
key | value |
---|---|
last_name | Last name of data collector. |
given_name | Given name of data collector. |
affiliation | Affiliation of data collector. |
ORCID | ORCID ID (Open Researcher and Contributor ID) for the data collector, if available. |
notes | optional notes for the data collector. |
additional_role | Any additional roles the data collector had in the study, a field most frequently used to identify which data contributor is the contact person for the dataset. |
An example is as follows:
data_collectors:
- last_name: Falster
given_name: Daniel
ORCID: 0000-0002-9814-092X
affiliation: Evolution & Ecology Research Centre, School of Biological, Earth,
and Environmental Sciences, UNSW Sydney, Australia
additional_role: contact
- last_name: Westoby
given_name: Mark
ORCID: 0000-0001-7690-4530
affiliation: Department of Biological Sciences, Macquarie University, Australia
Note that only the AusTraits custodians have the contributors’ e-mail addresses on file. This information will not be directly available to AusTraits users or new contributors via Github.
Additional fields within contributors are:
Assistants
, names of additional people who played a more minor role in data collection for the study.dataset_curators
, names of austraits team member(s) who contacted the data collectors and added the study to the austraits repository.
dataset
This section includes study details, including format of the data, custom r code applied to data, and various descriptors. the value entered for each element can be either a header for a column within the data.csv file or the actual value to be used.
The following elements are included under the element dataset
:
- data_is_long_format: Indicates if the data spreadsheet has a vertical (long) or horizontal (wide) configuration with
yes
orno
terminology. - custom_R_code: A field where additional R code can be included. This allows for custom manipulation of the data in the submitted spreadsheet into a different format for easy integration with AusTraits.
.na
indicates no custom R code was used. - collection_date: Date sample was taken, in the format
yyyy-mm-dd
,yyyy-mm
oryyyy
, depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a/
, as in 2010-10/2011-03 - taxon_name: Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
- location_name: location name
- source_id: For datasets that are compilations, an identifier for the original data source.
- individual_id: A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time.
- repeat_measurements_id: A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve.
- trait_name: Element required for long datasets to specify the column indicating the trait name associated with each row of data.
- value: The measured value of a trait.
- description: A 1-2 sentence description of the purpose of the study.
- basis_of_record: A categorical variable specifying from which kind of specimen traits were recorded.
- life_stage: A field to indicate the life stage or age class of the entity measured. Standard values are
adult
,sapling
,seedling
andjuvenile
. - sampling_strategy: A written description of how study locations were selected and how study individuals were selected. When available, this information is lifted verbatim from a published manuscript. For preserved specimens, this field ideally indicates which records were ‘sampled’ to measure a specific trait.
- measurement_remarks: Brief comments or notes accompanying the trait measurement.
- original_file: The name of the file initially submitted to AusTraits.
- notes: Generic notes about the study and processing of data.
Of these, the fields collection_date
, life_stage
, basis_of_record
, and measurement_remarks
can all be specified at the dataset level or the traits level (which overrides a dataset-level entry) or location level (which also overrides a dataset-level entry). In each case, they can be a fixed text value or indicate a column within the data.csv file (or generated through custom_R_code
) that includes the relevant information.
life_stage
,basis_of_record
, andcollection_date
are usually included undermetadata$dataset
unless they vary by trait.entity_type
,replicates
,basis_of_value
, andvalue_type
are usually different across traits and are usually mapped under themetadata$traits
section (see below), but are allowed to be specified for the entire dataset in this section.traits
andvalue
are only specified in metadata$dataset for long-format datasets.measurement_remarks
andindividual_id
are only included if required. They are absent from the majority of datasets.
An example is as follows:
data_is_long_format: no
custom_R_code: '
data %>%
mutate(
location_name = "Howard River catchment",
date = date %>% mdy()
) %>%
arrange(date) %>%
group_by(Tree) %>%
mutate(observation_number = dplyr::row_number()) %>%
ungroup() %>%
group_by(species) %>%
mutate(across(c("specific leaf area (m2 kg-1)"), replace_duplicates_with_NA)) %>%
ungroup()
'
collection_date: date
taxon_name: species
context_name: context
location_name: location_name
individual_id: Tree
description: Measurements of stem CO2 efflux and leaf gas exchange in a tropical
savanna ecosystem in northern Australia, and assessed the impact of fire on these
processes.
basis_of_record: field
life_stage: adult
sampling_strategy: The stem CO2 efflux was initially measured at two locations,
each of which was nested within a 3 km 2 plot...
original_file: leaf_summary.xls, Rbranch summary2.xls, and Rstem summary6.xls submitted
by Lucas Cernusak and archived in the raw data folder and GoogleDrive folder.
notes: none
A common use of the custom_R_code
is to automate the conversion of a verbal description of flowering or fruiting periods into the supported trait values. It might also be used if values for a single trait are expressed across multiple columns and need to be merged. See Catford_2014
as an example of this. The adding data vignette provides additional examples of code regularly implemented in custom_R_code
, including functions specifically that were developed for AusTraits data manipulations and are in the file scripts\custom.R
.
locations
This section provides a list of study locations (sites) and information about each of the study locations where data were collected. Each should include at least three variables - latitude (deg)
, longitude (deg)
and description
. Additional variables can be included where available. Set to .na
for botanical collections and field studies where data values are a mean across many locations.
Although the properties listed under each location are not part of a controlled vocabulary, it is best practice to align with in-use properties whenever possible. These can be identified by running austraits$locations %>% distinct(location_property)
.
An example of how a location and its properties, and the value of each property are listed (modified from Vesk_2019), is:
Round Hill-Nombinnie Nature Reserve:
latitude (deg): -32.965
longitude (deg): 146.161
precipitation, MAP (mm): 370
temperature, summer mean (C): 32.5
temperature, winter mean (C): 14.2
soil type: loamy red sands light red clays and light red browns earths
description: predominantly open Callitris glaucophylla - Eucalyptus populnea woodland
and Eucalyptus dumosa - E. socialis shrub mallee woodland
fire frequency (years): 5-20 years
contexts
This section provides contextual characteristics associated with information in traits
.
Within the context section is a list of contextual properties, each encapsulating information read in through a different column or created through custom_R_code
or as elements within specific traits
(see below).
- context_property: The context property represented by the data in the column specified by
var_in
. - category: The category of contextual data. Options are
plot
(a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment),treatment
(an experimental treatment),entity_context
(contextual information to record about the entity the isn’t documented elsewhere, including the entity’s sex, caste),temporal
(indicating when repeat observations are made on the same individual (or population, or taxon) across time) andmethod
(indicating the same trait was measured on the same individual (or population, or taxon) using multiple methods). - var_in: Name of column with contextual data in the original data submitted.
- find: The contextual values in the original data submitted (optional)
- value: The standardised contextual values, aligning syntax and wording with other studies.
- description: A description of the contextual values.
If the contextual values read in are appropriate and no substitutions are required, the field find
can be omitted, with the values from the data.csv column entered under the field value
. The field description
can likewise be omitted if it is redundant; for instance, if the values are simply sequential observation numbers, times of day, or taxon names (e.g. insect host plants).
As with location, the context properties are not part of a controlled vocabulary, but it is best practice to align syntax with in-use properties whenever possible. These can be identified by running austraits$contexts %>% distinct(context_property)
.
An example of how the contexts for a study are formatted (modified from Crous_2013), is:
contexts:
- context_property: sampling season
category: temporal_context
var_in: month
values:
- find: AUG
value: August
description: August (late winter)
- find: DEC
value: December
description: December (early summer)
- find: FEB
value: February
description: February (late summer)
- context_property: temperature treatment
category: treatment_context
var_in: Temp-trt
values:
- value: ambient
description: Plants grown at ambient temperatures; Jan average max = 29.4 dec
C / July average min = 3.2 dec C.
- value: elevated
description: Plants grown 3 deg C above ambient temperatures.
- context_property: CO2 treatment
category: treatment_context
var_in: CO2_Treat
values:
- find: ambient CO2
value: 400 ppm
description: Plants grown at ambient CO2 (400 ppm).
- find: added CO2
value: 640 ppm
description: Plants grown at elevated CO2 (640 ppm); 240 ppm above ambient.
- context_property: measurement temperature
category: method_context
var_in: method_context
values:
- find: Measurement made at 20°C
value: 20°C
description: Measurement made at 20°C
- find: Measurement made at 25°C
value: 25°C
description: Measurement made at 25°C
traits
This section provides a translation table, mapping traits and units from a contributed study onto corresponding variables in AusTraits. The methods used to collect the data are also specified here.
For each trait submitted to AusTraits, there is the following information:
- var_in: Name of trait in the original data submitted.
- unit_in: Units of trait in the original data submitted.
- trait_name: Name of the trait sampled. Allowable values specified in the table
definitions
. - entity_type: A categorical variable specifying the entity corresponding to the trait values recorded.
- value_type: A categorical variable describing the statistical nature of the trait value recorded.
- basis_of_record: A categorical variable specifying from which kind of specimen traits were recorded.
- basis_of_value: A categorical variable describing how the trait value was obtained.
- replicates: Number of replicate measurements that comprise a recorded trait measurement. A numeric value (or range) is ideal and appropriate if the value type is a
mean
,median
,min
ormax
. For these value types, if replication is unknown the entry should beunknown
. If the value type israw_value
the replicate value should be 1. If the trait is categorical or the value indicates a measurement for an entire species (or other taxon) replicate value should be.na
. - collection_date: Date sample was taken, in the format
yyyy-mm-dd
,yyyy-mm
oryyyy
, depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a/
, as in 2010-10/2011-03 - measurement_remarks: Brief comments or notes accompanying the trait measurement.
- methods: A textual description of the methods used to collect the trait data. Whenever available, methods are taken near-verbatim from the referenced source. Methods can include descriptions such as ‘measured on botanical collections’, ‘data from the literature’, or a detailed description of the field or lab methods used to collect the data.
- life_stage: A field to indicate the life stage or age class of the entity measured. Standard values are
adult
,sapling
,seedling
andjuvenile
. - repeat_measurements_id: A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve.
The elements trait_name
, entity_type
, value_type
, basis_of_record
, and basis of value
are controlled vocabularies; the values for these elements must be from the list of allowable values. Those for traits are listed in the traits.yml
file or vignette. For the other elements, see the database structure vignette.
The fields replicates
, basis_of_value
, value_type
, life_stage
, basis_of_record
, and measurement_remarks
can all be specified at the dataset level or the traits level (which overrides a dataset-level entry). In each case, they can be a fixed text value or indicate a column (within the data.csv
file or generated through custom_R_code
) that includes the relevant information. In addition, fields can be added to specify a specific context (most commonly a method context
, but occasionally a temporal context
). If such a field is added, the same name must appear in both the contexts section and for some (or all) of the traits.
Two examples are as follows:
- var_in: LeafP.m
unit_in: mg/g
trait_name: leaf_P_per_dry_mass
entity_type: individual # fixed value
value_type: value_type_column # referencing a column
basis_of_value: measurement # fixed value
replicates: count # referencing a column
methods: Oven-dried leaf material was used for determination of total leaf nitrogen
and phosphorus. Dried ground leaf material was hot-digested in acid-peroxide before
colorimetric analysis using a flow injection system (QuikChem 8500, Lachat Instruments,
Loveland, Colorado, USA).
and
- var_in: Jmax25
unit_in: umol/m2/s
trait_name: Jmax_per_area
entity_type: individual # fixed value
value_type: raw # fixed value
basis_of_value: measurement # fixed value
replicates: 1 # fixed value
method_context: 25C # optional field
methods: Controlled photosynthetic CO2 response curve measurements were made using
Li-Cor 6400 portable infrared gas analysers (LiCor Inc., Lincoln, NE, USA). CO2
response curves of net CO2 assimilation (Anet) were developed at a constant temperature
(termed 'Anet-Ci curves') for intact leaves within each tree chamber. These Anet-Ci
curve measurements progressed at four to five specified leaf temperatures for
the same leaf (i.e. one leaf per chamber) in each of three seasons (early summer,
December 2010; late summer, February 2011...
substitutions
This section provides a list of any “find and replace” substitutions needed to get the data into the right format.
Substitutions are required whenever the exact word(s) used to describe a categorical trait value in AusTraits is different from the vocabulary used by the author in the data.csv
file. It is preferable to align vocabulary using substitutions
rather than changing the data.csv
file. The trait definitions file provides a list of supported values for each trait.
Each substitution is documented using the following elements:
- trait_name: Trait where substitutions are required.
- find: Contributor’s trait value that needs to be changed.
- replace: AusTraits supported replacement value.
An example is as follows:
substitutions:
- trait_name: life_history
find: p
replace: perennial
- trait_name: plant_growth_form
find: s
replace: shrub
- ...
taxonomic_updates
This section provides a table of taxonomic name changes needed to align original names in the dataset with taxon names in the chosen taxonomic reference(s).
Each substitution is documented using the following elements:
- find: Name given to taxon in the original data supplied by the authors.
- replace: Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
- reason: Records why the change was implemented, e.g.
typos
,taxonomic synonyms
, andstandardising spellings
Algorithms within AusTraits automatically align outdated taxonomy and taxonomic synonyms to their currently accepted scientific name, so such adjustments are not documented as substitutions.
Some examples of taxonomic updates are as follows:
taxonomic_updates:
- find: Drummondita rubroviridis
replace: Drummondita rubriviridis
reason: match_07_fuzzy. Fuzzy alignment with accepted canonical name in APC (2022-11-21)
taxonomic_resolution: Species
- find: Acacia ancistrophylla/sclerophylla
replace: Acacia sp. [Acacia ancistrophylla/sclerophylla; White_2020]
reason: match_04. Rewording taxon where `/` indicates uncertain species identification
to align with `APC accepted` genus (2022-11-10)
taxonomic_resolution: genus
- find: Polyalthia (Wyvur)
replace: Polyalthia sp. (Wyvuri B.P.Hyland RFK2632)
reason: match_15_fuzzy. Fuzzy match alignment with species-level canonical name
in `APC known` when everything except first 2 words ignored (2022-11-10)
taxonomic_resolution: Species
questions
This section provides a place to record any queries we have about the dataset (recorded as a named array), including notes on any additional traits that may have been collected in the study but have not been incorporated into austraits.
An example is as follows:
questions:
questions for author: Triglochin procera has very different seed masses in the main traits spreadsheet and the field seeds worksheet. Which is correct? There are a number of species with values in the field leaves worksheet that are absent in the main traits worksheet - we have included this data into Austraits; please advise if this was inappropriate.
austraits: need to map aquatic_terrestrial onto an actual trait once one is created.
7.6 R/custom_R_code.R
The austraits.build
compilation contains an extra folder, R
containning a file custom_R_code.R
. This file documents any custom functions used in the compilation, called as part of the custom_R_code
section of metadata files.