Structure of AusTraits data compilation

This document describes the structure of the AusTraits compilation.

As an on-going collaborative community resource we would appreciate your contribution on any of the following:

Reporting Errors: If you notice a possible error in AusTraits, please get in contact.
Refining documentation: We welcome additions and edits that make using the existing data or adding new data easier for the community.
Contributing new data: We gladly accept new data contributions to AusTraits. For full instructions on preparing data for inclusion in AusTraits, , please get in contact.

AusTraits is essentially a series of linked components, which cross link against each other::

austraits
├── traits
├── sites
├── methods
├── excluded_data
├── taxonomy
├── definitions
├── contributors
├── sources
└── build_info

These include all the data and contextual information submitted with each contributed dataset. It is essential that users of AusTraits data are confident the data have the meaning they expect it to and were collected using methods they trust. As such, each dataset within Austraits must include descriptions of the study, sites, and methods used as well as the data itself.

Components

The core components are defined as follows.

traits

Description: A table containing measurements of plant traits.

Content:

key	value
dataset_id	Primary identifier for each study contributed into AusTraits; most often these are scientific papers, books, or online resources. By default should be name of first author and year of publication, e.g. `Falster_2005`.
taxon_name	Currently accepted name of taxon in the Australian Plant Census or, for unplaced species, in the Australian Plant Names Index.
site_name	Name of site where individual was sampled. Cross-references between similar columns in `sites` and `traits`.
context_name	Name of contextual senario where individual was sampled. Cross-references between similar columns in `contexts` and `traits`.
observation_id	A unique identifier for the observation, useful for joining traits coming from the same `observation_id`. These are assigned automatically, based on the `dataset_id` and row number of the raw data.
trait_name	Name of trait sampled. Allowable values specified in the table `traits`.
value	Measured value.
unit	Units of the sampled trait value after aligning with AusTraits standards.
date	Date sample was taken, in the format `yyyy-mm-dd`, but with days and months only when specified.
value_type	A categorical variable describing the type of trait value recorded.
replicates	Number of replicate measurements that comprise the data points for the trait for each measurement. A numeric value (or range) is ideal and appropriate if the value type is a `mean`, `median`, `min` or `max`. For these value types, if replication is unknown the entry should be `unknown`. If the value type is `raw_value` the replicate value should be 1. If the value type is `expert_mean`, `expert_min`, or `expert_max` the replicate value should be `.na`.
original_name	Name given to taxon in the original data supplied by the authors

sites

Description: A table containing observations of site characteristics associated with information in traits. Cross referencing between the two dataframes is possible using combinations of the variables dataset_id, site_name.

Content:

key	value
dataset_id	Primary identifier for each study contributed into AusTraits; most often these are scientific papers, books, or online resources. By default should be name of first author and year of publication, e.g. `Falster_2005`.
site_name	Name of site where individual was sampled. Cross-references between similar columns in `sites` and `traits`.
site_property	The site characteristic being recorded. Name should include units of measurement, e.g. `longitude (deg)`. Ideally we have at least these variables for each site - `longitude (deg)`, `latitude (deg)`, `description`.
value	Measured value.

contexts

Description: A table containing observations of contextual characteristics associated with information in traits. Cross referencing between the two dataframes is possible using combinations of the variables dataset_id, context_name.

Content:

key	value
dataset_id	Primary identifier for each study contributed into AusTraits; most often these are scientific papers, books, or online resources. By default should be name of first author and year of publication, e.g. `Falster_2005`.
context_name	Name of contextual senario where individual was sampled. Cross-references between similar columns in `contexts` and `traits`.
context_property	The contextual characteristic being recorded. Name should include units of measurement, e.g. `elevation (m)`.
value	Measured value.

methods

Description: A table containing details on methods with which data were collected, including time frame and source.

Content:

key	value
dataset_id	Primary identifier for each study contributed into AusTraits; most often these are scientific papers, books, or online resources. By default should be name of first author and year of publication, e.g. `Falster_2005`.
trait_name	Name of trait sampled. Allowable values specified in the table `traits`.
methods	A textual description of the methods used to collect the trait data. Whenever available, methods are taken near-verbatim from referenced source. Methods can include descriptions such as ‘measured on botanical collections’,‘data from the literature’, or a detailed description of the field or lab methods used to collect the data.
year_collected_start	The year data collection commenced.
year_collected_end	The year data collection was completed.
description	A 1-2 sentence description of the purpose of the study.
collection_type	A field to indicate where the majority of plants on which traits were measured were collected - in the `field`, `lab`, `glasshouse`, `botanical collection`, or `literature`. The latter should only be used when the data were sourced from the literature and the collection type is unknown.
sample_age_class	A field to indicate if the study was completed on `adult` or `juvenile` plants.
sampling_strategy	A written description of how study sites were selected and how study individuals were selected. When available, this information is lifted verbatim from a published manuscript. For botanical collections, this field ideally indicates which records were ‘sampled’ to measure a specific trait.
source_primary_citation	Citation for primary source. This detail is generated from the primary source in the metadata.
source_primary_key	Citation key for primary source in `sources`. The key is typically of format `Surname_year`.
source_secondary_citation	Citations for secondary source. This detail is generated from the secondary source in the metadata.
source_secondary_key	Citation key for secondary source in `sources`. The key is typically of format `Surname_year`.

excluded_data

Description: A table of data that did not pass quality test and so were excluded from the master dataset.

taxonomic_updates

Description: A table of all taxonomic changes implemented in the construction of AusTraits. Changes are determined by comapring against the APC (Australian Plant Census) and APNI (Australian Plant Names Index).

Content:

key	value
dataset_id	Primary identifier for each study contributed into AusTraits; most often these are scientific papers, books, or online resources. By default should be name of first author and year of publication, e.g. `Falster_2005`.
original_name	Name given to taxon in the original data supplied by the authors
aligned_name	Name of the taxon after implementing any changes encoded for this taxon in the metadata file in the specified correpsonding `dataset_id`.
taxonIDClean	Where it could be indentified, the `taxonID` of the aligned_name for this taxon in the APC.
taxonomicStatusClean	Taxonomic status of the taxon identified by `taxonIDClean` in the APC.
alternativeTaxonomicStatusClean	The status of alternative records with the name `aligned_name` in the APC.
acceptedNameUsageID	ID of the accepted name for taxon in the APC or APNI.
taxon_name	Currently accepted name of taxon in the Australian Plant Census or, for unplaced species, in the Australian Plant Names Index.

taxa

Description: A table containing details on taxa associated with information in traits. This information has been sourced from the APC (Australian Plant Census) and APNI (Australian Plant Names Index) and is released under a CC-BY3 license.

Content:

key	value
taxon_name	Currently accepted name of taxon in the Australian Plant Census or, for unplaced species, in the Australian Plant Names Index.
source	Source of taxnonomic information, either APC or APNI.
acceptedNameUsageID	Identifier for the accepted name of the taxon.
scientificNameAuthorship	Authority for accepted of the taxon indicated under taxon_name.
taxonRank	Rank of the taxon.
taxonomicStatus	Taxonomic status of the taxon.
family	Family of the taxon.
genus	Genus of the taxon.
taxonDistribution	Known distribution of the taxon.
ccAttributionIRI	Source of taxonomic information.

definitions

Description: A copy of the definitions for all tables and terms. Information included here was used to process data and generate any documentation for the study.

contributors

Description: A table of people contributing to each study.

Content:

key	value
dataset_id	Primary identifier for each study contributed into AusTraits; most often these are scientific papers, books, or online resources. By default should be name of first author and year of publication, e.g. `Falster_2005`.
name	Name of contributor
institution	Last known institution or affiliation
role	Their role in the study

sources

Description: Bibtex entries for all primary and secondary sources in the compilation.

build_info

Description: A description of the computing environment used to create this version of the dataset, including version number, git commit and R session_info.

Dataset IDs

The core organising unit behind AusTraits is the dataset_id. Records are organised as coming from a particular study, defined by the dataset_id. Our preferred format for dataset_id is the surname of the first author of any corresponding publication, followed by the year, as surname_year. E.g. Falster_2005. Wherever there are multiple studies with the same id, we add a suffix _2, _3 etc. E.g.Falster_2005, Falster_2005_2.

Observation IDs

As well as a dataset_id, each trait measurement has an associated observation_id. Observation IDs bind together related measurements within any dataset, and thereby allow transformation between long (e.g. with variables trait_name and value) and wide (e.g. with traits as columns) formats.

Generally, observation_id has the format dataset_id_XX where XX is a unique number within each dataset. For example, if multiple traits were collected on the same individual, the observation_id allows us to gather these together. For floras reporting species averages, the observation_id is assigned at the species level.

For datasets that arrive in wide format we assume each row has a unique observation_id. For datasets that arrive in long format, the observation_id is assigned based on a specified grouping variable. This variable can be specified in the metadata.yml file under the section variable_match. If missing, observation_id is assigned based on species_name.

Site names

As well as dataset_id and observation_id, where appropriate, trait values are associated with a site_name. Unique combinations of dataset_id and site_name can be used to cross-match against the sites table, which provide further details on the site sampled.

Context names

As well as dataset_id, observation_id, and site_name, where appropriate, trait values are associated with a context_name. Unique combinations of dataset_id and context_name can be used to cross-match against the context table, which provide further details on the context sampled.

Values and Value types

Each record in the table of trait data has an associated value and value_type.

Traits are either numeric or categorical. For traits with numerical values, the recorded value has been converted into standardised units and we have checked that the value can be converted into a number and lies within the allowable range. For categorical variables, we only include records that are defined in the definitions. Moreover, we use a format whereby:

we use _ for multi-word terms, e.g. semi_deciduous
use a space for situations where there are two possible values for that trait, e.g. annual biennial for something which is either annual or biennial

Each trait measurement also has an associated value_type, which gives A categorical variable describing the type of trait value recorded.. Possible values are:

key	value
raw_value	Value is a direct measurement
site_min	Value is the minimum of measurements on multiple individuals of the taxon at a single site
site_mean	Value is the mean or median of measurements on multiple individuals of the taxon at a single site
site_max	Value is the maximum of measurements on multiple individuals of the taxon at a single site
multisite_min	Value is the minimum of measurements on multiple individuals of the taxon across multiple sites
multisite_mean	Value is the mean or median of measurements on multiple individuals of the taxon across multiple sites
multisite_max	Value is the maximum of measurements on multiple individuals of the taxon across multiple sites
expert_min	Value is the minimum observed for a taxon across its range or in this particular dataset, as estimated by an expert based on their knowledge of the taxon. Data fitting this category include estimates from flora that represent a taxon’s entire range, and values for categorical variables obtained from a reference book, or identified by an expert.
expert_mean	Value is the mean observed for a taxon across its range or in this particular dataset, as estimated by an expert based on their knowledge of the taxon. Data fitting this category include estimates from flora that represent a taxon’s entire range, and values for categorical variables obtained from a reference book, or identified by an expert.
expert_max	Value is the maximum observed for a taxon across its range or in this particular dataset, as estimated by an expert based on their knowledge of the taxon. Data fitting this category include estimates from flora that represent a taxon’s entire range, and values for categorical variables obtained from a reference book, or identified by an expert.
experiment_min	Value is the minimum of measurements from an experimental study either in the field or a glasshouse
experiment_mean	Value is the mean or median of measurements from an experimental study either in the field or a glasshouse
experiment_max	Value is the maximum of measurements from an experimental study either in the field or a glasshouse
individual_mean	Value is a mean of replicate measurements on an individual (usually for experimental ecophysiology studies)
individual_max	Value is a maximum of replicate measurements on an individual (usually for experimental ecophysiology studies)
literature_source	Value is a site or multi-site mean that has been sourced from an unknown literature source
unknown	Value type is not currently known

AusTraits does not include intra-individual observations. When multiple measurements per individual are submitted to AusTraits, we take the mean of the values and record the value_type as individual_mean.

Taxonomy

The latest version of AusTraits contains records for over 28640 different taxa. We have attempted to align species names with known taxonomic units in the Australian Plant Census (APC) and/or the Australian Plant Names Index (APNI).

The table taxa lists all taxa in the database, including additional information about the taxa (see Table above).

The traits table reports both the original and the updated taxon name alongside each trait record.

The table taxanomic_updates provides details on all taxonomic name changes implemented when aligning with APC and APNI.

Sources

For each dataset in the compilation there is the option to list primary and secondary citations. The primary citation is The original study in which data were collected. while the secondary citation is A subsequent study where data were compiled or re-analysed and then made available.. These references are included in two places:

Within the table methods, where we provide a formatted version of each.
In the element sources, where we provide bibtex versions of all sources which can be imported into your reference library. The keys for these references are listed within the methods.

2022-07-18

Components

traits

sites

contexts

methods

excluded_data

taxonomic_updates

taxa

definitions

contributors

sources

build_info

Dataset IDs

Observation IDs

Site names

Context names

Values and Value types

Taxonomy

Sources