7 Data structure

This chapter describes the structure of the output of a traits.build compilation.

Note that the information below is based on the information provided within the file traits.build_schema.yml, which can be accessed by running get_schema or system.file("support", "traits.build_schema.yml", package = "traits.build").

A traits.build compilation results in a series of linked components, which cross link against each other:

austraits
├── traits
├── locations
├── contexts
├── methods
├── excluded_data
├── taxonomic_updates
├── taxa
├── contributors
├── sources
├── definitions
├── schema
├── metadata
└── build_info

These include all the data and contextual information submitted with each contributed dataset.

7.1 Components

The core components are defined as follows.

7.2 Traits

Description: A table containing measurements of traits.

Content:

key	value
dataset_id	Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. `Falster_2005`.
taxon_name	Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
observation_id	A unique integral identifier for the observation, where an observation is all measurements made on an individual at a single point in time. It is important for joining traits coming from the same `observation_id`. Within each dataset, observation_id’s are unique combinations of `taxon_name`, `population_id`, `individual_id`, and `temporal_context_id`.
trait_name	Name of the trait sampled. Allowable values specified in the table `definitions`.
value	The measured value of a trait, location property or context property.
unit	Units of the sampled trait value after aligning with AusTraits standards.
entity_type	A categorical variable specifying the entity corresponding to the trait values recorded.
value_type	A categorical variable describing the statistical nature of the trait value recorded.
basis_of_value	A categorical variable describing how the trait value was obtained.
replicates	Number of replicate measurements that comprise a recorded trait measurement. A numeric value (or range) is ideal and appropriate if the value type is a `mean`, `median`, `min` or `max`. For these value types, if replication is unknown the entry should be `unknown`. If the value type is `raw_value` the replicate value should be 1. If the trait is categorical or the value indicates a measurement for an entire species (or other taxon) replicate value should be `.na`.
basis_of_record	A categorical variable specifying from which kind of specimen traits were recorded.
life_stage	A field to indicate the life stage or age class of the entity measured. Standard values are `adult`, `sapling`, `seedling` and `juvenile`.
population_id	A unique integer identifier for a population, where a population is defined as individuals growing in the same location (location_id /location_name) and plot (plot_context_id, a context category) and being subjected to the same treatment (treatment_context_id, a context category).
individual_id	A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time.
repeat_measurements_id	A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve.
temporal_context_id	A unique integer identifier assigned where repeat observations are made on the same individual (or population, or taxon) across time. The identifier links to specific information in the context table.
source_id	For datasets that are compilations, an identifier for the original data source.
location_id	A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table.
entity_context_id	A unique integer identifier indicating specific contextual properties of an individual, possibly including the individual’s sex or caste (for social insects).
plot_context_id	A unique integer identifier for a plot, where a plot is a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment. The identifier links to specific information in the context table.
treatment_context_id	A unique integer identifier for a treatment, where a treatment is any experimental manipulation to an organism’s growing/living conditions. The identifier links to specific information in the context table.
collection_date	Date sample was taken, in the format `yyyy-mm-dd`, `yyyy-mm` or `yyyy`, depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a `/`, as in 2010-10/2011-03.
measurement_remarks	Brief comments or notes accompanying the trait measurement.
method_id	A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table.
method_context_id	A unique integer identifier indicating a trait is measured multiple times on the same entity, with different methods used for each entry. This field is only used if a single trait is measured using multiple methods within the same dataset. The identifier links to specific information in the context table.
original_name	Name given to taxon in the original data supplied by the authors.

Entity type

An entity is the feature of interest, indicating what a trait value applies to. While an entity can be just a component of an organism, within the scope of AusTraits, an individual is the finest scale entity that can be documented. The same study might measure some traits at a population-level (entity = population) and others at an individual-level (entity = individual).

In detail:

entity_type is a categorical variable specifying the entity corresponding to the trait values recorded. Possible values are:

key	value
individual	Value comes from a single individual.
population	Value represents a summary statistic from multiple individuals at a single location.
metapopulation	Value represents a summary statistic from individuals of the taxon across multiple locations.
species	Value represents a summary statistic for a species or infraspecific taxon across its range or as estimated by an expert based on their knowledge of the taxon. Data fitting this category include estimates from reference books that represent a taxon’s entire range and values for categorical variables obtained from a reference book or identified by an expert.
genus	Value represents a summary statistic or expert score for an entire genus.
family	Value represents a summary statistic or expert score for an entire family.
order	Value represents a summary statistic or expert score for an entire order.

Identifiers

The traits table includes 12 identifiers, dataset_id, observation_id, taxon_name, population_id, individual_id, temporal_context_id, source_id, location_id, entity_context_id, plot_context_id, treatment_context_id, and method_context_id.

dataset_id, source_id and taxon_name have easy-to-interpret values. The others are simply integral identifiers that link groups of measurements and are automatically generated through the AusTraits workflow (individual_id can be assigned in the metadata file or automatically generated.)

To expand on the definitions provided above,

observation_id links measurements made on the same entity (individual, population, or species) at a single point in time.
population_id indicates entities that share a common location_id, plot_context_id, and treatment_context_id. It is used to align measurements and observation_id’s for individuals versus populations (i.e. distinct entity_types) that share a common population_id. It is numbered sequentially within a dataset.
individual_id indicates a unique organism. It is numbered sequentially within a dataset by population. Multiple observations on the same organism across time (with distinct observation_id values), share a common individual_id.
temporal_context_id indicates a distinct point in time and is used only if there are repeat measurements on a population or individual across time. The identifier links to context properties (and their associated information) in the contexts table for context properties of type temporal.
source_id is applied if not all data within a single dataset (dataset_id) is from the same source, such as when a dataset represents a compilation for a meta-analysis.
location_id links to a distinct location_name and associated location_properties in the location table.
entity_context_id links to information in the contexts table for context properties (& associated values/descriptions) with category entity_context. Entity_contexts include organism sex, organism caste and any other features of an entity that need to be documented.
plot_context_id links to information in the contexts table for context properties (& associated values/descriptions) with category plot. Plot contexts include both blocks/plots within an experimental design as well as any stratified variation within a location that needs to be documented (e.g. slope position).
treatment_context_idlinks to information in the contexts table for context properties (& associated values/descriptions) with category treatment. Treatment contexts are experimental manipulations applied to groups of individuals.
method_context_idlinks to information in the contexts table for context properties (& associated values/descriptions) with category method. A method context indicates that the same trait was measured on or across individuals using different methods.

Additionally, measurement_remarks is used to document brief comments or notes accompanying the trait measurement.

Life stage, basis of record

life_stage: a field to indicate the life stage or age class of the entity measured. Standard values are adult, sapling, seedling and juvenile.
basis_of_record: a categorical variable specifying from which kind of specimen traits were recorded.

Possible values are:

key	value
field	Traits were recorded on entities living naturally in the field.
field_experiment	Traits were recorded on entities living under experimentally manipulated conditions in the field.
captive_cultivated	Traits were recorded on entities living in a common garden, arboretum, or botanical or zoological garden.
lab	Traits were recorded on entities growing in a lab, glasshouse or growth chamber.
preserved_specimen	Traits were recorded from specimens preserved in a collection, eg. herbarium or museum.
literature	Traits were sourced from values reported in the literature, and where the basis of record is not otherwise known.

Values, value types, basis of value

Each record in the table of trait data has an associated value, value_type, and basis_of_value.

Values: A trait’s values are either numeric or categorical. For traits with numerical values, the recorded value has been converted into standardised units and the AusTraits workflow has confirmed the value can be converted into a number and lies within the allowable range. For categorical variables, records have been aligned through substitutions to values listed as allowable values (terms) in a trait’s definition.
- we use _ for multi-word terms, e.g. semi_deciduous
- we use a space for situations where two values co-occur for the same entity. For instance, a flora might indicate that a plant species can be either annual or biennial, in which case the trait is scored as annual biennial.

Value types: Each trait measurement has an associated value_type, which is a categorical variable describing the statistical nature of the trait value recorded.

Possible value types are:

key	value
raw	Value recorded for an entity.
minimum	Value is the minimum of values recorded for an entity.
mean	Value is the mean of values recorded for an entity.
median	Value is the median of values recorded for an entity.
maximum	Value is the maximum of values recorded for an entity.
mode	Value is the mode of values recorded for an entity. This is the appropriate value type for a categorical trait value.
range	Value is a range of values recorded for an entity.
bin	Value for an entity falls within specified limits.
standard_error	Value is the standard error of a mean of values recorded for an entity.
unknown	Not currently known.

Each trait measurement also has an associated basis_of_value, which is a categorical variable describing how the trait value was obtained.

Possible values are:

key	value
measurement	Value is the result of a measurement(s) made on a specimen(s).
expert_score	Value has been estimated by an expert based on their knowledge of the entity.
model_derived	Value is derived from a statistical model, for example via gap-filling.
unknown	Not currently known.

7.3 Locations

Description: A table containing observations of location/site characteristics associated with information in traits. Cross referencing between the two dataframes is possible using combinations of the variables dataset_id, location_name.

Content:

key	value
dataset_id	Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. `Falster_2005`.
location_id	A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table.
location_name	The location name.
location_property	The location characteristic being recorded. The name should include units of measurement, e.g. `MAT (C)`. Ideally we have at least the following variables for each location, `longitude (deg)`, `latitude (deg)`, `description`.
value	The measured value of a location property.

7.4 Contexts

Description: A table containing observations of contextual characteristics associated with information in traits. Cross referencing between the two dataframes is possible using combinations of the variables dataset_id, link_id, and link_vals.

Content:

key	value
dataset_id	Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. `Falster_2005`.
context_property	The contextual characteristic being recorded. If applicable, name should include units of measurement, e.g. `CO2 concentration (ppm)`.
category	The category of context property, with options being `plot`, `treatment`, `individual_context`, `temporal` and `method`.
value	The measured value of a context property.
description	Description of a specific context property value.
link_id	Variable indicating which identifier column in the traits table contains the specified `link_vals`.
link_vals	Unique integer identifiers that link between identifier columns in the `traits` table and the contextual properties/values in the `contexts` table.

7.5 Methods

Description: A table containing details on methods with which data were collected, including time frame and source. Cross referencing with the traits table is possible using combinations of the variables dataset_id, trait_name.

Content:

key	value
dataset_id	Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. `Falster_2005`.
trait_name	Name of the trait sampled. Allowable values specified in the table `definitions`.
methods	A textual description of the methods used to collect the trait data. Whenever available, methods are taken near-verbatim from the referenced source. Methods can include descriptions such as ‘measured on botanical collections’, ‘data from the literature’, or a detailed description of the field or lab methods used to collect the data.
method_id	A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table.
description	A 1-2 sentence description of the purpose of the study.
sampling_strategy	A written description of how study locations were selected and how study individuals were selected. When available, this information is lifted verbatim from a published manuscript. For preserved specimens, this field ideally indicates which records were ‘sampled’ to measure a specific trait.
source_primary_key	Citation key for the primary source in `sources`. The key is typically formatted as `Surname_year`.
source_primary_citation	Citation for the primary source. This detail is generated from the primary source in the metadata.
source_secondary_key	Citation key for the secondary source in `sources`. The key is typically formatted as `Surname_year`.
source_secondary_citation	Citations for the secondary source. This detail is generated from the secondary source in the metadata.
source_original_dataset_key	Citation key for the original dataset_id in sources; for compilations. The key is typically formatted as `Surname_year`.
source_original_dataset_citation	Citations for the original dataset_id in sources; for compilationse. This detail is generated from the original source in the metadata.
data_collectors	The person (people) leading data collection for this study.
assistants	Names of people who played a more minor role in data collection for the study.
dataset_curators	Names of database team member(s) who contacted the data collectors and added the study to the database repository.

7.6 Excluded_data

Description: A table of data that did not pass quality tests and so were excluded from the master dataset. The structure is identical to that presented in the traits table, only with an extra column called error indicating why the record was excluded. Common reasons are missing_unit_conversions, missing_value, and unsupported_trait_value.

Content:

key	value
error	Indicating why the record was excluded. Common reasons are missing_unit_conversions, missing_value, and unsupported_trait_value.
dataset_id	Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. `Falster_2005`.
taxon_name	Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
observation_id	A unique integral identifier for the observation, where an observation is all measurements made on an individual at a single point in time. It is important for joining traits coming from the same `observation_id`. Within each dataset, observation_id’s are unique combinations of `taxon_name`, `population_id`, `individual_id`, and `temporal_context_id`.
trait_name	Name of the trait sampled. Allowable values specified in the table `definitions`.
value	The measured value of a trait.
unit	Units of the sampled trait value after aligning with AusTraits standards.
entity_type	A categorical variable specifying the entity corresponding to the trait values recorded.
value_type	A categorical variable describing the statistical nature of the trait value recorded.
basis_of_value	A categorical variable describing how the trait value was obtained.
replicates	Number of replicate measurements that comprise a recorded trait measurement. A numeric value (or range) is ideal and appropriate if the value type is a `mean`, `median`, `min` or `max`. For these value types, if replication is unknown the entry should be `unknown`. If the value type is `raw_value` the replicate value should be 1. If the trait is categorical or the value indicates a measurement for an entire species (or other taxon) replicate value should be `.na`.
basis_of_record	A categorical variable specifying from which kind of specimen traits were recorded.
life_stage	A field to indicate the life stage or age class of the entity measured. Standard values are `adult`, `sapling`, `seedling` and `juvenile`.
population_id	A unique integer identifier for a population, where a population is defined as individuals growing in the same location (location_id /location_name) and plot (plot_context_id, a context category) and being subjected to the same treatment (treatment_context_id, a context category).
individual_id	A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time.
repeat_measurements_id	A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve.
temporal_context_id	A unique integer identifier assigned where repeat observations are made on the same individual (or population, or taxon) across time. The identifier links to specific information in the context table.
source_id	For datasets that are compilations, an identifier for the original data source.
location_id	A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table.
entity_context_id	A unique integer identifier indicating specific contextual properties of an individual, possibly including the individual’s sex or caste (for social insects).
plot_context_id	A unique integer identifier for a plot, where a plot is a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment. The identifier links to specific information in the context table.
treatment_context_id	A unique integer identifier for a treatment, where a treatment is any experimental manipulation to an organism’s growing/living conditions. The identifier links to specific information in the context table.
collection_date	Date sample was taken, in the format `yyyy-mm-dd`, `yyyy-mm` or `yyyy`, depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a `/`, as in 2010-10/2011-03.
measurement_remarks	Brief comments or notes accompanying the trait measurement.
method_id	A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table.
method_context_id	A unique integer identifier indicating a trait is measured multiple times on the same entity, with different methods used for each entry. This field is only used if a single trait is measured using multiple methods within the same dataset. The identifier links to specific information in the context table.
original_name	Name given to taxon in the original data supplied by the authors.

7.7 Taxa

Description: A table containing details on taxa that are included in the table traits. We have attempted to align species names with known taxonomic units in the Australian Plant Census (APC) and/or the Australian Plant Names Index (APNI); the sourced information is released under a CC-BY3 license.

Version 0.1.0 of AusTraits contains records for 7324 different taxa.

Content:

key	value
taxon_name	Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
taxonomic_dataset	Name of the taxonomy (tree) that contains this concept. ie. APC, AusMoss etc.
taxon_rank	The taxonomic rank of the most specific name in the scientific name.
trinomial	The infraspecific taxon name match for an original name. This column is assigned `na` for taxon name that are at a broader taxonomic_resolution.
binomial	The species-level taxon name match for an original name. This column is assigned `na` for taxon name that are at a broader taxonomic_resolution.
genus	Genus of the taxon without authorship.
family	Family of the taxon.
taxon_distribution	Known distribution of the taxon, by Australian state.
establishment_means	Statement about whether an organism or organisms have been introduced to a given place and time through the direct or indirect activity of modern humans.
taxonomic_status	The status of the use of the scientificName as a label for the taxon in regard to the ‘accepted (or valid) taxonomy’. The assigned taxonomic status must be linked to a specific taxonomic reference that defines the concept.
taxon_id	An identifier for the set of taxon information (data associated with the taxon class). May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.
taxon_id_genus	An identifier for the set of taxon information (data associated with the taxon class) for the genus associated with a taxon name. May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.
taxon_id_family	An identifier for the set of taxon information (data associated with the taxon class) for the family associated with a taxon name. May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.
scientific_name	The full scientific name, with authorship and date information if known.
scientific_name_id	An identifier for the set of taxon information (data associated with the taxon class). May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.

7.8 Taxonomic_updates

Description: A table of all taxonomic changes implemented in the construction of AusTraits. Changes are determined by comparing the originally submitted taxon name against the taxonomic names listed in the taxonomic reference files, best placed in a subfolder in the config folder . Cross referencing with the traits table is possible using combinations of the variables dataset_id and taxon_name.

Content:

key	value
dataset_id	Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. `Falster_2005`.
original_name	Name given to taxon in the original data supplied by the authors.
aligned_name	The taxon name without authorship after implementing automated syntax standardisation and spelling changes as well as manually encoded syntax alignments for this taxon in the metadata file for the corresponding `dataset_id`. This name has not yet been matched to the currently accepted (botanical) or valid (zoological) taxon name in cases where there are taxonomic synonyms, isonyms, orthographic variants, etc.
taxonomic_resolution	The rank of the most specific taxon name (or scientific name) to which a submitted orignal name resolves.
taxon_name	Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
aligned_name_taxon_id	An identifier for the aligned name before it is updated to the currently accepted name usage. This may be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.
aligned_name_taxonomic_status	The status of the use of the `aligned_name` as a label for a taxon. Requires taxonomic opinion to define the scope of a taxon. Rules of priority then are used to define the taxonomic status of the nomenclature contained in that scope, combined with the experts opinion. It must be linked to a specific taxonomic reference that defines the concept.

Both the original and the updated taxon names are included in the traits table.

7.9 Definitions

Description: A copy of the definitions for all tables and terms. Information included here was used to process data and generate any documentation for the study.

Details on trait definitions: The allowable trait names and trait values are defined in the definitions file. Each trait is labelled as either numeric or categorical. An example of each type is as follows. For an example, see the the Trait definitions for AusTraits.

leaf_mass_per_area

number of records: 2261
number of studies: 13

woodiness

number of records: 0
number of studies: 0

7.10 Contributors

Description: A table of people contributing to each study.

Content:

key	value
dataset_id	Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. `Falster_2005`.
last_name	Last name of the data collector.
given_name	Given names of the data collector.
ORCID	ORCID of the data collector.
affiliation	Last known institution or affiliation.
additional_role	Additional roles of data collector, mostly contact person.

7.11 Sources

For each dataset in the compilation there is the option to list primary and secondary citations. The primary citation is defined as, The original study in which data were collected. The secondary citation is defined as, A subsequent study where data were compiled or re-analysed.

The element sources includes bibtex versions of all sources which can be imported into your reference library:

Or individually viewed:

A formatted version of the sources also exists within the table methods.

7.12 Metadata

Description: Metadata associated with the dataset, including title, creators, license, subject, funding sources.

7.13 Build_info

Description: A description of the computing environment used to create this version of the dataset, including version number, git commit and R session_info.