7  Data structure

This chapter describes the structure of the output of a traits.build compilation.

Note that the information below is based on the information provided within the file traits.build_schema.yml, which can be accessed by running get_schema or system.file("support", "traits.build_schema.yml", package = "traits.build").

A traits.build compilation results in a series of linked components, which cross link against each other:

austraits
├── traits
├── locations
├── contexts
├── methods
├── excluded_data
├── taxonomic_updates
├── taxa
├── contributors
├── sources
├── definitions
├── schema
├── metadata
└── build_info

These include all the data and contextual information submitted with each contributed dataset.

7.1 Components

The core components are defined as follows.

7.2 Traits

Description: A table containing measurements of traits.

Content:

key value
dataset_id Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005.
taxon_name Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
observation_id A unique integral identifier for the observation, where an observation is all measurements made on an individual at a single point in time. It is important for joining traits coming from the same observation_id. Within each dataset, observation_id’s are unique combinations of taxon_name, population_id, individual_id, and temporal_context_id.
trait_name Name of the trait sampled. Allowable values specified in the table definitions.
value The measured value of a trait, location property or context property.
unit Units of the sampled trait value after aligning with AusTraits standards.
entity_type A categorical variable specifying the entity corresponding to the trait values recorded.
value_type A categorical variable describing the statistical nature of the trait value recorded.
basis_of_value A categorical variable describing how the trait value was obtained.
replicates Number of replicate measurements that comprise a recorded trait measurement. A numeric value (or range) is ideal and appropriate if the value type is a mean, median, min or max. For these value types, if replication is unknown the entry should be unknown. If the value type is raw_value the replicate value should be 1. If the trait is categorical or the value indicates a measurement for an entire species (or other taxon) replicate value should be .na.
basis_of_record A categorical variable specifying from which kind of specimen traits were recorded.
life_stage A field to indicate the life stage or age class of the entity measured. Standard values are adult, sapling, seedling and juvenile.
population_id A unique integer identifier for a population, where a population is defined as individuals growing in the same location (location_id /location_name) and plot (plot_context_id, a context category) and being subjected to the same treatment (treatment_context_id, a context category).
individual_id A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time.
repeat_measurements_id A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve.
temporal_context_id A unique integer identifier assigned where repeat observations are made on the same individual (or population, or taxon) across time. The identifier links to specific information in the context table.
source_id For datasets that are compilations, an identifier for the original data source.
location_id A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table.
entity_context_id A unique integer identifier indicating specific contextual properties of an individual, possibly including the individual’s sex or caste (for social insects).
plot_context_id A unique integer identifier for a plot, where a plot is a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment. The identifier links to specific information in the context table.
treatment_context_id A unique integer identifier for a treatment, where a treatment is any experimental manipulation to an organism’s growing/living conditions. The identifier links to specific information in the context table.
collection_date Date sample was taken, in the format yyyy-mm-dd, yyyy-mm or yyyy, depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a /, as in 2010-10/2011-03.
measurement_remarks Brief comments or notes accompanying the trait measurement.
method_id A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table.
method_context_id A unique integer identifier indicating a trait is measured multiple times on the same entity, with different methods used for each entry. This field is only used if a single trait is measured using multiple methods within the same dataset. The identifier links to specific information in the context table.
original_name Name given to taxon in the original data supplied by the authors.

Entity type

An entity is the feature of interest, indicating what a trait value applies to. While an entity can be just a component of an organism, within the scope of AusTraits, an individual is the finest scale entity that can be documented. The same study might measure some traits at a population-level (entity = population) and others at an individual-level (entity = individual).

In detail:

  • entity_type is a categorical variable specifying the entity corresponding to the trait values recorded. Possible values are:
key value
individual Value comes from a single individual.
population Value represents a summary statistic from multiple individuals at a single location.
metapopulation Value represents a summary statistic from individuals of the taxon across multiple locations.
species Value represents a summary statistic for a species or infraspecific taxon across its range or as estimated by an expert based on their knowledge of the taxon. Data fitting this category include estimates from reference books that represent a taxon’s entire range and values for categorical variables obtained from a reference book or identified by an expert.
genus Value represents a summary statistic or expert score for an entire genus.
family Value represents a summary statistic or expert score for an entire family.
order Value represents a summary statistic or expert score for an entire order.

Identifiers

The traits table includes 12 identifiers, dataset_id, observation_id, taxon_name, population_id, individual_id, temporal_context_id, source_id, location_id, entity_context_id, plot_context_id, treatment_context_id, and method_context_id.

dataset_id, source_id and taxon_name have easy-to-interpret values. The others are simply integral identifiers that link groups of measurements and are automatically generated through the AusTraits workflow (individual_id can be assigned in the metadata file or automatically generated.)

To expand on the definitions provided above,

  • observation_id links measurements made on the same entity (individual, population, or species) at a single point in time.

  • population_id indicates entities that share a common location_id, plot_context_id, and treatment_context_id. It is used to align measurements and observation_id’s for individuals versus populations (i.e. distinct entity_types) that share a common population_id. It is numbered sequentially within a dataset.

  • individual_id indicates a unique organism. It is numbered sequentially within a dataset by population. Multiple observations on the same organism across time (with distinct observation_id values), share a common individual_id.

  • temporal_context_id indicates a distinct point in time and is used only if there are repeat measurements on a population or individual across time. The identifier links to context properties (and their associated information) in the contexts table for context properties of type temporal.

  • source_id is applied if not all data within a single dataset (dataset_id) is from the same source, such as when a dataset represents a compilation for a meta-analysis.

  • location_id links to a distinct location_name and associated location_properties in the location table.

  • entity_context_id links to information in the contexts table for context properties (& associated values/descriptions) with category entity_context. Entity_contexts include organism sex, organism caste and any other features of an entity that need to be documented.

  • plot_context_id links to information in the contexts table for context properties (& associated values/descriptions) with category plot. Plot contexts include both blocks/plots within an experimental design as well as any stratified variation within a location that needs to be documented (e.g. slope position).

  • treatment_context_idlinks to information in the contexts table for context properties (& associated values/descriptions) with category treatment. Treatment contexts are experimental manipulations applied to groups of individuals.

  • method_context_idlinks to information in the contexts table for context properties (& associated values/descriptions) with category method. A method context indicates that the same trait was measured on or across individuals using different methods.

Additionally, measurement_remarks is used to document brief comments or notes accompanying the trait measurement.

Life stage, basis of record

  • life_stage: a field to indicate the life stage or age class of the entity measured. Standard values are adult, sapling, seedling and juvenile.

  • basis_of_record: a categorical variable specifying from which kind of specimen traits were recorded.

Possible values are:

key value
field Traits were recorded on entities living naturally in the field.
field_experiment Traits were recorded on entities living under experimentally manipulated conditions in the field.
captive_cultivated Traits were recorded on entities living in a common garden, arboretum, or botanical or zoological garden.
lab Traits were recorded on entities growing in a lab, glasshouse or growth chamber.
preserved_specimen Traits were recorded from specimens preserved in a collection, eg. herbarium or museum.
literature Traits were sourced from values reported in the literature, and where the basis of record is not otherwise known.

Values, value types, basis of value

Each record in the table of trait data has an associated value, value_type, and basis_of_value.

Values: A trait’s values are either numeric or categorical. For traits with numerical values, the recorded value has been converted into standardised units and the AusTraits workflow has confirmed the value can be converted into a number and lies within the allowable range. For categorical variables, records have been aligned through substitutions to values listed as allowable values (terms) in a trait’s definition.
- we use _ for multi-word terms, e.g. semi_deciduous
- we use a space for situations where two values co-occur for the same entity. For instance, a flora might indicate that a plant species can be either annual or biennial, in which case the trait is scored as annual biennial.

Value types: Each trait measurement has an associated value_type, which is a categorical variable describing the statistical nature of the trait value recorded.

Possible value types are:

key value
raw Value recorded for an entity.
minimum Value is the minimum of values recorded for an entity.
mean Value is the mean of values recorded for an entity.
median Value is the median of values recorded for an entity.
maximum Value is the maximum of values recorded for an entity.
mode Value is the mode of values recorded for an entity. This is the appropriate value type for a categorical trait value.
range Value is a range of values recorded for an entity.
bin Value for an entity falls within specified limits.
standard_error Value is the standard error of a mean of values recorded for an entity.
unknown Not currently known.

Each trait measurement also has an associated basis_of_value, which is a categorical variable describing how the trait value was obtained.

Possible values are:

key value
measurement Value is the result of a measurement(s) made on a specimen(s).
expert_score Value has been estimated by an expert based on their knowledge of the entity.
model_derived Value is derived from a statistical model, for example via gap-filling.
unknown Not currently known.

7.3 Locations

Description: A table containing observations of location/site characteristics associated with information in traits. Cross referencing between the two dataframes is possible using combinations of the variables dataset_id, location_name.

Content:

key value
dataset_id Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005.
location_id A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table.
location_name The location name.
location_property The location characteristic being recorded. The name should include units of measurement, e.g. MAT (C). Ideally we have at least the following variables for each location, longitude (deg), latitude (deg), description.
value The measured value of a location property.

7.4 Contexts

Description: A table containing observations of contextual characteristics associated with information in traits. Cross referencing between the two dataframes is possible using combinations of the variables dataset_id, link_id, and link_vals.

Content:

key value
dataset_id Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005.
context_property The contextual characteristic being recorded. If applicable, name should include units of measurement, e.g. CO2 concentration (ppm).
category The category of context property, with options being plot, treatment, individual_context, temporal and method.
value The measured value of a context property.
description Description of a specific context property value.
link_id Variable indicating which identifier column in the traits table contains the specified link_vals.
link_vals Unique integer identifiers that link between identifier columns in the traits table and the contextual properties/values in the contexts table.

7.5 Methods

Description: A table containing details on methods with which data were collected, including time frame and source. Cross referencing with the traits table is possible using combinations of the variables dataset_id, trait_name.

Content:

key value
dataset_id Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005.
trait_name Name of the trait sampled. Allowable values specified in the table definitions.
methods A textual description of the methods used to collect the trait data. Whenever available, methods are taken near-verbatim from the referenced source. Methods can include descriptions such as ‘measured on botanical collections’, ‘data from the literature’, or a detailed description of the field or lab methods used to collect the data.
method_id A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table.
description A 1-2 sentence description of the purpose of the study.
sampling_strategy A written description of how study locations were selected and how study individuals were selected. When available, this information is lifted verbatim from a published manuscript. For preserved specimens, this field ideally indicates which records were ‘sampled’ to measure a specific trait.
source_primary_key Citation key for the primary source in sources. The key is typically formatted as Surname_year.
source_primary_citation Citation for the primary source. This detail is generated from the primary source in the metadata.
source_secondary_key Citation key for the secondary source in sources. The key is typically formatted as Surname_year.
source_secondary_citation Citations for the secondary source. This detail is generated from the secondary source in the metadata.
source_original_dataset_key Citation key for the original dataset_id in sources; for compilations. The key is typically formatted as Surname_year.
source_original_dataset_citation Citations for the original dataset_id in sources; for compilationse. This detail is generated from the original source in the metadata.
data_collectors The person (people) leading data collection for this study.
assistants Names of people who played a more minor role in data collection for the study.
dataset_curators Names of database team member(s) who contacted the data collectors and added the study to the database repository.

7.6 Excluded_data

Description: A table of data that did not pass quality tests and so were excluded from the master dataset. The structure is identical to that presented in the traits table, only with an extra column called error indicating why the record was excluded. Common reasons are missing_unit_conversions, missing_value, and unsupported_trait_value.

Content:

key value
error Indicating why the record was excluded. Common reasons are missing_unit_conversions, missing_value, and unsupported_trait_value.
dataset_id Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005.
taxon_name Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
observation_id A unique integral identifier for the observation, where an observation is all measurements made on an individual at a single point in time. It is important for joining traits coming from the same observation_id. Within each dataset, observation_id’s are unique combinations of taxon_name, population_id, individual_id, and temporal_context_id.
trait_name Name of the trait sampled. Allowable values specified in the table definitions.
value The measured value of a trait.
unit Units of the sampled trait value after aligning with AusTraits standards.
entity_type A categorical variable specifying the entity corresponding to the trait values recorded.
value_type A categorical variable describing the statistical nature of the trait value recorded.
basis_of_value A categorical variable describing how the trait value was obtained.
replicates Number of replicate measurements that comprise a recorded trait measurement. A numeric value (or range) is ideal and appropriate if the value type is a mean, median, min or max. For these value types, if replication is unknown the entry should be unknown. If the value type is raw_value the replicate value should be 1. If the trait is categorical or the value indicates a measurement for an entire species (or other taxon) replicate value should be .na.
basis_of_record A categorical variable specifying from which kind of specimen traits were recorded.
life_stage A field to indicate the life stage or age class of the entity measured. Standard values are adult, sapling, seedling and juvenile.
population_id A unique integer identifier for a population, where a population is defined as individuals growing in the same location (location_id /location_name) and plot (plot_context_id, a context category) and being subjected to the same treatment (treatment_context_id, a context category).
individual_id A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time.
repeat_measurements_id A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve.
temporal_context_id A unique integer identifier assigned where repeat observations are made on the same individual (or population, or taxon) across time. The identifier links to specific information in the context table.
source_id For datasets that are compilations, an identifier for the original data source.
location_id A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table.
entity_context_id A unique integer identifier indicating specific contextual properties of an individual, possibly including the individual’s sex or caste (for social insects).
plot_context_id A unique integer identifier for a plot, where a plot is a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment. The identifier links to specific information in the context table.
treatment_context_id A unique integer identifier for a treatment, where a treatment is any experimental manipulation to an organism’s growing/living conditions. The identifier links to specific information in the context table.
collection_date Date sample was taken, in the format yyyy-mm-dd, yyyy-mm or yyyy, depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a /, as in 2010-10/2011-03.
measurement_remarks Brief comments or notes accompanying the trait measurement.
method_id A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table.
method_context_id A unique integer identifier indicating a trait is measured multiple times on the same entity, with different methods used for each entry. This field is only used if a single trait is measured using multiple methods within the same dataset. The identifier links to specific information in the context table.
original_name Name given to taxon in the original data supplied by the authors.

7.7 Taxa

Description: A table containing details on taxa that are included in the table traits. We have attempted to align species names with known taxonomic units in the Australian Plant Census (APC) and/or the Australian Plant Names Index (APNI); the sourced information is released under a CC-BY3 license.

Version 0.1.0 of AusTraits contains records for 7324 different taxa.

Content:

key value
taxon_name Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
taxonomic_dataset Name of the taxonomy (tree) that contains this concept. ie. APC, AusMoss etc.
taxon_rank The taxonomic rank of the most specific name in the scientific name.
trinomial The infraspecific taxon name match for an original name. This column is assigned na for taxon name that are at a broader taxonomic_resolution.
binomial The species-level taxon name match for an original name. This column is assigned na for taxon name that are at a broader taxonomic_resolution.
genus Genus of the taxon without authorship.
family Family of the taxon.
taxon_distribution Known distribution of the taxon, by Australian state.
establishment_means Statement about whether an organism or organisms have been introduced to a given place and time through the direct or indirect activity of modern humans.
taxonomic_status The status of the use of the scientificName as a label for the taxon in regard to the ‘accepted (or valid) taxonomy’. The assigned taxonomic status must be linked to a specific taxonomic reference that defines the concept.
taxon_id An identifier for the set of taxon information (data associated with the taxon class). May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.
taxon_id_genus An identifier for the set of taxon information (data associated with the taxon class) for the genus associated with a taxon name. May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.
taxon_id_family An identifier for the set of taxon information (data associated with the taxon class) for the family associated with a taxon name. May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.
scientific_name The full scientific name, with authorship and date information if known.
scientific_name_id An identifier for the set of taxon information (data associated with the taxon class). May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.

7.8 Taxonomic_updates

Description: A table of all taxonomic changes implemented in the construction of AusTraits. Changes are determined by comparing the originally submitted taxon name against the taxonomic names listed in the taxonomic reference files, best placed in a subfolder in the config folder . Cross referencing with the traits table is possible using combinations of the variables dataset_id and taxon_name.

Content:

key value
dataset_id Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005.
original_name Name given to taxon in the original data supplied by the authors.
aligned_name The taxon name without authorship after implementing automated syntax standardisation and spelling changes as well as manually encoded syntax alignments for this taxon in the metadata file for the corresponding dataset_id. This name has not yet been matched to the currently accepted (botanical) or valid (zoological) taxon name in cases where there are taxonomic synonyms, isonyms, orthographic variants, etc.
taxonomic_resolution The rank of the most specific taxon name (or scientific name) to which a submitted orignal name resolves.
taxon_name Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level.
aligned_name_taxon_id An identifier for the aligned name before it is updated to the currently accepted name usage. This may be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset.
aligned_name_taxonomic_status The status of the use of the aligned_name as a label for a taxon. Requires taxonomic opinion to define the scope of a taxon. Rules of priority then are used to define the taxonomic status of the nomenclature contained in that scope, combined with the experts opinion. It must be linked to a specific taxonomic reference that defines the concept.

Both the original and the updated taxon names are included in the traits table.

7.9 Definitions

Description: A copy of the definitions for all tables and terms. Information included here was used to process data and generate any documentation for the study.

Details on trait definitions: The allowable trait names and trait values are defined in the definitions file. Each trait is labelled as either numeric or categorical. An example of each type is as follows. For an example, see the the Trait definitions for AusTraits.

leaf_mass_per_area

  • number of records: 2261
  • number of studies: 13

woodiness

  • number of records: 0
  • number of studies: 0

7.10 Contributors

Description: A table of people contributing to each study.

Content:

key value
dataset_id Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005.
last_name Last name of the data collector.
given_name Given names of the data collector.
ORCID ORCID of the data collector.
affiliation Last known institution or affiliation.
additional_role Additional roles of data collector, mostly contact person.

7.11 Sources

For each dataset in the compilation there is the option to list primary and secondary citations. The primary citation is defined as, The original study in which data were collected. The secondary citation is defined as, A subsequent study where data were compiled or re-analysed.

The element sources includes bibtex versions of all sources which can be imported into your reference library:

Or individually viewed:

A formatted version of the sources also exists within the table methods.

7.12 Metadata

Description: Metadata associated with the dataset, including title, creators, license, subject, funding sources.

7.13 Build_info

Description: A description of the computing environment used to create this version of the dataset, including version number, git commit and R session_info.