7 Data structure
This chapter describes the structure of the output of a traits.build
compilation.
Note that the information below is based on the information provided within the file traits.build_schema.yml
, which can be accessed by running get_schema
or system.file("support", "traits.build_schema.yml", package = "traits.build")
.
A traits.build
compilation results in a series of linked components, which cross link against each other:
austraits
├── traits
├── locations
├── contexts
├── methods
├── excluded_data
├── taxonomic_updates
├── taxa
├── contributors
├── sources
├── definitions
├── schema
├── metadata
└── build_info
These include all the data and contextual information submitted with each contributed dataset.
7.1 Components
The core components are defined as follows.
7.2 Traits
Description: A table containing measurements of traits.
Content:
key | value |
---|---|
dataset_id |
Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005 .
|
taxon_name | Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level. |
observation_id |
A unique integral identifier for the observation, where an observation is all measurements made on an individual at a single point in time. It is important for joining traits coming from the same observation_id . Within each dataset, observation_id’s are unique combinations of taxon_name , population_id , individual_id , and temporal_context_id .
|
trait_name |
Name of the trait sampled. Allowable values specified in the table definitions .
|
value | The measured value of a trait, location property or context property. |
unit | Units of the sampled trait value after aligning with AusTraits standards. |
entity_type | A categorical variable specifying the entity corresponding to the trait values recorded. |
value_type | A categorical variable describing the statistical nature of the trait value recorded. |
basis_of_value | A categorical variable describing how the trait value was obtained. |
replicates |
Number of replicate measurements that comprise a recorded trait measurement. A numeric value (or range) is ideal and appropriate if the value type is a mean , median , min or max . For these value types, if replication is unknown the entry should be unknown . If the value type is raw_value the replicate value should be 1. If the trait is categorical or the value indicates a measurement for an entire species (or other taxon) replicate value should be .na .
|
basis_of_record | A categorical variable specifying from which kind of specimen traits were recorded. |
life_stage |
A field to indicate the life stage or age class of the entity measured. Standard values are adult , sapling , seedling and juvenile .
|
population_id | A unique integer identifier for a population, where a population is defined as individuals growing in the same location (location_id /location_name) and plot (plot_context_id, a context category) and being subjected to the same treatment (treatment_context_id, a context category). |
individual_id | A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time. |
repeat_measurements_id | A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve. |
temporal_context_id | A unique integer identifier assigned where repeat observations are made on the same individual (or population, or taxon) across time. The identifier links to specific information in the context table. |
source_id | For datasets that are compilations, an identifier for the original data source. |
location_id | A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table. |
entity_context_id | A unique integer identifier indicating specific contextual properties of an individual, possibly including the individual’s sex or caste (for social insects). |
plot_context_id | A unique integer identifier for a plot, where a plot is a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment. The identifier links to specific information in the context table. |
treatment_context_id | A unique integer identifier for a treatment, where a treatment is any experimental manipulation to an organism’s growing/living conditions. The identifier links to specific information in the context table. |
collection_date |
Date sample was taken, in the format yyyy-mm-dd , yyyy-mm or yyyy , depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a / , as in 2010-10/2011-03.
|
measurement_remarks | Brief comments or notes accompanying the trait measurement. |
method_id | A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table. |
method_context_id | A unique integer identifier indicating a trait is measured multiple times on the same entity, with different methods used for each entry. This field is only used if a single trait is measured using multiple methods within the same dataset. The identifier links to specific information in the context table. |
original_name | Name given to taxon in the original data supplied by the authors. |
Entity type
An entity is the feature of interest
, indicating what a trait value applies to. While an entity can be just a component of an organism, within the scope of AusTraits, an individual
is the finest scale entity that can be documented. The same study might measure some traits at a population-level (entity = population
) and others at an individual-level (entity = individual
).
In detail:
entity_type
is a categorical variable specifying the entity corresponding to the trait values recorded. Possible values are:
key | value |
---|---|
individual | Value comes from a single individual. |
population | Value represents a summary statistic from multiple individuals at a single location. |
metapopulation | Value represents a summary statistic from individuals of the taxon across multiple locations. |
species | Value represents a summary statistic for a species or infraspecific taxon across its range or as estimated by an expert based on their knowledge of the taxon. Data fitting this category include estimates from reference books that represent a taxon’s entire range and values for categorical variables obtained from a reference book or identified by an expert. |
genus | Value represents a summary statistic or expert score for an entire genus. |
family | Value represents a summary statistic or expert score for an entire family. |
order | Value represents a summary statistic or expert score for an entire order. |
Identifiers
The traits table includes 12 identifiers, dataset_id
, observation_id
, taxon_name
, population_id
, individual_id
, temporal_context_id
, source_id
, location_id
, entity_context_id
, plot_context_id
, treatment_context_id
, and method_context_id
.
dataset_id
, source_id
and taxon_name
have easy-to-interpret values. The others are simply integral identifiers that link groups of measurements and are automatically generated through the AusTraits workflow (individual_id
can be assigned in the metadata file or automatically generated.)
To expand on the definitions provided above,
observation_id
links measurements made on the same entity (individual, population, or species) at a single point in time.population_id
indicates entities that share a commonlocation_id
,plot_context_id
, andtreatment_context_id
. It is used to align measurements andobservation_id
’s forindividuals
versuspopulations
(i.e. distinctentity_types
) that share a commonpopulation_id
. It is numbered sequentially within a dataset.individual_id
indicates a unique organism. It is numbered sequentially within a dataset by population. Multiple observations on the same organism across time (with distinctobservation_id
values), share a commonindividual_id
.temporal_context_id
indicates a distinct point in time and is used only if there are repeat measurements on a population or individual across time. The identifier links to context properties (and their associated information) in thecontexts
table for context properties of typetemporal
.source_id
is applied if not all data within a single dataset (dataset_id
) is from the same source, such as when a dataset represents a compilation for a meta-analysis.location_id
links to a distinctlocation_name
and associatedlocation_properties
in thelocation
table.entity_context_id
links to information in thecontexts
table for context properties (& associated values/descriptions) with categoryentity_context
.Entity_contexts
include organism sex, organism caste and any other features of an entity that need to be documented.plot_context_id
links to information in thecontexts
table for context properties (& associated values/descriptions) with categoryplot
.Plot contexts
include both blocks/plots within an experimental design as well as any stratified variation within a location that needs to be documented (e.g. slope position).treatment_context_id
links to information in thecontexts
table for context properties (& associated values/descriptions) with categorytreatment
.Treatment contexts
are experimental manipulations applied to groups of individuals.method_context_id
links to information in thecontexts
table for context properties (& associated values/descriptions) with categorymethod
. Amethod context
indicates that the same trait was measured on or across individuals using different methods.
Additionally, measurement_remarks
is used to document brief comments or notes accompanying the trait measurement.
Life stage, basis of record
life_stage
: a field to indicate the life stage or age class of the entity measured. Standard values areadult
,sapling
,seedling
andjuvenile
.basis_of_record
: a categorical variable specifying from which kind of specimen traits were recorded.
Possible values are:
key | value |
---|---|
field | Traits were recorded on entities living naturally in the field. |
field_experiment | Traits were recorded on entities living under experimentally manipulated conditions in the field. |
captive_cultivated | Traits were recorded on entities living in a common garden, arboretum, or botanical or zoological garden. |
lab | Traits were recorded on entities growing in a lab, glasshouse or growth chamber. |
preserved_specimen | Traits were recorded from specimens preserved in a collection, eg. herbarium or museum. |
literature | Traits were sourced from values reported in the literature, and where the basis of record is not otherwise known. |
Values, value types, basis of value
Each record in the table of trait data has an associated value
, value_type
, and basis_of_value
.
Values:
A trait’s values are either numeric
or categorical
. For traits with numerical values, the recorded value has been converted into standardised units and the AusTraits workflow has confirmed the value can be converted into a number and lies within the allowable range. For categorical variables, records have been aligned through substitutions to values listed as allowable values (terms) in a trait’s definition.
- we use _
for multi-word terms, e.g. semi_deciduous
- we use a space for situations where two values co-occur for the same entity. For instance, a flora might indicate that a plant species can be either annual or biennial, in which case the trait is scored as annual biennial
.
Value types:
Each trait measurement has an associated value_type
, which is a categorical variable describing the statistical nature of the trait value recorded.
Possible value types are:
key | value |
---|---|
raw | Value recorded for an entity. |
minimum | Value is the minimum of values recorded for an entity. |
mean | Value is the mean of values recorded for an entity. |
median | Value is the median of values recorded for an entity. |
maximum | Value is the maximum of values recorded for an entity. |
mode | Value is the mode of values recorded for an entity. This is the appropriate value type for a categorical trait value. |
range | Value is a range of values recorded for an entity. |
bin | Value for an entity falls within specified limits. |
unknown | Not currently known. |
Each trait measurement also has an associated basis_of_value
, which is a categorical variable describing how the trait value was obtained.
Possible values are:
key | value |
---|---|
measurement | Value is the result of a measurement(s) made on a specimen(s). |
expert_score | Value has been estimated by an expert based on their knowledge of the entity. |
model_derived | Value is derived from a statistical model, for example via gap-filling. |
unknown | Not currently known. |
7.3 Locations
Description: A table containing observations of location/site characteristics associated with information in traits
. Cross referencing between the two dataframes is possible using combinations of the variables dataset_id
, location_name
.
Content:
key | value |
---|---|
dataset_id |
Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005 .
|
location_id | A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table. |
location_name | The location name. |
location_property |
The location characteristic being recorded. The name should include units of measurement, e.g. MAT (C) . Ideally we have at least the following variables for each location, longitude (deg) , latitude (deg) , description .
|
value | The measured value of a location property. |
7.4 Contexts
Description: A table containing observations of contextual characteristics associated with information in traits
. Cross referencing between the two dataframes is possible using combinations of the variables dataset_id
, link_id
, and link_vals
.
Content:
key | value |
---|---|
dataset_id |
Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005 .
|
context_property |
The contextual characteristic being recorded. If applicable, name should include units of measurement, e.g. CO2 concentration (ppm) .
|
category |
The category of context property, with options being plot , treatment , individual_context , temporal and method .
|
value | The measured value of a context property. |
description | Description of a specific context property value. |
link_id |
Variable indicating which identifier column in the traits table contains the specified link_vals .
|
link_vals |
Unique integer identifiers that link between identifier columns in the traits table and the contextual properties/values in the contexts table.
|
7.5 Methods
Description: A table containing details on methods with which data were collected, including time frame and source. Cross referencing with the traits
table is possible using combinations of the variables dataset_id
, trait_name
.
Content:
key | value |
---|---|
dataset_id |
Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005 .
|
trait_name |
Name of the trait sampled. Allowable values specified in the table definitions .
|
methods | A textual description of the methods used to collect the trait data. Whenever available, methods are taken near-verbatim from the referenced source. Methods can include descriptions such as ‘measured on botanical collections’, ‘data from the literature’, or a detailed description of the field or lab methods used to collect the data. |
method_id | A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table. |
description | A 1-2 sentence description of the purpose of the study. |
sampling_strategy | A written description of how study locations were selected and how study individuals were selected. When available, this information is lifted verbatim from a published manuscript. For preserved specimens, this field ideally indicates which records were ‘sampled’ to measure a specific trait. |
source_primary_key |
Citation key for the primary source in sources . The key is typically formatted as Surname_year .
|
source_primary_citation | Citation for the primary source. This detail is generated from the primary source in the metadata. |
source_secondary_key |
Citation key for the secondary source in sources . The key is typically formatted as Surname_year .
|
source_secondary_citation | Citations for the secondary source. This detail is generated from the secondary source in the metadata. |
source_original_dataset_key |
Citation key for the original dataset_id in sources; for compilations. The key is typically formatted as Surname_year .
|
source_original_dataset_citation | Citations for the original dataset_id in sources; for compilationse. This detail is generated from the original source in the metadata. |
data_collectors | The person (people) leading data collection for this study. |
assistants | Names of additional people who played a more minor role in data collection for the study. |
dataset_curators | Names of AusTraits team member(s) who contacted the data collectors and added the study to the AusTraits repository. |
7.6 Excluded_data
Description: A table of data that did not pass quality tests and so were excluded from the master dataset. The structure is identical to that presented in the traits
table, only with an extra column called error
indicating why the record was excluded. Common reasons are missing_unit_conversions, missing_value, and unsupported_trait_value.
Content:
key | value |
---|---|
error | Indicating why the record was excluded. Common reasons are missing_unit_conversions, missing_value, and unsupported_trait_value. |
dataset_id |
Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005 .
|
taxon_name | Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level. |
observation_id |
A unique integral identifier for the observation, where an observation is all measurements made on an individual at a single point in time. It is important for joining traits coming from the same observation_id . Within each dataset, observation_id’s are unique combinations of taxon_name , population_id , individual_id , and temporal_context_id .
|
trait_name |
Name of the trait sampled. Allowable values specified in the table definitions .
|
value | The measured value of a trait. |
unit | Units of the sampled trait value after aligning with AusTraits standards. |
entity_type | A categorical variable specifying the entity corresponding to the trait values recorded. |
value_type | A categorical variable describing the statistical nature of the trait value recorded. |
basis_of_value | A categorical variable describing how the trait value was obtained. |
replicates |
Number of replicate measurements that comprise a recorded trait measurement. A numeric value (or range) is ideal and appropriate if the value type is a mean , median , min or max . For these value types, if replication is unknown the entry should be unknown . If the value type is raw_value the replicate value should be 1. If the trait is categorical or the value indicates a measurement for an entire species (or other taxon) replicate value should be .na .
|
basis_of_record | A categorical variable specifying from which kind of specimen traits were recorded. |
life_stage |
A field to indicate the life stage or age class of the entity measured. Standard values are adult , sapling , seedling and juvenile .
|
population_id | A unique integer identifier for a population, where a population is defined as individuals growing in the same location (location_id /location_name) and plot (plot_context_id, a context category) and being subjected to the same treatment (treatment_context_id, a context category). |
individual_id | A unique integer identifier for an individual, with individuals numbered sequentially within each dataset by taxon by population grouping. Most often each row of data represents an individual, but in some datasets trait data collected on a single individual is presented across multiple rows of data, such as if the same trait is measured using different methods or the same individual is measured repeatedly across time. |
repeat_measurements_id | A unique integer identifier for repeat measurements of a trait that comprise a single observation, such as a response curve. |
temporal_context_id | A unique integer identifier assigned where repeat observations are made on the same individual (or population, or taxon) across time. The identifier links to specific information in the context table. |
source_id | For datasets that are compilations, an identifier for the original data source. |
location_id | A unique integer identifier for a location, with locations numbered sequentially within a dataset. The identifier links to specific information in the location table. |
entity_context_id | A unique integer identifier indicating specific contextual properties of an individual, possibly including the individual’s sex or caste (for social insects). |
plot_context_id | A unique integer identifier for a plot, where a plot is a distinct collection of organisms within a single geographic location, such as plants growing on different aspects or blocks in an experiment. The identifier links to specific information in the context table. |
treatment_context_id | A unique integer identifier for a treatment, where a treatment is any experimental manipulation to an organism’s growing/living conditions. The identifier links to specific information in the context table. |
collection_date |
Date sample was taken, in the format yyyy-mm-dd , yyyy-mm or yyyy , depending on the resoluton specified. Alternatively an overall range for the study can be indicating, with the starting and ending sample date sepatated by a / , as in 2010-10/2011-03.
|
measurement_remarks | Brief comments or notes accompanying the trait measurement. |
method_id | A unique integer identifier to distinguish between multiple sets of methods used to measure a single trait within the same dataset. The identifier links to specific information in the methods table. |
method_context_id | A unique integer identifier indicating a trait is measured multiple times on the same entity, with different methods used for each entry. This field is only used if a single trait is measured using multiple methods within the same dataset. The identifier links to specific information in the context table. |
original_name | Name given to taxon in the original data supplied by the authors. |
7.7 Taxa
Description: A table containing details on taxa that are included in the table traits
. We have attempted to align species names with known taxonomic units in the Australian Plant Census
(APC) and/or the Australian Plant Names Index
(APNI); the sourced information is released under a CC-BY3 license.
Version 0.1.0 of AusTraits contains records for 6739 different taxa.
Content:
key | value |
---|---|
taxon_name | Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level. |
taxonomic_dataset | Name of the taxonomy (tree) that contains this concept. ie. APC, AusMoss etc. |
taxon_rank | The taxonomic rank of the most specific name in the scientific name. |
trinomial |
The infraspecific taxon name match for an original name. This column is assigned na for taxon name that are at a broader taxonomic_resolution.
|
binomial |
The species-level taxon name match for an original name. This column is assigned na for taxon name that are at a broader taxonomic_resolution.
|
genus | Genus of the taxon without authorship. |
family | Family of the taxon. |
taxon_distribution | Known distribution of the taxon, by Australian state. |
establishment_means | Statement about whether an organism or organisms have been introduced to a given place and time through the direct or indirect activity of modern humans. |
taxonomic_status | The status of the use of the scientificName as a label for the taxon in regard to the ‘accepted (or valid) taxonomy’. The assigned taxonomic status must be linked to a specific taxonomic reference that defines the concept. |
taxon_id | An identifier for the set of taxon information (data associated with the taxon class). May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
taxon_id_genus | An identifier for the set of taxon information (data associated with the taxon class) for the genus associated with a taxon name. May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
taxon_id_family | An identifier for the set of taxon information (data associated with the taxon class) for the family associated with a taxon name. May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
scientific_name | The full scientific name, with authorship and date information if known. |
scientific_name_id | An identifier for the set of taxon information (data associated with the taxon class). May be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
7.8 Taxonomic_updates
Description: A table of all taxonomic changes implemented in the construction of AusTraits. Changes are determined by comparing the originally submitted taxon name against the taxonomic names listed in the taxonomic reference files, best placed in a subfolder in the config
folder . Cross referencing with the traits
table is possible using combinations of the variables dataset_id
and taxon_name
.
Content:
key | value |
---|---|
dataset_id |
Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005 .
|
original_name | Name given to taxon in the original data supplied by the authors. |
aligned_name |
The taxon name without authorship after implementing automated syntax standardisation and spelling changes as well as manually encoded syntax alignments for this taxon in the metadata file for the corresponding dataset_id . This name has not yet been matched to the currently accepted (botanical) or valid (zoological) taxon name in cases where there are taxonomic synonyms, isonyms, orthographic variants, etc.
|
taxonomic_resolution | The rank of the most specific taxon name (or scientific name) to which a submitted orignal name resolves. |
taxon_name | Scientific name of the taxon on which traits were sampled, without authorship. When possible, this is the currently accepted (botanical) or valid (zoological) scientific name, but might also be a higher taxonomic level. |
aligned_name_taxon_id | An identifier for the aligned name before it is updated to the currently accepted name usage. This may be a global unique identifier or an identifier specific to the data set. Must be resolvable within this dataset. |
aligned_name_taxonomic_status |
The status of the use of the aligned_name as a label for a taxon. Requires taxonomic opinion to define the scope of a taxon. Rules of priority then are used to define the taxonomic status of the nomenclature contained in that scope, combined with the experts opinion. It must be linked to a specific taxonomic reference that defines the concept.
|
Both the original and the updated taxon names are included in the traits
table.
7.9 Definitions
Description: A copy of the definitions for all tables and terms. Information included here was used to process data and generate any documentation for the study.
Details on trait definitions: The allowable trait names and trait values are defined in the definitions file. Each trait is labelled as either numeric
or categorical
. An example of each type is as follows. For an example, see the the Trait definitions for AusTraits.
leaf_mass_per_area
- number of records: 2815
- number of studies: 13
woodiness
- number of records: 0
- number of studies: 0
7.10 Contributors
Description: A table of people contributing to each study.
Content:
key | value |
---|---|
dataset_id |
Primary identifier for each study contributed to AusTraits; most often these are scientific papers, books, or online resources. By default this should be the name of the first author and year of publication, e.g. Falster_2005 .
|
last_name | Last name of the data collector. |
given_name | Given names of the data collector. |
ORCID | ORCID of the data collector. |
affiliation | Last known institution or affiliation. |
additional_role | Additional roles of data collector, mostly contact person. |
7.11 Sources
For each dataset in the compilation there is the option to list primary and secondary citations. The primary citation is defined as, The original study in which data were collected. The secondary citation is defined as, A subsequent study where data were compiled or re-analysed.
The element sources
includes bibtex versions of all sources which can be imported into your reference library:
Or individually viewed:
A formatted version of the sources also exists within the table methods.
7.12 Metadata
Description: Metadata associated with the dataset, including title, creators, license, subject, funding sources.
7.13 Build_info
Description: A description of the computing environment used to create this version of the dataset, including version number, git commit and R session_info.