vignettes/austraits_file_structure.Rmd
austraits_file_structure.Rmd
The main directory for the austraits.build
repository contains the following files and folders, with purpose as indicated.
├── README.md # landing page
├── .github # folder containing github actions, issue templates, code of conduct
├── LICENCE
├── NEWS.md
├── _pkgdown.yml # used to create packagedown website
├── docs # contains website
├── Dockerfile # creates an image of R environment used in build
├── NAMESPACE # functions being exported
├── DESCRIPTION # R package description
├── man # XXXX-not in file structure
├── R # folder with functions and scripts to build AusTraits
├── tests # defines tests applied to datasets
├── vignettes # documentation of repo file structure, AusTraits database structure, definitions, data input processes
The folder config
contains three files which govern the building of the dataset.
config
├── definitions.yml
├── taxon_list.csv
└── unit_conversions.csv
The file definitions.yml
defines the structure of the database and all terms. There are three main sections to the definitions.yml
file, the trait definitions used to compile AusTraits, allowable trait values, and definitions of all elements included in the tables that comprise AusTraits. The trait definitions and definitions of all elements in AusTraits (and value types) are fully described in additional vignettes. A .yml
file is a structured data file where information is presented in a hierarchical format (see appendix for details).
The file taxon_list.csv
is our master list of known taxa. Each species is listed once, with links to species’ identifiers provided by the Australian Plant Name Index (APNI). The file taxon_list.csv
is added to if a study includes taxa not previously represented in AusTraits. These can be names included in either the APC/APNI, compilations of taxonomic concepts (APC) or names (APNI) for plants that are either native to or naturalised in Australia, or taxa without recognised names.
taxon_name | family | scientificNameAuthorship | source | taxonomicStatusClean | taxonIDClean |
---|---|---|---|---|---|
Abelia | Caprifoliaceae | R.Br. | APC | accepted | https://id.biodiversity.org.au/node/apni/2892114 |
Abelia x grandiflora | Caprifoliaceae | (Rovelli ex André) Rehder | APC | accepted | https://id.biodiversity.org.au/node/apni/2914209 |
Abelmoschus | Malvaceae | Medik. | APC | accepted | https://id.biodiversity.org.au/node/apni/2898872 |
Abelmoschus ficulneus | Malvaceae | (L.) Wight | APC | accepted | https://id.biodiversity.org.au/node/apni/2897916 |
Abelmoschus manihot | Malvaceae | (L.) Medik. | APC | accepted | https://id.biodiversity.org.au/node/apni/2901085 |
Abelmoschus moschatus | Malvaceae | Medik. | APC | accepted | https://id.biodiversity.org.au/node/apni/2900572 |
Abildgaardia | Cyperaceae | Vahl | APC | accepted | https://id.biodiversity.org.au/node/apni/2905759 |
Abildgaardia ovata | Cyperaceae | (Burm.f.) Kral | APC | accepted | https://id.biodiversity.org.au/node/apni/2919627 |
Abildgaardia vaginata | Cyperaceae | R.Br. | APC | accepted | https://id.biodiversity.org.au/node/apni/2899106 |
Abrodictyum | Hymenophyllaceae | C.Presl | APC | accepted | https://id.biodiversity.org.au/node/apni/7402562 |
The file unit_conversions.csv
defines the unit conversions that are used when converting contributed trait data to common units, e.g.
unit_from | unit_to | function |
---|---|---|
% | mg/g | x*10 |
% | g/g | x*0.01 |
% | mg/mg | x*0.01 |
% | mg/kg | x*10000 |
% | n/n | x*0.01 |
% | dimensionless | x*.01 |
years | month | x*12 |
count/m2 | count/mm2 | x*1/1000000 |
cm | m | x*0.01 |
cm | mm | x*10 |
The folder data
contains the raw data from individual studies included in AusTraits.
Records within the data
folder are organised as coming from a particular study, defined by the dataset_id
. Data from each study is organised into a separate folder, with two files:
data.csv
: a table containing the actual trait data.metadata.yml
: a file that contains study metadata (source, methods, sites, and context), maps trait names and units onto standard types, and lists any substitutions applied to the data in processing.The folder data
thus contains a long list of folders, one for each study and each containing two files:
data
├── Angevin_2010
│ ├── data.csv
│ └── metadata.yml
├── Barlow_1981
│ ├── data.csv
│ └── metadata.yml
├── Bean_1997
│ ├── data.csv
│ └── metadata.yml
├── ....
where Angevin_2010
, Barlow_1981
, & Bean_1997
are each a unique dataset_id
in the final dataset.
The file data.csv
contains raw measurements and can be in either long or wide format.
Required columns include the taxon name, the trait name (column in long format, header in wide format), units (column in long format, part of header in wide format), site (if applicable), context (if applicable), date (if available), and trait values.
It is important that all trait measurements made on the same individual or that are the mean of a species’ measurements from the same site are kept linked.
If the data are in wide format, each row should include measurements made on a single individual or a single species-by-site mean, with different trait values as consecutive columns.
If the data are in long format, an additional column is required to ensure multiple trait measurements made on the same individual or are the mean of a species’ measurements from the same site are linked: observation_ID
(or other identifier) must be assigned to identify which rows of measurements (for different traits) are linked to a unique individual or site.
We aim to keep the data file in rawest form possible (i.e. as few changes as possible) but it must be a single csv file. Additional custom R code may be required to make the file exactly compatible with the AusTraits format, but these changes should be executed as AusTraits is compiled and should be in the metadata.yml
file under config/custom_R_code
(see below). Any files used to create the submitted data.csv
file (e.g. Excel …) should be archived in a sub-folder within the study folder named raw
.
The metadata is compiled in a .yml
file, a structured data file where information is presented in a hierarchical format (see Appendix for details). There are 11 values at the top hierarchical level: source, people, dataset, sites, contexts, config, traits, substitutions, taxonomic_updates, exclude_observations, questions. These are each described below.
As a start, you may want to checkout some examples from existing studies in Austraits, e.g. Angevin_2010 or Wright_2009.
This section provides citation details for the original source(s) for the data, whether it is a published journal article, book, website, or thesis. In general we aim to reference the primary source. References are written in structured yml format, under the category source
and then sub-groupings primary
and secondary
. General guidelines for describing a source
Surname_year
and should be identical to the name given to the dataset folder. A second instance of the identical Surname_year should have the key Surname_year_2.contributor
of the dataset.An example of a primary source that is a journal article is:
source:
primary:
key: Falster_2005_1
bibtype: Article
author: Daniel S. Falster, Mark Westoby
year: 2005
title: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
journal: Journal of Ecology
volume: 93
pages: 521--535
publisher: Wiley-Blackwell
doi: 10.1111/j.0022-0477.2005.00992.x
If a secondary source is included it may look like:
primary:
key: Choat_2006
bibtype: Article
year: '2006'
author: B. Choat and M. C. Ball and J. G. Luly and C. F. Donnelly and J. A. M.
Holtum
journal: Tree Physiology
title: Seasonal patterns of leaf gas exchange and water relations in dry rain
forest trees of contrasting leaf phenology
volume: '26'
number: '5'
pages: 657--664
doi: 10.1093/treephys/26.5.657
secondary:
key: Choat_2005
bibtype: Article
year: '2005'
author: Brendan Choat and Marilyn C. Ball and Jon G. Luly and Joseph A. M. Holtum
journal: Trees
title: Hydraulic architecture of deciduous and evergreen dry rainforest tree species
from north-eastern Australia
volume: '19'
number: '3'
pages: 305--311
doi: 10.1007/s00468-004-0392-1
This section provides a list of the key contributors to the study, their respective institutions and roles in the study. Roles are defined as follows:
key | value |
---|---|
collector | The person (people) leading data collection (generally 1-2 people) |
contributor | Person responsible for entering data into AusTraits |
lab_leader | Leader of lab group at time of collection |
assistant | Anyone else who assisted in collection of the data |
contact | The person to contact with questions about the data set |
An example is as follows:
people:
- name: Daniel Falster
institution: Macquarie University
role: collector, contact, contributor
- name: Mark Westoby
institution: Macquarie University
role: lab_leader
Note that only the AusTraits custodians have the contributors e-mail addresses on file. This information will not be directly available to AusTraits users or new contributors via Github.
This section includes study details, including study description, sampling strategy, sampling time frame, and sample age class.
The following elements are included under the element dataset
:
field
, lab
, glasshouse
, botanical collection
, or literature
. The latter should only be used when the data were sourced from the literature and the collection type is unknown.adult
or juvenile
plants.An example is as follows:
year_collected_start: 2004
year_collected_end: 2004
description: Trait values for species with faster versus slower height growth following disturbance for Myall Lakes species.
collection_type: field
sample_age_class: adult
sampling_strategy: Fire is a recurrent disturbance in the park (interval – 0–30 years; Fox and Fox 1986). A mosaic of fire histories has facilitated previous use of space-for-time substitutions in studies of small mammal succession (Fox and McKay 1981). Here we employ the same methodology to reconstruct species height-growth trajectories (Enright and Goldblum 1999). Sites were identified at a range of times since fire with the use of NSW national parks GIS fire history records and personal observations of Karen Ross (Ross et al. 2002). Patches of vegetation 1, 2, 4, 8, 10, 12, 15, 27 and 28 years since fire were identified. Where possible several patches within a given age class were surveyed to determine species presence or absence. Nineteen species recorded in a majority of patches were selected for further study. This included eight resprouting species and 11 obligate seeders (full list in Appendix 1).
original_file: Falster & Westoby 2005 Oikos appendix.doc
notes: none
This section includes information on the format of the submitted data file.
Values are as follows:
yes
or no
terminologydata.csv
, excluding the actual trait data. One element within variable_match
must be taxon_name
. Datasets with data_is_long_format
set to yes
must also identify which column includes data on value
and trait_name
. Other allowed values include date
, site_name
, observation_id
, context_name
if these columns are present in the data.csv
file or created with custom R code..na
indicates no custom R code was used.An example is
config:
data_is_long_format: yes
variable_match:
species_name: Taxon
value: trait value
trait_name: trait
custom_R_code: '
data %>%
mutate(`trait value` = ifelse(trait == 'flowering time',
convert_month_range_vec_to_binary(`trait value`), `trait value`)
)
'
A common use of the custom_R_code
is to automate the conversion of a verbal description of flowering or fruiting periods into the supported trait values, as occurs in this example. It might also be used if values for a single trait are expressed across multiple columns and need to be merged. See Catford_2014
as an example of this. The adding data vignette provides additional examples of code regularly implemented in custom_R_code
, including functions specifically developed for AusTraits data manipulations.
This section provides a translation table mapping traits and units from a contributed study onto corresponding variables in AusTraits. Also specified here are methods used to collect the data.
For each trait submitted to AusTraits, there is the following information:
definitions
.mean
, median
, min
or max
. For these value types, if replication is unknown the entry should be unknown
. If the value type is raw_value
the replicate value should be 1. If the value type is expert_mean
, expert_min
, or expert_max
the replicate value should be .na
.Values under trait_name
must be allowable values, list in the definitions
file or vignette. Similarly, values under value_type
must be allowable values, also listed in the definitions file.
An example is as follows:
traits:
- var_in: LMA (mg mm-2)
unit_in: mg/mm2
trait_name: specific_leaf_area
value_type: site_mean
replicates: 3
methods: LMA was calculated as the leaf dry mass (oven-dried for 48 hours at 65 °C) divided by leaf size. It was measured on the first five fully expanded leaves at the tip of each individual.
- var_in: leaf size (mm2)
unit_in: mm2
trait_name: leaf_area
value_type: site_mean
replicates: 3
methods: Leaf size was calculated as the one-sided leaf area (flat bed scanner). It was measured on the first five fully expanded leaves at the tip of each individual.
This section provides a list of any “find and replace” substitutions needed to get the data into the right format.
Substitutions are required whenever the exact word(s) used to describe a categorical trait value in AusTraits is different from the vocabulary used by the author in the data.csv
file. It is preferable to align vocabulary using substitutions
rather than changing the data.csv
file. The trait definitions file provides a list of supported values for each trait.
Each substitution is documented using the following elements:
An example is as follows:
substitutions:
- trait_name: life_history
find: p
replace: perennial
- trait_name: plant_growth_form
find: s
replace: shrub
- ...
This section provides a table of taxonomic name changes needed to align original names in dataset with taxa in apc and apni.
Each substitution is documented using the following elements:
taxon_name
can indicate an alignment with a name included in the comprehensive Australian Plant Names Index (APNI) but not currently marked as accepted in APC; or a name that cannot be aligned to available lists of Australian plant names.typos
, taxonomic synonyms
, and standardising spellings
Algorithms within AusTraits automatically align outdated taxonomy and taxonomic synonyms to their currently accepted scientific name, so such adjustments are not documented as substitutions.
An example is as follows:
taxonomic_updates:
- find: Eucalyptus albens X crebra
replace: Eucalyptus albens x Eucalyptus crebra
reason: Change wording for hybrid species (Elizabeth Wenk, 2020-06-30)
- find: Eucalyptus albens x moluccana
replace: Eucalyptus albens x Eucalyptus moluccana
reason: Change wording for hybrid species (Elizabeth Wenk, 2020-06-30)
This section provides a place to record any queries we have about the dataset (recorded as a named array), including notes on any additional traits that may have been collected in the study but have not been incorporated into austraits.
An example is as follows:
questions:
questions for author: Triglochin procera has very different seed masses in the main traits spreadsheet and the field seeds worksheet. Which is correct? There are a number of species with values in the field leaves worksheet that are absent in the main traits worksheet - we have included this data into Austraits; please advise if this was inappropriate.
austraits: need to map aquatic_terrestrial onto an actual trait once one is created.
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. This is a comma format for storing tables of data in a simple text file. You can edit it an Excel or in a text editor. For more, see here.
The yml
file extension (pronounced “YAML”) is a type structured data file, that is both human and machine readable. You can edit it any text editor, or also in Rstudio. Generally, yml is used in situations where a table does not suit because of variable lengths and or nested structures. It has the advantage over a spreadsheet in that the nested “headers” can have variable numbers of categories. The data under each of the hierarchical headings are easily extracted by R.
Occasionally all the changes we want to make to dataset may not fit into the prescribed workflow used in AusTraits. For example, we assume each trait has a single unit. But there are a few datasets where data on different rows have different units. So we want to make to make some custom modifications to this particular dataset before the common pipeline of operations gets applied. To make this possible, the workflow allows for some custom R code to be run as a first step in the processing pipeline. That pipeline (in the function read_data_study
) looks like:
data <-
read_csv(filename_data_raw, col_types = cols()) %>%
custom_manipulation(metadata[["config"]][["custom_R_code"]])() %>%
parse_data(dataset_id, metadata) %>%
...()
Note the second line.
As an example, for Blackman_2010
we want to combine two columns to create an appropriate site variable. Here is the code that was included in data/Blackman_2010/metadata.yml under custom_R_code
.
data %>% mutate(
site = ifelse(site == "Mt Field" & habitat == "Montane rainforest", "Mt Field_wet", site),
site = ifelse(site == "Mt Field" & habitat == "Dry sclerophyll", "Mt Field_dry", site)
)
This is the finished solution, but to get there we did as follows.
Generally, this code should
data
, and apply whatever fixes are neededmutate
, rename
, etc;
at the end of each statement).First, load an object called data
:
library(readr)
library(yaml)
data <- read_csv(file.path("data", "Blackman_2010", "data.csv"), col_types = cols(.default = "c"))
data
Second, write your code to manipulate data, like the example above/
Third, once you have some working code, you then want to add it into your yml file under a group config
-> custom_R_code
.
Finally, check it works. Let’s assume you added it in. The function metadata_check_custom_R_code
loads the data and applies the custom R code:
metadata_check_custom_R_code("Blackman_2010")