This vignette explains the protocol for addong a new study to AusTraits. Before starting this, you should read more about
It is important that all steps are followed so that our automated workflow proceeds without problems.
The main steps are:
austraits.build
repository from github and create a new branch in the repo with the name of the dataset_id
, e.g. Gallagher_2014
.data
with a name of the dataset_id
, e.g. Gallagher_2014
.data.csv
and place it within the new folder (details here).metadata.yml
and place it within the new folder (details here).data.csv
and metadata.yml
files as necessary (details here).data.csv
or metadata.yml
files.It may help to download one of the existing datasets and use it as a template for your own files and a guide on required content. You should look at the files in the config folder, in particular the definitions
file for the list of traits we cover and the supported trait values for each trait. The GitHub repository also hosts a compiled trait definitions table.
Once you have prepared your data.csv
and metadata.yml
files within a folder in the data
directory, you can incorporate the new data into AusTraits by running:
This step updates the file remake.yml
with appropriate rules for the new dataset; similarly if you remove datasets, do the same. (At this stage, remake offers no looping constructs so for now we generate the remake file using whisker.)
You can then rebuild AusTraits, including your dataset.
The austraits.build
repository includes a bunch of functions that help build the repository. To use these, you’ll need to make them available.
The easiest way is to load the functions into your work space is to run the following (from within the repository)
devtools::load_all()
Add a new folder within the data
folder. Its name should be the study’s dataset_id
, the core organising unit behind AusTraits.
Our preferred format for dataset_id
is surname of the first author of any corresponding publication, followed by the year, as surname_year
. E.g. Falster_2005
. Wherever there are multiple studies with the same id, we add a suffix _2
, _3
etc. E.g.Falster_2005
, Falster_2005_2
.
data.csv
fileAll data for a study (dataset_id
) must be merged into a single spreadsheet, data.csv
.
taxon name
, site_name
(if appropriate), contexts
(if appropriate), and sample_date
(if appropriate). Trait data can be in either a wide format (1 column for each trait, with trait name
as the column header) or long format (1 column for all trait values
and additional columns for trait name
and units
) See sections adding sites and adding contexts below for more information on compiling site and context data.site name
, even if all measurements were made at a single site.
read_csv("data/dataset_id/data.csv") %>%
mutate(site_name = "Daintree National Park")
sampling date
if the information is provided in the manuscript or separately by the contributor. Whenever possible, add dates in yyyy-mm-dd
(e.g. 2020-03-05) format or, if the day of the month isn’t known, as yyyy-mm
(e.g. 2020-03)(Note: Development is in progress to allow AusTraits to recognise and include multiple contextual columns.)
Variation in contextual values must be summarised in a single column. This is easy in most circumstances, where context values will be a single column of values, distinguishing, for instance, between wet_season
vs. dry_season
or sun_leaves
vs. shade_leaves
. It can become much more difficult for experimental studies with multiple manipulations, applied factorially to the study plants. For instance, measurements may have been made under high
vs. low
light, high
vs. low
CO2 concentration, and well-watered
vs. drought
conditions. A contributor will likely have three columns, labeled, light_levels
, CO2_concentration
and water_treatment
, which need to be merged into a single column, context
. The values in context need to merge the contributor’s three context columns to create unique values for each factorial treatment. For instance:
read_csv("data/dataset_id/data.csv") %>%
mutate(
context =
paste(light_levels, "light_and", CO2_concentration, "CO2_and",
water_treatment,
sep = "_"
)
)
austraits$traits
does not include multiple measurements per individual, although the data are preserved in the contributor’s raw data files.When there is just a single row of values to summarise, use:
read_csv("data/dataset_id/raw/raw_data.csv") %>%
group_by(individual, `species name`, site, context, etc) %>%
summarise(individual_mean = mean(replicate)) %>%
ungroup()
(Make sure you group_by
all categorical variables you want to retain, for only columns that are grouping variables will be kept)
When you want to take the mean of a series of continuous variables, use:
read_csv("data/dataset_id/raw/raw_data.csv") %>%
group_by(individual, `species name`, site, context, etc) %>%
summarise(.funs = mean) %>%
ungroup()
(Categorical variables not included as grouping variables will return NA
)
When you want to choose a different function for each variable, use:
read_csv("data/dataset_id/raw/raw_data.csv") %>%
group_by(individual, `species name`, site, context, etc) %>%
summarise(
mean_buds = mean(buds_per_branch), max_height = max(`plant height`),
fire_response = first(`killed or alive`)
) %>%
ungroup()
(This allows you to retain character variables, but can be tedious with many columns. Generally use the function first
for categorical variables - it simply retains the trait value in the first column. In the rare case when rows in a particular grouping have different categorical values more complex manipulations are required.)
full_join
, specifying all conditions that need to be matched across spreadsheets (e.g. individual, species, site, context). Ensure the column names are identical between spreadsheets or specify columns that need to be matched.
read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2
data_1 %>% full_join(data_2, by = c("Individual", "Taxon", "Site", "Context"))
bind_rows
. Ensure the column names for taxon name, site name, context, individual, and sample date are identical between spreadsheets. If there is data for the same traits in both spreadsheets, make sure those column headers are identical as well.
read_csv("data/dataset_id/raw/data_file_1.csv") -> data_1
read_csv("data/dataset_id/raw/data_file_2.csv") -> data_2
data_1 %>% bind_rows(data_2)
read_csv("data/dataset_id/raw/continuous_data.csv") -> continuous_data
read_csv("data/dataset_id/raw/categorical_data.csv") -> categorical_data
continuous_data %>% full_join(data_2, by = c("Taxon"))
read_csv("data/dataset_id/raw/species_key.csv") -> species_key
read_csv("data/dataset_id/raw/data_file.csv") %>%
left_join(species_key, by = "code")
.xls
file as a .csv
file it only preserves the number of significant figures that are displayed on the screen. This means that if, for some reason, a column has been set to show values a very low number of significant figures or a column is very narrow, data quality is lost.read_csv
fail to register the column as numeric. It is fixed by adding the argument guess_max
:
read_csv("data/dataset_id/raw/raw_data.csv", guess_max = 10000)
This checks 10,000 rows of data before declaring the column is non-numeric. The value can be set even higher…
metadata.yml
fileOne way to construct the metadata.yml
file is to use one of the existing files and modify yours to follow the same format. As a start, checkout some examples from existing studies in AusTraits, e.g. Angevin_2010 or Wright_2009.
Note, when editing the metadata.yml
, edits should be made in a proper text editor (Microsoft word tends to stuff up the formatting). For example, Rstudio, textmate, sublime text, and Visual Studio Code are all good editors.
To assist you in constructing the metadata.yml
file, we have developed functions to help fill in the different sections of the file. You then manually edit the file further to fill in missing details.
First run the following, to make the functions available
devtools::load_all()
First create a basic template for the metadata.yml
file for your study. Note, it requires you to have already created a file data.csv
in the folder data/your_dataset_id
.
Let’s imagine you’re entering a study called Yang_2028
current_study <- "Yang_2028"
metadata_create_template(current_study)
The function will ask a series of questions and then create a relatively empty file data/your_dataset_id/metadata.yml
. The key questions are:
units
and trait_name
.species_name
site_name
context_name
If your data.csv
file does not yet have site
or context
columns, this information can later be added manually.
Three functions are available to help entering citation details for the source data.
The function metadata_create_template
creates a template for the primary source with default fields for a journal article, which you can then edit manually.
If you have a doi
for your study, use the function:
metadata_add_source_doi(dataset_id = current_study, doi = "doi")
and the different elements within source will automatically be generated. Double check the information added to ensure:
1. The title is in sentence case
2. Overall, the information isn’t in all caps
(information from a few journals is read in like this)
3. Pages numbers are present and added as, for example, 123 -- 134
By default, details are added as the primary source. If multiple sources are linked to a single dataset_id
, you can specify a source as secondary
metadata_add_source_doi(dataset_id, doi, type = "secondary")
key
in the metadata.yml file to be the appropriate author_yyyy
code for the secondary reference. Sequential qualifiers can be used if necessary (e.g. author_yyyy_2
)metadata.yml
file manually change the source’s header from secondary
to secondary_01
(and then secondary_02
, etc.). See Richards_2008 for an example of a complex source list.Alternatively, if you have reference details saved in a bibtex file called myref.bib
you can use the function
metadata_add_source_doi(dataset_id, file = "myref.bib")
(These options require the packages rcrossref and RefManageR to be installed.)
For a book, the proper format is:
source:
primary:
key: Cooper_2013
bibtype: Book
year: 2013
author: Wendy Cooper and William T. Cooper
title: Australian rainforest fruits
publisher: CSIRO Publishing
pages: 272
For an online resource, the proper format is:
source:
primary:
key: TMAG_2009
bibtype: Online
author: '{Tasmanian Herbarium}'
year: 2009
title: Flora of Tasmania Online
publisher: Tasmanian Museum & Art Gallery (Hobart)
url: http://www.tmag.tas.gov.au/floratasmania
For a thesis, the proper format is:
source:
primary:
key: Kanowski_2000
bibtype: Thesis
year: 1999
author: John Kanowski
title: Ecological determinants of the distribution and abundance of the folivorous
marsupials endemic to the rainforests of the Atherton uplands, north Queensland.
type: PhD
institution: James Cook University, Townsville
For an unpublished dataset, the proper format is:
source:
primary:
key: Ooi_2018
bibtype: Unpublished
year: 2018
author: Mark K. J. Ooi
title: "Unpublished data: Herbivory survey within Royal National Park"
Note, if you manually add information, note that if there is a colon (:) or apostrophe (’) in a reference, the text for that line must be in quotes (").
The skeletal metadata.yml
file created by the function metadata_create_template
includes a template for entering details about people. Edit it manually, duplicating if details for multiple people are required.
For many studies there are changes we want to make to a dataset before the data.csv file is read into AusTraits. These most often include applying a function to transform data, a function to filter data, or a function to replace a contributor’s “measurement missing” placeholder symbol with NA
. In each case it is appropriate to leave the rawer data in data.csv
.
In each case we want to make some custom modifications to a particular dataset before the common pipeline of operations gets applied. To make this possible, the workflow allows for some custom R code to be run as a first step in the processing pipeline. That pipeline (in the function load_study
) looks like this:
data <-
read_csv(filename_data_raw, col_types = cols(), guess_max = 1e5) %>%
custom_manipulation(metadata[["config"]][["custom_R_code"]])() %>%
parse_data(dataset_id, metadata)
Note the second line. This is where the custom code gets applied, right after the file is loaded.
data
, and apply whatever fixes are neededmutate
, rename
, etc, and otherwise avoid external packagesdevtools::load_all()
;
at the end of each statement).Most datasets from herbaria record flowering_time
and fruiting_time
as a span of months, while AusTraits codes these variables as a sequence of 12 N’s and Y’s for the 12 months. A series of functions make this conversion in custom_R_code. These include:
format_flowering_months
’ (Create flowering times from start to end pair)convert_month_range_string_to_binary
’ (Converts flowering and fruiting month ranges to 12 element character strings of binary data)convert_month_range_vec_to_binary
’ (Convert vectors of month range to 12 element character strings of binary data)collapse_multirow_phenology_data_to_binary_vec
’ (Converts multirow phenology data to a 12 digit binary string)Many datasets from herbaria record traits like leaf_length
, leaf_width
, seed_length
, etc. as a range (e.g. 2-8
). The function separate_range
separates this data into a pair of columns with minimum
and maximum
values.
Duplicate values within a study need to be filtered out.
If a species-level measurement has been entered for all within-site replicates, you need to filter out the duplicates. This is true for both numeric and categorical values.
data %>%
group_by(Species) %>%
mutate_at(
vars(leaf_percentN, `plant growth form`),
~ replace(.x, duplicated(.x), NA)
) %>%
ungroup()
Note: You would use group_by(Species, Site)
if there are unique values at the species x site level.
Values that were sourced from a different study need to be filtered out. See Duplicates between studies below - functions to automate this process are in progress.
Author has represented missing data values with a symbol, such as 0
:
data %>% mutate_at(vars(`height (cm)`, `leaf area (mm2)`), ~ na_if(., 0))
values
for a second trait in AusTraits, some data values can be duplicated in a second temporary column. In the example below, some data in the contributor’s growth_form
column also apply to the trait parasitic
in AusTraits:
data %>% mutate(parasitic = ifelse(`growth form` == "parasitic herb", "parasitic", NA))
values
for a second trait in AusTraits, some data values can be moved to a second column (second trait), using the function ‘move_values_to_new_trait
’. In the example below, some data in the contributor’s growth_form
column only apply to the trait parasitic
in AusTraits:
data %>% move_values_to_new_trait(original_trait_name = "growth form", new_trait_name = "parasitic", original_values = "parasitic", value_to_use = "parasitic")
data.csv
file includes raw data that you want to manipulate into a trait
or the contributor presents the data in a different formulation than AusTraits:
data %>% mutate(root_mass_fraction = `root mass` / (`root mass` + `shoot mass`))
sites
or manipulating site names
. This is only recommended for studies with a single (or few) site, where manually adding the site data to the metadata.yml
file is fast, since in precludes automatically propagating site data into metadata (see Adding site details). An example (Blackman_2010 dataset):
data %>%
mutate(
site = ifelse(site == "Mt Field" & habitat == "Montane rainforest", "Mt Field_wet", site),
site = ifelse(site == "Mt Field" & habitat == "Dry sclerophyll", "Mt Field_dry", site)
)
sampling dates
supplied into the yyyy-mm-dd
format or add a date columnConverting from any mdy
format to yyyy-mm-dd
(e.g. Dec 3 2015
to 2015-12-03
)
data %>% mutate(Date = Date %>% mdy())
Converting from any dmy
format to yyyy-mm-dd
(e.g. 3-12-2015
to 2015-12-03
)
data %>% mutate(Date = Date %>% dmy())
Converting from mmm-yyyy
(string) format to yyyy-mm
(e.g. Dec 2015
to 2015-12
)
data %>% mutate(Date = parse_date_time(Date, orders = "my") %>% format.Date("%Y-%m"))
Converting from mdy
format to yyyy-mm
(e.g. Excel has reinterpreted the data as full dates 12-01-2015
but the resolution should be “month” 2015-12
)
data %>% mutate(Date = parse_date_time(Date, orders = "mdy") %>% format.Date("%Y-%m"))
Add a single date in yyyy-mm
format to an entire dataset
data %>% mutate(Date = "2015-12")
After you’ve added the custom R code to a file, check that it has completed the intended data frame manipulation:
metadata_check_custom_R_code("Blackman_2010")
Begin by automatically adding all traits to your skeletal metadata.yml
file:
metadata_add_traits(current_study)
You will be asked to indicate the columns you wish to keep as distinct traits. Include all columns with trait data.
This automatically propagates each trait selected into metadata.yml
as follows:
- var_in: leaf area (mm2)
unit_in: .na
trait_name: .na
value_type: .na
replicates: .na
methods: .na
The trait details then need to be filled in manually.
units: fill in the units specified by the author - such as mm2. If you’re uncertain about the syntax/format used for some more complex units, look through the definitions file (config/definitions.yml
) or the file showing unit conversions (config/unit_conversions.csv
). For categorical variables, leave this as .na
.
trait_name: This is the appropriate trait name from config/definitions.yml
. If no appropriate trait exists in AusTraits, a new trait can often be added - just ensure it is a trait
where data will be comparable across studies and has been measured for a fair number (~>50) species. For currently unsupported traits, we leave this as .na
but then fill in the rest of the data and flag this study as having a potential new trait. Then in the future, when this trait is added to the definitions file, the data can be read into AusTraits by simply replacing the .na
with a trait name.
value_type: See the bottom of config/definitions.yml
for a list of accepted value types.
replicates: Fill in with the appropriate value. For categorical variables, leave this as .na
.
methods: This information can usually be copied verbatim from a manuscript. In general, methods sections extracted from pdf’s include “special characters” (non-UTF-8 characters). Non-English alphabet characters are recognised (e.g. é, ö) and should remain unchanged. Other characters will be re-formatted during the study input process, so double check that degree symbols (º), en-dashes (–), em-dashes (—), and curly quotes (‘,’,“,”) have been maintained or reformatted with a suitable alternative. Greek letter’s and some other characters are replaced with their Unicode equivalent (e.g. <U+03A8> replaces Psi (Ψ)); for these it is best to replace the symbol with an interpretable English-character equivalent.
Site data includes site names, latitude/longitude coordinates, verbal site descriptions, and any additional abiotic/biotic site variables provided by the contributor (or in the accompanying manuscript). For studies with more than a few sites, it is most efficient to create a table of this data that is automatically read into the metadata.yml
file.
Site names must be identical (including syntax, case) to those in data.csv
Columns headers for latitude and longitude data must read latitude (deg)
and longitude (deg)
Latitude and longitude must be in decimal degrees (i.e. -46.5832). There are many online converters to convert from degrees,minutes,seconds
format or UTM
. Or use the following formula: decimel_degrees = degrees + (minutes/60) + (seconds/3600)
If there is a column with a general vegetation description (i.e. rainforest
, coastal heath
it should be titled description
)
There are no rules for the column headers for other site data, but in general have headers that include a general description of the measurement, with the units in parentheses (e.g. MAP (mm)
or total soil N (%)
)
A few contributors provide a standalone file of all site data. Otherwise, the following sequence works well:
read_csv("data/dataset_id/data.csv") %>%
distinct(site, .keep_all = TRUE) %>% # the argument `.keep_all` ensures columns aren't dropped
select(site, rainfall, lat, lon) %>% # list of relevant columns to keep
rename(`latitude (deg)` = lat, `longitude (deg)` = long) %>% # rename columns to how you want them to appear in the metadata file. Faster to do it once here than repeatedly in the metadata file
write_csv("data/dataset_id/raw/site_data.csv")
Open the spreadsheet in Excel (or any editor of your choice) and manually add any additional data from the manuscript. Save as .csv file.
Open in R
read_csv("data/dataset_id/raw/site_data.csv") -> site_data
As an example of what the site table should look like:
#> # A tibble: 2 × 6
#> site_name description `elevation (m)` `latitude (deg)` `longitude (deg)` `rainfall (mm)`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Atherton Tropical rain forest vegetation. 800 -17.1 146. 2000
#> 2 Cape Tribulation Complex mesophyll vine forest in tropical rain forest. 25 -16.1 145. 3500
metadata.yml
:
metadata_add_sites(current_study, site_data)
You are first prompted to identify the column with the site name and then to list all columns that contain site data. This automatically fills in the site component on the metadata file.
Context data includes the context name (see above), type (usually experiment
or field
), description, and additional variables (columns) for some studies. This data needs to be compiled into a table (or manually added into metadata for simple studies).
high
vs. low
light, high
vs. low
CO2 concentration, and well-watered
vs. drought
conditions. To create a single column with context name, we merged the contributor’s three columns, light_levels
, CO2_concentration
and water_treatment
into a single column, context
with unique values for each factorial treatment.Next, the meaning of the overall context names
and their various components need to be defined. For instance:
read_csv("data/dataset_id/data.csv") %>%
distinct(context, .keep_all = TRUE) %>%
select(context, light_levels, CO2_concentration, water_treatment) %>%
mutate(
light_levels = gsub("high", "1800 PAR", light_levels),
light_levels = gsub("low", "1200 PAR", light_levels)
) %>%
# add lines for other contexts
mutate(type = "experiment") %>%
write_csv("data/dataset_id/raw/context_data.csv")
Then in Excel you’d add a column named description
and write out a description for each factorial context, such as Plants were growth at high light (1800 PAR), low CO2 (300 ppm), and were well-watered
. Save as a .csv file.
At the end, reopen in R
read_csv("data/dataset_id/raw/context_data.csv") -> context_data
As an example of what the context table should look like:
#> # A tibble: 13 × 3
#> context_name description type
#> <chr> <chr> <chr>
#> 1 April samples collected in April, the start of the dry season field_harsh
#> 2 August samples collected in August, during the dry season field_harsh
#> 3 December samples collected in December, at the start of the wet season field_favourable
#> 4 dry samples collected during the dry season field_harsh
#> 5 February samples collected in February, during the wet (growing) season field_favourable
#> 6 January samples collected in January, during the wet (growing) season field_favourable
#> 7 July samples collected in July, during the dry season field_harsh
#> 8 June samples collected in June, during the dry season field_harsh
#> 9 March samples collected in March, toward the end of the wet (growing) season field_favourable
#> 10 May samples collected in May, during the dry season field_harsh
#> 11 November samples collected in November, toward the end of the dry season field_harsh
#> 12 October samples collected in October, toward the end of the dry season field_harsh
#> 13 September samples collected in September, during the dry season field_harsh
metadata.yml
:
metadata_add_contexts(current_study, context_data)
You are first prompted to identify the column with the context name and then to list all columns that contain context data. This automatically fills in the context component on the metadata file.
It is very unlikely that a contributor will use entirely identical categorical trait values to those in the definitions.yml
file. You need to add substitutions for those that do not exactly align to match the wording and syntax supported by AusTraits. Combinations of multiple trait values are allowed - simply list them, space delimited (e.g. shrub tree
for a species whose growth form includes both)
Single substitutions can be added by running:
metadata_add_substitution(current_study, "trait_name", "find", "replace")
where trait_name
is the AusTraits defined trait name, find
is the trait value used in the data.csv file and replace
is the trait value supported by AusTraits.
If you have many substitutions to add, the following may be more efficient:
Add a single substitution via the function and then copy and paste the lines many times in the metadata.yml file, changing the relevant fields
Create a spreadsheet with a list of all trait_name
by trait_value
combinations requiring substitutions. The spreadsheet would have four columns with headers dataset_id
, trait_name
, find
and replace
. This spreadsheet can be read directly into the metadata.yml
file. This is described below under Adding many substitutions.
There are a few additional metadata fields that need to be filled in manually:
Under config
> variable_match
you can add another subheading date:
where you’d report the column in the data.csv file (or the column defined through custom R code) containing sampling dates.
Toward the top of the metadata is a section title dataset:
. This needs to be filled in manually.
- year_collected_start: Year data sampling began.
- year_collected_end: Year data sampling was completed.
- description: 1-2 sentence description of the study’s goals. The abstract of a manuscript usually includes some good sentences/phrases to borrow from.
- collection_type: Possible values include: Field, botanical collection, field_experiment, glasshouse, literature
- sample_age_class: Possible values include: Adult, sapling, seedling, unknown
- sampling_strategy: Often a quite long description of the sampling strategy, extracted verbatim from a manuscript.
- original_file: The name of the file initially submitted to AusTraits and archived in a Google Drive folder and usually in the studies folder, in a subfolder named raw
.
- notes: Notes about the study and processing of data, especially if there were complications or if some data is suspected duplicates with another study and was filtered out.
This section is to list specific trait values or taxon names that are in the data.csv
file but should be excluded from AusTraits.
It includes three elements:
- variable: A variable from the traits table, typically taxon_name
, site_name
or context_name
- find: Value of variable to remove
- reason: Records why the data was removed, e.g. exotic
Multiple, comma-delimited values can be added under find
.
For example, in Munroe_2019:
:
exclude_observations- variable: taxon_name
: Campylopus introflexus, Dicranoloma menziesii, Philonotis tenuis, Polytrichastrum
find
alpinum, Polytrichum juniperinum, Sphagnum cristatum: moss (E Wenk, 2020.06.18)
reason- variable: taxon_name
: Xanthoparmelia semiviridis
find: lichen (E Wenk, 2020.06.18) reason
At the very bottom of the metadata.yml
file is a section titled questions
. This is a location to:
contributor:
(indented once), and additional questions with question2:
, etc.traits
supported by AusTraits. Use the following syntax, indented once: additional_traits:
, followed by a list of traits.Before starting the quality checks, it is helpful to assign a variable, current_study
:
current_study <- "Wright_2001"
This lets you have a list of tests you run for each study and you just have to reassign a new dataset_id
to current_study
.
It is best to run tests and fix formatting first.
The clear formatting
code below, is essential reading and re-writing the yaml file, the same process that is repeated when running functions that automatically add substitutions or check taxonomy. Running it first ensures that formatting issues introduced (or fixed) during the read/write process are identified and solved first.
For instance, the write_metadata
function inserts line breaks every 80 characters, reworks other line breaks (except in custom_R_code). It also reformats special characters in the text, substituting in its accepted format for degree symbols, en-dashes, em-dashes, and quotes and substituting in Unicode codes for more obscure symbols.
f <- file.path("data", current_study, "metadata.yml")
read_metadata(f) %>% write_metadata(f)
You begin by running some automated tests to ensure the dataset meets required set up. The tests run through a collection of pre-specified checks on the files for each study. The output alerts you to possible issues needing to be fixed, by comparing the data in the files with expected structure and allowed values, as specified in the definitions.
Certain special characters may show up as errors and need to be manually adjusted in the metadata.yml
file
The tests also identify mismatches between the site names in the data.csv file vs. metadata.yml file (same for context), unsupported trait names, etc.
To run the tests, the variable dataset_ids
must be defined in the global namespace, containing a vector of ids to check. For example
# load relevant functions
devtools::load_all()
# Tests run test on one study
dataset_ids <- "Bragg_2002"
austraits_run_tests()
# Tests run test on one study using `current_study` variable
dataset_ids <- current_study
austraits_run_tests()
# Tests run test on all studies
dataset_ids <- dir("data")
austraits_run_tests()
Fix as many errors as you can and then rerun austraits_run_tests()
repeatedly until no errors remain.
See below for suggestions to implement large numbers of trait value substitutions.
Now incorporate the new study into AusTraits:
austraits_rebuild_remake_setup()
austraits <- remake::make("austraits")
AusTraits automatically excludes data for a number of reasons. These are available in the frame excluded_data
.
When you are finished running quality checks, no data should be excluded due to Missing unit conversion and Unsupported trait.
A few values may be legitimately excluded due to other errors, but check each entry.
The best way to view excluded data for a study is:
austraits$excluded_data %>%
filter(
dataset_id == current_study,
error != "Missing value",
error != "Observation excluded in metadata"
) %>%
View()
Or, if you want to check missing values and intentionally excluded metadata:
Possible reasons for excluding trait value include:
Missing species name: Species name missing from data.csv file for a given row of data. This usually occurs when there are stray characters in the data.csv file below the data – delete these rows.
Missing value: Value was missing. Do nothing – these almost always represent honestly missing data. Large numbers of missing values are often present if two data frames with values for different traits are connected together with bind_rows()
or if duplicate values have been omitted using custom R code.
Missing unit conversion: Value was present but appropriate unit conversion was missing -> you need to add it to the file config/unit_conversions.csv
. Add additional conversions near similar unit conversions already in the file for easier searching in the future.
Observation excluded in metadata: Specific values, usually certain taxon names can be excluded in the metadata. This is generally used when a study includes a number of non-native and non-naturalised species that need to be excluded. These should be intentional exclusions, as they have been added by you.
Time contains non-number: Indicates a problem with the value entered into the traits flowering_time
and fruiting_time
. (Note to AusTraits custodians: This error should no longer appear - will retain for now as a placeholder.)
Unsupported trait: trait_name
not listed in config/definitions.yml
, under traits
. Double check you have used the correct spelling/exact syntax for the trait_name
, adding a new trait
to the definitions
file if appropriate. If there is a trait that is currently unsupported by AusTraits, leave trait_name: .na
- do not fill in an arbitrary name. (Note, if you have data in long format, all traits are automatically added to the list of traits in metadata.yml
and even if you indicate trait_name: .na
these appear as unsupported trait
in the excluded_data
table.)
Unsupported trait value: This error, referencing categorical traits, means the value
for a trait is not included in the list of supported trait values for that trait in config/definitions.yml
. See Adding many substitutions
if these are many trait values requiring substitutions. If appropriate add another trait value to the definitions file.
Value does not convert to numeric: Is there a strange character in the file preventing easy conversion? This error is rare and generally justified.
Value out of allowable range: This error, referencing numeric traits, means the trait value, after unit conversions, falls outside of the allowable range specified for that trait in config/definitions.yml
. Sometimes the AusTraits range is too narrow and other times the author’s value is truly an outlier that should be excluded. Look closely at these and adjust the range in config/definitions.yml
if justified. Generally, don’t change the range until you’ve create a report for the study and confirmed that the general cloud of data aligns with other studies as excepted. Most frequently the units or unit conversion is what is incorrect.
You can also ask how many of each error type are present for a study:
austraits$excluded_data %>%
filter(dataset_id == "Cheal_2017") %>%
pull(error) %>%
table()
#> .
#> Missing value Observation excluded in metadata Unsupported trait value
#> 41389 12 264
Or produce a table of error type by trait:
austraits$excluded_data %>%
filter(
dataset_id == "Cheal_2017",
error != "Missing value"
) %>%
select(trait_name, error) %>%
table()
#> error
#> trait_name Observation excluded in metadata Unsupported trait value
#> fire_and_establishing 2 1
#> fire_cued_seeding 0 47
#> fire_response 2 55
#> fire_response_juvenile 2 55
#> fire_response_on_maturity 2 55
#> life_history 1 0
#> lifespan 1 1
#> reproductive_maturity 2 3
#> seed_longevity 0 47
Note, most studies have no excluded data. This study is the extreme example!
For categorical traits, if you want to create a list of all values that require substitutions:
austraits$excluded_data %>%
filter(
dataset_id == current_study,
error == "Unsupported trait value"
) %>%
distinct(dataset_id, trait_name, value) %>%
rename(find = value) %>%
select(-dataset_id) %>%
write_csv("data/dataset_id/raw/substitutions_required.csv")
For studies with a small number of substitutions, add them individually, using:
metadata_add_substitution(dataset_id, trait_name, find, replace)
For studies with large number of substitutions required, you can add an additional column to this table, replace
, and fill in all the correct trait values. Then read the list of substitutions directly into the metadata file:
substitutions_to_add <-
read_csv("data/dataset_id/raw/substitutions_required_after_editing.csv")
metadata_add_substitutions_list(dataset_id, substitutions_to_add)
The species names used in a given dataset must be aligned with the currently accepted APC/APNI taxonomy used in AusTraits.
The AusTraits config files include the file taxon_list.csv
, a list of all taxa in AusTraits with names recognised by APC/APNI, including outdated/obsolete/misapplied taxonomic names (e.g. nomenclatural synonym, taxonomic synonym). Outdated names are automatically aligned with currently recognised taxonomy.
In addition, the folder config/NSL includes complete lists of taxa recognised by APC/APNI, including names not yet represented in AusTraits.
To check if there are any taxon names in the study unrecognised by APC/APNI, use:
metadata_check_taxa(dataset_id = current_study)
This function aligns slightly misspelled names through fuzzy matching (default is up to 3 characters different).
Additional arguments can be added to the function:
metadata_check_taxa(dataset_id, max_distance_abs = 2, max_distance_rel = 0.2, try_outside_guesses = FALSE)
max_distance_abs
is the maximum number of different characters allowed with fuzzy matching.max_distance_rel
is the maximum proportion of characters that are allowed to be different with fuzzy matching.try_outside_guesses
offers a menu of matching suggestions for each species. Set this as false for the first run of the function – it is best to start with automated matching and use this for a small number of unmatched species at the end.Run the function once to align the easily re-aligned names.
Messages output during the checking include:
All taxa are already known: All taxa in the study being input are exact matches to names in taxon_list.csv
. You’re done.
Automatic alignment with name in APC/APNI: The taxon name is on the complete APC/APNI list, but not yet in AusTraits.
species are not yet matched, checking for close matches in APC & APNI: The matching algorithm is using fuzzy matching to try and align the taxon name in the dataset. If a match is found, you will also receive a second message for that taxon, indicating the match and its source (i.e. APC list (accepted))
Skipping - not assessing anything ending in sp.
Note, genus is in APC: Names formatted as genus sp.
or genus sp. [characters]
are recognised as members of a given genus. For instance, the name in the data file is Acacia sp.
, and the algorithm has successfully matched the genus, but not the species.
Taxa not found. Note, genus is in APC: The algorithm has successfully matched the genus of the submitted name, but cannot find a suitable match at the species level.
Taxon alignments made by metadata_check_taxa
are added to the taxonomic_updates:
section of the metadata.yml
file Check all substitutions added to the metadata file, to confirm they make sense.
- find: Byrsonima crassifolia
: Boronia crassifolia
replace: Automatic alignment with name in APC list (accepted) (2021-09-16)
reason- find: Hydrocotyle vulgaris
: Hydrocotyle rivularis
replace: Automatic alignment with name in APC list (accepted) (2021-09-16) reason
If the species is obviously a misspelled Australian species, but was mismatched, manually correct the correct alignment.
Run the function a second time to see the (hopefully) short list of names that couldn’t be easily matched and require manual matches or substitutions.
try_outside_guesses
to TRUE:
metadata_check_taxa(dataset_id, max_distance_abs = 3, max_distance_rel = 0.2, try_outside_guesses = TRUE)
excluded_data
section of eFLOWER_2021::
excluded_data: taxon_name
variable: Antirrhinum majus, Atropa belladonna, Averrhoa carambola, Byrsonima crassifolia,
find
Carpodetus serratus, Citronella suaveolens, Cornus mas, Cussonia spicata, Echinops
exaltatus, Elaeis guineensis, Garrya elliptica, Gymnosporia senegalensis, Homalanthus
populneus, Myoporum mauritianum, Najas minor, Pimelodendron zoanthogyne, Rourea
minor, Siphonodon celastrineus, Trachycarpus fortunei: These are 'unplaced' species that likely show up in the APNI list because
reason they are horticultural species, but they are not widespread naturalised species
genus sp.
or genus sp. [characters]
. The matching algorithm has already recognised them as members of the given genus.metadata_add_taxonomic_change
function.
metadata_add_taxonomic_change("study", "find", "replace", "reason")
or manually typing the information into the metadata.yml
file:
- find: Senna sturtii
: Senna artemisioides subsp. x sturtii
replace: Change spelling to align with APC or ALA species lists (Sam Andrew, 2018-02-07) reason
genus sp. [characters]
. For instance, Acacia sp.
or Acacia sp. [long leaf]
not long leaf Acacia species
- find: Celmisia 'pulchella'
replace: Celmisia sp. Pulchella (M.Gray & C.Totterdell 7079)
reason: Alignment with known name in APC list (accepted) (Elizabeth Wenk, 2020-06-30)
config\NSL
for examples or the extensive list of substitutions in White_2020.For a study requiring a large number of manual taxonomic changes, you can create a data frame with headers dataset_id
, find
(original name), replace
(APC/APNI accepted name), and reason
to manually fill in and read into the taxonomic_updates
section:
Begin by generating a list of taxa that haven’t been matched:
recognised_taxa <-
austraits$taxa %>%
filter(source %in% c("APC", "APNI"))
austraits$traits %>%
filter(dataset_id == current_study) %>%
distinct(taxon_name) %>%
anti_join(recognised_taxa) %>%
mutate(dataset_id = current_study) %>%
rename(find = taxon_name) %>%
select(-dataset_id) %>%
write_csv("data/dataset_id/raw/unmatched_taxa.csv")
Add columns for replace
and reason
, fill in the missing values and save as csv file. Then
taxon_substitutions <-
read_csv("data/dataset_id/raw/unmatched_taxa_edited.csv")
metadata_add_taxonomic_changes_list(current_study, taxon_substitutions)
5. Finally, but very important, update the list of taxa in taxon_list.csv
by running:
AusTraits strives to have no duplicate entries for numeric (continuous) trait measurements.
When you receive/solicit a dataset, ask the data contributor if all data submitted were collected for the specific study and if they suspect other studies from their lab/colleagues may also have contributed any of this data.
In addition, there are tests to check for duplicates within and across dataset_ids.
To check for duplicates:
austraits_deduped <- remove_suspected_duplicates(austraits)
duplicates_for_dataset_id <-
austraits_deduped$excluded_data %>%
filter(
dataset_id == current_study,
error != "Missing value"
)
View(duplicates_for_dataset_id)
First sort duplicates_for_dataset_id
by the column error
and scan for duplicates within the study (these will be entries under error that begin with the same dataset_id
as the dataset being processed)
For legitimate duplicates, do nothing. For instance, if %N has been measured on 50 replicates of a species and is reported to the nearest 0.01% it is quite likely there will be a few duplicate values within the study.
If a species-level measurement has been entered for all within-site replicates, you need to filter out the duplicates. This is true for both numeric and categorical values. Enter the following code as custom_R_code
in the dataset’s metadata file:
data %>%
group_by(Species) %>%
mutate_at(
vars(leaf_percentN, `plant growth form`),
~ replace(.x, duplicated(.x), NA)
) %>%
ungroup()
Note: Using custom R code instead of filtering the values in the data.csv file itself, ensures the relevant trait values are still associated with each line of data in the data.csv file, but only read into AusTraits a single time. Note: You would use group_by(Species, Site)
if there are unique values at the species x site level.
AusTraits does not attempt to filter out duplicates in categorical traits between studies. The commonly duplicated traits like life_form
, plant_growth_form
, photosynthetic_pathway
, fire_response
, etc. are legitimately duplicated and if the occasional study reported a different plant_growth_form
or fire_response
it would be important to have documented that one trait value was much more common than another. Such categorical trait values may have been sourced from a reference material or measured/identified by this research team.
Identifying duplicates in numeric traits between studies can be difficult, but is essential - we attempt to filter out all duplicate occurrences of the same measurement. Some common patterns of duplication include:
For a single trait, if there are a large number of values duplicated in a specific other dataset_id (i.e. the error
repeatedly starts with the same dataset_id
), be suspicious. Before contacting the author, check the metadata for the two datasets, especially authors and study locations, to see if it is likely these are data values that have been jointly collected and shared across studies. Similar site names/locations, identical university affiliations, or similar lists of traits being measured are good clues.
plant_height
, leaf_length
, leaf_width
, seed_length
, seed_width
and seed_mass
are the numeric variables that are most frequently sourced from reference material (e.g. floras, herbarium collections, reference books, Kew seed database, etc.)
The following datasets are flagged in AusTraits as reference
studies and are the source of most duplicates for the variables listed above: Kew_2019_1
, Kew_2019_2
, Kew_2019_3
, Kew_2019_4
, Kew_2019_5
, Kew_2019_6
, ANBG_2019
, GrassBase_2014
, CPBR_2002
, NTH_2014
,RBGK_2014
, NHNSW_2016
, RBGSYD__2014_2
, RBGSYD_2014
, TMAG_2009
, WAH_1998
, WAH_2016
,Brock_1993
, Barlow_1981
, Hyland_2003
, Cooper_2013
Data from these studies are assumed to be the source, and the other study with the value is assumed to have sourced it from the above study. We recognise this is not always accurate, especially for compilations within Kew_2019_1
, Kew’s seed mass database, and have filtered certain contributors data from Kew_2019_1
already.
Data for wood_density
is also often sourced from other studies, most commonly Ilic_2000
or Zanne_2009
.
Data from a number of studies from Leishman
and Wright
have been extensively shared within the trait ecology community, especially through TRY
If the dataset you are processing has a number of numeric trait duplicates that follow one of the patterns of duplication
listed, the duplicates should be filtered out. Any other data explicitly indicated in the manuscript as sourced should also be filtered out. Most difficult are studies that have partially sourced data, often from many small studies, and partially collected new data, but not identified the source of each value.
Filtering duplicate data is a three-step process. In brief:
data.csv
, identifying certain trait_values
as duplicates.custom R code
that filters out identified duplicates when the study is merged into AusTraits.duplicates_for_dataset_id
to remove rows that you believe are legitimate duplicates, including duplicates values due to replicate measurements within a single study and stray duplicates across studies that likely true, incidental duplicate values. Carefully consider which datasets and traits to include/exclude from the filter.As an example:
# Note, this code will be replaced by a function in the future.
duplicates_to_filter <-
duplicates_for_dataset_id %>%
mutate(
dataset_with_duplicate =
error %>%
gsub("Duplicate of ", "", .) %>%
gsub("[[:alnum:]]$", "", .) %>%
gsub("[[:punct:]]$", "", .)
) %>%
filter(dataset_with_duplicate %in% c("Ilic_2000", "Zanne_2009", "Kew_2019_1", "Barlow_1981", "NTH_2014")) %>%
filter(trait_name %in% c("wood_density", "seed_mass", "leaf_length", "leaf_width"))
data.csv
that identify specific values as duplicates:
# Note, this code will be replaced by a function in the future.
wood_density_duplicates <-
duplicates_to_filter %>%
filter(trait_name == "wood_density") %>%
select(error, original_name) %>%
rename(wood_density_duplicate = error)
seed_mass_duplicates <-
duplicates_to_filter %>%
filter(trait_name == "seed_width") %>%
select(error, original_name) %>%
rename(seed_mass_duplicate = error)
leaf_width_min_duplicates <-
duplicates_to_filter %>%
filter(trait_name == "leaf_width", value_type == "expert_min") %>%
select(error, original_name) %>%
rename(leaf_width_min_duplicate = error)
leaf_width_max_duplicates <-
duplicates_to_filter %>%
filter(trait_name == "leaf_width", value_type == "expert_max") %>%
select(error, original_name) %>%
rename(leaf_width_max_duplicate = error)
leaf_length_min_duplicates <-
duplicates_to_filter %>%
filter(trait_name == "leaf_length", value_type == "expert_min") %>%
select(error, original_name) %>%
rename(leaf_length_min_duplicate = error)
leaf_length_max_duplicates <-
duplicates_to_filter %>%
filter(trait_name == "leaf_length", value_type == "expert_max") %>%
select(error, original_name) %>%
rename(leaf_length_max_duplicate = error)
read_csv("data/dataset_id/data.csv") %>%
left_join(wood_density_duplicates, by = c("column_with_species_name" = "original_name")) %>%
left_join(seed_mass_duplicates, by = c("column_with_species_name" = "original_name")) %>%
left_join(leaf_width_min_duplicates, by = c("column_with_species_name" = "original_name")) %>%
left_join(leaf_width_max_duplicates, by = c("column_with_species_name" = "original_name")) %>%
left_join(leaf_length_min_duplicates, by = c("column_with_species_name" = "original_name")) %>%
left_join(leaf_length_max_duplicates, by = c("column_with_species_name" = "original_name")) %>%
write_csv("data/dataset_id/data.csv")
NA
) as the dataset is read into AusTraits.
data %>%
mutate(
`wood density` = ifelse(is.na(wood_density_duplicate), `wood density`, NA),
`seed mass (mg)` = ifelse(is.na(seed_mass_duplicate), `seed mass (mg)`, NA),
`leaf width minimum (mm)` = ifelse(is.na(leaf_width_min_duplicate), `leaf width minimum (mm)`, NA),
`leaf width maximum (mm)` = ifelse(is.na(leaf_width_max_duplicate), `leaf width maximum (mm)`, NA),
`leaf length minimum (mm)` = ifelse(is.na(leaf_length_min_duplicate), `leaf length minimum (mm)`, NA),
`leaf length maximum (mm)` = ifelse(is.na(leaf_length_max_duplicate), `leaf length maximum (mm)`, NA)
)
Difficulties:
A final quality check is generating a report on the data in each study.
Rebuild the taxon list and AusTraits to ensure other changes made during the quality checks are implemented. Then build the study report:
austraits_rebuild_taxon_list()
austraits <- remake::make("austraits")
build_study_report(current_study)
Check the study report to ensure:
If necessary, cycle back through earlier steps to fix any errors, rebuilding the study report as necessary
At the very end, re-clear formatting, re-run tests, rebuild AusTraits, rebuild report.
If you’re uncertain, also recheck excluded data and duplicates before these final steps.
f <- file.path("data", current_study, "metadata.yml")
read_metadata(f) %>% write_metadata(f)
dataset_ids <- current_study
austraits_run_tests()
austraits <- remake::make("austraits")
build_study_report(current_study, overwrite = TRUE)
To generate a report for a collection of studies:
build_study_reports(c("Falster_2005_1", "Wright_2002"), overwrite = TRUE)
Or for all studies:
build_study_reports(overwrite = TRUE)
Add the argument overwrite=TRUE
if you already have a copy of a specific report stored in your computer and want to replace it with a newer version.
(Reports are written in Rmarkdown and generated via the knitr package. The template is stored in scripts/report_study.html
).
By far our preferred way of contributing is for you to contribute files directly into the repository and then send a pull request with you input. You can do this by
In short,
Before you make a substantial pull request, you should always file an issue and make sure someone from the team agrees that it’s worth pursuing the problem. If you’ve found a bug, create an associated issue and illustrate the bug with a minimal reprex illustrating the issue.
If this is not possible, you could email the relevant files (see above) to the AusTraits email: austraits.database@gmail.com
There are multiple ways to merge a pull request, including using GitHub’s built-in options for merging and squashing. When merging a PR, we ideally want
There’s two ways to do this. For both you need to be an approved maintainer.
You can merge in your own PR after you’ve had someone else review it.
When merging in someone else’s PR, the built-in options aren’t ideal, as they either take all of the commits on a branch (ugh, messy), OR make the commit under the name of the person merging the request.
The workflow below describes how to merge a pull request from the command line, with a single commit & attributing the work to the original author. Lets assume a branch of name Smith_1995
.
First from the master branch in the repo, run the following:
git merge --squash origin/Smith_1995
Then in R
Now back in the terminal
git add .
git commit
Add a commit message, referencing relevant pull requests and issues, e.g.
Smith_1995: Import new data
For #224, closes #286
And finally, amend the commit author, to reference the person who did all the work!
git commit --amend --author "XXX <XXX@gmail.com>"
Informative commit messages are ideal. Where possible, these should reference the issue being addressed. They should clearly describe the work done and value added to AusTraits in a few, clear, bulleted points.
Releases of the dataset are snapshots that are archived and available for use.
We use semantic versioning to label our versions. As discussed in Falster et al 2019, semantic versioning can apply to datasets as well as code.
The version number will have 3 components for actual releases, and 4 for development versions. The structure is major.minor.patch.dev
, where dev
is at least 9000. The dev
component provides a visual signal that this is a development version. So, if the current version is 0.9.1.9000, the release be 0.9.2, 0.10.0 or 1.0.0.
Our approach to incrementing version numbers is
major
: increment when you make changes to the structure that are likely incompatible with any code written to work with previous versions.minor
: increment to communicate any changes to the structure that are likely to be compatible with any code written to work with the previous versions (i.e., allows code to run without error). Such changes might involve adding new data within the existing structure, so that the previous dataset version exists as a subset of the new version. For tabular data, this includes adding columns or rows. On the other hand, removing data should constitute a major version because records previously relied on may no longer exist.patch
: Increment to communicate correction of errors in the actual data, without any changes to the structure. Such changes are unlikely to break or change analyses written with the previous version in a substantial way.Figure: Semantic versioning communicates to users the types of changes that have occurred between successive versions of an evolving dataset, using a tri-digit label where increments in a number indicate major, minor, and patch-level changes, respectively. From Falster et al 2019, (CC-BY).
The process of making a release is as follows. Note that corresponding releases and versions are needed in both austraits
and austraits.build
:
Update the version number in the DECRIPTION file, using `
Compile austraits.build
.
Update the documentation.
Commit and push to github.
Make a release on github, adding version number
Prepare for the next version by updating version numbers.
A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. This is a comma format for storing tables of data in a simple text file. You can edit it an Excel or in a text editor. For more, see here.
The yml
file extension (pronounced “YAML”) is a type structured data file, that is both human and machine readable. You can edit it any text editor, or also in Rstudio. Generally, yml is used in situations where a table does not suit because of variable lengths and or nested structures. It has the advantage over a spreadsheet in that the nested “headers” can have variable numbers of categories. The data under each of the hierarchical headings are easily extracted by R.
If you encounter a PDF table of data and need to extract values, this can be achieved with the tabula-java
tool. There’s actually an R wrapper (called tabulizer
), but we haven’t succeeded in getting this running. However, it’s easy enough to run the java tool at the command line on OSX.
Download latest release of tabula-java
and save the file in your path
Run
java -jar tabula-1.0.3-jar-with-dependencies.jar my_table.pdf -o my_data.csv
This should output the data from the table in my_table.pdf
into the csv my_data.csv