library(traits.build)
source("R/custom_R_code.R")
23 Tutorial 7: Adding long format dataset
23.1 Overview
This is the seventh tutorial on adding datasets to your traits.build
database.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the example datasets in traits.build-template
. Instructions are available at Tutorial: Example compilation.
It is also recommended that you first work through some of the earlier tutorials, as many steps for adding datasets to a traits.build
database are only thoroughly described in the early tutorials.
Goals
Learn how to add a long dataset
Learn how to add units from a column
New functions introduced
- none.
23.2 Adding tutorial_dataset_7
This dataset is a subset of data from ABRS_1981 in AusTraits. These are data from the original Flora of Australia volumes (Australian Biological Resources Study) and are therefore all species-level trait values.
This tutorial focuses on how to input a dataset in long format, where there is a single column with all trait values and a column specifying the trait documented in each row of the data file.
Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled tutorial_dataset_7
within the data folder.
Ensure that this folder exists on your computer.
The file
data.csv
exists within thetutorial_dataset_7
folder.There is a folder
raw
nested within thetutorial_dataset_7
folder, that contains two files,locations.csv
andtutorial_dataset_7_notes.txt
.
source necessary functions
- If you have restarted R Studio since last adding a dataset, ensure all functions are loaded from both the
traits.build
package and the custom functions file:
Create a metadata.yml file
Create a metadata template
To create the metadata template, run:
metadata_create_template("tutorial_dataset_7")
As with previous datasets, the first question asks whether this is a long
or wide
dataset. You now select long
:
As with in the previous tutorials, this function leads you through a series of menus requiring user input. Ensure you select:
data format: long
The remaining prompts are now slightly different, since you have to identify columns for trait_name
and value
:
Select column for taxon_name 1: species_name
Select column for trait_name 2: trait
Select column for value 4: value
location_name column: 1: NA
individual_id column: 1: NA
collection_date column: 1: NA
Enter collection_date range in format ‘2007/2009’: unknown/1981
Do all traits need repeat_measurements_id’s? 2: No
Notes:
- All long-format datasets require an identifier to group rows of data referring to the same entity. If neither a
location_name
nor anindividual_id
is provided (as is the case for all flora-derived datasets), thetaxon_name
becomes the identifier that is used to unite measurements into a single observation.
Navigate to the dataset’s folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.
Propagate source information into the metadata.yml file
Since this dataset is not from a published study with a doi, the source information needs to be manually added:
: Online
bibtype: 1981
year: '{Australian Biological Resources Study}'
author: Flora of Australia, Australian Biological Resources Study, Canberra.
title: Department of Climate Change, Energy, the Environment and Water, Canberra.
publisher: http://www.ausflora.org.au url
There are other bibtype
’s you will encounter as well, including Unpublished
, Book
, Misc
, Thesis
, InBook
(for chapters), Report
and TechReport
. For each there are different required and optional fields (per BibTex’s rules). See the complete guide to adding datasets for examples of each.
Add traits
To select columns in the data.csv
file that include trait data, run:
metadata_add_traits(dataset_id = "tutorial_dataset_7")
For long datasets, this function outputs a list of unique values within the trait names column:
Indicate all columns you wish to keep as distinct traits in tutorial_dataset_7 (by number separated by space; e.g. ‘1 2 4’): 1: leaf length maximum 2: leaf type 3: seed length maximum 4: seed length minimum
Select columns 1 2 3 4, you want to include all four traits.
Then fill in the details for each trait column in the traits section of the metadata file.
trait | trait concept | units_in | entity_type | value_type | basis_of_ value | replicates |
---|---|---|---|---|---|---|
leaf length maximum | leaf_length | units | species | maximum | measurement | .na |
leaf type | leaf_compoundness | .na | species | mode | expert_score | .na |
seed length maximum | seed_length | units | species | maximum | measurement | .na |
seed length minimum | seed_length | units | species | minimum | measurement | .na |
Notes:
You may have noticed in the data.csv file that there is also a column
units
. For many long datasets there is a fixed unit for each trait, just as is standardly the case for wide datasets. In such cases fixed units values are mapped into the traits section of the metadata file, just as occurs with most wide datasets. In this dataset there is a column documenting the units, as different tax have leaf length and seed length reported in different units. The column for units can be mapped in at the trait level, as indicated here, or, for a long dataset, it could be mapped in a single time in the dataset section of the metadata,units_in: units
and then you’d delete the line referring tounits_in
from each of the traits.There are two different trait names that refer to seed length,
seed length maximum
andseed length minimum
. It is not a problem that these both map to the trait conceptseed_length
as they are different value types.
Because these are species-level trait values, even the numeric traits do not have a replicate count. The range of values should represent all individuals of the species.
Adding contributors
The file data/tutorial_dataset_7/raw/tutorial_dataset_7_notes.txt
indicates the main data_contributor for this study.
Dataset fields
The file data/tutorial_dataset_7/raw/tutorial_dataset_7_notes.txt
indicates how to fill in the unknown
dataset fields for this study.
Testing, error fixes, and report building
At this point, run the dataset tests, rebuild the dataset, and check for excluded data:
dataset_test("tutorial_dataset_7")
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
$excluded_data %>%
traits.build_databasefilter(dataset_id == "tutorial_dataset_7") %>% View()
The excluded data includes four rows of data with the error Unsupported trait value
for the trait leaf_compoundness
. The term article
does not describe a leaf’s compoundness. As articles are always simple
leaves you can add a substitution:
metadata_add_substitution(dataset_id = "tutorial_dataset_7",
trait_name = "leaf_compoundness",
find = "articles", replace = "simple")
Then rebuild the database and again check excluded data to ensure the substitution has worked as intended.
$build_info$version <- "5.0.0"
traits.build_database# a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_7", traits.build_database, overwrite = TRUE)