library(traits.build)
source("R/custom_R_code.R")23 Tutorial 7: Adding long format dataset
23.1 Overview
This is the seventh tutorial on adding datasets to your traits.build database.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the example datasets in traits.build-template. Instructions are available at Tutorial: Example compilation.
It is also recommended that you first work through some of the earlier tutorials, as many steps for adding datasets to a traits.build database are only thoroughly described in the early tutorials.
Goals
- Learn how to add a long dataset 
- Learn how to add units from a column 
New functions introduced
- none.
23.2 Adding tutorial_dataset_7
This dataset is a subset of data from ABRS_1981 in AusTraits. These are data from the original Flora of Australia volumes (Australian Biological Resources Study) and are therefore all species-level trait values.
This tutorial focuses on how to input a dataset in long format, where there is a single column with all trait values and a column specifying the trait documented in each row of the data file.
Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled tutorial_dataset_7 within the data folder. 
- Ensure that this folder exists on your computer. 
- The file - data.csvexists within the- tutorial_dataset_7folder.
- There is a folder - rawnested within the- tutorial_dataset_7folder, that contains two files,- locations.csvand- tutorial_dataset_7_notes.txt.
source necessary functions
- If you have restarted R Studio since last adding a dataset, ensure all functions are loaded from both the traits.buildpackage and the custom functions file:
Create a metadata.yml file
Create a metadata template
To create the metadata template, run:
metadata_create_template("tutorial_dataset_7")As with previous datasets, the first question asks whether this is a long or wide dataset. You now select long:
As with in the previous tutorials, this function leads you through a series of menus requiring user input. Ensure you select:
data format: long
The remaining prompts are now slightly different, since you have to identify columns for trait_name and value:
Select column for taxon_name 1: species_name
Select column for trait_name 2: trait
Select column for value 4: value
location_name column: 1: NA
individual_id column: 1: NA
collection_date column: 1: NA
Enter collection_date range in format ‘2007/2009’: unknown/1981
Do all traits need repeat_measurements_id’s? 2: No
Notes:
- All long-format datasets require an identifier to group rows of data referring to the same entity. If neither a location_namenor anindividual_idis provided (as is the case for all flora-derived datasets), thetaxon_namebecomes the identifier that is used to unite measurements into a single observation.
Navigate to the dataset’s folder and open the metadata.yml file in Visual Studio Code, to ensure information is added to the expected sections as you work through the tutorial.
Propagate source information into the metadata.yml file
Since this dataset is not from a published study with a doi, the source information needs to be manually added:
  bibtype: Online
  year: 1981
  author: '{Australian Biological Resources Study}'
  title: Flora of Australia, Australian Biological Resources Study, Canberra.
  publisher: Department of Climate Change, Energy, the Environment and Water, Canberra.
  url: http://www.ausflora.org.auThere are other bibtype’s you will encounter as well, including Unpublished, Book, Misc, Thesis, InBook (for chapters), Report and TechReport. For each there are different required and optional fields (per BibTex’s rules). See the complete guide to adding datasets for examples of each.
Add traits
To select columns in the data.csv file that include trait data, run:
metadata_add_traits(dataset_id = "tutorial_dataset_7")For long datasets, this function outputs a list of unique values within the trait names column:
Indicate all columns you wish to keep as distinct traits in tutorial_dataset_7 (by number separated by space; e.g. ‘1 2 4’): 1: leaf length maximum 2: leaf type 3: seed length maximum 4: seed length minimum
Select columns 1 2 3 4, you want to include all four traits.
Then fill in the details for each trait column in the traits section of the metadata file.
| trait | trait concept | units_in | entity_type | value_type | basis_of_ value | replicates | 
|---|---|---|---|---|---|---|
| leaf length maximum | leaf_length | units | species | maximum | measurement | .na | 
| leaf type | leaf_compoundness | .na | species | mode | expert_score | .na | 
| seed length maximum | seed_length | units | species | maximum | measurement | .na | 
| seed length minimum | seed_length | units | species | minimum | measurement | .na | 
Notes:
- You may have noticed in the data.csv file that there is also a column - units. For many long datasets there is a fixed unit for each trait, just as is standardly the case for wide datasets. In such cases fixed units values are mapped into the traits section of the metadata file, just as occurs with most wide datasets. In this dataset there is a column documenting the units, as different tax have leaf length and seed length reported in different units. The column for units can be mapped in at the trait level, as indicated here, or, for a long dataset, it could be mapped in a single time in the dataset section of the metadata,- units_in: unitsand then you’d delete the line referring to- units_infrom each of the traits.
- There are two different trait names that refer to seed length, - seed length maximumand- seed length minimum. It is not a problem that these both map to the trait concept- seed_lengthas they are different value types.
 
- Because these are species-level trait values, even the numeric traits do not have a replicate count. The range of values should represent all individuals of the species. 
 
Adding contributors
The file data/tutorial_dataset_7/raw/tutorial_dataset_7_notes.txt indicates the main data_contributor for this study.
Dataset fields
The file data/tutorial_dataset_7/raw/tutorial_dataset_7_notes.txt indicates how to fill in the unknown dataset fields for this study.
Testing, error fixes, and report building
At this point, run the dataset tests, rebuild the dataset, and check for excluded data:
dataset_test("tutorial_dataset_7")
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
traits.build_database$excluded_data %>% 
  filter(dataset_id == "tutorial_dataset_7") %>%  View()The excluded data includes four rows of data with the error Unsupported trait value for the trait leaf_compoundness. The term article does not describe a leaf’s compoundness. As articles are always simple leaves you can add a substitution:
metadata_add_substitution(dataset_id = "tutorial_dataset_7", 
                          trait_name = "leaf_compoundness", 
                          find = "articles", replace = "simple")Then rebuild the database and again check excluded data to ensure the substitution has worked as intended.
traits.build_database$build_info$version <- "5.0.0"  
    # a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_7", traits.build_database, overwrite = TRUE)