library(traits.build)
17 Tutorial 1: Adding a simple dataset
17.1 Overview
This is the first of five tutorials on adding datasets to your traits.build
database. This introduces you to the basic functions, the user input required, and the manual manipulations required to complete the dataset’s metadata file. The next four tutorials introduce you to progressively more complex datasets, functions, and decisions.
Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the datasets in traits.build-template
. Instructions are available at Tutorial: Example compilation.
Goals
Learn how to build a metadata.yml file for a dataset.
Learn how to merge a new dataset into a
traits.build
database.
New functions introduced
metadata_create_template
metadata_add_source_doi
metadata_add_locations
metadata_add_traits
dataset_test
build_setup_pipeline
17.2 Adding tutorial_dataset_1
Ensure the dataset folder contains the correct data files
In the traits.build-template repository, there is a folder titled tutorial_dataset_1
within the data folder.
Ensure that this folder exists on your computer.
The file
data.csv
exists within thetutorial_dataset_1
folder.There is a folder
raw
nested within thetutorial_dataset_1
folder, that contains one file,notes.txt
.
Source necessary functions
- Source the functions in the
traits.build
package:
Use functions to create a metadata.yml file
Create a metadata template
All dataset metadata is documented within a .yml file that also resides within the dataset’s folder.
A function quickly creates the skeletal metadata.yml
file.
metadata_create_template("tutorial_dataset_1")
This function cycles through a series of user-input menus, querying about both the data format (long versus wide) and which columns contain which variables (taxon name, location name, individual identifiers, collection date).
The menus are shown below, with the menu in blue and the appropriate user input in red.
Is the data long or wide format?
1: Long
2: Wide
Selection: 2
This dataset is considered wide
, because the data for each trait is documented in its own column.
Select column for taxon_name
1: Species
2: site
3: LMA (mg mm-2)
4: Leaf nitrogen (mg mg-1)
5: leaf size (mm2)
6: latitude (deg)
7: longitude (deg)
8: description`
Selection: 1
Select 1
since taxon names
are documented in the column Species
.
Select column for location_name
1: NA
2: Species
3: site
4: LMA (mg mm-2)
5: Leaf nitrogen (mg mg-1)
6: leaf size (mm2)
7: latitude (deg)
8: longitude (deg)
9: description
Selection: 3
Select 3
since location names
are documented in the column site
.
Select column for individual_id
1: NA
2: Species
3: site
4: LMA (mg mm-2)
5: Leaf nitrogen (mg mg-1)
6: leaf size (mm2)
7: latitude (deg)
8: longitude (deg)
9: description
Selection: 1
This dataset does not include a column for individual_id
, so 1: NA
is the appropriate input.
Select column for collection_date
1: NA
2: Species
3: site
4: LMA (mg mm-2)
5: Leaf nitrogen (mg mg-1)
6: leaf size (mm2)
7: latitude (deg)
8: longitude (deg)
9: description
Selection: 1
This dataset does not include a column for collection_date
, so 1: NA
is the appropriate input.
A follow-up question then allows you to add a fixed collection_date
as a range. The information can be manually updated later.
Enter collection_date range in format ‘2007/2009’: 2002-11/2002-11
A final user prompt asks if, for any traits, a sequence of rows represents repeat observations.
Do all traits need repeat_measurements_id
’s?
1: Yes 2: No
This only occurs if the dataset documents response curve data (e.g. an A-ci or light response curve for plants; or a temperature response curve for animal or plant behaviour) and the answer is almost always no
.
2
Navigate to the dataset’s folder to find the metadata.yml file.
Open this file in Visual Studio Code (or another text-based editor of choice; NOT Word!), so you can see how it is progressively filled in as you work through the next steps.
Propagate source information into the metadata.yml file
This dataset is from a published source with a doi
and therefore the source information can be added with a single line of code:
metadata_add_source_doi(
dataset_id = "tutorial_dataset_1",
doi = "10.1111/j.0022-0477.2005.00992.x"
)
The following information is automatically propagated into the source field:
:
primary: Test_1
key: Article
bibtype: '2005'
year: Daniel S. Falster and Mark Westoby
author: Journal of Ecology
journal: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
title: '93'
volume: '3'
number: 521--535
pages: 10.1111/j.0022-0477.2005.00992.x doi
Once you’ve run this line of code, look at the metadata file to confirm:
- the authors’ names are formatted as
first name last name
orfirst initial last name
(Daniel S. Falster
orD. S. Falster
if first names weren’t available)
- sequential author’s names are separated by
and
- the article title is in sentence case
- the page numbers are filled in as a range, separated by a double dash (
521--535
is correct)
Note, there is also a function metadata_add_source_bibtex
if your source information is in this format.
Add location details
Location data can be automatically propagated into the metadata file if it is available in tabular format. For instance, for this study:
<-
locations read_csv("data/tutorial_dataset_1/data.csv") %>%
select(site, description, `latitude (deg)`, `longitude (deg)`) %>%
distinct()
You can then add this location information directly into the metadata file by running:
metadata_add_locations(dataset_id = "tutorial_dataset_1", location_data = locations)
This leads to the following user prompts:
Select column for location_name
1: site
2: description
3: latitude (deg)
4: longitude (deg)
Selection: 1
Select the same column that you indicated contained location
names when you created the metadata template.
Indicate all columns you wish to keep as distinct location_properties in tutorial_dataset_1 (by number separated by space; e.g. ‘1 2 4’):
1: description
2: latitude (deg)
3: longitude (deg)
Selection: 1 2 3
Select all columns that include location properties
that should be documented within the metadata.yml
file. In this case, it is all three columns.
Following locations added to metadata for tutorial_dataset_1: ‘Atherton’, ‘Cape Tribulation’
with variables ‘description’, ‘latitude (deg)’, ‘longitude (deg)’
Please complete information in data/tutorial_dataset_1/metadata.yml
All available location data has now been automatically added to the metadata.yml
file.
:
locations:
Atherton: Tropical rain forest vegetation
descriptionlatitude (deg): -17.117
longitude (deg): 145.65
:
Cape Tribulation: Complex mesophyll vine forest in tropical rain forest
descriptionlatitude (deg): -16.1
longitude (deg): 145.45
Add traits
The next step is to select which columns in the data.csv
file have trait information you want to include in the database.
The function metadata_add_traits
automatically adds the trait-scaffold to metadata.yml
:
metadata_add_traits(dataset_id = "tutorial_dataset_1")
The user is prompted to select the columns with trait data.
Indicate all columns you wish to keep as distinct traits in tutorial_dataset_1 (by number separated by space; e.g. ‘1 2 4’):
1: Species
2: site
3: LMA (mg mm-2)
4: Leaf nitrogen (mg mg-1)
5: leaf size (mm2)
6: latitude (deg)
7: longitude (deg)
8: description
Selection: 3 4 5
You select columns 3, 4, 5, as these contain trait data.
Following traits added to metadata for tutorial_dataset_1: ‘LMA (mg mm-2)’, ‘Leaf nitrogen (mg mg-1)’, ‘leaf size (mm2)’
Please complete information in data/tutorial_dataset_1/metadata.yml
metadata.yml
now includes a framework in which to manually fill in details about each trait:
:
traits- var_in: LMA (mg mm-2)
: unknown
unit_in: unknown
trait_name: unknown
entity_type: unknown
value_type: unknown
basis_of_value: unknown
replicates: unknown
methods- var_in: Leaf nitrogen (mg mg-1)
: unknown
unit_in: unknown
trait_name: unknown
entity_type: unknown
value_type: unknown
basis_of_value: unknown
replicates: unknown
methods- var_in: leaf size (mm2)
: unknown
unit_in: unknown
trait_name: unknown
entity_type: unknown
value_type: unknown
basis_of_value: unknown
replicates: unknown methods
Manual filling in of metadata
The remaining fields within the metadata.yml file must now be filled in manually.
These include:
* the contributors
section
* the description
, basis_of_record
, life_stage
, sampling_strategy
, original_file
, and notes
under the dataset
section
* details for each trait, including unit_in
, trait_name
, entity_type
, value_type
, basis_of_record
, replicates
and methods
These are all fields that contain the word unknown
.
Adding contributors
Contributor field | Information to add |
---|---|
last_name, first_name | The contributors first and last names should be available from the source |
ORCID | Contributors are identified by their ORCID, available for most active researches at orcid.org |
affiliation | Available from the source or the orcid.org website. Use the same syntax for the same affiliation throughout your database. |
additional_role | For the lead dataset contributor, add the field: additional_role: contact |
You can add multiple data collectors by duplicating the relevant 4 lines of code; see the Adding dataset vignette for protocols on who to add as a data collector.
The line
assistants:
can be deleted if there aren’t any assistants’ names to add.Add yourself as the
dataset_curator
.
Dataset fields
The file
data/tutorial_dataset_1/raw/tutorial_dataset_1_notes.txt
indicates how to fill in theunknown
dataset fields for this study.In general, the information to fill in these fields should be available from the source (article) or obtained directly from the dataset contributor.
Dataset field | Information to add |
---|---|
basis_of_record | See traits.build_schema for allowable terms. |
life_stage | See traits.build_schema for allowable terms. |
description | A 1-2 sentence summary of the dataset. This can generally be formulated by information in the abstract. |
sampling_stategy | A description of how sites and sampling protocols were chosen. Can generally be taken verbatim from the methods section of a manuscript. |
original_file | Name of the file submitted by the data contributor and archived in the raw folder. |
notes | none (or .na ) for this study, but any notes added by the data curator about data quality, edits to the data during dataset curation. |
Trait details
trait_name
The trait_name
must match a trait_name
within the traits dictionary. For this example:
column in dataset | trait concept |
---|---|
LMA (mg mm-2) | leaf_mass_per_area |
Leaf nitrogen (mg mg-1) | leaf_N_per_dry_mass |
leaf size (mm2) | leaf_area |
A dataset curator must be familiar with the likely traits in their discipline to accurately match those in a contributed dataset to traits in the dictionary, and be able to determine if a new trait definition is warranted.
unit_in
Units are formatted according to the UCUM convention:
- units in the numerator are separated by a ‘.’/
- units in the denominator are each preceded by a ‘/’./
- “extra” information that is commonly informally included as part of the units for clarity can be included in curly brackets, {}
As examples:
units | UCUM format |
---|---|
milligram per square millimetre | mg/mm2 |
micromole per square metre second | umol/m2/s |
micromole carbon dioxide per square metre second | umol{CO2}/m2/s |
If the units being read in for a specific trait differ from those defined for the trait in the traits dictionary the trait values are converted using the conversion rules specified in unit_conversions.csv.
entity_type, value_type, basis_of_value, replicates, methods
field | value for this dataset | description |
---|---|---|
entity_type | population | The entity corresponding to the trait value. Uses a controlled vocabulary. See traits.build_schema for allowable terms. |
value_type | mean | The statistical nature of the trait value. Uses a controlled vocabulary. See traits.build_schema for allowable terms. |
basis_of_value | measurement | How the trait value was obtained. See traits.build_schema for allowable terms. |
replicates | 3 | The number of replicate measurements that comprise the trait measurement recorded in the spreadsheet. |
methods | See the study’s metadata_notes.txt file | A verbatim (free-form) text field documenting the methods used to collect the trait measurements. This is generally available from the reference or directly from the author. |
The values for entity_type
, value_type
, basis_of_value
, and replicates
can vary by trait – and indeed by measurement – but for this study are identical for all traits.
Final steps
Double check the metadata.yml file
You should now have a completed metadata.yml
file, with no unknown
fields.
You’ll notice five sections we haven’t used, contexts
, substitutions
, taxonomic_updates
, exclude_observations
, and questions
.
These should each contain an .na
(as in substitutions: .na
). They will be explored in future lessons.
Run tests on the metadata file
Confirm there are no errors in the metadata.yml
file:
dataset_test("tutorial_dataset_1")
This should result in the following output:
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 79 ]
Add dataset to the database
Next add the dataset_id to the build file that builds the database and rebuild the database
build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
Build dataset report
As a final step, build a report for the study
$build_info$version <- "5.0.0"
traits.build_database# a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_1", traits.build_database, overwrite = TRUE)
Have a look at the report, but there reports become much more interesting once there are more datasets in the database.