17  Tutorial 1: Adding a simple dataset

17.1 Overview

This is the first of five tutorials on adding datasets to your traits.build database. This introduces you to the basic functions, the user input required, and the manual manipulations required to complete the dataset’s metadata file. The next four tutorials introduce you to progressively more complex datasets, functions, and decisions.

Before you begin this tutorial, ensure you have installed traits.build, cloned the traits.build-template repository, and have successfully build a database from the datasets in traits.build-template . Instructions are available at Tutorial: Example compilation.

Goals

New functions introduced

  • metadata_create_template

  • metadata_add_source_doi

  • metadata_add_locations

  • metadata_add_traits

  • dataset_test

  • build_setup_pipeline


17.2 Adding tutorial_dataset_1

Ensure the dataset folder contains the correct data files

In the traits.build-template repository, there is a folder titled tutorial_dataset_1 within the data folder. 

  • Ensure that this folder exists on your computer. 

  • The file data.csv exists within the tutorial_dataset_1 folder. 

  • There is a folder raw nested within the tutorial_dataset_1 folder, that contains one file, notes.txt

Source necessary functions

  • Source the functions in the traits.build package:
library(traits.build)

Use functions to create a metadata.yml file

Create a metadata template

All dataset metadata is documented within a .yml file that also resides within the dataset’s folder.

A function quickly creates the skeletal metadata.yml file.

metadata_create_template("tutorial_dataset_1")

This function cycles through a series of user-input menus, querying about both the data format (long versus wide) and which columns contain which variables (taxon name, location name, individual identifiers, collection date).

The menus are shown below, with the menu in blue and the appropriate user input in red.

Is the data long or wide format?

1: Long
2: Wide

Selection: 2

This dataset is considered wide, because the data for each trait is documented in its own column.

Select column for taxon_name

1: Species
2: site
3: LMA (mg mm-2)
4: Leaf nitrogen (mg mg-1)
5: leaf size (mm2)
6: latitude (deg)
7: longitude (deg)
8: description`

Selection: 1

Select 1 since taxon names are documented in the column Species.

Select column for location_name

1: NA
2: Species
3: site
4: LMA (mg mm-2)
5: Leaf nitrogen (mg mg-1)
6: leaf size (mm2)
7: latitude (deg)
8: longitude (deg)
9: description

Selection: 3

Select 3 since location names are documented in the column site.

Select column for individual_id

1: NA
2: Species
3: site
4: LMA (mg mm-2)
5: Leaf nitrogen (mg mg-1)
6: leaf size (mm2)
7: latitude (deg)
8: longitude (deg)
9: description

Selection: 1

This dataset does not include a column for individual_id, so 1: NA is the appropriate input.

Select column for collection_date

1: NA
2: Species
3: site
4: LMA (mg mm-2)
5: Leaf nitrogen (mg mg-1)
6: leaf size (mm2)
7: latitude (deg)
8: longitude (deg)
9: description

Selection: 1

This dataset does not include a column for collection_date, so 1: NA is the appropriate input.

A follow-up question then allows you to add a fixed collection_date as a range. The information can be manually updated later.

Enter collection_date range in format ‘2007/2009’: 2002-11/2002-11

A final user prompt asks if, for any traits, a sequence of rows represents repeat observations.

Do all traits need repeat_measurements_id’s?

1: Yes 2: No

This only occurs if the dataset documents response curve data (e.g. an A-ci or light response curve for plants; or a temperature response curve for animal or plant behaviour) and the answer is almost always no.

2

Navigate to the dataset’s folder to find the metadata.yml file.

Open this file in Visual Studio Code (or another text-based editor of choice; NOT Word!), so you can see how it is progressively filled in as you work through the next steps.


Propagate source information into the metadata.yml file

This dataset is from a published source with a doi and therefore the source information can be added with a single line of code:

metadata_add_source_doi(
  dataset_id = "tutorial_dataset_1", 
  doi = "10.1111/j.0022-0477.2005.00992.x"
  )

The following information is automatically propagated into the source field:

primary:
  key: Test_1
  bibtype: Article
  year: '2005'
  author: Daniel S. Falster and Mark Westoby
  journal: Journal of Ecology
  title: Alternative height strategies among 45 dicot rain forest species from tropical Queensland, Australia
  volume: '93'
  number: '3'
  pages: 521--535
  doi: 10.1111/j.0022-0477.2005.00992.x

Once you’ve run this line of code, look at the metadata file to confirm:

  1. the authors’ names are formatted as first name last name or first initial last name (Daniel S. Falster or D. S. Falster if first names weren’t available)
  2. sequential author’s names are separated by and 
  3. the article title is in sentence case
  4. the page numbers are filled in as a range, separated by a double dash (521--535 is correct) 

Note, there is also a function metadata_add_source_bibtex if your source information is in this format.


Add location details

Location data can be automatically propagated into the metadata file if it is available in tabular format. For instance, for this study:

locations <-
  read_csv("data/tutorial_dataset_1/data.csv") %>%
    select(site, description, `latitude (deg)`, `longitude (deg)`) %>%
    distinct()

You can then add this location information directly into the metadata file by running:

metadata_add_locations(dataset_id = "tutorial_dataset_1", location_data = locations)

This leads to the following user prompts:

Select column for location_name 

1: site
2: description
3: latitude (deg)
4: longitude (deg)

Selection: 1 

Select the same column that you indicated contained location names when you created the metadata template.

Indicate all columns you wish to keep as distinct location_properties in tutorial_dataset_1 (by number separated by space; e.g. ‘1 2 4’):

1: description
2: latitude (deg)
3: longitude (deg)

Selection: 1 2 3

Select all columns that include location properties that should be documented within the metadata.yml file. In this case, it is all three columns.

Following locations added to metadata for tutorial_dataset_1: ‘Atherton’, ‘Cape Tribulation’
with variables ‘description’, ‘latitude (deg)’, ‘longitude (deg)’
Please complete information in data/tutorial_dataset_1/metadata.yml

All available location data has now been automatically added to the metadata.yml file.

locations:
  Atherton:
    description: Tropical rain forest vegetation
    latitude (deg): -17.117
    longitude (deg): 145.65
  Cape Tribulation:
    description: Complex mesophyll vine forest in tropical rain forest
    latitude (deg): -16.1
    longitude (deg): 145.45

Add traits

The next step is to select which columns in the data.csv file have trait information you want to include in the database.

The function metadata_add_traits automatically adds the trait-scaffold to metadata.yml:

metadata_add_traits(dataset_id = "tutorial_dataset_1")

The user is prompted to select the columns with trait data.

Indicate all columns you wish to keep as distinct traits in tutorial_dataset_1 (by number separated by space; e.g. ‘1 2 4’):

1: Species
2: site
3: LMA (mg mm-2)
4: Leaf nitrogen (mg mg-1)
5: leaf size (mm2)
6: latitude (deg)
7: longitude (deg)
8: description

Selection: 3 4 5

You select columns 3, 4, 5, as these contain trait data.

Following traits added to metadata for tutorial_dataset_1: ‘LMA (mg mm-2)’, ‘Leaf nitrogen (mg mg-1)’, ‘leaf size (mm2)’
Please complete information in data/tutorial_dataset_1/metadata.yml

metadata.yml now includes a framework in which to manually fill in details about each trait:

traits:
- var_in: LMA (mg mm-2)
  unit_in: unknown
  trait_name: unknown
  entity_type: unknown
  value_type: unknown
  basis_of_value: unknown
  replicates: unknown
  methods: unknown
- var_in: Leaf nitrogen (mg mg-1)
  unit_in: unknown
  trait_name: unknown
  entity_type: unknown
  value_type: unknown
  basis_of_value: unknown
  replicates: unknown
  methods: unknown
- var_in: leaf size (mm2)
  unit_in: unknown
  trait_name: unknown
  entity_type: unknown
  value_type: unknown
  basis_of_value: unknown
  replicates: unknown
  methods: unknown

Manual filling in of metadata

The remaining fields within the metadata.yml file must now be filled in manually.

These include:
* the contributors section
* the description, basis_of_record, life_stage, sampling_strategy, original_file, and notes under the dataset section
* details for each trait, including unit_in, trait_name, entity_type, value_type, basis_of_record, replicates and methods

These are all fields that contain the word unknown.

Adding contributors

Contributor field Information to add
last_name, first_name The contributors first and last names should be available from the source
ORCID Contributors are identified by their ORCID, available for most active researches at orcid.org
affiliation Available from the source or the orcid.org website. Use the same syntax for the same affiliation throughout your database.
additional_role For the lead dataset contributor, add the field: additional_role: contact
  • You can add multiple data collectors by duplicating the relevant 4 lines of code; see the Adding dataset vignette for protocols on who to add as a data collector.

  • The line assistants: can be deleted if there aren’t any assistants’ names to add.

  • Add yourself as the dataset_curator.

Dataset fields

  • The file data/tutorial_dataset_1/raw/tutorial_dataset_1_notes.txt indicates how to fill in the unknown dataset fields for this study.

  • In general, the information to fill in these fields should be available from the source (article) or obtained directly from the dataset contributor.

Dataset field Information to add
basis_of_record See traits.build_schema for allowable terms.
life_stage See traits.build_schema for allowable terms.
description A 1-2 sentence summary of the dataset. This can generally be formulated by information in the abstract.
sampling_stategy A description of how sites and sampling protocols were chosen. Can generally be taken verbatim from the methods section of a manuscript.
original_file Name of the file submitted by the data contributor and archived in the raw folder.
notes none (or .na) for this study, but any notes added by the data curator about data quality, edits to the data during dataset curation.

Trait details

trait_name

The trait_name must match a trait_name within the traits dictionary. For this example:

column in dataset trait concept
LMA (mg mm-2) leaf_mass_per_area
Leaf nitrogen (mg mg-1) leaf_N_per_dry_mass
leaf size (mm2) leaf_area

A dataset curator must be familiar with the likely traits in their discipline to accurately match those in a contributed dataset to traits in the dictionary, and be able to determine if a new trait definition is warranted.

unit_in

Units are formatted according to the UCUM convention:

  • units in the numerator are separated by a ‘.’/
  • units in the denominator are each preceded by a ‘/’./
  • “extra” information that is commonly informally included as part of the units for clarity can be included in curly brackets, {}

As examples:

units UCUM format
milligram per square millimetre mg/mm2
micromole per square metre second umol/m2/s
micromole carbon dioxide per square metre second umol{CO2}/m2/s

If the units being read in for a specific trait differ from those defined for the trait in the traits dictionary the trait values are converted using the conversion rules specified in unit_conversions.csv.

entity_type, value_type, basis_of_value, replicates, methods
field value for this dataset description
entity_type population The entity corresponding to the trait value. Uses a controlled vocabulary. See traits.build_schema for allowable terms.
value_type mean The statistical nature of the trait value. Uses a controlled vocabulary. See traits.build_schema for allowable terms.
basis_of_value measurement How the trait value was obtained. See traits.build_schema for allowable terms.
replicates 3 The number of replicate measurements that comprise the trait measurement recorded in the spreadsheet.
methods See the study’s metadata_notes.txt file A verbatim (free-form) text field documenting the methods used to collect the trait measurements. This is generally available from the reference or directly from the author.

The values for entity_type, value_type, basis_of_value, and replicates can vary by trait – and indeed by measurement – but for this study are identical for all traits.

Final steps

Double check the metadata.yml file

You should now have a completed metadata.yml file, with no unknown fields.

You’ll notice five sections we haven’t used, contexts, substitutions, taxonomic_updates, exclude_observations, and questions.

These should each contain an .na (as in substitutions: .na). They will be explored in future lessons.

Run tests on the metadata file

Confirm there are no errors in the metadata.yml file:

dataset_test("tutorial_dataset_1")

This should result in the following output:

[ FAIL 0 | WARN 0 | SKIP 0 | PASS 79 ]

Add dataset to the database

Next add the dataset_id to the build file that builds the database and rebuild the database

build_setup_pipeline(method = "base", database_name = "traits.build_database")
source("build.R")
Build dataset report

As a final step, build a report for the study

traits.build_database$build_info$version <- "5.0.0"  
    # a fix because the function was built around specific AusTraits versions
dataset_report("tutorial_dataset_1", traits.build_database, overwrite = TRUE)

Have a look at the report, but there reports become much more interesting once there are more datasets in the database.