2  Motivation

2.1 The problem

Ecological data are often collected in disparate formats, making it difficult to combine datasets for analysis. This is particularly true for trait data, which are often collected in a variety of formats and are rarely stored in a relational database. This makes it difficult to combine trait data with other ecological data, such as abundance or biomass, for analysis.

Moreover, trait data are often collected in fragments, corrpesonding to scientific papers. Different traits may be collected for different species, or different traits are collected for the same species at different times or at different locations.

To create a large harmonised dataset, we need to combine these different datasets into a unfied whole, with common names, units, values for categorical traits, and so on. This is a time-consuming process.

We are not the first group to tackle this problem. Our field (plant ecology) has many trait datasets, most compiled from diverse sources. Most start small, with a few soruces that can be handled within a spreadhseet. But as thery grow, they ecnouter issues. The largest is TRY, with > 15 million records from over 300,000 plant taxa. However, each group has had to tackle the compilation challenge anew, as there are no common pathways for creating a harmonised dataset.

2.2 The solution

The traits.build data standard, R package, and workflow offer a solution to this problem, with a set of open-source tools that enable users to create open-source, harmonised, reproducible databases from disparate datasets, underpinned by a sophisticated ontology able to handle the complexities inherent to ecological data.

We developed these tools for the to create AusTraits, an, open-source database of Australian plant traits. We figured others may want to replicate these efforts, so we the code and workflow was transformed into a standalone package allowing anyone to build a trait database for their own region or taxa.

Along the way, we hdeveloped a suite of tools:

  • traits.build data standard: a relational database structure that fully documents the contextual data essential to interpreting ecological data.
  • traits.build R package: a set of functions that enable users to create a harmonised database from disparate datasets.
  • APD: The AusTraits Plant Dictionary, with detailed descriptions for more than 500 plant trait concepts.
  • APCalign: an R package to align and update Australian plant taxon name strings with the Australian Plant Census.

The tools are open-source, so that users can apply them to suit their needs and without cost.