5 Long vs Wide data
5.1 Background
Concepts
There are two distinct formats for storing data, wide
and long
. With wide
format, each row is a separate set of measurements on an entity (individual, taxon), and data for each trait are recorded in successive columns. With long
format data, each row includes only a single trait measurement, with a column for the trait name and a column that indicates which rows of data belong to the same entity. Most ecological datasets are in wide
format, as the same trait measurements are made on each entity. However, when merging together datasets that measure different traits, it becomes more efficient to store data in long
format. Otherwise, you end up with a very “holey” table, as each trait may only be measured on a small proportion of entities.
5.2 Our approach
The traits.build output tables (traits, locations, contexts, etc.) are each in long
format. Within a traits.build database different traits, location properties, and context properties are always likely to be collected by different datasets, leading to a database with many, many columns with low completeness. Under this scenario, long
format significantly reduces the size of the data object, reducing storage costs and increasing the speed of loading. However, most users will prefer a wide
format for analysis as multiple trait measurements for the same entity will be collapsed to a single row. This becomes practical once the full database is subsetted to just include specific traits.
Pivotting between long and wide
Two {tidyr} functions, pivot_longer
and pivot_wider
make it quick to convert data from long
to wide
format in R. The AusTraits tutorial offers multiple examples for how these functions can be used to pivot the various relational tables in a traits.build database, inuding the core traits table.