When working with biodiversity data, it is important to verify
taxonomic names with an authoritative list and correct any out-of-date
names. The APCalign
package simplifies this process by:
- Accessing up-to-date taxonomic information from the Australian Plant Census and the Australia Plant Name Index.
- Aligning authoritative names to your taxonomic names using our fuzzy matching algorithm
- Updating your taxonomic names in a transparent, reproducible manner
Installation
install.packages("remotes")
remotes::install_github("traitecoevo/APCalign")
library(APCalign)
To demonstrate how to use APCalign
, we will use an
example dataset gbif_lite
which is documented in
?gbif_lite
dim(gbif_lite)
#> [1] 129 7
gbif_lite |> print(n = 6)
#> # A tibble: 129 × 7
#> species infraspecificepithet taxonrank decimalLongitude decimalLatitude scientificname
#> <chr> <chr> <chr> <dbl> <dbl> <chr>
#> 1 Tetratheca… <NA> SPECIES 145. -37.4 Tetratheca ci…
#> 2 Peganum ha… <NA> SPECIES 139. -33.3 Peganum harma…
#> 3 Calotis mu… <NA> SPECIES 115. -24.3 Calotis multi…
#> 4 Leptosperm… <NA> SPECIES 151. -34.0 Leptospermum …
#> 5 Lepidosper… <NA> SPECIES 142. -37.3 Lepidosperma …
#> 6 Enneapogon… <NA> SPECIES 129. -17.8 Enneapogon po…
#> # ℹ 123 more rows
#> # ℹ 1 more variable: verbatimscientificname <chr>
Retrieve taxonomic resources
The first step is to retrieve the entire APC and APNI name databases
and store them locally as taxonomic resources. We achieve this using
load_taxonomic_resources()
. The resources are compressed as
parquet files to speed download and local loading.
There are two versions of the databases that you can retrieve with
the stable_or_current_data
argument. Calling:
-
stable
will retrieve the most recent, archived version of the databases from our GitHub releases. This is set as the default option. -
current
will retrieve the up-to-date databases directly from the APC and APNI website.
Note that the databases are reasonably large so the initial retrieval
of the core data will take a few minutes. Once the taxonomic resources
have been stored locally, subsequent retrievals will take less time.
Retrieving current
resources will always take longer since
it is accessing the latest information from the website in an
uncompressed format.
# Benchmarking the retrieval of `stable` or `current` resources
stable_start_time <- Sys.time()
stable_resources <- load_taxonomic_resources(stable_or_current_data = "stable")
#> Loading resources......done
stable_end_time <- Sys.time()
current_start_time <- Sys.time()
current_resources <- load_taxonomic_resources(stable_or_current_data = "current")
#> Loading resources......done
current_end_time <- Sys.time()
# Compare times
stable_end_time - stable_start_time
#> Time difference of 16.48976 secs
For a more reproducible workflow, we recommend specifying the exact
stable
version you want to use.
resources <- load_taxonomic_resources(stable_or_current_data = "stable", version = "0.0.2.9000")
#> Loading resources......done
Align and update plant taxon names
Now we can query our taxonomic names against the taxonomic resources
we just retrieved using create_taxonomic_update_lookup()
.
This all-in-one function will:
- Align your taxonomic names to APC and APNI using our matching
algorithms
- Update names to an APC-accepted species or infraspecific name
whenever possible.
- Return a suggested name for all names, defaulting to an
accepted_name
when available, and otherwise providing an APNI name or a name where only a genus-level alignment is possible.
If you would like to learn more about each of these step, take a look at the section Closer look at name alignment and updating with ‘APCalign’
library(dplyr)
updated_gbif_names <- gbif_lite |>
pull(species) |>
create_taxonomic_update_lookup(resources = resources)
#> Checking alignments of 121 taxa
#> -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked
updated_gbif_names |>
print(n = 6)
#> # A tibble: 129 × 12
#> original_name aligned_name accepted_name suggested_name genus taxon_rank taxonomic_dataset
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Tetratheca c… Tetratheca … Tetratheca c… Tetratheca ci… Tetr… species APC
#> 2 Peganum harm… Peganum har… Peganum harm… Peganum harma… Pega… species APC
#> 3 Calotis mult… Calotis mul… Calotis mult… Calotis multi… Calo… species APC
#> 4 Leptospermum… Leptospermu… Leptospermum… Leptospermum … Lept… species APC
#> 5 Lepidosperma… Lepidosperm… Lepidosperma… Lepidosperma … Lepi… species APC
#> 6 Enneapogon p… Enneapogon … Enneapogon p… Enneapogon po… Enne… species APC
#> # ℹ 123 more rows
#> # ℹ 5 more variables: taxonomic_status <chr>, scientific_name_authorship <chr>,
#> # aligned_reason <chr>, update_reason <chr>, number_of_collapsed_taxa <dbl>
The original_name
is the taxon name used in your
original data. The aligned_name
is the taxon name we used
to link with the APC to identify any synonyms. The
accepted_name
is the currently, accepted taxon name used by
the Australian Plant Census. The suggested_name
is the best
possible name option for the original_name
.
Plant established status across states/territories
‘APCalign’ can also provide the state/territory distribution for established status (native/introduced) from the APC.
We can access the established status data by state/territory using
create_species_state_origin_matrix()
# Retrieve status data by state/territory
status_matrix <- create_species_state_origin_matrix(resources = resources)
Here is a breakdown of all possible values for
origin
library(purrr)
library(janitor)
# Obtain unique values
status_matrix |>
select(-species) |>
flatten_chr() |>
tabyl()
#> flatten_chr(select(status_matrix, -species)) n percent
#> doubtfully naturalised 1120 2.371003e-03
#> formerly naturalised 277 5.863998e-04
#> native 40336 8.538997e-02
#> native and doubtfully naturalised 9 1.905270e-05
#> native and naturalised 136 2.879075e-04
#> native and uncertain origin 2 4.233933e-06
#> naturalised 8765 1.855521e-02
#> not present 421606 8.925258e-01
#> presumed extinct 101 2.138136e-04
#> uncertain origin 22 4.657327e-05
You can also obtain the breakdown of species by established status
for a particular state/territory using
state_diversity_counts()
state_diversity_counts("NSW", resources = resources)
#> # A tibble: 7 × 3
#> origin state num_species
#> <chr> <chr> <table[1d]>
#> 1 doubtfully naturalised NSW 93
#> 2 formerly naturalised NSW 8
#> 3 native NSW 5958
#> 4 native and doubtfully naturalised NSW 2
#> 5 native and naturalised NSW 34
#> 6 naturalised NSW 1580
#> 7 presumed extinct NSW 8
Using the established status data and state/territory information, we
can check if a plant taxa is a native using
native_anywhere_in_australia()
library(dplyr)
updated_gbif_names |>
sample_n(1) |> # Choosing a random species
pull(suggested_name) |> # Extracting this APC accepted name
native_anywhere_in_australia(resources = resources)
#> # A tibble: 1 × 2
#> species native_anywhere_in_aus
#> <chr> <chr>
#> 1 Solanum prinophyllum considered native to Australia by APC
Closer look at name standardisation with ‘APCalign’
create_taxonomic_update_lookup
is a simple, wrapper,
function for novice users that want to quickly check and standardise
taxon names. For more experienced users, you can take a look at the sub
functions match_taxa()
, align_taxa()
and
update_taxonomy()
to see how taxon names are processed,
aligned and updated.
Aligning names to APC and APNI
The function align_taxa
will:
- Clean up your taxonomic names
- The functions
standardise_names
,strip_names
andstrip_names_extra
standardise infraspecific taxon designations and clean up punctuation and whitespaces
- The functions
- Find best alignment with APC or APNI to your taxonomic name using
our the function match_taxa
- A taxonomic name flows through a progression of 50
match algorithms until it is able to be aligned to a name on either
the APC or APNI list.
- These include exact and fuzzy matches.
Fuzzy matches are designed to capture small spelling mistakes and syntax
errors in phrase names.
- These include matches to the entire name string and matches on just
select words in the sequence.
- The sequence of matches has been carefully curated to align names with the fewest mistakes.
- A taxonomic name flows through a progression of 50
match algorithms until it is able to be aligned to a name on either
the APC or APNI list.
- Determine the
taxon_rank
to which the name can be resolved, based on its syntax.- For names that can only be resolved to genus, reformats the name to
offer a standardised
genus sp.
name, with additional information/notes provided as part of the original name in square brackets, as inAcacia sp. [skinny leaves]
orAcacia sp. [Broken Hill]
- For names that can only be resolved to genus, reformats the name to
offer a standardised
- Determine the
taxonomic_reference
(APC or APNI) of each name-alignment.
Note that align_taxa
does
not seek to update outdated taxonomy. That process occurs
during update_taxonomy process.
align_taxa
instead aligns each name input to the closest
match amongst names documented by the APC and APNI.
library(dplyr)
aligned_gbif_taxa <- gbif_lite |>
pull(species) |>
align_taxa(resources = resources)
#> Checking alignments of 121 taxa
#> -> 0 names already matched; 0 names checked but without a match; 121 taxa yet to be checked
aligned_gbif_taxa |>
print(n = 6)
#> # A tibble: 129 × 7
#> original_name cleaned_name aligned_name taxonomic_dataset taxon_rank aligned_reason
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Tetratheca ciliata Tetratheca … Tetratheca … APC species Exact match o…
#> 2 Peganum harmala Peganum har… Peganum har… APC species Exact match o…
#> 3 Calotis multicaulis Calotis mul… Calotis mul… APC species Exact match o…
#> 4 Leptospermum triner… Leptospermu… Leptospermu… APC species Exact match o…
#> 5 Lepidosperma latera… Lepidosperm… Lepidosperm… APC species Exact match o…
#> 6 Enneapogon polyphyl… Enneapogon … Enneapogon … APC species Exact match o…
#> # ℹ 123 more rows
#> # ℹ 1 more variable: alignment_code <chr>
For every aligned_name
, align_taxa()
will
provide a aligned_reason
which you can review as a table of
counts:
library(janitor)
aligned_gbif_taxa |>
pull(aligned_reason) |>
tabyl() |>
tibble()
#> # A tibble: 6 × 4
#> `pull(aligned_gbif_taxa, aligned_reason)` n percent valid_percent
#> <chr> <int> <dbl> <dbl>
#> 1 Exact match of taxon name to an APC-accepted canonical name o… 118 0.915 0.929
#> 2 Exact match of taxon name to an APC-known canonical name once… 6 0.0465 0.0472
#> 3 Exact match of taxon name to an APNI-listed canonical name on… 1 0.00775 0.00787
#> 4 Exact match of the first two words of the taxon name to an AP… 1 0.00775 0.00787
#> 5 Exact match of the first word of the taxon name to an APC-acc… 1 0.00775 0.00787
#> 6 <NA> 2 0.0155 NA
Configuring matching precision and aligned output
There are arguments in align_taxa
that allows you to
select which of the 50 matching algorithms are activated/deactivated and
the degree of fuzziness of the fuzzy matching function
-
fuzzy_matches
turns fuzzy matching on / off (it defaults toTRUE
).
-
fuzzy_abs_dist
andfuzzy_rel_dist
control the degree of fuzzy matching (they default tofuzzy_abs_dist = 3
&fuzzy_rel_dist = 0.2
).
-
imprecise_fuzzy_matches
turns imprecise fuzzy matching on / off (it defaults toFALSE
; for true it is set tofuzzy_abs_dist = 5
&fuzzy_rel_dist = 0.25
).
-
APNI_matches
turns matches to the APNI list on/off (it defaults toTRUE
).
-
identifier
allows you to specify a text string that is added to genus-level matches, indicating the site, study, etc e.g.Acacia sp. [Blue Mountains]
Updating to APC-accepted names
update_taxonomy()
uses the information generated by
align_taxa()
to, whenever possible, update names to
APC-accepted names.
updated_gbif_taxa <- aligned_gbif_taxa |>
update_taxonomy(resources = resources)
updated_gbif_taxa |>
print(n = 6)
#> # A tibble: 129 × 21
#> original_name aligned_name accepted_name suggested_name genus family taxon_rank
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Tetratheca ciliata Tetratheca c… Tetratheca c… Tetratheca ci… Tetr… Elaeo… species
#> 2 Peganum harmala Peganum harm… Peganum harm… Peganum harma… Pega… Nitra… species
#> 3 Calotis multicaulis Calotis mult… Calotis mult… Calotis multi… Calo… Aster… species
#> 4 Leptospermum trinervium Leptospermum… Leptospermum… Leptospermum … Lept… Myrta… species
#> 5 Lepidosperma laterale Lepidosperma… Lepidosperma… Lepidosperma … Lepi… Cyper… species
#> 6 Enneapogon polyphyllus Enneapogon p… Enneapogon p… Enneapogon po… Enne… Poace… species
#> # ℹ 123 more rows
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> # taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> # subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> # taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> # row_number <dbl>, number_of_collapsed_taxa <dbl>
Taxonomic resources used for updating names
The APC includes all previously recorded taxonomic names for a current taxon concept, designating the currently-accepted name as
taxonomic_status: accepted
, while previously used or inappropriately used names for the taxon concept have alternative taxonomic statuses documented (e.g. taxonomic synonym, orthographic variant, misapplied).The APC includes a column
acceptedNameUsageID
that links a taxon name with an alternative taxonomic status to the current taxon name, allowing outdated/inappropriately used names to be synced to their current name.
Note: Names listed on the APNI but absent from the APC are
those that are designated as taxonomic_dataset: APNI
by
APCalign
. These are names that are currently
unknown
by the APC. Over time, this list shrinks, as
taxonomists link ever more occasionally used name variants to an
APC-accepted taxon. However, for now, names listed only on the APNI
cannot be updated
Name updates at different taxonomic levels
-
update_taxonomy()
divides names into lists based on thetaxon_rank
andtaxonomic_dataset
assigned byalign_taxa
, as each list requires different updating algorithms.
- Only taxonomic names that are designated as
taxon_rank = species/infraspecific
andtaxonomic_dataset = APC
can be updated to an APC-accepted name.
- For all other taxa, it may be possible to align the genus-name to an
APC-accepted genus.
- For all taxa, a
suggested_name
is provided, selecting theaccepted_name
when available, and otherwise thealigned_name
, but with, if possible, an updated, APC-accepted genus name.
Taxonomic splits
Taxonomic splits refers to instances where a single taxon concept is subsequently split into multiple taxon concepts. For such taxa, when the
aligned_name
is the “old” taxon concept name, it is impossible to know which of the currently accepted taxon concepts the name represents.-
The function
update_taxonomy
includes an argumenttaxonomic_splits
, offering three alternative outputs for taxon concepts that have been split.most_likely_species
is the default value, and returns theaccepted_name
of the original taxon_concept; alternative names are documented in square brackets as part of the suggested name (Acacia aneura [alternative possible names: Acacia minyura (pro parte misapplied) | Acacia paraneura (pro parte misapplied) | Acacia quadrimarginea (misapplied)
).return_all
returns all currently accepted names that were split from the original taxon_concept; this leads to an increase in the number of rows in the output table. (Acacia aneura, Acacia minyura and Acacia paraneura are each output as a separate row, each with a unique taxon_ID)collapse_to_higher_taxon
declares that for split names, there is no way to be certain about which accepted name is appropriate and therefore that the best possible match is at the genus level; noaccepted_name
is returned, thetaxon_rank
is demoted togenus
and the suggested name documents the possible species-level names in square brackets (Acacia sp. [collapsed names: Acacia aneura (accepted) | Acacia minyura (pro parte misapplied) | Acacia paraneura (pro parte misapplied)]
)
library(dplyr)
aligned_gbif_taxa |>
update_taxonomy(taxonomic_splits = "most_likely_species",
resources = resources) |>
filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example
#> # A tibble: 1 × 21
#> original_name aligned_name accepted_name suggested_name genus family taxon_rank
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Acacia aneura Acacia aneura Acacia aneura Acacia aneura [alternat… Acac… Fabac… species
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> # taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> # subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> # taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> # row_number <dbl>, number_of_collapsed_taxa <dbl>
aligned_gbif_taxa |>
update_taxonomy(taxonomic_splits = "return_all",
resources = resources) |>
filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example
#> # A tibble: 3 × 21
#> original_name aligned_name accepted_name suggested_name genus family taxon_rank
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Acacia aneura Acacia aneura Acacia aneura Acacia aneura Acacia Fabaceae species
#> 2 Acacia aneura Acacia aneura Acacia minyura Acacia minyura Acacia Fabaceae species
#> 3 Acacia aneura Acacia aneura Acacia paraneura Acacia paraneura Acacia Fabaceae species
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> # taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> # subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> # taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> # row_number <dbl>, number_of_collapsed_taxa <dbl>
aligned_gbif_taxa |>
update_taxonomy(taxonomic_splits = "collapse_to_higher_taxon",
resources = resources) |>
filter(original_name == "Acacia aneura") # Subsetting Acacia aneura as an example
#> # A tibble: 1 × 21
#> original_name aligned_name accepted_name suggested_name genus family taxon_rank
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Acacia aneura Acacia aneura Acacia sp. Acacia sp. [collapsed n… Acac… Fabac… species
#> # ℹ 14 more variables: taxonomic_dataset <chr>, taxonomic_status <chr>,
#> # taxonomic_status_aligned <chr>, aligned_reason <chr>, update_reason <chr>,
#> # subclass <chr>, taxon_distribution <chr>, scientific_name_authorship <chr>,
#> # taxon_ID <chr>, taxon_ID_genus <chr>, scientific_name_ID <chr>, canonical_name <chr>,
#> # row_number <dbl>, number_of_collapsed_taxa <dbl>