Function notes

APCalign functions

APCalign exports 10 functions to facilitate the alignment of submitted plant names to scientific names on the APC and APNI lists. They are listed in order of likelihood of use.

Taxon name alignment and updating functions

create_taxonomic_update_lookup

description: This function takes a list of Australian plant names that need to be reconciled with current taxonomy and generates a lookup table of the best-possible scientific name match for each input name. It uses first the function align_taxa, then the function update_taxonomy to achieve the output. The aligned name is plant name that has been aligned to a taxon name in the APC or APNI by the align_taxa function.

usage notes: This is APCalign’s core function, merging together the alignment and updating of taxonomy.

arguments:

taxa                                      #input vector of taxon names
stable_or_current_data = "stable"  
version = default_version()  
taxonomic_splits = "most_likely_species"  #options for names with ambiguous taxonomic histories
full = FALSE                              #outputs fewer (FALSE) or more (TRUE) columns
APNI_matches = TRUE                       #include (TRUE) or exclude (FALSE) APNI list
imprecise_fuzzy_matches = FALSE           #disallow (FALSE) or allow (TRUE) imprecise fuzzy matches
identifier = NA_character_                #include a unique identifier as part of informal names
resources = load_taxonomic_resources()  
output = NULL

output: A data frame with rows representing each taxon and columns documenting taxon metadata (original_name, aligned_name, accepted_name, suggested_name, genus, family, taxon_rank, taxonomic_dataset, taxonomic_status, taxonomic_status_aligned, aligned_reason, update_reason, subclass, taxon_distribution, scientific_name_authorship, taxon_ID, taxon_ID_genus, scientific_name_ID, row_number, number_of_collapsed_taxa).

example:

input <- c("Banksia serrata", "Banksia serrate", "Banksia cerrata", "Banksea serrata", "Banksia serrrrata", "Dryandra")
resources <- load_taxonomic_resources()

updated_taxa <- 
  APCalign::create_taxonomic_update_lookup(
    taxa = input,
    identifier = "APCalign test",
    full = TRUE,
    resources = resources
  )

or, start with a csv file where there is a column of taxon names to align

taxon_list <-                          #or load data through the R studio menu
  readr::read_csv(here("inst/", "extdata", "test_taxa.csv"),
    show_col_types = FALSE
  )
resources <- load_taxonomic_resources()

updated_taxa <- 
  APCalign::create_taxonomic_update_lookup(
    taxa = taxon_list$original_name,
    identifier = "APCalign test",
    full = TRUE,
    resources = resources
  )

notes
- If you will be running the function APCalign::create_taxonomic_update_lookup many times, it is best to load the taxonomic resources separately using resources <- load_taxonomic_resources(), then add the argument resources = resources
- The name Banksia cerrata does not align as the fuzzy matching algorithm does not allow the first letter of the genus and species epithet to change.
- The argument taxonomic_splits allows you to choose the outcome for updating the names of taxa with ambiguous taxonomic histories; this applies to scientific names that were once attached to a more broadly circumscribed taxon concept, that was then split into several more narrowly circumscribed taxon concepts, one of which retains the original name. There are three options: most_likely_species returns the name that is retained, with alternative names documented in square brackets; return_all adds additional rows to the output, one for each possible taxon concept; collapse_to_higher_taxon returns the genus with possible names in square brackets.
- The argument identifier allows you to add a fix text string to all genus- and family- level names, such as identifier = "Royal NP" would return `Acacia sp. [Royal NP]`.

align_taxa

description: This function finds taxonomic alignments in the APC or APNI. It uses the internal function match_taxa to attempt to match input strings to taxon names in the APC/APNI. It sequentially searches for matches against more than 20 different string patterns, prioritising exact matches (to accepted names as well as synonyms, orthographic variants) over fuzzy matches. It prioritises matches to taxa in the APC over names in the APNI. It identifies string patterns in input names that suggest a name can only be aligned to a genus (hybrids that are not in the APC/ANI; graded species; taxa not identified to species), and indicates these names only have a genus-rank match.

usage notes: Users will run this function if they wish to see the details of the matching algorithms, the many output columns that the matching function compares to as it seeks the best alignment. They may also select this function if they want to adjust the “fuzziness” level for fuzzy matches, options not allowed in create_taxonomic_update_lookup. This function is the first half of create_taxonomic_update_lookup.

arguments:

original_name                           #input vector of taxon names
output = NULL  
full = FALSE                            #outputs fewer (FALSE) or more (TRUE) columns
resources = load_taxonomic_resources()  
fuzzy_abs_dist = 3                      #set number of characters allowed to be different for fuzzy match
fuzzy_rel_dist = 0.2                    #set proportion of characters allowed to be different for fuzzy match
fuzzy_matches = TRUE                    #disallow (FALSE) or allow (TRUE) any fuzzy matches
imprecise_fuzzy_matches = FALSE         #disallow (FALSE) or allow (TRUE) imprecise fuzzy matches
APNI_matches = TRUE                     #include (TRUE) or exclude (FALSE) APNI list
identifier = NA_character               #include a unique identifier as part of informal names

output: A data frame with rows representing each taxon and with columns documenting the alignment made, the reason for this alignment, and a selection of taxon name mutations to which the original name was compared (original_name, aligned_name, taxonomic_dataset, taxon_rank, aligned_reason, alignment_code, cleaned_name, stripped_name, stripped_name2, trinomial, binomial, genus, fuzzy_match_genus, fuzzy_match_genus_synonym, fuzzy_match_genus_APNI, fuzzy_match_cleaned_APC, fuzzy_match_cleaned_APC_synonym, fuzzy_match_cleaned_APC_imprecise, fuzzy_match_cleaned_APC_synonym_imprecise, fuzzy_match_binomial, fuzzy_match_binomial_APC_synonym, fuzzy_match_trinomial, fuzzy_match_trinomial_synonym, fuzzy_match_cleaned_APNI, fuzzy_match_cleaned_APNI_imprecise).

example:

input <- c("Banksia serrata", "Banksia serrate", "Banksia cerrata", "Banksia serrrrata", "Dryandra sp.", "Banksia big red flowers")
resources <- load_taxonomic_resources()


aligned_taxa <-
  APCalign::align_taxa(
    original_name = input,
    identifier = "APCalign test",
    full = TRUE,
    resources = resources
  )

notes
- If you will be running the function APCalign::create_taxonomic_update_lookup many times, it is best to load the taxonomic resources separately using resources <- load_taxonomic_resources(), then add the argument resources = resources
- The name Banksia cerrata does not align as the fuzzy matching algorithm does not allow the first letter of the genus and species epithet to change.
- With this function you have the option of changing the fuzzy matching parameters. The defaults, with fuzzy matches only allowing changes of 3 (or fewer) characters AND 20% (or less) of characters has been carefully calibrated to catch just about all typos, but very, very rarely mis-align a name. If you wish to introduce less conservative fuzzy matching it is recommended you manually check the aligned names.
- It is recommended that you begin with imprecise_fuzzy_matches = FALSE (the default), as quite a few of the less precise fuzzy matches are likely to be erroneous. This argument should be turned on only if you plan to check all alignments manually.
- The argument identifier allows you to add a fix text string to all genus- and family- level names, such as identifier = "Royal NP" would return Acacia sp. [Royal NP].

update_taxonomy

description: This function uses the APC to update the taxonomy of names aligned to a taxon concept listed in the APC to the currently accepted name for the taxon concept. The aligned_data data frame that is input must contain 5 columns, originial_name, aligned_name, taxon_rank, taxonomic_dataset, and aligned_reason, the columns output by the function APCalign::align_taxa(). The aligned name is a plant name that has been aligned to a taxon name in the APC or APNI by the align_taxa function.

usage notes: As the input for this function is a table with 5 columns (output by align_taxa), this function will only be used when you explicitly want to separate the aligment and updating components of APCalign. This function is the second half of create_taxonomic_update_lookup.

arguments:

aligned_data                              #input table of aligned names and information about the aligned name
taxonomic_splits = "most_likely_species"  #options for names with ambiguous taxonomic histories
output = NULL   
resources = load_taxonomic_resources()

Diversity and distribution functions

create_species_state_origin_matrix

description: This function processes the geographic data available in the APC and returns state level native, introduced and more complicated origins status for all taxa.

arguments:

resources = load_taxonomic_resources()

output: A data frame with rows representing each species and columns for taxon name and each state . The values in each cell represent the origin of the species in that state.

native_anywhere_in_australia

description: This function checks if the given species is native anywhere in Australia according to the APC. Note that this will not detect within-Australia introductions, e.g. if a species is from Western Australia and is invasive on the east coast.

arguments:

species                                  #input vector of taxon names
resources = load_taxonomic_resources()

output: A data frame with rows representing each taxon and two columns: species, which is the same as the unique values of the input species, and native_anywhere_in_aus, a vector indicating whether each species is native anywhere in Australia, introduced by humans from elsewhere, or unknown with respect to the APC resource.

state_diversity_counts

description: This function calculates state-level diversity for native, introduced, and more complicated species origins based on the geographic data available in the APC.

arguments:

state                                    #state for which diversity should be summarised
resources = load_taxonomic_resources()

output: A data frame with three columns: “origin” indicating the origin of the species, “state” indicating the Australian state or territory, and “num_species” indicating the number of species for that origin and state.

Utility functions

load_taxonomic_resources

description: This function loads two taxonomic datasets for Australia’s vascular plants, the APC and APNI, into the global environment. It accesses taxonomic data from a dataset using the provided version number or the default version. The function creates several data frames by filtering and selecting data from the loaded lists.

usage notes: This function is called by many other APC functions, but is unlikely to be used independently by a APCalign user.

arguments:

stable_or_current_data = "stable"  
version = default_version()  
reload = FALSE

output: Several dataframes that include subsets of the APC/APNI based on taxon rank and taxonomic status.

standardise_names

description: This function standardises taxon names by performing a series of text substitutions to remove common inconsistencies in taxonomic nomenclature. The function takes a character vector of taxon names as input and returns a character vector of taxon names using standardised taxonomic syntax as output. In particular it standardises taxon rank abbreviations and qualifiers (subsp., var., f.), as people use many variants of these terms. It also standardises or removes a few additional filler words used within taxon names (affinis becomes aff.; s.l. and s.s. are removed).

arguments:

taxon_names           #input vector of taxon names

output: A character vector of standardised taxon names.

strip_names

description: Given a vector of taxonomic names, this function removes subtaxa designations (“subsp.”, “var.”, “f.”, and “ser”), special characters (e.g., “-”, “.”, “(”, “)”, “?”), and extra whitespace. The resulting vector of names is also converted to lowercase.

arguments:

taxon_names            #input vector of taxon names

output: A character vector of stripped taxonomic names, with subtaxa designations, special characters, and extra whitespace removed, and all letters converted to lowercase.

strip_names_2

description: Given a vector of taxonomic names, this function removes subtaxa designations (“subsp.”, “var.”, “f.”, and “ser”), additional filler words and characters (” x ” [hybrid taxa], “sp.”), special characters (e.g., “-”, “.”, “(”, “)”, “?”), and extra whitespace. The resulting vector of names is also converted to lowercase.

arguments:

taxon_names             #input vector of taxon names

output: A character vector of stripped taxonomic names, with subtaxa designations, special characters, additional filler words and extra whitespace removed, and all letters converted to lowercase.

Elizabeth Wenk

2024-01-22

APCalign functions

Taxon name alignment and updating functions

create_taxonomic_update_lookup

align_taxa

update_taxonomy

Diversity and distribution functions

create_species_state_origin_matrix

native_anywhere_in_australia

state_diversity_counts

Utility functions

load_taxonomic_resources

standardise_names

strip_names

strip_names_2