Function notes
Elizabeth Wenk
2024-01-22
Source:vignettes/articles/function_notes.Rmd
function_notes.Rmd
APCalign functions
APCalign exports 10 functions to facilitate the alignment of submitted plant names to scientific names on the APC and APNI lists. They are listed in order of likelihood of use.
Taxon name alignment and updating functions
create_taxonomic_update_lookup
description: This function takes a list of
Australian plant names that need to be reconciled with current taxonomy
and generates a lookup table of the best-possible scientific name match
for each input name. It uses first the function align_taxa
,
then the function update_taxonomy
to achieve the output.
The aligned name is plant name that has been aligned to a taxon name in
the APC or APNI by the align_taxa function.
usage notes: This is APCalign’s core function, merging together the alignment and updating of taxonomy.
arguments:
taxa #input vector of taxon names
stable_or_current_data = "stable"
version = default_version()
taxonomic_splits = "most_likely_species" #options for names with ambiguous taxonomic histories
full = FALSE #outputs fewer (FALSE) or more (TRUE) columns
APNI_matches = TRUE #include (TRUE) or exclude (FALSE) APNI list
imprecise_fuzzy_matches = FALSE #disallow (FALSE) or allow (TRUE) imprecise fuzzy matches
identifier = NA_character_ #include a unique identifier as part of informal names
resources = load_taxonomic_resources()
output = NULL
output: A data frame with rows representing each taxon and columns documenting taxon metadata (original_name, aligned_name, accepted_name, suggested_name, genus, family, taxon_rank, taxonomic_dataset, taxonomic_status, taxonomic_status_aligned, aligned_reason, update_reason, subclass, taxon_distribution, scientific_name_authorship, taxon_ID, taxon_ID_genus, scientific_name_ID, row_number, number_of_collapsed_taxa).
example:
input <- c("Banksia serrata", "Banksia serrate", "Banksia cerrata", "Banksea serrata", "Banksia serrrrata", "Dryandra")
resources <- load_taxonomic_resources()
updated_taxa <-
APCalign::create_taxonomic_update_lookup(
taxa = input,
identifier = "APCalign test",
full = TRUE,
resources = resources
)
or, start with a csv file where there is a column of taxon names to align
taxon_list <- #or load data through the R studio menu
readr::read_csv(here("inst/", "extdata", "test_taxa.csv"),
show_col_types = FALSE
)
resources <- load_taxonomic_resources()
updated_taxa <-
APCalign::create_taxonomic_update_lookup(
taxa = taxon_list$original_name,
identifier = "APCalign test",
full = TRUE,
resources = resources
)
notes
- If you will be running the function
APCalign::create_taxonomic_update_lookup
many times, it is
best to load the taxonomic resources separately using
resources <- load_taxonomic_resources()
, then add the
argument resources = resources
- The name Banksia cerrata
does not align as the fuzzy
matching algorithm does not allow the first letter of the genus and
species epithet to change.
- The argument taxonomic_splits
allows you to choose the
outcome for updating the names of taxa with ambiguous taxonomic
histories; this applies to scientific names that were once attached to a
more broadly circumscribed taxon concept, that was then split into
several more narrowly circumscribed taxon concepts, one of which retains
the original name. There are three options:
most_likely_species
returns the name that is retained, with
alternative names documented in square brackets; return_all
adds additional rows to the output, one for each possible taxon concept;
collapse_to_higher_taxon
returns the genus with possible
names in square brackets.
- The argument identifier
allows you to add a fix text
string to all genus- and family- level names, such as
identifier = "Royal NP"
would return `Acacia sp. [Royal
NP]`.
align_taxa
description: This function finds taxonomic
alignments in the APC or APNI. It uses the internal function
match_taxa
to attempt to match input strings to taxon names
in the APC/APNI. It sequentially searches for matches against more than
20 different string patterns, prioritising exact matches (to accepted
names as well as synonyms, orthographic variants) over fuzzy matches. It
prioritises matches to taxa in the APC over names in the APNI. It
identifies string patterns in input names that suggest a name can only
be aligned to a genus (hybrids that are not in the APC/ANI; graded
species; taxa not identified to species), and indicates these names only
have a genus-rank match.
usage notes: Users will run this function if they
wish to see the details of the matching algorithms, the many output
columns that the matching function compares to as it seeks the best
alignment. They may also select this function if they want to adjust the
“fuzziness” level for fuzzy matches, options not allowed in
create_taxonomic_update_lookup
. This function is the first
half of create_taxonomic_update_lookup
.
arguments:
original_name #input vector of taxon names
output = NULL
full = FALSE #outputs fewer (FALSE) or more (TRUE) columns
resources = load_taxonomic_resources()
fuzzy_abs_dist = 3 #set number of characters allowed to be different for fuzzy match
fuzzy_rel_dist = 0.2 #set proportion of characters allowed to be different for fuzzy match
fuzzy_matches = TRUE #disallow (FALSE) or allow (TRUE) any fuzzy matches
imprecise_fuzzy_matches = FALSE #disallow (FALSE) or allow (TRUE) imprecise fuzzy matches
APNI_matches = TRUE #include (TRUE) or exclude (FALSE) APNI list
identifier = NA_character #include a unique identifier as part of informal names
output: A data frame with rows representing each taxon and with columns documenting the alignment made, the reason for this alignment, and a selection of taxon name mutations to which the original name was compared (original_name, aligned_name, taxonomic_dataset, taxon_rank, aligned_reason, alignment_code, cleaned_name, stripped_name, stripped_name2, trinomial, binomial, genus, fuzzy_match_genus, fuzzy_match_genus_synonym, fuzzy_match_genus_APNI, fuzzy_match_cleaned_APC, fuzzy_match_cleaned_APC_synonym, fuzzy_match_cleaned_APC_imprecise, fuzzy_match_cleaned_APC_synonym_imprecise, fuzzy_match_binomial, fuzzy_match_binomial_APC_synonym, fuzzy_match_trinomial, fuzzy_match_trinomial_synonym, fuzzy_match_cleaned_APNI, fuzzy_match_cleaned_APNI_imprecise).
example:
input <- c("Banksia serrata", "Banksia serrate", "Banksia cerrata", "Banksia serrrrata", "Dryandra sp.", "Banksia big red flowers")
resources <- load_taxonomic_resources()
aligned_taxa <-
APCalign::align_taxa(
original_name = input,
identifier = "APCalign test",
full = TRUE,
resources = resources
)
notes
- If you will be running the function
APCalign::create_taxonomic_update_lookup
many times, it is
best to load the taxonomic resources separately using
resources <- load_taxonomic_resources()
, then add the
argument resources = resources
- The name Banksia cerrata
does not align as the fuzzy
matching algorithm does not allow the first letter of the genus and
species epithet to change.
- With this function you have the option of changing the fuzzy matching
parameters. The defaults, with fuzzy matches only allowing changes of 3
(or fewer) characters AND 20% (or less) of characters has been carefully
calibrated to catch just about all typos, but very, very rarely
mis-align a name. If you wish to introduce less conservative fuzzy
matching it is recommended you manually check the aligned names.
- It is recommended that you begin with
imprecise_fuzzy_matches = FALSE
(the default), as quite a
few of the less precise fuzzy matches are likely to be erroneous. This
argument should be turned on only if you plan to check all alignments
manually.
- The argument identifier
allows you to add a fix text
string to all genus- and family- level names, such as
identifier = "Royal NP"
would return
Acacia sp. [Royal NP]
.
update_taxonomy
description: This function uses the APC to update
the taxonomy of names aligned to a taxon concept listed in the APC to
the currently accepted name for the taxon concept. The aligned_data data
frame that is input must contain 5 columns, originial_name
,
aligned_name
, taxon_rank
,
taxonomic_dataset
, and aligned_reason
, the
columns output by the function APCalign::align_taxa()
. The
aligned name is a plant name that has been aligned to a taxon name in
the APC or APNI by the align_taxa function.
usage notes: As the input for this function is a
table with 5 columns (output by align_taxa
), this function
will only be used when you explicitly want to separate the
aligment
and updating
components of APCalign.
This function is the second half of
create_taxonomic_update_lookup
.
arguments:
aligned_data #input table of aligned names and information about the aligned name
taxonomic_splits = "most_likely_species" #options for names with ambiguous taxonomic histories
output = NULL
resources = load_taxonomic_resources()
output: A data frame with rows representing each taxon and columns documenting taxon metadata (original_name, aligned_name, accepted_name, suggested_name, genus, family, taxon_rank, taxonomic_dataset, taxonomic_status, taxonomic_status_aligned, aligned_reason, update_reason, subclass, taxon_distribution, scientific_name_authorship, taxon_ID, taxon_ID_genus, scientific_name_ID, row_number, number_of_collapsed_taxa).
Diversity and distribution functions
create_species_state_origin_matrix
description: This function processes the geographic data available in the APC and returns state level native, introduced and more complicated origins status for all taxa.
arguments:
resources = load_taxonomic_resources()
output: A data frame with rows representing each species and columns for taxon name and each state . The values in each cell represent the origin of the species in that state.
native_anywhere_in_australia
description: This function checks if the given species is native anywhere in Australia according to the APC. Note that this will not detect within-Australia introductions, e.g. if a species is from Western Australia and is invasive on the east coast.
arguments:
species #input vector of taxon names
resources = load_taxonomic_resources()
output: A data frame with rows representing each
taxon and two columns: species
, which is the same as the
unique values of the input species
, and
native_anywhere_in_aus
, a vector indicating whether each
species is native anywhere in Australia, introduced by humans from
elsewhere, or unknown with respect to the APC resource.
state_diversity_counts
description: This function calculates state-level diversity for native, introduced, and more complicated species origins based on the geographic data available in the APC.
arguments:
state #state for which diversity should be summarised
resources = load_taxonomic_resources()
output: A data frame with three columns: “origin” indicating the origin of the species, “state” indicating the Australian state or territory, and “num_species” indicating the number of species for that origin and state.
Utility functions
load_taxonomic_resources
description: This function loads two taxonomic datasets for Australia’s vascular plants, the APC and APNI, into the global environment. It accesses taxonomic data from a dataset using the provided version number or the default version. The function creates several data frames by filtering and selecting data from the loaded lists.
usage notes: This function is called by many other APC functions, but is unlikely to be used independently by a APCalign user.
arguments:
stable_or_current_data = "stable"
version = default_version()
reload = FALSE
output: Several dataframes that include subsets of the APC/APNI based on taxon rank and taxonomic status.
standardise_names
description: This function standardises taxon names by performing a series of text substitutions to remove common inconsistencies in taxonomic nomenclature. The function takes a character vector of taxon names as input and returns a character vector of taxon names using standardised taxonomic syntax as output. In particular it standardises taxon rank abbreviations and qualifiers (subsp., var., f.), as people use many variants of these terms. It also standardises or removes a few additional filler words used within taxon names (affinis becomes aff.; s.l. and s.s. are removed).
arguments:
taxon_names #input vector of taxon names
output: A character vector of standardised taxon names.
strip_names
description: Given a vector of taxonomic names, this function removes subtaxa designations (“subsp.”, “var.”, “f.”, and “ser”), special characters (e.g., “-”, “.”, “(”, “)”, “?”), and extra whitespace. The resulting vector of names is also converted to lowercase.
arguments:
taxon_names #input vector of taxon names
output: A character vector of stripped taxonomic names, with subtaxa designations, special characters, and extra whitespace removed, and all letters converted to lowercase.
strip_names_2
description: Given a vector of taxonomic names, this function removes subtaxa designations (“subsp.”, “var.”, “f.”, and “ser”), additional filler words and characters (” x ” [hybrid taxa], “sp.”), special characters (e.g., “-”, “.”, “(”, “)”, “?”), and extra whitespace. The resulting vector of names is also converted to lowercase.
arguments:
taxon_names #input vector of taxon names
output: A character vector of stripped taxonomic names, with subtaxa designations, special characters, additional filler words and extra whitespace removed, and all letters converted to lowercase.