GBIF.org
The Global Biodiversity Information Facility (GBIF.org) has become the standard open-access online database of occurrence records for all manner of biological organisms. It was initially a clearinghouse for museum records (such as herbarium specimens), but now includes iNaturalist observations (those that are rated ‘research’ grade), survey data, and a growing variety of taxonomic and checklist sources.
While GBIF’s expansion increases the overall value of the database, it also means we need to be more circumspect in how we use the data. When I first encountered GBIF decades ago, I used it as one of several sources for herbarium records. I searched for the species I was looking for, and received a list of museum specimens. Nowadays most, maybe all, online herbarium data is mirrored by GBIF, so I no longer need to chase down multiple websites to round up all the herbarium records I need.
However, I can no longer assume that a GBIF record represents a physical specimen. It could be: a human observation, with or without an associated image; documentation harvested from sequence data submitted to Genbank; an entry from a field survey, which sometimes contain records for both presences and absences. And of course, all of these records are subject to any number of issues: transcription errors, identification errors, georeferencing errors.
All of which to say, we need a way to filter the results of our GBIF query to ensure the data we receive is fit for purpose.
Getting GBIF data
Step one is actually getting the data from GBIF. You can do this from the website, but I now prefer to do this in my R scripts, using the rgbif package.
Finding taxon names
To get started, we need to match our name up with the GBIF taxonomic backbone:
library(rgbif)
conyza <- name_backbone("Conyza canadensis")
erigeron <- name_backbone("Erigeron canadensis")
This snippet searches the GBIF database for “Conyza canadensis” and “Erigeron canadensis” and returns the closest matches. The results can be a bit confusing to interpret. To make sense of it, we need to understand a few terms.
A usageKey
is a unique number associated with every taxon in the
database, at every level. This includes species, subspecies, genera etc,
and importantly, it includes both accepted species and synonyms.
Complicating things, GBIF taxonomy tables use the term usageKey
, but in
the individual observation records the term taxonKey
is used instead.
They both mean the same thing - you’ll use the usageKey
you get from your
name_backbone
search as the taxonKey
in the query you submit (see
below).
The acceptedUsageKey
(or acceptedTaxonKey
) is the number associated
with every accepted taxon. For taxa that are synonyms, their
acceptedUsageKey
is the usageKey
of the accepted taxon they belong to.
To illustrate the distinctions, let’s look at the values for the examples above.
Key | Value |
---|---|
scientificName | Conyza canadensis (L.) Cronquist |
usageKey | 5404801 |
status | SYNONYM |
acceptedUsageKey | 3146791 |
Key | Value |
---|---|
scientificName | Erigeron canadensis L. |
usageKey | 3146791 |
status | ACCEPTED |
In this case Conyza canadensis is a synonym of Erigeron canadensis. The
name Conyza canadensis has it’s own usageKey
: 5404801. Its
acceptedUsageKey
is 3146791, which is the usageKey
for Erigeron
canadensis. An additional wrinkle is that accepted taxa only have a
usageKey
, they don’t have an acceptedUsageKey
. You can also use the
status
field to check if a taxon is accepted or a synonym.
Understanding this is important to make sure you get the records you’re
after when you query the database. If you request data for taxonKey
5404801 (the usageKey
for Conyza canadensis), you’ll get records with
that name on them, but not records for Erigeron canadensis. On the
other hand, if you search for taxonKey
3146791, you’ll get records for
Erigeron canadensis, and also all records for any synonyms of that name,
including Conyza canadensis.
In other words, searching for an accepted taxon will return results for that taxon including all its synonyms. Searching for a synonym will return results only for that synonym.
You can search for records by name, without using the usageKey
. But it’s
safer to look up and use the usageKey
, to confirm that the name you asked
for matches with something in the GBIF database.
Preparing a query
Once you have a list of one or more taxonKey
values, you’re ready to
request your data. For large record sets, and especially if you want to
request multiple species at once, the rgbif
function occ_download_queue
is very convenient. Note that you need to have a (free) GBIF account in
order to use this.
This is a three step process. You create your query with
occ_download_prep
, submit the query to GBIF with occ_download_queue
,
and once the query is done you download the results via occ_download_get
.
Starting with occ_download_prep
, a basic query only requires your account
credentials and one or more taxonKeys
:
myQuery <- occ_download_prep(
pred_in("taxonKey", c(3146791, 3189859)),
pred("hasCoordinate", TRUE),
format = "DWCA",
user = "YourUserName",
pwd = "YourGBIFPassword",
email = "your@email.address"
)
SECURITY NOTE You can provide your GBIF username and password in your script as I have done, but there are more secure ways to submit your credentials without listing them in your code. See the rgbif documentation for better options.
This will create a query for two taxa, as specified by the provided keys;
it will filter the results to include only records with coordinates (i.e.,
hasCoordinate
is TRUE); and the results will be in the Darwin Core
Format (i.e., “DWCA”). We reviewed taxon keys
above. If we’re mapping our records, and don’t have time or need to do any
georeferencing ourselves, we can save time by limiting our results to
records that already have coordinates.
The default format is “DWCA”, which includes a lot of columns in the results. You can choose “SIMPLE_CSV” instead, and this will give you a subset of commonly used fields. However, this limits your ability to filter records after download, so I recommend sticking with “DWCA”.
There are many other ways to filter records, documented in
?download_predicate_dsl
. Depending on your focus, you might want to
restrict the results by
basisOfRecord, to select
only “HumanObservation” or “PreservedSpecimen”; or by
datasetID to select a specific
project (i.e., iNaturalist). However, I’ve found that these terms are not
applied consistently, so it may be better to download everything and filter
after you’ve had a chance to inspect the tables yourself.
Submitting a query
With a query ready, you can now submit it to GBIF via:
out <- occ_download_queue(.list = list(myQuery))
We could have submitted the query directly, by using occ_download
above
instead of occ_download_prep
. The latter method offers two advantages.
First, we can submit multiple queries at once. e.g.,
out <- occ_download_queue(.list = list(queryA, queryB,
queryC))
Second, GBIF allows you to submit up to three queries at a time. If you
have more, you have to wait until one of the earlier queries is finished.
occ_download_queue
keeps track of this for you, submitting three
requests, and sending additional requests as the first three finish.
Retrieving your query
occ_download_queue
returns the details of your submission(s):
$`46c5a45e8f1f2e64fc9eda5318e74972`
<<gbif download>>
Your download is being processed by GBIF:
...
Check status with
occ_download_wait('0025322-231120084113126')
After it finishes, use
d <- occ_download_get('0025322-231120084113126') %>%
occ_download_import()
to retrieve your download.
...
Usually it takes a few minutes to process your request. You can check its
progress with occ_download_wait('...')
, using the details provided. Once
the query is done, you can download it via occ_download_get
, and read it
into R
with occ_download_import
, as shown.
Cleaning GBIF data
Now that we have our records downloaded, we need to review and clean the data before analysis. Note that in the code below, I’m using a large download to demonstrate my investigations. I haven’t included this data in this tutorial, but you can try the code on your own data.
Filtering based on the data
basisOfRecord
With the data in hand, we can take a closer look at what kind of records
they are, and where they came from. Starting with basisofRecord
:
d1 <- occ_download_get(out[[1]], path = "./dl/") %>%
occ_download_import()
sort(table(d1$basisOfRecord), decreasing = TRUE)
HUMAN_OBSERVATION PRESERVED_SPECIMEN OCCURRENCE
4224215 325278 112588
OBSERVATION MATERIAL_CITATION LIVING_SPECIMEN
104335 17703 3123
MATERIAL_SAMPLE MACHINE_OBSERVATION FOSSIL_SPECIMEN
1698 222 10
In this case, I have nine different kinds of record. The definitions of some of these terms are listed in the Darwin Core Quick Reference Guide. HUMAN_OBSERVATION includes, among other things, iNaturalist records. PRESERVED_SPECIMEN includes mostly (and most) herbarium records. I usually want both of these groups.
OCCURRENCE and OBSERVATION aren’t well defined or consistently used, so require further examination to determine if we want to retain them.
MATERIAL_CITATION, MATERIAL_SAMPLE, and MACHINE_OBSERVATION are a little vague, inconsistently used, and also not used often, so I remove them.
LIVING_SPECIMEN and FOSSIL_SPECIMEN are self-explanatory, and usually not want I want for my work.
iNaturalist and Human Observations
Research-grade iNaturalist
records
are imported to GBIF every few weeks. We can extract these using the field
datasetKey
, with the value "50c9509d-22c7-4a22-a47d-8c48425ef4a7"
:
sum(d1$datasetKey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7")
357863 # iNaturalist Records in my data
You might be tempted to filter on datasetName
, since there is a dataset
called “iNaturalist research-grade observations”. Unfortunately, this name
isn’t used consistently, the name “iNaturalist observations” is also used
for some records in the same dataset. In fact, the word iNaturalist
appears in a variety of different dataset names:
table(d1$datasetName[grep("iNaturalist", d1$datasetName)])
"Flora of Russia" on iNaturalist: a trusted backlog
131
iNaturalist observations
25
iNaturalist research-grade observations
357838
iNaturalist XicotliData observations
7
Which leaves us with the unwieldy datasetKey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7"
as the most reliable way to get the
offical iNaturalist dataset.
All of the iNaturalist records are labelled as HUMAN_OBSERVATION
:
table(d1$basisOfRecord[d1$datasetKey ==
"50c9509d-22c7-4a22-a47d-8c48425ef4a7"])
HUMAN_OBSERVATION
357863
But this is only a small fraction of the 4,224,215 HUMAN_OBSERVATION
records in my query. In my data, there are over 1500 different
HUMAN_OBSERVATION
datasets. This includes a number of surveys, and in
some cases these surveys record presences and absences, as recorded in
the occurrenceStatus
field:
table(d1$occurrenceStatus)
ABSENT PRESENT
27896 4761276
Whether you want to include a heterogeneous collection of surveys in your data depends on the question your asking.
But if you are planning to do distribution modeling, you may not want to
include ABSENT
records in your training data. It will depend on the scale
of your modeling, and the scale of the surveys that contributed their data
to GBIF. Fine-scale local surveys may include a mix of PRESENT and ABSENT
records within a small area (say a few hundred meters). If your modeling is
at the scale of 30 second climate rasters (~1km^2), that’s a problem. A
species can be absent in a 10m^2 quadrat, but present in the larger 1km^2
raster grid.
Here are a few examples of filters you might want to use:
herbarium <- d1[, d1$basisOfRecord == "PRESERVED_SPECIMEN"]
iNat <- d1[, d1$datasetKey ==
"50c9509d-22c7-4a22-a47d-8c48425ef4a7"]
present <- d1[, d1$occurrenceStatus == "PRESENT"
& d1$basisOfRecord == "HUMAN_OBSERVATION"]
Common location errors
The R
package CoordinateCleaner
provides some
functions for dealing with common problems:
cc_cen
: Identifies records close to the centroid of a country. Automated
georeferencing will often use the centroid of a country as the
coordinates for a specimen that doesn’t have any other location data.
This could be 100s/1000s of kilometers away from the actual location.
cc_inst
: Identifies records close to the location of museums. Automated
georeferencing will often use the location of a museum as the
location for specimens it contains, regardless of where the specimen
was actually collected.
cc_sea
: Identifies records that are in the ocean, which are clearly errors
for terrestrial organisms. However, this could also be a consequence
of relatively minor errors in record coordinates or the mapping of
coastlines.
For a more thorough overview of these kinds of issues, see this post by John Waller on the GBIF data blog.
Identification Errors
Now that we’ve dealt with the more ‘mechanical’ sorts of errors we might find in our data, we can review our records for obvious or suspected identification errors. If you have a large dataset, especially if it includes many species, or species you are not familiar with, this can be a daunting task.
Two important botanical resources can help us with this, at least in North America: the Flora of North America and NatureServe.org. These provide a ‘sanity check’ for our GBIF records, as both websites host carefully curated data on plant distributions.
The Flora of North America includes taxonomic treatments of all plant species that occur outside of cultivation in Canada and the United States1. The maps aren’t very detailed, showing us only the states or provinces each taxon is found in. But those maps are all based on actual specimens examined by a taxonomist with expertise on that plant. If New York appears on an FNA distribution map, it means there is at least one documented record of that species, confirmed by an expert, in that state. And conversely, if a state or province isn’t included on the map, it means that a thorough search of herbaria has failed to find a single record of the species.
NatureServe is similar, but the distribution maps are based on state/provincial/regional natural heritage programs. These are generally expert field botanists, rather than taxonomists. Their job is to document which plants grow in their jurisdiction. If a NatureServe distribution map includes Ontario, that means an expert field botanist has evidence (which could be a herbarium voucher, or a reliable observation) that the species occurs (or occurred) in Ontario. And again, if a state or province isn’t included on the map, then no such evidence has been found.
NatureServe and FNA distribution maps will usually match fairly closely. NatureServe is more responsive to new information, and those maps will periodically be updated. FNA isn’t yet complete (although it’s getting close), but for the taxa that it covers it provides a comprehensive review of their distribution at the time of publication (i.e., it doesn’t get updated).
Neither source is complete, and plants do move around. But, if you have records in your GBIF dataset from areas beyond the range documented in NatureServe or FNA, these are the ones I’d be most concerned about verifying. Use the map interface on the GBIF website to locate them, and look at the record details for clues to confirm or refute them. If they trace back to iNaturalist records, you may be able to confirm the identification yourself from the photos, or ask the observer for more details.
You won’t likely be able to confirm the identifications of all of your records, but you can often validate or exclude the outlying records that have the greatest potential to distort your analysis.
The FNA project is more properly the “Flora of North America north of Mexico”.↩︎