12 min read

Preparing GBIF records for distribution modeling

§r §sdm

GBIF.org

The Global Biodiversity Information Facility (GBIF.org) has become the standard open-access online database of occurrence records for all manner of biological organisms. It was initially a clearinghouse for museum records (such as herbarium specimens), but now includes iNaturalist observations (those that are rated ‘research’ grade), survey data, and a growing variety of taxonomic and checklist sources.

While GBIF’s expansion increases the overall value of the database, it also means we need to be more circumspect in how we use the data. When I first encountered GBIF decades ago, I used it as one of several sources for herbarium records. I searched for the species I was looking for, and received a list of museum specimens. Nowadays most, maybe all, online herbarium data is mirrored by GBIF, so I no longer need to chase down multiple websites to round up all the herbarium records I need.

However, I can no longer assume that a GBIF record represents a physical specimen. It could be: a human observation, with or without an associated image; documentation harvested from sequence data submitted to Genbank; an entry from a field survey, which sometimes contain records for both presences and absences. And of course, all of these records are subject to any number of issues: transcription errors, identification errors, georeferencing errors.

All of which to say, we need a way to filter the results of our GBIF query to ensure the data we receive is fit for purpose.

Getting GBIF data

Step one is actually getting the data from GBIF. You can do this from the website, but I now prefer to do this in my R scripts, using the rgbif package.

Finding taxon names

To get started, we need to match our name up with the GBIF taxonomic backbone:

library(rgbif)
conyza <- name_backbone("Conyza canadensis")
erigeron <- name_backbone("Erigeron canadensis")

This snippet searches the GBIF database for “Conyza canadensis” and “Erigeron canadensis” and returns the closest matches. The results can be a bit confusing to interpret. To make sense of it, we need to understand a few terms.

A usageKey is a unique number associated with every taxon in the database, at every level. This includes species, subspecies, genera etc, and importantly, it includes both accepted species and synonyms. Complicating things, GBIF taxonomy tables use the term usageKey, but in the individual observation records the term taxonKey is used instead. They both mean the same thing - you’ll use the usageKey you get from your name_backbone search as the taxonKey in the query you submit (see below).

The acceptedUsageKey (or acceptedTaxonKey) is the number associated with every accepted taxon. For taxa that are synonyms, their acceptedUsageKey is the usageKey of the accepted taxon they belong to.

To illustrate the distinctions, let’s look at the values for the examples above.

Key Value
scientificName Conyza canadensis (L.) Cronquist
usageKey 5404801
status SYNONYM
acceptedUsageKey 3146791

Key Value
scientificName Erigeron canadensis L.
usageKey 3146791
status ACCEPTED

In this case Conyza canadensis is a synonym of Erigeron canadensis. The name Conyza canadensis has it’s own usageKey: 5404801. Its acceptedUsageKey is 3146791, which is the usageKey for Erigeron canadensis. An additional wrinkle is that accepted taxa only have a usageKey, they don’t have an acceptedUsageKey. You can also use the status field to check if a taxon is accepted or a synonym.

Understanding this is important to make sure you get the records you’re after when you query the database. If you request data for taxonKey 5404801 (the usageKey for Conyza canadensis), you’ll get records with that name on them, but not records for Erigeron canadensis. On the other hand, if you search for taxonKey 3146791, you’ll get records for Erigeron canadensis, and also all records for any synonyms of that name, including Conyza canadensis.

In other words, searching for an accepted taxon will return results for that taxon including all its synonyms. Searching for a synonym will return results only for that synonym.

You can search for records by name, without using the usageKey. But it’s safer to look up and use the usageKey, to confirm that the name you asked for matches with something in the GBIF database.

Preparing a query

Once you have a list of one or more taxonKey values, you’re ready to request your data. For large record sets, and especially if you want to request multiple species at once, the rgbif function occ_download_queue is very convenient. Note that you need to have a (free) GBIF account in order to use this.

This is a three step process. You create your query with occ_download_prep, submit the query to GBIF with occ_download_queue, and once the query is done you download the results via occ_download_get.

Starting with occ_download_prep, a basic query only requires your account credentials and one or more taxonKeys:

myQuery <- occ_download_prep(
  pred_in("taxonKey", c(3146791, 3189859)),
  pred("hasCoordinate", TRUE),
  format = "DWCA",
  user = "YourUserName",
  pwd = "YourGBIFPassword",
  email = "your@email.address"
)

SECURITY NOTE You can provide your GBIF username and password in your script as I have done, but there are more secure ways to submit your credentials without listing them in your code. See the rgbif documentation for better options.

This will create a query for two taxa, as specified by the provided keys; it will filter the results to include only records with coordinates (i.e., hasCoordinate is TRUE); and the results will be in the Darwin Core Format (i.e., “DWCA”). We reviewed taxon keys above. If we’re mapping our records, and don’t have time or need to do any georeferencing ourselves, we can save time by limiting our results to records that already have coordinates.

The default format is “DWCA”, which includes a lot of columns in the results. You can choose “SIMPLE_CSV” instead, and this will give you a subset of commonly used fields. However, this limits your ability to filter records after download, so I recommend sticking with “DWCA”.

There are many other ways to filter records, documented in ?download_predicate_dsl. Depending on your focus, you might want to restrict the results by basisOfRecord, to select only “HumanObservation” or “PreservedSpecimen”; or by datasetID to select a specific project (i.e., iNaturalist). However, I’ve found that these terms are not applied consistently, so it may be better to download everything and filter after you’ve had a chance to inspect the tables yourself.

Submitting a query

With a query ready, you can now submit it to GBIF via:

out <- occ_download_queue(.list = list(myQuery))

We could have submitted the query directly, by using occ_download above instead of occ_download_prep. The latter method offers two advantages. First, we can submit multiple queries at once. e.g.,

out <- occ_download_queue(.list = list(queryA, queryB,
                                       queryC)) 

Second, GBIF allows you to submit up to three queries at a time. If you have more, you have to wait until one of the earlier queries is finished. occ_download_queue keeps track of this for you, submitting three requests, and sending additional requests as the first three finish.

Retrieving your query

occ_download_queue returns the details of your submission(s):

$`46c5a45e8f1f2e64fc9eda5318e74972`
<<gbif download>>
  Your download is being processed by GBIF:
  ...
  Check status with
  occ_download_wait('0025322-231120084113126')
  After it finishes, use
  d <- occ_download_get('0025322-231120084113126') %>%
    occ_download_import()
  to retrieve your download.
  ...

Usually it takes a few minutes to process your request. You can check its progress with occ_download_wait('...'), using the details provided. Once the query is done, you can download it via occ_download_get, and read it into R with occ_download_import, as shown.

Cleaning GBIF data

Now that we have our records downloaded, we need to review and clean the data before analysis. Note that in the code below, I’m using a large download to demonstrate my investigations. I haven’t included this data in this tutorial, but you can try the code on your own data.

Filtering based on the data

basisOfRecord

With the data in hand, we can take a closer look at what kind of records they are, and where they came from. Starting with basisofRecord:

d1 <- occ_download_get(out[[1]], path = "./dl/") %>% 
  occ_download_import()
sort(table(d1$basisOfRecord), decreasing = TRUE)
  HUMAN_OBSERVATION  PRESERVED_SPECIMEN          OCCURRENCE 
            4224215              325278              112588 
        OBSERVATION   MATERIAL_CITATION     LIVING_SPECIMEN 
             104335               17703                3123 
    MATERIAL_SAMPLE MACHINE_OBSERVATION     FOSSIL_SPECIMEN 
               1698                 222                  10 

In this case, I have nine different kinds of record. The definitions of some of these terms are listed in the Darwin Core Quick Reference Guide. HUMAN_OBSERVATION includes, among other things, iNaturalist records. PRESERVED_SPECIMEN includes mostly (and most) herbarium records. I usually want both of these groups.

OCCURRENCE and OBSERVATION aren’t well defined or consistently used, so require further examination to determine if we want to retain them.

MATERIAL_CITATION, MATERIAL_SAMPLE, and MACHINE_OBSERVATION are a little vague, inconsistently used, and also not used often, so I remove them.

LIVING_SPECIMEN and FOSSIL_SPECIMEN are self-explanatory, and usually not want I want for my work.

iNaturalist and Human Observations

Research-grade iNaturalist records are imported to GBIF every few weeks. We can extract these using the field datasetKey, with the value "50c9509d-22c7-4a22-a47d-8c48425ef4a7":

sum(d1$datasetKey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7")
357863 # iNaturalist Records in my data

You might be tempted to filter on datasetName, since there is a dataset called “iNaturalist research-grade observations”. Unfortunately, this name isn’t used consistently, the name “iNaturalist observations” is also used for some records in the same dataset. In fact, the word iNaturalist appears in a variety of different dataset names:

table(d1$datasetName[grep("iNaturalist", d1$datasetName)])
"Flora of Russia" on iNaturalist: a trusted backlog 
                                                131 
                           iNaturalist observations 
                                                 25 
            iNaturalist research-grade observations 
                                             357838 
               iNaturalist XicotliData observations 
                                                  7 

Which leaves us with the unwieldy datasetKey == "50c9509d-22c7-4a22-a47d-8c48425ef4a7" as the most reliable way to get the offical iNaturalist dataset.

All of the iNaturalist records are labelled as HUMAN_OBSERVATION:

table(d1$basisOfRecord[d1$datasetKey ==
                       "50c9509d-22c7-4a22-a47d-8c48425ef4a7"]) 
HUMAN_OBSERVATION 
           357863 

But this is only a small fraction of the 4,224,215 HUMAN_OBSERVATION records in my query. In my data, there are over 1500 different HUMAN_OBSERVATION datasets. This includes a number of surveys, and in some cases these surveys record presences and absences, as recorded in the occurrenceStatus field:

table(d1$occurrenceStatus)
 ABSENT PRESENT 
  27896 4761276 

Whether you want to include a heterogeneous collection of surveys in your data depends on the question your asking.

But if you are planning to do distribution modeling, you may not want to include ABSENT records in your training data. It will depend on the scale of your modeling, and the scale of the surveys that contributed their data to GBIF. Fine-scale local surveys may include a mix of PRESENT and ABSENT records within a small area (say a few hundred meters). If your modeling is at the scale of 30 second climate rasters (~1km^2), that’s a problem. A species can be absent in a 10m^2 quadrat, but present in the larger 1km^2 raster grid.

Here are a few examples of filters you might want to use:

herbarium <- d1[, d1$basisOfRecord == "PRESERVED_SPECIMEN"]
iNat <- d1[, d1$datasetKey ==
             "50c9509d-22c7-4a22-a47d-8c48425ef4a7"]
present <- d1[, d1$occurrenceStatus == "PRESENT"
              & d1$basisOfRecord == "HUMAN_OBSERVATION"]

Common location errors

The R package CoordinateCleaner provides some functions for dealing with common problems:

cc_cen : Identifies records close to the centroid of a country. Automated georeferencing will often use the centroid of a country as the coordinates for a specimen that doesn’t have any other location data. This could be 100s/1000s of kilometers away from the actual location.

cc_inst : Identifies records close to the location of museums. Automated georeferencing will often use the location of a museum as the location for specimens it contains, regardless of where the specimen was actually collected.

cc_sea : Identifies records that are in the ocean, which are clearly errors for terrestrial organisms. However, this could also be a consequence of relatively minor errors in record coordinates or the mapping of coastlines.

For a more thorough overview of these kinds of issues, see this post by John Waller on the GBIF data blog.

Identification Errors

Now that we’ve dealt with the more ‘mechanical’ sorts of errors we might find in our data, we can review our records for obvious or suspected identification errors. If you have a large dataset, especially if it includes many species, or species you are not familiar with, this can be a daunting task.

Two important botanical resources can help us with this, at least in North America: the Flora of North America and NatureServe.org. These provide a ‘sanity check’ for our GBIF records, as both websites host carefully curated data on plant distributions.

The Flora of North America includes taxonomic treatments of all plant species that occur outside of cultivation in Canada and the United States1. The maps aren’t very detailed, showing us only the states or provinces each taxon is found in. But those maps are all based on actual specimens examined by a taxonomist with expertise on that plant. If New York appears on an FNA distribution map, it means there is at least one documented record of that species, confirmed by an expert, in that state. And conversely, if a state or province isn’t included on the map, it means that a thorough search of herbaria has failed to find a single record of the species.

NatureServe is similar, but the distribution maps are based on state/provincial/regional natural heritage programs. These are generally expert field botanists, rather than taxonomists. Their job is to document which plants grow in their jurisdiction. If a NatureServe distribution map includes Ontario, that means an expert field botanist has evidence (which could be a herbarium voucher, or a reliable observation) that the species occurs (or occurred) in Ontario. And again, if a state or province isn’t included on the map, then no such evidence has been found.

NatureServe and FNA distribution maps will usually match fairly closely. NatureServe is more responsive to new information, and those maps will periodically be updated. FNA isn’t yet complete (although it’s getting close), but for the taxa that it covers it provides a comprehensive review of their distribution at the time of publication (i.e., it doesn’t get updated).

Neither source is complete, and plants do move around. But, if you have records in your GBIF dataset from areas beyond the range documented in NatureServe or FNA, these are the ones I’d be most concerned about verifying. Use the map interface on the GBIF website to locate them, and look at the record details for clues to confirm or refute them. If they trace back to iNaturalist records, you may be able to confirm the identification yourself from the photos, or ask the observer for more details.

You won’t likely be able to confirm the identifications of all of your records, but you can often validate or exclude the outlying records that have the greatest potential to distort your analysis.


  1. The FNA project is more properly the “Flora of North America north of Mexico”.↩︎