GBIF Data
GBIF is the main online clearing house for occurence data in the world. It includes most (but not all) online herbarium databases. It also includes iNaturalist records, as well as a number of other survey repositories. I’m not familiar with all of the sources included.
The main page presents a search bar. Enter your species name (e.g., Rubus chamaemorus) in the box, and select the ‘occurrences’ tab. You’ll be taken to a list of results that match your species (2.9e6 results for R. chamaemorus). This may include records for other species, presumably because the name of the species you are searching for is recorded in the metadata for those records. We don’t usually want these.
To fix this, look for the prompt: “Your search matches a Species: ‘Rubus chamaemorus L.’. Do you wish to limit your search to this taxon only?”, in the left-hand tool bar. Select yes to restrict the results to only records for your species (i.e, excluding records where your species is mentioned in the data). For R. chamaemorus we drop from 2.9 e6 records to 87 e3.
Basis of Record
In the right hand toolbar, look for the option: Basis of Record. Expanding that tab, you see a list of options, along with the number of records for each option. For R. chamaemorus I see:
- Observation (6)
- Machine Observation (46)
- Human Observation (79686)
- Material Sample (516)
- Material Citation (1)
- Preserved Specimens (6032)
- Fossil Specimen (133)
- Living Specimen (28)
- Occurrence (1222)
The Basis of Record field is no longer recommended, and is deprecated in the DarwinCore system. Looking through the various categories, that’s probably a good idea. I’m not sure what the difference is between ‘Observation’, ‘Machine Observation’, ‘Occurrence’ etc, and it looks like the groups are applied very inconsistently.
‘Human Observation’ does include iNaturalist
records, but it also includes a lot of other data. The iNaturalist
records
are updated periodically, but it’s also possible to download them directly
from iNaturalist. That’s my preference, as then I don’t need to concern
myself with what else might be included under ‘Human Observation’.
To restrict ourselves to herbarium records, we’ll select ‘Preserved Specimens’ and move on. For R. chamaemorus that brings us down to 5.9 e3 records.
Geographic Filters
With our data filtered to just herbarium records, we can see where they are from by clicking on the ‘map’ tab at the top of the table. Doing that I see that of the 5900 records available, 2700 have coordinates already.
The rectangle and pentagon symbols in the upper right provide tools for further filtering the results by rectangle or polygon, respectively. You can also use the buttons on the upper left to zoom in and out, and to examine record details for an area (the arrow pointing at a circle). In my case, I want all records for the planet, so I’ll skip those options.
Downloading
Now I can download the records. I think GBIF requires that you have a (free) account for this. That helps them track use. The ‘simple’ format is sufficient, the DarwinCore format includes a lot of extra columns.
VERY IMPORTANT!
Record the doi for your download! You will need this to properly cite the data you use in a publication. If you have a GBIF account, you may be able to recover it after the fact. You can still cite GBIF in your paper without a DOI for the download, but it’s much less useful.
For my R. chamaemorus data, GBIF provides the following citation:
GBIF.org (23 March 2022) GBIF Occurrence Download https://doi.org/10.15468/dl.uhqfg3
For large queries, it may take several minutes or longer for the data to be ready. You’ll get an email when it’s available.
Data Processing
Uploading
Once you have your data, we’ll upload it to our shared Google Drive for
processing. From the shared drive, select New -> File upload, and
select the unzipped .csv
file from your GBIF download. Double-click on
the file, and select ‘Open With Google Sheets’ at the top of the page.
The filename is listed in the top left corner. Change that to the species name, possibly with any additional details to distinguish it from other files with the same species. For most cases, when we have a single global file for the species, just the name is fine.
However, we do want to link our data to the GBIF DOI. I do this by renaming
the sheet to the DOI url: https://doi.org/10.15468/dl.uhqfg3
. If we need
to maintain separate datasets for the same species, they can be added as
new sheets. This hasn’t come up yet.
Prepping the Spreadsheet
Even the ‘simple’ export has a lot of extra columns we don’t need. These can be ‘hidden’ from view to make it easier to work with the file. The main columns we use are:
countryCode
locality
stateProvince
decimalLatitude
decimalLongitude
coordinateUncertaintyInMeters
georeferencedBy
georeferencedDate
georeferenceSources
georeferenceProtocol
We do need to add several columns for our georeferencing. Find the column
coordinateUncertainty
, and add five columns to the right. The names of
these columns are:
georeferencedBy
georeferencedDate
georeferenceSources
georeferenceProtocol
georeferenceRemarks
Data Entry
For records that already have coordinates, don’t change them or add any additional notes.
For records that you locate:
-
add your name to
georeferencedBy
. Use the same spelling/abbreviation every time -
Add the date in yyyy-mm-dd format (i.e., 2022-03-23) to
georeferencedDate
-
Add the coordinates to
decimalLatitude
anddecimalLongitude
-
Add the uncertainty in meters to
coordinateUncertaintyInMeters
-
Add the location source to
georeferenceSources
(usually this will be “GoogleMaps” or “GeoLocate”)
georeferenceRemarks
is for comments regarding your efforts to locate the
record. Some possible values include:
AttemptedGoogleMaps
: I tried to find the location using google maps, and could notAttemptedGetty
: I tried to find the location using Getty Thesaurus of Geographic NamesAttemptedGNIS
= I tried to find the location using Geographic Names Information System (GNIS)AttemptedNRC
= I tried to find the location using Natural Resources Canada, geographical place/feature names of CanadaLocationNotFound
= I could not find this location / I could not georeference itContradictingLocation
= I could not georeference the location as there are two locations mentioned which are not close by each otherDuplicate
= duplicate georeference; other georeferences share the same lat/long and uncertainty
Problems
Inconsistencies in Source Data
We don’t normally need to modify the original data. However, if you see the following known errors they can be corrected as indicated.
-
0, 0
latitude and longitude is an error. Can be overwritten, do so as you are working through the data. -
coordinateUncertaintyInMeters
without latitude and longitude coordinates is an error. Can be overwritten, do so as you are working through the data -
Data with no country code: If no other locality data given, cannot be given a georeference; If locality data given, can add the country code and province information.
-
Localities with no province should be corrected as you are working through the data. Either add the province that is listed in the locality to the province column, or add the province that applies given the country and locality to the province column.
Finding Coordinates
Start by sorting the spreadsheet by country, state within country, and locality within state. This will save time, as you’ll be processing records from the same location together. In many cases you’ll have multiple records in the same state, and possibly in the same town. If you’re lucky, there will be records with coordinates for the same locaion as records without coordinates. Sorting allows you to take advantage of this.
The final step is actually finding the coordinates. Hopefully there’s
enough detail in the locality
field that you can search for landmarks in
GoogleMaps, and use the satellite view to get close to the actual location.
It really depends on the record what will be possible.
Google Maps
Google Maps has a very straightforward interface, and is the simplest option. The GeoLocate web client provides a slightly more sophisticated search tool, which allows you to use administrative boundaries or editable circles to set uncertainty values.
GeoLocate
GeoLocate also supports batch
processing. This allows
you to upload a file and process a group of records together,
automatically. You will still need to review the results to make sure they
make sense, but you can manually edit them directly in the map viewer. When
you’re done you can export the results as a csv
file. To use this
approach, you need to prepare the spreadsheet according to the
instructions, and we
need a simple approach for exporting from Google Sheets, and reimporting
back into Google Sheets when we’re done.
Record coordinates to five decimal places. At the equator, 1 degree is about 100 km, so 0.00001 is about 1 m. We don’t need sub-meter accuracy, and none of these records are that precise.
Uncertainty
Estimating uncertainty is a judgement call. If the locality data is very precise (“the corner of Main and Second Avenue in Springfield”) you might have a very small uncertainty. If all you have is a state, your coordinate will be the center of the state, and the uncertainty will be the diameter of the state. Don’t worry about being precise: 1m, 100m, 1000m, 10000m (or bigger) are fine.
For vague/approximate locations, do what you can. Try to think like a field biologist. How far from a town do you need to be before you stop describing your location as “South of X”, and start describing it as “between X and Y”?
Priorities
Some records take a lot of work to locate, and some are impossible. We’re most interested in filling in gaps. If we have a thousand records from Ottawa, we don’t need to waste a lot of energy tracking down the coordinates for one more record. But if we only have one record for China, it will be worth spending some time trying to find it. Similarly, earlier records are rarer and more valuable: the 100th record from New York (in 2021) doesn’t add as much value as the first one (from 1830).
Tips For Deciphering Labels
Compiled by Laura Kostyniuk.
Abbreviations seen in data
Latin
opp.
= likely oppidum, town in englishfl.
= likely flumen, river in englishurb.
= likely urbs/urbem, city in english
Other languages
VC
orv.c.
in Great Britain codes likely stands for vice county. Not sure if applies to other countriesArred.
/Arred. de
in Portugal is abbreviation forArredores
; meaning ‘surroundings of’- Locations in Sweden starting with
W
often do not show up in Google maps. Sometimes can be found ifW
is changed forV
(ex: Wrigstad = no hits, Vrigstad = village in right general location).
Other Sources
For valuable records you can’t locate in GoogleMaps or GeoLocate. Compiled by Laura Kostyniuk and Shannon Ascensio.
Place/feature names, gazetteers
- Canadian Geographical Names Database (Canada)
- USGS Geographic Names Information System (GNIS) (USA)
- Getty Thesaurus of Geographic Names (Worldwide)
Coordinate converters
-
Tool-online Coordinate converter: search by country or by world if the type of coordinate system you are looking to convert is available
-
Converter for Swiss coordinates: If numbers are written as 123.4/567.8, 567.8 is 567 800 as a Y coordinate, and 123 400 as an X coordinate
Useful maps
-
OldMapsonline Old map searching tool; search by location, year, and add filters to search.
-
MapCarta: Good general map with more locations than Google; sometimes have to manually search the map though, locations will not always show up when searched in search bar.
-
FindLatitudeLongitude Another general mapping / searchable tool.