10 min read

Georeferencing Notes

§sdm §data

GBIF Data

GBIF is the main online clearing house for occurence data in the world. It includes most (but not all) online herbarium databases. It also includes iNaturalist records, as well as a number of other survey repositories. I’m not familiar with all of the sources included.

The main page presents a search bar. Enter your species name (e.g., Rubus chamaemorus) in the box, and select the ‘occurrences’ tab. You’ll be taken to a list of results that match your species (2.9e6 results for R. chamaemorus). This may include records for other species, presumably because the name of the species you are searching for is recorded in the metadata for those records. We don’t usually want these.

To fix this, look for the prompt: “Your search matches a Species: ‘Rubus chamaemorus L.’. Do you wish to limit your search to this taxon only?”, in the left-hand tool bar. Select yes to restrict the results to only records for your species (i.e, excluding records where your species is mentioned in the data). For R. chamaemorus we drop from 2.9 e6 records to 87 e3.

Basis of Record

In the right hand toolbar, look for the option: Basis of Record. Expanding that tab, you see a list of options, along with the number of records for each option. For R. chamaemorus I see:

  • Observation (6)
  • Machine Observation (46)
  • Human Observation (79686)
  • Material Sample (516)
  • Material Citation (1)
  • Preserved Specimens (6032)
  • Fossil Specimen (133)
  • Living Specimen (28)
  • Occurrence (1222)

The Basis of Record field is no longer recommended, and is deprecated in the DarwinCore system. Looking through the various categories, that’s probably a good idea. I’m not sure what the difference is between ‘Observation’, ‘Machine Observation’, ‘Occurrence’ etc, and it looks like the groups are applied very inconsistently.

‘Human Observation’ does include iNaturalist records, but it also includes a lot of other data. The iNaturalist records are updated periodically, but it’s also possible to download them directly from iNaturalist. That’s my preference, as then I don’t need to concern myself with what else might be included under ‘Human Observation’.

To restrict ourselves to herbarium records, we’ll select ‘Preserved Specimens’ and move on. For R. chamaemorus that brings us down to 5.9 e3 records.

Geographic Filters

With our data filtered to just herbarium records, we can see where they are from by clicking on the ‘map’ tab at the top of the table. Doing that I see that of the 5900 records available, 2700 have coordinates already.

The rectangle and pentagon symbols in the upper right provide tools for further filtering the results by rectangle or polygon, respectively. You can also use the buttons on the upper left to zoom in and out, and to examine record details for an area (the arrow pointing at a circle). In my case, I want all records for the planet, so I’ll skip those options.

Downloading

Now I can download the records. I think GBIF requires that you have a (free) account for this. That helps them track use. The ‘simple’ format is sufficient, the DarwinCore format includes a lot of extra columns.

VERY IMPORTANT!

Record the doi for your download! You will need this to properly cite the data you use in a publication. If you have a GBIF account, you may be able to recover it after the fact. You can still cite GBIF in your paper without a DOI for the download, but it’s much less useful.

For my R. chamaemorus data, GBIF provides the following citation:

GBIF.org (23 March 2022) GBIF Occurrence Download https://doi.org/10.15468/dl.uhqfg3

For large queries, it may take several minutes or longer for the data to be ready. You’ll get an email when it’s available.

Data Processing

Uploading

Once you have your data, we’ll upload it to our shared Google Drive for processing. From the shared drive, select New -> File upload, and select the unzipped .csv file from your GBIF download. Double-click on the file, and select ‘Open With Google Sheets’ at the top of the page.

The filename is listed in the top left corner. Change that to the species name, possibly with any additional details to distinguish it from other files with the same species. For most cases, when we have a single global file for the species, just the name is fine.

However, we do want to link our data to the GBIF DOI. I do this by renaming the sheet to the DOI url: https://doi.org/10.15468/dl.uhqfg3. If we need to maintain separate datasets for the same species, they can be added as new sheets. This hasn’t come up yet.

Prepping the Spreadsheet

Even the ‘simple’ export has a lot of extra columns we don’t need. These can be ‘hidden’ from view to make it easier to work with the file. The main columns we use are:

  • countryCode
  • locality
  • stateProvince
  • decimalLatitude
  • decimalLongitude
  • coordinateUncertaintyInMeters
  • georeferencedBy
  • georeferencedDate
  • georeferenceSources
  • georeferenceProtocol

We do need to add several columns for our georeferencing. Find the column coordinateUncertainty, and add five columns to the right. The names of these columns are:

  • georeferencedBy
  • georeferencedDate
  • georeferenceSources
  • georeferenceProtocol
  • georeferenceRemarks

Data Entry

For records that already have coordinates, don’t change them or add any additional notes.

For records that you locate:

  • add your name to georeferencedBy. Use the same spelling/abbreviation every time

  • Add the date in yyyy-mm-dd format (i.e., 2022-03-23) to georeferencedDate

  • Add the coordinates to decimalLatitude and decimalLongitude

  • Add the uncertainty in meters to coordinateUncertaintyInMeters

  • Add the location source to georeferenceSources (usually this will be “GoogleMaps” or “GeoLocate”)

georeferenceRemarks is for comments regarding your efforts to locate the record. Some possible values include:

  • AttemptedGoogleMaps: I tried to find the location using google maps, and could not
  • AttemptedGetty: I tried to find the location using Getty Thesaurus of Geographic Names
  • AttemptedGNIS = I tried to find the location using Geographic Names Information System (GNIS)
  • AttemptedNRC = I tried to find the location using Natural Resources Canada, geographical place/feature names of Canada
  • LocationNotFound = I could not find this location / I could not georeference it
  • ContradictingLocation = I could not georeference the location as there are two locations mentioned which are not close by each other
  • Duplicate = duplicate georeference; other georeferences share the same lat/long and uncertainty

Problems

Inconsistencies in Source Data

We don’t normally need to modify the original data. However, if you see the following known errors they can be corrected as indicated.

  • 0, 0 latitude and longitude is an error. Can be overwritten, do so as you are working through the data.

  • coordinateUncertaintyInMeters without latitude and longitude coordinates is an error. Can be overwritten, do so as you are working through the data

  • Data with no country code: If no other locality data given, cannot be given a georeference; If locality data given, can add the country code and province information.

  • Localities with no province should be corrected as you are working through the data. Either add the province that is listed in the locality to the province column, or add the province that applies given the country and locality to the province column.

Finding Coordinates

Start by sorting the spreadsheet by country, state within country, and locality within state. This will save time, as you’ll be processing records from the same location together. In many cases you’ll have multiple records in the same state, and possibly in the same town. If you’re lucky, there will be records with coordinates for the same locaion as records without coordinates. Sorting allows you to take advantage of this.

The final step is actually finding the coordinates. Hopefully there’s enough detail in the locality field that you can search for landmarks in GoogleMaps, and use the satellite view to get close to the actual location. It really depends on the record what will be possible.

Google Maps

Google Maps has a very straightforward interface, and is the simplest option. The GeoLocate web client provides a slightly more sophisticated search tool, which allows you to use administrative boundaries or editable circles to set uncertainty values.

GeoLocate

GeoLocate also supports batch processing. This allows you to upload a file and process a group of records together, automatically. You will still need to review the results to make sure they make sense, but you can manually edit them directly in the map viewer. When you’re done you can export the results as a csv file. To use this approach, you need to prepare the spreadsheet according to the instructions, and we need a simple approach for exporting from Google Sheets, and reimporting back into Google Sheets when we’re done.

Record coordinates to five decimal places. At the equator, 1 degree is about 100 km, so 0.00001 is about 1 m. We don’t need sub-meter accuracy, and none of these records are that precise.

Uncertainty

Estimating uncertainty is a judgement call. If the locality data is very precise (“the corner of Main and Second Avenue in Springfield”) you might have a very small uncertainty. If all you have is a state, your coordinate will be the center of the state, and the uncertainty will be the diameter of the state. Don’t worry about being precise: 1m, 100m, 1000m, 10000m (or bigger) are fine.

For vague/approximate locations, do what you can. Try to think like a field biologist. How far from a town do you need to be before you stop describing your location as “South of X”, and start describing it as “between X and Y”?

Priorities

Some records take a lot of work to locate, and some are impossible. We’re most interested in filling in gaps. If we have a thousand records from Ottawa, we don’t need to waste a lot of energy tracking down the coordinates for one more record. But if we only have one record for China, it will be worth spending some time trying to find it. Similarly, earlier records are rarer and more valuable: the 100th record from New York (in 2021) doesn’t add as much value as the first one (from 1830).

Tips For Deciphering Labels

Compiled by Laura Kostyniuk.

Abbreviations seen in data

Latin

  • opp. = likely oppidum, town in english
  • fl. = likely flumen, river in english
  • urb. = likely urbs/urbem, city in english

Other languages

  • VC or v.c. in Great Britain codes likely stands for vice county. Not sure if applies to other countries
  • Arred. / Arred. de in Portugal is abbreviation for Arredores; meaning ‘surroundings of’
  • Locations in Sweden starting with W often do not show up in Google maps. Sometimes can be found if W is changed for V (ex: Wrigstad = no hits, Vrigstad = village in right general location).

Other Sources

For valuable records you can’t locate in GoogleMaps or GeoLocate. Compiled by Laura Kostyniuk and Shannon Ascensio.

Place/feature names, gazetteers

Coordinate converters

Useful maps