Tyler Smith bio photo

Tyler Smith

Plant taxonomist
Adjunct professor
Field botanist

Email Twitter Github Stackoverflow

The Flora of North America is a great resource for botanists. The books are nice, and even better, almost all of the keys and images are also freely available online. These keys are generally the best available for a genus or family, unless you’re lucky enough to work in an area with a very recent local flora.

However, since the keys have to include all the species found anywhere in the US and Canada, they tend to be long and convoluted. On occasion I’ve rewritten some of the keys, to trim them down to the species where I live. This is tedious work, and I usually give up before I’ve done more than one or two of my favourite Carex sections.

Enter Python. I’ve been looking for a good project to try Python for a while, and it turns out it’s a great tool for scraping websites and reformatting the data to suit your needs.

(If you’re not interested in programming, you may want to skip to the end product, which is my draft key to the sedges of Ontario. The rest of this post is about the code I wrote to make it. On the other hand, if you are interested in Python, what follows may horrify you. It’s my first Python program, so it’s bound to be ugly. This may lead you to wonder who exactly is the intended audience for this post. That’s a good question.)

In a nutshell, it works like this:

import fna
## scrape the Eriphorum pages from the site
eriophorum = fna.scrapeTaxon("Eriophorum")
## Turn the data into a key (actually a nested list)
eriophKey = fna.makeKey(eriophorum)
## Extract only the clues that lead to Ontario species
eriophON = fna.selectKey(eriophKey, reg = "Ont.")
## Write your key to a file, formatted for LaTeX
fna.writeLatexKey(eriophON, outfile = "eriophON.tex", 
              title = "\\emph{Eriophorum}", abbrev = "\\emph{E.}~")

What comes out the other end, in this case saved to the file eriphON.tex, is the FNA key, except with only the species found in Ontario. The actual output is designed to be used with the dichokey LaTeX package, so it needs to be wrapped with appropriate headers and closing tags. I’m collating multiple keys into a single document, so I have a master file that looks something like this:

\documentclass[twocolumn]{article}
\usepackage[landscape,margin=0.75in]{geometry}
\usepackage{dichokey}
\usepackage{gensymb}
\usepackage{tgschola}
\usepackage[T1]{fontenc}

\title{Keys to the Cyperaceae of Ontario}

\begin{document}
\input{eriphON}
\end{document}

If you aren’t familiar with LaTeX , you can also use fna.writeHtmlKey(). I haven’t put much time into that yet, but it does produce a self-contained html file.

The source code is available from my bitbucket repository. It’s a work in progress, and a first effort, so comments and criticisms are welcome.

A few more comments on things I found interesting:

Memoizing url requests

The FNA website is kind of slow, and working on a scraper involves sending a lot of requests. To speed things up, I build a local cache of the webpages, so only the first request for a webpage goes to the net, all subsequent calls use the local version:

URLDICT = dict()

def fetchUrl (url, verbose = False) :
    if not url in URLDICT:
        if verbose : print("***fetching from the network***")
        page = urllib2.urlopen(url).read()
        URLDICT[url] = page
    else :
        if verbose : print("***fetching from cache***")
    return(BeautifulSoup(URLDICT[url], "lxml"))

If you want to save the cache at the end of a session, use fna.saveDict(). Reload it with fna.URLDICT = fna.loadDict(). Unfortunately, processing the raw html with Beautiful Soup is also slow, and there’s no straightforward way to save the result to file.

This is called memoization, which I read about in Conrad Barski’s fantastic book Land of Lisp. It’s a really simple trick, and saves a few minutes every time I have to reparse Carex section Ovales It will be even more useful when Crataegus goes online, which will hopefully happen later this year.

Coping with idiosyncratic formatting

I’m trying to build a general set of tools, but it’s challenging because the FNA is not entirely consistent. Some of the keys contain errors, or are missing entirely (i.e., Cyperus). There are also monotypic genera, and genera with one or more levels of sub-sectioning. Carex has several levels of keys above the sectional keys, and then only some of the sections have proper keys themselves. To deal with this, I use a lower-level approach to fine-tune the keys I extract. This is the function scrapeTaxonDev, which uses taxon_id and, optionally, key_no, in place of the name of the taxon itself. You can find these numbers on the links on the FNA website. For example, the key to Carex section Ovales west of the Rockies, which is linked from the main section Ovales, is http://www.efloras.org/florataxon.aspx?flora_id=1&taxon_id=302719&key_no=2. So taxon_id = 302719 and key_no = 2.

In addition, both scrapeTaxon and ScrapeTaxonDev take an optional depth argument. This allows you to tell the scraper how deep to go in the website before it starts recording names. For example, the Eleocharis key starts with a key to subgenera. If you call scrapeTaxon("Eleocharis") on its own, the species names come with the subgenus attached. scrapeTaxon("Eleocharis", depth = 1) fixes that.

Pass by magic

Python doesn’t pass arguments like normal languages. It’s not pass by value, and it’s not pass by reference. It’s something different, referred to as pass by sharing, or call by object. I don’t really understand it yet. But it lets you do things like this:

eleoch = fna.scrapeTaxon("Eleocharis", depth = 1)
eleochKey = fna.makeKey(eleoch)
eleochON = fna.selectKey(eleochKey, reg = "Ont.")
tmp = fna.getLabel(eleochON, label = "3+")
subeleoch = tmp[2]
tmp[2] = fna.endText("subgenus Eleocharis")

The first three lines prepare the Eleocharis key, as above. Then I extract the clue with the label 3+ and point the variable tmp at it. Then I point the variable subeleoch at the third element of this clue, which is it’s target. Then I point the third element of tmp at a new terminal key text.

After all that, the original eleochON is truncated - the target of clue 3+ is now a single text element, rather than a separate branch of the key. But that branch of the key still exists, and is accessed via subeleoch. This allows me to cut the big Eleocharis key into two pieces, and process them both separately.

Moving forward

It’s not exactly a work of art, but I find it useful. Excluding Carex, which is a bit tricky with all the sections, and Cyperus which is missing, I used that code to generate a key to the sedges of Ontario in about a half-hour. It still needs a bit of personal attention to correct odd formatting and other minor glitches, but the core information is all there. Carex is underway, it just requires more hand-coding to deal with the various sections and nested keys.

This is stage one of my master plan. Stage two is updating the keys to reflect recent taxonomic work, and more ambitiously, simplifying the keys beyond the naive truncation that can be done with code. This latter step will require actual botanical work, rather than weekend hacking. Since I’m not officially working on sedges at the moment, that may be a slow process. If anyone wants to contribute ideas, I’d be happy to work them in.

The current key is posted here, and I’ll keep updating it as I work. I hope to have the Cyperaceae complete by the beginning of the field season (at least, complete in as much as I have scraped and formatted all the FNA keys - updating the taxonomy will take longer). I’ll continue to update the code on bitbucket as well. Drop me a line if you find it useful, or want to add anything.