It’s been more than a year since I posted my first experiments in reading
fsa files in R. Since that time
I’ve come up with a complete working solution that allows me to complete
my entire AFLP analysis workflow within R, without resorting to expensive
proprietary programs like GeneMapper or GeneMarker.
So far I’ve tested it on only two Debian GNU/Linux installations and one Windows machine. I’m very curious if it works on other setups. The graphical bin editor in particular may require some platform-dependent tweaking.
I haven’t put together a vignette, but the help files have a complete
working example – see
- reading ABI
- normalizing electropherograms
- identifying and sizing peaks
- viewing individual electropherograms
- dropping, adding and renaming samples
- Automated peak-binning, using the RawGeno algorithm
- Visually editing bins
- Generating presence-absence matrices for further analysis in R (or export for use in other programs, if you like)
Reading .fsa files
The first three steps are glued together in the
readFSA function. The
slowest part of the entire process is calibrating the size standard,
necessary for sizing peaks. I use the algorithm from the
slighlty tweaked. The bottleneck occurs in a loop that runs
lm on various
subsets of potential ladder peaks to determine the best fit. It may be
worth re-implementing this in C. As is, it may take 6-10 minutes to read in
a large (100+) number of
It just occured to me that the length of this step is also impacted by the choice of ladder. Specifically, we use the GS500(-250) ladder. This standard has a ‘250’ bp peak that you are supposed to ignore (it is usually closer to 246 bp in my tests). Consequently, the algorithm has to spend considerable cycles figuring out which peak corresponds to the ‘250’ peak so it can be excluded. Other standards might give you a slightly faster processing time, if that’s important to you.
binner uses the same sizing calculations as the commercial PeakScanner
and GeneMapper programs (i.e., local southern). So other than being
(slightly) slower, it produces near-identical results. (There are slight
differences due to different smoothing parameters.)
The most innovative feature of this package, and the only thing not
currently available in either
RawGeno, is a facility for
manually reviewing and editing bin boundaries. The main value of this is in
allowing you to exclude regions where there is no clear division between
bins. The binning algorithm will make a decision, but it’s not always
appropriate. With the bin editor, you can decide for yourself if you want
to delete questionable bins, or modify the boundaries manually.
There are many possible tweaks to extend the value of
processing AFLP data. However, without some external interest they will
probably not get implemented. Our current lab focus is on microsatellites.
As a result, I’m looking at available microsatellite processing options in
R, and will look to extend
binner in a way to help fill any gaps in
providing a complete workflow within R.