I’ve been working on a lot of AFLP data this winter. I’d really like to be able to do all the analysis in R, for a few reasons. First, it would mean no more fighting with GeneMapper, which is incredibly frustrating: it’s Windows-only, expensive, closed-source and painfully underpowered for the job. Second, presumably if I can figure out how to code this myself I will develop a deeper understanding of the system. And third, if I can get the code working in R, I will be able to automate most of the process.
There are two R projects already in progress for working with AFLP data. RawGeno is one option. It doesn’t yet allow for importing fsa files directly, but the example scripts provide some clues about how to do this. I couldn’t get the code to work as written, but I was able to steal some ideas from it.
The other R package is AFLP. This package includes a read.fsa() function, but it doesn’t seem to work yet. I understand they’ve only recently switched to ABI sequencers, and haven’t yet updated their code. AFLP also combines reading the fsa files, calibrating the sizing, and defining the bins into one step. That’s a sensible thing to do, but I’m not that clever. I need to break things into small pieces if I hope to get anywhere.
Since one of my goals is self-education, I’m not concerned about duplicating some of the effort of these other projects. In fact, I’m going to try and steal as much as I can from them. That’s one of the benefits of Free Software, we get to learn from each other.
Step one, reading the raw data
Lucky for me, most of the work involved in actually getting the contents of an .fsa file into R has already been done, via the package seqinr. All that I need to do is extract the useful bits and reformat it into a data.frame.
sig.channel is a vector of the DATA channels to read from the fsa file. I’m using FAM dye, which gets recorded in channel 1. lad.channel is the DATA channel where the size standard is found. We use the orange dye for the ladder, which is in channel 105. pretrim and posttrim are conveniences, for removing leading and trailing strings from the filenames, via tag.trimmer.
The actual data, in my case, is composed of 8959 rows for each sample x dye combination. Each row is the reading from the laser at that point in the run (the time). In other words, the size of the fragments that are migrating past the window at that particular time. Since we have multiple readings for each time, the time column allows us to refer to information from different dyes and different samples that were detected at the same time. peak is the strength of the fluorescence associated with each sample x dye x time combination. The negative numbers are obviously noise. You can use the thresh argument to clear out all rows that are below a particular fluorescence value. This isn’t necessary unless you’ve got a really big data set. I ran this on nearly 200 samples and had no problems - I don’t think you’re likely to run into issues with less than 1000 samples.
This isn’t useful yet. First we need to convert the times into actual fragment sizes, in base pairs. In the meantime, we can at least plot our raw data in R now:
Next up is finding the peaks in each channel, matching up the size standard peaks to the known values, and using that to convert the rest of the peaks from time to base-pairs.