<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Tutorials on plantarum.ca</title>
    <link>https://plantarum.ca/tutorials/</link>
    <description>Recent content in Tutorials on plantarum.ca</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language>
    <lastBuildDate>Thu, 29 Jul 2021 00:00:00 +0000</lastBuildDate>
    
        <atom:link href="https://plantarum.ca/tutorials/index.xml" rel="self" type="application/rss+xml" />
    
    
    <item>
      <title>Plant Metabarcoding with the Minion</title>
      <link>https://plantarum.ca/2026/03/16/metabarcoding/</link>
      <pubDate>Mon, 16 Mar 2026 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2026/03/16/metabarcoding/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This is my bespoke pipeline for processing
&lt;a href=&#34;https://nanoporetech.com/products/sequence/minion&#34;&gt;MinION&lt;/a&gt; metabarcoding
data, focused specifically on ITS2 in vascular plants. It hasn’t yet been
reviewed by anyone with any special expertise in this area, but you may
nevertheless find it useful.&lt;/p&gt;
&lt;p&gt;I find the documentation of bioinformatics tools and pipelines generally
lacking, or at least targeted at people with more background knowledge than
I have. To compensate for this, I move very slowly through a protocol, and
examine intermediate outputs with standard text-processing tools. This is
essential for my learning process, but may be tedious if you do have the
understanding the bioinformaticians expect you to have.&lt;/p&gt;
&lt;p&gt;In any case, what follows is a very detailed walk through of my pipeline.
Now that it works I’ll convert it to a single script to run through all the
bits below.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;library-preparation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Library Preparation&lt;/h1&gt;
&lt;p&gt;The wet lab protocol we’re using is very similar to that used by
ONTBarcoder &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-SrivathsanEtAl_2024&#34;&gt;Srivathsan et al., 2024&lt;/a&gt;)&lt;/span&gt;, except we’re using ITS2 instead of COI.
The ONTBarcoder pipeline isn’t limited to COI, but it does expect the
target amplicon to be a consistent length. That’s not the case with ITS2,
so I put together the following pipeline.&lt;/p&gt;
&lt;p&gt;Our wet lab protocol consists of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;DNA extraction&lt;/li&gt;
&lt;li&gt;amplification with ITS2-S2F and ITS4, both primers with a 9bp index&lt;/li&gt;
&lt;li&gt;PCR products further processed with the Nanopore ligation kit
&lt;a href=&#34;https://store.nanoporetech.com/ca/ligation-sequencing-kit-v14.html&#34;&gt;SQK-LSK114&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The completed library is:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;LSK adapter - Forward Index - ITS2-S2F - amplicon - 
        - ITS4 - Reverse Index - LSK adapter&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We expect the length of individual products to be 218-378 bp. This accounts
for documented variation in IT2 length (160-320bp) plus primers and
adapters. Adding a bit of buffer, I’ll consider sequences in the range
160-500bp as a potentially ‘valid’ read.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&#34;https://nanoporetech.com/document/chemistry-technical-document#adapter-sequences&#34;&gt;LSK Adapter
sequence&lt;/a&gt;
is:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT
        GGACATGAAGCAAGTCAATGCATAACG&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Primers:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;ITS-S2F: ATGCGATACTTGGTGTGAAT&lt;/li&gt;
&lt;li&gt;ITS4: TCCTCCGCTTATTGATATGC&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Our indexes are 9 bp long:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th&gt;Forward&lt;/th&gt;
&lt;th&gt;Reverse&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;GGAGAAGAA&lt;/td&gt;
&lt;td&gt;CTCACAAGG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;CAGAGAGAA&lt;/td&gt;
&lt;td&gt;AAGAGCAGG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;ACTCCAGAA&lt;/td&gt;
&lt;td&gt;TAACTGCGA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td&gt;TCCAAGGAA&lt;/td&gt;
&lt;td&gt;GTCATCCGA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td&gt;…&lt;/td&gt;
&lt;td&gt;…&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;a href=&#34;https://software-docs.nanoporetech.com/dorado/latest/&#34;&gt;Dorado&lt;/a&gt; will trim
the LSK adapters, either as part of basecalling (which is the default) or
in a separate step. However, in my early trials this seems to leave an
inconsistent residue at the ends of the reads which was (initially at
least) kind of confusing. (this turned out to be a non-issue with the
approach I took).&lt;/p&gt;
&lt;p&gt;Another issue is a small proportion of reads are concatenated into chimeras
during sequencing. i.e., two sequences are read as one long one, containing
two sets of amplicons along with their flanking primers and indexes.
ONTBarcoder deals with this by splitting long sequences into two, and then
treating each half as a separate read. In our first run only 1.1 percent of
reads were long enough to be chimeras, so I’m just going to filter out
these long reads and get on with my life.&lt;/p&gt;
&lt;p&gt;With all that in mind, our pipeline will be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;basecalling via &lt;a href=&#34;https://software-docs.nanoporetech.com/dorado/latest/&#34;&gt;Dorado&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;extracting amplicons + flanking primers/indices via &lt;a href=&#34;https://bioinf.shenwei.me/seqkit/usage/#amplicon&#34;&gt;seqkit amplicon&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;length filtering via &lt;a href=&#34;https://bioinf.shenwei.me/seqkit/usage/#seq&#34;&gt;seqkit seq&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;demultiplex via &lt;a href=&#34;https://demultiplex.readthedocs.io/en/latest/index.html&#34;&gt;demultiplex demux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;trim primers/indices via &lt;a href=&#34;https://bioinf.shenwei.me/seqkit/usage/#amplicon&#34;&gt;seqkit amplicon&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;dereplicate via &lt;a href=&#34;https://github.com/torognes/vsearch&#34;&gt;vsearch –fastx-uniques&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;match to reference via &lt;a href=&#34;https://github.com/torognes/vsearch&#34;&gt;vsearch –usearch-global&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;further processing in &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;basecalling&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Basecalling&lt;/h1&gt;
&lt;p&gt;In the end, it doesn’t really matter if we trim the adapters or not. We’re
going to excise the amplicons in the next step, which will remove
everything external to the indices.&lt;/p&gt;
&lt;p&gt;In addition to the
&lt;a href=&#34;https://software-docs.nanoporetech.com/dorado/latest/&#34;&gt;Dorado&lt;/a&gt; program, we
need the basecalling models. We can download them with Dorado:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;dorado download --model all&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This downloads a whole load of models. The one we want will be in a
directory named &lt;code&gt;dna_r10.4.1_e8.2_400bps_sup@v5.2.0&lt;/code&gt;. The particulars of
the model are encoded in the directory name:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;dna&lt;/dt&gt;
&lt;dd&gt;
DNA sequence
&lt;/dd&gt;
&lt;dt&gt;r10.4.1&lt;/dt&gt;
&lt;dd&gt;
FLO-MIN114 flow cell
&lt;/dd&gt;
&lt;dt&gt;e8.2&lt;/dt&gt;
&lt;dd&gt;
chemistry type, Kit 14
&lt;/dd&gt;
&lt;dt&gt;400bps&lt;/dt&gt;
&lt;dd&gt;
translocation speed
&lt;/dd&gt;
&lt;dt&gt;sup&lt;/dt&gt;
&lt;dd&gt;
super accurate
&lt;/dd&gt;
&lt;dt&gt;v5.2.0&lt;/dt&gt;
&lt;dd&gt;
model version
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;We will always use &lt;code&gt;dna_r10.4.1_e8.2_400bps_sup&lt;/code&gt;, as these parameters are
set by the hardware we’re using. We may want to update to the newer model
versions as they are released.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;dorado basecaller \
       ./path/to/model/dna_r10.4.1_e8.2_400bps_sup@v5.2.0 \
       ./path/to/pod5 \
       --device cuda:all --emit-fastq --no-trim \
       &amp;gt; output_dir/basecalls.fastq&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;NB: &lt;code&gt;--device cuda:all&lt;/code&gt; for specifying a GPU.&lt;/p&gt;
&lt;p&gt;This takes about 20 minutes on the cluster (all times will depend on the
size of your read files of course; I include mine here to give you an idea
of the relative time required at each step).&lt;/p&gt;
&lt;p&gt;Check the number of reads:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;  wc -l &amp;lt; output_dir/basecalls.fastq &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;4362908&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Remember, four lines per read, so total reads = 1,090,727.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;amplicon-selection&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Amplicon Selection&lt;/h1&gt;
&lt;p&gt;In initial trials, this worked best if I select the 9 bp external to each
primer. i.e., exactly matching the expected size of the barcodes. This
means we’ll drop any reads where there was an indel in the index. Which is
probably a good thing to do in any case.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;seqkit amplicon -F ATGCGATACTTGGTGTGAAT \
       -R TCCTCCGCTTATTGATATGC -r -9:9 -f \
       output_dir/basecalls.fastq -m 4 &amp;gt; \
       output_dir/amplicons_9m4.fastq&lt;/code&gt;&lt;/pre&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;-F&lt;/code&gt; and &lt;code&gt;-R&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
primer sequences
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;-r -9:9 -f&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
extract the sequence starting 9 bp before the forward primer and
extending 9 bp after the reverse primer
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;-m 4&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
allow a maximum of four mismatches in primer sites
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;Takes about 2 minutes.&lt;/p&gt;
&lt;p&gt;As a quick sanity check, look at the first read in the output:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;head -2 amplicons_9m4.fastq&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;@d2b8520e-173a-441d-ab34-b9f8ea53d679	qs:f:9.27552 [ ... ]
TTCCAACAAATGCGATACTTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGCCTCCTGGTCGAGGGCACGTCTGCCTGGGTGTCACGCATCGTCGCCCCCACTCCCCTCGGCTCACGAGGGCGGGGGCGGATACTGGTCTCCCGCGCGCTCCCGCCCGTGGTTGGCCTAAAATCGAGTCCTCGGCGACGGTCGCCACGACAAGCGGTGGTTGAGATAGCTCGATGGTCGGTGTGTGTCGTTGCCGCCCTGGGGAACTCCCGGTACCGCCGAGCATTTGGCTTCAGGGATGCTCGCTGGGACCCCAGCGTGGCAGGACCCTCTAAGCATATCAATAAGCGGAGGATGTCGTGTT&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And compare it to the raw read:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;grep -A 1 @d2b8520e basecalls_2026.fastq&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;@d2b8520e-173a-441d-ab34-b9f8ea53d679	qs:f:9.27552 [ ... ]
&amp;gt; TTAAGTTGTAACCTACTCGACTTTCAGTTACGTATTGCTAACACGACATCCTCCGCTTATTGATATGCTTAGAGGGTCCTGCCACGCTGGGGTCCCAGCGAGCATCCCTGAAGCCAAATGCTCGGCGGTACCGGGAGTTCCCCAGGGCGGCAACGACACACACCGACCATCGAGCTATCTCAACCACCGCTTGTCGTGGCGACCGTCGCCGAGGACTCGATTTTAGGCCAACCACGGGCGGGAGCGCGCGGGAGACCAGTATCCGCCCCCGCCCTCGTGAGCCGAGGGGAGTGGGGGCGACGATGCGTGACACCCAGGCAGACGTGCCCTCGACCAGGAGGCCTCGGGCGCAACTTGCGTTCAAAGACTCGATGGTTCACGGGATTCTGCAATTCACACCAAGTATCGCATTTGTTGGAAAGCAATGCGTT&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I visually matched the raw read (reverse complimented in this case, the top
line) to the output from &lt;code&gt;amplicon&lt;/code&gt; (the middle line):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;AACGCATTGCTTTCCAACAA ATGCGATACTTGGTGTGAAT TGCAGAATCC [ ... ]
           TTCCAACAA ATGCGATACTTGGTGTGAAT TGCAGAATCC [ ... ]
           INDEX     Forward Primer       AMPLICON

[ ... ] GTGGCAGGACCCTCTAA GCATATCAATAAGCGGAGGA TGTCGTGTT AGCAATACG [ ... ]
[ ... ] GTGGCAGGACCCTCTAA GCATATCAATAAGCGGAGGA TGTCGTGTT
[ ... ] AMPLICON          Reverse Primer (rc)  INDEX&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;All is as expected, so we can proceed with the analysis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;length-filtering&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Length Filtering&lt;/h1&gt;
&lt;p&gt;Extract read lengths:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;awk &amp;#39;NR % 4 == 2 { print length }&amp;#39; \
    output_dir/amplicons_9m4.fastq &amp;gt; \
    output_dir/9m4_lengths.txt&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Review read lengths in R:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;reads &amp;lt;- read.table(&amp;quot;output_dir/9m4_lengths.txt&amp;quot;,
                    sep = &amp;quot;\t&amp;quot;)
readsTrim &amp;lt;- reads[reads &amp;lt;= 1000, ]
hist(readsTrim, breaks = 50, xlab = &amp;quot;Read Length&amp;quot;,
       ylab = &amp;quot;Number of Reads&amp;quot;, main = &amp;quot;&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/reads9M4Hist.jpg&#34; title=&#34;Histogram of the distribution of read lengths&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Note in this case I’m not plotting reads longer than 1000 bp, of which
there are 160 (from 848,255 total, less than 0.01%).&lt;/p&gt;
&lt;p&gt;Filtering my reads to 200-500 bp drops ~10% of the total:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;seqkit seq -m 200 -M 500 output_dir/amplicons_9m4.fastq &amp;gt; \
       output_dir/amp9M4_500.fastq&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;wc -l &amp;lt; amp9M4_500.fastq&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;3031992&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;757,998 reads retained. Takes 10 seconds.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;demultiplexing&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Demultiplexing&lt;/h1&gt;
&lt;p&gt;Note: &lt;code&gt;demultiplex&lt;/code&gt; appends output to existing files. So we need to remove
any files from a previous run before we re-run it on the same files (e.g.,
when tweaking parameters).&lt;/p&gt;
&lt;p&gt;&lt;code&gt;for_indexes.tsv&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;F1	GGAGAAGAA
F2	CAGAGAGAA
F3	ACTCCAGAA
F4	TCCAAGGAA
...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;rev_indexes.tsv&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;R1	GAGTACGGA
R2	AGAACCGGA
R3	TAACTGCGA
R4	GTCATCCGA
...&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;rm -r output_dir/demux9/*
demultiplex demux -p output_dir/demux9 -s 1 -e 9 \
            for_indexes.tsv output_dir/amp9M4_500.fastq&lt;/code&gt;&lt;/pre&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;-s 1 -e 9&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
match indexes against base pairs [&lt;strong&gt;S&lt;/strong&gt;]tarting at bp 1 and [&lt;strong&gt;E&lt;/strong&gt;]nding
at bp 9.
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;Takes one minute.&lt;/p&gt;
&lt;p&gt;I’m not sure how to tell &lt;code&gt;demultiplex&lt;/code&gt; to look in the last 9 bp for the
barcodes. So I’m going to reverse compliment the sequences here. This is
very quick and can be done on the head node:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;for FILE in output_dir/demux9/*fastq
do
    OUT=$(basename $FILE .fastq)_rc.fastq
    seqkit seq -r -p -t dna $FILE &amp;gt; output_dir/demux9/$OUT
done&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have the sequences in the right order, proceed to demultiplex a
second time with the reverse index:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;rm output_dir/demux9/demux9b/*
for FILE in output_dir/demux9/*_rc.fastq
do
    demultiplex demux -p output_dir/demux9/demux9b \
                -s 1 -e 9 rev_indexes.tsv $FILE
done
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Takes one minute.&lt;/p&gt;
&lt;p&gt;Checking one sample:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;wc -l &amp;lt; output_dir/demux9/demux9b/amp9M4_500_F8_rc_R11.fastq &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;105244&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;i.e., 26,311 reads for this sample.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;trimming&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Trimming&lt;/h1&gt;
&lt;p&gt;Return to &lt;code&gt;seqkit amplicon&lt;/code&gt; to trim our primer sites:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;rm output_dir/trim/*
for FILE in output_dir/demux9/demux9b/*_rc_R*
do
    OUT=$(basename \$FILE .fastq)
    OUT=${OUT/_rc_/_}_trim.fastq
    seqkit amplicon -F ATGCGATACTTGGTGTGAAT \
           -R TCCTCCGCTTATTGATATGC \
           -r 20:-20 $FILE -m 4 &amp;gt; output_dir/trim/${OUT}
done&lt;/code&gt;&lt;/pre&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;-r 20:-20&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
Extract the first and last 20 base pairs from each amplicon (i.e., the
primer sites)
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;-m 4&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
maximum four mismatches in primer sites
&lt;/dd&gt;
&lt;/dl&gt;
&lt;/div&gt;
&lt;div id=&#34;dereplication&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Dereplication&lt;/h1&gt;
&lt;p&gt;We are expecting a small number of distinct sequences, each of which may be
present in very high numbers. We don’t want to match each of the duplicate
sequences to the database, so the next step is to extract a single
representative of each unique sequence.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;for FILE in output_dir/trim/*trim.fastq
do
    NAME=$(basename $FILE)
    OUT=${NAME%trim.fastq}

    vsearch --fastx_uniques $FILE \
            --sizeout --quiet --minuniquesize 2 \
            --fastqout output_dir/derep/${OUT}dr.fq
done
&lt;/code&gt;&lt;/pre&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;--fastqout&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
output &lt;code&gt;fastq&lt;/code&gt; formats
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;--sizeout&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
include number of replicates in output
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;--minuniquesize 2&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
Discard singleton sequences, as they are likely to be sequencing errors
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;Takes a few seconds.&lt;/p&gt;
&lt;p&gt;Note that as we’ve requested &lt;code&gt;--sizeout&lt;/code&gt;, every retained read will have the
string &lt;code&gt;size=&lt;/code&gt; in its header line. Counting these tells us how many unique
sequences were found in a sample.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;grep -c &amp;quot;size=&amp;quot; output_dir/amp9M4_500_F8_R11_dr.fq&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;683&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Summing the values listed for &lt;code&gt;size&lt;/code&gt; tells us how many of the original
reads we retained (i.e., after excluding singletons).&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;awk &amp;#39;BEGIN { FS = &amp;quot;size=&amp;quot; }
       /size/ { tot = tot + $2}
       END { print tot}&amp;#39; output_dir/amp9M4_500_F8_R11_dr.fq&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;6542&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Recall we had 26,311 after dereplication, so 20k reads were singletons.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;taxon-matching&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Taxon Matching&lt;/h1&gt;
&lt;p&gt;&lt;code&gt;globaldb.fasta&lt;/code&gt; is from &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-QuaresmaEtAl_2024&#34;&gt;Quaresma et al., 2024&lt;/a&gt;)&lt;/span&gt;. It’s comprehensive, based on
a lightly cleaned set of sequences from GenBank. A locally curated,
&lt;em&gt;validated&lt;/em&gt; database will be better whenever available. If it’s not, be
sure to examine the individual reference sequences matched.&lt;/p&gt;
&lt;p&gt;There are quite a few output options. We can get everything we need from
&lt;code&gt;--biomout&lt;/code&gt; or &lt;code&gt;--uc&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;for FILE in ${output_dir}/derep/*dr.fq
do
    NAME=$(basename $FILE)
    OUT=$(basename $NAME _dr.fq)
    vsearch --usearch_global ${output_dir}/derep/$NAME \
            --db globaldb.fasta \
            --uc ${outout_dir}/taxon/$OUT.uc \
            --biomout ${output_dir}/taxon/$OUT.biom \
            --maxaccepts 100 --id 0.98 --sizeout
done&lt;/code&gt;&lt;/pre&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;--maxaccepts 100&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
by default, the search stops when the first target is found that passes
the matching criterion (i.e., &amp;gt; 0.98 identity). This may not be the best
match! Set to 0 to search the entire database, or another number N to
force the search to consider at least the first N references that pass
the criterion before deciding which one is the best.
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;--id 0.98&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
reject a sequence match if the pairwise identify is less than the 0.98
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;--sizeout&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;
add abundance annotations to –uc file
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;Takes about an hour.&lt;/p&gt;
&lt;p&gt;We can check the &lt;code&gt;--uc&lt;/code&gt; output from the command line. Each query sequence
has a line in the file. The first column indicates &lt;em&gt;H&lt;/em&gt;its (matches) or &lt;em&gt;N&lt;/em&gt;o
matches. The ninth column includes the query sequence ID and the number of
replicates (i.e., &lt;code&gt;size=&lt;/code&gt;). The 10th column indicates the taxon the query
matched to, in case of matches.&lt;/p&gt;
&lt;p&gt;Unique query sequences:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;wc -l &amp;lt; output_dir/amp9M4_500_F8_R11.uc&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;683&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Matching queries:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;grep -c &amp;#39;^H&amp;#39; output_dir/amp9M4_500_F8_R11.uc&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;330&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Total sequences matched:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;grep &amp;#39;^H&amp;#39; output_dir/amp9M4_500_F8_R11.uc | cut -f9 | \
    awk &amp;#39;BEGIN { FS = &amp;quot;size=&amp;quot; }
                 { tot = tot + $2}
           END { print tot}&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;2405&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Our reference has multiple sequences for many taxa. We can sum those up by
species:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;grep &amp;#39;^H&amp;#39; output_dir/amp9M4_500_F8_R11.uc | cut -f10 | \
    sed &amp;#39;s/^.*s://&amp;#39; | sort | uniq -c | sort -h&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;   128 Cucumis_sativus;
    83 Trichophorum_pumilum;
    53 Rhododendron_tomentosum;
    39 Capsicum_frutescens;
    15 Brassica_napus;
     9 Solanum_lycopersicum;
     2 Rhododendron_groenlandicum;
     1 Rhododendron_ferrugineum;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At this point the analysis moves from the realm of bioinformatics and back
into the realm of botany. Further assessment is easiest (for me) to do in
R. I’ll leave that for another post.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references csl-bib-body hanging-indent&#34;&gt;
&lt;div id=&#34;ref-QuaresmaEtAl_2024&#34; class=&#34;csl-entry&#34;&gt;
Quaresma, A., M. J. Ankenbrand, C. A. Y. Garcia, J. Rufino, M. Honrado, J. Amaral, R. Brodschneider, et al. 2024. &lt;a href=&#34;https://doi.org/10.1038/s41597-024-02962-5&#34;&gt;Semi-automated sequence curation for reliable reference datasets in &lt;span&gt;ITS2&lt;/span&gt; vascular plant &lt;span&gt;DNA&lt;/span&gt; (meta-)barcoding&lt;/a&gt;. &lt;em&gt;Scientific Data&lt;/em&gt; 11: 129.
&lt;/div&gt;
&lt;div id=&#34;ref-SrivathsanEtAl_2024&#34; class=&#34;csl-entry&#34;&gt;
Srivathsan, A., V. Feng, D. Suárez, B. Emerson, and R. Meier. 2024. &lt;a href=&#34;https://doi.org/10.1111/cla.12566&#34;&gt;&lt;span&gt;ONTbarcoder&lt;/span&gt; 2.0: Rapid species discovery and identification with real-time barcoding facilitated by &lt;span&gt;Oxford Nanopore R10&lt;/span&gt;.4&lt;/a&gt;. &lt;em&gt;Cladistics&lt;/em&gt; 40: 192–203.
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Parallelizing Bash loops</title>
      <link>https://plantarum.ca/2025/09/18/parallelloops/</link>
      <pubDate>Thu, 18 Sep 2025 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2025/09/18/parallelloops/</guid>
      <description>&lt;h1 id=&#34;routine-data-maintenance&#34;&gt;Routine data maintenance&lt;/h1&gt;
&lt;p&gt;Processing large numbers of (large) files is a common task when working on
a cluster. For operations that take only take a few seconds, maybe a few
minutes total for the full set of files, you can do this right on the head
node. If you need to, you can run a job on the head node for a few hours.
But anything that takes more than 15 or 30 minutes is worth considering for
submission as a job for the scheduler. Among the many advantages, this
allows you to use multiple CPUs to speed up your work.&lt;/p&gt;
&lt;h1 id=&#34;a-simple-bash-loop&#34;&gt;A simple Bash loop&lt;/h1&gt;
&lt;p&gt;For processing files, this often involves loops. A previous post explains
how to &lt;a href=&#34;https://plantarum.ca/2025/01/24/slurm-arrays&#34;&gt;use array jobs&lt;/a&gt; to automate submitting
each cycle through a loop as a separate job.&lt;/p&gt;
&lt;p&gt;Another option, especially handy for Bash scripts, is to execute each
command as a background process. For example, lets start with this loop:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; FILE in raw_data/*
&lt;span style=&#34;color:#66d9ef&#34;&gt;do&lt;/span&gt;
    FILE&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;$(&lt;/span&gt;basename $FILE&lt;span style=&#34;color:#66d9ef&#34;&gt;)&lt;/span&gt;
    echo &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
    processor all_files/&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt; &amp;gt; clean_data/&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
&lt;span style=&#34;color:#66d9ef&#34;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This loops over all files in the &lt;code&gt;raw_data&lt;/code&gt; directory, runs the program
&lt;code&gt;processor&lt;/code&gt; on them, and makes a copy in the &lt;code&gt;clean_data&lt;/code&gt; directory. That
should work, but it only processes one file at a time.&lt;/p&gt;
&lt;h1 id=&#34;running-commands-in-the-background&#34;&gt;Running commands in the background&lt;/h1&gt;
&lt;p&gt;If we direct the program to run the &lt;code&gt;processor&lt;/code&gt; in the background, we can
process multiple files at once:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; FILE in raw_data/*
&lt;span style=&#34;color:#66d9ef&#34;&gt;do&lt;/span&gt;
    &lt;span style=&#34;color:#f92672&#34;&gt;(&lt;/span&gt;
    FILE&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;$(&lt;/span&gt;basename &lt;span style=&#34;color:#ae81ff&#34;&gt;\$&lt;/span&gt;FILE&lt;span style=&#34;color:#66d9ef&#34;&gt;)&lt;/span&gt;
    echo &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
    processor all_files/&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt; &amp;gt; clean_data/&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
    &lt;span style=&#34;color:#f92672&#34;&gt;)&lt;/span&gt; &amp;amp;
&lt;span style=&#34;color:#66d9ef&#34;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;&amp;amp;&lt;/code&gt; tells the cluster to run the previous expression in the background,
and immediately proceed to the next command. In this case, that means the
next trip through the loop.&lt;/p&gt;
&lt;h1 id=&#34;controlling-the-number-of-background-jobs&#34;&gt;Controlling the number of background jobs&lt;/h1&gt;
&lt;p&gt;That&amp;rsquo;s an improvement, but if we have 1000 files to process, we probably
don&amp;rsquo;t have enough CPUs to actually run them all at once. The job may fail,
or the cluster may try to manage the competing processes itself. This will
be slow.&lt;/p&gt;
&lt;p&gt;Better to specifically request the number of CPUs we want, and then to
limit our loop to never use more than that many at a time. We can do this
with the following:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;N&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;8&lt;/span&gt;

&lt;span style=&#34;color:#66d9ef&#34;&gt;for&lt;/span&gt; FILE in raw_data/*
&lt;span style=&#34;color:#66d9ef&#34;&gt;do&lt;/span&gt;
    &lt;span style=&#34;color:#f92672&#34;&gt;(&lt;/span&gt;
    FILE&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;$(&lt;/span&gt;basename &lt;span style=&#34;color:#ae81ff&#34;&gt;\$&lt;/span&gt;FILE&lt;span style=&#34;color:#66d9ef&#34;&gt;)&lt;/span&gt;
    echo &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
    processor all_files/&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt; &amp;gt; clean_data/&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
    &lt;span style=&#34;color:#f92672&#34;&gt;)&lt;/span&gt; &amp;amp;

    &lt;span style=&#34;color:#66d9ef&#34;&gt;if&lt;/span&gt; &lt;span style=&#34;color:#f92672&#34;&gt;[[&lt;/span&gt; &lt;span style=&#34;color:#66d9ef&#34;&gt;$(&lt;/span&gt;jobs -r -p | wc -l&lt;span style=&#34;color:#66d9ef&#34;&gt;)&lt;/span&gt; -ge $N &lt;span style=&#34;color:#f92672&#34;&gt;]]&lt;/span&gt;; &lt;span style=&#34;color:#66d9ef&#34;&gt;then&lt;/span&gt;
        wait -n
    &lt;span style=&#34;color:#66d9ef&#34;&gt;fi&lt;/span&gt;

&lt;span style=&#34;color:#66d9ef&#34;&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The &lt;code&gt;jobs&lt;/code&gt; program reports the process ID (via the &lt;code&gt;-p&lt;/code&gt; flag) of the
running jobs (the &lt;code&gt;-r&lt;/code&gt; flag), one ID per line. The &lt;code&gt;wc -l&lt;/code&gt; program tells us
how many lines (i.e., how many running jobs) there are in the output. &lt;code&gt;-ge $N&lt;/code&gt; checks if that number is greater than or equal to the value of &lt;code&gt;N&lt;/code&gt;,
which we set to &lt;code&gt;8&lt;/code&gt; on the first line of the script.&lt;/p&gt;
&lt;p&gt;If that&amp;rsquo;s true (there are 8 jobs running), the &lt;code&gt;wait&lt;/code&gt; command will pause
all further commands until any of the jobs finishes (the &lt;code&gt;-n&lt;/code&gt; flag).&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s exactly what we needed. Now we can ask for the number of CPUs we
want in our SLURM script:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#SBATCH --cpus-per-task=8
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;And make sure that number matches the value we set for &lt;code&gt;N&lt;/code&gt; in our Bash
loop. The higher the number, the more files will get processed at once. But
it may also take longer for your job to get scheduled, depending on the
size of your cluster and how busy it is.&lt;/p&gt;
&lt;h1 id=&#34;references&#34;&gt;References&lt;/h1&gt;
&lt;p&gt;I found this approach on
&lt;a href=&#34;https://unix.stackexchange.com/a/436713/6096&#34;&gt;StackOverflow&lt;/a&gt;.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Replacing Loops with Array Jobs</title>
      <link>https://plantarum.ca/2025/01/24/array-jobs/</link>
      <pubDate>Fri, 24 Jan 2025 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2025/01/24/array-jobs/</guid>
      <description>&lt;h1 id=&#34;loops-in-bioinformatic-pipelines&#34;&gt;Loops in Bioinformatic Pipelines&lt;/h1&gt;
&lt;p&gt;A common requirement of many bioinformatics pipelines is completing the
same analysis on a list of files. This is simple to do with a bit of &lt;a href=&#34;https://tldp.org/LDP/abs/html/&#34;&gt;Bash
scripting:&lt;/a&gt;&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;color:#75715e&#34;&gt;#!/bin/bash
&lt;/span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --job-name=ustacks&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --output=ustacks.log&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --time=360:00:00&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --ntasks=1&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --cpus-per-task=8&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --mem-per-cpu=8G&lt;/span&gt;
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM

M&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;4&lt;/span&gt;

reads_dir&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;./prort/concat/
out_dir&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;ustacks
threads&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;8&lt;/span&gt;

&lt;span style=&#34;color:#66d9ef&#34;&gt;while&lt;/span&gt; read SAMPLE; &lt;span style=&#34;color:#66d9ef&#34;&gt;do&lt;/span&gt;
    echo &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt;$SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt;
    ustacks -t gzfastq -f &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;reads_dir&lt;span style=&#34;color:#e6db74&#34;&gt;}${&lt;/span&gt;SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;.1.fq.gz &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;&lt;/span&gt;            -o &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;out_dir&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt; -m &lt;span style=&#34;color:#ae81ff&#34;&gt;3&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;&lt;/span&gt;            --name &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt; -M $M -p &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;threads&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
&lt;span style=&#34;color:#66d9ef&#34;&gt;done&lt;/span&gt; &amp;lt;SAMPLES.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;This script reads each line of the file &lt;code&gt;SAMPLES.txt&lt;/code&gt;, passes that value to
the program &lt;code&gt;ustacks&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s fine, but it can require a &lt;em&gt;lot&lt;/em&gt; of cluster time. In this case, I
have 96 files listed in &lt;code&gt;SAMPLES.txt&lt;/code&gt;, and each of them will take three or
four hours to process. That&amp;rsquo;s 15 days. In reality, it may take much longer,
as the cluster may not be able to schedule my job immediately given the
length (at least on the cluster I use!).&lt;/p&gt;
&lt;h1 id=&#34;replacing-loops-with-array-jobs&#34;&gt;Replacing Loops with Array Jobs&lt;/h1&gt;
&lt;p&gt;However, each of these 96 jobs can be processed at the same time. There&amp;rsquo;s
no interaction between them. If instead of asking for a single 15 day job,
I ask for 96 six-hour jobs, it&amp;rsquo;s easier for the cluster to schedule them,
and my total run time drops to something closer to the length of a single
job (maybe 6-12 hours, depending on how quickly they get scheduled).&lt;/p&gt;
&lt;p&gt;This is what array jobs do. I can replace the script above with this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;color:#75715e&#34;&gt;#!/bin/bash
&lt;/span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --job-name=ustacks%a&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --output=ustacks_%a.log&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --time=6:00:00&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --ntasks=1&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --cpus-per-task=8&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --mem-per-cpu=8G&lt;/span&gt;
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM

M&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;4&lt;/span&gt;

reads_dir&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;./prort/concat/
out_dir&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;ustacks
threads&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;8&lt;/span&gt;

SAMPLE&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;$(&lt;/span&gt;sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;SLURM_ARRAY_TASK_ID&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;q;d&amp;#34;&lt;/span&gt; SAMPLES.txt&lt;span style=&#34;color:#66d9ef&#34;&gt;)&lt;/span&gt;

echo &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt;$SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt;
ustacks -t gzfastq -f &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;reads_dir&lt;span style=&#34;color:#e6db74&#34;&gt;}${&lt;/span&gt;SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;.1.fq.gz &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;&lt;/span&gt;        -o &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;out_dir&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt; -m &lt;span style=&#34;color:#ae81ff&#34;&gt;3&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;&lt;/span&gt;        --name &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt; -M $M -p &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;threads&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;It&amp;rsquo;s nearly the same as the original. The key difference is I&amp;rsquo;ve replaced
the &lt;code&gt;while&lt;/code&gt; loop with a single variable, set using a line of
&lt;a href=&#34;https://www.gnu.org/software/sed/manual/sed.html&#34;&gt;sed&lt;/a&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;SAMPLE=$(sed &amp;quot;${SLURM_ARRAY_TASK_ID}q;d&amp;quot; SAMPLES.txt)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Sed is a very powerful tool for processing text. In this case, I&amp;rsquo;ve asked
it to search through the file &lt;code&gt;SAMPLES.txt&lt;/code&gt;, skip forward to the line
&lt;code&gt;${SLURM_ARRAY_TASK_ID}&lt;/code&gt;, print that line, and then quit.&lt;/p&gt;
&lt;p&gt;Where does &lt;code&gt;SLURM_ARRAY_TASK_ID&lt;/code&gt; come from? That&amp;rsquo;s variable is set for us
automatically when we submit our script as an array job. Instead of the
usual invocation of Slurm:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sbatch my_script.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We&amp;rsquo;ll do this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sbatch --array=1-96 my_script.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When we do this, it tells Slurm that we want to submit 96 jobs. Each job
will be run with identical code, except for one thing: Slurm will insert
the variable &lt;code&gt;SLURM_ARRAY_TASK_ID&lt;/code&gt; with a different value into each script.
That value is also inserted into my job name and log file, using the &lt;code&gt;%a&lt;/code&gt;
placeholder:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --job-name=ustacks%a&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --output=ustacks_%a.log&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;With that little change, instead of waiting a month for my 15 day job to
run, I wait a day for 96 six-hour jobs to run.&lt;/p&gt;
&lt;h2 id=&#34;selectively-repeating-failed-jobs&#34;&gt;Selectively Repeating Failed jobs&lt;/h2&gt;
&lt;p&gt;Another benefit of this approach is that it allows you to easily resubmit
only a subset of your jobs. In the example above I&amp;rsquo;ve asked for six hours
for each file. If some of those files happen to require more time than
that, they will fail with a timeout. You&amp;rsquo;lll see this in your job logs.
Here&amp;rsquo;s an example from an array where I requested 36 hours per job:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;date
sacct --jobs&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;3069231&lt;/span&gt; --format&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;jobid,jobname,state,elapsed | grep -v &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#39;batch\|extern&amp;#39;&lt;/span&gt;

: Fri &lt;span style=&#34;color:#ae81ff&#34;&gt;24&lt;/span&gt; Jan &lt;span style=&#34;color:#ae81ff&#34;&gt;2025&lt;/span&gt; 12:10:04 PM EST
: JobID           JobName      State    Elapsed 
: ------------ ---------- ---------- ---------- 
: 3069231_1    ustacks23+  COMPLETED   19:14:14 
: 3069231_2    ustacks23+  COMPLETED   17:51:58 
: 3069231_3    ustacks23+  COMPLETED   13:20:56 
: 3069231_4    ustacks23+  COMPLETED   23:43:24 
: 3069231_5    ustacks23+    TIMEOUT 1-12:00:01 
: 3069231_6    ustacks23+  COMPLETED   17:43:17 
: 3069231_7    ustacks23+  COMPLETED   17:07:07 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;You can see that that wasn&amp;rsquo;t enough time for the fifth job. I can rerun
just that job (after increasing the requested time) with:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sbatch --array=5 my_script.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Or if there are a group of failed jobs, I can combine them like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;sbatch --array=5,7,20-23 my_script.sh
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Much easier than repeating the analysis of all 96 samples just because one
or two of them needed more time.&lt;/p&gt;
&lt;h1 id=&#34;tips-for-using-array-jobs&#34;&gt;Tips for Using Array jobs&lt;/h1&gt;
&lt;h2 id=&#34;automating-making-lists-to-use-in-array-jobs&#34;&gt;Automating Making Lists to use in Array Jobs&lt;/h2&gt;
&lt;p&gt;You can create the lists of samples or files you want to cycle over by
hand. That&amp;rsquo;s probably the fastest way to do it, if you only need to do it
once. But as soon as you start repeating an analysis on different files,
you might benefit from some automation.&lt;/p&gt;
&lt;p&gt;&lt;a href=&#34;https://tldp.org/LDP/abs/html/&#34;&gt;Bash scripting&lt;/a&gt; provides a huge variety of
useful tools for creating lists of files or samples you want to cycle
through with &lt;code&gt;SLURM_ARRAY_TASK_ID&lt;/code&gt;. I won&amp;rsquo;t provide a full tutorial, but a
few examples may be useful in showing you why it might be worth investing
some time to learn more about this.&lt;/p&gt;
&lt;h3 id=&#34;filtering-filenames-with-bash-pipes&#34;&gt;Filtering Filenames with Bash Pipes&lt;/h3&gt;
&lt;p&gt;I have a directory containing a list of files like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;B12A10.1.fq.gz	    BWA11.rem.2.fq.gz  HHA19.rem.1.fq.gz  RBB8.1.fq.gz
B12A10.2.fq.gz	    BWA1.2.fq.gz       HHA19.rem.2.fq.gz  RBB8.2.fq.gz
B12A10.rem.1.fq.gz  BWA13.1.fq.gz      HHA1.rem.1.fq.gz   RBB8.rem.1.fq.gz
B12A10.rem.2.fq.gz  BWA13.2.fq.gz      HHA1.rem.2.fq.gz   RBB8.rem.2.fq.gz
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Each sample has a file of matched forward (&lt;code&gt;.1.fq.gz&lt;/code&gt;) and reverse reads
(&lt;code&gt;.2.fq.gz&lt;/code&gt;), and a file of forward and reverse unmatched reads
(&lt;code&gt;.rem.1.fq.gz&lt;/code&gt; and &lt;code&gt;.rem.2.fq.gz&lt;/code&gt;). In my processing I&amp;rsquo;ll want to loop
over each individual sample, not over all four files for each sample. I do
this with the following line:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;ls reads/*.1.fq.gz | grep -v rem | &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;&lt;/span&gt;    xargs -n &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt;  basename -s .1.fq.gz &amp;gt; 2023_list.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;The first clause, &lt;code&gt;ls reads/*.1.fq.gz&lt;/code&gt; lists all the forward read files.
This will include both &lt;code&gt;B12A10.1.fq.gz&lt;/code&gt; and &lt;code&gt;B12A10.rem.1.fq.gz&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Next, I use &lt;code&gt;grep -v rem&lt;/code&gt;. That tells &lt;code&gt;grep&lt;/code&gt; to filter the values we give it
for the string &lt;code&gt;rem&lt;/code&gt;. The &lt;code&gt;-v&lt;/code&gt; argument tells &lt;code&gt;grep&lt;/code&gt; to &amp;lsquo;invert&amp;rsquo; the
search. In other words, remove all the values that include &lt;code&gt;rem&lt;/code&gt;. At this
point, we have a single file name per sample, e.g. &lt;code&gt;B12A10.1.fq.gz&lt;/code&gt; and
&lt;code&gt;RBB8.1.fq.gz&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The next step is a combination. First, we pass the values from grep to the
program &lt;code&gt;xargs&lt;/code&gt; with the argument &lt;code&gt;-n 1&lt;/code&gt;. That tells &lt;code&gt;xargs&lt;/code&gt; to pass the
values it recieves one at a time to the following command. That command is
&lt;code&gt;basename&lt;/code&gt;, which will strip off the suffix we identify with the &lt;code&gt;-s&lt;/code&gt;
argument: &lt;code&gt;.1.fg.gz&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Finally, the results are stored in a new file, &lt;code&gt;2023_list.txt&lt;/code&gt;, which will
look like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;B12A10
BWA11
HHA19
RBB8
...
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;If you&amp;rsquo;d rather not have another file cluttering up your directory, you can
include the entire pipeline in your Slurm script&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span style=&#34;color:#75715e&#34;&gt;#!/bin/bash
&lt;/span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;&lt;/span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --job-name=ustacks%a&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --output=ustacks_%a.log&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --time=6:00:00&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --ntasks=1&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --cpus-per-task=8&lt;/span&gt;
&lt;span style=&#34;color:#75715e&#34;&gt;#SBATCH --mem-per-cpu=8G&lt;/span&gt;
source ~/miniconda3/etc/profile.d/conda.sh
conda activate stacksM

M&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;4&lt;/span&gt;

reads_dir&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;./prort/concat/
out_dir&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;ustacks
threads&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;8&lt;/span&gt;

SAMPLE&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;$(&lt;/span&gt;ls reads/*.1.fq.gz | grep -v rem | &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;&lt;/span&gt;    xargs -n &lt;span style=&#34;color:#ae81ff&#34;&gt;1&lt;/span&gt;  basename -s .1.fq.gz | &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;&lt;/span&gt;    sed &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;SLURM_ARRAY_TASK_ID&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;q;d&amp;#34;&lt;/span&gt;&lt;span style=&#34;color:#66d9ef&#34;&gt;)&lt;/span&gt;

echo &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt;$SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;&amp;#34;&lt;/span&gt;
ustacks -t gzfastq -f &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;reads_dir&lt;span style=&#34;color:#e6db74&#34;&gt;}${&lt;/span&gt;SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;.1.fq.gz &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;&lt;/span&gt;        -o &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;out_dir&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt; -m &lt;span style=&#34;color:#ae81ff&#34;&gt;3&lt;/span&gt; &lt;span style=&#34;color:#ae81ff&#34;&gt;\
&lt;/span&gt;&lt;span style=&#34;color:#ae81ff&#34;&gt;&lt;/span&gt;        --name &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt; -M $M -p &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;threads&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h3 id=&#34;parameter-substitutions&#34;&gt;Parameter Substitutions&lt;/h3&gt;
&lt;p&gt;Another set of useful tools are &lt;a href=&#34;https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html&#34;&gt;shell parameter
substitutions&lt;/a&gt;.
They can be a convenient way of trimming prefixes and suffices off of
names, particularly if you use (or received) concistently formatted file
names. For example, if my samples are coded as
&lt;code&gt;population&lt;/code&gt;_&lt;code&gt;individual&lt;/code&gt;.&lt;code&gt;suffix&lt;/code&gt;, I can isolate the various bits like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;FILE&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;NS01_01.1.fq.gz
SAMPLE&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE%.1.fq.gz&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
POP&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE%_*&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
IND&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;SAMPLE#&lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;POP&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;_&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;

echo File: &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;FILE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
echo   Sample: &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;SAMPLE&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;, Pop: &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;POP&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;, Ind: &lt;span style=&#34;color:#e6db74&#34;&gt;${&lt;/span&gt;IND&lt;span style=&#34;color:#e6db74&#34;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;blockquote&gt;
&lt;p&gt;File: NS01_01.1.fq.gz &lt;br&gt;
Sample: NS01_01, Pop: NS01, Ind: 01&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1 id=&#34;tips-for-emacs-users&#34;&gt;Tips for Emacs Users&lt;/h1&gt;
&lt;p&gt;I&amp;rsquo;ve written the above for use by anyone working on a cluster that uses
Slurm for job submission. However, as I explained in my &lt;a href=&#34;https://plantarum.ca/2025/01/10/slurm-yasnippet/&#34;&gt;previous
post&lt;/a&gt;,
&lt;a href=&#34;https://www.gnu.org/software/emacs/&#34;&gt;Emacs&lt;/a&gt; snippets provide an extra
layer of convenience for managing Slurm scripts. Following the approach I
outline there, I write my Slurm scripts as code blocks in
&lt;a href=&#34;https://orgmode.org/&#34;&gt;https://orgmode.org/&lt;/a&gt;. In this context, an array job will look something
like this:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre style=&#34;color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4&#34;&gt;&lt;code class=&#34;language-org&#34; data-lang=&#34;org&#34;&gt;&lt;span style=&#34;color:#75715e&#34;&gt;#+begin_src &lt;/span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;bash&lt;/span&gt;&lt;span style=&#34;color:#75715e&#34;&gt; :results output
&lt;/span&gt;&lt;span style=&#34;color:#75715e&#34;&gt;&lt;/span&gt;  date
  sbatch --array&lt;span style=&#34;color:#f92672&#34;&gt;=&lt;/span&gt;1-96 &lt;span style=&#34;color:#e6db74&#34;&gt;&amp;lt;&amp;lt;SUBMITSCRIPT
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  #!/bin/bash
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  #SBATCH --job-name=my_job_%a
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  #SBATCH --output=my_job_%a.log
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  #SBATCH --time=1-00:00:00
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  #SBATCH --ntasks=1
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  #SBATCH --cpus-per-task=4
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  #SBATCH --mem-per-cpu=96GB
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  source ~/miniconda3/etc/profile.d/conda.sh
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  conda activate stacksM
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  THREADS=8
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  FILE=\$(ls stacks23_out/*alleles.tsv.gz | grep -v catalog | sed -n &amp;#34;\${SLURM_ARRAY_TASK_ID}p&amp;#34;)
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  DIR=\$(dirname \$FILE)
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  SAMPLE=\$(basename -s .alleles.tsv.gz \$FILE)
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  echo &amp;#34;\$DIR : \$SAMPLE&amp;#34;
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  cd \${DIR}
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  sstacks -c ../\${DIR} -s ../\${DIR}/\$SAMPLE -p \$THREADS
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;
&lt;/span&gt;&lt;span style=&#34;color:#e6db74&#34;&gt;  SUBMITSCRIPT&lt;/span&gt;              
&lt;span style=&#34;color:#75715e&#34;&gt;#+end_src&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
    </item>
    
    <item>
      <title>Taming Slurm with Emacs&#39; Yasnippet</title>
      <link>https://plantarum.ca/2025/01/10/slurm-yasnippet/</link>
      <pubDate>Fri, 10 Jan 2025 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2025/01/10/slurm-yasnippet/</guid>
      <description>&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/yasnippet.gif&#34; alt=&#34;Animation&#34;&gt;&lt;/p&gt;
&lt;p&gt;In a previous post I explained how to set up &lt;a href=&#34;https://plantarum.ca/2024/03/28/org-cluster/&#34;&gt;Emacs Org-mode for working on
a remote cluster&lt;/a&gt;. If your cluster uses
&lt;a href=&#34;https://slurm.schedmd.com/overview.html&#34;&gt;Slurm&lt;/a&gt; to manage jobs, you will
need to specify a set of options for submission. This isn&amp;rsquo;t difficult, but
it&amp;rsquo;s tedious. We can automate away most of the tedium with Emacs,
specifically the &lt;a href=&#34;http://joaotavora.github.io/yasnippet/&#34;&gt;YASnippet
package&lt;/a&gt;. (YASnippet stands for
&amp;ldquo;yet another snippet package&amp;rdquo;, by the way). The animation above shows what
we&amp;rsquo;re aiming for.&lt;/p&gt;
&lt;h1 id=&#34;install-yasnippet&#34;&gt;Install YASnippet&lt;/h1&gt;
&lt;p&gt;The homepage includes instructions on how to install it. The easiest uses
the &lt;a href=&#34;https://github.com/joaotavora/yasnippet?tab=readme-ov-file#install-with-package-install&#34;&gt;Emacs package
manager&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you want to use it by default, you should add the following to your
config:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;(yas-global-mode 1)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Alternatively, you can set it for each mode you want it applied to. For
example, the following turns it on for &lt;code&gt;org-mode&lt;/code&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;(add-hook &amp;lsquo;org-mode-hook #&amp;lsquo;yas-minor-mode)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;You may also need the following line to ensure all your previously defined
templates are loaded:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;(yas-reload-all)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h1 id=&#34;writing-snippets&#34;&gt;Writing Snippets&lt;/h1&gt;
&lt;p&gt;A snippet is a bit of text that you can automatically insert into a file.
It can include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;text, inserted &amp;lsquo;as-is&amp;rsquo;;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;tab stop fields, which you can navigate through and enter text (with or
without default values) when the template is inserted;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;more complex variations combining fields, transformations, mirrors and
elisp code&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To get started, call the command &lt;code&gt;M-x yas-new-snippet&lt;/code&gt;. This will open a
new buffer with the following text:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# -*- mode: snippet -*-
# name: 
# key: 
# --

&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;The cursor will be on the &lt;code&gt;name&lt;/code&gt; row. This value is for our benefit, so
something descriptive is appropriate. I&amp;rsquo;ll use the phrase &amp;lsquo;Slurm header&amp;rsquo;.&lt;/p&gt;
&lt;p&gt;This buffer is actually a snippet itself, so we can use the &lt;code&gt;TAB&lt;/code&gt; key to
jump to the next tab stop, which is on the next line.&lt;/p&gt;
&lt;p&gt;The key is a combination of one or more letters that we&amp;rsquo;ll use to insert
the template. It can&amp;rsquo;t contain a space. It should be memorable, but not
something we&amp;rsquo;ll type in other contexts. In this case, &lt;code&gt;slrm&lt;/code&gt; will do. Hit
&lt;code&gt;TAB&lt;/code&gt; again, and point will move down into the body of the snippet.&lt;/p&gt;
&lt;p&gt;This is where we put the text for our submission script.&lt;/p&gt;
&lt;p&gt;For starters, we can include any text we want to appear &amp;lsquo;as-is&amp;rsquo;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# -*- mode: snippet -*-
# name: Slurm header
# key: slrm
# --
#+BEGIN_SRC bash :results output
  date
  sbatch &amp;lt;&amp;lt;SUBMITSCRIPT
  #!/bin/bash
  #SBATCH --job-name=MYJOB
  #SBATCH --output=MYJOB.log
  #SBATCH --open-mode=truncate
  #SBATCH --partition=standard
  #SBATCH --time=24:00:00
  #SBATCH --ntasks=1
  #SBATCH --cpus-per-task=1

  $0

  SUBMITSCRIPT
#+END_SRC
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;I start and end with the &lt;code&gt;org-mode&lt;/code&gt; block header/footer, with the language
set to &lt;code&gt;bash&lt;/code&gt; and the &lt;code&gt;:results output&lt;/code&gt; option. These never change.&lt;/p&gt;
&lt;p&gt;I start the code chunk with the &lt;code&gt;date&lt;/code&gt; command, so it will capture the time and
date I submitted the script, which will be recorded in the &lt;code&gt;org&lt;/code&gt; file.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;&amp;lt;SUBMITSCRIPT
...
SUBMITSCRIPT
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Defines the script that is passed to Slurm, as discussed in my
&lt;a href=&#34;https://plantarum.ca/2024/03/28/org-cluster/&#34;&gt;previous&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;After that we find the actual Slurm directives with default values for each
line.&lt;/p&gt;
&lt;p&gt;There is one special field here, &lt;code&gt;$0&lt;/code&gt;. This is where the cursor will be
after the template is inserted.&lt;/p&gt;
&lt;p&gt;That&amp;rsquo;s fine, but we still need to go back and fill in the actual values for
the directives. We can improve this with by adding tab stops, along with
default values and &lt;code&gt;mirrors&lt;/code&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# -*- mode: snippet -*-
# name: Slurm header
# expand-env: ((yas-indent-line &#39;fixed))
# key: slrm
# --
#+BEGIN_SRC bash :results output
  date
  sbatch &amp;lt;&amp;lt;SUBMITSCRIPT
  #!/bin/bash
  #SBATCH --job-name=${1:NAME}
  #SBATCH --output=$1.log
  #SBATCH --open-mode=truncate
  #SBATCH --partition=standard
  #SBATCH --time=${2:24:00:00}
  #SBATCH --ntasks=1
  #SBATCH --cpus-per-task=${3:1}

  $0

  SUBMITSCRIPT
#+END_SRC
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;When we insert this template, the cursor starts at the &lt;code&gt;--job-name&lt;/code&gt;
directive where the &lt;code&gt;${1:NAME}&lt;/code&gt; field is. The &lt;code&gt;1&lt;/code&gt; indicates this is the
first tab stop. The value &lt;code&gt;NAME&lt;/code&gt; is the default for this field. Whatever
value we enter here will be mirrored to the line below with the suffix
&lt;code&gt;log&lt;/code&gt;, i.e., &lt;code&gt;--output=NAME.log&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;After we&amp;rsquo;ve added the name, pressing &lt;code&gt;&amp;lt;tab&amp;gt;&lt;/code&gt; takes us to the second tab
stop, for the &lt;code&gt;time&lt;/code&gt; directive. My default is &lt;code&gt;24:00:00&lt;/code&gt;, but I can change
that to whatever I need. The third tab stop is for &lt;code&gt;cpus-per-task&lt;/code&gt;. After
that we tab to stop zero, where we can enter the body of our script.&lt;/p&gt;
&lt;p&gt;One final tweak: I want yasnippet to leave the indentation of this template
exactly as I&amp;rsquo;ve written it (i.e., &amp;lsquo;fixed&amp;rsquo;). To accomplish this, I&amp;rsquo;ve set
the &lt;code&gt;expand-env&lt;/code&gt; options in the header to use fixed indentation.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# expand-env: ((yas-indent-line &#39;fixed))
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;You can test out your snippet by calling &lt;code&gt;M-x yas-try-snippet&lt;/code&gt;, which is
bound to &lt;code&gt;C-c C-t&lt;/code&gt;. This will open a temporary buffer with the body of your
snippet inserted. You can then enter text and tab through the stops to see
how it works. &lt;code&gt;C-x k&lt;/code&gt; will kill the buffer when you&amp;rsquo;re done.&lt;/p&gt;
&lt;p&gt;When you&amp;rsquo;re done, you can install the snippet with &lt;code&gt;C-c C-c&lt;/code&gt;, (or &lt;code&gt;M-x yas-load-snippet-buffer-and-close)&lt;/code&gt;. Emacs will ask you what mode you want
to use the snippet in. In this case, it&amp;rsquo;s &lt;code&gt;org-mode&lt;/code&gt;.&lt;/p&gt;
&lt;h1 id=&#34;using-snippets&#34;&gt;Using Snippets&lt;/h1&gt;
&lt;p&gt;With that out of the way, you can insert your snippet in an org file with
the key sequence &lt;code&gt;slurm&amp;lt;tab&amp;gt;&lt;/code&gt;. Yasnippet provides a lot of additional
features. If you&amp;rsquo;re interested, see the &lt;a href=&#34;http://joaotavora.github.io/yasnippet/index.html&#34;&gt;online
manual&lt;/a&gt;.&lt;/p&gt;
&lt;h1 id=&#34;interacting-with-slurm&#34;&gt;Interacting with Slurm&lt;/h1&gt;
&lt;p&gt;Once you&amp;rsquo;ve composed your Slurm script, you&amp;rsquo;ll want to submit it to the
cluster. If you&amp;rsquo;ve set up your org file to point to the cluster, as I
described in &lt;a href=&#34;https://plantarum.ca/2024/03/28/org-cluster/&#34;&gt;my previous post&lt;/a&gt;, all you need to
do is type &lt;code&gt;C-c C-c&lt;/code&gt;, while the cursor (point) is in your Slurm code block.&lt;/p&gt;
&lt;p&gt;It may take a few moments to connect, depending on your network. Once it
does, you should see something like this appear in your org file:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#+RESULTS:
: Tue 27 Aug 2024 12:17:15 PM EDT
: Submitted batch job 2931015
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;We could log into the server to check on the status of our job. But with
another snippet, we can have Emacs do that for us.&lt;/p&gt;
&lt;p&gt;I use the following snippet for this purpose:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;# name: sacct
# key: sss
# expand-env: ((yas-indent-line &#39;fixed))
# --
#+BEGIN_SRC bash :results output 
date
sacct --jobs=`(save-excursion
                   (re-search-backward &amp;quot;^: Submitted batch job \\([[:digit:]]+\\)&amp;quot;)
                   (buffer-substring-no-properties
                     (match-beginning 1) (match-end 1)))`$0
#+END_SRC
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;I&amp;rsquo;ve used a bit of elisp code to do some work here. Yasnippet will process
any text between back ticks (&amp;quot;`&amp;quot;) as elisp code. Let&amp;rsquo;s step through that:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;`(save-excursion 
   (re-search-backward &amp;quot;^: Submitted batch job \\([[:digit:]]+\\)&amp;quot;)
   (buffer-substring-no-properties (match-beginning 1) 
     (match-end 1)))`
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;We start with &lt;code&gt;save-excursion&lt;/code&gt;. That tells Emacs that we want to come back
to the spot we started when the code is finished.&lt;/p&gt;
&lt;p&gt;Next we use &lt;code&gt;re-search-backward&lt;/code&gt; to find a line that starts with (&lt;code&gt;^&lt;/code&gt;) the
string &amp;ldquo;: Submitted batch job&amp;rdquo;, followed by a number. The number is wrapped
in &lt;code&gt;\\(&lt;/code&gt; and &lt;code&gt;\\)&lt;/code&gt;: these symbols tell Emacs to record the number as a
&amp;lsquo;match group&amp;rsquo;.&lt;/p&gt;
&lt;p&gt;Finally, we return the number with the &lt;code&gt;buffer-substring-no-properties&lt;/code&gt;
call. This returns the substring (text) in the current buffer, starting
with &lt;code&gt;(match-beginning 1)&lt;/code&gt; (which is the first digit in our number), and
ending with &lt;code&gt;(match-end 2)&lt;/code&gt;, which is the last digit in our number.&lt;/p&gt;
&lt;p&gt;This results in the following text being inserted in the buffer:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#+BEGIN_SRC bash :results output 
date
sacct --jobs=3472303
#+END_SRC
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;As written, this bit of code assumes the job we want to see the status of
is somewhere above the point that we call this snippet. It&amp;rsquo;s not very
robust if this isn&amp;rsquo;t true, but in my limited use-case it works well enough.&lt;/p&gt;
&lt;p&gt;The snippet leaves the cursor inside the code block. If you type &lt;code&gt;C-c C-c&lt;/code&gt;
at that point, it will submit the command to the cluster, and after a
moment or two you should see something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#+RESULTS:
: Fri 10 Jan 2025 05:28:53 PM EST
: JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
: ------------ ---------- ---------- ---------- ---------- ---------- -------- 
: 3472303          my_job   standard grdi_gena+          1  COMPLETED      0:0 
: 3472303.bat+      batch            grdi_gena+          1  COMPLETED      0:0 
: 3472303.ext+     extern            grdi_gena+          1  COMPLETED      0:0 
&lt;/code&gt;&lt;/pre&gt;&lt;h1 id=&#34;viewing-job-output&#34;&gt;Viewing Job Output&lt;/h1&gt;
&lt;p&gt;The preceding covers most of my needs for interacting with my cluster. One
last trick that can be handy is linking to log files or program output.&lt;/p&gt;
&lt;p&gt;You can insert a link to a file in org mode with &lt;code&gt;C-c C-l&lt;/code&gt;. You will be
prompted for the file location. This can include remote files with the
syntax: &lt;code&gt;/ssh:user@host:path/to/file&lt;/code&gt;; this can be shortened to
&lt;code&gt;/ssh:host:path/to/file&lt;/code&gt; if you&amp;rsquo;ve set the appropriate options in
&lt;code&gt;.ssh/config&lt;/code&gt; as I describe in a &lt;a href=&#34;https://plantarum.ca/2023/05/02/notes-on-gpsc-accounts/#configure-keys-and-addresses&#34;&gt;previous
post&lt;/a&gt;.
Next you&amp;rsquo;re prompted for the name of the link, which can be anything you
like. Once the link is complete, Emacs will colour it to let you know it
has a special power: you can view the file by pressing enter on the link.&lt;/p&gt;
&lt;p&gt;I don&amp;rsquo;t do this often enough to have made a snippet for it.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Spatial Tutorials Update</title>
      <link>https://plantarum.ca/2024/05/10/terra-time/</link>
      <pubDate>Fri, 10 May 2024 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2024/05/10/terra-time/</guid>
      <description>


&lt;p&gt;A quick update. The &lt;a href=&#34;https://rspatial.org/&#34;&gt;spatial analysis libraries&lt;/a&gt; in
the &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R Project&lt;/a&gt; have undergone a substantial
change in the past couple of years. The details are laid out in the &lt;a href=&#34;https://r-spatial.org/&#34;&gt;R
spatial blog&lt;/a&gt;, but the crux of the issue is that
legacy packages &lt;code&gt;rgdal&lt;/code&gt; and &lt;code&gt;rgeos&lt;/code&gt; have been retired, and packages that
depend on them (such as &lt;code&gt;raster&lt;/code&gt; and &lt;code&gt;sp&lt;/code&gt;) will have been modified to use
new dependencies, or replaced entirely. For the most part, the things we
used to do with &lt;code&gt;raster&lt;/code&gt; we now do with
&lt;a href=&#34;https://rspatial.org/index.html&#34;&gt;&lt;code&gt;terra&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The transition was a bit rough, and for a while we needed to translate back
and forth between &lt;code&gt;terra&lt;/code&gt; and &lt;code&gt;raster&lt;/code&gt; in our work. That &lt;em&gt;should&lt;/em&gt; now be
over, with all current packages using the new &lt;code&gt;terra&lt;/code&gt;-based workflow.&lt;/p&gt;
&lt;p&gt;I have already updated my &lt;a href=&#34;https://plantarum.ca/2023/02/13/terra-maps&#34;&gt;quick mapping tutorial&lt;/a&gt;,
and I’ve just updated my &lt;a href=&#34;https://plantarum.ca/2023/07/28/ecospat-terra&#34;&gt;ecospat tutorial&lt;/a&gt;, now
that &lt;code&gt;ecospat&lt;/code&gt; has been fully updated to use &lt;code&gt;terra&lt;/code&gt; too. Some of the other
spatial tutorials you find here may not work properly, or at all, until I
have a chance to review them. If you happen to find anything that isn’t
working as expected, please let me know!&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Preparing GBIF records for distribution modeling</title>
      <link>https://plantarum.ca/2024/04/04/record-cleaning/</link>
      <pubDate>Thu, 04 Apr 2024 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2024/04/04/record-cleaning/</guid>
      <description>


&lt;div id=&#34;gbif.org&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;GBIF.org&lt;/h1&gt;
&lt;p&gt;The Global Biodiversity Information Facility
(&lt;a href=&#34;https://www.gbif.org/&#34;&gt;GBIF.org&lt;/a&gt;) has become the standard open-access
online database of occurrence records for all manner of biological
organisms. It was initially a clearinghouse for museum records (such as
herbarium specimens), but now includes
&lt;a href=&#34;https://www.inaturalist.org&#34;&gt;iNaturalist&lt;/a&gt; observations (those that are
rated &lt;a href=&#34;https://www.gbif.org/dataset/50c9509d-22c7-4a22-a47d-8c48425ef4a7&#34;&gt;‘research’
grade&lt;/a&gt;),
survey data, and a growing variety of taxonomic and checklist sources.&lt;/p&gt;
&lt;p&gt;While GBIF’s expansion increases the overall value of the database, it
also means we need to be more circumspect in how we use the data. When I
first encountered GBIF decades ago, I used it as one of several sources for
herbarium records. I searched for the species I was looking for, and
received a list of museum specimens. Nowadays most, maybe all, online
herbarium data is mirrored by GBIF, so I no longer need to chase down
multiple websites to round up all the herbarium records I need.&lt;/p&gt;
&lt;p&gt;However, I can no longer assume that a GBIF record represents a physical
specimen. It could be: a human observation, with or without an associated
image; documentation harvested from sequence data submitted to
&lt;a href=&#34;https://www.ncbi.nlm.nih.gov/genbank/&#34;&gt;Genbank&lt;/a&gt;; an entry from a field
survey, which sometimes contain records for both &lt;em&gt;presences&lt;/em&gt; and
&lt;em&gt;absences&lt;/em&gt;. And of course, all of these records are subject to any number
of issues: transcription errors, identification errors, georeferencing
errors.&lt;/p&gt;
&lt;p&gt;All of which to say, we need a way to filter the results of our GBIF query
to ensure the data we receive is fit for purpose.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-gbif-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting GBIF data&lt;/h1&gt;
&lt;p&gt;Step one is actually getting the data from GBIF. You can do this from the
website, but I now prefer to do this in my &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt;
scripts, using the
&lt;a href=&#34;https://docs.ropensci.org/rgbif/articles/rgbif.html&#34;&gt;rgbif&lt;/a&gt; package.&lt;/p&gt;
&lt;div id=&#34;finding-taxon-names&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Finding taxon names&lt;/h2&gt;
&lt;p&gt;To get started, we need to match our name up with the GBIF taxonomic
backbone:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(rgbif)
conyza &amp;lt;- name_backbone(&amp;quot;Conyza canadensis&amp;quot;)
erigeron &amp;lt;- name_backbone(&amp;quot;Erigeron canadensis&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This snippet searches the GBIF database for “&lt;em&gt;Conyza canadensis&lt;/em&gt;” and
“&lt;em&gt;Erigeron canadensis&lt;/em&gt;” and returns the closest matches. The results can be
a bit confusing to interpret. To make sense of it, we need to understand a
few terms.&lt;/p&gt;
&lt;p&gt;A &lt;code&gt;usageKey&lt;/code&gt; is a unique number associated with every taxon in the
database, at every level. This includes species, subspecies, genera etc,
and importantly, it includes both &lt;em&gt;accepted species&lt;/em&gt; and &lt;em&gt;synonyms&lt;/em&gt;.
Complicating things, GBIF taxonomy tables use the term &lt;code&gt;usageKey&lt;/code&gt;, but in
the individual observation records the term &lt;code&gt;taxonKey&lt;/code&gt; is used instead.
They both mean the same thing - you’ll use the &lt;code&gt;usageKey&lt;/code&gt; you get from your
&lt;code&gt;name_backbone&lt;/code&gt; search as the &lt;code&gt;taxonKey&lt;/code&gt; in the query you submit (see
below).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;acceptedUsageKey&lt;/code&gt; (or &lt;code&gt;acceptedTaxonKey&lt;/code&gt;) is the number associated
with every &lt;em&gt;accepted&lt;/em&gt; taxon. For taxa that are synonyms, their
&lt;code&gt;acceptedUsageKey&lt;/code&gt; is the &lt;code&gt;usageKey&lt;/code&gt; of the accepted taxon they belong to.&lt;/p&gt;
&lt;p&gt;To illustrate the distinctions, let’s look at the values for the examples
above.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Key&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;scientificName&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Conyza canadensis (L.) Cronquist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;usageKey&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;5404801&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;status&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;SYNONYM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;acceptedUsageKey&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;3146791&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;hr /&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr class=&#34;header&#34;&gt;
&lt;th align=&#34;left&#34;&gt;Key&lt;/th&gt;
&lt;th align=&#34;left&#34;&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;scientificName&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;Erigeron canadensis L.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;even&#34;&gt;
&lt;td align=&#34;left&#34;&gt;usageKey&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;3146791&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&#34;odd&#34;&gt;
&lt;td align=&#34;left&#34;&gt;status&lt;/td&gt;
&lt;td align=&#34;left&#34;&gt;ACCEPTED&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;In this case &lt;em&gt;Conyza canadensis&lt;/em&gt; is a synonym of &lt;em&gt;Erigeron canadensis&lt;/em&gt;. The
name &lt;em&gt;Conyza canadensis&lt;/em&gt; has it’s own &lt;code&gt;usageKey&lt;/code&gt;: 5404801. Its
&lt;code&gt;acceptedUsageKey&lt;/code&gt; is 3146791, which is the &lt;code&gt;usageKey&lt;/code&gt; for &lt;em&gt;Erigeron
canadensis&lt;/em&gt;. An additional wrinkle is that accepted taxa only have a
&lt;code&gt;usageKey&lt;/code&gt;, they don’t have an &lt;code&gt;acceptedUsageKey&lt;/code&gt;. You can also use the
&lt;code&gt;status&lt;/code&gt; field to check if a taxon is &lt;em&gt;accepted&lt;/em&gt; or a &lt;em&gt;synonym&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Understanding this is important to make sure you get the records you’re
after when you query the database. If you request data for &lt;code&gt;taxonKey&lt;/code&gt;
5404801 (the &lt;code&gt;usageKey&lt;/code&gt; for &lt;em&gt;Conyza canadensis&lt;/em&gt;), you’ll get records with
that name on them, but &lt;em&gt;not&lt;/em&gt; records for &lt;em&gt;Erigeron canadensis&lt;/em&gt;. On the
other hand, if you search for &lt;code&gt;taxonKey&lt;/code&gt; 3146791, you’ll get records for
&lt;em&gt;Erigeron canadensis&lt;/em&gt;, and also all records for any synonyms of that name,
including &lt;em&gt;Conyza canadensis&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;In other words, searching for an &lt;em&gt;accepted&lt;/em&gt; taxon will return results for
that taxon including all its synonyms. Searching for a &lt;em&gt;synonym&lt;/em&gt; will
return results only for that synonym.&lt;/p&gt;
&lt;p&gt;You can search for records by name, without using the &lt;code&gt;usageKey&lt;/code&gt;. But it’s
safer to look up and use the &lt;code&gt;usageKey&lt;/code&gt;, to confirm that the name you asked
for matches with something in the GBIF database.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;preparing-a-query&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Preparing a query&lt;/h2&gt;
&lt;p&gt;Once you have a list of one or more &lt;code&gt;taxonKey&lt;/code&gt; values, you’re ready to
request your data. For large record sets, and especially if you want to
request multiple species at once, the &lt;code&gt;rgbif&lt;/code&gt; function &lt;code&gt;occ_download_queue&lt;/code&gt;
is very convenient. Note that you need to have a (free) GBIF account in
order to use this.&lt;/p&gt;
&lt;p&gt;This is a three step process. You create your query with
&lt;code&gt;occ_download_prep&lt;/code&gt;, submit the query to GBIF with &lt;code&gt;occ_download_queue&lt;/code&gt;,
and once the query is done you download the results via &lt;code&gt;occ_download_get&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Starting with &lt;code&gt;occ_download_prep&lt;/code&gt;, a basic query only requires your account
credentials and one or more &lt;code&gt;taxonKeys&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;myQuery &amp;lt;- occ_download_prep(
  pred_in(&amp;quot;taxonKey&amp;quot;, c(3146791, 3189859)),
  pred(&amp;quot;hasCoordinate&amp;quot;, TRUE),
  format = &amp;quot;DWCA&amp;quot;,
  user = &amp;quot;YourUserName&amp;quot;,
  pwd = &amp;quot;YourGBIFPassword&amp;quot;,
  email = &amp;quot;your@email.address&amp;quot;
)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;SECURITY NOTE&lt;/strong&gt; You can provide your GBIF username and password in your
script as I have done, but there are more secure ways to submit your
credentials without listing them in your code. I have mine stored in my
&lt;code&gt;~/.Renviron&lt;/code&gt; file, which looks like this:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;GBIF_USER=&amp;quot;my_user_name&amp;quot;
GBIF_PWD=&amp;quot;my_password,&amp;quot;
GBIF_EMAIL=&amp;quot;my@email.ca&amp;quot;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;See the &lt;a href=&#34;https://docs.ropensci.org/rgbif/articles/gbif_credentials.html&#34;&gt;rgbif
documentation&lt;/a&gt;
for more options.&lt;/p&gt;
&lt;p&gt;This will create a query for two taxa, as specified by the provided keys;
it will filter the results to include only records with coordinates (i.e.,
&lt;code&gt;hasCoordinate&lt;/code&gt; is TRUE); and the results will be in the &lt;a href=&#34;https://dwc.tdwg.org/&#34;&gt;Darwin Core
Format&lt;/a&gt; (i.e., “DWCA”). We reviewed taxon keys
above. If we’re mapping our records, and don’t have time or need to do any
georeferencing ourselves, we can save time by limiting our results to
records that already have coordinates.&lt;/p&gt;
&lt;p&gt;The default format is “DWCA”, which includes a &lt;em&gt;lot&lt;/em&gt; of columns in the
results. You can choose “SIMPLE_CSV” instead, and this will give you a
subset of commonly used fields. However, this limits your ability to filter
records after download, so I recommend sticking with “DWCA”.&lt;/p&gt;
&lt;p&gt;There are many other ways to filter records, documented in
&lt;code&gt;?download_predicate_dsl&lt;/code&gt;. Depending on your focus, you might want to
restrict the results by
&lt;a href=&#34;https://dwc.tdwg.org/terms/#dwc:basisOfRecord&#34;&gt;basisOfRecord&lt;/a&gt;, to select
only “HumanObservation” or “PreservedSpecimen”; or by
&lt;a href=&#34;https://dwc.tdwg.org/terms/#dwc:datasetID&#34;&gt;datasetID&lt;/a&gt; to select a specific
project (i.e., iNaturalist). However, I’ve found that these terms are not
applied consistently, so it may be better to download everything and filter
after you’ve had a chance to inspect the tables yourself.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;submitting-a-query&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Submitting a query&lt;/h2&gt;
&lt;p&gt;With a query ready, you can now submit it to GBIF via:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;out &amp;lt;- occ_download_queue(.list = list(myQuery))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We could have submitted the query directly, by using &lt;code&gt;occ_download&lt;/code&gt; above
instead of &lt;code&gt;occ_download_prep&lt;/code&gt;. The latter method offers two advantages.
First, we can submit multiple queries at once. e.g.,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;out &amp;lt;- occ_download_queue(.list = list(queryA, queryB,
                                       queryC)) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Second, GBIF allows you to submit up to three queries at a time. If you
have more, you have to wait until one of the earlier queries is finished.
&lt;code&gt;occ_download_queue&lt;/code&gt; keeps track of this for you, submitting three
requests, and sending additional requests as the first three finish.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;retrieving-your-query&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Retrieving your query&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;occ_download_queue&lt;/code&gt; returns the details of your submission(s):&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;$`46c5a45e8f1f2e64fc9eda5318e74972`
&amp;lt;&amp;lt;gbif download&amp;gt;&amp;gt;
  Your download is being processed by GBIF:
  ...
  Check status with
  occ_download_wait(&amp;#39;0025322-231120084113126&amp;#39;)
  After it finishes, use
  d &amp;lt;- occ_download_get(&amp;#39;0025322-231120084113126&amp;#39;) %&amp;gt;%
    occ_download_import()
  to retrieve your download.
  ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Usually it takes a few minutes to process your request. You can check its
progress with &lt;code&gt;occ_download_wait(&#39;...&#39;)&lt;/code&gt;, using the details provided. Once
the query is done, you can download it via &lt;code&gt;occ_download_get&lt;/code&gt;, and read it
into &lt;code&gt;R&lt;/code&gt; with &lt;code&gt;occ_download_import&lt;/code&gt;, as shown.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: once your query is submitted by &lt;code&gt;occ_download_queue&lt;/code&gt;, it will be
processed remotely by GBIF. If something should happen in the meantime –
your computer crashes, or you cancel the function call for any reason –
the query will still be processed. However, should this happen you won’t
have the &lt;code&gt;downloadKey&lt;/code&gt; you need to retrieve the results. You can get this
key by logging into your GBIF account on the website, navigating to the
Downloads section of your profile, and clicking on either the &lt;code&gt;DOI&lt;/code&gt; or the
&lt;code&gt;SHOW&lt;/code&gt; links for the download in question. This will take you to a page
with the meta data for the query. It doesn’t include the actual
&lt;code&gt;downloadKey&lt;/code&gt;, but that value is present as the final part of the url, as
shown here:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;./gbif-downloadKey.jpg&#34; alt=&#34;The GBIF download summary page, showing the downloadKey value in the URL&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;The GBIF download summary page, showing the downloadKey value in the
URL&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;You can also click the &lt;code&gt;Download&lt;/code&gt; button on the right side to download the
file from your browser, if you prefer.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;cleaning-gbif-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Cleaning GBIF data&lt;/h1&gt;
&lt;p&gt;Now that we have our records downloaded, we need to review and clean the
data before analysis. Note that in the code below, I’m using a large
download to demonstrate my investigations. I haven’t included this data in
this tutorial, but you can try the code on your own data.&lt;/p&gt;
&lt;div id=&#34;filtering-based-on-the-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Filtering based on the data&lt;/h2&gt;
&lt;div id=&#34;basisofrecord&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;basisOfRecord&lt;/h3&gt;
&lt;p&gt;With the data in hand, we can take a closer look at what kind of records
they are, and where they came from. Starting with &lt;code&gt;basisofRecord&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;d1 &amp;lt;- occ_download_get(out[[1]], path = &amp;quot;./dl/&amp;quot;) %&amp;gt;% 
  occ_download_import()
sort(table(d1$basisOfRecord), decreasing = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;  HUMAN_OBSERVATION  PRESERVED_SPECIMEN          OCCURRENCE 
            4224215              325278              112588 
        OBSERVATION   MATERIAL_CITATION     LIVING_SPECIMEN 
             104335               17703                3123 
    MATERIAL_SAMPLE MACHINE_OBSERVATION     FOSSIL_SPECIMEN 
               1698                 222                  10 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, I have nine different kinds of record. The definitions of
&lt;em&gt;some&lt;/em&gt; of these terms are listed in the &lt;a href=&#34;https://dwc.tdwg.org/terms/#livingspecimen&#34;&gt;Darwin Core Quick Reference
Guide&lt;/a&gt;. &lt;em&gt;HUMAN_OBSERVATION&lt;/em&gt;
includes, among other things, &lt;a href=&#34;https://www.inaturalist.org&#34;&gt;iNaturalist&lt;/a&gt;
records. &lt;em&gt;PRESERVED_SPECIMEN&lt;/em&gt; includes mostly (and most) herbarium records.
I usually want both of these groups.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;OCCURRENCE&lt;/em&gt; and &lt;em&gt;OBSERVATION&lt;/em&gt; aren’t well defined or consistently used, so
require further examination to determine if we want to retain them.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;MATERIAL_CITATION&lt;/em&gt;, &lt;em&gt;MATERIAL_SAMPLE&lt;/em&gt;, and &lt;em&gt;MACHINE_OBSERVATION&lt;/em&gt; are a
little vague, inconsistently used, and also not used often, so I remove
them.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;LIVING_SPECIMEN&lt;/em&gt; and &lt;em&gt;FOSSIL_SPECIMEN&lt;/em&gt; are self-explanatory, and usually
not want I want for my work.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;inaturalist-and-human-observations&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;iNaturalist and Human Observations&lt;/h3&gt;
&lt;p&gt;Research-grade &lt;a href=&#34;https://www.gbif.org/dataset/50c9509d-22c7-4a22-a47d-8c48425ef4a7&#34;&gt;iNaturalist
records&lt;/a&gt;
are imported to GBIF every few weeks. We can extract these using the field
&lt;code&gt;datasetKey&lt;/code&gt;, with the value &lt;code&gt;&#34;50c9509d-22c7-4a22-a47d-8c48425ef4a7&#34;&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sum(d1$datasetKey == &amp;quot;50c9509d-22c7-4a22-a47d-8c48425ef4a7&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;357863 # iNaturalist Records in my data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You might be tempted to filter on &lt;code&gt;datasetName&lt;/code&gt;, since there is a dataset
called “iNaturalist research-grade observations”. Unfortunately, this name
isn’t used consistently, the name “iNaturalist observations” is also used
for some records in the &lt;em&gt;same&lt;/em&gt; dataset. In fact, the word &lt;code&gt;iNaturalist&lt;/code&gt;
appears in a variety of different dataset names:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(d1$datasetName[grep(&amp;quot;iNaturalist&amp;quot;, d1$datasetName)])&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;&amp;quot;Flora of Russia&amp;quot; on iNaturalist: a trusted backlog 
                                                131 
                           iNaturalist observations 
                                                 25 
            iNaturalist research-grade observations 
                                             357838 
               iNaturalist XicotliData observations 
                                                  7 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Which leaves us with the unwieldy &lt;code&gt;datasetKey == &#34;50c9509d-22c7-4a22-a47d-8c48425ef4a7&#34;&lt;/code&gt; as the most reliable way to get the
offical iNaturalist dataset.&lt;/p&gt;
&lt;p&gt;All of the iNaturalist records are labelled as &lt;code&gt;HUMAN_OBSERVATION&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(d1$basisOfRecord[d1$datasetKey ==
                       &amp;quot;50c9509d-22c7-4a22-a47d-8c48425ef4a7&amp;quot;]) &lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;HUMAN_OBSERVATION 
           357863 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But this is only a small fraction of the 4,224,215 &lt;code&gt;HUMAN_OBSERVATION&lt;/code&gt;
records in my query. In my data, there are over 1500 different
&lt;code&gt;HUMAN_OBSERVATION&lt;/code&gt; datasets. This includes a number of surveys, and in
some cases these surveys record presences &lt;em&gt;and&lt;/em&gt; absences, as recorded in
the &lt;code&gt;occurrenceStatus&lt;/code&gt; field:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;table(d1$occurrenceStatus)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt; ABSENT PRESENT 
  27896 4761276 &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Whether you want to include a heterogeneous collection of surveys in your
data depends on the question your asking.&lt;/p&gt;
&lt;p&gt;But if you are planning to do distribution modeling, you may not want to
include &lt;code&gt;ABSENT&lt;/code&gt; records in your training data. It will depend on the scale
of your modeling, and the scale of the surveys that contributed their data
to GBIF. Fine-scale local surveys may include a mix of PRESENT and ABSENT
records within a small area (say a few hundred meters). If your modeling is
at the scale of 30 second climate rasters (~1km^2), that’s a problem. A
species can be absent in a 10m^2 quadrat, but present in the larger 1km^2
raster grid.&lt;/p&gt;
&lt;p&gt;Here are a few examples of filters you might want to use:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;herbarium &amp;lt;- d1[, d1$basisOfRecord == &amp;quot;PRESERVED_SPECIMEN&amp;quot;]
iNat &amp;lt;- d1[, d1$datasetKey ==
             &amp;quot;50c9509d-22c7-4a22-a47d-8c48425ef4a7&amp;quot;]
present &amp;lt;- d1[, d1$occurrenceStatus == &amp;quot;PRESENT&amp;quot;
              &amp;amp; d1$basisOfRecord == &amp;quot;HUMAN_OBSERVATION&amp;quot;]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;common-location-errors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Common location errors&lt;/h2&gt;
&lt;p&gt;The &lt;code&gt;R&lt;/code&gt; package &lt;code&gt;CoordinateCleaner&lt;/code&gt; provides some
functions for dealing with common problems:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;cc_cen&lt;/code&gt;
: Identifies records close to the centroid of a country. Automated
georeferencing will often use the centroid of a country as the
coordinates for a specimen that doesn’t have any other location data.
This could be 100s/1000s of kilometers away from the actual location.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;cc_inst&lt;/code&gt;
: Identifies records close to the location of museums. Automated
georeferencing will often use the location of a museum as the
location for specimens it contains, regardless of where the specimen
was actually collected.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;cc_sea&lt;/code&gt;
: Identifies records that are in the ocean, which are clearly errors
for terrestrial organisms. However, this could also be a consequence
of relatively minor errors in record coordinates or the mapping of
coastlines.&lt;/p&gt;
&lt;p&gt;For a more thorough overview of these kinds of issues, see this post by
John Waller on the &lt;a href=&#34;https://data-blog.gbif.org/post/gbif-filtering-guide/&#34;&gt;GBIF data
blog&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;identification-errors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Identification Errors&lt;/h2&gt;
&lt;p&gt;Now that we’ve dealt with the more ‘mechanical’ sorts of errors we might
find in our data, we can review our records for obvious or suspected
identification errors. If you have a large dataset, especially if it
includes many species, or species you are not familiar with, this can be
a daunting task.&lt;/p&gt;
&lt;p&gt;Two important botanical resources can help us with this, at least in North
America: the &lt;a href=&#34;http://floranorthamerica.org&#34;&gt;Flora of North America&lt;/a&gt; and
&lt;a href=&#34;https://www.natureserve.org/&#34;&gt;NatureServe.org&lt;/a&gt;. These provide a ‘sanity
check’ for our GBIF records, as both websites host carefully curated data
on plant distributions.&lt;/p&gt;
&lt;p&gt;The Flora of North America includes taxonomic treatments of all plant
species that occur outside of cultivation in Canada and the United
States&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;. The maps aren’t very detailed, showing us only the states or
provinces each taxon is found in. But those maps are all based on actual
specimens examined by a taxonomist with expertise on that plant. If New
York appears on an FNA distribution map, it means there is at least one
documented record of that species, confirmed by an expert, in that state.
And conversely, if a state or province isn’t included on the map, it means
that a thorough search of herbaria has failed to find a single record of
the species.&lt;/p&gt;
&lt;p&gt;NatureServe is similar, but the distribution maps are based on
state/provincial/regional natural heritage programs. These are generally
expert field botanists, rather than taxonomists. Their job is to document
which plants grow in their jurisdiction. If a NatureServe distribution map
includes Ontario, that means an expert field botanist has evidence (which
could be a herbarium voucher, or a reliable observation) that the species
occurs (or occurred) in Ontario. And again, if a state or province isn’t
included on the map, then no such evidence has been found.&lt;/p&gt;
&lt;p&gt;NatureServe and FNA distribution maps will usually match fairly closely.
NatureServe is more responsive to new information, and those maps will
periodically be updated. FNA isn’t yet complete (although it’s getting
close), but for the taxa that it covers it provides a comprehensive review
of their distribution at the time of publication (i.e., it doesn’t get
updated).&lt;/p&gt;
&lt;p&gt;Neither source is complete, and plants do move around. But, if you have
records in your GBIF dataset from areas beyond the range documented in
NatureServe or FNA, these are the ones I’d be most concerned about
verifying. Use the map interface on the GBIF website to locate them,
and look at the record details for clues to confirm or refute them. If they
trace back to iNaturalist records, you may be able to confirm the
identification yourself from the photos, or &lt;a href=&#34;https://www.inaturalist.org/observations/189766549&#34;&gt;ask the
observer for more
details&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;You won’t likely be able to confirm the identifications of all of your
records, but you can often validate or exclude the outlying records that
have the greatest potential to distort your analysis.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;The FNA project is more properly the “Flora of North America north of
Mexico”.&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Introduction to Emacs&#39; Org Mode for cluster computing</title>
      <link>https://plantarum.ca/2024/03/28/org-cluster/</link>
      <pubDate>Thu, 28 Mar 2024 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2024/03/28/org-cluster/</guid>
      <description>


&lt;div id=&#34;getting-started&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting Started&lt;/h1&gt;
&lt;p&gt;This tutorial builds on my previous post
&lt;a href=&#34;https://www.gnu.org/software/emacs/&#34;&gt;Emacs&lt;/a&gt; posts (ie.,
&lt;a href=&#34;https://plantarum.ca/2020/06/16/emacs-tutorial-01/&#34;&gt;introduction&lt;/a&gt;,
&lt;a href=&#34;https://plantarum.ca/2020/06/17/emacs-tutorial-02/&#34;&gt;Orgmode&lt;/a&gt;, &lt;a href=&#34;https://plantarum.ca/2020/12/30/emacs-03/&#34;&gt;R and
ESS&lt;/a&gt;). If you’re not familiar with Emacs, you might
want to look through those first, particularly
&lt;a href=&#34;https://plantarum.ca/2020/06/17/emacs-tutorial-02/&#34;&gt;Orgmode&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;In this post, I will show you how to setup an &lt;code&gt;Org mode&lt;/code&gt; (or &lt;code&gt;Orgmode&lt;/code&gt;,
&lt;code&gt;org-mode&lt;/code&gt;) file on your local machine (laptop or office desktop) to manage
and run a cluster computing project.&lt;/p&gt;
&lt;p&gt;While not absolutely necessary, working this way is &lt;em&gt;much&lt;/em&gt; easier if you’ve
configured your machine to provide keywordless access to the server. I’ve
talked about setting up your &lt;a href=&#34;https://plantarum.ca/2023/05/02/notes-on-gpsc-accounts/#configure-keys-and-addresses&#34;&gt;.ssh/config
file&lt;/a&gt;
to manage usernames and hostnames/addresses previously, as well as setting
up &lt;a href=&#34;https://plantarum.ca/2014/08/19/medium-performance-cluster-computing#security&#34;&gt;RSA
keys&lt;/a&gt;
so you can securely access remote servers without a password.&lt;/p&gt;
&lt;p&gt;In addition, within Emacs you’ll likely want to configure the variable
&lt;code&gt;org-confirm-babel-evaluate&lt;/code&gt; to &lt;code&gt;nil&lt;/code&gt;, so that you don’t get asked for
confirmation each time you try to evalute a code block, and also to
configure which languages you would like to use in code blocks, via the
&lt;code&gt;Org Babel Load Languages&lt;/code&gt; setting. I step through this in my &lt;a href=&#34;https://plantarum.ca/2020/06/17/emacs-tutorial-02#setting-up&#34;&gt;previous
post&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;org-file-headers&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Org file headers&lt;/h1&gt;
&lt;p&gt;First off, we want to tell &lt;code&gt;Org Mode&lt;/code&gt; where we want to run our code. Putting
this information in the file header ensures it will apply to all the code
blocks in the file (unless we explicitly change it for a particular block).&lt;/p&gt;
&lt;p&gt;My file template includes the following two lines as the first two lines of
my &lt;code&gt;.org&lt;/code&gt; file:&lt;/p&gt;
&lt;pre class=&#34;org&#34;&gt;&lt;code&gt;# -*- org-export-babel-evaluate: nil -*-
#+PROPERTY: header-args:bash :results output :dir /ssh:gpsc:./path/to/my/project&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The first line tells &lt;code&gt;Orgmode&lt;/code&gt; not to evaluate code blocks if/when we
export our source file (i.e., if we want to create a pdf or html report).
If we don’t do this, then every time we export the document it will
resubmit all of our cluster jobs, which is almost certainly not what we
want.&lt;/p&gt;
&lt;p&gt;The second line sets the default properties for &lt;code&gt;bash&lt;/code&gt; code blocks.
&lt;code&gt;:results output&lt;/code&gt; means we want the text printed out by the code in our
code blocks inserted into the file. &lt;code&gt;:dir /ssh:gpsc:./path/to/my/project&lt;/code&gt;
means we want these code blocks to be run on the remote host &lt;code&gt;gpsc&lt;/code&gt;, which
we access through &lt;code&gt;ssh&lt;/code&gt;, and in the directory &lt;code&gt;./path/to/my/project&lt;/code&gt;
(relative to my home directory).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When this is properly configured, we can edit the &lt;code&gt;org&lt;/code&gt; file on our local
machine, and send all the code to the server to run!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you make a mistake in the header lines, or just want to change the path
or other details, be sure to update the settings by pressing &lt;code&gt;C-c C-c&lt;/code&gt; with
the cursor on the header line, or, in the menus, select ‘Org’ -&amp;gt;
‘Refresh/Reload’ -&amp;gt; ‘Refresh setup current buffer’.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;executing-code-on-the-server&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Executing code on the server&lt;/h1&gt;
&lt;p&gt;Now we’re ready to run some code. To create a block, enter the following
text in the file:&lt;/p&gt;
&lt;pre class=&#34;org&#34;&gt;&lt;code&gt;#+begin_src bash
  date
  ls
#+end_src&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The lines &lt;code&gt;#+begin_src ...&lt;/code&gt; and &lt;code&gt;#+end_src&lt;/code&gt; denote the code block, with the
argument &lt;code&gt;bash&lt;/code&gt; indicating that this is the language for this block.&lt;/p&gt;
&lt;p&gt;With the cursor (point) anywhere in the block, hit &lt;code&gt;C-c C-c&lt;/code&gt; to evaluate
the code. The results will show up in a few moments:&lt;/p&gt;
&lt;pre class=&#34;org&#34;&gt;&lt;code&gt;#+RESULTS:
: Thu 28 Mar 2024 09:31:21 PM UTC
: hostname.tld.ca
: bin  data  man	miniconda3  ovbin  rubus  slurm_sample.sh  srcs  wp2&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is the contents of my home directory on &lt;code&gt;hostname.tld.ca&lt;/code&gt; - success!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;write-submit-job-requests-with-slurm&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Write &amp;amp; submit job requests (with slurm)&lt;/h1&gt;
&lt;p&gt;Now we can run bash scripts on the head node. That’s fine for installing
packages with &lt;code&gt;conda&lt;/code&gt;/&lt;code&gt;mamba&lt;/code&gt; or whatever system you’re using. But for
&lt;em&gt;real&lt;/em&gt; work, we’ll need to submit a job request. In my case we use the
&lt;code&gt;slurm&lt;/code&gt; scheduler. I use two different methods to run these scripts.&lt;/p&gt;
&lt;div id=&#34;heredocs&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Heredocs&lt;/h2&gt;
&lt;p&gt;In most cases, I include the entire script as a
&lt;a href=&#34;https://linuxize.com/post/bash-heredoc/&#34;&gt;heredoc&lt;/a&gt;:&lt;/p&gt;
&lt;pre class=&#34;org&#34;&gt;&lt;code&gt;#+begin_src bash :results output
  sbatch &amp;lt;&amp;lt;SUBMITSCRIPT
  #!/bin/bash
  #SBATCH --job-name=myjob
  #SBATCH --output=myjob.log
  #SBATCH --time=24:00:00
  #SBATCH --cpus-per-task=1
  #SBATCH --mem-per-cpu=1G

  VARIABLE=&amp;quot;My variable&amp;quot;

  source ~/miniconda3/etc/profile.d/conda.sh
  conda activate bowtie2

  echo \$VARIABLE

  bowtie2 --help

  SUBMITSCRIPT
#+end_src&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(Of course the actual details will depend on how your server is set up,
with slurm, qsub, or whatever is set up).&lt;/p&gt;
&lt;p&gt;Note that when you include bash variables in scripts submitted as
&lt;code&gt;heredocs&lt;/code&gt;, you need to escape the &lt;code&gt;$&lt;/code&gt; when you reference them (e.g.,
&lt;code&gt;\$VARIABLE&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;Again, we can submit this script by pressing &lt;code&gt;C-c C-c&lt;/code&gt; when the cursor is
anywhere in the code block. In this case, the output we get is from
&lt;code&gt;sbatch&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;org&#34;&gt;&lt;code&gt;#+RESULTS:
: Submitted batch job 2075188&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Having a record of the job number allows us to track its progress with
another code block:&lt;/p&gt;
&lt;pre class=&#34;org&#34;&gt;&lt;code&gt;#+BEGIN_SRC bash
sacct --jobs=2075188 --format=jobid,jobname,state,elapsed,ReqMem,MaxRSS
#+END_SRC&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Running this (&lt;code&gt;C-c C-c&lt;/code&gt; again) returns the status of my job:&lt;/p&gt;
&lt;pre class=&#34;org&#34;&gt;&lt;code&gt;#+RESULTS:
: JobID           JobName      State    Elapsed     ReqMem     MaxRSS 
: ------------ ---------- ---------- ---------- ---------- ---------- 
: 2075188         ustacks    RUNNING   00:00:07        64G            
: 2075188.bat+      batch    RUNNING   00:00:07                       
: 2075188.ext+     extern    RUNNING   00:00:07                       &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we’re getting somewhere! In this one file we can include our freehand
notes and metadata, along side the scripts we used in the analysis. One
file per project, so we don’t need to keep track of different script
versions for each step in a long workflow.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;standalone-scripts&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Standalone scripts&lt;/h2&gt;
&lt;p&gt;Most job scripts are pretty simple, so the &lt;code&gt;heredoc&lt;/code&gt; approach is fine, and
keeps all the details together in the &lt;code&gt;org&lt;/code&gt; file. However, if you have a
more involved script, you may want to keep it in a file of its own. That
will give you access to the language-specific editing features of Emacs,
and may help keep your main &lt;code&gt;org&lt;/code&gt; file to a manageable length.&lt;/p&gt;
&lt;p&gt;If you choose this approach, you can include a link to the external script
with the format:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;[[file:/ssh:servername:path/to/script.sh][script.sh]]&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;You’ll want this file to be in the same directory you specified in the
&lt;code&gt;org&lt;/code&gt; file header.&lt;/p&gt;
&lt;p&gt;Using this link format, pressing enter with the cursor on the link text
will open that file for you, allowing you to read and edit it with ease.&lt;/p&gt;
&lt;p&gt;To submit the script, we can use a much smaller code block:&lt;/p&gt;
&lt;pre class=&#34;org&#34;&gt;&lt;code&gt;sbatch script.sh&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This would be the way I would submit jobs in other languages as well, if
you use Perl, Python, or R in your analyses.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;streamlining-all-the-text-entry&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Streamlining all the text entry&lt;/h1&gt;
&lt;p&gt;This system works very well, and being based on plain text files it’s
pretty robust and portable. However, there are a lot of bits of text to
enter, and remembering all the syntax can be tedious. &lt;code&gt;orgmode&lt;/code&gt; provides
some conveniences to simplify this.&lt;/p&gt;
&lt;p&gt;To enter a new code block, you can use the shortcut &lt;code&gt;C-c C-,&lt;/code&gt; to open the
block menu, and select source code with &lt;code&gt;s&lt;/code&gt;. Then add the header &lt;code&gt;bash&lt;/code&gt; and
you’re ready to go. You can do the same thing via the menus with ‘Org’ -&amp;gt;
‘Editing’ -&amp;gt; ‘Add block structure’.&lt;/p&gt;
&lt;p&gt;The headers for &lt;code&gt;slurm&lt;/code&gt; scripts can be quite involved as well. I use &lt;a href=&#34;https://joaotavora.github.io/yasnippet/&#34;&gt;the
yasnippet extension&lt;/a&gt; to insert
templates for these. I’ll leave the details for another post.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-about-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;What about R?&lt;/h1&gt;
&lt;p&gt;You can run &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt; scripts in batch mode on a
server just like any other script. But you might want to run an interactive
process for exploring the results of your scripts. I do this by passing the
output from code blocks run on the server to a local instance of R. I’ll
explain that in a future post as well.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Extrapolation Detection (exDet) for SDMs</title>
      <link>https://plantarum.ca/2023/12/19/exdet/</link>
      <pubDate>Tue, 19 Dec 2023 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2023/12/19/exdet/</guid>
      <description>


&lt;div id=&#34;identifying-non-analogous-climate-conditions&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Identifying Non-Analogous Climate Conditions&lt;/h1&gt;
&lt;p&gt;A major concern when projecting species distribution models to new contexts
(e.g., invaded ranges, or future climates) is establishing whether (and
where) the environments in the new context are analogous to those in the
training region (&lt;a href=&#34;https://plantarum.ca/2020/06/15/maxent#projection&#34;&gt;see my notes&lt;/a&gt;). A common
approach is to compare each variable in isolation, and construct a
“Multivariate Environmental Similarity Surface”, or MESS &lt;span class=&#34;citation&#34;&gt;(Elith et al., &lt;a href=&#34;#ref-ElithEtAl_2010&#34;&gt;2010&lt;/a&gt;)&lt;/span&gt;.
Areas in the new context that are outside the range of any variable from
the training context will have values below 0, with lower values indicating
greater departures.&lt;/p&gt;
&lt;p&gt;However, the MESS approach does not account for correlations among
variables. Perhaps the native range of a species includes areas with hot,
wet, summers, but not hot dry regions. The MESS analysis would consider
conditions ‘analogous’ as long as they were within the temperature range
and precipitation range of the training region. This could result in novel
environmental combinations being incorrectly identified as analogous
climate:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;mess.png&#34; alt=&#34;Visualization of MESS analysis. The green area indicates the reference climate conditions, and the red square shows the climate space MESS will identify as ‘analog’.&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Visualization of MESS analysis. The green area indicates the reference
climate conditions, and the red square shows the climate space MESS will
identify as ‘analog’.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Note particularly the upper left corner. This is well outside the range of
conditions in the training region (the green area), but by considering each
variable separately, MESS will treat this region as if it were analogous.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;extrapolation-detection&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Extrapolation Detection&lt;/h1&gt;
&lt;p&gt;Despite the name, MESS is only just barely multivariate: it evaluates
multiple variables, but each one is considered in isolation. Extrapolation
Detection &lt;span class=&#34;citation&#34;&gt;(Mesgaran et al., &lt;a href=&#34;#ref-MesgaranEtAl_2014&#34;&gt;2014&lt;/a&gt;)&lt;/span&gt;, or &lt;code&gt;exDet&lt;/code&gt;, provides a more sophisticated
approach, based on the Mahalanobis distance. This is a common multivariate
distance measure, designed to account for covariation among variables. In
this context, our analysis will include the following steps:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Calculate the Mahalanobis distance for every cell in the training region&lt;/li&gt;
&lt;li&gt;Scale the distances by dividing by the maximum distance, such that they
range between 0 and 1&lt;/li&gt;
&lt;li&gt;Calculate the Mahalanobis distance for every cell in the novel
region/time period, again dividing by the maximum distance from the
&lt;strong&gt;training&lt;/strong&gt; (not the new) region.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This will produce a raster map with cell values ranging from 0 to infinity,
which serves as the Novelty Index. &lt;span class=&#34;citation&#34;&gt;Mesgaran et al. (&lt;a href=&#34;#ref-MesgaranEtAl_2014&#34;&gt;2014&lt;/a&gt;)&lt;/span&gt; refer to it as NT2,
to distinguish it from the &lt;code&gt;MESS&lt;/code&gt; index which they call NT1. Cells with NT2
values less than 1 are within the range of conditions present in the
training range; cells with NT2 &amp;gt; 1 are outside that range, with higher
values indicating greater departure:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;exdet.png&#34; alt=&#34;Visualization of exDet analysis. The green area indicates the reference climate conditions, and the red ellipse shows the climate space exDet will identify as ‘analog’.&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Visualization of exDet analysis. The green area indicates the reference
climate conditions, and the red ellipse shows the climate space exDet will
identify as ‘analog’.&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The result more closely captures the (co)variation in conditions present in
the training area.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;exdet-in-r&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;exDet in R&lt;/h1&gt;
&lt;p&gt;There is an R package that calculates &lt;code&gt;exDet&lt;/code&gt;:
&lt;a href=&#34;https://github.com/luismurao/ntbox&#34;&gt;ntbox&lt;/a&gt;. However, it is still using the
&lt;code&gt;raster&lt;/code&gt;-based workflow, which is now obsolete. Luckily, calculating
Mahalanobis distances only requires a few lines of code, so we can do it
ourselves.&lt;/p&gt;
&lt;p&gt;For this example, I’ll use the &lt;em&gt;Lythrum salicaria&lt;/em&gt; data from my previous
tutorials, but updated to use the new
&lt;a href=&#34;https://rspatial.org/index.html&#34;&gt;terra&lt;/a&gt; workflow.&lt;/p&gt;
&lt;div id=&#34;data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data&lt;/h2&gt;
&lt;p&gt;I use the same GBIF records previously downloaded in my &lt;a href=&#34;https://plantarum.ca/2023/07/28/ecospat-terra&#34;&gt;Ecospat with Terra
tutorial&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(terra)
library(geodata)
library(rgbif) ## not actually necessary, used to download
               ## the occurrence data

library(usdm) ## for vif collinearity test below

load(&amp;quot;../data/2021-07-29-ls-gbif-recs.Rda&amp;quot;)
lsOccs &amp;lt;- lsGBIF$data

## convert to a spatial vector:
lsOccs &amp;lt;- vect(lsOccs, geom = c(&amp;quot;decimalLongitude&amp;quot;,
                                &amp;quot;decimalLatitude&amp;quot;),
               crs = &amp;quot;+proj=longlat +datum=WGS84&amp;quot;)

wrld &amp;lt;- world(path = &amp;quot;../data/maps/&amp;quot;)

wclim &amp;lt;- worldclim_global(var = &amp;quot;bio&amp;quot;, res = 10,
                          path = &amp;quot;../data/&amp;quot;)

## North America basemap:
nAmCountries &amp;lt;- c(&amp;quot;CAN&amp;quot;, &amp;quot;MEX&amp;quot;, &amp;quot;USA&amp;quot;)
nAm &amp;lt;- gadm(nAmCountries, level = 0,
            path = &amp;quot;../data/maps/&amp;quot;,
            resolution = 2)

## Eurasia basemap:
eurasiaCountries &amp;lt;- country_codes(&amp;quot;Asia|Europe&amp;quot;)

## remove missing countries:
eurasiaCountries &amp;lt;-
  eurasiaCountries[!eurasiaCountries$ISO3 %in%
                    c(&amp;quot;HKG&amp;quot;, &amp;quot;MAC&amp;quot;, &amp;quot;XNC&amp;quot;),] 
eur &amp;lt;- gadm(eurasiaCountries$ISO3, level = 0,
            path = &amp;quot;../data/maps/&amp;quot;,
            resolution = 2)

## subset occurrences
lsNA &amp;lt;- lsOccs[nAm]
lsEA &amp;lt;- lsOccs[eur]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;training-region&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Training Region&lt;/h2&gt;
&lt;p&gt;There are a number of considerations when deciding on the region to use to
train our model. For this example, I’ll just use a 500km buffer around
occurrences in the native range.&lt;/p&gt;
&lt;p&gt;Now we have all the data we need for the analysis:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;train &amp;lt;- buffer(lsEA, width = 500000)
train &amp;lt;- aggregate(train)

plot(wrld, mar = c(1.5, 1.5, 0.5, 0.5))
plot(train, col = &amp;#39;#00FF0080&amp;#39;, add = TRUE)
plot(lsEA, add = TRUE, cex = 0.5)
plot(lsNA, add = TRUE, cex = 0.5, col = &amp;#39;blue&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:training-region&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://plantarum.ca/2023/12/19/exdet/index_files/figure-html/training-region-1.png&#34; alt=&#34;Lythrum salicaria distribution. Green shading shows the training region, black points are occurrences in the native range, and blue points are occurrences in the invaded range.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Lythrum salicaria distribution. Green shading shows the training region, black points are occurrences in the native range, and blue points are occurrences in the invaded range.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Note that the training region extends into the ocean, but the underlying
worldclim data doesn’t, so we don’t need to worry about cleaning this up.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;mahalanobis-distance&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Mahalanobis Distance&lt;/h2&gt;
&lt;p&gt;Now we can calculate the NT2 index. We start by clipping the climate data
to our training and projection regions, and then selecting a set of
variables that aren’t highly collinear (in the training region):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trainWC &amp;lt;- mask(wclim, train)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
|---------|---------|---------|---------|
=========================================
                                          &lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trainVal &amp;lt;- values(trainWC)

naWC &amp;lt;- mask(wclim, nAm)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
|---------|---------|---------|---------|
=========================================
                                          &lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;naVal &amp;lt;- values(naWC)

## screen out collinear variables:
vifSel &amp;lt;- vifstep(data.frame(trainVal), th = 5,
                  size = 20000)

VARS &amp;lt;- vifSel@results$Variables&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;VARS&lt;/code&gt; contains a list of non-collinear variables. For a real analysis you
probably should be more deliberate in selecting which variables to retain,
in order to consider both their biological interest and statistical
properties.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trainMeans &amp;lt;- colMeans(trainVal[, VARS], na.rm = TRUE)
trainVar &amp;lt;- var(trainVal[, VARS], na.rm = TRUE)
trainMah &amp;lt;- mahalanobis(trainVal[, VARS], trainMeans,
                        trainVar, na.rm = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;trainMah&lt;/code&gt; now contains the Mahalanobis distance from each cell in the
training region to the climate centroid of the training region. ‘0’ means
the point is in the center of the climate space. By the nature of the
Mahalanobis distance, this value accounts for covariation among our climate
variables.&lt;/p&gt;
&lt;p&gt;As a quick check, let’s take a look at the Mahalanobis distances for the
training region:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;hist(trainMah, main = &amp;quot;Mahalanobis Distances in Eurasia&amp;quot;,
     xlab = &amp;quot;Distance&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/12/19/exdet/index_files/figure-html/hist-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;That’s a bit odd. We have a small number of outliers with very high
distances. I think this is likely a consequence of idiosyncracies in our
climate rasters: an artifact of the analysis rather than a biologically
meaningful value. To clean this up, I’ll identify values above the 95th
percentile as outliers and remove them. Then we can proceed with
calculating the distances for the invaded range.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;thresh &amp;lt;- quantile(trainMah, probs = 0.95, na.rm = TRUE)
trainMah[which(trainMah &amp;gt; thresh)] &amp;lt;- NA
hist(trainMah, main = &amp;quot;Trimmed Mahalanobis Distances in Eurasia&amp;quot;,
     xlab = &amp;quot;Distance&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/12/19/exdet/index_files/figure-html/outliers-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;That looks better. Moving on, we can now calculate NT2, which is the
Mahalanobis distance divided by the maximum Mahalanobis distance in the
training range:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;maxMah &amp;lt;- max(trainMah, na.rm = TRUE)
trainMah &amp;lt;- trainMah/maxMah
trainMahR &amp;lt;- trainWC[[1]]
values(trainMahR) &amp;lt;- trainMah

naMah &amp;lt;- mahalanobis(naVal[, VARS], 
                     trainMeans, trainVar, na.rm = TRUE)
naMah &amp;lt;- naMah/maxMah

## Create a new raster map for North America:
naMahR &amp;lt;- naWC[[1]]

## set the values to our Mahalanobis distances:
values(naMahR) &amp;lt;- naMah&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we have a raster map with the NT2 index values for each cell in the
training and invaded ranges. Values between 0 and 1 are within the training
range. We can visualize that directly:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;COLS &amp;lt;- c(&amp;quot;green&amp;quot;, &amp;quot;lightgreen&amp;quot;, &amp;quot;aquamarine&amp;quot;, &amp;quot;blue&amp;quot;, 
           &amp;quot;orange&amp;quot;, &amp;quot;brown&amp;quot;, &amp;quot;red&amp;quot;)

BREAKS = c(0, 0.25, 0.5, 1, 2, 4, 8, 16)

plot(trainMahR, breaks = BREAKS, col = COLS,
     mar = c(2, 2, 1, 5))
plot(naMahR, breaks = BREAKS, col = COLS, add = TRUE, legend
     = FALSE)
plot(wrld, border = &amp;#39;lightgrey&amp;#39;, add = TRUE, lwd = 0.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/12/19/exdet/index_files/figure-html/mahPlot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This completes main part of the &lt;code&gt;exDet&lt;/code&gt; analysis. We can see the training
range all falls within the range 0-1, which is by design. Transferring that
distribution to North America, we can see most of Canada and the north
central US falls within the same range. The high arctic, the SE US and
western US are all outside of the training range, with Mexico even more
novel. Projections into these areas should be treated with caution, or
avoided entirely.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;most-influential-covariate-mic&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Most Influential Covariate, MIC&lt;/h2&gt;
&lt;p&gt;While the map above is our primary interest in &lt;code&gt;exDet&lt;/code&gt; analysis, we may
want to know which variables are most influential in creating novel
environments. To do this, we need to calculate the distance map repeatedly,
removing one variable at a time. For each cell, we can then identify the
variable that contributes the most to its NT2 index, by calculating the
difference between the distance for all variables, and the distance with
that variable removed.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;names(naMahR) &amp;lt;- &amp;quot;all&amp;quot;

## Remove variables one at a time and recalculate distances:

for(V in VARS){
  SEL &amp;lt;- VARS[VARS != V]

  tmpMeans &amp;lt;- colMeans(trainVal[, SEL], na.rm = TRUE)
  tmpVar &amp;lt;- var(trainVal[, SEL], na.rm = TRUE)
  tmpMah &amp;lt;- mahalanobis(trainVal[, SEL], tmpMeans,
                        tmpVar, na.rm = TRUE)

  thresh &amp;lt;- quantile(tmpMah, probs = 0.95, na.rm = TRUE)
  tmpMah[which(tmpMah &amp;gt; thresh)] &amp;lt;- NA
  tmpMax &amp;lt;- max(tmpMah, na.rm = TRUE)

  resMah &amp;lt;- mahalanobis(naVal[, SEL], 
                     tmpMeans, tmpVar, na.rm = TRUE)
  resMah &amp;lt;- resMah/tmpMax
  resMahR &amp;lt;- naWC[[1]]
  
  ## calculate difference from full distance:
  values(resMahR) &amp;lt;- values(naMahR$all) - resMah

  ## add a layer to our NA raster:
  naMahR[[V]] &amp;lt;- resMahR
}

## Find the maximum difference for each cell:
naMaxMah &amp;lt;- max(naMahR[[-1]])
naTest &amp;lt;- naMahR[[-1]] == naMaxMah

## Convert TRUE/FALSE to category numbers:

for(L in seq_along(names(naTest))){
  values(naTest[[L]]) &amp;lt;- values(naTest[[L]]) * L
}

## Collapse layers into a single raster:
naCat &amp;lt;- max(naTest)

## convert to a factor
levs &amp;lt;- data.frame(vals = seq_along(VARS),
                   ## clean up variable names:
                   levels = gsub(&amp;quot;wc2.1_10m_&amp;quot;, &amp;quot;&amp;quot;, VARS))
levels(naCat) &amp;lt;- levs&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can visualize which individual variables are contributing most to
the NT2 index for each cell:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(naCat, xlim = c(-180, -40), ylim = c(10, 90),
     mar = c(2, 2, 4, 5),
     main = &amp;quot;Most Influential Covariate&amp;quot;)
plot(wrld, border = &amp;#39;lightgrey&amp;#39;, lwd = 0.5, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/12/19/exdet/index_files/figure-html/MIC-plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Whether or not that information is useful to you will depend on your
research question, of course.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-ElithEtAl_2010&#34;&gt;
&lt;p&gt;Elith, J., M. Kearney, and S. Phillips. 2010. The art of modelling range-shifting species. &lt;em&gt;Methods in Ecology and Evolution&lt;/em&gt; 1: 330–342.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-MesgaranEtAl_2014&#34;&gt;
&lt;p&gt;Mesgaran, M. B., R. D. Cousens, and B. L. Webber. 2014. Here be dragons: A tool for quantifying novelty due to covariate range and correlation change when projecting species distribution models J. Franklin [ed.], &lt;em&gt;Diversity and Distributions&lt;/em&gt; 20: 1147–1159.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Niche Quantification with Ecospat and Terra</title>
      <link>https://plantarum.ca/2023/07/28/ecospat-terra/</link>
      <pubDate>Fri, 28 Jul 2023 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2023/07/28/ecospat-terra/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;p&gt;This is an update of my previous &lt;a href=&#34;https://plantarum.ca/2021/07/29/ecospat&#34;&gt;ecospat
tutorial&lt;/a&gt;. Spatial analysis in R is shifting to
&lt;code&gt;terra&lt;/code&gt; and &lt;code&gt;sf&lt;/code&gt; as the primary packages, so I’ve translated my old,
&lt;code&gt;raster&lt;/code&gt;-based tutorial to the new workflow. I also took this opportunity
to clean up and extend the original tutorial.&lt;/p&gt;
&lt;p&gt;See the &lt;a href=&#34;https://rspatial.org/spatial/index.html&#34;&gt;RSpatial tutorial&lt;/a&gt; for a
more detailed introduction/overview of using &lt;code&gt;terra&lt;/code&gt; for GIS/spatial
analysis.&lt;/p&gt;
&lt;p&gt;&lt;del&gt;Note this analysis depends on the &lt;code&gt;ecospat&lt;/code&gt; package, and as of 2023-07-28
&lt;code&gt;ecospat&lt;/code&gt; doesn’t support the spatial objects produced by &lt;code&gt;terra&lt;/code&gt;. There
are a couple of work-arounds in the code below to account for this. I’ll
update the code once &lt;code&gt;ecospat&lt;/code&gt; is fully compatible with &lt;code&gt;terra&lt;/code&gt;&lt;/del&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;UPDATE 2024-05-10&lt;/strong&gt; &lt;code&gt;ecospat&lt;/code&gt; has now been updated to use the &lt;code&gt;terra&lt;/code&gt;
package for spatial analysis. The code here has been updated to reflect
this. Thank you to Lin Lin at Yunan University for bringing this to my
attention!&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;ecospat&lt;/code&gt; package &lt;span class=&#34;citation&#34;&gt;(Cola et al. &lt;a href=&#34;#ref-ColaEtAl_2017&#34;&gt;2017&lt;/a&gt;)&lt;/span&gt; provides code to quantify and
compare the environmental and geographic niche of two species, or of the
same species in different contexts (e.g., in its native and invaded
ranges). The included vignette explains how to do such analyses.&lt;/p&gt;
&lt;p&gt;However, the vignette assumes you already have a matrix of occurrence
records, along with the climate data for each of those records. In our
work, we typically have to construct those matrices from observation data
(herbarium records, iNaturalist observations, etc) and climate rasters
&lt;span class=&#34;citation&#34;&gt;(e.g. Fick and Hijmans &lt;a href=&#34;#ref-FickHijmans_2017&#34;&gt;2017&lt;/a&gt;)&lt;/span&gt;. This short tutorial will walk through the steps
necessary to do this.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;required-packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Required Packages&lt;/h1&gt;
&lt;p&gt;In addition to &lt;code&gt;ecospat&lt;/code&gt;, we’ll use &lt;code&gt;terra&lt;/code&gt; &lt;span class=&#34;citation&#34;&gt;(Hijmans &lt;a href=&#34;#ref-Hijmans_2023&#34;&gt;2023&lt;/a&gt;)&lt;/span&gt; to download
WorldClim &lt;span class=&#34;citation&#34;&gt;(Fick and Hijmans &lt;a href=&#34;#ref-FickHijmans_2017&#34;&gt;2017&lt;/a&gt;)&lt;/span&gt; rasters, and manipulate the spatial data;
&lt;code&gt;rgbif&lt;/code&gt; &lt;span class=&#34;citation&#34;&gt;(Chamberlain et al. &lt;a href=&#34;#ref-ChamberlainEtAl_2021&#34;&gt;2021&lt;/a&gt;)&lt;/span&gt; to download GBIF records, &lt;code&gt;geodata&lt;/code&gt;
&lt;span class=&#34;citation&#34;&gt;(Hijmans et al. &lt;a href=&#34;#ref-HijmansEtAl_2023&#34;&gt;2023&lt;/a&gt;)&lt;/span&gt; to get a world basemap for plots, and &lt;code&gt;ade4&lt;/code&gt;
&lt;span class=&#34;citation&#34;&gt;(Thioulouse et al. &lt;a href=&#34;#ref-ThioulouseEtAl_2018&#34;&gt;2018&lt;/a&gt;)&lt;/span&gt; to perform the principal components analysis of the
climate data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ecospat)
library(terra)
library(rgbif)
library(geodata)
library(ade4)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting Data&lt;/h1&gt;
&lt;div id=&#34;gbif-occurrence-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;GBIF Occurrence Data&lt;/h2&gt;
&lt;p&gt;We’ll start by sourcing our data. For observations, let’s take a look at
Purple Loosestrife, a wetland species that is native to Europe, and
invasive in North America. For actual research work, I normally download
the files directly from GBIF, and examine them carefully to check for
errors or missing data. For this demo we’ll use the &lt;code&gt;rgbif&lt;/code&gt; package to
download the data directly into R, and we’ll assume there are no problems
with the data that need to be corrected.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsGBIF &amp;lt;- occ_search(scientificName = &amp;quot;Lythrum salicaria&amp;quot;,
                    limit = 10000,
                    basisOfRecord = &amp;quot;Preserved_Specimen&amp;quot;,
                    hasCoordinate = TRUE,
                    fields = c(&amp;quot;decimalLatitude&amp;quot;,
                               &amp;quot;decimalLongitude&amp;quot;, &amp;quot;year&amp;quot;,
                               &amp;quot;country&amp;quot;, &amp;quot;countryCode&amp;quot;))

save(lsGBIF, file = &amp;quot;../data/2021-07-29-ls-gbif-recs.Rda&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This returned an object with 7969 records. I saved that locally, so that
I’m not making GBIF search their database everytime I work on this demo.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;load(&amp;quot;../data/2021-07-29-ls-gbif-recs.Rda&amp;quot;)
lsOccs &amp;lt;- lsGBIF$data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;lsGBIF$data&lt;/code&gt; is the table with the actual records in it. That’s what we’ll
be working with. The other components of &lt;code&gt;lsGBIF&lt;/code&gt; are metadata related to
the original GBIF search. That’s useful to have, but not needed for the
rest of this example.&lt;/p&gt;
&lt;p&gt;Next, we tell R which columns are the coordinates, which allows us to map
the observations. This also converts our observation matrix to a
&lt;code&gt;SpatVector&lt;/code&gt; object.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;crs&lt;/code&gt; argument here tells R that our points are in lat/lon
(unprojected) coordinates. If all of your data use lat/lon, you don’t need
to specify this, but it’s important if you need to reproject your data, or
combine data with different projections.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsOccs &amp;lt;- vect(lsOccs, geom = c(&amp;quot;decimalLongitude&amp;quot;,
                                &amp;quot;decimalLatitude&amp;quot;),
               crs = &amp;quot;+proj=longlat +datum=WGS84&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;basemap&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Basemap&lt;/h2&gt;
&lt;p&gt;We’ll also need a world map to use in our plots. the &lt;code&gt;world&lt;/code&gt; function from
the &lt;code&gt;geodata&lt;/code&gt; package will download one for us. The first time you call
this function in a directory, it downloads the data from the internet,
and saves it locally according to the &lt;code&gt;path&lt;/code&gt; argument. Subsequent calls
will load your local copy of the data, to speed things up.&lt;/p&gt;
&lt;p&gt;In this case I’ve set the &lt;code&gt;path&lt;/code&gt; argument to store the downloaded files in
a location that is convenient for me. You can set this to anything you
like, or leave it out to use the current R working directory.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wrld &amp;lt;- world(path = &amp;quot;../data/maps/&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is a relatively low resolution map of the continents. For higher
resolution maps, see the &lt;code&gt;gadm&lt;/code&gt; function.&lt;/p&gt;
&lt;p&gt;Now we can plot our data. Note that we set the margins (via &lt;code&gt;mar&lt;/code&gt;) &lt;em&gt;inside&lt;/em&gt;
the &lt;code&gt;plot&lt;/code&gt; call, not via the &lt;code&gt;par&lt;/code&gt; function used in most base R plots.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(wrld, border = &amp;quot;gray80&amp;quot;, mar = c(0, 0, 0, 0))
points(lsOccs, col = 2, cex = 0.3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/07/28/ecospat-terra/index_files/figure-html/base-plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;climate-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Climate Data&lt;/h2&gt;
&lt;p&gt;To get our climate data, we can use geodata’s &lt;code&gt;worldclim_*&lt;/code&gt; functions.
I’m using the coarsest resolution (10 minutes) to speed things up for this
demonstration. The &lt;code&gt;path&lt;/code&gt; argument works the same was as for &lt;code&gt;world&lt;/code&gt;,
storing a local copy of the files. In this instance we’ll use
&lt;code&gt;worldclim_global&lt;/code&gt; to get the bioclim variables for the world all at once:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wclim &amp;lt;- worldclim_global(var = &amp;quot;bio&amp;quot;, res = 10,
                          path = &amp;quot;../data/&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can take a look at one layer:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(wclim$wc2.1_10m_bio_1, main = &amp;quot;bio1&amp;quot;,
     mar = c(0.1, 0.1, 2, 4), legend = TRUE,
     axes = FALSE)
par(mar = c(0.1, 0.1, 2, 4))
box()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/07/28/ecospat-terra/index_files/figure-html/climate-plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;sampling-bias&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Sampling Bias&lt;/h2&gt;
&lt;p&gt;One of the challenges we deal with with herbarium data is that observations
tend to be clustered together in non-random ways. Sites near universities,
museums, or in well-used field stations will have more records than more
remote locations. We can’t completely account for this bias, but we can
reduce it by spatial thinning. This is the process of randomly selecting a
small number (often just one) of records for each grid cell in our
analysis. The result is that those highly-sampled locations will be
represented by a single record, meaning they will have a less exaggerated
influence on the results.&lt;/p&gt;
&lt;p&gt;We will need the full set of records below, so I’ll make a copy of the
un-thinned data named &lt;code&gt;lsOccsAll&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsOccsAll &amp;lt;- lsOccs ## keep this for later

## select one record per cell:
lsOccs &amp;lt;- spatSample(lsOccs, size = 1, strata = wclim)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;linking-climate-data-to-species-observations&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Linking Climate Data to Species Observations&lt;/h2&gt;
&lt;p&gt;Next, we need to extract the environmental values from the climate rasters
for each of our observation records:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsOccs &amp;lt;- cbind(lsOccs, extract(wclim, lsOccs))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the process of extracting &lt;code&gt;wclim&lt;/code&gt; values for our observations, we
usually end up with a few missing values. This is a consequence of
mismatches between the observation coordinates and the climate rasters. In
some cases, the observations are placed off the coast in the ocean, or in
another area where there is no climate data available. We need to exclude
these missing values from our analysis.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsOccs &amp;lt;- lsOccs[complete.cases(data.frame(lsOccs)), ]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;splitting-records-into-native-and-introduced-ranges&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Splitting Records into Native and Introduced Ranges&lt;/h1&gt;
&lt;p&gt;At this point, all the data we need for the Niche Quantification analysis
is in &lt;code&gt;lsOccs&lt;/code&gt; and &lt;code&gt;wclim&lt;/code&gt;. We need to split this data into native and
invasive regions for our comparison. We’ll restrict ourselves to the
northern hemisphere, and consider all records from Eurasia as native, and
all records from North America as invasive.&lt;/p&gt;
&lt;p&gt;We can use GADM to download maps of the continents, and use this to select
subsets of the datasets. I will select the country-level borders (&lt;code&gt;level = 0&lt;/code&gt;) and low resolution (&lt;code&gt;resolution = 2&lt;/code&gt;). For ‘real’ analyses you should
use high resolution (&lt;code&gt;resolution = 1&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;To speed up analyses further down, I’m just going to download Canada, USA,
and Mexico for my “North America” map.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nAmCountries &amp;lt;- c(&amp;quot;CAN&amp;quot;, &amp;quot;MEX&amp;quot;, &amp;quot;USA&amp;quot;)
nAm &amp;lt;- gadm(nAmCountries, level = 0,
            path = &amp;quot;../data/maps/&amp;quot;,
            resolution = 2)

## select Europe OR Asia:
eurasiaCountries &amp;lt;- country_codes(&amp;quot;Asia|Europe&amp;quot;)
eur &amp;lt;- gadm(eurasiaCountries$ISO3, level = 0,
            path = &amp;quot;../data/maps/&amp;quot;,
            resolution = 2)

lsNA &amp;lt;- lsOccs[nAm]
lsEA &amp;lt;- lsOccs[eur]

plot(wrld, axes = FALSE, xlim = c(-140, 150),
     ylim = c(10, 80), mar = c(0, 0, 0, 0))
points(lsNA, col = &amp;#39;red&amp;#39;, cex = 0.5)
points(lsEA, col = &amp;#39;darkgreen&amp;#39;, cex = 0.5)
par(mar = c(0,0,0,0))
box()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/07/28/ecospat-terra/index_files/figure-html/splitting-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Note that this code will generate some warnings, “this file does not
exist”. There are a few countries listed in the &lt;code&gt;country_codes&lt;/code&gt; database
that don’t have corresonding maps in the GADM repository (e.g. Hong Kong).
We’ll ignore this for now.&lt;/p&gt;
&lt;p&gt;For the Niche Quantification, we need to have a matrix with the background
environment present in the native and invasive ranges, as well as the
complete global environmental including the combined extent of the native
and introduced environments. After cropping the data, we use &lt;code&gt;values&lt;/code&gt;
to convert the raster to a dataframe.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Crop Climate Layers:
naEnvR &amp;lt;- mask(wclim, nAm)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## |---------|---------|---------|---------|=========================================                                          &lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;eaEnvR &amp;lt;- mask(wclim, eur)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## |---------|---------|---------|---------|=========================================                                          &lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;globalEnvR &amp;lt;- mask(wclim, rbind(nAm, eur))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## |---------|---------|---------|---------|=========================================                                          &lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Extract values to matrix:
naEnvM &amp;lt;- values(naEnvR)
eaEnvM &amp;lt;- values(eaEnvR)
globalEnvM &amp;lt;- values(globalEnvR)

## Clean out missing values:
naEnvM &amp;lt;- naEnvM[complete.cases(naEnvM), ]
eaEnvM &amp;lt;- eaEnvM[complete.cases(eaEnvM), ]
globalEnvM &amp;lt;- globalEnvM[complete.cases(globalEnvM), ]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that for the geographic projection below, it is essential that the
&lt;code&gt;globalEnvM&lt;/code&gt; be constructed directly from the &lt;code&gt;globalEnvR&lt;/code&gt; rasters. If you
try to combine &lt;code&gt;naEnvM&lt;/code&gt; and &lt;code&gt;eaEnvM&lt;/code&gt; to make &lt;code&gt;globalEnvM&lt;/code&gt; it will end up
scrambling the values in the geographic projection. If you don’t do a
geographic projection it doesn’t matter.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;niche-quantification&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Niche Quantification&lt;/h1&gt;
&lt;div id=&#34;pca&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;PCA&lt;/h2&gt;
&lt;p&gt;The Niche Quantification analysis starts with a Principal Components
Analysis of the environmental data. The actual ordination uses the global
data, with the observation records and the native and invasive background
environment treated as supplemental rows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca.clim &amp;lt;- dudi.pca(globalEnvM, center = TRUE,
                    scale = TRUE, scannf = FALSE, nf = 2)
global.scores &amp;lt;- pca.clim$li

nativeLS.scores &amp;lt;-
  suprow(pca.clim,
         data.frame(lsEA)[, colnames(globalEnvM)])$li   
invasiveLS.scores &amp;lt;-
  suprow(pca.clim,
         data.frame(lsNA)[, colnames(globalEnvM)])$li

nativeEnv.scores &amp;lt;- suprow(pca.clim, naEnvM)$li
invasiveEnv.scores &amp;lt;- suprow(pca.clim, eaEnvM)$li&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s break that down. &lt;code&gt;dudi.pca&lt;/code&gt; does a PCA analysis on &lt;code&gt;globalEnvM&lt;/code&gt;,
which is a matrix of all the environmental variables over the entire study
area. We use that to create a two-dimensional summary of the total
environmental variability.&lt;/p&gt;
&lt;p&gt;Next, we map our observation data (&lt;code&gt;lsEA&lt;/code&gt; and &lt;code&gt;lsNA&lt;/code&gt;) into that
2-dimensional ordination, using the &lt;code&gt;suprow&lt;/code&gt; function. &lt;code&gt;lsEA&lt;/code&gt; and &lt;code&gt;lsNA&lt;/code&gt;
are &lt;code&gt;SpatialPointsDataFrame&lt;/code&gt; objects. Sometimes you can treat them as if
they were data.frames, but other times you need to explicity convert them.
This is one of those times, hence I’ve wrapped them in &lt;code&gt;data.frame()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Recall that &lt;code&gt;lsEA&lt;/code&gt; and &lt;code&gt;lsNA&lt;/code&gt; have more columns than the environmental
matrix: they also include &lt;code&gt;year&lt;/code&gt;, &lt;code&gt;countryCode&lt;/code&gt;, &lt;code&gt;country&lt;/code&gt;. We only want to
include the environmental variables when you project the observations into
the ordination. To make sure that we use the same variables as in the
original ordination of &lt;code&gt;globalEnvM&lt;/code&gt;, in the same order, I select the
columns explicitly to match that object:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data.frame(lsEA)[, colnames(globalEnvM)]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output of &lt;code&gt;dudi.pca&lt;/code&gt; and &lt;code&gt;suprow&lt;/code&gt; includes a lot of information that we
aren’t using here. We only need the &lt;code&gt;li&lt;/code&gt; element, so I’ve selected that
from each of the function outputs.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;occurrence-density-grids&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Occurrence Density Grids&lt;/h2&gt;
&lt;p&gt;Finally we’re ready to do the Niche Quantification/Comparisons. We’ll use
the PCA scores for the global environment, the native and invasive
environments, and the native and invasive occurrence records.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nativeGrid &amp;lt;- ecospat.grid.clim.dyn(global.scores,
                                   nativeEnv.scores,
                                   nativeLS.scores)

invasiveGrid &amp;lt;- ecospat.grid.clim.dyn(global.scores,
                                   invasiveEnv.scores, 
                                   invasiveLS.scores)

ecospat.plot.niche.dyn(nativeGrid, invasiveGrid,
                       quant = 0.05) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/07/28/ecospat-terra/index_files/figure-html/grid.clim.dyn-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The resulting plot shows us the environmental conditions present in Eurasia
(inside the green line) and North America (inside the red line). The green
area represents environments occupied by &lt;em&gt;Lythrum salicaria&lt;/em&gt; in Eurasia,
but not in North America, the red area shows environments occupied in North
America and not Eurasia, and the blue area shows environments occupied in
both ranges. We can also see that there are a few areas in Eurasia with
environments not present in North America, and vice versa. However, for the
most part, &lt;em&gt;Lythrum salicara&lt;/em&gt; doesn’t occur in this environments.&lt;/p&gt;
&lt;p&gt;To get the index values we use &lt;code&gt;ecospat.niche.dyn.index&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;indexVals &amp;lt;- ecospat.niche.dyn.index(nativeGrid,
                                     invasiveGrid) 
indexVals$dynamic.index.w&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##  expansion  stability  unfilling 
## 0.00598794 0.99401206 0.02349507&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;projecting-climate-space-to-geographic-space&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Projecting Climate Space to Geographic Space&lt;/h2&gt;
&lt;p&gt;We can take these results and project them onto a geographic map. This will
show us which areas of the native and invasive range of &lt;em&gt;Lythrum salicaria&lt;/em&gt;
fall into each of the three categories.&lt;/p&gt;
&lt;p&gt;Note that the function &lt;code&gt;ecospat.niche.dynIndexProjGeo&lt;/code&gt; doesn’t yet handle
the &lt;code&gt;SpatRaster&lt;/code&gt; objects created by the &lt;code&gt;terra&lt;/code&gt; package, so we have to
convert it to a raster’s &lt;code&gt;stack&lt;/code&gt; when we call it. For convenience and
consistency with the rest of the code, I convert the results back to a
&lt;code&gt;SpatRaster&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;geoProj &amp;lt;-
  ecospat.niche.dynIndexProjGeo(nativeGrid,
                                invasiveGrid,
                                env = globalEnvR)

plot(geoProj, legend = FALSE, 
     col = c(&amp;quot;grey&amp;quot;, &amp;quot;green&amp;quot;, &amp;quot;red&amp;quot;, &amp;quot;blue&amp;quot;),
     ylim = c(20, 80), axes = FALSE, mar = c(0, 0, 0, 0))
points(lsEA, col = &amp;#39;darkgreen&amp;#39;,
       cex = 0.25, pch = 23, bg = &amp;quot;white&amp;quot;)
points(lsNA, col = &amp;#39;darkred&amp;#39;,
       cex = 0.25, pch = 23, bg = &amp;quot;white&amp;quot;)
par(mar = c(0,0,0,0))
box()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/07/28/ecospat-terra/index_files/figure-html/geographic_projection-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;geographic-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Geographic Comparisons&lt;/h1&gt;
&lt;p&gt;You can also apply this analysis to geographic locations, instead of
environmental conditions. This won’t make much sense for native vs invaded
range comparisons, but it could be useful for comparing different species
within the same area.&lt;/p&gt;
&lt;p&gt;To demonstrate, let’s compare the distribution of &lt;em&gt;Lythrum salicaria&lt;/em&gt; in
North America before and after 1950. In this case, I need to go back to the
original occurrence data, since I need the thin the records before and
after 1950 separately. Otherwise, we won’t accurately capture the locations
where &lt;em&gt;Lythrum salicaria&lt;/em&gt; was collected in both time periods.&lt;/p&gt;
&lt;p&gt;We use geographic coordinates here, so no need for a PCA. We do need to
generate the ‘background’ coordinates. I’ll use &lt;code&gt;expand.grid&lt;/code&gt; to create the
locations for this. I’ve broken up the NA extent into 500 x 500 grids.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsNA_All &amp;lt;- lsOccsAll[nAm]

lsNAearly &amp;lt;- subset(lsNA_All, lsNA$year &amp;lt;= 1950)
## thin records:
lsNAearly &amp;lt;- spatSample(lsNAearly, size = 1, strata = wclim)

lsNAlate &amp;lt;- subset(lsNA_All, lsNA$year &amp;gt; 1950)
## thin records:
lsNAlate &amp;lt;- spatSample(lsNAlate, size = 1, strata = wclim)

geoGrid &amp;lt;- expand.grid(longitude =
                        seq(-160, -40, length.out = 500),
                      latitude =
                        seq(20, 90, length.out = 500))

earlyGeoGrid &amp;lt;- ecospat.grid.clim.dyn(geoGrid, geoGrid,
                                     crds(lsNAearly))

lateGeoGrid &amp;lt;- ecospat.grid.clim.dyn(geoGrid, geoGrid,
                                    crds(lsNAlate))

ecospat.plot.niche.dyn(earlyGeoGrid, lateGeoGrid, quant = 0)
plot(nAm, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/07/28/ecospat-terra/index_files/figure-html/temporal-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This looks pretty good. However, &lt;code&gt;ecospat&lt;/code&gt; uses a kernel density formula to
model the occurence distributions. As a consequence, it projects out into
the ocean, which isn’t very realistic. To correct this, we need to mask the
analysis to the continental land mass. This requires we have a vector map
of the desired area.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;earlyGeoGrid &amp;lt;- ecospat.grid.clim.dyn(geoGrid, geoGrid,
                                     crds(lsNAearly),
                                     geomask = nAm)

lateGeoGrid &amp;lt;- ecospat.grid.clim.dyn(geoGrid, geoGrid,
                                    crds(lsNAlate),
                                    geomask = nAm)

ecospat.plot.niche.dyn(earlyGeoGrid, lateGeoGrid, quant = 0)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/07/28/ecospat-terra/index_files/figure-html/masked-geography-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;That gives more reasonable results. The blue area is the range occupied by
&lt;em&gt;Lythrum salicara&lt;/em&gt; prior to 1950, and the red area is the range it expanded
into after that year. The small green areas are regions where it hasn’t
been collected since 1950.&lt;/p&gt;
&lt;p&gt;Note that this visualization is weighted by the density of points, so there
might be a few pre-1950 records in the red area, or a few post-1950 records
in the red area. That’s expected.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;summary&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Summary&lt;/h1&gt;
&lt;p&gt;This is a fairly quick overview of this workflow. You’ll almost certainly
want to consider thinning your observations, among other data cleaning
procedures. I’ve also set the study extent very crudely. That might be
appropriate for very large scale (global) studies. But you’ll usually want
to think a bit more carefully about how you set your extent. The way you
process your data will also differ depending on your context.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-ChamberlainEtAl_2021&#34;&gt;
&lt;p&gt;Chamberlain, Scott, Vijay Barve, Dan Mcglinn, Damiano Oldoni, Peter Desmet, Laurens Geffert, and Karthik Ram. 2021. “Rgbif: Interface to the Global Biodiversity Information Facility API.” Manual. &lt;a href=&#34;https://CRAN.R-project.org/package=rgbif&#34;&gt;https://CRAN.R-project.org/package=rgbif&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-ColaEtAl_2017&#34;&gt;
&lt;p&gt;Cola, Valeria Di, Olivier Broennimann, Blaise Petitpierre, Frank T. Breiner, Manuela D’Amen, Christophe Randin, Robin Engler, et al. 2017. “Ecospat: An R Package to Support Spatial Analyses and Modeling of Species Niches and Distributions.” &lt;em&gt;Ecography&lt;/em&gt; 40 (6): 774–87. &lt;a href=&#34;https://doi.org/10.1111/ecog.02671&#34;&gt;https://doi.org/10.1111/ecog.02671&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-FickHijmans_2017&#34;&gt;
&lt;p&gt;Fick, Stephen E., and Robert J. Hijmans. 2017. “WorldClim 2: New 1-Km Spatial Resolution Climate Surfaces for Global Land Areas.” &lt;em&gt;International Journal of Climatology&lt;/em&gt; 37 (12): 4302–15. &lt;a href=&#34;https://doi.org/10.1002/joc.5086&#34;&gt;https://doi.org/10.1002/joc.5086&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-Hijmans_2023&#34;&gt;
&lt;p&gt;Hijmans, Robert J. 2023. “Terra: Spatial Data Analysis.” Manual. &lt;a href=&#34;https://CRAN.R-project.org/package=terra&#34;&gt;https://CRAN.R-project.org/package=terra&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-HijmansEtAl_2023&#34;&gt;
&lt;p&gt;Hijmans, Robert J., Márcia Barbosa, Aniruddha Ghosh, and Alex Mandel. 2023. “Geodata: Download Geographic Data.” Manual. &lt;a href=&#34;https://CRAN.R-project.org/package=geodata&#34;&gt;https://CRAN.R-project.org/package=geodata&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-ThioulouseEtAl_2018&#34;&gt;
&lt;p&gt;Thioulouse, Jean, Stéphane Dray, Anne-Béatrice Dufour, Aurélie Siberchicot, Thibaut Jombart, and Sandrine Pavoine. 2018. &lt;em&gt;Multivariate Analysis of Ecological Data with Ade4&lt;/em&gt;. New York, NY: Springer New York. &lt;a href=&#34;https://doi.org/10.1007/978-1-4939-8850-1&#34;&gt;https://doi.org/10.1007/978-1-4939-8850-1&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Managing Absolute Paths in Reproducible Analyses</title>
      <link>https://plantarum.ca/2023/02/14/path_switching/</link>
      <pubDate>Tue, 14 Feb 2023 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2023/02/14/path_switching/</guid>
      <description>


&lt;p&gt;In a previous post on &lt;a href=&#34;https://plantarum.ca/2022/10/17/data_management&#34;&gt;reproducible analysis&lt;/a&gt;,
I explained the importance of using relative paths in your scripts, and
organizing your data in a single directory, in order to maintain
portability. You want to be able to pack up your analysis in a zip file, or
upload it as a single directory to GitHub or Dropbox, in order to share it
with colleagues, or transfer it to a new computer.&lt;/p&gt;
&lt;p&gt;This is best practice, but you may run into problems achieving this. One
challenge is dealing with large data sets that you will use in multiple
analyses. For example, we use &lt;a href=&#34;https://worldclim.org/&#34;&gt;WorldClim&lt;/a&gt; for a lot
of our distribution work. Each copy of the global 30s dataset fills nearly
10GB (compressed). That would quickly fill my laptop harddrive if I stored
a separate copy for each project.&lt;/p&gt;
&lt;p&gt;I have created a separate directory to store such datasets, so that I only
need to maintain a single copy on my computer. All analyses that use the
WorldClim data will look for it in &lt;code&gt;~/data/worldclim/&lt;/code&gt; on my laptop. On my
workstation, I store the same data in &lt;code&gt;~/data/enm/worldclim&lt;/code&gt;. I could
simplify this by using the same absolute path on both machines, but that
wouldn’t help anyone else trying to use my script on their own machine.&lt;/p&gt;
&lt;p&gt;The approach I’ve come up with for managing this requires a few lines of
code to set the paths appropriately:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;switch(system2(&amp;quot;hostname&amp;quot;, stdout = TRUE),
       LAPTOP.HOSTNAME =  ## my laptop:
         {worldClimPath &amp;lt;- &amp;quot;~/data/worldclim/&amp;quot;}, 
       WORKSTATION.HOSTNAME =   ## my workstation:
         {worldClimPath &amp;lt;- &amp;quot;~/data/enm/worldclim/&amp;quot;},
       {worldClimPath &amp;lt;- &amp;quot;dl/worldclim/&amp;quot;}) ## default&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;First, I check for the machine name with the function &lt;code&gt;system2(&#34;hostname&#34;, stdout = TRUE)&lt;/code&gt;. This calls the &lt;code&gt;hostname&lt;/code&gt; command on the underlying
operating system (which should work on Linux, Windows, and Mac). &lt;code&gt;hostname&lt;/code&gt;
returns the network hostname for your computer, which should be unique (at
least within your organization). The &lt;code&gt;switch&lt;/code&gt; function then compares this
value to the names for my different machines, which I’ve already looked up.
I can then use that information to set the correct path for my shared data.&lt;/p&gt;
&lt;p&gt;In the case that I’m not on either of my machines, I set a default,
relative path. That will allow other people to use my script without using
my hard-coded paths.&lt;/p&gt;
&lt;p&gt;In the case of WorldClim, the &lt;code&gt;geodata&lt;/code&gt; package provides a convenient way
to download the rasters:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(geodata)
bio &amp;lt;- worldclim_global(&amp;quot;bio&amp;quot;, res = 10,
                        path = worldClimPath) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The function &lt;code&gt;worldclim_global&lt;/code&gt; will check the &lt;code&gt;path&lt;/code&gt; argument. If it finds
the requested data there, it loads the local copy. If the data isn’t there,
it downloads a new copy from the internet, and stores it there.&lt;/p&gt;
&lt;p&gt;This makes for a convenient solution: running on my computers, all of my
analyses will use the same shared data, and I won’t have to wait for
downloads or exhaust my hard drives. But I can also share my code as-is
with collaborators, and it will &lt;em&gt;just work&lt;/em&gt;, without their having to change
any paths.&lt;/p&gt;
</description>
    </item>
    
    <item>
      <title>Simple Maps in R with Terra</title>
      <link>https://plantarum.ca/2023/02/13/terra-maps/</link>
      <pubDate>Mon, 13 Feb 2023 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2023/02/13/terra-maps/</guid>
      <description>


&lt;div id=&#34;reference&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Reference&lt;/h1&gt;
&lt;p&gt;This is an update of my previous &lt;a href=&#34;https://plantarum.ca/2020/10/30/simple-maps-r&#34;&gt;mapping
tutorial&lt;/a&gt;. Spatial analysis in R is shifting to
&lt;code&gt;terra&lt;/code&gt; and &lt;code&gt;sf&lt;/code&gt; as the primary packages, so I’ve translated my old,
&lt;code&gt;raster&lt;/code&gt;-based tutorial to the new workflow. See the &lt;a href=&#34;https://rspatial.org/spatial/index.html&#34;&gt;RSpatial
tutorial&lt;/a&gt; for a more detailed
introduction/overview of using &lt;code&gt;terra&lt;/code&gt; for GIS/spatial analysis.&lt;/p&gt;
&lt;p&gt;The following tutorial walks through some common plotting tasks I use for
distribution models.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;basemaps&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Basemaps&lt;/h1&gt;
&lt;p&gt;The &lt;code&gt;geodata&lt;/code&gt; package provides several convenient functions for downloading
raster and vector maps for use as basemaps and spatial analysis. The first
time you use these functions, they will download the requested maps from
the internet. It will save the data in your working directory, or in a
location specified with the &lt;code&gt;path&lt;/code&gt; argument. The next time you request the
same data, if it finds them in the local directory (or the specified
&lt;code&gt;path&lt;/code&gt;), they will be loaded from there, saving the time and bandwith
necessary to download them.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(geodata)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: terra&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## terra 1.7.29&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;us &amp;lt;- gadm(country = &amp;quot;USA&amp;quot;, level = 1, resolution = 2,
             path = &amp;quot;../data/maps/&amp;quot;)
canada &amp;lt;- gadm(country = &amp;quot;CAN&amp;quot;, level = 1, resolution = 2,
               path = &amp;quot;../data/maps&amp;quot;)
mexico &amp;lt;- gadm(country = &amp;quot;MX&amp;quot;, level = 1, resolution = 2,
               path = &amp;quot;../data/maps&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Countries are specified by their ISO code, which you can find by calling
the function &lt;code&gt;country_codes()&lt;/code&gt;. The by default, &lt;code&gt;country_codes()&lt;/code&gt; returns a
table of countries and the various ISO codes they have, as well as the
continents they are in. The &lt;code&gt;query&lt;/code&gt; argument lets you filter this table on
any value. For example, you can get a table of North American countries
with:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;country_codes(&amp;quot;North America&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;or, just their ISO codes with:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;country_codes(&amp;quot;North America&amp;quot;)$ISO3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The other arguments allow you to select, national, subnational, or
subprovincial borders (&lt;code&gt;level&lt;/code&gt; 1-3), the &lt;code&gt;resolution&lt;/code&gt; (high: 1, low: 2),
and the &lt;code&gt;path&lt;/code&gt; to the directory where you want the maps stored.&lt;/p&gt;
&lt;p&gt;These maps can be plotted directly with the &lt;code&gt;plot&lt;/code&gt; command. If you want to
combine them, use the &lt;code&gt;add = TRUE&lt;/code&gt; argument to the second &lt;code&gt;plot&lt;/code&gt; call:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(us, lwd = 2)
plot(canada, add = TRUE, col = &amp;#39;red&amp;#39;)
plot(mexico, add = TRUE, border = &amp;quot;green&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/map_plots-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;code&gt;col&lt;/code&gt; sets the fill colour, &lt;code&gt;border&lt;/code&gt; sets the outline color.&lt;/p&gt;
&lt;p&gt;You can also request multiple countries as a single set of polygons, by
passing a character vector to the &lt;code&gt;country&lt;/code&gt; argument.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;NorthAmerica &amp;lt;- gadm(country = country_codes(&amp;quot;North America&amp;quot;)$ISO3,
                     level = 0, resolution = 2,
                     path = &amp;quot;../data/maps/&amp;quot;)

plot(NorthAmerica, xlim = c(-180, -50))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/combining_vectors-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Note that &lt;code&gt;xlim&lt;/code&gt; and &lt;code&gt;ylim&lt;/code&gt; work as you would expect for plotting.&lt;/p&gt;
&lt;p&gt;If you want to combine multiple polygons into a single object, use &lt;code&gt;rbind&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;CanUS &amp;lt;- rbind(us, canada)
plot(CanUS, xlim = c(-180, -50))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/rbind_polygons-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;These maps are ‘unprojected’, meaning they are plotted in
latitude/longitude degrees. That makes it easy to set the plot boundaries:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(CanUS, xlim = c(-100, -50), ylim = c(30, 60))
plot(NorthAmerica, lwd = 2, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/zooming%20a%20map-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NB:&lt;/strong&gt; The size of your plot canvas is fixed, but a map can’t stretch. The
x and y dimensions have to maintain the same aspect. That means zooming in
one dimension (i.e. latitude only) won’t necessarily change the zoom of
your map, if the other dimension fills the canvas. You’ll have to play
around with the plot size, and both x and y dimensions together, to tweak
your zoom.&lt;/p&gt;
&lt;p&gt;It’s handy to have a shapefile of the Great Lakes, for making prettier
maps. I created this one in QGIS and use it for plotting:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;greatlakes &amp;lt;- vect(&amp;quot;../data/maps/greatlakes.shp&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;adding-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Adding Data&lt;/h1&gt;
&lt;div id=&#34;vectors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Vectors&lt;/h2&gt;
&lt;p&gt;You can add points to the plot like a regular scatter plot:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(scales)  ## for the alpha function below&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## 
## Attaching package: &amp;#39;scales&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## The following object is masked from &amp;#39;package:terra&amp;#39;:
## 
##     rescale&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gbif &amp;lt;- read.table(&amp;quot;../data/trich-gbif.csv&amp;quot;)
## Set the line color to gray to focus on the data points:
plot(CanUS, xlim = c(-100, -50), ylim = c(30, 60),
     border = &amp;quot;gray&amp;quot;)
points(gbif$X, gbif$Y, pch = 16,
       col = alpha(&amp;quot;green&amp;quot;, 0.2))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/adding%20points-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;You can also convert your points to a spatial vector object, in which case
R will know which columns to use for plotting. This is also necessary
before we can project our data (see below).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gbif &amp;lt;- vect(gbif, geom = c(&amp;quot;X&amp;quot;, &amp;quot;Y&amp;quot;))
## plot(CanUS, xlim = c(-100, -50), ylim = c(30, 60),
##      border = &amp;quot;gray&amp;quot;)
## points(gbif, pch = 16, col = alpha(&amp;quot;green&amp;quot;, 0.2))&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;subsetting-vectors&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Subsetting vectors&lt;/h3&gt;
&lt;p&gt;You often need to select polygons that contain points, or select only those
points that occur within a particular polygon. You can do this with R’s &lt;code&gt;[&lt;/code&gt;
subsetting syntax. For instance, &lt;code&gt;&amp;lt;polygons&amp;gt;[&amp;lt;points&amp;gt;, ]&lt;/code&gt; will select
&lt;code&gt;polygons&lt;/code&gt; that contain one or more &lt;code&gt;points&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(CanUS, xlim = c(-100, -50), ylim = c(30, 60),
      border = &amp;quot;gray&amp;quot;)
## Select states and provinces where Trichophorum is
## present: 
plot(CanUS[gbif, ], col = &amp;#39;red&amp;#39;, add = TRUE)
points(gbif, pch = 21, col = &amp;#39;white&amp;#39;, bg = &amp;#39;black&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/states_with_trich-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Conversely, &lt;code&gt;&amp;lt;points&amp;gt;[&amp;lt;polygons&amp;gt;, ]&lt;/code&gt; will select points that are within
polygons. In this example, we’ll use the fact that our &lt;code&gt;CanUS&lt;/code&gt; &lt;code&gt;SpatVector&lt;/code&gt;
object has a column named &lt;code&gt;NAME_1&lt;/code&gt; that holds the name of the state or
province of each polygon:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(CanUS, xlim = c(-100, -50), ylim = c(30, 60),
      border = &amp;quot;gray&amp;quot;)
plot(CanUS[gbif, ], col = &amp;#39;red&amp;#39;, add = TRUE)
## Select only points in NY state:
points(gbif[CanUS[CanUS$NAME_1 == &amp;quot;New York&amp;quot;, ], ],
       pch = 21, col = &amp;#39;white&amp;#39;, bg = &amp;#39;black&amp;#39;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/NY_trich-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;rasters&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Rasters&lt;/h2&gt;
&lt;p&gt;Similarly, you can plot rasters with plot:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trichPreds &amp;lt;- rast(&amp;quot;../data/trichPreds.grd&amp;quot;)
plot(trichPreds, xlim = c(-100, -50), ylim = c(30, 60))
plot(CanUS, border = &amp;quot;gray&amp;quot;, lwd = 0.5, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/loading%20rasters-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Cells with &lt;code&gt;NA&lt;/code&gt; values are transparent. In this case, a species
distribution model, low values are displayed in gray. This may be useful
for visualizing the extent of the model. However, it looks a bit odd, and
makes it hard to see limits of the high-suitability areas. You can tweak
this by playing with the color ramp, but it’s also handy to ‘turn off’ the
low values entirely (for visualization, &lt;strong&gt;not&lt;/strong&gt; for analysis!!)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trichPredsTrim &amp;lt;- trichPreds
trichPredsTrim[trichPredsTrim &amp;lt;
               quantile(values(trichPreds),
                        probs = 0.75, na.rm = TRUE)] &amp;lt;- NA
plot(trichPredsTrim, xlim = c(-100, -50), ylim = c(30, 60))
plot(CanUS, border = &amp;quot;grey&amp;quot;, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/trimming%20predictions-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The test I used here, &lt;code&gt;trichPredsTrim &amp;lt; quantile(getValues(trichPreds), probs = 0.75, na.rm = TRUE)&lt;/code&gt; identifies all cells in the lower 75% of the
suitability scores, which I then set to &lt;code&gt;NA&lt;/code&gt; to make them invisible. I
decided on 75% after experimenting with different values. In this case, 75%
drops most of the grey background (the very lowest values), without eating
into the areas that the prediction indicates are suitable.&lt;/p&gt;
&lt;p&gt;You could also use an absolute value here, but then you’d need to know the
actual distribution of the suitability scores. &lt;code&gt;quantile&lt;/code&gt; is easier to
tweak.&lt;/p&gt;
&lt;p&gt;Alternatively, you can assign colours to the prediction raster based on the
value of each pixel, after splitting them into categories:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suitability &amp;lt;- extract(trichPreds, gbif, ID = FALSE)[, 1]

predCat &amp;lt;- (trichPreds &amp;gt; quantile(suitability, 0.25)) + 
  (trichPreds &amp;gt; quantile(suitability, 0.15)) +
  (trichPreds &amp;gt; quantile(suitability, 0.05)) +
  (trichPreds &amp;gt; quantile(suitability, 0.025))

predCols &amp;lt;- c(&amp;quot;white&amp;quot;, &amp;quot;grey95&amp;quot;, &amp;quot;yellow3&amp;quot;, &amp;quot;orange&amp;quot;, &amp;quot;red&amp;quot;) 

plot(predCat, xlim = c(-100, -50), colNA = &amp;#39;lightblue&amp;#39;,
     col = predCols, ylim = c(30, 60), legend = FALSE)
plot(CanUS, border = &amp;quot;grey&amp;quot;, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/colour_categories-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;I find using a small number of categories makes it easier to read the
suitability map than trying to interpret the colour gradients usually used.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;projections&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Projections&lt;/h1&gt;
&lt;p&gt;Lat/Lon maps look a bit square; we’re more used to seeing maps projected. A
common projection for Canada is Lambert Conformal Conic. We can transform
our data to this projection to make nicer maps:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## define the projection
canlam &amp;lt;- &amp;quot;+proj=lcc +lat_1=49 +lat_2=77 +lat_0=49 +lon_0=-95 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs&amp;quot;

## project our vector data:
CanUS.lcc &amp;lt;- project(CanUS, canlam)
gl.lcc &amp;lt;- project(greatlakes, canlam)

## Now we to set the projection of our points:
crs(gbif) &amp;lt;- &amp;quot;+proj=longlat +datum=WGS84&amp;quot;

## Finally, we can project our points to LCC:
gbif.lcc &amp;lt;- project(gbif, canlam)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that our data needs to be in an object of class &lt;code&gt;Spat*&lt;/code&gt;, and it must
have a defined coordinate reference system (CRS) before we can project it
to a new CRS. The &lt;code&gt;crs&lt;/code&gt; function allows us to explicitly set the
projection. See &lt;a href=&#34;https://rspatial.org/spatial/6-crs.html#notation&#34;&gt;the RSpatial
tutorial&lt;/a&gt; for more
details on specifying CRS with &lt;code&gt;terra&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The same function works for rasters (sort of):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rasterLCC &amp;lt;- project(trichPredsTrim, canlam)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will reproject my raster from lat/lon to Lambert Conformal Conic, but
we will unavoidably lose some precision when we do this. This is fine for
visualization, but you should avoid projecting raster data used in
analysis. For more details, see &lt;a href=&#34;https://rspatial.org/spatial/6-crs.html#transforming-raster-data&#34;&gt;the terra
tutorial&lt;/a&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(rasterLCC)
plot(CanUS.lcc, border = &amp;quot;grey&amp;quot;, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/plot%20projected-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The units are no longer Lat/Lon, but meters. We can read them off the plot
to improve the zoom:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(rasterLCC, xlim = c(0, 2500000),
     ylim = c(-1500000, -400000))
plot(CanUS.lcc, border = &amp;quot;grey&amp;quot;, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/projected%20zoom-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;formatting&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Formatting&lt;/h1&gt;
&lt;p&gt;With the data plotted, we can then turn to making the map a little
prettier. Note that when working with &lt;code&gt;Spat*&lt;/code&gt; objects, we can’t set the
plot margins with &lt;code&gt;par&lt;/code&gt; as we usually do. Instead, we use the &lt;code&gt;mar&lt;/code&gt;
argument &lt;em&gt;within&lt;/em&gt; the &lt;code&gt;plot&lt;/code&gt; function:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Make a panel with two plots
par(mfrow = c(1, 2))

## store the plot limits:
my_xlims &amp;lt;- c(0, 2500000) 
my_ylims &amp;lt;- c(-1300000, -200000)

## Plot the points:
plot(CanUS.lcc, xlim = my_xlims , ylim = my_ylims,
     border = &amp;quot;grey&amp;quot;, background = &amp;quot;lightblue&amp;quot;,
     col = &amp;quot;white&amp;quot;, axes = FALSE, mar = c(0.1,0.1,0.1,0))
plot(gl.lcc, add = TRUE, border = &amp;quot;grey&amp;quot;, col = &amp;quot;lightblue&amp;quot;)
points(gbif.lcc, pch = 16, col = alpha(&amp;quot;grey30&amp;quot;, 0.2),
       cex = 0.7)
box() 

plot(CanUS.lcc, xlim = my_xlims , ylim = my_ylims,
     border = &amp;quot;grey&amp;quot;, background = &amp;quot;lightblue&amp;quot;,
     col = &amp;quot;white&amp;quot;, mar = c(0.1,0,0.1,0.1), axes = FALSE)
plot(gl.lcc, add = TRUE, border = &amp;quot;grey&amp;quot;, col = &amp;quot;lightblue&amp;quot;)
plot(rasterLCC, add = TRUE, legend = FALSE, axes = FALSE)

## plotted again to put the border lines on top:
plot(CanUS.lcc, border = &amp;quot;grey&amp;quot;, add = TRUE) 
box()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2023/02/13/terra-maps/index_files/figure-html/pretty%20plot-1.png&#34; width=&#34;696&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If you want to plot the state/provincial borders &lt;em&gt;on top&lt;/em&gt; of the raster,
you need to add those layers last. But you can’t set the background colour
of the raster layer to “lightblue” (or at least I haven’t figured that
out), so the ocean stays white. I get around that by plotting the
boundaries twice, first to set the background colour, and then to put the
state lines on top of the raster.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Data Management for Reproducible Science</title>
      <link>https://plantarum.ca/2022/10/17/data_management/</link>
      <pubDate>Mon, 17 Oct 2022 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2022/10/17/data_management/</guid>
      <description>


&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;blockquote&gt;
&lt;p&gt;Research is reproducible when others can reproduce the results of a
scientific study given only the original data, code, and documentation
&lt;span class=&#34;citation&#34;&gt;(Alston and Rick, &lt;a href=&#34;#ref-AlstonRick_2021&#34;&gt;2021&lt;/a&gt;)&lt;/span&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Benefits to the Author:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Clear and complete documentation of your work makes it easier to share,
write up and extend in future work, including responding to reviewers
and developing new projects&lt;/li&gt;
&lt;li&gt;Conscientious documentation of your work involves a great deal of
error-checking, which is reassuring to you – that you haven’t missed
anything, or mis-remembered what you did; and to your readers – that
you have conducted your work in a rigorous manner&lt;/li&gt;
&lt;li&gt;Reproducible work gets cited more, and developing a data archive creates
a new citable product from your research.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Benefits to the Community:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Increases the speed and fidelity with which we can learn and apply new
approaches.&lt;/li&gt;
&lt;li&gt;Makes it easier to avoid mistakes (through the care and attention
required to create the archive), and to detect and correct them if they
do happen (by allowing others to critically review your work)&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
&lt;div id=&#34;how-to-make-your-work-reproducible&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;How to make your work reproducible&lt;/h1&gt;
&lt;p&gt;There are two related goals in producing a reproducible analysis:
&lt;strong&gt;portability&lt;/strong&gt;, and &lt;strong&gt;reproducibility&lt;/strong&gt;. Reproducibility is perhaps
obvious, but it’s not enough that you can reproduce your analysis on your
computer. You are the only person with access to your computer. Your work
should be reproducible on anyone’s computer.&lt;/p&gt;
&lt;div id=&#34;portability&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Portability&lt;/h2&gt;
&lt;p&gt;To achieve portability, we need to know which files are needed for an
analysis, and to organize them in a way that they can be readily moved from
one computer to another. In practice, this means they’ll all be in a single
directory, and that directory will only contain files for that particular
analysis. You may have several related analyses that share a directory.
That’s ok, but take time to organize them in a sensible way.&lt;/p&gt;
&lt;p&gt;While you may have related analyses together in a directory, you don’t want
to mix unrelated files and data in this directory. That will make it harder
to keep track of what is needed and what isn’t, and will waste space on the
computers where this work is ultimately archived.&lt;/p&gt;
&lt;div id=&#34;readme.txt&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;README.txt&lt;/h3&gt;
&lt;p&gt;There are many ways you can organize your work within this directory. One
absolute requirement is there needs to be a clear guide to your files, a
“Table of Contents”. This should be a ‘plain text’ file - something that
can be opened by any text editor. Your archive might be around for decades,
and you don’t know if your readers will be able to find a copy of MSWord97
when they need to read it. We can be reasonably confident that plain text
files will be accessible for a long time to come.&lt;/p&gt;
&lt;p&gt;By convention, this file is called &lt;code&gt;README.txt&lt;/code&gt;, and some data archiving
services (&lt;a href=&#34;https://datadryad.org/&#34;&gt;DataDryad&lt;/a&gt;) require that you include a
file with this name. You should probably use this name, unless you have a
good reason not to.&lt;/p&gt;
&lt;p&gt;One minor exception: if you use
&lt;a href=&#34;https://www.markdownguide.org/&#34;&gt;markdown&lt;/a&gt;, or
&lt;a href=&#34;https://rmarkdown.rstudio.com/&#34;&gt;RMarkdown&lt;/a&gt;, you can use these formats for
your &lt;code&gt;README&lt;/code&gt;. They are both plain text, and even if your audience doesn’t
use them, they can open a &lt;code&gt;README.md&lt;/code&gt; or &lt;code&gt;README.Rmd&lt;/code&gt; file in any text
editor. There are other, similar simple markup formats used in different
coding communities. As long as they are saved as plain text files they will
meet our requirements.&lt;/p&gt;
&lt;p&gt;The contents of your &lt;code&gt;README&lt;/code&gt; should describe as clearly as possible the
contents of your archive. Here is an excerpt from one of &lt;a href=&#34;https://github.com/plantarum/trich&#34;&gt;mine&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Start with trich.Rmd to read our draft manuscript. See trich-prep.Rmd for
the bulk of the code used in generating this manuscript.&lt;/p&gt;
&lt;h1 id=&#34;file-list&#34;&gt;File List:&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;trich.Rmd&lt;/strong&gt; : main manuscript file&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;plantarum.json&lt;/strong&gt; : bibliography (Zotero bibliography)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;trich-prep.Rmd&lt;/strong&gt; : The bulk of the code used in the analysis. Loaded from
trich.Rmd to regenerate the figures and tables there.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;data/&lt;/strong&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;data/ssr_raw.csv&lt;/strong&gt; : raw microsatellite data. See trich-prep.Rmd (Loading
Data) for code to load and translate this to a genind object&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;data/survey-pops.csv&lt;/strong&gt; : coordinates of sampled populations&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;data/eval.opt.2020-06-24.Rda&lt;/strong&gt; : the output of the Maxent modeling,
saved as a binary R Data object. Load into R with the load() function,
so you don’t need to repeat the lengthy Maxent analysis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;data/trich-gbif.csv&lt;/strong&gt; : GBIF records used in the Maxent analysis&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;data/trich_soil.csv&lt;/strong&gt; : Soil analysis for each sampled population&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;data/maps&lt;/strong&gt; : Maps (shapefiles and rasters) used in the Maxent
analysis, and for some of the manuscript plots&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;
&lt;p&gt;I prefer to use &lt;code&gt;RMarkdown&lt;/code&gt; to develop my manuscripts, as this allows me to
keep the code for figures, and the resulting images, together with the text
that describes the methods and interprets the results. If you prefer to
manage your code separately from your writing that’s fine too, but you may
end up structuring your archive a little differently.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;directory-organization&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Directory Organization&lt;/h3&gt;
&lt;p&gt;If you only have a few files, you may not need to do any further
organization of your archive. I find it helpful to use subfolders to keep
things organized. Depending on the needs of a project, I use:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;data/&lt;/dt&gt;
&lt;dd&gt;unprocessed data used in the analysis
&lt;/dd&gt;
&lt;dt&gt;processed/&lt;/dt&gt;
&lt;dd&gt;intermediate data files, generated from data and not stored permanently
&lt;/dd&gt;
&lt;dt&gt;downloads/&lt;/dt&gt;
&lt;dd&gt;storage of large external datasets used in the analysis, but not
stored permanently in the archive (e.g., &lt;a href=&#34;https://worldclim.org/&#34;&gt;WorldClim Climate
Data&lt;/a&gt;); be sure to include links to the source of
any external data you are not archiving yourself!
&lt;/dd&gt;
&lt;dt&gt;plots/&lt;/dt&gt;
&lt;dd&gt;images generated by the analysis
&lt;/dd&gt;
&lt;dt&gt;code/&lt;/dt&gt;
&lt;dd&gt;code used in the analysis, if it’s not in the top-level directory
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;These are not hard rules. You can use whatever structure suits your
project. Just be sure to explain it in your &lt;code&gt;README&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;reproducibility&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Reproducibility&lt;/h2&gt;
&lt;p&gt;File organization gets us most of the way to portability. There are a few
things we need to do in our coding to complete this arrangement, and of
course to ensure we can reproduce the analysis once we move it to a new
computer.&lt;/p&gt;
&lt;div id=&#34;use-relative-paths&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Use Relative Paths&lt;/h3&gt;
&lt;p&gt;A &lt;code&gt;absolute path&lt;/code&gt; is the location of a file on a particular computer. The
absolute path to the file I’m editing right now is
&lt;code&gt;/home/smithty/blogdown/content/tutorials/2022-10-17-data-management/index.md&lt;/code&gt;.
That location will only ever exist on a Linux computer with a user named
&lt;code&gt;smithty&lt;/code&gt;. If I moved it to a Windows computer, it might be located at
&lt;code&gt;C:\Users\smithty\Documents\blogdown/content/tutorials/2022-10-17-data-management/index.md&lt;/code&gt;.
If I try to refer to this file by its absolute path on Linux on the Windows
machine, I won’t find it.&lt;/p&gt;
&lt;p&gt;On the other hand, from the top directory of my blog, &lt;code&gt;blogdown/&lt;/code&gt;, this
file will have the same &lt;code&gt;relative path&lt;/code&gt; on both machines:
&lt;code&gt;content/tutorials/2022-10-17-data-management/index.md&lt;/code&gt;. That means links
to this file using the relative path will work just fine on both machines.&lt;/p&gt;
&lt;p&gt;We should use &lt;code&gt;relative paths&lt;/code&gt; in our analyses too.&lt;/p&gt;
&lt;p&gt;For example, I could use an absolute path to load my data:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;samples &amp;lt;-
  read.csv(&amp;quot;/home/smithty/nextcloud/trich/2020-06-25/data/survey-pops.csv&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But this won’t load on anyone else’s computer. If we do this instead:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;samples &amp;lt;-
  read.csv(&amp;quot;data/survey-pops.csv&amp;quot;) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Anyone with our archive can run the code as it is, as long as they have the
working directory set properly. For my projects, I do this by running my
code from the top directory of the archive. In this case, I have the
directory structure:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;├── data
│   └── survey-pops.csv
├── README.md
├── trich-prep.Rmd
└── trich.Rmd&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When I run code from &lt;code&gt;trich.Rmd&lt;/code&gt;, I set the working directory to the
location of that file. &lt;code&gt;RStudio&lt;/code&gt; manages this for you with its project
support. If you don’t use that feature, you can tell R to use that location
when it starts. After that, stick to relative file paths, and you don’t
need to worry about your code breaking when you move to a different
computer.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;structure-your-code-to-run-on-its-own&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Structure Your Code to Run On Its Own&lt;/h3&gt;
&lt;p&gt;Two common problems that make it hard to run your code are mixing your
‘good’ code with non-working code, and writing code that requires you to
update it by hand to finish your analysis.&lt;/p&gt;
&lt;p&gt;Mixing good and bad code might look like this:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Load in our data:
myData &amp;lt;- read.data(&amp;quot;data/myfile.csv&amp;quot;)
myDataScaled &amp;lt;- scale(myData)
myDataScaled &amp;lt;- scale(myData, center = FALSE)

myData &amp;lt;- myData[, -1]
myDataScaled &amp;lt;- scale(myData, scale = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;It’s not unusual to accumulate various versions of code (should I scale
this, center it, both? Do I need the first column?). While you’re actively
working on your code, you may find you have multiple versions in the same
file. They need to be clearly commented, for your own benefit! But more
importantly, when you decide which version you’re going to use, remove the
rest. Don’t expect yourself (and certainly don’t expect anyone else) to
figure out which lines they should run, and which they should skip.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Load in our data:
myData &amp;lt;- read.data(&amp;quot;data/myfile.csv&amp;quot;)

## drop the name columnn
myData &amp;lt;- myData[, colnames(myData) != &amp;quot;name&amp;quot;]

## don&amp;#39;t scale, use raw data
## myDataScaled &amp;lt;-scale(myData)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you have a few lines of code that you might want to revisit, comment
them out and leave yourself a note in a comment. If you have large blocks
of code you aren’t using, but want to keep a record of, put them in a
separate file. The end product should be a file that you can run from start
to finish, without deciding which lines to skip and which to run.&lt;/p&gt;
&lt;p&gt;A similar problem occurs when you have code that works, but requires you to
manually update it in order to complete your analysis. e.g.,&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;myData &amp;lt;- read.table(&amp;quot;data/experiments.csv&amp;quot;)

myExperiment &amp;lt;- subset(myExperimentData, exp == 1)

myExperimentResult &amp;lt;- processingCode(myExperiment)

## repeat for exp 1-20&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Code structured like this requires you to edit and re-edit the code many
times to complete your work. That is tedious, and it’s easy to make a
mistake. And when you update &lt;code&gt;processingCode&lt;/code&gt;, you need to rerun everything
by hand. You can avoid this with loops and lists.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;r-programming-do-not-save-your-workspace&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;R Programming: Do Not Save Your Workspace!&lt;/h3&gt;
&lt;p&gt;R offers to save your current workspace when you close the terminal. This
is sometimes convenient, but can cause hard to detect problems with your
analysis. If you happen to alter one of the objects you are working with in
an R session, but don’t capture the code in your script file, you won’t
have a record of what you’ve done. If you then save your workspace, the
next time you work on your code, you will be able to keep using that
modified object. At this point, your code and data are out of sync, with
nothing to indicate how they differ.&lt;/p&gt;
&lt;p&gt;At best, this is inconvenient. At worst, if undetected, you can waste weeks
or months analyzing the wrong data! Better to avoid the risk and set R to
never save your workspace.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;metadata-data-about-your-data&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Metadata: data about your data&lt;/h3&gt;
&lt;p&gt;If your archive includes data files, these should also be in an open
format, such as comma-separated or tab-separated text files (typically with
a name like &lt;code&gt;FILE.csv&lt;/code&gt;, &lt;code&gt;DATA.txt&lt;/code&gt;, or &lt;code&gt;RECORDS.txt&lt;/code&gt;. That ensures that
anyone can open your data without needing a proprietary program to do so.&lt;/p&gt;
&lt;div id=&#34;data-tables&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Data Tables&lt;/h4&gt;
&lt;p&gt;In addition, you need to document how your &lt;a href=&#34;https://data.library.arizona.edu/data-management/best-practices/data-documentation-readme-metadata&#34;&gt;data is
coded&lt;/a&gt;.
This includes things like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Variable names and descriptions&lt;/li&gt;
&lt;li&gt;Definition of codes and classification schemes&lt;/li&gt;
&lt;li&gt;Codes of, and reasons for, missing values&lt;/li&gt;
&lt;li&gt;Definitions of specialty terminology and acronyms&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Be sure to include things like units (meters or feet?), and what your
special codes mean (is &lt;code&gt;pet_l&lt;/code&gt; the petal length or the petiole length?,
what is &lt;code&gt;lf1&lt;/code&gt;, &lt;code&gt;lf2&lt;/code&gt; and &lt;code&gt;lf3&lt;/code&gt;?). This is also a chance to review your data
for consistency - are you using NA &lt;em&gt;and&lt;/em&gt; -1 for missing values? Do you have
multiple different phrasings for the same thing?&lt;/p&gt;
&lt;p&gt;Depending on your project, you may also need to distinguish between absent
evidence (you didn’t sample on a day) and evidence of absence (you sampled
on a day, but didn’t find any events/individuals). If your analysis will
include sampling events that didn’t result in any observations, you’ll need
to document these ‘true negatives’ in your data table.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;data-sources&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Data Sources&lt;/h4&gt;
&lt;p&gt;If you’re using data from outside sources, like &lt;a href=&#34;www.gbif.org&#34;&gt;GBIF.org&lt;/a&gt; or
&lt;a href=&#34;https://worldclim.org/&#34;&gt;WorldClim.og&lt;/a&gt;, be sure to record their citation
details, including a DOI, if available, when you download them. This will
ensure you can properly cite them later, and that your readers will be able
to access the same data if they want to reproduce your work. Most data
providers have clear policies as to how you should cite them, and what
you’re allowed to do with the data they share (i.e., can you share it or
archive it yourself). For example, here are the policies for
&lt;a href=&#34;https://worldclim.org/about.html&#34;&gt;WorldClim&lt;/a&gt; and
&lt;a href=&#34;https://www.gbif.org/citation-guidelines&#34;&gt;GBIF&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;how-to-get-there-from-here&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;How to get there from here&lt;/h1&gt;
&lt;p&gt;Now we have an idea of what we want our data archive to look like, how do
we get there? Start by setting up your directory structure, and making a
&lt;code&gt;README&lt;/code&gt; file. If you’re starting from scratch, make it a practice to
review your directory regularly, to see what you’ve done, what you’re no
longer using, and updating your &lt;code&gt;README&lt;/code&gt; to capture that. This kind of
regular reflection is useful to track your progress, and helps you keep on
top of your archive work so you don’t have a huge mess to wrangle at the
end of your project.&lt;/p&gt;
&lt;p&gt;If you’re already well-along in your project, it may be easier to create an
‘aspirational’ directory and populate it from your existing work. I do this
regularly! I often find I’ve charged into an analysis without thinking
about archiving, and after a few weeks I have an unholy mess of files and
data to deal with. In that case, I create a new directory, a &lt;code&gt;README&lt;/code&gt;, and
copy over the main code file I’m using to that directory. When I get to the
first data file in that directory, I copy it over to the &lt;code&gt;data/&lt;/code&gt; directory
in the new archive, adjust the code to use the relative path, if necessary,
and continue. This can be very helpful in clarifying what code and files
you actually need and use, and what can be left behind.&lt;/p&gt;
&lt;p&gt;This is also a good opportunity to ensure that your analysis is structured
in a way that it can run start-to-finish without manual editing&lt;/p&gt;
&lt;p&gt;You don’t need to delete anything in the old directory, you can keep it in
case you later decide you want to revisit some ideas in there.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;what-to-do-with-it-all&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;What to do with it all&lt;/h1&gt;
&lt;p&gt;One of the benefits of structuring your code in a single directory is there
are lots of tools that you can use to manage it.
&lt;a href=&#34;https://git-scm.com/&#34;&gt;git&lt;/a&gt; and &lt;a href=&#34;https://github.com/&#34;&gt;GitHub&lt;/a&gt; are popular,
and very powerful for managing code, especially tracking different versions
of the same files. However, they require a certain amount of discipline to
get the full benefit, and they are challenging to learn. RStudio does have
good support for GitHub repositories.&lt;/p&gt;
&lt;p&gt;You can also keep it simple, and sync your directory to Google Drive,
Dropbox, Nextcloud, or many other options. Once your work is published, you
can archive it permanently on &lt;a href=&#34;https://zenodo.org/&#34;&gt;Zenodo&lt;/a&gt;,
&lt;a href=&#34;https://datadryad.org/&#34;&gt;DataDryad&lt;/a&gt;, or other online services.&lt;/p&gt;
&lt;p&gt;All of these options will support housing a single directory and its
subdirectories. None of these options will be easy to deal with if you have
files spread across multiple directories and mixed with files from other
projects!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;examples&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Examples&lt;/h1&gt;
&lt;p&gt;I’ve been doing some version of this for my own work for years, but have
only recently moved to permanent, public archives of my work. Here are
three recent examples:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class=&#34;citation&#34;&gt;Hayes et al. (&lt;a href=&#34;#ref-HayesEtAl_2022&#34;&gt;2022&lt;/a&gt;)&lt;/span&gt;: &lt;a href=&#34;https://github.com/plantarum/celtisSSR&#34;&gt;The Genetic Diversity of Triploid &lt;em&gt;Celtis pumila&lt;/em&gt; and
its Diploid Relatives&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;citation&#34;&gt;Nowell et al. (&lt;a href=&#34;#ref-NowellEtAl_2022&#34;&gt;2022&lt;/a&gt;)&lt;/span&gt;: &lt;a href=&#34;https://github.com/plantarum/trich&#34;&gt;Conservation assessment of a range-edge population of
&lt;em&gt;Trichophorum planifolium&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&#34;citation&#34;&gt;Foster et al. (&lt;a href=&#34;#ref-FosterEtAl_2022&#34;&gt;2022&lt;/a&gt;)&lt;/span&gt;: &lt;a href=&#34;https://doi.org/10.5061/dryad.cfxpnvx8f&#34;&gt;Testing the assumption of environmental equilibrium in
an invasive plant species over a 130 year
history&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I’m still figuring out how best to do this, and my practice will definitely
continue to change and evolve. Regardless of the specifics, I have
benefited enormously from investing the time needed to make coherent
archives of my projects.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-AlstonRick_2021&#34;&gt;
&lt;p&gt;Alston, J. M., and J. A. Rick. 2021. A Beginner’s Guide to Conducting Reproducible Research. &lt;em&gt;The Bulletin of the Ecological Society of America&lt;/em&gt; 102: e01801.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-FosterEtAl_2022&#34;&gt;
&lt;p&gt;Foster, S. L., H. M. Kharouba, and T. W. Smith. 2022. Testing the assumption of environmental equilibrium in an invasive plant species over a 130 year history. &lt;em&gt;Ecography&lt;/em&gt;: e06284.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-HayesEtAl_2022&#34;&gt;
&lt;p&gt;Hayes, A., S. Wang, A. T. Whittemore, and T. W. Smith. 2022. The Genetic Diversity of Triploid &lt;em&gt;Celtis&lt;/em&gt; &lt;em&gt;Pumila&lt;/em&gt; and its Diploid Relatives &lt;em&gt;C&lt;/em&gt;&lt;em&gt;. Occidentalis&lt;/em&gt; and &lt;em&gt;C&lt;/em&gt;&lt;em&gt;. Laevigata&lt;/em&gt; (Cannabaceae). &lt;em&gt;Systematic Botany&lt;/em&gt; 47: 441–451.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;ref-NowellEtAl_2022&#34;&gt;
&lt;p&gt;Nowell, V. J., S. Wang, and T. W. Smith. 2022. Conservation assessment of a range-edge population of &lt;em&gt;Trichophorum&lt;/em&gt; &lt;em&gt;Planifolium&lt;/em&gt; (Cyperaceae) reveals range-wide inbreeding and locally divergent environmental conditions. &lt;em&gt;Botany&lt;/em&gt; 100: 631–642.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Georeferencing Notes</title>
      <link>https://plantarum.ca/2022/03/23/georeferencing/</link>
      <pubDate>Wed, 23 Mar 2022 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2022/03/23/georeferencing/</guid>
      <description>&lt;h1 id=&#34;gbif-data&#34;&gt;GBIF Data&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://www.gbif.org/&#34;&gt;GBIF&lt;/a&gt; is the main online clearing house for
occurence data in the world. It includes most (but not all) online
herbarium databases. It also includes
&lt;a href=&#34;https://www.inaturalist.org&#34;&gt;iNaturalist&lt;/a&gt; records, as well as a number of
other survey repositories. I&amp;rsquo;m not familiar with all of the sources
included.&lt;/p&gt;
&lt;p&gt;The main page presents a search bar. Enter your species name (e.g., &lt;em&gt;Rubus
chamaemorus&lt;/em&gt;) in the box, and select the &amp;lsquo;occurrences&amp;rsquo; tab. You&amp;rsquo;ll be taken
to a list of results that match your species (2.9e6 results for &lt;em&gt;R.
chamaemorus&lt;/em&gt;). This may include records for other species, presumably
because the name of the species you are searching for is recorded in the
metadata for those records. We don&amp;rsquo;t usually want these.&lt;/p&gt;
&lt;p&gt;To fix this, look for the prompt: &amp;ldquo;Your search matches a Species: &amp;lsquo;Rubus
chamaemorus L.&amp;rsquo;. Do you wish to limit your search to this taxon only?&amp;rdquo;, in
the left-hand tool bar. Select yes to restrict the results to only records
for your species (i.e, excluding records where your species is mentioned in
the data). For &lt;em&gt;R. chamaemorus&lt;/em&gt; we drop from 2.9 e6 records to 87 e3.&lt;/p&gt;
&lt;h2 id=&#34;basis-of-record&#34;&gt;Basis of Record&lt;/h2&gt;
&lt;p&gt;In the right hand toolbar, look for the option: &lt;strong&gt;Basis of Record&lt;/strong&gt;.
Expanding that tab, you see a list of options, along with the number of
records for each option. For &lt;em&gt;R. chamaemorus&lt;/em&gt; I see:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Observation (6)&lt;/li&gt;
&lt;li&gt;Machine Observation (46)&lt;/li&gt;
&lt;li&gt;Human Observation (79686)&lt;/li&gt;
&lt;li&gt;Material Sample (516)&lt;/li&gt;
&lt;li&gt;Material Citation (1)&lt;/li&gt;
&lt;li&gt;Preserved Specimens (6032)&lt;/li&gt;
&lt;li&gt;Fossil Specimen (133)&lt;/li&gt;
&lt;li&gt;Living Specimen (28)&lt;/li&gt;
&lt;li&gt;Occurrence (1222)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The &lt;strong&gt;Basis of Record&lt;/strong&gt; field is no longer recommended, and is deprecated
in the DarwinCore system. Looking through the various categories, that&amp;rsquo;s
probably a good idea. I&amp;rsquo;m not sure what the difference is between
&amp;lsquo;Observation&amp;rsquo;, &amp;lsquo;Machine Observation&amp;rsquo;, &amp;lsquo;Occurrence&amp;rsquo; etc, and it looks like
the groups are applied very inconsistently.&lt;/p&gt;
&lt;p&gt;&amp;lsquo;Human Observation&amp;rsquo; does include &lt;a href=&#34;https://www.inaturalist.org&#34;&gt;iNaturalist&lt;/a&gt;
records, but it also includes a lot of other data. The &lt;code&gt;iNaturalist&lt;/code&gt; records
are updated periodically, but it&amp;rsquo;s also possible to download them directly
from iNaturalist. That&amp;rsquo;s my preference, as then I don&amp;rsquo;t need to concern
myself with what else might be included under &amp;lsquo;Human Observation&amp;rsquo;.&lt;/p&gt;
&lt;p&gt;To restrict ourselves to herbarium records, we&amp;rsquo;ll select &amp;lsquo;Preserved
Specimens&amp;rsquo; and move on. For &lt;em&gt;R. chamaemorus&lt;/em&gt; that brings us down to 5.9 e3
records.&lt;/p&gt;
&lt;h2 id=&#34;geographic-filters&#34;&gt;Geographic Filters&lt;/h2&gt;
&lt;p&gt;With our data filtered to just herbarium records, we can see where they are
from by clicking on the &amp;lsquo;map&amp;rsquo; tab at the top of the table. Doing that I see
that of the 5900 records available, 2700 have coordinates already.&lt;/p&gt;
&lt;p&gt;The rectangle and pentagon symbols in the upper right provide tools for
further filtering the results by rectangle or polygon, respectively. You
can also use the buttons on the upper left to zoom in and out, and to
examine record details for an area (the arrow pointing at a circle). In my
case, I want all records for the planet, so I&amp;rsquo;ll skip those options.&lt;/p&gt;
&lt;h2 id=&#34;downloading&#34;&gt;Downloading&lt;/h2&gt;
&lt;p&gt;Now I can download the records. I think GBIF requires that you have a
(free) account for this. That helps them track use. The &amp;lsquo;simple&amp;rsquo; format is
sufficient, the DarwinCore format includes a lot of extra columns.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;VERY IMPORTANT!&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Record the doi for your download! You will need this to properly cite the
data you use in a publication. If you have a GBIF account, you may be able
to recover it after the fact. You can still cite GBIF in your paper without
a DOI for the download, but it&amp;rsquo;s much less useful.&lt;/p&gt;
&lt;p&gt;For my &lt;em&gt;R. chamaemorus&lt;/em&gt; data, GBIF provides the following citation:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;GBIF.org (23 March 2022) GBIF Occurrence Download https://doi.org/10.15468/dl.uhqfg3
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;For large queries, it may take several minutes or longer for the data to be
ready. You&amp;rsquo;ll get an email when it&amp;rsquo;s available.&lt;/p&gt;
&lt;h1 id=&#34;data-processing&#34;&gt;Data Processing&lt;/h1&gt;
&lt;h2 id=&#34;uploading&#34;&gt;Uploading&lt;/h2&gt;
&lt;p&gt;Once you have your data, we&amp;rsquo;ll upload it to our shared Google Drive for
processing. From the shared drive, select &lt;strong&gt;New&lt;/strong&gt; -&amp;gt; &lt;strong&gt;File upload&lt;/strong&gt;, and
select the unzipped &lt;code&gt;.csv&lt;/code&gt; file from your GBIF download. Double-click on
the file, and select &amp;lsquo;Open With Google Sheets&amp;rsquo; at the top of the page.&lt;/p&gt;
&lt;p&gt;The filename is listed in the top left corner. Change that to the species
name, possibly with any additional details to distinguish it from other
files with the same species. For most cases, when we have a single global
file for the species, just the name is fine.&lt;/p&gt;
&lt;p&gt;However, we do want to link our data to the GBIF DOI. I do this by renaming
the sheet to the DOI url: &lt;code&gt;https://doi.org/10.15468/dl.uhqfg3&lt;/code&gt;. If we need
to maintain separate datasets for the same species, they can be added as
new sheets. This hasn&amp;rsquo;t come up yet.&lt;/p&gt;
&lt;h2 id=&#34;prepping-the-spreadsheet&#34;&gt;Prepping the Spreadsheet&lt;/h2&gt;
&lt;p&gt;Even the &amp;lsquo;simple&amp;rsquo; export has a lot of extra columns we don&amp;rsquo;t need. These
can be &amp;lsquo;hidden&amp;rsquo; from view to make it easier to work with the file. The main
columns we use are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;countryCode&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;locality&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stateProvince&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;decimalLatitude&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;decimalLongitude&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;coordinateUncertaintyInMeters&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;georeferencedBy&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;georeferencedDate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;georeferenceSources&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;georeferenceProtocol&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We do need to add several columns for our georeferencing. Find the column
&lt;code&gt;coordinateUncertainty&lt;/code&gt;, and add five columns to the right. The names of
these columns are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;georeferencedBy&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;georeferencedDate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;georeferenceSources&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;georeferenceProtocol&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;georeferenceRemarks&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;data-entry&#34;&gt;Data Entry&lt;/h2&gt;
&lt;p&gt;For records that already have coordinates, don&amp;rsquo;t change them or add any
additional notes.&lt;/p&gt;
&lt;p&gt;For records that you locate:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;add your name to &lt;code&gt;georeferencedBy&lt;/code&gt;. Use the same spelling/abbreviation
every time&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add the date in yyyy-mm-dd format (i.e., 2022-03-23) to
&lt;code&gt;georeferencedDate&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add the coordinates to &lt;code&gt;decimalLatitude&lt;/code&gt; and &lt;code&gt;decimalLongitude&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add the uncertainty in meters to &lt;code&gt;coordinateUncertaintyInMeters&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Add the location source to &lt;code&gt;georeferenceSources&lt;/code&gt; (usually this will be
&amp;ldquo;GoogleMaps&amp;rdquo; or &amp;ldquo;GeoLocate&amp;rdquo;)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;code&gt;georeferenceRemarks&lt;/code&gt; is for comments regarding your efforts to locate the
record. Some possible values include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;AttemptedGoogleMaps&lt;/code&gt;: I tried to find the location using google maps,
and could not&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AttemptedGetty&lt;/code&gt;: I tried to find the location using Getty Thesaurus of
Geographic Names&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AttemptedGNIS&lt;/code&gt; = I tried to find the location using Geographic Names
Information System (GNIS)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;AttemptedNRC&lt;/code&gt; = I tried to find the location using Natural Resources
Canada, geographical place/feature names of Canada&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LocationNotFound&lt;/code&gt; = I could not find this location / I could not
georeference it&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ContradictingLocation&lt;/code&gt; = I could not georeference the location as there
are two locations mentioned which are not close by each other&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Duplicate&lt;/code&gt; = duplicate georeference; other georeferences share the same
lat/long and uncertainty&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;problems&#34;&gt;Problems&lt;/h2&gt;
&lt;h3 id=&#34;inconsistencies-in-source-data&#34;&gt;Inconsistencies in Source Data&lt;/h3&gt;
&lt;p&gt;We don&amp;rsquo;t normally need to modify the original data. However, if you see the
following known errors they can be corrected as indicated.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;0, 0&lt;/code&gt; latitude and longitude is an error. Can be overwritten, do so as
you are working through the data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;coordinateUncertaintyInMeters&lt;/code&gt; without latitude and longitude
coordinates is an error. Can be overwritten, do so as you are working
through the data&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Data with no country code: If no other locality data given, cannot be
given a georeference; If locality data given, can add the country code
and province information.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Localities with no province should be corrected as you are working
through the data. Either add the province that is listed in the locality
to the province column, or add the province that applies given the
country and locality to the province column.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&#34;finding-coordinates&#34;&gt;Finding Coordinates&lt;/h1&gt;
&lt;p&gt;Start by sorting the spreadsheet by country, state within country, and
locality within state. This will save time, as you&amp;rsquo;ll be processing records
from the same location together. In many cases you&amp;rsquo;ll have multiple records
in the same state, and possibly in the same town. If you&amp;rsquo;re lucky, there
will be records with coordinates for the same locaion as records without
coordinates. Sorting allows you to take advantage of this.&lt;/p&gt;
&lt;p&gt;The final step is actually finding the coordinates. Hopefully there&amp;rsquo;s
enough detail in the &lt;code&gt;locality&lt;/code&gt; field that you can search for landmarks in
GoogleMaps, and use the satellite view to get close to the actual location.
It really depends on the record what will be possible.&lt;/p&gt;
&lt;h2 id=&#34;google-maps&#34;&gt;Google Maps&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://www.google.com/maps&#34;&gt;Google Maps&lt;/a&gt; has a very straightforward
interface, and is the simplest option. The &lt;a href=&#34;http://www.geo-locate.org/web/WebGeoref.aspx&#34;&gt;GeoLocate web
client&lt;/a&gt; provides a slightly
more sophisticated search tool, which allows you to use administrative
boundaries or editable circles to set uncertainty values.&lt;/p&gt;
&lt;h2 id=&#34;geolocate&#34;&gt;GeoLocate&lt;/h2&gt;
&lt;p&gt;GeoLocate also supports &lt;a href=&#34;http://www.geo-locate.org/web/WebFileGeoref.aspx&#34;&gt;batch
processing&lt;/a&gt;. This allows
you to upload a file and process a group of records together,
automatically. You will still need to review the results to make sure they
make sense, but you can manually edit them directly in the map viewer. When
you&amp;rsquo;re done you can export the results as a &lt;code&gt;csv&lt;/code&gt; file. To use this
approach, you need to prepare the spreadsheet according to &lt;a href=&#34;http://www.geo-locate.org/standalone/tutorial.html&#34;&gt;the
instructions&lt;/a&gt;, and we
need a simple approach for exporting from Google Sheets, and reimporting
back into Google Sheets when we&amp;rsquo;re done.&lt;/p&gt;
&lt;p&gt;Record coordinates to five decimal places. At the equator, 1 degree is
about 100 km, so 0.00001 is about 1 m. We don&amp;rsquo;t need sub-meter accuracy,
and none of these records are that precise.&lt;/p&gt;
&lt;h2 id=&#34;uncertainty&#34;&gt;Uncertainty&lt;/h2&gt;
&lt;p&gt;Estimating uncertainty is a judgement call. If the locality data is very
precise (&amp;ldquo;the corner of Main and Second Avenue in Springfield&amp;rdquo;) you might
have a very small uncertainty. If all you have is a state, your coordinate
will be the center of the state, and the uncertainty will be the diameter
of the state. Don&amp;rsquo;t worry about being precise: 1m, 100m, 1000m, 10000m (or
bigger) are fine.&lt;/p&gt;
&lt;p&gt;For vague/approximate locations, do what you can. Try to think like a field
biologist. How far from a town do you need to be before you stop describing
your location as &amp;ldquo;South of X&amp;rdquo;, and start describing it as &amp;ldquo;between X and
Y&amp;rdquo;?&lt;/p&gt;
&lt;h1 id=&#34;priorities&#34;&gt;Priorities&lt;/h1&gt;
&lt;p&gt;Some records take a lot of work to locate, and some are impossible. We&amp;rsquo;re
most interested in filling in gaps. If we have a thousand records from
Ottawa, we don&amp;rsquo;t need to waste a lot of energy tracking down the
coordinates for one more record. But if we only have one record for China,
it will be worth spending some time trying to find it. Similarly, earlier
records are rarer and more valuable: the 100th record from New York (in
2021) doesn&amp;rsquo;t add as much value as the first one (from 1830).&lt;/p&gt;
&lt;h1 id=&#34;tips-for-deciphering-labels&#34;&gt;Tips For Deciphering Labels&lt;/h1&gt;
&lt;p&gt;Compiled by Laura Kostyniuk.&lt;/p&gt;
&lt;h2 id=&#34;abbreviations-seen-in-data&#34;&gt;Abbreviations seen in data&lt;/h2&gt;
&lt;h3 id=&#34;latin&#34;&gt;Latin&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;opp.&lt;/code&gt; = likely oppidum, town in english&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fl.&lt;/code&gt; = likely flumen, river in english&lt;/li&gt;
&lt;li&gt;&lt;code&gt;urb.&lt;/code&gt; = likely urbs/urbem, city in english&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;other-languages&#34;&gt;Other languages&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;VC&lt;/code&gt; or &lt;code&gt;v.c.&lt;/code&gt; in &lt;strong&gt;Great Britain&lt;/strong&gt; codes likely stands for &lt;a href=&#34;https://en.wikipedia.org/wiki/Vice-county&#34;&gt;vice
county&lt;/a&gt;. Not sure if applies to other countries&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Arred.&lt;/code&gt; / &lt;code&gt;Arred. de&lt;/code&gt; in &lt;strong&gt;Portugal&lt;/strong&gt; is abbreviation for &lt;code&gt;Arredores&lt;/code&gt;; meaning ‘surroundings of’&lt;/li&gt;
&lt;li&gt;Locations in &lt;strong&gt;Sweden&lt;/strong&gt; starting with &lt;code&gt;W&lt;/code&gt; often do not show up in Google
maps. Sometimes can be found if &lt;code&gt;W&lt;/code&gt; is changed for &lt;code&gt;V&lt;/code&gt; (ex: Wrigstad = no
hits, Vrigstad = village in right general location).&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&#34;other-sources&#34;&gt;Other Sources&lt;/h1&gt;
&lt;p&gt;For valuable records you can&amp;rsquo;t locate in GoogleMaps or GeoLocate. Compiled
by Laura Kostyniuk and Shannon Ascensio.&lt;/p&gt;
&lt;h2 id=&#34;placefeature-names-gazetteers&#34;&gt;Place/feature names, gazetteers&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;http://www4.rncan.gc.ca/search-place-names/search?lang=en&#34;&gt;Canadian Geographical Names Database (Canada)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://geonames.usgs.gov/apex/f?p=138:1:0&#34;&gt;USGS Geographic Names Information System (GNIS) (USA)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;http://www.getty.edu/research/tools/vocabularies/tgn/index.html&#34;&gt;Getty Thesaurus of Geographic Names (Worldwide)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;coordinate-converters&#34;&gt;Coordinate converters&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://tool-online.com/en/coordinate-converter.php&#34;&gt;Tool-online Coordinate
converter&lt;/a&gt;: search
by country or by world if the type of coordinate system you are looking
to convert is available&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://www.ornitho.ch/index.php?m_id=48&amp;amp;action=XY&#34;&gt;Converter for Swiss
coordinates&lt;/a&gt;: If
numbers are written as 123.4/567.8, 567.8 is 567 800 as a Y coordinate,
and 123 400 as an X coordinate&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;useful-maps&#34;&gt;Useful maps&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://mapire.eu/en/map/europe-19century-secondsurvey/?bbox=280827.2190911589%2C6392475.771436083%2C1198071.558513274%2C7003971.997717493&amp;amp;layers=158%2C164&#34;&gt;Europe in the XIX century&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;http://www.komsta.net/atpol/&#34;&gt;Poland: ATPOL map&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://www.randymajors.com/p/township-range-on-google-maps.html&#34;&gt;US: PLSS system of township/section/range&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;http://digital.library.mcgill.ca/countyatlas/searchmapframes.php&#34;&gt;Ontario: Historic County/Township maps&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://www.gisapplication.lrc.gov.on.ca/matm/Index.html?site=Make_A_Topographic_Map&amp;amp;viewer=MATM&amp;amp;locale=en-US&#34;&gt;Ontario: Lot/Concession Maps&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://upload.wikimedia.org/wikipedia/commons/1/10/Comt%C3%A9s_du_Qu%C3%A9bec_1920.jpg&#34;&gt;Quebec: historic counties (1920)&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://www.oldmapsonline.org/&#34;&gt;OldMapsonline&lt;/a&gt; Old map searching tool;
search by location, year, and add filters to search.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://mapcarta.com/&#34;&gt;MapCarta&lt;/a&gt;: Good general map with more locations
than Google; sometimes have to manually search the map though, locations
will not always show up when searched in search bar.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://www.findlatitudeandlongitude.com/&#34;&gt;FindLatitudeLongitude&lt;/a&gt;
Another general mapping / searchable tool.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;http://centrodedescargas.cnig.es/CentroDescargas/locale?request_locale=en#&#34;&gt;Spanish grid
map&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&#34;https://kartor.eniro.se/&#34;&gt;Scandinavia&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
    <item>
      <title>Schoener&#39;s D and Study Extent</title>
      <link>https://plantarum.ca/2021/12/02/schoenersd/</link>
      <pubDate>Thu, 02 Dec 2021 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2021/12/02/schoenersd/</guid>
      <description>
&lt;script src=&#34;https://plantarum.ca/2021/12/02/schoenersd/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;background&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Background&lt;/h1&gt;
&lt;p&gt;Schoener’s D was created by &lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-Schoener_1968&#34; role=&#34;doc-biblioref&#34;&gt;Schoener&lt;/a&gt; (&lt;a href=&#34;#ref-Schoener_1968&#34; role=&#34;doc-biblioref&#34;&gt;1968&lt;/a&gt;)&lt;/span&gt; He was studying the feeding
niche of anoles, and needed a way to quantify the overlap in prey items for
different species. This is what he came up with:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[D(p_X, p_X) = 1 - \frac{1}{2} \sum_i \vert p_{X,i} - p_{Y, i} \vert\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Here, &lt;span class=&#34;math inline&#34;&gt;\(p_{X,i}\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(p_{Y,i}\)&lt;/span&gt; are the frequencies for species &lt;span class=&#34;math inline&#34;&gt;\(X\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(Y\)&lt;/span&gt;,
respectively, for the &lt;span class=&#34;math inline&#34;&gt;\(i^{th}\)&lt;/span&gt; category. For Schoener, the categories were
prey sizes. In the context of distribution modeling, they would be regions
along an environmental gradient, and the ‘frequencies’ are the fitted
values from an SDM, or the density values from an Ecospat dynamic niche
grid.&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-WarrenEtAl_2008&#34; role=&#34;doc-biblioref&#34;&gt;Warren et al.&lt;/a&gt; (&lt;a href=&#34;#ref-WarrenEtAl_2008&#34; role=&#34;doc-biblioref&#34;&gt;2008&lt;/a&gt;)&lt;/span&gt; pointed out some subtle theoretical issues with Schoener’s
D in this context, and proposed his own index &lt;em&gt;I&lt;/em&gt;, based on the Hellinger
distance, to better account for them.&lt;/p&gt;
&lt;p&gt;Hellinger’s distance:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[H(p_X, p_Y = \sqrt{\sum_i(\sqrt{p_{X,i}} - \sqrt{p_{Y,i}})^2}\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;Warren’s &lt;em&gt;I&lt;/em&gt;:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[I(p_X, p_Y) = 1 - \frac{1}{2} H(p_X, p_Y)\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;In application, Schoener’s D suggests that the &lt;span class=&#34;math inline&#34;&gt;\(p_{X, i}\)&lt;/span&gt; values reflect
relative use of a particular habitat. However, ENM predictions indicate the
relative ‘suitability’ of a cell for &lt;em&gt;occupancy&lt;/em&gt; (i.e., presence or
absence) by the study species, but do not necessarily reflect density.&lt;/p&gt;
&lt;p&gt;However, Warren also noted that despite the potential issues, in practice
there is little difference in the qualitative results following from &lt;em&gt;D&lt;/em&gt;
and &lt;em&gt;I&lt;/em&gt;. I think Schoener’s &lt;em&gt;D&lt;/em&gt; is more commonly used now, but either or
both may show up in distribution modeling studies.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;overlap-vs-correlation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Overlap vs Correlation&lt;/h1&gt;
&lt;p&gt;&lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-Warren_2018&#34; role=&#34;doc-biblioref&#34;&gt;Warren&lt;/a&gt; (&lt;a href=&#34;#ref-Warren_2018&#34; role=&#34;doc-biblioref&#34;&gt;2018&lt;/a&gt;)&lt;/span&gt; made an interesting contrast between two species’ niche
overlap (D), and the correlation between their suitability scores.
Schoener’s &lt;em&gt;D&lt;/em&gt; quantifies the extent to which a pair of species may
interact in the same space (i.e., they’re both likely to be present
together in a location). This is important to know, especially in the
context of niche-shift studies &lt;span class=&#34;citation&#34;&gt;(e.g. &lt;a href=&#34;#ref-AtwaterBarney_2021&#34; role=&#34;doc-biblioref&#34;&gt;Atwater and Barney, 2021&lt;/a&gt;)&lt;/span&gt;. But while they
tell us about where species are found along an environmental gradient, they
don’t tell us anything about how they respond to that gradient. In fact,
species with &lt;em&gt;perfectly opposite&lt;/em&gt; responses to the environment
may still have relatively high niche overlap, &lt;em&gt;D&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Let’s revisit the example from &lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-Warren_2018&#34; role=&#34;doc-biblioref&#34;&gt;Warren&lt;/a&gt; (&lt;a href=&#34;#ref-Warren_2018&#34; role=&#34;doc-biblioref&#34;&gt;2018&lt;/a&gt;)&lt;/span&gt;. We start with the &lt;code&gt;olaps&lt;/code&gt;
helper function, which calculates the statistics of interest:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ggplot2)
library(grid)

olaps &amp;lt;- function(sp1, sp2){
  ## Calculate Schoener&amp;#39;s D, Warren&amp;#39;s I, and Spearman
  ## Correlation for sp1 and sp2

  ## sp1 and sp2 are the relative occupancy values for each
  ## species along the same environmental gradient

  ## scale the values for each species 0:1
  sp1 &amp;lt;- sp1/sum(sp1)
  sp2 &amp;lt;- sp2/sum(sp2)
  
  plot.table &amp;lt;- data.frame(
    species = c(rep(&amp;quot;sp1&amp;quot;, length(sp1)),
                rep(&amp;quot;sp2&amp;quot;, length(sp2))),
    env = c(seq(1:length(sp1)), seq(1:length(sp2))),
    suitability = c(sp1, sp2))

  D = 1 - sum(abs(sp1 - sp2))/2
  I = 1 - sum((sqrt(sp1) - sqrt(sp2))^2)/2
  cor = cor(sp1, sp2, method = &amp;quot;spearman&amp;quot;)

  grob &amp;lt;- grobTree(textGrob(paste(&amp;quot;D =&amp;quot;, round(D, 2),
                                 &amp;quot;  I =&amp;quot;, round(I, 2),
                                 &amp;quot;  Cor =&amp;quot;, round(cor, 2)),
                           x = 0.1,  y = 0.95, hjust = 0,
                           gp = gpar(fontsize = 15)))
  

  suitplot = qplot(env, suitability, data = plot.table,
                   col = species, geom = &amp;quot;line&amp;quot;) +
    annotation_custom(grob)

  return(list(
    D = D, I = I, cor = cor, suitplot = suitplot
  ))
  
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can recreate the examples from &lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-Warren_2018&#34; role=&#34;doc-biblioref&#34;&gt;Warren&lt;/a&gt; (&lt;a href=&#34;#ref-Warren_2018&#34; role=&#34;doc-biblioref&#34;&gt;2018&lt;/a&gt;)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sp1 &amp;lt;- seq(0.1, 1.0, 0.001)
sp2 &amp;lt;- seq(0.1, 1.0, 0.001)

olaps(sp1, sp2)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:example-1&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://plantarum.ca/2021/12/02/schoenersd/index_files/figure-html/example-1-1.png&#34; alt=&#34;Identical Species&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Identical Species
&lt;/p&gt;
&lt;/div&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sp1 &amp;lt;- seq(0.1, 1.0, 0.001)
sp2 &amp;lt;- seq(1.0, 0.1, -0.001)

olaps(sp1, sp2)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:example-2&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://plantarum.ca/2021/12/02/schoenersd/index_files/figure-html/example-2-1.png&#34; alt=&#34;Species with Inverse Environmental Response&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 2: Species with Inverse Environmental Response
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;The point that &lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-Warren_2018&#34; role=&#34;doc-biblioref&#34;&gt;Warren&lt;/a&gt; (&lt;a href=&#34;#ref-Warren_2018&#34; role=&#34;doc-biblioref&#34;&gt;2018&lt;/a&gt;)&lt;/span&gt; was making is that two species may occupy more
or less similar locations along an environmental gradient, while having
very different &lt;em&gt;responses&lt;/em&gt; to that gradient. This isn’t a problem. But it
does highlight the importance of clearly articulating the question you are
asking in your research, and making sure that the analyses you choose are
actually answering that question.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;niche-overlap-vs-study-extent&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Niche Overlap vs Study Extent&lt;/h1&gt;
&lt;p&gt;Something else struck me reading Warren’s post. The toy examples he used
represent a very narrow slice of an environmental gradient; that is, the
portion where both species are present. Applying these analyses to global
patterns, as we do when comparing the distribution of invasive species in
their native and introduced range, and especially when we apply these
analyses to large numbers of species, we can (potentially) include much
broader gradients. And this can have significant impact on theses
statistics.&lt;/p&gt;
&lt;p&gt;Here’s a (ever so slightly) more realistic example to illustrate. We’ll
define our gradient over the range 0 to 50&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;env &amp;lt;- seq(0, 50, by = 0.01)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Then we’ll define two species, with partially overlapping ranges:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sp1 &amp;lt;- dnorm(env, mean = 22.5, 2)
sp2 &amp;lt;- dnorm(env, mean = 27.5, 2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now compare the species ‘globally’:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;olaps(sp1, sp2)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:global-analysis&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://plantarum.ca/2021/12/02/schoenersd/index_files/figure-html/global-analysis-1.png&#34; alt=&#34;Global Analysis&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 3: Global Analysis
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;At this scale, their response to the gradient appears to be highly
correlated, while they have low niche overlap.&lt;/p&gt;
&lt;p&gt;If we zoom in a bit, and ‘trim’ off the lowest and highest 1000 values on
our gradient, we can emulate a ‘continental’ extent:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## ignore the lowest and highest 1000
## environmental values 
slice &amp;lt;- 1000:4000 

olaps(sp1[slice], sp2[slice])&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:continental-analysis&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://plantarum.ca/2021/12/02/schoenersd/index_files/figure-html/continental-analysis-1.png&#34; alt=&#34;Continental Analysis&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 4: Continental Analysis
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Correlation drops, but niche overlap remains identical. On reflection, this
makes sense. Locations where neither species are present get no weight in
the calculation of D, so dropping ‘empty’ gradient has no impact. On the
other hand, those locations do contribute to inflating correlation.&lt;/p&gt;
&lt;p&gt;Now what if we shift our focus, such that the distribution of our species
is not equally represented:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;slice &amp;lt;- 1000:2500 
olaps(sp1[slice], sp2[slice])&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:regional-analysis&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://plantarum.ca/2021/12/02/schoenersd/index_files/figure-html/regional-analysis-1.png&#34; alt=&#34;Regional Analysis&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 5: Regional Analysis
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Correlation jumps up, as despite both species increase together over most
of the sampled gradient. And with this particular slice, our niche overlap
is twice the ‘true’ value when we consider the full gradient.&lt;/p&gt;
&lt;p&gt;Finally, we can zoom in on the center of the gradient, where both species
are equally represented (although with inverse responses):&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;slice &amp;lt;- 2000:3000
olaps(sp1[slice], sp2[slice])&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:contact-analysis&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://plantarum.ca/2021/12/02/schoenersd/index_files/figure-html/contact-analysis-1.png&#34; alt=&#34;Contact Zone&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 6: Contact Zone
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Correlation drops again, accurately reflecting the inverse pattern. And D
is back down close to the ‘true’ value. That’s ‘lucky’, as my toy species
have perfectly symmetrical distributions, so sufficently large, symmetrical
regions around the mid-point between the two of them will give reasonably
accurate estimates.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;implications&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Implications&lt;/h1&gt;
&lt;p&gt;Why does this matter? If you’re interested in comparing the environmental
responses of two species, the results (correlation) can vary quite
dramatically depending on the extent of your study.&lt;/p&gt;
&lt;p&gt;On the other hand, Schoener’s D is robust to data that includes ‘too much’
of a gradient (i.e., extending beyond the region occupied by either
species). But it can be sensitive to undersampling a gradient, where the
relative occupancy of each species varies depending on how you set set your
extent.&lt;/p&gt;
&lt;p&gt;In other words, if you’re interested in the ‘underlying models’ that govern
species’ comparative distributions along a gradient, you need to be very
clear about the scope of the question, and how much of the environmental
gradient you sample. But if you want to quantify niche overlap (or relative
niche shift), then you want to include environments well beyond the regions
actually occupied by your study organisms.&lt;/p&gt;
&lt;p&gt;All of which is trivial to do when you get to create the species on the
computer, and much trickier when you need to infer the details from museum
records and climate rasters!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references csl-bib-body hanging-indent&#34;&gt;
&lt;div id=&#34;ref-AtwaterBarney_2021&#34; class=&#34;csl-entry&#34;&gt;
Atwater, D. Z., and J. N. Barney. 2021. Climatic niche shifts in 815 introduced plant species affect their predicted distributions I. Martins [ed.],. &lt;em&gt;Global Ecology and Biogeography&lt;/em&gt; 30: 1671–1684.
&lt;/div&gt;
&lt;div id=&#34;ref-Schoener_1968&#34; class=&#34;csl-entry&#34;&gt;
Schoener, T. W. 1968. The Anolis Lizards of Bimini: Resource Partitioning in a Complex Fauna. &lt;em&gt;Ecology&lt;/em&gt; 49: 704–726.
&lt;/div&gt;
&lt;div id=&#34;ref-Warren_2018&#34; class=&#34;csl-entry&#34;&gt;
Warren, D. 2018. Species In Space: Why add correlations for suitability scores? &lt;em&gt;Species In Space&lt;/em&gt;. Website &lt;a href=&#34;https://enmtools.blogspot.com/2018/10/why-add-correlations-for-suitability.html&#34;&gt;https://enmtools.blogspot.com/2018/10/why-add-correlations-for-suitability.html&lt;/a&gt; [accessed 2 December 2021].
&lt;/div&gt;
&lt;div id=&#34;ref-WarrenEtAl_2008&#34; class=&#34;csl-entry&#34;&gt;
Warren, D. L., R. E. Glor, and M. Turelli. 2008. Environmental Niche Equivalency Versus Conservatism: Quantitative Approaches to Niche Evolution. &lt;em&gt;Evolution&lt;/em&gt; 62: 2868–2883.
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Thinning Occurrence Records in R</title>
      <link>https://plantarum.ca/2021/10/26/r-gridsample/</link>
      <pubDate>Tue, 26 Oct 2021 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2021/10/26/r-gridsample/</guid>
      <description>


&lt;blockquote&gt;
&lt;p&gt;Note that this tutorial refers to the thinning method used in the old
version of the &lt;code&gt;rspatial.org&lt;/code&gt; tutorial, which used the &lt;code&gt;raster&lt;/code&gt; package
(along with &lt;code&gt;dismo&lt;/code&gt;) for the GIS computations. The &lt;code&gt;terra&lt;/code&gt; package will
shortly be replacing &lt;code&gt;raster&lt;/code&gt;, and all new code should use this instead.
The details of spatial thinning with &lt;code&gt;terra&lt;/code&gt; are presented in my &lt;a href=&#34;https://plantarum.ca/2023/07/28/ecospat-terra/#sampling-bias&#34;&gt;new
ecospat tutorial&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;A common approach to reducing spatial bias in occurrence records is to
randomly select one (or a small number) of samples present in each cell in
the landscape. This uses the &lt;code&gt;gridSample&lt;/code&gt; function from the package &lt;code&gt;dismo&lt;/code&gt;
&lt;span class=&#34;citation&#34;&gt;(Hijmans et al. &lt;a href=&#34;#ref-HijmansEtAl_2017&#34;&gt;2017&lt;/a&gt;)&lt;/span&gt;, as described at
&lt;a href=&#34;https://rspatial.org/raster/sdm/2_sdm_occdata.html#sampling-bias&#34;&gt;RSpatial.org&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;However, the code presented at RSpatial uses a newly-created raster layer
to thin the records. This layer is based on the extent of your occurrence
data; even if you set the resolution to match the resolution of the
environmental rasters you use, the they won’t necessarily be aligned. That
means the cells will be the same size, but the edges won’t line up.&lt;/p&gt;
&lt;p&gt;A consequence of this is that you might end up keeping more than one
sample, or removing all samples, from a single cell in your environmental
data, even after thinning to one sample per cell in your newly-created
raster. This lead to some strange behaviour in one of my downstream
analyses, where the results for a small data set changed each time I reran
the analysis.&lt;/p&gt;
&lt;p&gt;I’ll demonstrate using the example from RSpatial:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(dismo)
library(maptools)
library(sp)
library(raster)

wclim &amp;lt;- getData(&amp;quot;worldclim&amp;quot;, var = &amp;quot;bio&amp;quot;, res = 10,
                path = &amp;quot;../data&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning in getData(&amp;quot;worldclim&amp;quot;, var = &amp;quot;bio&amp;quot;, res = 10, path = &amp;quot;../data&amp;quot;): getData will be removed in a future version of raster
## . Please use the geodata package instead&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wc1 &amp;lt;- wclim[[1]]

## crop the climate data to speed up grid creation
wc1crop &amp;lt;- crop(wc1, extent(c(-70, -60),
                           ylim = c(-20, -10)))

data(acaule)
data(wrld_simpl)

acgeo &amp;lt;- subset(acaule, !is.na(lon) &amp;amp; !is.na(lat))
dups2 &amp;lt;- duplicated(acgeo[, c(&amp;#39;lon&amp;#39;, &amp;#39;lat&amp;#39;)])
acg &amp;lt;- acgeo[!dups2, ]
i &amp;lt;- acg$lon &amp;gt; 0 &amp;amp; acg$lat &amp;gt; 0
acg$lon[i] &amp;lt;- -1 * acg$lon[i]
acg$lat[i] &amp;lt;- -1 * acg$lat[i]
acg &amp;lt;- acg[acg$lon &amp;lt; -50 &amp;amp; acg$lat &amp;gt; -50, ]
coordinates(acg) &amp;lt;- ~lon+lat
crs(acg) &amp;lt;- crs(wrld_simpl)

plot(acg, pch = 20)
plot(wclim[[1]], legend = FALSE, add = TRUE)
points(acg, pch = 20)
plot(wrld_simpl, add=T, border=&amp;#39;blue&amp;#39;, lwd=2)
box()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/10/26/r-gridsample/index_files/figure-html/setup-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This is the same example from RSpatial. See the link above for more
details.&lt;/p&gt;
&lt;p&gt;Now we want to thin our records, such that we retain only one observation
for each cell in the WorldClim climate layer. To track what’s going on
here, I’ll zoom in on one of the crowed areas:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(acg, pch = 20, xlim = c(-68.5, -67),
     ylim = c(-17.5, -16.5))
plot(wc1crop, legend = FALSE, add = TRUE)
points(acg, pch = 20)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/10/26/r-gridsample/index_files/figure-html/zoom-1.png&#34; width=&#34;672&#34; /&gt;
Now lets overlay the grid generated by the code from RSpatial:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;r &amp;lt;- raster(acg)
# set the resolution of the cells to the same as wclim
res(r) &amp;lt;- res(wclim)
# expand (extend) the extent of the RasterLayer a little
r &amp;lt;- extend(r, extent(r)+1)
p &amp;lt;- rasterToPolygons(r)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(acg, pch = 20, xlim = c(-68.5, -67),
     ylim = c(-17.5, -16.5))
plot(wc1crop, legend = FALSE, add = TRUE)
points(acg, pch = 20)
plot(p, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/10/26/r-gridsample/index_files/figure-html/gridPlot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Notice how the climate cells (the coloured squares) are offset from the
sampling grid (the black gridlines).&lt;/p&gt;
&lt;p&gt;Using this grid for &lt;code&gt;gridSample&lt;/code&gt;. I’ll plot a blue ring around the retained
occurrences; the records we drop are left as points:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1)
acsel &amp;lt;- gridSample(acg, r, n=1)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(acg, pch = 20, xlim = c(-68.5, -67),
     ylim = c(-17.5, -16.5))
plot(wc1crop, legend = FALSE, add = TRUE)
points(acg, pch = 20)
plot(p, add = TRUE)
points(acsel, col = &amp;#39;blue&amp;#39;, cex = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/10/26/r-gridsample/index_files/figure-html/gridSample_plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Take a close look at that last plot. Notice that there are climate cells
with no retained observations:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/10/26/r-gridsample/index_files/figure-html/gridMissing-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;As well as climate cells with multiple observations:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/10/26/r-gridsample/index_files/figure-html/gridExtra-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This is fine if you are using the occurrence records in a purely spatial
analysis (i.e., without incorporating climate data for each observation).
But if you are intending to retain at most one observations for every cell
in your climate map, this is not what you were hoping for.&lt;/p&gt;
&lt;p&gt;A better way to achieve our desired result is to use the climate layer
directly in &lt;code&gt;gridSample&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(1)
acselClimate &amp;lt;- gridSample(acg, wc1, n=1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Lets visualize the grid we sampled on:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(acg, pch = 20, xlim = c(-68.5, -67),
     ylim = c(-17.5, -16.5))
plot(wc1crop, legend = FALSE, add = TRUE)
points(acg, pch = 20)

## using a cropped layer because this is a slow operation:
climGrid &amp;lt;- rasterToPolygons(wc1crop)
plot(climGrid, add = TRUE)
points(acselClimate, col = &amp;#39;blue&amp;#39;, cex = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/10/26/r-gridsample/index_files/figure-html/climateGrid-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Everything matches up as we expect now: the sampling grids are perfectly
aligned with the climate cells.&lt;/p&gt;
&lt;p&gt;This approach will only work if your climate raster covers the full extent
of your occurrence records. Which it really should - if it doesn’t, the
records that aren’t covered will end up getting dropped from your analysis
since there’s no climate data at those locations.&lt;/p&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references&#34;&gt;
&lt;div id=&#34;ref-HijmansEtAl_2017&#34;&gt;
&lt;p&gt;Hijmans, Robert J., Steven Phillips, John Leathwick, and Jane Elith. 2017. “Dismo R Package Version 1.1-4.” &lt;a href=&#34;https://CRAN.R-project.org/package=dismo&#34;&gt;https://CRAN.R-project.org/package=dismo&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Emacs for Bioinformatics #4: RMarkdown</title>
      <link>https://plantarum.ca/2021/10/03/emacs-tutorial-rmarkdown/</link>
      <pubDate>Sun, 03 Oct 2021 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2021/10/03/emacs-tutorial-rmarkdown/</guid>
      <description>



&lt;p&gt;This is part four in my series of Emacs tutorials aimed at bioinformatics
(and other scientific analysis) workflows. See the rest on my
&lt;a href=&#34;https://plantarum.ca/tutorials/&#34;&gt;tutorials&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;Emacs provides full support for editing
&lt;a href=&#34;https://rmarkdown.rstudio.com/&#34;&gt;RMarkdown&lt;/a&gt; documents. RMarkdown has
extensive documentation, both at the previous RStudio link, and several
free online books by Xie et al. (notably &lt;a href=&#34;https://bookdown.org/yihui/rmarkdown/&#34;&gt;R Markdown: The Definitive
Guide&lt;/a&gt;, but also several others
listed on &lt;a href=&#34;https://bookdown.org/yihui/rmarkdown/&#34;&gt;Yihui Xie’s Bookdown
page&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;Most of these references assume you are using the
&lt;a href=&#34;https://rstudio.com/&#34;&gt;RStudio&lt;/a&gt; development environment. The purpose of
this tutorial is to get you started editing RMarkdown documents in Emacs.&lt;/p&gt;
&lt;div id=&#34;installation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Installation&lt;/h1&gt;
&lt;div id=&#34;prerequisites&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;You need to have &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt; installed, of course. You
will also need &lt;a href=&#34;https://pandoc.org/&#34;&gt;Pandoc&lt;/a&gt; in order to take full
advantage of all the output options available. If you want to create PDF
documents, you’ll need &lt;a href=&#34;https://www.latex-project.org/&#34;&gt;LaTeX&lt;/a&gt; as well.&lt;/p&gt;
&lt;p&gt;All three of these programs are provided in the package repositories for
most major Linux distributions. See the links above for instructions for
installing on Windows or Apple computers.&lt;/p&gt;
&lt;p&gt;You will also need to install the &lt;code&gt;rmarkdown&lt;/code&gt; R package. You can do this
from within R via:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install.packages(&amp;quot;rmarkdown&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will also install the other R requirements, notably the
&lt;a href=&#34;https://yihui.org/knitr/&#34;&gt;knitr&lt;/a&gt; package.&lt;/p&gt;
&lt;p&gt;The &lt;a href=&#34;https://bookdown.org/yihui/rmarkdown/&#34;&gt;bookdown&lt;/a&gt; package provides some
more advanced citation features. I won’t discuss them in this short
tutorial, but in order to use them you need to install that package too:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;install.packages(&amp;quot;bookdown&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;emacs-packages&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Emacs Packages&lt;/h2&gt;
&lt;p&gt;We need a few additional Emacs packages to comfortably edit RMarkdown
documents. These are:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;a href=&#34;https://github.com/jrblevin/markdown-mode&#34;&gt;Markdown Mode&lt;/a&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;The major mode for editing files in markdown format. &lt;strong&gt;This tutorial uses
features added after 6 January 2021.&lt;/strong&gt;&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;a href=&#34;https://ess.r-project.org/&#34; title=&#34;ESS&#34;&gt;ESS&lt;/a&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;The collection of modes for editing R code and interacting with the R
program.&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;a href=&#34;https://polymode.github.io/&#34;&gt;poly-R (Polymode)&lt;/a&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;&lt;code&gt;polymode&lt;/code&gt; is a ‘glue’ mode. The &lt;code&gt;poly-R&lt;/code&gt; variant extends markdown mode
to allow us to edit embedded code snippets in R (and other languages too)&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;(&lt;code&gt;poly-R&lt;/code&gt; also supports files in &lt;code&gt;.Rnw&lt;/code&gt; format, which mix LaTeX and R
code. We won’t cover that here)&lt;/p&gt;
&lt;p&gt;&lt;code&gt;polymode&lt;/code&gt; started out as a collection of modes to support files with
different combinations of languages. As it has grown, many of those
different modes have been split out into separate packages. When we
install &lt;code&gt;poly-R&lt;/code&gt;, it will automatically install the core of the
&lt;code&gt;polymode&lt;/code&gt; system for us.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;This tutorial uses features added after 29 September 2021.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;As in previous tutorials, (see &lt;a href=&#34;https://plantarum.ca/2020/12/30/emacs-tutorial-03/&#34;&gt;my
blog&lt;/a&gt; or the &lt;a href=&#34;https://www.youtube.com/watch?v=So1LYzSk9o0&#34;&gt;demo on
Youtube&lt;/a&gt;), we can install all
three of these packages from &lt;a href=&#34;https://melpa.org/#/&#34;&gt;MELPA&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Once we have the required packages installed, no further configuration
should be necessary. When we next open a file with a &lt;code&gt;.Rmd&lt;/code&gt; extension,
Emacs will know to use the &lt;code&gt;poly-markdown+R-mode&lt;/code&gt; for these files. If
everything is working properly, you’ll see &lt;code&gt;Markdown PM-Rmd&lt;/code&gt; in the
modeline at the bottom of the window for these files, and &lt;code&gt;Markdown&lt;/code&gt;,
&lt;code&gt;RMarkdown&lt;/code&gt;, and &lt;code&gt;Polymode&lt;/code&gt; menus at the top of Emacs frame.&lt;/p&gt;
&lt;div id=&#34;configuration-note&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Configuration Note&lt;/h3&gt;
&lt;p&gt;Depending on how you have installed &lt;code&gt;poly-R&lt;/code&gt;, it may be loaded
automatically, or you might need to load it yourself in your config. If it
isn’t loaded automatically, you might see errors like &lt;code&gt;(void-function poly-gfm+r-mode)&lt;/code&gt; when you try to open an Rmarkdown file.&lt;/p&gt;
&lt;p&gt;You can fix this with by adding the following line to your Emacs config.
The location isn’t critical, but it’s probably most convenient to put it at
the beginning of any configuration you use for ESS/R/Markdown.&lt;/p&gt;
&lt;pre class=&#34;lisp&#34;&gt;&lt;code&gt;(require &amp;#39;poly-R)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;github-flavoured-markdown-and-code-blocks&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Github Flavoured Markdown and Code Blocks&lt;/h3&gt;
&lt;p&gt;Markdown mode supports several different options for code blocks. To take
full advantage of the RMarkdown support provided by the &lt;code&gt;rmarkdown&lt;/code&gt; R
package, we need to use fenced code blocks, along with language strings
wrapped in braces&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;. I’ll explain this in more detail below.&lt;/p&gt;
&lt;p&gt;This variant of markdown is referred to as “Github Flavoured Markdown”, and
the &lt;code&gt;markdown-mode&lt;/code&gt; package provides &lt;code&gt;gfm-mode&lt;/code&gt; with a few extra features
particular to it. Turning on &lt;code&gt;gfm-mode&lt;/code&gt; for Rmd files requires the
following line in your Emacs configuration to turn it on&lt;a href=&#34;#fn2&#34; class=&#34;footnote-ref&#34; id=&#34;fnref2&#34;&gt;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;:&lt;/p&gt;
&lt;pre class=&#34;lisp&#34;&gt;&lt;code&gt;;; associate the new polymode to Rmd files:
(add-to-list &amp;#39;auto-mode-alist
             &amp;#39;(&amp;quot;\\.[rR]md\\&amp;#39;&amp;quot; . poly-gfm+r-mode))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You will also need the following line, if you want &lt;code&gt;gfm-mode&lt;/code&gt; to
automatically insert braces for code blocks (described below):&lt;/p&gt;
&lt;pre class=&#34;lisp&#34;&gt;&lt;code&gt;;; uses braces around code block language strings:
(setq markdown-code-block-braces t)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This will switch you from using &lt;code&gt;poly-markdown+R-mode&lt;/code&gt; to
&lt;code&gt;poly-gfm+r-mode&lt;/code&gt;, which shows up in your mode bar as “PM-Rmd(gfm)”. It’s
nearly similar, the main differences being the support for fenced code
blocks.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;rmarkdown&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;RMarkdown&lt;/h1&gt;
&lt;div id=&#34;editing-markdown&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Editing Markdown&lt;/h2&gt;
&lt;p&gt;&lt;a href=&#34;https://www.markdownguide.org/basic-syntax/&#34;&gt;Markdown syntax&lt;/a&gt; is designed
to be easily entered by hand, which means if you’re already familiar with
the format you can just get going. Markdown mode will provide you with
syntax highlighing automatically:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;markdown-mode.jpg&#34; alt=&#34;A Markdown Mode buffer showing syntax highlighting&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;A Markdown Mode buffer showing syntax highlighting&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;Of course, there are lots of shortcuts available. You can explore the most
frequently used in the &lt;code&gt;Markdown&lt;/code&gt; menu. I’ll summarize some of the main
ones here to get you started.&lt;/p&gt;
&lt;div id=&#34;headings&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Headings&lt;/h3&gt;
&lt;p&gt;Markdown has two different kinds of headings. The “Atx” style uses &lt;code&gt;#&lt;/code&gt;
symbols at the beginning of the heading, and, optionally, also at the end:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;
# Heading Level 1

## Heading Level 2 ##
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can insert these headings with the following commands:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;C-c C-s h&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;insert a heading at the same level as the previous heading.&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;If the &lt;code&gt;region&lt;/code&gt; is active, the contents of the region will be used as the
header text. If &lt;code&gt;point&lt;/code&gt; is on a line with text, the line will be
converted into a header. Otherwise, an empty header will be created.&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;C-c C-s {1-9}&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;insert a heading at the specified level. i.e., &lt;code&gt;C-c C-s 3&lt;/code&gt; inserts a
third-level heading. &lt;code&gt;region&lt;/code&gt; and &lt;code&gt;point&lt;/code&gt; can be used to set the heading
text as for the previous.&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;You can manipulate headings with the following commands:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;C-c &amp;lt;up&amp;gt;&lt;/code&gt; and &lt;code&gt;C-c &amp;lt;down&amp;gt;&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;move a heading and all of its content up or down in the document.&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;ie., turn this:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;markdown-move1.jpg&#34; alt=&#34;Markdown headings in original order&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Markdown headings in original order&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;into this:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;markdown-move2.jpg&#34; alt=&#34;Markdown headings with subheading 2 ahead of subheading 1&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Markdown headings with subheading 2 ahead of subheading 1&lt;/p&gt;
&lt;/div&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;C-c &amp;lt;left&amp;gt;&lt;/code&gt; and &lt;code&gt;C-c &amp;lt;right&amp;gt;&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;promote or demote a heading.&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;i.e., turn this:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;markdown-move1.jpg&#34; alt=&#34;Markdown headings in hierarchy&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Markdown headings in hierarchy&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;into this (and vice versa):&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;demotion.jpg&#34; alt=&#34;Markdown with subheading 2 demoted&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Markdown with subheading 2 demoted&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;If you prefer asymmetric headings (i.e., with &lt;code&gt;#&lt;/code&gt; symbols only at the
beginning of the line), you can configue this by setting the variable
&lt;code&gt;markdown-asymmetric-header&lt;/code&gt; to &lt;code&gt;t&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;lisp&#34;&gt;&lt;code&gt;;; set in your ~/.emacs or ~/.emacs.d/init.el
(setq markdown-asymmetric-header t)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Alternatively, you can do this via &lt;code&gt;M-x customize-variable markdown-asymmetric-header&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Markdown mode also supports the &lt;code&gt;setext&lt;/code&gt; style headings:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;
Heading Level 1
===============

Heading Level 2
---------------
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Only two levels are supported, which you can insert automatically with the
commands &lt;code&gt;C-c C-s !&lt;/code&gt; (level 1) and &lt;code&gt;C-c C-s @&lt;/code&gt; (level 2).&lt;/p&gt;
&lt;div id=&#34;heading-navigation&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Heading Navigation&lt;/h4&gt;
&lt;p&gt;You can move from heading to heading with the following commands:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;C-c C-n&lt;/code&gt; and &lt;code&gt;C-c C-p&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;move to next and previous headings&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-f&lt;/code&gt; and &lt;code&gt;C-c C-b&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;move forward and backward to headings at the same level&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-u&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;move up to parent heading&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;/div&gt;
&lt;div id=&#34;heading-visibility&#34; class=&#34;section level4&#34;&gt;
&lt;h4&gt;Heading Visibility&lt;/h4&gt;
&lt;p&gt;You can hide and show different sections in documents by pressing the
&lt;code&gt;&amp;lt;TAB&amp;gt;&lt;/code&gt; key with point on a heading. For example, with point on the &lt;code&gt;# Installation&lt;/code&gt; heading, when I press &lt;code&gt;&amp;lt;TAB&amp;gt;&lt;/code&gt; I move from this:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;unhidden.jpg&#34; alt=&#34;Markdown buffer with all sections visible&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Markdown buffer with all sections visible&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;to this:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;hidden.jpg&#34; alt=&#34;Markdown buffer with the Installation section hidden&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;Markdown buffer with the Installation section hidden&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;This doesn’t change any of the text in your file, it only hides the parts
you don’t want to see. Repeatedly pressing the &lt;code&gt;&amp;lt;TAB&amp;gt;&lt;/code&gt; key will toggle
through the various levels of hiding and showing.&lt;/p&gt;
&lt;p&gt;If you want to toggle all the headings at once, &lt;code&gt;Shift-&amp;lt;TAB&amp;gt;&lt;/code&gt; will toggle
visibility for all headings at once. You can use this to collapse your
entire document to a table of contents:&lt;/p&gt;
&lt;div class=&#34;figure&#34;&gt;
&lt;img src=&#34;toc.jpg&#34; alt=&#34;A Markdown buffer with only headings visible&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;A Markdown buffer with only headings visible&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;links-and-images&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Links and Images&lt;/h3&gt;
&lt;p&gt;Inserting links is done with &lt;code&gt;C-c C-l&lt;/code&gt;. Emacs will first prompt you for the
link URL, followed by the link text, and finally the tooltip text. Only the
URL is required. To open a link from Emacs, use &lt;code&gt;C-c C-o&lt;/code&gt;, which will take
you to the webpage in your browser.&lt;/p&gt;
&lt;p&gt;Images are handled similarly, and are inserted with &lt;code&gt;C-c &amp;lt;TAB&amp;gt;&lt;/code&gt; or &lt;code&gt;C-c C-i&lt;/code&gt;. The URL can be a web resource (e.g.,
&lt;code&gt;https://my-images.ca/image1.jpg&lt;/code&gt;), or a local file (e.g.,
&lt;code&gt;./images/image1.jpg&lt;/code&gt;). The image will appear as:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;![Image Caption](image URL)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can toggle displaying the actual image in the buffer with &lt;code&gt;C-c C-x C-i&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;tables&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Tables&lt;/h3&gt;
&lt;p&gt;To insert a new table, use the command &lt;code&gt;C-c C-s t&lt;/code&gt;. You will be prompted
for the number of rows and columns, and the alignment you want. When you’re
done, you’ll have a proper markdown table ready to edit:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;|   |   |   |   |
|---|---|---|---|
|   |   |   |   |
|   |   |   |   |
|   |   |   |   |
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With point in any of the cells, you can &lt;code&gt;&amp;lt;TAB&amp;gt;&lt;/code&gt; into the next cell, or
&lt;code&gt;Shift-&amp;lt;TAB&amp;gt;&lt;/code&gt; into the previous cell. Each time you hit tab the cells will
resize automatically to accomodate your text.&lt;/p&gt;
&lt;p&gt;Additional commands are available for moving, adding and deleting rows and
columns; see the &lt;code&gt;Markdown -&amp;gt; Tables&lt;/code&gt; menu the options.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;other-markup&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Other Markup&lt;/h3&gt;
&lt;p&gt;Markdown mode also provides shortcuts for other markup elements. See the
&lt;code&gt;Markdown&lt;/code&gt; menu for some of the options. I find most of the basics (bold,
emphasis, unordered lists) are just as fast to type by hand as they are to
insert using shortcuts.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;working-with-r-code&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Working with R Code&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;rmarkdown&lt;/code&gt; uses fenced code blocks with braces around the language string. i.e.,:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;    ```{R code-block-example}
    ## R code goes here!
    1 + 1
    ```&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you’ve set up &lt;code&gt;gfm-mode&lt;/code&gt; as described above, you can create one of these
code blocks with the command &lt;code&gt;markdown-insert-gfm-code-block&lt;/code&gt;, bound to
&lt;code&gt;C-c C-s C&lt;/code&gt; by default. Alternatively, simply entering three “`” characters
at the beginning of a line will call the function for you&lt;a href=&#34;#fn3&#34; class=&#34;footnote-ref&#34; id=&#34;fnref3&#34;&gt;&lt;sup&gt;3&lt;/sup&gt;&lt;/a&gt;. Either way,
you’ll be prompted for the language of the code block (which will be R most
of the time, but you can use others!). You can also add a label for the
code block at the prompt, and any additional options you want to use for
the chunk. You can also add options later if you change your mind.&lt;a href=&#34;#fn4&#34; class=&#34;footnote-ref&#34; id=&#34;fnref4&#34;&gt;&lt;sup&gt;4&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Once you have created a code block, &lt;code&gt;polymode&lt;/code&gt; will work its magic. You can
continue to edit the markdown portions of your document, with all the
features of &lt;code&gt;gfm-mode&lt;/code&gt;. But when point is in an R code block, you’ll be
editing it in &lt;code&gt;ESS[R]&lt;/code&gt; mode. That allows you to use all the features of
that package (see
&lt;a href=&#34;https://plantarum.ca/tutorials/emacs-tutorial-03/&#34;&gt;plantarum.ca&lt;/a&gt; for a
quick tutorial/refresher).&lt;/p&gt;
&lt;p&gt;Polymode provides some additional conveniences:&lt;/p&gt;
&lt;div id=&#34;navigation&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Navigation&lt;/h3&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;polymode-next-chunk&lt;/code&gt;/&lt;code&gt;polymode-previous-chunk&lt;/code&gt;, bound to &lt;code&gt;M-n C-n&lt;/code&gt; and &lt;code&gt;M-n C-p&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;move to the next/previous chunk. i.e., move from an RMarkdown chunk to
the next R code chunk.&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;polymode-next-chunk-same-type&lt;/code&gt;/&lt;code&gt;polymode-previous-chunk-same-type&lt;/code&gt;, bound to &lt;code&gt;M-n M-C-n&lt;/code&gt; and &lt;code&gt;M-n M-C-p&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;move to the next/previous chunk of the same type. i.e., move from one R
code chunk the next R code chunk.&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;polymode-kill-chunk&lt;/code&gt;, bound to &lt;code&gt;M-n M-k&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;kill the current chunk&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;polymode-toggle-chunk-narrowing&lt;/code&gt;, bound to &lt;code&gt;M-n C-t&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;toggle narrowing the buffer to display only the current chunk, or to
display the entire document&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;/div&gt;
&lt;div id=&#34;evaluation&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Evaluation&lt;/h3&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;polymode-eval-region-or-chunk&lt;/code&gt;, bound to &lt;code&gt;M-n v&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;evaluate all code chunks in the active region, or the chunk at point if there
is no active region&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;polymode-eval-buffer&lt;/code&gt;, bound to &lt;code&gt;M-n b&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;evaluate all code chunks in the buffer&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;polymode-eval-buffer-from-beg-to-point&lt;/code&gt;/&lt;code&gt;polymode-eval-buffer-from-point-to-end&lt;/code&gt;, bound to &lt;code&gt;M-n u&lt;/code&gt; or &lt;code&gt;M-n ↑&lt;/code&gt;, and &lt;code&gt;M-n d&lt;/code&gt; or &lt;code&gt;M-n ↓&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;evaluate all code chunks from the beginning of the buffer to point (&lt;code&gt;u&lt;/code&gt; and &lt;code&gt;↑&lt;/code&gt;), or from point to the end of the buffer (&lt;code&gt;d&lt;/code&gt; and &lt;code&gt;↓&lt;/code&gt;)&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;exporting-rmarkdown&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Exporting RMarkdown&lt;/h2&gt;
&lt;p&gt;The most important ‘convenience’ of &lt;code&gt;polymode&lt;/code&gt; is that it connects Emacs to
the programs used to export RMarkdown files to presentation formats (i.e.,
pdf, html, slides). The main function you need for this is
&lt;code&gt;polymode-export&lt;/code&gt;, bound to &lt;code&gt;M-n e&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The first time you run this, you’ll be asked which exporter you would like
to use. There are two choices, &lt;code&gt;markdown&lt;/code&gt; and &lt;code&gt;markdown-ess&lt;/code&gt;. &lt;code&gt;markdown&lt;/code&gt;
means &lt;code&gt;polymode&lt;/code&gt; will start a new, self-contained R process and compile
your file there. When compilation is finished, the process will be closed.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;markdown-ess&lt;/code&gt; will use an existing R process, or start a new one if there
isn’t an active process available. When compilation is complete, the R
process remains active. This allows you to check the values of various
objects interactively. This can be useful as you develop a new script.&lt;/p&gt;
&lt;p&gt;RMarkdown files can be compiled to produce a variety of output formats. You
will be prompted to select which one you want the first time you run the
exporter. &lt;code&gt;polymode&lt;/code&gt; remembers this setting, so you don’t get prompted a
second time. If you want to switch, say from pdf output to html, you can
reset the target via &lt;code&gt;C-u M-n e&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://github.com/jrblevin/markdown-mode/pull/581&#34;&gt;Feature added 6 January
2021&lt;/a&gt;.&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn2&#34;&gt;&lt;p&gt;&lt;a href=&#34;https://github.com/polymode/poly-R/pull/27&#34;&gt;Feature added 29 September
2021&lt;/a&gt;&lt;a href=&#34;#fnref2&#34; class=&#34;footnote-back&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn3&#34;&gt;&lt;p&gt;By default; you can turn this feature off by setting the variable
&lt;code&gt;markdown-gfm-use-electric-backquote&lt;/code&gt; to nil.&lt;a href=&#34;#fnref3&#34; class=&#34;footnote-back&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li id=&#34;fn4&#34;&gt;&lt;p&gt;I’m working on tab-completion for R chunk options, but haven’t
decided how best to set it up yet. Watch this space!&lt;a href=&#34;#fnref4&#34; class=&#34;footnote-back&#34;&gt;↩&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Evaluating Invasion Stage with SDMs</title>
      <link>https://plantarum.ca/2021/08/11/invasion-stage/</link>
      <pubDate>Wed, 11 Aug 2021 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2021/08/11/invasion-stage/</guid>
      <description>
&lt;script src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;My attempt to recreate the invasion stage analysis developed by
&lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-GallienEtAl_2012&#34; role=&#34;doc-biblioref&#34;&gt;Gallien et al.&lt;/a&gt; (&lt;a href=&#34;#ref-GallienEtAl_2012&#34; role=&#34;doc-biblioref&#34;&gt;2012&lt;/a&gt;)&lt;/span&gt;, inspired by seeing it applied by &lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-EckertEtAl_2020&#34; role=&#34;doc-biblioref&#34;&gt;Eckert et al.&lt;/a&gt; (&lt;a href=&#34;#ref-EckertEtAl_2020&#34; role=&#34;doc-biblioref&#34;&gt;2020&lt;/a&gt;)&lt;/span&gt;. We’ll
continue with the &lt;em&gt;Lythrum salicaria&lt;/em&gt; data from my tutorial on &lt;a href=&#34;https://plantarum.ca/2021/07/29/ecospat/&#34;&gt;niche
quantification analysis&lt;/a&gt;. Specifically, I’ll model
how the niche space this species occupies in its invaded range in North
America relates to its global niche.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ecospat)
library(raster)
library(rgbif)
library(maptools)
library(magrittr)
library(dismo)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-GallienEtAl_2012&#34; role=&#34;doc-biblioref&#34;&gt;Gallien et al.&lt;/a&gt; (&lt;a href=&#34;#ref-GallienEtAl_2012&#34; role=&#34;doc-biblioref&#34;&gt;2012&lt;/a&gt;)&lt;/span&gt; used an ensemble of SDMs, which is (should be) more
robust than applying a single approach. Nevertheless, for this short
tutorial, I’ll stick to Maxent. I’m also cutting a lot of corners with
respect to variable selection, model validation and other important steps.
See my &lt;a href=&#34;https://plantarum.ca/2020/06/15/maxent/&#34;&gt;Maxent notebook&lt;/a&gt; for pointers.&lt;/p&gt;
&lt;div id=&#34;niche-models&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Niche Models&lt;/h1&gt;
&lt;p&gt;We start by constructing SDMs for the global and North American
distribution of &lt;em&gt;L. salicaria&lt;/em&gt;.&lt;/p&gt;
&lt;div id=&#34;data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Data&lt;/h2&gt;
&lt;p&gt;We need occurrence data and environmental data, and we’ll need to create
background (pseudoabsence) samples.&lt;/p&gt;
&lt;p&gt;The occurence data comes from GBIF, with details in my &lt;a href=&#34;https://plantarum.ca/2021/07/29/ecospat/&#34;&gt;previous
post&lt;/a&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;load(&amp;quot;../data/2021-07-29-ls-gbif-recs.Rda&amp;quot;)
lsOccs &amp;lt;- lsGBIF$data

coordinates(lsOccs) &amp;lt;- c(&amp;quot;decimalLongitude&amp;quot;,
                        &amp;quot;decimalLatitude&amp;quot;) 
  ## Set the projection
crs(lsOccs) &amp;lt;- &amp;#39;+proj=longlat +datum=WGS84&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data(wrld_simpl) # load the maptools worldmap

par(mar = c(0,0, 0, 0))
plot(wrld_simpl, border = &amp;quot;gray80&amp;quot;)
points(lsOccs, pch = 16, col = 2, cex = 0.3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/observation-data-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We’ll use the same climate data as well, sourced from WorldClim
&lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-FickHijmans_2017&#34; role=&#34;doc-biblioref&#34;&gt;Fick and Hijmans 2017&lt;/a&gt;)&lt;/span&gt; and imported using functions provided in the &lt;code&gt;raster&lt;/code&gt;
&lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-Hijmans_2021&#34; role=&#34;doc-biblioref&#34;&gt;Hijmans 2021&lt;/a&gt;)&lt;/span&gt; package. Note that I use the &lt;code&gt;path&lt;/code&gt; argument to direct the
download to a particular location. This is the same location I used in the
previous post, and the data is still there, so it doesn’t get downloaded
again.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wclim &amp;lt;- getData(&amp;quot;worldclim&amp;quot;, var = &amp;quot;bio&amp;quot;, res = 10,
                path = &amp;quot;../data&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We need to define our study extent for selecting background points. I’ll
use a 200 km buffer around our observations. We’re working at the global
scale, and &lt;em&gt;Lythrum salicaria&lt;/em&gt; is a strong disperser, so a relatively large
scale is appropriate here. You’ll need to consider the aims of your own
study when setting your extent.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;studyExtent &amp;lt;- buffer(lsOccs, 200000, dissolve = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required namespace: rgeos&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files
## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(wrld_simpl, border = &amp;quot;gray80&amp;quot;)
plot(studyExtent, col = &amp;#39;lightgreen&amp;#39;, add = TRUE)
points(lsOccs, pch = 16, col = 2, cex = 0.3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/extent-buffer-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The 200 km buffer creates some isolated pockets in North America. The
extent should represent the area the species can access. The buffer I made
includes the west coast inwards to Alberta, the east coast inwards to
Saskatchewan, with an isolated patch in the center of Canada which looks
like it’s at the Alberta/Saskatchewan border, with similar ‘islands’ in the
western US:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(wrld_simpl, border = &amp;quot;gray80&amp;quot;, xlim = c(-135, -90),
     ylim = c(45, 60))
plot(studyExtent, col = &amp;#39;lightgreen&amp;#39;, add = TRUE)
points(lsOccs, pch = 16, col = 2, cex = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/extent-buffer2-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Those islands are likely the leading edge of the same invasion, not
separate invasions! I’m going to increase our buffer to 300 km to capture
the intervening area on the map:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;studyExtent &amp;lt;- buffer(lsOccs, 300000, dissolve = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files
## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(wrld_simpl, border = &amp;quot;gray80&amp;quot;)
plot(studyExtent, col = &amp;#39;lightgreen&amp;#39;, add = TRUE)
points(lsOccs, pch = 16, col = 2, cex = 0.3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/extent-buffer-plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This is better. I prefer to use ecoregions to set study extent, but for the
purposes of this demo I’ll continue with this.&lt;/p&gt;
&lt;p&gt;One further issue: our study extent includes the ocean. Let’s trim it back
to the land:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;land &amp;lt;- aggregate(wrld_simpl) ## dissolve country borders

  ## clip buffer to land:
studyExtent &amp;lt;- intersect(studyExtent, land) 

plot(wrld_simpl, border = &amp;quot;gray80&amp;quot;)
plot(studyExtent, col = &amp;#39;lightgreen&amp;#39;, add = TRUE)
points(lsOccs, pch = 16, col = 2, cex = 0.3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/crop-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;(This generates some warnings, likely related to missing values in my data
or issues with the shapefile manipulations. It seems safe to proceed.)&lt;/p&gt;
&lt;p&gt;The aggregation is a bit rough, but that should work for my purposes today.
Now we can select our background points. I’m using 10000 points, and
excluding any cells with a &lt;em&gt;Lythrum salicaria&lt;/em&gt; occurrence.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;  ## Convert landmass polygon to a raster:
landMask &amp;lt;- rasterize(land, wclim)

  ## sample points from the raster:  
background &amp;lt;- randomPoints(landMask, n = 10000, p = lsOccs)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;global-sdm&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Global SDM&lt;/h2&gt;
&lt;p&gt;Now we can fit our Maxent model. To reduce bias, I’ll thin the samples to
5 observations per grid cell (ca. 20 km square). Normally I work on a
finer resolution (1 km 2), and thin to 1 observation per cell. Again, the
details depend on your study area and goals.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsThin &amp;lt;- gridSample(lsOccs, wclim, n = 5) %&amp;gt;%
  as.data.frame

coordinates(lsThin) &amp;lt;-
  c(&amp;quot;decimalLongitude&amp;quot;, &amp;quot;decimalLatitude&amp;quot;)
glMax &amp;lt;- maxent(wclim, p = lsThin, a = background)
glPred &amp;lt;- predict(glMax, wclim)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;(Warnings here tell us that some of the occurrences are in locations where
there is no climate data. That’s normal, and not a problem as long as there
are only a few points lost this way. If you are working with small data
sets, you’ll want to investigate further to see if you can better match
your records with the climate rasters.)&lt;/p&gt;
&lt;p&gt;Here’s the model prediction:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(glPred)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/global-maxent-plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;north-america-sdm&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;North America SDM&lt;/h2&gt;
&lt;p&gt;For the SDM in the invaded range in North America, we need to crop our
observations and background. Here I’m repeating the functions I used above
to create a raster mask for land, but applying it only to the area of
Canada, United States, and Mexico (our species isn’t in the Caribbean). I’m
using the pipe (&lt;code&gt;%&amp;gt;%&lt;/code&gt;) feature from &lt;code&gt;magrittr&lt;/code&gt;, which makes it easier to
follow the process.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;NA_polygon &amp;lt;- wrld_simpl %&amp;gt;%
  subset(NAME %in%
         c(&amp;quot;Canada&amp;quot;, &amp;quot;United States&amp;quot;, &amp;quot;Mexico&amp;quot;)) %&amp;gt;%
  aggregate()

NA_mask &amp;lt;- rasterize(NA_polygon, wclim)

NA_background &amp;lt;-
  randomPoints(NA_mask, n = 10000, p = lsOccs) %&amp;gt;%
  as.data.frame()

coordinates(NA_background) &amp;lt;- c(&amp;quot;x&amp;quot;, &amp;quot;y&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the original paper of &lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-GallienEtAl_2012&#34; role=&#34;doc-biblioref&#34;&gt;Gallien et al.&lt;/a&gt; (&lt;a href=&#34;#ref-GallienEtAl_2012&#34; role=&#34;doc-biblioref&#34;&gt;2012&lt;/a&gt;)&lt;/span&gt;, the background points were
weighted using the values from the global model. I don’t think
&lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-EckertEtAl_2020&#34; role=&#34;doc-biblioref&#34;&gt;Eckert et al.&lt;/a&gt; (&lt;a href=&#34;#ref-EckertEtAl_2020&#34; role=&#34;doc-biblioref&#34;&gt;2020&lt;/a&gt;)&lt;/span&gt; applied this weighting, and it’s not clear to me how to
do so with Maxent. For now I’ll skip it.&lt;/p&gt;
&lt;p&gt;For the occurrence records, I’ll take my previously thinned data, and crop
it to North America:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;NA_polygon &amp;lt;- wrld_simpl %&amp;gt;%
  subset(NAME %in%
         c(&amp;quot;Canada&amp;quot;, &amp;quot;United States&amp;quot;, &amp;quot;Mexico&amp;quot;)) %&amp;gt;%
  aggregate()

lsNAThin &amp;lt;- intersect(lsThin, NA_polygon)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can construct the SDM for North America:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;naMax &amp;lt;- maxent(wclim, p = lsNAThin, a = NA_background)
naPred &amp;lt;- predict(naMax, wclim)
plot(naPred)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/NA-maxent-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Note that I didn’t crop the WorldClim layer for the North American SDM
model fitting. Maxent only uses the data for the presence and background
points, so it doesn’t matter if the climate layers cover the whole planet
for this step.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;invasion-stage-analysis&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Invasion Stage Analysis&lt;/h1&gt;
&lt;p&gt;Now that we have completed both a global and a local (North America) SDM
for &lt;em&gt;L. salicaria&lt;/em&gt;, we’re ready to compare the results.&lt;/p&gt;
&lt;div id=&#34;niche-space&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Niche Space&lt;/h2&gt;
&lt;p&gt;The values we need are the model predictions corresponding to each
observation in North America.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;globalVals &amp;lt;- extract(glPred, lsNAThin)
naVals &amp;lt;- extract(naPred, lsNAThin)

plot(naVals ~ globalVals, pch = 16, xlim = c(0, 1),
     ylim = c(0, 1),
     xlab = &amp;quot;Global model predictions&amp;quot;,
     ylab = &amp;quot;Regional model predictions&amp;quot;,
     col = &amp;quot;#00000050&amp;quot;)
abline(h = 0.5, lty = 2)
abline(v = 0.5, lty = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/model-predictions-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This plot compares the default Maxent output, the complementary log-log
value. This is an estimate of the probability of presence, which is more
appropriate than the other options for this kind of analysis (raw values
would be difficult to interpret). However, I’m not sure 50% is the most
appropriate value to use in the analysis that follows. &lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-EckertEtAl_2020&#34; role=&#34;doc-biblioref&#34;&gt;Eckert et al.&lt;/a&gt; (&lt;a href=&#34;#ref-EckertEtAl_2020&#34; role=&#34;doc-biblioref&#34;&gt;2020&lt;/a&gt;)&lt;/span&gt;
used &lt;code&gt;optim.thresh&lt;/code&gt; from the (now defunct) SDMTools package to determine
the best threshold for their study.&lt;/p&gt;
&lt;p&gt;Following &lt;span class=&#34;citation&#34;&gt;&lt;a href=&#34;#ref-GallienEtAl_2012&#34; role=&#34;doc-biblioref&#34;&gt;Gallien et al.&lt;/a&gt; (&lt;a href=&#34;#ref-GallienEtAl_2012&#34; role=&#34;doc-biblioref&#34;&gt;2012&lt;/a&gt;)&lt;/span&gt;, we interpret the four quadrants of this plot
as follows:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;Upper right&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;High suitability in both native and global habitat. Observations here are
occupying locations that fall within both the global and invaded niche,
interpreted as ‘stabilizing.’&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;Upper left&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;High suitability in native model, but low suitability in the global
model. Observations are occupying locations that are within the invaded
niche, but outside the global niche, interpreted as populations
demonstrating local adaption&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;Lower right&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;High suitability in the global model, but low suitability in the local
model. These are interpreted as regional colonizations: the conditions
here are within the global niche, but which are only starting to be
occupied in the invaded range.&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;Lower left&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;Low suitability in both the local and global model. Presumably sink
populations (not likely to persist).&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;Let’s tabulate the results:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tally &amp;lt;- c(stabilizing
          = sum(globalVals &amp;gt;= 0.5 &amp;amp; naVals &amp;gt;= 0.5,
                na.rm = TRUE),
          adapting = sum(globalVals &amp;lt; 0.5 &amp;amp; naVals &amp;gt;= 0.5,
                         na.rm = TRUE),
          sinks = sum(globalVals &amp;lt; 0.5 &amp;amp; naVals &amp;lt; 0.5,
                      na.rm = TRUE),
          colonizing = sum(globalVals &amp;gt;= 0.5 &amp;amp; naVals &amp;lt; 0.5,
                           na.rm = TRUE))

barplot(tally, ylab = &amp;quot;Occurences&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/niche-space-tally-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can plot these regions on the map as well (apologies for the opaque
raster algebra; there should be a clearer way to calculate this, but I can’t
think of it at the moment).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;suitabilityThreshold &amp;lt;- 0.5
na_Niche &amp;lt;- naPred &amp;gt; suitabilityThreshold
gl_Niche &amp;lt;- glPred &amp;gt; suitabilityThreshold

stable_Niche &amp;lt;- (na_Niche + gl_Niche) == 2
expansion_Niche &amp;lt;- ((2 * na_Niche) - gl_Niche) == 2
contraction_Niche &amp;lt;- ((2 * gl_Niche) - na_Niche) == 2

NicheRaster &amp;lt;- stable_Niche + (2 * expansion_Niche) +
  (3 * contraction_Niche)

plot(NicheRaster, xlim = c(-140, -60), ylim = c(30, 70),
     col = c(&amp;quot;white&amp;quot;, &amp;quot;blue&amp;quot;, &amp;quot;red&amp;quot;, &amp;quot;green&amp;quot;),
     legend = FALSE)
plot(wrld_simpl, add = TRUE)
points(lsNAThin, pch = 16, cex = 0.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/08/11/invasion-stage/index_files/figure-html/map-comparisons-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In this plot, the blue depicts areas identified as suitable habitat in both
the global and regional model. The green is area identified as suitable
habitat in the global model, but not the North American model. There are
some occurrences in this area, but they aren’t as numerous as the blue
regions. Finally, the red areas were identified by the North American model
as suitable habitat but they were not part of the global model’s suitable
habitat. Following Gallien’s framework, any points in the white areas would be
‘sinks.’ More likely they’re the current leading edge of the invasion front
I think.&lt;/p&gt;
&lt;p&gt;Obviously, there’s a lot going on here, and each of these steps will
warrant careful consideration and additional checks, validations, and
optimizations. I hope this simplified outline is enough to get you started.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references csl-bib-body hanging-indent&#34;&gt;
&lt;div id=&#34;ref-EckertEtAl_2020&#34; class=&#34;csl-entry&#34;&gt;
Eckert, Sandra, Amina Hamad, Charles Joseph Kilawe, Theo E. W. Linders, Wai‐Tim Ng, Purity Rima Mbaabu, Hailu Shiferaw, Arne Witt, and Urs Schaffner. 2020. &lt;span&gt;“Niche Change Analysis as a Tool to Inform Management of Two Invasive Species in Eastern Africa.”&lt;/span&gt; &lt;em&gt;Ecosphere&lt;/em&gt; 11 (2). &lt;a href=&#34;https://doi.org/10.1002/ecs2.2987&#34;&gt;https://doi.org/10.1002/ecs2.2987&lt;/a&gt;.
&lt;/div&gt;
&lt;div id=&#34;ref-FickHijmans_2017&#34; class=&#34;csl-entry&#34;&gt;
Fick, Stephen E., and Robert J. Hijmans. 2017. &lt;span&gt;“WorldClim 2: New 1-Km Spatial Resolution Climate Surfaces for Global Land Areas.”&lt;/span&gt; &lt;em&gt;International Journal of Climatology&lt;/em&gt; 37 (12): 4302–15. &lt;a href=&#34;https://doi.org/10.1002/joc.5086&#34;&gt;https://doi.org/10.1002/joc.5086&lt;/a&gt;.
&lt;/div&gt;
&lt;div id=&#34;ref-GallienEtAl_2012&#34; class=&#34;csl-entry&#34;&gt;
Gallien, Laure, Rolland Douzet, Steve Pratte, Niklaus E. Zimmermann, and Wilfried Thuiller. 2012. &lt;span&gt;“Invasive Species Distribution Models – How Violating the Equilibrium Assumption Can Create New Insights.”&lt;/span&gt; &lt;em&gt;Global Ecology and Biogeography&lt;/em&gt; 21 (11): 1126–36. https://doi.org/&lt;a href=&#34;https://doi.org/10.1111/j.1466-8238.2012.00768.x&#34;&gt;https://doi.org/10.1111/j.1466-8238.2012.00768.x&lt;/a&gt;.
&lt;/div&gt;
&lt;div id=&#34;ref-Hijmans_2021&#34; class=&#34;csl-entry&#34;&gt;
Hijmans, Robert J. 2021. &lt;span&gt;“Raster: Geographic Data Analysis and Modeling.”&lt;/span&gt; Manual. &lt;a href=&#34;https://CRAN.R-project.org/package=raster&#34;&gt;https://CRAN.R-project.org/package=raster&lt;/a&gt;.
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Niche Quantification with Ecospat</title>
      <link>https://plantarum.ca/2021/07/29/ecospat/</link>
      <pubDate>Thu, 29 Jul 2021 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2021/07/29/ecospat/</guid>
      <description>


&lt;div id=&#34;update&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Update&lt;/h1&gt;
&lt;p&gt;NB: this tutorial is now out of date and depends on deprecated versions of
R packages! Please refer to the new version of my &lt;a href=&#34;https://plantarum.ca/2023/07/28/ecospat-terra&#34;&gt;ecospat
tutorial&lt;/a&gt; for the current packages and workflow
for this analysis.&lt;/p&gt;
&lt;p&gt;I’ll leave the old version below in case it’s of any interest, but it won’t
work on current versions of R.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;archived-version&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Archived Version&lt;/h1&gt;
&lt;p&gt;The &lt;code&gt;ecospat&lt;/code&gt; package &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-ColaEtAl_2017&#34;&gt;Cola et al. 2017&lt;/a&gt;)&lt;/span&gt; provides code to quantify and
compare the environmental and geographic niche of two species, or of the
same species in different contexts (e.g., in its native and invaded
ranges). The included vignette explains how to do such analyses.&lt;/p&gt;
&lt;p&gt;However, the vignette assumes you already have a matrix of occurrence
records, along with the climate data for each of those records. In our
work, we typically have to construct those matrices from observation data
(herbarium records, iNaturalist observations, etc) and climate rasters
&lt;span class=&#34;citation&#34;&gt;(e.g. &lt;a href=&#34;#ref-FickHijmans_2017&#34;&gt;Fick and Hijmans 2017&lt;/a&gt;)&lt;/span&gt;. This short tutorial will walk through the steps
necessary to do this.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;packages&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Packages&lt;/h1&gt;
&lt;p&gt;In addition to &lt;code&gt;ecospat&lt;/code&gt;, we’ll use &lt;code&gt;raster&lt;/code&gt; &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-Hijmans_2021&#34;&gt;Hijmans 2021&lt;/a&gt;)&lt;/span&gt; to download
WorldClim &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-FickHijmans_2017&#34;&gt;Fick and Hijmans 2017&lt;/a&gt;)&lt;/span&gt; rasters, and manipulate the spatial data;
&lt;code&gt;rgbif&lt;/code&gt; &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-ChamberlainEtAl_2021&#34;&gt;Chamberlain et al. 2021&lt;/a&gt;)&lt;/span&gt; to download GBIF records, and &lt;code&gt;maptools&lt;/code&gt;
&lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-BivandLewin-Koh_2021&#34;&gt;Bivand and Lewin-Koh 2021&lt;/a&gt;)&lt;/span&gt; to get a world basemap for plots.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(ecospat)
library(raster)
library(rgbif)
library(maptools)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;strong&gt;NB&lt;/strong&gt; there is a &lt;a href=&#34;https://github.com/ecospat/ecospat/issues/18&#34;&gt;bug in
ecospat&lt;/a&gt; that prevents us
from using the argument &lt;code&gt;geomask&lt;/code&gt; (see below). This has been fixed, but as
of 2021-07-29, the bug fix has not made it into the released package,
currently version 3.2. Consequently, you need to install directly from the
development sources:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(devtools)
install_github(repo = &amp;quot;ecospat/ecospat/ecospat&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Presumably this won’t be necessary for versions 3.3+ or newer (once
released).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting Data&lt;/h1&gt;
&lt;p&gt;We’ll start by sourcing our data. For observations, let’s take a look at
Purple Loosestrife, a wetland species that is native to Europe, and
invasive in North America. For actual research work, I normally download
the files directly from GBIF, and examine them carefully to check for
errors or missing data. For this demo we’ll use the &lt;code&gt;rgbif&lt;/code&gt; package to
download the data directly into R, and we’ll assume there are no problems.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsGBIF &amp;lt;- occ_search(scientificName = &amp;quot;Lythrum salicaria&amp;quot;,
                    limit = 10000,
                    basisOfRecord = &amp;quot;Preserved_Specimen&amp;quot;,
                    hasCoordinate = TRUE,
                    fields = c(&amp;quot;decimalLatitude&amp;quot;,
                               &amp;quot;decimalLongitude&amp;quot;, &amp;quot;year&amp;quot;,
                               &amp;quot;country&amp;quot;, &amp;quot;countryCode&amp;quot;))

save(lsGBIF, file = &amp;quot;../data/2021-07-29-ls-gbif-recs.Rda&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This returned an object with 7969 records. I saved that locally, so that
I’m not making GBIF search their database everytime I work on this demo.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;load(&amp;quot;../data/2021-07-29-ls-gbif-recs.Rda&amp;quot;)
lsOccs &amp;lt;- lsGBIF$data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;lsGBIF$data&lt;/code&gt; is the table with the actual records in it. That’s what we’ll
be working with. The other components of &lt;code&gt;lsGBIF&lt;/code&gt; are metadata related to
the original GBIF search. That’s useful to have, but not needed for the
rest of this example.&lt;/p&gt;
&lt;p&gt;Next, we tell R which columns are the coordinates, which allows us to map
the observations. This also converts our observation matrix to a
&lt;code&gt;SpatialPointsDataFrame&lt;/code&gt; object.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;coordinates(lsOccs) &amp;lt;- c(&amp;quot;decimalLongitude&amp;quot;,
                        &amp;quot;decimalLatitude&amp;quot;) 
data(wrld_simpl) # load the maptools worldmap

par(mar = c(0,0, 0, 0))
plot(wrld_simpl, border = &amp;quot;gray80&amp;quot;)
points(lsOccs, pch = 16, col = 2, cex = 0.3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To get our climate data, we can use raster’s &lt;code&gt;getData&lt;/code&gt; function. The first
time you call this function in a directory, it downloads the data from the
internet, and saves it locally. Subsequent calls will load your local copy
of the data, to speed things up. I’m using the coarsest resolution (10
minutes) to speed things up for this demonstration:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;wclim &amp;lt;- getData(&amp;quot;worldclim&amp;quot;, var = &amp;quot;bio&amp;quot;, res = 10,
                path = &amp;quot;../data&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can take a look at one layer:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mar = c(0,0, 3, 1))
plot(wclim[[&amp;quot;bio1&amp;quot;]], main = &amp;quot;bio1&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next, we need to extract the environmental values from the climate rasters
for each of our observation records:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsOccs &amp;lt;- cbind(lsOccs, extract(wclim, lsOccs))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the process of extracting &lt;code&gt;wclim&lt;/code&gt; values for our observations, we
usually end up with a few missing values. This is a consequence of
mismatches between the observation coordinates and the climate rasters. In
some cases, the observations are placed off the coast in the ocean, or in
another area where there is no climate data available. We need to exclude
these missing values from our analysis.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsOccs &amp;lt;- lsOccs[complete.cases(data.frame(lsOccs)), ]&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;splitting-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Splitting Data&lt;/h1&gt;
&lt;p&gt;At this point, all the data we need for the Niche Quantification analysis
is in &lt;code&gt;lsOccs&lt;/code&gt; and &lt;code&gt;wclimMat&lt;/code&gt;. We need to split this data into native and
invasive regions for our comparison. We’ll restrict ourselves to the
northern hemisphere north of 20 degrees, and consider all records from
Eurasia as native, and all records from North America as invasive.&lt;/p&gt;
&lt;p&gt;I’ve created extents to cover the rough outlines of the areas in question.
In practice, you could use a more carefully constructed vector map to split
your data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## North America: na
## Longitude from 40 to 180W, Latitude from 20 to 90N
naExt &amp;lt;- extent(c(-180, -40, 20, 90))
lsNA &amp;lt;- crop(lsOccs, naExt)

## Eurasia: ea
## Longitude from 40W to 180E, Latitude from 20 to 90N
eaExt &amp;lt;- extent(c(-40, 180, 20, 90))
lsEA &amp;lt;- crop(lsOccs, eaExt)

par(mar = c(1, 0, 0, 0))
plot(wrld_simpl, ylim = c(20, 80), axes = FALSE)
points(lsNA, pch = 16, col = &amp;#39;red&amp;#39;, cex = 0.5)
points(lsEA, pch = 16, col = &amp;#39;darkgreen&amp;#39;, cex = 0.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For the Niche Quantification, we need to have a matrix with the background
environment present in the native and invasive ranges, as well as the
complete global environmental including the combined extent of the native and
introduced environments. After cropping, we use &lt;code&gt;getValues&lt;/code&gt; to convert the
raster to a dataframe.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Crop Climate Layers:
naEnvR &amp;lt;- crop(wclim, naExt)
eaEnvR &amp;lt;- crop(wclim, eaExt)

## Extract values to matrix:
naEnvM &amp;lt;- getValues(naEnvR)
eaEnvM &amp;lt;- getValues(eaEnvR)

## Clean out missing values:
naEnvM &amp;lt;- naEnvM[complete.cases(naEnvM), ]
eaEnvM &amp;lt;- eaEnvM[complete.cases(eaEnvM), ]

## Combined global environment:
globalEnvM &amp;lt;- rbind(naEnvM, eaEnvM)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;niche-quantification&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Niche Quantification&lt;/h1&gt;
&lt;div id=&#34;pca&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;PCA&lt;/h2&gt;
&lt;p&gt;The Niche Quantification analysis starts with a Principal Components
Analysis of the environmental data. The actual ordination uses the global
data, with the observation records and the native and invasive background
environment treated as supplemental rows.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pca.clim &amp;lt;- dudi.pca(globalEnvM, center = TRUE,
                    scale = TRUE, scannf = FALSE, nf = 2)
global.scores &amp;lt;- pca.clim$li

nativeLS.scores &amp;lt;-
  suprow(pca.clim,
         data.frame(lsEA)[, colnames(globalEnvM)])$li   
invasiveLS.scores &amp;lt;-
  suprow(pca.clim,
         data.frame(lsNA)[, colnames(globalEnvM)])$li

nativeEnv.scores &amp;lt;- suprow(pca.clim, naEnvM)$li
invasiveEnv.scores &amp;lt;- suprow(pca.clim, eaEnvM)$li&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Let’s break that down. &lt;code&gt;dudi.pca&lt;/code&gt; does a PCA analysis on &lt;code&gt;globalEnvM&lt;/code&gt;,
which is a matrix of all the environmental variables over the entire study
area. We use that to create a two-dimensional summary of the total
environmental variability.&lt;/p&gt;
&lt;p&gt;Next, we map our observation data (&lt;code&gt;lsEA&lt;/code&gt; and &lt;code&gt;lsNA&lt;/code&gt;) into that
2-dimensional ordination, using the &lt;code&gt;suprow&lt;/code&gt; function. &lt;code&gt;lsEA&lt;/code&gt; and &lt;code&gt;lsNA&lt;/code&gt;
are &lt;code&gt;SpatialPointsDataFrame&lt;/code&gt; objects. Sometimes you can treat them as if
they were data.frames, but other times you need to explicity convert them.
This is one of those times, hence I’ve wrapped them in &lt;code&gt;data.frame()&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Recall that &lt;code&gt;lsEA&lt;/code&gt; and &lt;code&gt;lsNA&lt;/code&gt; have more columns than the environmental
matrix: they also include &lt;code&gt;year&lt;/code&gt;, &lt;code&gt;countryCode&lt;/code&gt;, &lt;code&gt;country&lt;/code&gt;. We only want to
include the environmental variables when you project the observations into
the ordination. To make sure that we use the same variables as in the
original ordination of &lt;code&gt;globalEnvM&lt;/code&gt;, in the same order, I select the
columns explicitly to match that object:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;data.frame(lsEA)[, colnames(globalEnvM)]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The output of &lt;code&gt;dudi.pca&lt;/code&gt; and &lt;code&gt;suprow&lt;/code&gt; includes a lot of information that we
aren’t using here. We only need the &lt;code&gt;li&lt;/code&gt; element, so I’ve selected that
from each of the function outputs.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;occurence-densities-grid&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Occurence Densities Grid&lt;/h2&gt;
&lt;p&gt;Finally we’re ready to do the Niche Quantification/Comparisons. We’ll use
the PCA scores for the global environment, the native and invasive
environments, and the native and invasive occurrence records.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;nativeGrid &amp;lt;- ecospat.grid.clim.dyn(global.scores,
                                   nativeEnv.scores,
                                   nativeLS.scores)

invasiveGrid &amp;lt;- ecospat.grid.clim.dyn(global.scores,
                                   invasiveEnv.scores, 
                                   invasiveLS.scores)

ecospat.plot.niche.dyn(nativeGrid, invasiveGrid,
                       quant = 0.05) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The resulting plot shows us the environmental conditions present in Eurasia
(inside the green line) and North America (inside the red line). The green
area represents environments occupied by &lt;em&gt;Lythrum salicaria&lt;/em&gt; in Eurasia,
but not in North America, the red area shows environments occupied in North
America and not Eurasia, and the blue area shows environments occupied in
both ranges. We can also see that there are a few areas in Eurasia with
environments not present in North America, and vice versa. However, for the
most part, &lt;em&gt;Lythrum salicara&lt;/em&gt; doesn’t occur in this environments (except
for a tiny bit of green in the center of the plot).&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;geographic-comparisons&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Geographic Comparisons&lt;/h1&gt;
&lt;p&gt;You can also apply this analysis to geographic locations, instead of
environmental conditions. This won’t make much sense for native vs invaded
range comparisons, but it could be useful for comparing different species
within the same area.&lt;/p&gt;
&lt;p&gt;To demonstrate, let’s compare the distribution of &lt;em&gt;Lythrum salicaria&lt;/em&gt; in
North America before and after 1950. We use geographic coordinates here, so
no need for a PCA. We do need to generate the ‘background’ coordinates.
I’ll use &lt;code&gt;expand.grid&lt;/code&gt; to create the locations for this. I’ve broken up the
NA extent into 500 x 500 grids.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;lsNAearly &amp;lt;- subset(lsNA, year &amp;lt;= 1950)
lsNAlate &amp;lt;- subset(lsNA, year &amp;gt; 1950)
geoGrid &amp;lt;- expand.grid(longitude =
                        seq(-160, -40, length.out = 500),
                      latitude =
                        seq(20, 90, length.out = 500))

earlyGeoGrid &amp;lt;- ecospat.grid.clim.dyn(geoGrid, geoGrid,
                                     coordinates(lsNAearly))

lateGeoGrid &amp;lt;- ecospat.grid.clim.dyn(geoGrid, geoGrid,
                                    coordinates(lsNAlate))

ecospat.plot.niche.dyn(earlyGeoGrid, lateGeoGrid, quant = 0)
plot(wrld_simpl, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This looks pretty good. However, &lt;code&gt;ecospat&lt;/code&gt; uses a kernel density formula to
model the occurence distributions. As a consequence, it projects out into
the ocean, which isn’t very realistic. To correct this, we need to mask the
analysis to the continental land mass. This requires we have a vector map
of the desired area. I’ll combine the US, Canada, and Mexico polygons from
&lt;code&gt;wrld_simpl&lt;/code&gt; for this purpose.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;naMask &amp;lt;- bind(subset(wrld_simpl, NAME == &amp;quot;Canada&amp;quot;),
              subset(wrld_simpl, NAME == &amp;quot;United States&amp;quot;),
              subset(wrld_simpl, NAME == &amp;quot;Mexico&amp;quot;))

earlyGeoGrid &amp;lt;- ecospat.grid.clim.dyn(geoGrid, geoGrid,
                                     coordinates(lsNAearly),
                                     geomask = naMask)

lateGeoGrid &amp;lt;- ecospat.grid.clim.dyn(geoGrid, geoGrid,
                                    coordinates(lsNAlate),
                                    geomask = naMask)

ecospat.plot.niche.dyn(earlyGeoGrid, lateGeoGrid, quant = 0)
plot(wrld_simpl, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;That gives more reasonable results.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;summary&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Summary&lt;/h1&gt;
&lt;p&gt;This is a fairly quick overview of this workflow. You’ll almost certainly
want to consider thinning your observations, among other data cleaning
procedures. I’ve also set the study extent very crudely. That might be
appropriate for very large scale (global) studies. But you’ll usually want
to think a bit more carefully about how you set your extent. The way you
process your data will also differ depending on your context.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references csl-bib-body hanging-indent&#34; entry-spacing=&#34;0&#34;&gt;
&lt;div id=&#34;ref-BivandLewin-Koh_2021&#34; class=&#34;csl-entry&#34;&gt;
Bivand, Roger, and Nicholas Lewin-Koh. 2021. &lt;span&gt;“Maptools: Tools for Handling Spatial Objects.”&lt;/span&gt; Manual. &lt;a href=&#34;https://CRAN.R-project.org/package=maptools&#34;&gt;https://CRAN.R-project.org/package=maptools&lt;/a&gt;.
&lt;/div&gt;
&lt;div id=&#34;ref-ChamberlainEtAl_2021&#34; class=&#34;csl-entry&#34;&gt;
Chamberlain, Scott, Vijay Barve, Dan Mcglinn, Damiano Oldoni, Peter Desmet, Laurens Geffert, and Karthik Ram. 2021. &lt;span&gt;“Rgbif: Interface to the Global Biodiversity Information Facility API.”&lt;/span&gt; Manual. &lt;a href=&#34;https://CRAN.R-project.org/package=rgbif&#34;&gt;https://CRAN.R-project.org/package=rgbif&lt;/a&gt;.
&lt;/div&gt;
&lt;div id=&#34;ref-ColaEtAl_2017&#34; class=&#34;csl-entry&#34;&gt;
Cola, Valeria Di, Olivier Broennimann, Blaise Petitpierre, Frank T. Breiner, Manuela D’Amen, Christophe Randin, Robin Engler, et al. 2017. &lt;span&gt;“Ecospat: An R Package to Support Spatial Analyses and Modeling of Species Niches and Distributions.”&lt;/span&gt; &lt;em&gt;Ecography&lt;/em&gt; 40 (6): 774–87. &lt;a href=&#34;https://doi.org/10.1111/ecog.02671&#34;&gt;https://doi.org/10.1111/ecog.02671&lt;/a&gt;.
&lt;/div&gt;
&lt;div id=&#34;ref-FickHijmans_2017&#34; class=&#34;csl-entry&#34;&gt;
Fick, Stephen E., and Robert J. Hijmans. 2017. &lt;span&gt;“WorldClim 2: New 1-Km Spatial Resolution Climate Surfaces for Global Land Areas.”&lt;/span&gt; &lt;em&gt;International Journal of Climatology&lt;/em&gt; 37 (12): 4302–15. &lt;a href=&#34;https://doi.org/10.1002/joc.5086&#34;&gt;https://doi.org/10.1002/joc.5086&lt;/a&gt;.
&lt;/div&gt;
&lt;div id=&#34;ref-Hijmans_2021&#34; class=&#34;csl-entry&#34;&gt;
Hijmans, Robert J. 2021. &lt;span&gt;“Raster: Geographic Data Analysis and Modeling.”&lt;/span&gt; Manual. &lt;a href=&#34;https://CRAN.R-project.org/package=raster&#34;&gt;https://CRAN.R-project.org/package=raster&lt;/a&gt;.
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>GBS Admixture Analysis Workflow</title>
      <link>https://plantarum.ca/2021/06/01/admixture/</link>
      <pubDate>Tue, 01 Jun 2021 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2021/06/01/admixture/</guid>
      <description>
&lt;script src=&#34;https://plantarum.ca/2021/06/01/admixture/index_files/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;&lt;a href=&#34;https://dalexander.github.io/admixture/index.html&#34;&gt;Admixture&lt;/a&gt; is a program
for completing
&lt;a href=&#34;https://web.stanford.edu/group/pritchardlab/structure.html&#34;&gt;STRUCTURE&lt;/a&gt;-style
analyses of large SNP datasets, such as we get with GBS
&lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-ElshireEtAl_2011&#34; role=&#34;doc-biblioref&#34;&gt;Elshire et al. 2011&lt;/a&gt;)&lt;/span&gt;. This short tutorial covers getting our SNP data from
STACKS &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-RochetteEtAl_2019&#34; role=&#34;doc-biblioref&#34;&gt;Rochette, Rivera‐Colón, and Catchen 2019&lt;/a&gt;)&lt;/span&gt; into a format that Admixture will understand,
running the analysis, and importing the results into
&lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt; for further investigation &amp;amp; plotting.&lt;/p&gt;
&lt;div id=&#34;converting-stacks-output-to-admixture-input&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Converting Stacks Output to Admixture Input&lt;/h1&gt;
&lt;p&gt;Both Stacks and Admixture can process
&lt;a href=&#34;https://www.cog-genomics.org/plink2/formats&#34;&gt;PLINK&lt;/a&gt; data. However, there
are a few ‘gotchas’ that took a while to sort out. The simplest way I found
to bridge the two programs was to export my Stacks data to &lt;code&gt;vcf&lt;/code&gt;, clean it
up on the command line, and then use the &lt;code&gt;plink&lt;/code&gt; program to convert it to a
&lt;code&gt;plink&lt;/code&gt; file that Admixture could parse.&lt;/p&gt;
&lt;p&gt;I’ll start from the &lt;code&gt;vcf&lt;/code&gt; file generated by Stacks’
&lt;a href=&#34;http://catchenlab.life.illinois.edu/stacks/comp/populations.php&#34;&gt;populations&lt;/a&gt;
program. We expect to have thousands of contigs in a typical GBS dataset,
and each of which is numbered in the Stacks output:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;head -20 populations.haps.vcf&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##fileformat=VCFv4.2
##fileDate=20210531
##source=&amp;quot;Stacks v2.3e&amp;quot;
##INFO=&amp;lt;ID=AD,Number=R,Type=Integer,Description=&amp;quot;Total Depth for Each Allele&amp;quot;&amp;gt;
...
##FORMAT=&amp;lt;ID=GT,Number=1,Type=String,Description=&amp;quot;Genotype&amp;quot;&amp;gt;
##INFO=&amp;lt;ID=loc_strand,Number=1,Type=Character,Description=&amp;quot;Genomic strand the co
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NIAP1-1011  NIAP1-0342  
26  1   .   TTTCG   TATCG   .   PASS    snp_columns=61,93,136,137,177   GT  0/0 0/0 
46  1   .   AAAGTT  AACATT  .   PASS    snp_columns=13,15,48,73,125,192 GT  0/1 0/1 
103 1   .   CCCACATGATACGCCGC   CCCATATAAGCCGCCGC   .   PASS    snp_columns=11,16   
149 1   .   ACACTGT ACATTGT .   PASS    snp_columns=47,66,84,91,121,126,163 GT  0/0 
271 1   .   CATTGTGCGGAATATGT   TACCTCATTTGCATTAC,CATTGTGCGGAAAATGT .   PASS&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This causes a problem when we try to use &lt;code&gt;plink&lt;/code&gt;, which won’t accept
&lt;code&gt;#CHROM&lt;/code&gt; values higher than 21. To fix this, we need to append a letter to
the &lt;code&gt;#CHROM&lt;/code&gt; numbers:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;sed &amp;#39;/^[[:digit:]]/s/^/c/&amp;#39; populations.haps.vcf &amp;gt; popC.haps.vcf&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;head -20 mingan.haps.vcf&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##fileformat=VCFv4.2
##fileDate=20210531
##source=&amp;quot;Stacks v2.3e&amp;quot;
##INFO=&amp;lt;ID=AD,Number=R,Type=Integer,Description=&amp;quot;Total Depth for Each Allele&amp;quot;&amp;gt;
...
##FORMAT=&amp;lt;ID=GT,Number=1,Type=String,Description=&amp;quot;Genotype&amp;quot;&amp;gt;
##INFO=&amp;lt;ID=loc_strand,Number=1,Type=Character,Description=&amp;quot;Genomic strand the co
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NIAP1-1011  NIAP1-0342  
c26 1   .   TTTCG   TATCG   .   PASS    snp_columns=61,93,136,137,177   GT  0/0 
c46 1   .   AAAGTT  AACATT  .   PASS    snp_columns=13,15,48,73,125,192 GT  0/1 
c103    1   .   CCCACATGATACGCCGC   CCCATATAAGCCGCCGC   .   PASS
c149    1   .   ACACTGT ACATTGT .   PASS    snp_columns=47,66,84,91,121,126,163 GT
c271    1   .   CATTGTGCGGAATATGT   TACCTCATTTGCATTAC,CATTGTGCGGAAAATGT .   PASS&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With that small addition, we can now create a &lt;code&gt;plink&lt;/code&gt; file with the &lt;code&gt;plink&lt;/code&gt;
program:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;plink --vcf popC.haps.vcf --make-bed --out pop.admix --allow-extra-chr 0&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This generates a few files: &lt;code&gt;pop.admix.bed&lt;/code&gt;, &lt;code&gt;pop.admix.bim&lt;/code&gt;,
&lt;code&gt;pop.admix.fam&lt;/code&gt;, &lt;code&gt;pop.admix.log&lt;/code&gt;, &lt;code&gt;pop.admix.nosex&lt;/code&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;running-admixture&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Running Admixture&lt;/h1&gt;
&lt;p&gt;We can now run admixture itself. We need all the files generated by &lt;code&gt;plink&lt;/code&gt;
together in the same directory. We’ll pass the &lt;code&gt;.bed&lt;/code&gt; file as an argument
to Admixture, but it will look for the other files when it’s running.&lt;/p&gt;
&lt;p&gt;The other key argument for admixture is the number of clusters to look for.
We most likely will want to try a range of different values, in order to
determine the optimal number (if there is one). We can do this in bash with
a loop:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;for K in `seq -w 1 20` 
do
    admixture --cv pop.admix.bed $K &amp;gt; ktests/k${K}.out
done&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;seq&lt;/code&gt; command generates a sequence of numbers, and the &lt;code&gt;-w&lt;/code&gt; flag tells
it to pad the numbers with zeros (i.e., 01, 02, … 19, 20).&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;--cv&lt;/code&gt; flag tells admixture to calculate cross-validation error rates,
which we will use to determine the optimal K value.&lt;/p&gt;
&lt;p&gt;We direct the output to files in the directory &lt;code&gt;ktests&lt;/code&gt;. Make sure this
directory exists before you start.&lt;/p&gt;
&lt;p&gt;Once the loop is finished, we’ll want to examine the results:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;grep -h CV ktests/*out&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;CV error (K=1): 0.39835
CV error (K=2): 0.31327
CV error (K=3): 0.26516
CV error (K=4): 0.19929
CV error (K=5): 0.18499
...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we’re ready to move into R to explore the results.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;r-plotting&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;R Plotting&lt;/h1&gt;
&lt;p&gt;First, we’ll take a look at the CV values. Since they’re scattered in 20
different log files, we’ll use grep to collect them into a single file:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;grep -h CV ktests/*out &amp;gt; CV.csv&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can load that into R:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;CVs &amp;lt;- read.table(&amp;quot;CV.csv&amp;quot;, sep = &amp;quot; &amp;quot;)
CVs &amp;lt;- CVs[, 3:4] ## drop the first two columns
## Remove the formatting around the K values:
CVs[, 1] &amp;lt;- gsub(x = CVs[, 1], pattern = &amp;quot;\\(K=&amp;quot;,
                replacement = &amp;quot;&amp;quot;)
CVs[, 1] &amp;lt;- gsub(x = CVs[, 1], pattern = &amp;quot;\\):&amp;quot;,
                replacement = &amp;quot;&amp;quot;) 
head(CVs)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   V3      V4
## 1  1 0.39835
## 2  2 0.31327
## 3  3 0.26516
## 4  4 0.19929
## 5  5 0.18499
## 6  6 0.15408&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(CVs, xlab = &amp;quot;K&amp;quot;, ylab = &amp;quot;CV error&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/06/01/admixture/index_files/figure-html/CV-plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;In our case, there isn’t a real clear optimum. K = 9 is about the bottom of
the ‘elbow,’ we’ll use that to make our plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ad9 &amp;lt;- read.table(&amp;quot;pop.admix.9.Q&amp;quot;)
head(ad9)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##         V1    V2    V3    V4    V5       V6    V7    V8    V9
## 1 0.000010 1e-05 1e-05 1e-05 1e-05 0.999920 1e-05 1e-05 1e-05
## 2 0.000010 1e-05 1e-05 1e-05 1e-05 0.999920 1e-05 1e-05 1e-05
## 3 0.000010 1e-05 1e-05 1e-05 1e-05 0.999920 1e-05 1e-05 1e-05
## 4 0.000010 1e-05 1e-05 1e-05 1e-05 0.999920 1e-05 1e-05 1e-05
## 5 0.020244 1e-05 1e-05 1e-05 1e-05 0.979686 1e-05 1e-05 1e-05
## 6 0.003441 1e-05 1e-05 1e-05 1e-05 0.996489 1e-05 1e-05 1e-05&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We also need a popmap file to annotate our plot. This file lists the
population of every sample, and critically, it must be in the same order as
the rows in &lt;code&gt;pop.admix.9.Q&lt;/code&gt;. In this case, I’ll use the popmap data that I
used with the &lt;code&gt;populations&lt;/code&gt; program to generate the original vcf files we
started wtih, with an additional column added with the population names in
it.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;popmap &amp;lt;- read.table(&amp;quot;popmap.csv&amp;quot;)
head(popmap)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       sample popnum    popname
## 1 NIAP1-1011      1 Niapsikau1
## 2 NIAP1-0342      1 Niapsikau1
## 3 NIAP1-1004      1 Niapsikau1
## 4 NIAP1-1017      1 Niapsikau1
## 5 NIAP1-1014      1 Niapsikau1
## 6 NIAP1-0923      1 Niapsikau1&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At this point, the two tables are in the same order. Before I do any
manipulations, I’ll combine them. This allows me to sort them in any order,
and the names will stay associated with the correct samples.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ad9 &amp;lt;- cbind(popmap, ad9)
head(ad9)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##       sample popnum    popname       V1    V2    V3    V4    V5       V6    V7
## 1 NIAP1-1011      1 Niapsikau1 0.000010 1e-05 1e-05 1e-05 1e-05 0.999920 1e-05
## 2 NIAP1-0342      1 Niapsikau1 0.000010 1e-05 1e-05 1e-05 1e-05 0.999920 1e-05
## 3 NIAP1-1004      1 Niapsikau1 0.000010 1e-05 1e-05 1e-05 1e-05 0.999920 1e-05
## 4 NIAP1-1017      1 Niapsikau1 0.000010 1e-05 1e-05 1e-05 1e-05 0.999920 1e-05
## 5 NIAP1-1014      1 Niapsikau1 0.020244 1e-05 1e-05 1e-05 1e-05 0.979686 1e-05
## 6 NIAP1-0923      1 Niapsikau1 0.003441 1e-05 1e-05 1e-05 1e-05 0.996489 1e-05
##      V8    V9
## 1 1e-05 1e-05
## 2 1e-05 1e-05
## 3 1e-05 1e-05
## 4 1e-05 1e-05
## 5 1e-05 1e-05
## 6 1e-05 1e-05&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In my case, the samples aren’t in order, so I need to sort them prior to
plotting:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ad9 &amp;lt;- ad9[order(ad9$popnum), ]&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we’re ready to plot:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;barplot(t(as.matrix(ad9[, -1:-3])), col=rainbow(9), 
        space = 0, xlab=&amp;quot;Population&amp;quot;, ylab = &amp;quot;Ancestry&amp;quot;, 
        border=NA, axisnames = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/06/01/admixture/index_files/figure-html/admixture-plot-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Let’s break that down. First, we excluded the first three columns, &lt;code&gt;ad9[, -1:-3]&lt;/code&gt;, so that we don’t include our labels in the data. Then we transpose
the matrix, so that each individual sample is represented by a column of
Ancestry proportions. I removed the borders on the bars, and the spaces
between them (&lt;code&gt;border = NA&lt;/code&gt;, and &lt;code&gt;space = 0&lt;/code&gt;), and set the axis labels.&lt;/p&gt;
&lt;p&gt;That’s nice, but we’d like to label our original populations on the plot,
so we can see how they compare to the clusters produced by admixture. I use
the &lt;code&gt;aggregate&lt;/code&gt; function for this.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;xlabels &amp;lt;- aggregate(1:nrow(ad9),
                    by = list(ad9[, &amp;quot;popname&amp;quot;]),
                    FUN = mean)
xlabels&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      Group.1     x
## 1   Fantome4  58.0
## 2   Fantome5  83.5
## 3     Havre6 107.5
## 4     Havre7 125.5
## 5  Marteau11 149.0
## 6 Niapsikau1   7.0
## 7 Niapsikau2  30.5&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, I’ve grouped the rows by population name, and then calculated the
mean row number for each group. That will be handy, as I can then use that
mean value to plot the name of each population centered beneath it.&lt;/p&gt;
&lt;p&gt;Similarly, I can find the borders of the groups:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;sampleEdges &amp;lt;- aggregate(1:nrow(ad9),
                        by = list(ad9[, &amp;quot;popname&amp;quot;]), 
                        FUN = max)
sampleEdges&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##      Group.1   x
## 1   Fantome4  68
## 2   Fantome5  98
## 3     Havre6 116
## 4     Havre7 134
## 5  Marteau11 163
## 6 Niapsikau1  13
## 7 Niapsikau2  47&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In this case, I find the highest row for each population, which I’ll use to
draw a line between them.&lt;/p&gt;
&lt;p&gt;Putting this all together, we get:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;barplot(t(as.matrix(ad9[, -1:-3])), col=rainbow(9), 
        space = 0, xlab=&amp;quot;Population&amp;quot;, ylab = &amp;quot;Ancestry&amp;quot;, 
        border=NA, axisnames = FALSE)
abline(v = sampleEdges$x, lwd = 2)
axis(1, at = xlabels$x - 0.5, labels = xlabels$Group.1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/2021/06/01/admixture/index_files/figure-html/admixture-plot-complete-1.png&#34; width=&#34;864&#34; /&gt;&lt;/p&gt;
&lt;p&gt;And we’re done!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;see-also&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;See Also&lt;/h1&gt;
&lt;p&gt;For an alternative using ggplot2, see &lt;a href=&#34;https://luisdva.github.io/rstats/model-cluster-plots/&#34;&gt;Luis D. Verde Arregoitia’s
blog&lt;/a&gt;. There’s
another tutorial that you might find helpful at
&lt;a href=&#34;https://speciationgenomics.github.io/ADMIXTURE/&#34;&gt;SpeciationGenomics.github.io&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references csl-bib-body hanging-indent&#34;&gt;
&lt;div id=&#34;ref-ElshireEtAl_2011&#34; class=&#34;csl-entry&#34;&gt;
Elshire, Robert J., Jeffrey C. Glaubitz, Qi Sun, Jesse A. Poland, Ken Kawamoto, Edward S. Buckler, and Sharon E. Mitchell. 2011. &lt;span&gt;“A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species.”&lt;/span&gt; &lt;em&gt;PLoS ONE&lt;/em&gt; 6 (5). &lt;a href=&#34;https://doi.org/10.1371/journal.pone.0019379&#34;&gt;https://doi.org/10.1371/journal.pone.0019379&lt;/a&gt;.
&lt;/div&gt;
&lt;div id=&#34;ref-RochetteEtAl_2019&#34; class=&#34;csl-entry&#34;&gt;
Rochette, Nicolas C., Angel G. Rivera‐Colón, and Julian M. Catchen. 2019. &lt;span&gt;“Stacks 2: Analytical Methods for Paired-End Sequencing Improve RADseq-Based Population Genomics.”&lt;/span&gt; &lt;em&gt;Molecular Ecology&lt;/em&gt; 28 (21): 4737–54. https://doi.org/&lt;a href=&#34;https://doi.org/10.1111/mec.15253&#34;&gt;https://doi.org/10.1111/mec.15253&lt;/a&gt;.
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Adding Lat/Lon Grids to Maps in R</title>
      <link>https://plantarum.ca/2021/02/22/graticules-r/</link>
      <pubDate>Mon, 22 Feb 2021 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2021/02/22/graticules-r/</guid>
      <description>
&lt;script src=&#34;https://plantarum.ca/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;In a previous post, I outlined my workflow for &lt;a href=&#34;https://plantarum.ca/2020/10/30/simple-maps-r/&#34;&gt;preparing maps in
R&lt;/a&gt;. Today I had to add a
&lt;a href=&#34;https://www.merriam-webster.com/dictionary/graticule&#34;&gt;graticule&lt;/a&gt;, a grid
of latitude and longitude lines, to my maps. That’s easy enough to do with
unprojected maps, as the plot coordinates are latitude and longitude, so
your X and Y axes are already graticules. But if you’ve projected your
data, the plot coordinates are on a different scale, so you need to do a
bit of tuning.&lt;/p&gt;
&lt;p&gt;I couldn’t find a direct way to do this in the R &lt;code&gt;sp&lt;/code&gt; package. However,
&lt;code&gt;sp&lt;/code&gt; (&lt;code&gt;sp&lt;/code&gt; for ‘spatial’) is slowly being replaced by
&lt;a href=&#34;https://r-spatial.github.io/sf/index.html&#34;&gt;sf&lt;/a&gt; (&lt;code&gt;sf&lt;/code&gt; for &lt;a href=&#34;https://en.wikipedia.org/wiki/Simple_Features&#34;&gt;simple
feature&lt;/a&gt;), and &lt;code&gt;sf&lt;/code&gt; does
support graticules. Here are the steps required to add them to your plots:&lt;/p&gt;
&lt;div id=&#34;importing-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Importing Data&lt;/h1&gt;
&lt;p&gt;We can use &lt;code&gt;raster::getData&lt;/code&gt; to get our map data again. It’s
straightforward to convert objects from &lt;code&gt;sp&lt;/code&gt; (&lt;code&gt;Spatial*&lt;/code&gt;) and &lt;code&gt;sf&lt;/code&gt; (&lt;code&gt;sf*&lt;/code&gt;)
format and back, with the functions &lt;code&gt;st_as_sf&lt;/code&gt; (to convert from a
&lt;code&gt;Spatial*&lt;/code&gt; to &lt;code&gt;sf*&lt;/code&gt;), and &lt;code&gt;as&lt;/code&gt; (to convert from &lt;code&gt;sf*&lt;/code&gt; to a &lt;code&gt;Spatial*&lt;/code&gt;
object). As it turns out, &lt;code&gt;getData&lt;/code&gt; also supports downloading data directly
into &lt;code&gt;sf&lt;/code&gt; format:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(sf)
library(raster)
us &amp;lt;- getData(&amp;quot;GADM&amp;quot;, country = &amp;quot;USA&amp;quot;, level = 1,
             path = &amp;quot;./data/maps/&amp;quot;, type = &amp;quot;sf&amp;quot;)
canada &amp;lt;- getData(&amp;quot;GADM&amp;quot;, country = &amp;quot;CAN&amp;quot;, level = 1,
                 path = &amp;quot;./data/maps&amp;quot;, type = &amp;quot;sf&amp;quot;)
mexico &amp;lt;- getData(&amp;quot;GADM&amp;quot;, country = &amp;quot;MEX&amp;quot;, level = 1,
                 path = &amp;quot;./data/maps&amp;quot;, type = &amp;quot;sf&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This uses the undocumented type argument, set to &lt;code&gt;sf&lt;/code&gt;. Given that it’s not
documented, it may change in future, be warned!&lt;/p&gt;
&lt;p&gt;You can also use the function &lt;code&gt;st_read&lt;/code&gt; to read shapefiles directly:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;greatlakes &amp;lt;- st_read(&amp;quot;data/maps/greatlakes.shp&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Reading layer `greatlakes&amp;#39; from data source 
##   `/home/smithty/blogdown/content/tutorials/data/maps/greatlakes.shp&amp;#39; 
##   using driver `ESRI Shapefile&amp;#39;
## Simple feature collection with 2 features and 5 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -365240.6 ymin: -2741892 xmax: 1888977 ymax: 509590.2
## proj4string:   +proj=laea +lat_0=45 +lon_0=-100 +x_0=0 +y_0=0 +a=6370997 +b=6370997 +units=m +no_defs&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In the previous tutorial, I used &lt;code&gt;bind&lt;/code&gt; to combine two &lt;code&gt;Spatial*&lt;/code&gt; objects.
With &lt;code&gt;sf&lt;/code&gt; we need &lt;code&gt;rbind&lt;/code&gt; instead:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;na &amp;lt;- rbind(us, canada, mexico)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()
## old-style crs object detected; please recreate object with a recent sf::st_crs()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Plotting complex vector maps like this can be a slow process, especially
when you’re constantly tweaking and adjusting them. You can speed this up
by simplifying the layers:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;na.simp &amp;lt;- st_simplify(na, dTolerance = 0.01)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;On my laptop, plotting the original map takes a minute or more, compared to
2 seconds for the simplified vector. I set the tolerance by trial and
error. The higher the tolerance, the smoother the map will be. At 0.01, it
still looks nearly identical at the scale I’m plotting it, but is much
smaller and faster to plot. &lt;code&gt;sf&lt;/code&gt; does warn me about not correctly
simplifying the data, but since I’m only using this for display that’s not
a concern. I wouldn’t simplify a vector if I was going to use it in an analysis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;plotting-maps&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Plotting Maps&lt;/h1&gt;
&lt;p&gt;When it comes to plotting, we need to tell R to plot only the geometry. By
default it will plot multiple maps, one for each attribute. That’s not what
we want here.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(st_geometry(na.simp), xlim = c(-130, -70),
     ylim = c(35, 45))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2021-02-22-r-maps-graticules_files/figure-html/plot%20sf%20map-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;projections&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Projections&lt;/h1&gt;
&lt;p&gt;To project our unprojected data, we need to define a projection, and transform the object.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;laea = CRS(&amp;quot;+proj=laea +lat_0=30 +lon_0=-95&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;na.la &amp;lt;- st_transform(na.simp, laea)
plot(st_geometry(na.la), xlim = c(-500000, 2000000),
     ylim = c(-400000, 2100000))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2021-02-22-r-maps-graticules_files/figure-html/projection-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can add layers just as we did in the previous post:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;gl.la &amp;lt;- st_transform(greatlakes, laea)
plot(st_geometry(gl.la), col = &amp;#39;lightblue&amp;#39;, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2021-02-22-r-maps-graticules_files/figure-html/plotting%20the%20great%20lakes%20for%20real-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;You can also mix &lt;code&gt;sf&lt;/code&gt; and &lt;code&gt;Spatial*&lt;/code&gt; objects on the same plot, as long as
they’re in the same projection.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;graticules&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Graticules&lt;/h1&gt;
&lt;p&gt;Now we have everything we need to add graticules to our map. This includes
the map we want to plot, and the CRS data for the graticules we want to
overlay. In our case, we’ll use the original, unprojected layer as the
source our CRS:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(st_geometry(na.la),
     xlim = c(-500000, 2000000), ylim = c(-400000, 2100000),
     graticule = st_crs(na.simp),
     bgc = &amp;#39;lightblue&amp;#39;, ## Background color for the ocean
     col = &amp;#39;white&amp;#39;,
     axes = TRUE)
plot(st_geometry(gl.la), col = &amp;#39;lightblue&amp;#39;, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2021-02-22-r-maps-graticules_files/figure-html/plotting%20with%20graticules-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If you want to specify the location of the graticules, you can use the
arguments &lt;code&gt;lat&lt;/code&gt; and &lt;code&gt;lon&lt;/code&gt; to specify where you want them.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Emacs for Bioinformatics #3: R and ESS</title>
      <link>https://plantarum.ca/2020/12/30/emacs-tutorial-03/</link>
      <pubDate>Wed, 30 Dec 2020 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2020/12/30/emacs-tutorial-03/</guid>
      <description>
&lt;script src=&#34;https://plantarum.ca/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;p&gt;This is part three in my series of Emacs tutorials aimed at bioinformatics
(and other scientific analysis) workflows. See the rest on my
&lt;a href=&#34;https://plantarum.ca/tutorials/&#34;&gt;tutorials&lt;/a&gt; page.&lt;/p&gt;
&lt;p&gt;Emacs support for the R programming language is provided by the
&lt;a href=&#34;https://ess.r-project.org/&#34; title=&#34;ESS&#34;&gt;ESS&lt;/a&gt; package (AKA, “Emacs Speaks
Statistics”). ESS has been around since at least 1994, and is supported by
a very active development team. It provides most or all of the features of
the more widely-known &lt;a href=&#34;https://rstudio.com/&#34; title=&#34;RStudio&#34;&gt;RStudio&lt;/a&gt;, as
well as a great many more. Like all things Emacs, if it doesn’t have a
feature you want, it’s likely someone else has written a package that
provides it; failing that, the motivated hacker you can create their own
customizations using the built-in scripting language, elisp.&lt;/p&gt;
&lt;p&gt;However, lets not let all that potential scare us off. Getting up and
running with ESS doesn’t require much effort at all.&lt;/p&gt;
&lt;div id=&#34;installation&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Installation&lt;/h1&gt;
&lt;div id=&#34;prerequisites&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Prerequisites&lt;/h2&gt;
&lt;p&gt;You need to have &lt;code&gt;R&lt;/code&gt; installed in order to use &lt;code&gt;ESS&lt;/code&gt;!&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;installing-ess-from-melpa&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Installing ESS from MELPA&lt;/h2&gt;
&lt;p&gt;The easiest way to install it is to use the &lt;a href=&#34;https://melpa.org/%20%22MELPA%22&#34;&gt;MELPA&lt;/a&gt; package repository. MELPA hosts Emacs packages provided by hackers
who are not part of the Emacs development team. (“packages” here has
roughly the same meaning as “plugins” or “extensions” in other software
systems).&lt;/p&gt;
&lt;p&gt;If you aren’t already using MELPA, you need to add it to your configuration
file (typically &lt;code&gt;~/.emacs.d/init.el&lt;/code&gt;, or &lt;code&gt;~/.emacs&lt;/code&gt;):&lt;/p&gt;
&lt;pre class=&#34;elisp&#34;&gt;&lt;code&gt;(require &amp;#39;package)
(add-to-list &amp;#39;package-archives
             &amp;#39;(&amp;quot;melpa&amp;quot; . &amp;quot;https://melpa.org/packages/&amp;quot;) t)
(package-initialize)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once this code is evaluated, you can view the complete list of packages
available on MELPA via &lt;code&gt;M-x package-list-packages&lt;/code&gt;. It’s a &lt;em&gt;big&lt;/em&gt; list, and
it will take a few seconds for Emacs to get the latest version from the
server (you need an internet connection for this).&lt;/p&gt;
&lt;p&gt;Search down to the entry for &lt;code&gt;ESS&lt;/code&gt;, select it by pressing &lt;code&gt;i&lt;/code&gt;, and then
install it by pressing &lt;code&gt;x&lt;/code&gt;. &lt;code&gt;ESS&lt;/code&gt; is one of the larger packages, so it may
take a few seconds to download and install all the files.&lt;/p&gt;
&lt;p&gt;Once this is done, you have &lt;code&gt;ESS&lt;/code&gt;, and don’t need to return to
&lt;code&gt;package-list-packages&lt;/code&gt; until you want to update to a new version (or add
some other packages).&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-started&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting Started&lt;/h1&gt;
&lt;p&gt;ESS comes with a comprehensive manual. That will be your canonical
reference for learning about this package. However, you can get started
with just a few commands.&lt;/p&gt;
&lt;div id=&#34;interactive-r-session&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Interactive R Session&lt;/h2&gt;
&lt;p&gt;From within Emacs, start R with the command &lt;code&gt;M-x R&lt;/code&gt;. You will be prompted
for the project starting directory. Select whatever you like and press
enter. You will then be presented with an R shell:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/ess.jpg&#34; /&gt;&lt;/p&gt;
&lt;p&gt;You can enter code and view results here, just as you would in the terminal
in RStudio, or with R running on the command line. ESS uses the same code
to manage this as for &lt;a href=&#34;https://plantarum.ca/2020/06/16/emacs-tutorial-01/&#34;&gt;shell mode&lt;/a&gt;. That
means we can use the same keybindings here:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;&amp;lt;tab&amp;gt;&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;with the cursor at the active prompt, &lt;code&gt;tab&lt;/code&gt; will complete function and
variable names, as well as the arguments for functions&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;&amp;lt;enter&amp;gt;&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;with the cursor at the active prompt, send the command on the prompt to R
for evaluation&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;&amp;lt;enter&amp;gt;&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;With the cursor on a previous command, re-enter that command&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;M-p&lt;/code&gt; and &lt;code&gt;M-n&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;Move through your command history at the active prompt&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-p&lt;/code&gt; and &lt;code&gt;C-c C-n&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;Move the cursor to the &lt;em&gt;previous&lt;/em&gt; and &lt;em&gt;next&lt;/em&gt; prompts&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c &amp;lt;enter&amp;gt;&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;With the cursor on a previous command, copy that command the to active
prompt, but don’t enter it. This allows you to edit a previous command
before sending a new variation to R for evaluation&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-o&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;Delete the output from the previous command&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-v&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;Opens a prompt to select a help file, which will be displayed in Emacs
(you can also open help files from the prompt via &lt;code&gt;?&amp;lt;function&amp;gt;&lt;/code&gt;)&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;These few commands cover most general interactions. There are a lot more
features available. Check the &lt;code&gt;iESS&lt;/code&gt; menu item on the toolbar for some of
them; see the manual for the details.&lt;/p&gt;
&lt;div id=&#34;plots&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Plots&lt;/h3&gt;
&lt;p&gt;Calling plotting commands will create a new window (frame) for your figure.
There isn’t a dedicated pane in Emacs to display them, like in RStudio, and
you can’t scroll forward and backward through your history of images. You
can, however, create multiple image windows, and view them side by side.&lt;/p&gt;
&lt;p&gt;To create and manipulate new image windows, you’ll need the following
commands:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dev.new() ## Create new plot window, and make it the
          ## active window
dev.set() ## If more than one plot window is open,
          ## set the next window to be the active
          ##window &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;See &lt;code&gt;?dev&lt;/code&gt; page for more details.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;writing-r-scripts&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Writing R Scripts&lt;/h2&gt;
&lt;p&gt;After &lt;code&gt;ESS&lt;/code&gt; is installed, anytime you open a file with a &lt;code&gt;.r&lt;/code&gt; or &lt;code&gt;.R&lt;/code&gt;
extension, it will be in &lt;code&gt;ESS[R]&lt;/code&gt; mode. You can enter text as usual, and
additionally have the following helpful commands available:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;C-c C-n&lt;/code&gt; or &lt;code&gt;C-c &amp;lt;enter&amp;gt;&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;send the current line to the R process and step to the next&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-r&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;send the current region to the R process&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-f&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;send the current function to the R process&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-c&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;send the current region, paragraph, or function to the R process&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-b&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;send the entire buffer to the R process&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-v&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;prompt for a help file to open&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;M-tab&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;tab completion of objects (functions, variables, file names) and function
arguments&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;If there is no &lt;code&gt;R&lt;/code&gt; process running when you try to send code, you will be
prompted for a working directory in which to start a new process. In
addition, you can manage &lt;code&gt;R&lt;/code&gt; processes with the following commands:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;C-c C-z&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;switch from the script buffer to the process buffer (and vice versa)&lt;/p&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;C-c C-s&lt;/code&gt;&lt;/dt&gt;
&lt;dd&gt;&lt;p&gt;change the process linked to the current script buffer (e.g., if you want
to run multiple R processes at once, with different scripts in each process)&lt;/p&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;next-steps&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Next Steps&lt;/h1&gt;
&lt;p&gt;This may well be all you need, and if that’s the case, you’re all done.
However, there is a lot more available to you, including support for
writing documentation, package development, managing git repositories,
editing on remote servers, and more.&lt;/p&gt;
&lt;p&gt;My advice is to start slowly. The pointers on this page will get you up and
running. When you find yourself repeating something tedious multiple times,
it may be time to investigate if there’s a shortcut available to make your
life easier. I recommend skimming the manual, to get a sense of all that’s
available, and if something catches your eye see about incorporating it
into your workflow.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Plotting Simple Maps in R</title>
      <link>https://plantarum.ca/2020/10/30/simple-maps-r/</link>
      <pubDate>Fri, 30 Oct 2020 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2020/10/30/simple-maps-r/</guid>
      <description>


&lt;p&gt;NOTE: This tutorial uses older R packages that are scheduled to be
deprecated at the end of 2023. I have updated this tutorial using the new
packages. Unless you need to use older code, you should use the new
&lt;a href=&#34;https://plantarum.ca/2023/02/13/terra-maps&#34;&gt;Terra-based approach&lt;/a&gt; instead of this!&lt;/p&gt;
&lt;div id=&#34;reference&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Reference&lt;/h1&gt;
&lt;p&gt;See the &lt;a href=&#34;https://rspatial.org/raster/spatial/index.html&#34;&gt;RSpatial
tutorial&lt;/a&gt; for a
more detailed introduction/overview of using R for GIS/spatial analysis.
The following tutorial walks through some common plotting tasks I use for
distribution models.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;basemaps&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Basemaps&lt;/h1&gt;
&lt;p&gt;The &lt;code&gt;raster&lt;/code&gt; package provides the function &lt;code&gt;getData&lt;/code&gt;, which is a handy way
to download basemaps for plotting. (You can also use it to get WorldClim
data, see the man page). The first time you call it, it will download the
requested maps from the internet. It will save the data in your working
directory, or in a location specified with the &lt;code&gt;path&lt;/code&gt; argument. The next
time you request the same map from &lt;code&gt;getData&lt;/code&gt;, if it finds it in the local
directory it will load it from there, rather than downloading it again.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(raster)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: sp&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Loading required package: methods&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: multiple methods tables found for &amp;#39;metadata&amp;#39;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;us &amp;lt;- getData(&amp;quot;GADM&amp;quot;, country = &amp;quot;USA&amp;quot;, level = 1,
             path = &amp;quot;./data/maps/&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning in getData(&amp;quot;GADM&amp;quot;, country = &amp;quot;USA&amp;quot;, level = 1, path = &amp;quot;./data/maps/&amp;quot;): getData will be removed in a future version of raster
## . Please use the geodata package instead&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;canada &amp;lt;- getData(&amp;quot;GADM&amp;quot;, country = &amp;quot;CAN&amp;quot;, level = 1,
                 path = &amp;quot;./data/maps&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning in getData(&amp;quot;GADM&amp;quot;, country = &amp;quot;CAN&amp;quot;, level = 1, path = &amp;quot;./data/maps&amp;quot;): getData will be removed in a future version of raster
## . Please use the geodata package instead&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These maps can be plotted directly with the &lt;code&gt;plot&lt;/code&gt; command. If you want to
combine them, use the &lt;code&gt;add = TRUE&lt;/code&gt; argument to the second &lt;code&gt;plot&lt;/code&gt; call:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(us)
plot(canada, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-10-30-r-maps_files/figure-html/map%20plots-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;You can combine multiple vector maps into a single map with &lt;code&gt;bind&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;na &amp;lt;- bind(us, canada)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These maps are ‘unprojected’, meaning they are plotted in
latitude/longitude degrees. That makes it easy to set the plot boundaries:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(na, xlim = c(-100, -50), ylim = c(30, 60))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-10-30-r-maps_files/figure-html/zooming%20a%20map-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NB:&lt;/strong&gt; The size of your plot canvas is fixed, but a map can’t stretch. The
x and y dimensions have to maintain the same aspect. That means zooming in
one dimension (i.e. latitude only) won’t necessarily change the zoom of
your map, if the other dimension fills the canvas. You’ll have to play
around with the plot size, and both x and y dimensions together, to tweak
your zoom.&lt;/p&gt;
&lt;p&gt;It’s handy to have a shapefile of the Great Lakes, for making prettier
maps. I created this one in QGIS and use it for plotting:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;greatlakes &amp;lt;- shapefile(&amp;quot;data/maps/greatlakes.shp&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;adding-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Adding Data&lt;/h1&gt;
&lt;p&gt;You can add points to the plot like a regular scatter plot:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(scales)  ## for the alpha function below
gbif &amp;lt;- read.table(&amp;quot;data/trich-gbif.csv&amp;quot;)
## Set the line color to gray to focus on the data points:
plot(na, xlim = c(-100, -50), ylim = c(30, 60),
     border = &amp;quot;gray&amp;quot;)
points(gbif$X, gbif$Y, pch = 16,
       col = alpha(&amp;quot;green&amp;quot;, 0.2))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-10-30-r-maps_files/figure-html/adding%20points-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;You can also convert your points to a spatial points object, in which case
R will know which columns to use for plotting. This is also necessary
before we can project our data (see below).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;coordinates(gbif) &amp;lt;- ~X+Y
## plot(na, xlim = c(-100, -50), ylim = c(30, 60),
##      border = &amp;quot;gray&amp;quot;)
## points(gbif, pch = 16, col = alpha(&amp;quot;green&amp;quot;, 0.2))&lt;/code&gt;&lt;/pre&gt;
&lt;div id=&#34;rasters&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Rasters&lt;/h2&gt;
&lt;p&gt;Similarly, you can plot rasters with plot:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trichPreds &amp;lt;- raster(&amp;quot;./data/trichPreds&amp;quot;)
plot(trichPreds, xlim = c(-100, -50), ylim = c(30, 60))
plot(na, border = &amp;quot;gray&amp;quot;, lwd = 0.5, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-10-30-r-maps_files/figure-html/loading%20rasters-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Cells with &lt;code&gt;NA&lt;/code&gt; values are transparent. In this case, a species
distribution model, low values are displayed in gray. This may be useful
for visualizing the extent of the model. However, it looks a bit odd, and
makes it hard to see limits of the high-suitability areas. You can tweak
this by playing with the color ramp, but it’s also handy to ‘turn off’ the
low values entirely (for visualization, &lt;strong&gt;not&lt;/strong&gt; for analysis!!)&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;trichPredsTrim &amp;lt;- trichPreds
trichPredsTrim[trichPredsTrim &amp;lt;
               quantile(getValues(trichPreds),
                        probs = 0.75, na.rm = TRUE)] &amp;lt;- NA
plot(trichPredsTrim, xlim = c(-100, -50), ylim = c(30, 60))
plot(na, border = &amp;quot;grey&amp;quot;, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-10-30-r-maps_files/figure-html/trimming%20predictions-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The test I used here, &lt;code&gt;trichPredsTrim &amp;lt; quantile(getValues(trichPreds), probs = 0.75, na.rm = TRUE)&lt;/code&gt; identifies all cells in the lower 75% of the
suitability scores, which I then set to &lt;code&gt;NA&lt;/code&gt; to make them invisible. I
decided on 75% after experimenting with different values. In this case, 75%
drops most of the grey background (the very lowest values), without eating
into the areas that the prediction indicates are suitable.&lt;/p&gt;
&lt;p&gt;You could also use an absolute value here, but then you’d need to know the
actual distribution of the suitability scores. &lt;code&gt;quantile&lt;/code&gt; is easier to
tweak.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;projections&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Projections&lt;/h1&gt;
&lt;p&gt;Lat/Lon maps look a bit square; we’re more used to seeing maps projected. A
common projection for Canada is Lambert Conformal Conic. We can transform
our data to this projection to make nicer maps:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## define the projection
canlam &amp;lt;- CRS(&amp;quot;+proj=lcc +lat_1=49 +lat_2=77 +lat_0=49 +lon_0=-95 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs&amp;quot;)

## project our vector data:
na.lcc &amp;lt;- spTransform(na, canlam)
gl.lcc &amp;lt;- spTransform(greatlakes, canlam)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Warning: PROJ support is provided by the sf and terra packages among others&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## We already convereted gbif to spatial points object above!
## Now we to set the projection of our points:
crs(gbif) &amp;lt;- CRS(&amp;quot;+proj=longlat +datum=WGS84&amp;quot;)

## Finally, we can project our points to LCC:
gbif.lcc &amp;lt;- spTransform(gbif, canlam)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that our data needs to be in an object of class &lt;code&gt;Spatial*&lt;/code&gt;, and it
must have a defined coordinate reference system (CRS) before we can project
it to a new CRS. Setting the coordinates of our points via the
&lt;code&gt;coordinates&lt;/code&gt; function creates a &lt;code&gt;Spatial*&lt;/code&gt; object. The &lt;code&gt;crs&lt;/code&gt; function
allows us to explicitly set the projection. We need to know the EPSG code
for the projection to use this. The function &lt;code&gt;make_EPSG&lt;/code&gt; in the package
&lt;code&gt;rgdal&lt;/code&gt; is helpful for finding this information. See &lt;a href=&#34;https://rspatial.org/raster/spatial/6-crs.html#notation&#34;&gt;the RSpatial
tutorial&lt;/a&gt; for
details.&lt;/p&gt;
&lt;p&gt;There are a few more steps for raster layers:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;rasterLCC &amp;lt;- projectExtent(trichPredsTrim, canlam)
res(rasterLCC) &amp;lt;- 10000 ## set the cell size to 10km
predLCC &amp;lt;- projectRaster(trichPredsTrim, rasterLCC)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that I set the resolution to 10km here. That’s the size of the raster
cells. The original raster cells, in the lat/lon projection, were at 30
second resolution, which is about 1km. I could have set a smaller cell
size here. However, since I’m only using this map for visualization, 10km
is plenty big enough for my plot, and will run faster (and take less
memory) than a map with 1km cell size.&lt;/p&gt;
&lt;p&gt;Now we can plot our data in the Lambert Conformal Projection:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(predLCC)
plot(na.lcc, border = &amp;quot;grey&amp;quot;, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-10-30-r-maps_files/figure-html/plot%20projected-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;p&gt;The units are no longer Lat/Lon, but meters. We can read them off the plot
to improve the zoom:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(predLCC, xlim = c(0, 2500000),
     ylim = c(-1500000, -400000))
plot(na.lcc, border = &amp;quot;grey&amp;quot;, add = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-10-30-r-maps_files/figure-html/projected%20zoom-1.png&#34; width=&#34;672&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;formatting&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Formatting&lt;/h1&gt;
&lt;p&gt;With the data plotted, we can then turn to making the map a little
prettier:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## Make a panel with two plots, set the right margin tight:
par(mar = c(0.1,0.1,0.1,0), mfrow = c(1, 2))

## store the plot limits:
my_xlims &amp;lt;- c(0, 2500000) 
my_ylims &amp;lt;- c(-1300000, -200000)

## Plot the points:
plot(na.lcc, xlim = my_xlims , ylim = my_ylims,
     border = &amp;quot;grey&amp;quot;, bg = &amp;quot;lightblue&amp;quot;, col = &amp;quot;white&amp;quot;)
plot(gl.lcc, add = TRUE, border = &amp;quot;grey&amp;quot;, col = &amp;quot;lightblue&amp;quot;)
points(gbif.lcc, pch = 16, col = alpha(&amp;quot;grey30&amp;quot;, 0.2),
       cex = 0.7)
box() 

## tighten up the left margin:
par(mar = c(0.1,0,0.1,0.1))
plot(na.lcc, xlim = my_xlims , ylim = my_ylims,
     border = &amp;quot;grey&amp;quot;, bg = &amp;quot;lightblue&amp;quot;, col = &amp;quot;white&amp;quot;)
plot(gl.lcc, add = TRUE, border = &amp;quot;grey&amp;quot;, col = &amp;quot;lightblue&amp;quot;)
plot(predLCC, add = TRUE, legend = FALSE)

## plotted again to put the border lines on top:
plot(na.lcc, border = &amp;quot;grey&amp;quot;, add = TRUE) 
box()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-10-30-r-maps_files/figure-html/pretty%20plot-1.png&#34; width=&#34;696&#34; /&gt;&lt;/p&gt;
&lt;p&gt;If you want to plot the state/provincial borders &lt;em&gt;on top&lt;/em&gt; of the raster,
you need to add those layers last. But you can’t set the background colour
of the raster layer to “lightblue” (or at least I haven’t figured that
out), so the ocean stays white. I get around that by plotting the
boundaries twice, first to set the background colour, and then to put the
state lines on top of the raster.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Emacs for Bioinformatics #2: Orgmode</title>
      <link>https://plantarum.ca/2020/06/17/emacs-tutorial-02/</link>
      <pubDate>Wed, 17 Jun 2020 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2020/06/17/emacs-tutorial-02/</guid>
      <description>


&lt;p&gt;In the &lt;a href=&#34;https://plantarum.ca/2020/06/16/emacs-tutorial-01/&#34;&gt;previous post&lt;/a&gt; we took a first look at Emacs, including creating and editing a script file, and passing commands from the file to the shell terminal. At the end of that post, I recommended you check out the built-in tutorial (accessible via &lt;code&gt;C-h t&lt;/code&gt; from within Emacs). In this post I assume you’ve done so, although I won’t expect you’ve understood everything you found there.&lt;/p&gt;
&lt;div id=&#34;orgmode&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Orgmode&lt;/h1&gt;
&lt;p&gt;Last time, I promised a better way to integrate scripts, output, and notes in a single file. The better way is provided by &lt;a href=&#34;https://orgmode.org/&#34; title=&#34;Orgmode Website&#34;&gt;orgmode&lt;/a&gt;, which comes bundled with Emacs. Orgmode evolved from a simple task-manager, to a full-fledged information management system, especially for people whose work includes computer code. This lesson will focus on getting started with orgmode, and using it to help explore some Illumina data (you don’t need to undertand Illumina data to follow along).&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;objectives&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Objectives&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Create a new org file with:
&lt;ul&gt;
&lt;li&gt;text notes&lt;/li&gt;
&lt;li&gt;Bash code blocks&lt;/li&gt;
&lt;li&gt;the results of executing the code blocks&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;setting-up&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Setting Up&lt;/h1&gt;
&lt;p&gt;First off, we need to create a new file. We do this from the menu “File - Visit New File” option. Name it &lt;code&gt;gbs.org&lt;/code&gt;, and put it in the directory we created last time: &lt;code&gt;~/gbs-analysis&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Emacs will recognize this file as an &lt;code&gt;org&lt;/code&gt; file, based on the &lt;code&gt;.org&lt;/code&gt; suffix, and will turn on &lt;code&gt;orgmode&lt;/code&gt; for us. The file is still a normal plain text file, but Emacs will look for special tags that identify which parts are code, what is a heading, which text to make bold etc.&lt;/p&gt;
&lt;p&gt;We need to customize one of the &lt;code&gt;orgmode&lt;/code&gt; options before we start. For security, by default &lt;code&gt;orgmode&lt;/code&gt; won’t allow you to execute code in programming languages other than Emacs’ built-in language &lt;code&gt;elisp&lt;/code&gt;. We need to add &lt;code&gt;bash&lt;/code&gt; to the list of permitted languages, so we can use it for our scripts.&lt;/p&gt;
&lt;p&gt;We can find the options in the menu “Org - Customize”. At first, there are only two options here. Select the “Expand this Menu” option, then open the “Org - Customize” menu again. Now you have a second menu item named “Customize”. This will lead you to a long list of options. We’ll ignore them, and pick the “Babel” option from up near the top:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/config-menu.jpg&#34; title=&#34;The Org Customize Menu&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This opens up the Babel customize window:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/babel-custom.jpg&#34; title=&#34;The Babel Customization Window&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Scroll down until you find the &lt;code&gt;Org Babel Load Languages&lt;/code&gt; option. Click on the triangle to reveal the current settings:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/babel-load.jpg&#34; title=&#34;The Babel Load Languages Defaults&#34; /&gt;&lt;/p&gt;
&lt;p&gt;To begin with, there’s only one entry, for Emacs Lisp. Press the &lt;code&gt;INS&lt;/code&gt; button to insert a new option. The &lt;code&gt;Value Menu&lt;/code&gt; will show “Awk”, which we need to change. Click on the “Value Menu” button, and enter “Shell Script” and press enter.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/babel-add-shell.jpg&#34; title=&#34;Adding Shell Scripts to the Load Languages&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now press the “Save” button in the tool bar to set all options, and press the &lt;code&gt;q&lt;/code&gt; key to close the window.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;writing-our-script&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Writing our Script&lt;/h1&gt;
&lt;p&gt;We can now enter whatever we like in the file: introductory notes, comments about the code we will create, and the code itself. This is just a plain text file, so there is no restriction.&lt;/p&gt;
&lt;p&gt;However, if we use special tags, we can insert “code blocks” that we can run directly in this file. A shell code block looks like this:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;#+begin_src bash
ls
#+end_src&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;With the cursor anywhere in this code block, we can run the code by pressing &lt;code&gt;C-c C-c&lt;/code&gt;. Note that Emacs will ask us to confirm we want to run the code, and then it runs it:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/ls-out.jpg&#34; title=&#34;ls command output&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Emacs has taken our &lt;code&gt;ls&lt;/code&gt; command, run it through a shell interpreter, and inserted the results back in our file. Again, the file is still just plain text, so we can add any comments we like, although it’s best to keep our comments out of the code blocks, and not put anything between the code block and the results it generates:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/ls-plus-text.jpg&#34; title=&#34;ls with annotations&#34; /&gt;&lt;/p&gt;
&lt;p&gt;It’s often handy to have basic commands like &lt;code&gt;ls&lt;/code&gt; included in your scripts, to confirm the files you think you are working with are in fact where you want them to be.&lt;/p&gt;
&lt;p&gt;I use code blocks to build templates for my analyses, along with useful
notes and sanity checks. The following template includes two ‘checks’ before and after the command &lt;code&gt;process_radtags&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;Examine the contents of our sequencing files:
#+begin_src bash
zcat data/Vaccinium3-*.fastq.gz | head -4
#+end_src

Demultiplex: expect this to take about 10 minutes:
#+begin_src bash
process_radtags -p ./data/ -b ./data/barcodes.csv \
                -i gzfastq -o ./output -q -r \
                --inline-null -e pstI
#+end_src

Demultiplexed sequencing reads:
#+begin_src bash
zcat output/samp-001.fq.gz | head -4 
#+end_src&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that the &lt;code&gt;process_radtags&lt;/code&gt; command is longer than a single line. Put a single &lt;code&gt;\&lt;/code&gt; symbol as the very last character on a line to make sure the computer treats it as a single line.&lt;/p&gt;
&lt;p&gt;I can step through this sequence, pressing &lt;code&gt;C-c C-c&lt;/code&gt; on each block, to work through the analysis. My notes remind me that I’ll have time to get a cup of tea while I wait for &lt;code&gt;process_radtags&lt;/code&gt; to finish.&lt;/p&gt;
&lt;p&gt;After I run the code, my file looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/multi-chunk.jpg&#34; title=&#34;Three code chunks with their output&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Note that the results for the &lt;code&gt;process_radtags&lt;/code&gt; chunk are empty. That’s good, that means &lt;code&gt;process_radtags&lt;/code&gt; completed without issuing any warnings or errors. The files it produced are saved on my hard-drive, and I confirm that by peaking inside one of them in the third chunk.&lt;/p&gt;
&lt;p&gt;This is still a simple example, but I hope that now you can start to see some of the potential for using &lt;code&gt;orgmode&lt;/code&gt; to structure your analysis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;conveniences&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Conveniences&lt;/h1&gt;
&lt;p&gt;If you do think you’d like to use &lt;code&gt;orgmode&lt;/code&gt; in your analyses, you might think entering all those &lt;code&gt;#+BEGIN&lt;/code&gt; tags will get tedious. That’s true. There are keyboard shortcuts to help though. On older versions of &lt;code&gt;orgmode&lt;/code&gt;, up to version 9.1, you can enter the following shortcut. Starting at the beginning of a line, type&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;&amp;lt;s&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and then press the &lt;code&gt;TAB&lt;/code&gt; key. The &lt;code&gt;&amp;lt;s&lt;/code&gt; characters will be replaced by:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;#+begin_src
#+end_src&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You can then add the &lt;code&gt;bash&lt;/code&gt; keyword at the end of the first line, and start your code between the two lines.&lt;/p&gt;
&lt;p&gt;If that doesn’t work, your &lt;code&gt;orgmode&lt;/code&gt; is probably one of the newer releases. Starting with &lt;code&gt;orgmode&lt;/code&gt; version 9.2, instead of the above, you can press &lt;code&gt;C-c C-,&lt;/code&gt; (i.e, control-c, control-comma), then select &lt;code&gt;s&lt;/code&gt; for source code, and you’ll get the &lt;code&gt;begin&lt;/code&gt; and &lt;code&gt;end&lt;/code&gt; tags you need.&lt;/p&gt;
&lt;p&gt;One other shortcut: you’ll probably get tired of Emacs asking you if you really want to execute code each time you run a code chunk. You can turn off that check in the customize menu: “Org - Customize - Customize - Babel”, scroll down to find “Org Confirm Babel Evaluate”, click the &lt;code&gt;toggle&lt;/code&gt; button to “off”, and click the “Save” button on the menu. You’ll never be asked again, so be careful!&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Emacs for Bioinformatics: Getting Started</title>
      <link>https://plantarum.ca/2020/06/16/emacs-tutorial-01/</link>
      <pubDate>Tue, 16 Jun 2020 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2020/06/16/emacs-tutorial-01/</guid>
      <description>
&lt;script src=&#34;https://plantarum.ca/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;


&lt;div id=&#34;emacs-for-bioinformatics&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Emacs for Bioinformatics&lt;/h1&gt;
&lt;p&gt;&lt;a href=&#34;https://www.gnu.org/software/emacs/&#34; title=&#34;Emacs webpage&#34;&gt;GNU Emacs&lt;/a&gt; is likely one of the oldest pieces of software still in active development. It is also one of the most powerful systems for editing code, built by and for hackers. However, it does have a reputation for unwieldy complexity. I think this is largely undeserved. While it would take years of study to understand all its nooks and crannies, if you focus on just those features that you actually need, you can get going fairly quickly.&lt;/p&gt;
&lt;p&gt;The purpose of this series of posts is to introduce new coders to the benefits of Emacs. I’m specifically targeting biologists, but hopefully the content will be generally useful.&lt;/p&gt;
&lt;p&gt;You may have heard of &lt;a href=&#34;https://orgmode.org/&#34; title=&#34;Orgmode Website&#34;&gt;orgmode&lt;/a&gt;. This is one of Emacs’ killer featues, and will feature prominently in future posts. However, for this first post we’ll stick to more straightforward features, and a simple task: developing a Bash script.&lt;/p&gt;
&lt;p&gt;One last caveat: I work on Linux, and so the examples here will assume you are too. Most of Emacs’ features work the same on Windows and Macs, but interacting with external processes, like a Bash shell, may require additional configuration on those systems.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;objectives&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Objectives&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;Create a new bash shell script:
&lt;ul&gt;
&lt;li&gt;write the code&lt;/li&gt;
&lt;li&gt;run the code in a shell&lt;/li&gt;
&lt;li&gt;save the results&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;More importantly, by the end of this quick example, I hope you’ll see that Emacs, while a little weird, isn’t &lt;em&gt;that&lt;/em&gt; different from more conventional programs. And if I can convince you of that, then we’ll be ready to explore some of the more useful features it has for us.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;getting-started&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Getting Started&lt;/h1&gt;
&lt;p&gt;You’ll need to install Emacs if it isn’t already. On Ubuntu, you can do this directly from the Software Center. Other distributions will undoubtedly have it in their repositories, and you can get the latest release directly from &lt;a href=&#34;https://www.gnu.org/software/emacs/&#34; title=&#34;GNU Emacs&#34;&gt;GNU&lt;/a&gt; for Windows and Mac (and Linux)&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Start by opening Emacs from the launcher, or with the command &lt;code&gt;emacs&lt;/code&gt; on the command line. This should open up a new graphical window that looks something like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/emacs-start.jpg&#34; title=&#34;The Emacs Welcome Screen&#34; /&gt;&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;NB:&lt;/strong&gt; Emacs uses a &lt;strong&gt;lot&lt;/strong&gt; of keyboard shortcuts. I’ll be introducing them slowly, to keep things simple. However, it’s possible you’ll accidentally hit one, and something strange will happen. Most of the time, you can fix this by &lt;code&gt;quitting&lt;/code&gt;, which you can do with the keyboard shortcut &lt;code&gt;C-g&lt;/code&gt; (that is, hold the &lt;strong&gt;C&lt;/strong&gt;ontrol key down while pressing &lt;code&gt;g&lt;/code&gt;). It may also be helpful to know there’s an &lt;code&gt;undo&lt;/code&gt; option in the “Edit” menu, if you accidentally change a bunch of text.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;First off, we’ll create a new file: click on the “File” menu, and select “Visit New File.” You’ll see a file browser. We use the browser to create a new folder &lt;code&gt;gbs-analysis&lt;/code&gt;, and open a new file &lt;code&gt;script.sh&lt;/code&gt; in that folder.&lt;/p&gt;
&lt;p&gt;Now we have an empty file for our script. We’ll add a few commands to set up our project:&lt;/p&gt;
&lt;pre class=&#34;bash&#34;&gt;&lt;code&gt;mkdir data ## for our raw data
mkdir output ## analysis results
cp ~/dl/Vaccinium3-SingleRead300_S1_L001_R1_001.fastq.gz data&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;To save the file, we can use the menu “File - Save.” We’ll do this a lot, so we can use a keyboard shortcut instead: &lt;code&gt;C-x C-s&lt;/code&gt;. That is, hold the &lt;strong&gt;C&lt;/strong&gt;ontrol key down and press &lt;code&gt;x&lt;/code&gt;, then &lt;code&gt;s&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Now we’re ready to run the code in a shell. We can start a shell from the menu: “Tools - Shell Commands - Run Shell Interactively.”&lt;/p&gt;
&lt;p&gt;This opens a new shell terminal inside Emacs. In my case, it opens below the script window. Depending on the size and shape of your screen, it might open beside your script instead:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/emacs-shell.jpg&#34; title=&#34;Emacs Interactive Shell&#34; /&gt;&lt;/p&gt;
&lt;p&gt;This terminal is almost normal: you can enter commands at the prompt and view the output, just as you would with a regular terminal. To do that, you need to move the cursor to the prompt. You can do that with your mouse. However, moving back and forth between different windows in Emacs is something that we’ll do alot, so there’s a keyboard shortcut for that too: &lt;code&gt;C-x o&lt;/code&gt;. That moves the cursor from one window to the &lt;strong&gt;o&lt;/strong&gt;ther. Try that a few times.&lt;/p&gt;
&lt;p&gt;Now we have a script window, and a terminal, and we’d like to run a few lines from our script in the terminal. First, we move the cursor back to our script window (&lt;code&gt;C-x o&lt;/code&gt;), and then move it to the beginning of the first line (you can use the arrow keys).&lt;/p&gt;
&lt;p&gt;Here we encounter one of Emacs quirks: it doesn’t use the usual &lt;code&gt;C-c&lt;/code&gt;/&lt;code&gt;C-v&lt;/code&gt; copy and paste convention. Emacs was already 10 or 15 years old when this was developed, and it already had its own way: “killing” and “yanking.” We kill with &lt;code&gt;C-k&lt;/code&gt;, and yank with &lt;code&gt;C-y&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;So to copy the first line, we first kill it with &lt;code&gt;C-k&lt;/code&gt;. And it’s gone!&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/emacs-kill.jpg&#34; title=&#34;Emacs Killing Text&#34; /&gt;&lt;/p&gt;
&lt;p&gt;We can get it back by “yanking” it with &lt;code&gt;C-y&lt;/code&gt;. Now we’re back where we started, except that the contents of the line are stored in the “kill-ring.” Now switch back to the shell prompt and yank again:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/emacs-yank.jpg&#34; title=&#34;Emacs Yanking Text&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Now the line we killed is on the command prompt and ready to run. Hit enter and we’ll create our new directory. Repeat with the next line. The third line won’t work of course, because you don’t have the same file on your computer, but I’ll use it here to round out my example.&lt;/p&gt;
&lt;p&gt;We can enter commands directly in the terminal. I’ll use &lt;code&gt;ls&lt;/code&gt; to check that the new files and directories are all where we put them:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/images/emacs-shell-output.jpg&#34; title=&#34;Emacs Shell Output&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Remember I said the terminal is ‘almost’ normal. One of the things that’s not normal about it is that you can move around in it with the arrow keys, and kill and yank text if you like. You can even enter additional text in the transcript if you like. That might be useful if you want annotate something; just be careful not to hit &lt;code&gt;enter&lt;/code&gt; when you do this, or the text will get entered at the prompt, which you usually don’t want to do.&lt;/p&gt;
&lt;p&gt;You can also save the text in the terminal window as a text file, again using the &lt;code&gt;File - Save&lt;/code&gt; dialoag, or the keyboard shortcut &lt;code&gt;C-x C-s&lt;/code&gt;. That’s not very useful for this toy example. But if you had spent hours on a bioinformatic pipeline, you could then save a record of everything you’d done as a text file.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;so-what&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;So… what?&lt;/h1&gt;
&lt;p&gt;That may have been a little underwhelming, if you’ve ever listened to an Emacs zealot ramble on about how mind-blowing this program is. What I hope you got from this short introduction is:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Emacs has some quirks, but it’s basically just a text editor that you can use like any other editor, with a few concessions&lt;/li&gt;
&lt;li&gt;Having your script file and terminal in the same program is handy for developing scripts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are some obvious deficiencies here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;transfering text from the script to the shell is a bit clunky&lt;/li&gt;
&lt;li&gt;maintaining a script file and a separate file with the shell output will get confusing quickly&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It would be much better if we could:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;write our script in a single file&lt;/li&gt;
&lt;li&gt;have the commands sent automatically to the shell&lt;/li&gt;
&lt;li&gt;have the results pasted back into the script file at the appropriate location&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That would be a big improvement. Emacs can do that and more:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;mix different languages, including Bash, R, Python and more in one file&lt;/li&gt;
&lt;li&gt;include human language, including sections, formatting (bold, italics), and links to other files and websites – &lt;em&gt;all in the same source file&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That’s where &lt;code&gt;orgmode&lt;/code&gt; comes in, and it really is a killer feature. But before we get there, we need to get a bit more familiar with the basics of Emacs. For this, I strongly recommend the built-in tutorial. You can start it with &lt;code&gt;C-h t&lt;/code&gt; from Emacs. It takes 20 or 30 minutes, and explains the basics of getting around in the program. Try that out, and we’ll look at &lt;code&gt;orgmode&lt;/code&gt; in the next post.&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;There are several customized Emacs distributions that provide improved default settings. Some of them are quite good, but I’m going to stick to standard Emacs to minimize distractions.&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Publication Quality R Figures</title>
      <link>https://plantarum.ca/2014/02/19/r-graphics/</link>
      <pubDate>Wed, 19 Feb 2014 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2014/02/19/r-graphics/</guid>
      <description>
&lt;script src=&#34;https://plantarum.ca/rmarkdown-libs/header-attrs/header-attrs.js&#34;&gt;&lt;/script&gt;

&lt;div id=&#34;TOC&#34;&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#introduction&#34;&gt;Introduction&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#learning-objectives&#34;&gt;Learning Objectives&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#pre-requisites&#34;&gt;Pre-requisites&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#motivation&#34;&gt;Motivation&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#building-our-plot&#34;&gt;Building Our Plot&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#size&#34;&gt;Size&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#content&#34;&gt;Content&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#plot-symbols&#34;&gt;Plot Symbols&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#margins&#34;&gt;Margins&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#axes&#34;&gt;Axes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#the-finished-plot&#34;&gt;The finished plot&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#exercise-1-adding-a-legend&#34;&gt;Exercise 1: adding a legend&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#additional-customization&#34;&gt;Additional Customization&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#selecting-plot-symbols&#34;&gt;Selecting Plot Symbols&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#panels&#34;&gt;Panels&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#exercise-2-completing-the-panel&#34;&gt;Exercise 2: Completing the Panel&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#image-formats&#34;&gt;Image Formats&lt;/a&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;#raster-images&#34;&gt;Raster Images&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;#vector-images&#34;&gt;Vector Images&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;

&lt;div class=&#34;figure&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:final&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/final-1.png&#34; alt=&#34;A. Iris Sepal Size by Species. B. Iris Petal Width&#34; width=&#34;696&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: A. Iris Sepal Size by Species. B. Iris Petal Width
&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;introduction&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Introduction&lt;/h1&gt;
&lt;div id=&#34;learning-objectives&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Learning Objectives&lt;/h2&gt;
&lt;p&gt;At the end of this lesson, you should be able to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Customize plots produced with the R base graphics system&lt;/li&gt;
&lt;li&gt;Design multi-panel plots&lt;/li&gt;
&lt;li&gt;Design plots to suit the publication requirements of a journal&lt;/li&gt;
&lt;li&gt;Save your plots as high-resolution raster or vector image files as
required by your publisher&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;pre-requisites&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Pre-requisites&lt;/h2&gt;
&lt;p&gt;You will need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A recent version of R installed on your computer&lt;/li&gt;
&lt;li&gt;Familiarity editing R scripts and passing commands from a script to the R
interpreter&lt;/li&gt;
&lt;li&gt;Note that RStudio is not ideal for this lesson, due to limitations in how
it processes plotting commands; the default RGui installed on Windows or
Mac will work better&lt;/li&gt;
&lt;li&gt;These notes!&lt;/li&gt;
&lt;li&gt;Optionally, some of your own data to work with during the exercises&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
&lt;div id=&#34;motivation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Motivation&lt;/h2&gt;
&lt;p&gt;You have several options for plotting with R. The simplest is the built-in
or base graphics package. Base graphics are less powerful than newer
alternatives like lattice or ggplot2. On the other hand, it’s much easier
to customize base graphics than the others. For this reason, I prefer to
use the built-in functions when preparing single-panel plots.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;ggplot2&lt;/code&gt; is definitely worth investigating, especially if you want to
produce complex multi-panel faceted plots. The &lt;a href=&#34;http://ggplot2.org/&#34;&gt;official
website&lt;/a&gt; has all the documentation. Roger Peng has
also posted a very nice introductory &lt;a href=&#34;https://www.youtube.com/watch?v=HeqHMM4ziXA&amp;amp;feature=youtube_gdata_player&#34;&gt;video on
YouTube&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;building-our-plot&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Building Our Plot&lt;/h1&gt;
&lt;p&gt;For this example, we’ll use the guidelines provided by the &lt;a href=&#34;https://bsapubs.onlinelibrary.wiley.com/hub/journal/15372197/homepage/forauthors#ps&#34;&gt;American
Journal of
Botany&lt;/a&gt;. AJB
accepts figures 3.5 inches (1 column), 5-6 inches (1.5 columns), or 7.25
inches wide (2 columns). The height can be up to 9 inches. We’ll start with
a one-column plot, so the dimensions should be 3.5 inches wide.&lt;/p&gt;
&lt;p&gt;The figure we’ll plot is from the built-in &lt;code&gt;iris&lt;/code&gt; data set. We’ll do a
simple scatterplot of &lt;code&gt;Sepal.Length&lt;/code&gt; against &lt;code&gt;Sepal.Width&lt;/code&gt;.&lt;/p&gt;
&lt;div id=&#34;size&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Size&lt;/h2&gt;
&lt;p&gt;Let’s start with a square. If we need more height, we can increase the size
as necessary. Similarly, if we decide we need to stretch our figure over
two columns, we can change later.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Note that RStudio isn’t the best environment for this exercise.&lt;/strong&gt;
Unfortunately, it’s not possible to create new plot windows in RStudio, so
you can’t specify the dimensions of the figure for the on-screen display.
Consequently, when you save your figure to an image file, you won’t
necessarily get exactly what you see on the screen. In many cases, this may
well be fine. But double-check the image file to make sure that you get
what you expected. If you didn’t, it may be because the on-screen display
and the saved image were not close enough to the same dimensions.&lt;/p&gt;
&lt;p&gt;To set up the canvas for our plot, start a new device:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dev.new(height = 3.5, width = 3.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;If you are using RStudio, &lt;code&gt;dev.new()&lt;/code&gt; won’t work. Instead, drag the edges
of the plot window to get as close to a 3.5&#34; square as possible.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;content&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Content&lt;/h2&gt;
&lt;p&gt;Now that our canvas is ready, we can start placing our graphics. Let’s
start with the default plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(Sepal.Length ~ Sepal.Width, data = iris)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/iris-fig-default-source-1.png&#34; width=&#34;336&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;plot-symbols&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Plot Symbols&lt;/h2&gt;
&lt;p&gt;The default plot uses the same symbol for each point. However, our data
frame includes samples from three different species:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;str(iris)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## &amp;#39;data.frame&amp;#39;:    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels &amp;quot;setosa&amp;quot;,&amp;quot;versicolor&amp;quot;,..: 1 1 1 1 1 1 1 1 1 1 ...&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Recall that a factor is just a vector of integers, with each integer having
it’s own label. We can display the underlying numbers by converting from
factor to numeric:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;as.numeric(iris$Species)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is a very useful feature. It means we can set a different plot symbol
for each species:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;plot(Sepal.Length ~ Sepal.Width, pch = as.numeric(Species),
     data = iris)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/iris-fig-1-1.png&#34; width=&#34;336&#34; /&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;margins&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Margins&lt;/h2&gt;
&lt;p&gt;Now we can see what we’re working with, but the layout isn’t ideal. In
particular, the plot is very small relative to the size of the figure. We
can fix this with the &lt;code&gt;mar&lt;/code&gt; parameter. &lt;code&gt;mar&lt;/code&gt; takes a vector of four
integers, which set the width of the margin on the bottom, left, top and
right sides respectively (remember clockwise from the bottom!). These
numbers refer to the width of each margin in &lt;code&gt;lines&lt;/code&gt; — i.e., the width
required for a single line of text. The default is &lt;code&gt;c(5, 4, 4, 2) + 0.1&lt;/code&gt;.
The top margin in particular is usually too wide. We will very rarely add a
title to a published figure, so we don’t need to set aside space for it.&lt;/p&gt;
&lt;p&gt;Set the value of &lt;code&gt;mar&lt;/code&gt; with the &lt;code&gt;par&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mar = c(3, 3, 0.5, 0.5))
plot(Sepal.Length ~ Sepal.Width, pch = as.numeric(Species),
     data = iris)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/iris-fig-2-1.png&#34; width=&#34;336&#34; /&gt;&lt;/p&gt;
&lt;p&gt;That’s better. But we’ve lost our axis labels. They aren’t actually lost,
but they are plotted outside of the margins we’ve set, so they are no
longer visible. I find the defaults that R uses for the axes to be larger
than we need. Better to turn off the axes entirely and replot them
ourselves.&lt;/p&gt;
&lt;p&gt;Note that once the margins are set with &lt;code&gt;par()&lt;/code&gt;, they will keep their value
until we open a new plot window, or reset them.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;axes&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Axes&lt;/h2&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mar = c(3, 3, 0.5, 0.5))
plot(Sepal.Length ~ Sepal.Width, pch = as.numeric(Species),
     data = iris, 
     ann = FALSE,      # turn off axis labels
     axes = FALSE)     # turn off axis ticks&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/iris-fig-3-1.png&#34; width=&#34;336&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Notice that &lt;code&gt;axes = FALSE&lt;/code&gt; has turned of the box around our plot. We can put it back easily:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;box()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now we can explicitly add each axis with their size and placement specified:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;axis(side = 1, tcl = -0.2, mgp = c(3, 0.3, 0),
     cex.axis = 0.8) 
axis(side = 2, tcl = -0.2, mgp = c(3, 0.3, 0),
     cex.axis = 0.8) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/iris-fig-3b-1.png&#34; width=&#34;336&#34; /&gt;&lt;/p&gt;
&lt;p&gt;What just happened? Let’s breakdown the arguments to &lt;code&gt;axis&lt;/code&gt;:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;side&lt;/code&gt;:&lt;/dt&gt;
&lt;dd&gt;which side of the plot, clockwise from bottom, same as for &lt;code&gt;mar&lt;/code&gt;
above
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;tcl&lt;/code&gt;:&lt;/dt&gt;
&lt;dd&gt;length of the ticks. Negative values indicate extending outwards
from plot, positive values extend inward. The default is -0.5, which I
find a bit too long.
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;mgp&lt;/code&gt;:&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;margin line&lt;/code&gt;, a vector of three numbers, which indicate the
position of the axis title, axis labels, and axis line, respectively. The
values are the number of &lt;code&gt;lines&lt;/code&gt; away from the plot border to place each
item, with &lt;code&gt;0&lt;/code&gt; indicating the margin of the plot area. Note that title
doesn’t matter here, since we aren’t using an axis title (yet).
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;cex.axis&lt;/code&gt;:&lt;/dt&gt;
&lt;dd&gt;axis character expansion. Scale the size of the tick labels.
&amp;lt; 1 reduces the size, &amp;gt; 1 increases the size.
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;Now we can add our axis titles back in:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mtext(&amp;quot;Sepal Width&amp;quot;, side = 1, line = 1.5)
mtext(&amp;quot;Sepal Length&amp;quot;, side = 2, line = 1.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here, we can use &lt;code&gt;line&lt;/code&gt; to adjust the distance between the label text and the axis.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-finished-plot&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;The finished plot&lt;/h2&gt;
&lt;p&gt;Putting this altogether gives us the following plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mar = c(3, 3, 0.5, 0.5))
plot(Sepal.Length ~ Sepal.Width, pch = as.numeric(Species),
     data = iris, 
     ann = FALSE,      # turn off axis labels
     axes = FALSE)     # turn off axis ticks
box()
axis(side = 1, tcl = -0.2, mgp = c(3, 0.3, 0),
     cex.axis = 0.8) 
axis(side = 2, tcl = -0.2, mgp = c(3, 0.3, 0),
     cex.axis = 0.8) 
mtext(&amp;quot;Sepal Width&amp;quot;, side = 1, line = 1.5)
mtext(&amp;quot;Sepal Length&amp;quot;, side = 2, line = 1.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/iris-finished-plot-1.png&#34; width=&#34;336&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;exercise-1-adding-a-legend&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Exercise 1: adding a legend&lt;/h3&gt;
&lt;p&gt;We now have a complete figure. We could provide an explanation of the
symbols in the caption, but it might be nicer to have a legend plotted on
the figure. This is easily done with the &lt;code&gt;legend()&lt;/code&gt; function.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;legend(legend = levels(iris$Species), x=&amp;quot;topleft&amp;quot;,
       pch = 1:3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/legend-code-1.png&#34; width=&#34;336&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Since we set the &lt;code&gt;pch&lt;/code&gt; argument in our plots using the factor
&lt;code&gt;iris$Species&lt;/code&gt;, we can use &lt;code&gt;levels()&lt;/code&gt; function to extract the labels for
the legend. &lt;code&gt;pch&lt;/code&gt; indicates the actual symbols to use, and &lt;code&gt;x&lt;/code&gt; is the
location of the legend.&lt;/p&gt;
&lt;p&gt;This is clearly not “publication quality.” Our plot needs a bit more space
for the legend. See if you can make an attractive plot. The following
options might be helpful:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;&lt;code&gt;dev.new()&lt;/code&gt;:&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;width, height&lt;/code&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;plot()&lt;/code&gt;:&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;xlim, ylim, cex&lt;/code&gt;
&lt;/dd&gt;
&lt;dt&gt;&lt;code&gt;legend()&lt;/code&gt;:&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;x, y, bty, horiz, cex, pt.cex, text.width&lt;/code&gt;
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;If you need a hint, take a look at the next section.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;additional-customization&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Additional Customization&lt;/h1&gt;
&lt;div id=&#34;selecting-plot-symbols&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Selecting Plot Symbols&lt;/h2&gt;
&lt;p&gt;If you want to select different symbols, it’s easy to do using R’s
subsetting syntax. By default, for three levels of our &lt;code&gt;Species&lt;/code&gt; factor, we
get symbols 1, 2, and 3. If instead we wanted to use symbols&lt;a href=&#34;#fn1&#34; class=&#34;footnote-ref&#34; id=&#34;fnref1&#34;&gt;&lt;sup&gt;1&lt;/sup&gt;&lt;/a&gt; 19, 5, and
3, we could do this:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mar = c(3, 3, 0.5, 0.5))
mysymbols &amp;lt;- c(19, 5, 3)
plot(Sepal.Length ~ Sepal.Width,
     pch = mysymbols[as.numeric(Species)], data = iris,
     ylim = c(4.0, 8.5), ann = FALSE, axes = FALSE)
box()
axis(side = 1, tcl = -0.2, mgp = c(3, 0.3, 0),
     cex.axis = 0.8) 
axis(side = 2, tcl = -0.2, mgp = c(3, 0.3, 0),
     cex.axis = 0.8) 
mtext(&amp;quot;Sepal Width&amp;quot;, side = 1, line = 1.5)
mtext(&amp;quot;Sepal Length&amp;quot;, side = 2, line = 1.5)
legend(legend = levels(iris$Species), x = &amp;quot;top&amp;quot;,
       pch = mysymbols, horiz = TRUE, bty = &amp;#39;n&amp;#39;,
       cex = 0.9, text.width = c(0.6, 0.7, 0.6))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/symbols-1.png&#34; width=&#34;336&#34; /&gt;&lt;/p&gt;
&lt;p&gt;That’s an important application of &lt;code&gt;R&lt;/code&gt;’s subsetting commands, so make sure
you follow what happened — we subset the &lt;code&gt;mysymbols&lt;/code&gt; vector with the
longer &lt;code&gt;as.numeric(iris$Species)&lt;/code&gt; vector, which converted the values from
&lt;code&gt;(1, 2, 3)&lt;/code&gt; to &lt;code&gt;(19, 5, 3)&lt;/code&gt;:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mysymbols&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 19  5  3&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;as.numeric(iris$Species)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3
## [112] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [149] 3 3&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mysymbols[as.numeric(iris$Species)]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;##   [1] 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
##  [26] 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
##  [51]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
##  [76]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
## [101]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3
## [126]  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;panels&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Panels&lt;/h2&gt;
&lt;p&gt;&lt;code&gt;ggplot2&lt;/code&gt; provides a very sophisticated system for producing multi-panel
plots. But it’s easy enough to create a simple panel using the base
graphics. For this example, let’s do a two-plot horizontal panel, with our
scatter plot in the first position, and a boxplot of petal widths in the
second position. A two-column plot in AJB is 7.25 inches wide:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;dev.new(width = 7.25, height = 3.5)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Next we need, to inform R that we’re splitting the figure into two panels:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mfrow = c(1, 2))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;mfrow&lt;/code&gt; sets the graphics device for rows and columns, in this case one
row, two columns. We can now put our first plot in the first spot:&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/panel1-1.png&#34; width=&#34;696&#34; /&gt;&lt;/p&gt;
&lt;p&gt;After dividing a plot device into panels with &lt;code&gt;mfrow&lt;/code&gt;, the first high-level
plot (i.e., &lt;code&gt;plot, boxplot&lt;/code&gt; etc.) command will be placed in the first
panel. All subsequent low-level plotting commands (i.e., &lt;code&gt;legend, axis, mtext&lt;/code&gt; etc.) will be added to this same panel. When the next high-level
command is called, it will be placed in the next panel, and focus shifts
with it. So we can now add our boxplot to the second panel:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;boxplot(Petal.Width ~ Species, data = iris)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/panel2-1.png&#34; width=&#34;696&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Note that the margins we set for the first panel are still in effect.
Consequently, we’ve lost the axis labels on our second plot. It’s going to
need some attention to make it look right. We’ll leave that for the next
exercise.&lt;/p&gt;
&lt;p&gt;In the meantime, we have one more requirement to meet. On multi-figure
panels, AJB requires an uppercase letter (A, B, etc) to label each plot.
This label should go in the upper-left corner of each panel. This is easy
to do with the &lt;code&gt;text&lt;/code&gt; command. At the moment, we don’t have space in the
upper-left corner, so we’ll put the labels in the lower-right temporarily.
For example:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## For the first panel:
text(&amp;quot;A&amp;quot;, x = 4.2, y = 4.5, cex = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## For the second panel:
text(&amp;quot;B&amp;quot;, x = 3.2, y = 0.24, cex = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/panel2B-1.png&#34; width=&#34;696&#34; style=&#34;display: block; margin: auto;&#34; /&gt;&lt;/p&gt;
&lt;p&gt;Unfortunately, each figure is plotted on different scales, so placing the
letters in the same position is not straightforward. Luckily, R provides a
function for getting “universal” coordinates for every plot.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## For the first panel:
text(&amp;quot;A&amp;quot;, x = grconvertX(0.9, from=&amp;quot;npc&amp;quot;, to=&amp;quot;user&amp;quot;), 
     y = grconvertY(0.1, from = &amp;quot;npc&amp;quot;, to=&amp;quot;user&amp;quot;), cex = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;## For the second panel:
text(&amp;quot;B&amp;quot;, x = grconvertX(0.9, from=&amp;quot;npc&amp;quot;, to=&amp;quot;user&amp;quot;), 
     y = grconvertY(0.1, from = &amp;quot;npc&amp;quot;, to=&amp;quot;user&amp;quot;), cex = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The functions &lt;code&gt;grconvertX&lt;/code&gt; and &lt;code&gt;grconvertY&lt;/code&gt; convert between different
coordinate systems. &lt;code&gt;npc&lt;/code&gt; is “normalized plot coordinates.” In this system,
(0, 0) is the lower left corner of the plot, and (1, 1) is the upper right
corner. &lt;code&gt;user&lt;/code&gt;, on the other hand, is the coordinate system in effect for
the actual plotted data. Which, for our Panel A, means the lower right
corner is ca. (2.0, 4.0) and the upper left corner is ca. (4.5, 8.5). So
&lt;code&gt;grconvertX(0.9, from = &#34;npc&#34;, to = &#34;user&#34;)&lt;/code&gt; returns the X coordinate to
plot our text 90% of the way to the left side of the plot, regardless of
the scale used in that plot. With this addition, we have the following
code, and the generated panels:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;par(mfrow = c(1, 2))
par(mar = c(3, 3, 0.5, 0.5))
mysymbols &amp;lt;- c(19, 5, 3)
plot(Sepal.Length ~ Sepal.Width,
     pch = mysymbols[as.numeric(Species)], data = iris,
     ylim = c(4.0, 8.5), ann = FALSE, axes = FALSE)
box()
axis(side = 1, tcl = -0.2, mgp = c(3, 0.3, 0),
     cex.axis = 0.8) 
axis(side = 2, tcl = -0.2, mgp = c(3, 0.3, 0),
     cex.axis = 0.8) 
mtext(&amp;quot;Sepal Width&amp;quot;, side = 1, line = 1.5)
mtext(&amp;quot;Sepal Length&amp;quot;, side = 2, line = 1.5)
legend(legend = levels(iris$Species), x = &amp;quot;top&amp;quot;,
       pch = mysymbols, horiz = TRUE, bty = &amp;#39;n&amp;#39;,
       cex = 0.9, text.width = c(0.6, 0.7, 0.6))
## For the first panel:
text(&amp;quot;A&amp;quot;, x = grconvertX(0.9, from=&amp;quot;npc&amp;quot;, to=&amp;quot;user&amp;quot;), 
     y = grconvertY(0.1, from = &amp;quot;npc&amp;quot;, to=&amp;quot;user&amp;quot;), cex = 2)
boxplot(Petal.Width ~ Species, data = iris)
## For the second panel:
text(&amp;quot;B&amp;quot;, x = grconvertX(0.9, from=&amp;quot;npc&amp;quot;, to=&amp;quot;user&amp;quot;), 
     y = grconvertY(0.1, from = &amp;quot;npc&amp;quot;, to=&amp;quot;user&amp;quot;), cex = 2)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;img src=&#34;https://plantarum.ca/tutorials/2020-07-20-base-R-plots_files/figure-html/panel2Bc-1.png&#34; width=&#34;696&#34; /&gt;&lt;/p&gt;
&lt;div id=&#34;exercise-2-completing-the-panel&#34; class=&#34;section level3&#34;&gt;
&lt;h3&gt;Exercise 2: Completing the Panel&lt;/h3&gt;
&lt;p&gt;There are still a few problems with our panel:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The title of the Y axis on the second panel is not visible&lt;/li&gt;
&lt;li&gt;The panel labels (A and B) are in the wrong positions — they should be
in the top left corners&lt;/li&gt;
&lt;li&gt;Fixing the panel labels will require moving the legend for the first figure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you change &lt;code&gt;ylim&lt;/code&gt;, you can put the legend on the bottom, and make space
for the label at the top. Go ahead and see what you can do with this. You
can use my example, Figure 1 at the top of this article as a model. There
is a trick to formatting the x-axis of boxplots. When calling the function
&lt;code&gt;axis&lt;/code&gt;, you’ll have to set the &lt;code&gt;at&lt;/code&gt; argument to indicate where to plot the
labels (which should be &lt;code&gt;c(1, 2, 3)&lt;/code&gt;, and you’ll have to set the &lt;code&gt;labels&lt;/code&gt;
argument to indicate what the labels should be.&lt;/p&gt;
&lt;p&gt;Alternatively, try formatting your own data according to the requirements
of a journal in your field.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;image-formats&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Image Formats&lt;/h1&gt;
&lt;p&gt;R can save graphics to a variety of formats, including anything your target
journal might require. In general, you can store your images in one of two
classes of file format:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;raster:&lt;/dt&gt;
&lt;dd&gt;images are stored as a matrix of values, with each value indicating the color of a single pixel in the grid. Best used for photographs. Examples: jpg, tiff, png.
&lt;/dd&gt;
&lt;dt&gt;vector:&lt;/dt&gt;
&lt;dd&gt;images are stored as a series of mathematical instructions for re-creating the display: lines, polygons, text etc. Best used for line drawings. Examples: eps, svg.
&lt;/dd&gt;
&lt;/dl&gt;
&lt;div id=&#34;raster-images&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Raster Images&lt;/h2&gt;
&lt;p&gt;Raster images are stored as a grid of numbers called pixels. Each number
records the colour of a single pixel in the image. As a consequence, the
image resolution is limited by the number of pixels recorded in the file.
In our example, we need a figure 3.5 inches wide. AJB requires a resolution
of 1000 dots per inch (DPI) for line drawings, which means we need a source
image 3500 pixels in the x and y dimension. We don’t actually need to do
these calculations, though, R will handle it for us. We just need to pick a
format and set the final resolution.&lt;/p&gt;
&lt;p&gt;AJB prefers TIFF format for raster files. To generate one we will use the
&lt;code&gt;tiff()&lt;/code&gt; function. Note that this function only sets the file details for
our plot; we need to add the plotting code after we open the file, and
close it when we’re done with &lt;code&gt;dev.off()&lt;/code&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;tiff(filename = &amp;quot;iris.tiff&amp;quot;, width = 3.5, height = 3.5,
     units=&amp;quot;in&amp;quot;, res = 1000, compression = &amp;quot;lzw&amp;quot;)
par(mar = c(3, 3, 0.5, 0.5))
plot(Sepal.Length ~ Sepal.Width, pch = as.numeric(Species),
     data = iris, 
     ann = FALSE,      # turn off axis labels
     axes = FALSE)     # turn off axis ticks
## Insert additional plot code here
dev.off()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;width&lt;/code&gt;, &lt;code&gt;height&lt;/code&gt; and &lt;code&gt;units&lt;/code&gt; set the size of the image, &lt;code&gt;res&lt;/code&gt; sets the
resolution in points per inch. &lt;code&gt;compression&lt;/code&gt; reduces the size of the file.
The &lt;code&gt;lzw&lt;/code&gt; options is only available for &lt;code&gt;tiff&lt;/code&gt; files. It’s lossless, which
means the compressed image is just as good as the original, so there’s no
reason not to use it. In this case, it reduces the file size from 36Mb to
366K — a 99% reduction!&lt;/p&gt;
&lt;p&gt;To create the same image as a &lt;code&gt;jpg&lt;/code&gt; with the same resolution we’d use:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;jpeg(filename = &amp;quot;iris.jpg&amp;quot;, width = 3.5, height = 3.5,
     units = &amp;quot;in&amp;quot;, res = 1000, quality = 85)
## insert plot code here!
dev.off()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;code&gt;jpg&lt;/code&gt; files are always compressed, and they use a lossy compression. That
means there is some degradation of the image quality associated with the
compression. The &lt;code&gt;quality&lt;/code&gt; argument determines how aggressively the image
is compressed. Higher values produce larger, less-degraded images. As a
rule of thumb, 85 usually produces fine images at a reasonable size. In
this case, the file is 550K, so a little larger than the compressed TIFF
file.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;vector-images&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Vector Images&lt;/h2&gt;
&lt;p&gt;Vector images are stored as a list of instructions: ‘draw a line from here
to here, put a circle at this coordinate’ etc. As a consequence, they don’t
have an inherent resolution; rather, they can be printed at any resolution
necessary. So we don’t worry about the resolution when creating them, just
the size and width. There are other options we need to be concerned with
here:&lt;/p&gt;
&lt;dl&gt;
&lt;dt&gt;paper:&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;&#34;special&#34;&lt;/code&gt; indicates that we are making a single image, not a full-page
&lt;/dd&gt;
&lt;dt&gt;onefile:&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;FALSE&lt;/code&gt; indicates that we are making a new file for each image (probably
not necessary with a single image)
&lt;/dd&gt;
&lt;/dl&gt;
&lt;p&gt;horizontal:
:&lt;code&gt;FALSE&lt;/code&gt; indicates we don’t want a landscape-orientation&lt;/p&gt;
&lt;p&gt;To create an &lt;code&gt;eps&lt;/code&gt; file:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;postscript(&amp;quot;iris.eps&amp;quot;, height = 3.5, width = 3.5,
           paper = &amp;quot;special&amp;quot;, onefile = FALSE,
           horizontal = FALSE) 
## insert plot code here!
dev.off()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This file is only 11K, and can be printed at any resolution. That makes
&lt;code&gt;eps&lt;/code&gt; a very convenient format to use. However, you may run into issues
with fonts. By default, &lt;code&gt;eps&lt;/code&gt; files produced by R don’t include the fonts,
just the position of the letters to place on the image. If you need to
embed the fonts, you need to explicitly request this:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;embedFonts(&amp;quot;iris.eps&amp;quot;, outfile=&amp;quot;iris-embed.eps&amp;quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This command creates a new file, &lt;code&gt;iris-embed.eps&lt;/code&gt;, that has the font
information embedded in the file. Fonts can be tricky, and specific details
vary between Windows, Mac and Linux. It’s easiest to stick to the default
font settings, and only dive into custom fonts and settings if you are
required by the publisher.&lt;/p&gt;
&lt;p&gt;Note that, &lt;code&gt;pdf&lt;/code&gt; files are more common than &lt;code&gt;eps&lt;/code&gt;. You can create &lt;code&gt;pdf&lt;/code&gt;
image files directly from &lt;code&gt;R&lt;/code&gt; as well, using:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;pdf(&amp;quot;iris.pdf&amp;quot;, height = 3.5, width = 3.5,
    paper = &amp;quot;special&amp;quot;, onefile = FALSE) 
## insert plot code here!
dev.off()&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div class=&#34;footnotes&#34;&gt;
&lt;hr /&gt;
&lt;ol&gt;
&lt;li id=&#34;fn1&#34;&gt;&lt;p&gt;How did I pick 19, 5 and 3? You can see all 25 symbols
with &lt;code&gt;plot(1:25, pch = 1:25)&lt;/code&gt;.&lt;a href=&#34;#fnref1&#34; class=&#34;footnote-back&#34;&gt;↩︎&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Preparing Rubus samples for herbarium study</title>
      <link>https://plantarum.ca/2013/08/25/rubus-herbarium/</link>
      <pubDate>Sun, 25 Aug 2013 00:00:00 +0000</pubDate>
      
      <guid>https://plantarum.ca/2013/08/25/rubus-herbarium/</guid>
      <description>&lt;p&gt;Collecting blackberries and their relatives (&lt;em&gt;Rubus&lt;/em&gt; spp.) for herbarium
study is particularly challenging, and even experienced field-botanists may
not appreciate everything that is involved. More than in other vascular
plant groups, to make a good &lt;em&gt;Rubus&lt;/em&gt; specimens, you need to understand a
bit about the their life-cycle.&lt;/p&gt;
&lt;figure&gt;
    &lt;img src=&#34;assets/TWS3673-web.jpg&#34;
         alt=&#34;A single Rubus allegheniensis specimen, with the first-year primocane on the right, the second-year floricane on the left, and my expert Rubus presser Charlotte in the middle. Note that the primocane is unbranched, while the floricane has many flowering branches&#34; width=&#34;100%&#34;/&gt; &lt;figcaption&gt;
            &lt;p&gt;A single &lt;em&gt;Rubus allegheniensis&lt;/em&gt; specimen, with the first-year primocane on the right, the second-year floricane on the left, and my expert &lt;em&gt;Rubus&lt;/em&gt; presser Charlotte in the middle. Note that the primocane is unbranched, while the floricane has many flowering branches&lt;/p&gt;
        &lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Blackberries, and most other &lt;em&gt;Rubus&lt;/em&gt;, are perennial shrubs. However, their
stems, called canes, are biennial. In the first year of growth a cane
remains in a vegetative state, and does not flower. Usually, it doesn&amp;rsquo;t
even form any side branches. These canes are referred to as &lt;strong&gt;primocanes&lt;/strong&gt;.
At the end of the season, buds form in the leaf axils and at the shoot tip.&lt;/p&gt;
&lt;p&gt;In the second year of growth, flowering shoots will emerge from these buds.
At this stage, the cane is called a &lt;strong&gt;floricane&lt;/strong&gt;. Growth stops after the
fruits are formed; the floricane doesn&amp;rsquo;t produce any over-wintering buds.&lt;/p&gt;
&lt;p&gt;A mature blackberry plant produces both primocanes and floricanes at the
same time. The floricanes start earlier in the season, since they get a
head-start building from last year&amp;rsquo;s primocanes. By the time the flowers
appear in June or early July, the next batch of primocanes are only just
beginning to emerge from the underground rhizome.&lt;/p&gt;
&lt;p&gt;Unlike other angiosperms, blackberry flowers aren&amp;rsquo;t generally useful
taxonomically&lt;sup id=&#34;fnref:1&#34;&gt;&lt;a href=&#34;#fn:1&#34; class=&#34;footnote-ref&#34; role=&#34;doc-noteref&#34;&gt;1&lt;/a&gt;&lt;/sup&gt;. The really interesting characters are the primocane
leaves, the stem armature (prickles, bristles, hairs and glands) and the
mature inflorescences of the floricane. This means the best time to collect
specimens is later in the season, when the fruits are forming or ripe and
the primocanes are well developed.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;First, confirm that you are collecting both a floricane and a primocane,
and they both come from the same rootstock. This is important, as we&amp;rsquo;ve
seen a lot of mixed populations!&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;assets/TWS3669-web.jpg&#34; alt=&#34;Rubus rhizome, confirming the primocane and floricane are part ofthe same plant. The upper cane is the primocane, and the lower one isthe floricane. Primocanes are usually green, and floricanes usually red.This can vary though!&#34; title=&#34;The base of two *Rubus*
canes, with the soil uncovered, showing that they are indeed part of the
same plant.&#34;&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You make at least two separate sheets for each plant. For the first,
select a representative section of the cane from the central part of the
primocane, with a few well-formed mature leaves attached. The cane tips
are easily damaged, and don&amp;rsquo;t generally press well. Similarly, the
leaves at the base of the cane are often damaged or misshapen.&lt;/p&gt;
&lt;p&gt;Pressing the leaves is tricky, particularly on prickly specimens. Keep
in mind that on the mounted sheet you will want to see the intact
outline of a complete leaf, as well as examples of the upper and lower
surface of the leaves.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;assets/TWS3675-web.jpg&#34; alt=&#34;Preparing a herbarium sheet from a Rubus primocane. Note there aretwo well-formed leaves, and we&amp;rsquo;re pressing them so that both upper andlower surfaces will be visible.&#34; title=&#34;Two *Rubus*
leaves, connected to the same stem, laid flat in a herbarium press&#34;&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Make sure the specimen includes a nice segment of the most heavily-armed
section of the stem. Sometimes the best leaves and the nastiest prickles
are on different sections of the stem. If this is the case, include a
section of the prickliest part of the stem, with the leaves removed, in
your sample.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;assets/TWS3664-web.jpg&#34; alt=&#34;Include a section of the prickliest part of the cane in yourcollection&#34; title=&#34;A close up of a *Rubus* cane,
showing the large prickles and glandular hairs.&#34;&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Use a second sheet, with a separate newsprint folder, for the floricane.
Get a section of the stem with two or three well-formed inflorescence
branches. Again, the little shoots at the very top or bottom of the cane
are often misshapen, so use the center of the cane. As for the
primocane, make sure you&amp;rsquo;ve got a section of the prickliest cane.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;assets/TWS3678-web.jpg&#34; alt=&#34;Preparing a Rubus floricane sheet&#34; title=&#34;A *Rubus*
cane, showing the flowering branches, laid flat in a herbarium press.&#34;&gt;&lt;/p&gt;
&lt;p&gt;Floricane leaves are incredibly variable. If you can include a few
well-formed leaves in the folder, great. They don&amp;rsquo;t get much attention
taxonomically, though, due in large part to their inconstancy.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If you&amp;rsquo;re collecting tissue for DNA work, the freshly-formed leaves at
the end of the primocane are the best source. I had some bad experiences
as a grad student, and so I collect enough for at least four
extractions, and four flow cytometry analyses. Typically 50-100
cm&lt;sup&gt;2&lt;/sup&gt;. Double-bagged, with labels inside and out in case of
catastrophe.&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;assets/TWS3678-web.jpg&#34; alt=&#34;Preparing a sample of Rubus leaf tissue for drying insilica&#34; title=&#34;Two *Rubus* leaves, and a ziplock bag
full of orange silica crystals.&#34;&gt;&lt;/p&gt;
&lt;p&gt;&lt;img src=&#34;assets/TWS3685-web.jpg&#34; alt=&#34;Preparing a sample of Rubus leaf tissue for drying insilica&#34; title=&#34;Two *Rubus* leaves, broken into pieces,
inside the ziplock bag with full of orange silica crystals. The inner
bag is inside a larger ziplock, and a label is written on the bag in
black marker, and on a paper slip visible inside the bag.&#34;&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Finally, make sure you note the habit of the plant for the label. The
major growth forms include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;highbush blackberries, with upright, ascending, or somewhat arching stems
that don&amp;rsquo;t reach the ground and don&amp;rsquo;t root at the tip&lt;/li&gt;
&lt;li&gt;doming forms, that start out upright, but arch and may or may not root at
the tips, possibly trailing along the ground as well&lt;/li&gt;
&lt;li&gt;trailing forms, which may either grow prostrate on the ground, or trail
and scramble over other low vegetation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Do your best to describe what you see. It can be challenging to
determine if you&amp;rsquo;re looking at a trailing form, or a highbush form that
is now growing on the ground after having been weighed down by snow or
fallen branches. And some highbush species will look strange and
stunted when growing in poor conditions.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The shape of the inflorescence should be visible on the sheet. The fruits,
however, will either rot or dry down to a pile of seeds. So definitely
include a note on their size (length and width). Also flavour, if you get
some ripe ones. Better in your belly, with a good description on the label,
than pressed into mash untasted!&lt;/p&gt;
&lt;p&gt;Note that this protocol reflects the minimum necessary material for a
useful herbarium collection. If you have time and space for more than two
sheets, and you see interesting bits you&amp;rsquo;d like to include, you can always
collect more. Some of our collections include five or six sheets for a
single plant.&lt;/p&gt;
&lt;section class=&#34;footnotes&#34; role=&#34;doc-endnotes&#34;&gt;
&lt;hr&gt;
&lt;ol&gt;
&lt;li id=&#34;fn:1&#34; role=&#34;doc-endnote&#34;&gt;
&lt;p&gt;What I should say here is that flowers have not been found useful by
previous taxonomic research on blackberries. Sadly, previous taxonomic
research on blackberries is quite a mess, and I am prepared to accept
that anything I think I know about this group is completely wrong. &lt;a href=&#34;#fnref:1&#34; class=&#34;footnote-backref&#34; role=&#34;doc-backlink&#34;&gt;&amp;#x21a9;&amp;#xfe0e;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/section&gt;
</description>
    </item>
    
  </channel>
</rss>
