Structural variants and the SAM format - the long (reads) and short (reads) of it

The SAM file format and its specification is pretty amazing, but also terse and abstract. To really understand what is going on with your reads, you need to play with real-world data and learn from tutorials.

I’ll relay some things I’ve learned over the years, focused on how SAM concepts relate to structural variants.

Disclaimer: I’m a developer of JBrowse 2. This document has some screenshots and links for it, feel free to try it at https://jbrowse.org.

Basics

What is a `SAM` file, and how does it relate to `BAM` and `CRAM`?

A SAM file generally contains “reads” from a sequencer, with information about how they are mapped to a reference genome ¹².
It is generally produced when an aligner takes in raw, unaligned reads (often stored in FASTQ format files) and aligns them to a reference genome ³.
A SAM file is a text format that you can read with your text editor. BAM and CRAM are compressed representations of the SAM format.

You generally see BAM or CRAM in the wild instead of SAM since the files are very large, so the compressed versions are better to store on disk.

See Appendix E on how to convert between SAM/BAM/CRAM.

What is in a `SAM`/`BAM`/`CRAM` file?

A SAM file contains a “header” and a series of “records”.

What is a `SAM` “record”?

A record is a single line in a SAM file, and it generally corresponds to a single read, but as we will see, a split alignment may produce multiple records that refer to the same source read.

Note that the words “read” and “record” are sometimes used interchangeably, but record has the more specific meaning of being a single line in the SAM file.

What are tags in a `SAM` file?

A SAM file has a core set of required fields, and then an arbitrary list of extra columns called tags. The tags have a two-character abbreviation like MQ (mapping quality) or many others. They can be upper or lower case. Upper case tags are reserved for official usages (except those starting with X, Y, or Z, which are reserved for local/user-defined use). See SAMtags.pdf for more details.

What is a `CIGAR` string, and how do you interpret it?

A CIGAR string is a “compact idiosyncratic gapped alignment report”. It tells you about insertions, deletions, and clipping. It is a series of “operators” with lengths.

Insertion example:

50M50I50M

That would be 50bp of matching bases (50M), followed by a 50bp insertion (50I), followed by another 50bp of matches (50M). The 50bp insertion means the read contains 50 bases in the middle, which did not match the reference genome that you are comparing the read to.

Clipping example:

50S50M50S

This means that 50bp matched (50M in the middle of the CIGAR string) and both sides of the read are soft-clipped. The clipping means the aligner was not able to align the reads on either side.

Notes:

Finding mismatches: A CIGAR string match like 50M means 50 bases “matched” the reference genome, but that only means that there are no insertions or deletions in those 50 bases. There could be underlying mismatches in the read compared to the reference. Note: there is also “extended CIGAR” that replaces M with = (exact match) and X (mismatch). Also see Appendix D on the MD tag and finding where the mismatches are, but note that MD tag is tricky
Ambiguity of representation: A CIGAR string with insertions and deletions could be 50M1D1I50M. This string had a 1bp deletion and a 1bp insertion back-to-back. This could be just a mismatch! There is ambiguity in sequence alignment representations. Downstream programs must accommodate this.
Split records and soft-clipping: A CIGAR string with soft-clipping 500S50M this means that 500 bases of the read were not aligned at this position, but 50 bases were!

See SAMv1.pdf for all the CIGAR operators.

If you are working with SAM data, you will often write loops that directly parse CIGAR strings. See Appendix B for handy functions for parsing CIGAR strings. Don’t fear the CIGAR!

Note regarding soft clipping: this will be discussed further in this post, but “supplementary alignments” might indicate that e.g. those 500 bases that were soft-clipped aligned to another region on the genome.

How do you determine “start” and “end” and “strand” of a record?

SAM records do not have easy-to-interpret “start” and “end” and “strand” fields like, say, lines in a BED file do

Instead, each SAM record has:

A single POS coordinate
A CIGAR string
A flags field, that might say “read reverse complemented”

Using this, you can find the “end” coordinate of a record by starting at the POS coordinate and enumerating the reference-consuming operators of the CIGAR: matches (M/=/X), deletions (D), and skips (N). Processing the whole CIGAR effectively gives you the “end” coordinate on the reference genome, so “end” is derived from the CIGAR. The “start” field corresponds to the POS field

What are “forward” and “reverse” strand reads?

Reads can align to the “forward strand” of the reference genome (e.g. it matches the DNA letters as written in the reference genome FASTA file), resulting in “forward strand reads”.

Or, they can align to the reverse complement of the DNA in the reference genome, resulting in “reverse strand reads”.

Strandedness is recorded for each record in the SAM/BAM/CRAM file by having the SAM flags bitwise flag 16 enabled (https://broadinstitute.github.io/picard/explain-flags.html), where this flag is called “read reverse strand”.

Do you count backwards with the `CIGAR` for reverse strand reads?

As mentioned above, you get the “start” and “end” by processing the CIGAR, so one might wonder whether you count “downwards” from POS for reverse strand reads. This is actually not correct though

Here is what the SAM spec has to say about this

“POS is the 1-based leftmost mapping POSition of the first CIGAR operation that “consumes” a reference base”

And regarding CIGAR and other fields:

“For segments that have been mapped to the reverse strand, the recorded SEQ is reverse complemented from the original unmapped sequence and CIGAR, QUAL, and strand-sensitive optional fields are reversed and thus recorded consistently with the sequence bases as represented”

That is a little bit of a mind boggle but effectively, no matter whether the record is forward or reverse strand, you count upwards when processing the CIGAR.

You can see in an example of this in SAMv1 section 1.1 — the read r001/2 is a “reverse strand” read, and its POS, or “start” position, is 37, and then the CIGAR (9M, 9 matches) counts upwards from there, giving it an “end” position of 45.

How are reads ordered in a `SAM` file?

When you run an aligner program like bwa or minimap2, you type e.g.

minimap2 -a reference_genome.fa reads.fq > out.sam

out.sam will be ordered in the same order as the fastq. For preparation for loading into analysis tools or genome browsers, we will often sort the reads using samtools sort which will group by chromosome name, and then increasing in coordinate start position. Then samtools index creates an index file (e.g. bam.bai, cram.crai). The index file lets you quickly query the reads in a specific genomic region (e.g. with samtools view myfile chr10:1000-2000). See Appendix G for how this indexing actually works.

What happens to reads that don’t align to the genome?

If a read failed to align to the reference genome, it may still be in your SAM file, marked as unmapped using the flag column. Sometimes, “dumpster diving” (looking at the unmapped records from a SAM file) can be used to aid structural variant searches (e.g. there may be novel sequence in there not from the reference genome that could be assembled)

Detecting SVs from long reads

Long reads offer a wide array of methods for detecting SVs

Small insertions/deletions: Long reads can completely span moderate sized insertions and deletions, indicated by I or D in a CIGAR string.
Large insertions/deletions: If a long read does not completely span an insertion or deletion, it may be split aligned on either side of the SV or could be soft/hard clipped where it can’t align all the way through an insertion.
Translocations: A split long alignment can span long-range or even inter-chromosomal translocations, so part of the read maps to one chromosome and one part maps to the other
Inversions: A split alignment can span an inversion. In this case, long reads can be split into multiple parts, one part of it aligns in the reverse orientation, while the other part aligns in the forward orientation

There are many other methods for detecting SVs from long reads — not all use mapped reads (some use de novo assembly for example) — but mapped-read methods are still useful to know.

What are split/supplementary/chimeric alignments?

Split alignments, or chimeric alignments, are alignments where part of the read maps to one place, and another part to another. For example, part of a long read may map to chr1 and part of it maps to chr4. It is worth reading the definition of “Chimeric alignment” from SAMv1.pdf when you get the chance.

As SAMv1.pdf tells us, one record is marked as “representative”, sometimes also called the “primary” record, while the other components of the split read are marked “supplementary”, given the 2048 flag. The “primary” record generally has a SEQ field that represents the entirety of the original read’s sequence (with CIGAR soft clipping operators saying which part of that sequence aligned), and the “supplementary alignments” will have SEQ field but sometimes just segments of the original read’s sequence with CIGAR hard clipping operators indicating that it is partial.

Supplementary alignments are especially common with long reads and can signal structural variants: a read may split across a chromosomal fusion, align to either side of a large deletion, or span an inversion (part aligning forward, part reverse).

There is no limitation on how many splits might occur so the split can align to 3, 4, or more different places. Each part of the split puts a new line in the SAM file, and note that all the records also have the same read name, or QNAME (first column of SAM).

What are secondary alignments/multi-mappers?

Secondary alignments generally come from “multi-mappers” where the entire read maps equally well (or at least somewhat equally well) to, say, somewhere on both chr4 and chr1. “Multi mapping” results in secondary alignments, while split reads result in supplementary alignments. See “Multiple mapping” in the SAMv1.pdf for the definition of multi-mapping. Note also that secondary alignments sometimes are missing the SEQ field entirely too, see https://github.com/lh3/minimap2/issues/458#issuecomment-516661855.

I wrote a tool called secondary_rewriter to add the SEQ field back to secondary alignments, which may help in some cases.

What is the `SA` tag?

The SA tag is outputted on each part of the supplementary/split/chimeric alignment, e.g. the primary contains an SA tag that refers to information (e.g. the location) of where all the supplementary alignments were placed, and each of the supplementary alignments also contains an SA tag that refers to the primary alignment and each other supplementary alignment.

Fun fact: The SA tag conceptually can result in a ‘quadratic explosion’ of data, because each part of the split contains references to every other part. For example, if a read is split into 4 pieces, then each record would have an SA tag with 3 segments, so 3*4 segments will be documented in the SA tag. In many cases, this is not a problem, but if you imagine a finished chromosome aligned to a draft assembly, it may get split so many times that this could be a factor.

See SAMtags.pdf for more info on the SA tag.

Visualizing split reads across a breakend or translocation

For an inter-chromosomal translocation, JBrowse 2’s “breakpoint split view” visualizes support for the event, showing reads that are split-aligned across the SV (and connections between paired-end reads across it too).

The SV inspector workflow helps you set these up, and you can launch the breakpoint split view from it. Both work best on Breakends (VCF 4.3 section 5.4) and <TRA> events from VCF, or BEDPE SV calls.

Visualizing a ‘read vs reference’ view given a split alignment

Given any one record of a split read, we can reconstruct what the whole read looks like compared to the reference genome. We don’t need the primary specifically: the record’s SA tag lists every other piece, so the record plus its SA tag describe the whole read.

The primary is special in one respect: it carries the SEQ of the entire read, so it’s where the full read sequence lives if you want to display the bases. (Whether supplementary records also carry the full SEQ depends on the aligner: minimap2 and bwa hard-clip them by default so they store only their aligned portion, but -Y makes them soft-clip and keep the entire SEQ.) JBrowse 2 implements this reconstruction with a synteny-style rendering. [1]

Figure showing JBrowse 2 piecing together a long read vs the reference genome from a single read

To do this, JBrowse takes the CIGAR of the record you started from plus each piece in its SA tag, and sorts them by the length of the clip at the start of the read, counting soft (S) and hard (H) clips alike (since supplementary records may be either, depending on the aligner). A larger leading clip means the piece starts further along the read, so the clip lengths give the order of the pieces — telling us where each came from in the original read, which we plot in a synteny-style display.

[1] Similar functionality also exists in GenomeRibbon https://genomeribbon.org

`SAM` vs `VCF` - Breakends vs split alignments

A nice property of SAM is that from a single record I can reconstruct the “derived” genome around a region of interest. The SA tag is just a set of linear alignments that piece together easily, and you only need that one record to do it.

Doing the same from the VCF Breakend specification (section 5.4 of VCF4.3.pdf) is harder: a Breakend is only an edge in a graph (sequences are the nodes), so reconstructing a structural variant means building that graph and decoding paths through it. An inversion, for example, may take 4 VCF records to express. Some tools do tackle this — the GRIDSS / PURPLE / LINX pipeline reconstructs derivative chromosomes from breakends, and so does my derivative-chromosome-utils — but it remains more involved than piecing together an SA tag.

The catch with single reads is that they can be noisy (contain errors). de novo assembled contigs can also be stored in SAM format and are much less noisy, being built from many reads.

Haplotype-tagged reads

A new trend has been to create SAM/BAM/CRAM files with tagged reads, which tells us which haplotype a read was inferred to have come from. This is commonly done with the HP tag, which might have HP=1 and HP=2 for a diploid genome. Tools like whatshap can add these tags to a SAM file, and IGV and JBrowse 2 can color and sort by these tags.

Screenshot of JBrowse 2 with the “Color by tag” and “Sort by tag” settings enabled (coloring and sorting by the HP tag), letting us see that only one haplotype has a deletion. Tutorial for how to do this in JBrowse 2 here: https://jbrowse.org/jb2/docs/user_guides/alignments_track/#sort-color-and-filter-by-tag

How do you detect SVs with paired-end reads?

Distance between pairs being abnormally large or short

The distance between pairs is encoded by the TLEN column. For well-mapped pairs it is relatively constant, called the “insert size”, and comes from how paired-end sequencing reads both ends of a fragment.

But, if you are mapping reads vs the reference genome, and you observe that they are abnormally far apart, say 50kb apart instead of 1kb apart, this may indicate that your sample contains a deletion relative to the reference.

Screenshot of JBrowse 1 with “View as pairs” enabled with “mate paired” sequencing, and large insert size colored as red (from https://jbrowse.org/docs/paired_reads.html). Note that some of JBrowse 1’s View as pairs features are not yet available in JBrowse 2

Mate pairs vs paired end reads

“Mate paired” reads have a larger distance between pairs (aka the insert size is larger). Think: mate pairs have 2-5kb between pairs instead of a couple hundred bp. Mate pairs can be useful in scaffolding de novo genome assemblies since the larger range can span gaps, giving evidence that multiple contigs are connected.

“Paired-end” reads typically have like 200-500bp distance between pairs (aka, the insert size).

Data sheet https://web.archive.org/web/20210827021305/https://www.illumina.com/Documents/products/datasheets/datasheet_genomic_sequence.pdf

The above JBrowse screenshot shows mate pairs. Note that mate pairs have a different pair orientation, and pairs don’t necessarily point at each other, they “point away from each other” in normal conditions due to how the library is constructed. See Appendix A for more info.

Linked reads aka simulated long reads

Briefly: linked reads are another method that uses short reads but gives long read information, by tagging short reads that came from the same long molecule with a shared barcode. Note that 10x Genomics, the company that originally popularized this method, has since discontinued commercial linked-read sequencing to focus on single cell. https://www.10xgenomics.com/products/linked-reads

An abundance of reads being “clipped” at a particular position

This can indicate that part of the reads map well, but then there was an abrupt stop to the mapping. This might mean that there is a sequence that was an insertion at that position, or a deletion, or a translocation.

The clipping is indicated by the CIGAR string, either at the start or end of it by an S or an H. The S indicates “soft clipping”, and indicates that the sequence of the clipped portion can be found in the SEQ field of the primary alignment. The H is hard clipped, and the sequence that is hard clipped will not appear in the SEQ.

Screenshot of JBrowse 2 showing blue clipping indicator with a “pileup” of soft-clipping at a particular position shown in blue. The clipping is an “interbase” operation (it occurs between base pair coordinates) so it is plotted separately from the normal coverage histogram.

Screenshot of JBrowse 2 showing an insertion with Nanopore (top), PacBio (middle) and Illumina short reads. The long reads may completely span the insertion, so the CIGAR string on those have an I operator and are indicated by the purple triangle above the reads. For the short reads, the reads near the insertion will be clipped since they will not properly map to the reference genome and cannot span the insertion. The “Show soft clipping” setting in JBrowse 2 and IGV can be used to show visually the bases that extend into the insertion (shown on the bottom track).

Unexpected pair orientation

With standard paired end sequencing, the pairs normally point at each other

forward          reverse
--->             <---

If the stranded-ness of the pair is off, then it could indicate a structural variant. See Appendix A for a handy function for calculating pair orientation.

This guide from IGV is helpful for interpreting the pair directionality with patterns of SVs using “Color by pair orientation”

https://software.broadinstitute.org/software/igv/interpreting_pair_orientations

Figure: JBrowse 2 showing an inverted (tandem) duplication in 1000 genomes data. It uses the same coloring as IGV for pair orientation. The tandem duplication can produce green arrows which have reads pointing in opposite directions e.g. <-- and -->, while blue arrows which can indicate an inversion point in the same direction e.g. --> and -->

Caveat about TLEN

Note that TLEN is a field in the SAM format that is somewhat ill-defined, at least in the sense that different tools may use it differently https://github.com/pysam-developers/pysam/issues/667#issuecomment-381521767

If needed, you can compute the distance yourself from the paired records, but I haven’t had trouble relying on the TLEN in the data files.

Calling copy number variants with your short or long reads

Another type of SV you can get from SAM files is copy number variants (CNVs). Looking at depth-of-coverage for abnormalities can reveal them; a tool like mosdepth quickly produces a file of coverage across the genome.

Be aware that if you are comparing the coverage counts from different tools, they have different defaults that may affect comparison. Some discard QC_FAIL, DUP, and SECONDARY flagged reads. This is probably appropriate, and corresponds to what most genome browsers will display (see https://gist.github.com/cmdcolin/9f677eca28448d8a7c5d6e9917fc56af for a short summary of depth calculated from different tools)

Both long and short reads can be used for CNV detection; long reads may be more accurate thanks to better mapping through difficult regions.

Screenshot showing coverage in BigWig format from nanopore reads on normal and tumor tissue from a melanoma cancer cell line (COLO829) plotted using JBrowse 2. This coverage data is calculated from nanopore sequencing from here using mosdepth, converted from BedGraph to BigWig, and loaded into JBrowse 2. See (demo and tutorial)

The future, with graph genomes and de novo assemblies

Currently, SV visualization is mostly based on comparing data to a reference genome (and SAM is a signature of this: it stores everything in reference coordinates). In the future it may look more like comparative genomics, comparing an SV to a population-specific or graph-genome reference.

de novo assembly has more power to detect SVs than some read-based methods (https://twitter.com/lh3lh3/status/1362921612690010118/photo/1), so as assembled genomes improve and become more widespread, we may see a shift in how SVs are called. I’d also like to see fast ‘on the fly’ gene prediction on assembled genomes, to see what SNPs or altered splicing look like in duplicated gene copies.

Fun fact: the GAF (graphical alignment format) is a strict superset of PAF (pairwise alignment format) by storing graph node labels in the target name slot of PAF, and can refer to an rGFA (reference genome graph)! Looking forward to the graph genome world.

Conclusion

Algorithms that call structural variants face many challenges, but understanding how the reads are encoded in SAM format, and seeing what they look like in the genome browser is a useful first step to gaining a better understanding.

In summary, some of the signatures of SVs may include:

Aberrant insert size (TLEN) detection (longer for deletion, shorter for insertion)
Aberrant pair orientation (pairs are not pointing at each other)
Split-read detection (SA tag)
CIGAR string processing (D operator for deletions, I operator for insertions)
Over-abundance of clipping (S or H operators in CIGAR)
Depth of coverage changes for CNVs
Aligning de novo assembly vs a reference genome (https://twitter.com/lh3lh3/status/1362921612690010118/photo/1)

That last point is worth expanding on: aligning a de novo assembly to a reference can output SAM, but it can also output PAF format, which can be loaded into the JBrowse 2 synteny views. The techniques for detecting SVs from PAF are fundamentally similar to those listed above, but may look a bit different (see the cs tag in PAF for example, which is a modified CIGAR-like string).

If you have any ideas I should include here, let me know!

Appendix A: Interpreting pair orientation

This is fairly low-level, but it’s a handy utility for determining pair orientation from a single SAM/BAM/CRAM record.

// @param flags - flags from a single read
// @param ref - this read's RNAME (reference sequence name)
// @param rnext - the RNEXT field, used to check the mate is on the same reference
// @param tlen - the TLEN field from SAM
// @return e.g. F1R2 normal paired end orientation
function getPairOrientation(
  flags: number,
  ref: string,
  rnext: string,
  tlen: number,
) {
  // this read is not unmapped &&
  // this read's mate is also not unmapped &&
  // this read's mate is on the same reference genome (RNEXT '=' means same as RNAME)
  if (!(flags & 4) && !(flags & 8) && (rnext === '=' || ref === rnext)) {
    const s1 = flags & 16 ? 'R' : 'F'
    const s2 = flags & 32 ? 'R' : 'F'
    let o1 = ' '
    let o2 = ' '

    // if first in pair
    if (flags & 64) {
      o1 = '1'
      o2 = '2'
    }

    // else if second in pair
    else if (flags & 128) {
      o1 = '2'
      o2 = '1'
    }

    const tmp = []
    if (tlen > 0) {
      tmp[0] = s1
      tmp[1] = o1
      tmp[2] = s2
      tmp[3] = o2
    } else {
      tmp[2] = s1
      tmp[3] = o1
      tmp[0] = s2
      tmp[1] = o2
    }
    return tmp.join('')
  }
  return null
}

Then this can be broken down further by orientation type

Paired end reads are “fr”
Mate pair reads are “rf”

So you can interpret e.g. F1R2 in relation to being a paired end read (fr) or mate pair (rf) below and with this link https://software.broadinstitute.org/software/igv/interpreting_pair_orientations

{
  "fr": {
    "F1R2": "LR",
    "F2R1": "LR",

    "F1F2": "LL",
    "F2F1": "LL",

    "R1R2": "RR",
    "R2R1": "RR",

    "R1F2": "RL",
    "R2F1": "RL"
  },

  "rf": {
    "R1F2": "LR",
    "R2F1": "LR",

    "R1R2": "LL",
    "R2R1": "LL",

    "F1F2": "RR",
    "F2F1": "RR",

    "F1R2": "RL",
    "F2R1": "RL"
  }
}

Appendix B - `CIGAR` parsing

// @param cigar: CIGAR string in text form
// @returns an array of elements like ['30','M', '2','I', '50','M', '40','D']
// which you can consume in a loop two elements at a time
function parseCigar(cigar: string) {
  return cigar.split(/([MIDNSHPX=])/)
}

Then parse the returned array two at a time

// this function does nothing, but is informative for how to interpret a
// CIGAR string
// @param cigar - CIGAR string from record
// @param readSeq -  the SEQ from record
// @param refSeq -  the reference sequence underlying the read
function interpretCigar(cigar: string, readSeq: string, refSeq: string) {
  const opts = parseCigar(cigar)
  let qpos = 0 // query position, position on the read
  let tpos = 0 // target position, position on the reference sequence

  // opts will be an array like this ['30','M', '2','I', '50','M', '40','D']
  // which we parse two elements at a time
  for (let i = 0; i < opts.length; i += 2) {
    const len = +opts[i]
    const op = opts[i + 1]
    // do things. refer to the CIGAR chart in SAMv1.pdf for which operators
    // "consume reference" to see whether to increment
    if (op === 'M' || op === '=') {
      // matches consume query and reference
      const refMatch = refSeq.slice(tpos, tpos + len)
      const readMatch = readSeq.slice(qpos, qpos + len)
      for (let i = 0; i < len; i++) {
        if (refMatch[i] !== readMatch[i]) {
          // SNP at this position
        }
      }
      qpos += len
      tpos += len
    }
    if (op === 'I') {
      // insertions only consume query
      // sequence of the insertion from the read is
      const insSeq = readSeq.slice(qpos, qpos + len)
      qpos += len
    }
    if (op === 'D') {
      // deletions only consume reference
      // sequence of the deletion from the reference is
      const delSeq = refSeq.slice(tpos, tpos + len)
      tpos += len
    }
    if (op === 'N') {
      // skips only consume reference
      // skips are similar to deletions but are related to spliced alignments
      tpos += len
    }
    if (op === 'X') {
      // mismatch using the extended CIGAR format
      // could lookup the mismatch letter in a string containing the reference
      const mismatch = refSeq.slice(tpos, tpos + len)
      qpos += len
      tpos += len
    }
    if (op === 'H') {
      // does not consume query or reference
      // hardclip is just an indicator
    }
    if (op === 'S') {
      // softclip consumes query
      // below gets the entire soft clipped portion
      const softClipStr = readSeq.slice(qpos, qpos + len)
      qpos += len
    }
  }
}

Note for example, that to determine how long a record is on the reference sequence, you have to combine the records start position with the CIGAR string, basically parsing the CIGAR string to add up tpos and return tpos

Appendix C - align `FASTQ` directly to `CRAM`

This example from the htslib documentation (http://www.htslib.org/workflow/fastq.html) shows how you can stream directly from FASTQ to CRAM (and generate the index file .crai too)

If you want, you can make this a little shell script, easy_align_shortreads.sh

easy_align_shortreads.sh

#!/bin/bash
minimap2 -t 8 -a -x sr "$1" "$2" "$3"  | \
samtools fixmate -u -m - - | \
samtools sort -u -@2 - | \
samtools markdup -@8 --reference "$1" - --write-index "$4"

Similar idea for longreads, except just a single fastq file is generally used for longreads

easy_align_longreads.sh

#!/bin/bash
# add the preset for your platform, e.g. -x map-ont, -x map-hifi, or -x map-pb
minimap2 -t 8 -a "$1" "$2"  | \
samtools fixmate -u -m - - | \
samtools sort -u -@2 - | \
samtools markdup -@8 --reference "$1" - --write-index "$3"

Then call

bash easy_align_shortreads.sh ref.fa reads1.fq reads2.fq out.cram
bash easy_align_longreads.sh ref.fa reads.fq out.cram

## output BAM instead
bash easy_align_shortreads.sh ref.fa reads1.fq reads2.fq out.bam
bash easy_align_longreads.sh ref.fa reads.fq out.bam

This same concept works with other common aligners as well like bwa

Bonus: CRAM to bigwig, for looking at CNV/coverage

#!/bin/bash
# quickalign.sh ref.fa 1.fq 2.fq out.cram
# produces out.cram and out.bw
samtools faidx $1
minimap2 -t 8 -a -x sr "$1" "$2" "$3"  | \
samtools fixmate -u -m - - | \
samtools sort -u -@2 - | \
samtools markdup -@8 --reference "$1" - --write-index "$4"


mosdepth -f $1 $4 $4
gunzip $4.per-base.bed.gz
bedGraphToBigWig $4.per-base.bed $1.fai $4.bw

Call as “quickalign.sh ref.fa 1.fq 2.fq out.cram” gives you out.cram, out.cram.crai, and out.cram.bw (coverage)

Appendix D - the `MD` tag and finding SNPs in reads

The MD tag helps tell you where the mismatches are without looking at the reference genome. This is useful because as I mentioned, CIGAR can say 50M (50 matches) but some letters inside those 50 matches can be mismatches, it only says there are no insertions/deletions in those 50 bases, but you have to determine where in those 50 bases where the mismatches are.

The MD tag can help tell you where those are, but it is somewhat complicated to decode (https://vincebuffalo.com/notes/2014/01/17/md-tags-in-bam-files.html). You have to combine it with the CIGAR to get the position of the mismatches on the reference genome. If you have a reference genome to look at, you might just compare all the bases within the 50M to the reference genome and look for mismatches yourself and forget about the MD tag

The MD tag is also not required to exist, but the command samtools calmd yourfile.bam --reference reference.fa can add MD tags to your BAM file. It is generally not useful for CRAM because CRAM actually does store mismatches with the reference genome in its compression format.

Note that there are some oddities about MD tag representation leading to complaints (e.g. https://github.com/samtools/hts-specs/issues/505), which could lend credence to “doing it yourself” e.g. finding your own mismatches by comparing the read sequence with the reference, instead of relying on the MD tag. (see Appendix B)

Appendix E: Converting between SAM/BAM/CRAM

You can convert SAM to BAM with samtools

samtools view file.sam -o file.bam

You can also convert a BAM back to SAM with samtools view

samtools view -h file.bam -o file.sam

The -h just makes sure to preserve the header.

If you are converting SAM to CRAM, it may require the -T argument to specify your reference sequence (this is because the CRAM is “reference compressed”)

samtools view -T reference.fa file.sam -o file.cram

Note that in some cases you can pipe data directly from e.g. an aligner straight to CRAM. See Appendix C: piping FASTQ from minimap2 directly to CRAM

Appendix F: Why is my read alignment coverage uneven?

When looking at coverage plots, you might notice that coverage is far from uniform. There are both biological and technical reasons for this.

Biological reasons

GC content bias: Regions with very high or very low GC content tend to have lower coverage. This affects both short and long read sequencing, though to different degrees. DNA with extreme GC content is harder to denature and amplify during library preparation.
Repetitive regions: Tandem repeats, transposable elements, segmental duplications, and other repetitive sequences can cause reads to multi-map, leading to apparent coverage drops if multi-mappers are filtered out, or inflated coverage if they are randomly assigned. Centromeres and telomeres are particularly affected.
Copy number variation: True biological differences in copy number between your sample and the reference genome will show up as coverage differences. Duplicated regions will have higher coverage, and deleted regions will have lower or zero coverage.
Heterochromatin and chromatin structure: Tightly packed heterochromatic regions can be harder to sequence due to reduced accessibility during library preparation, leading to lower coverage.
Segmental duplications and paralogous sequences: Highly similar duplicated regions in the genome can cause reads to map ambiguously, resulting in mapping quality drops and uneven coverage.

Technical reasons

PCR amplification bias: Library preparation protocols that involve PCR can introduce amplification bias. Some fragments amplify more efficiently than others, leading to uneven representation. PCR-free library prep protocols can reduce this.
Fragmentation bias: Enzymatic or mechanical fragmentation of DNA during library preparation is not perfectly uniform, which can lead to some regions being over- or under-represented.
Mappability: Some regions of the genome are inherently difficult to map reads to uniquely. Short reads in particular struggle with repetitive regions, leading to mapping quality zero reads or unmapped reads. Long reads improve this but don’t fully eliminate it.
Sequencing errors in homopolymers: Nanopore and older PacBio sequencing can have higher error rates in homopolymer runs (e.g. AAAAAAA), which can cause local alignment artifacts and apparent coverage fluctuations.
Library complexity: Low input DNA or over-amplified libraries result in many duplicate reads. After duplicate marking/removal, the effective coverage in some regions can drop significantly.
Target capture bias: For exome or targeted sequencing, probe hybridization efficiency varies across targets, leading to highly uneven coverage across the captured regions.
Edge effects: Coverage tends to drop at the edges of chromosomes (telomeres) and near gaps (Ns) in the reference assembly.
Reference genome quality: Errors or gaps in the reference genome itself can cause reads to fail to map in certain regions, or cause pileups of misaligned reads in others.

Understanding these sources of coverage variation is important when interpreting CNV calls, as not every coverage dip or peak represents a true biological copy number change.

Divergent strains mapped to a single reference

An extreme case of uneven coverage occurs when you map reads from a divergent strain or wild isolate to a reference genome. If the sample is highly divergent from the reference in certain genomic regions, reads from those regions will either fail to map entirely or map with very high mismatch rates, creating dramatic coverage drops and spikes at a fine scale.

This is particularly well-studied in C. elegans, where wild isolates (e.g. DL238, JU2526, EG4725) have “hyper-divergent” regions compared to the standard N2 reference. These regions can be so different from N2 that they resemble a different species, and the coverage profile when mapping to N2 looks extremely jagged: sharp transitions between well-mapped conserved regions and poorly-mapped or unmapped divergent regions. The read pileup in these regions will also show an abundance of colored mismatches.

These hyper-divergent haplotypes in C. elegans often overlap with genes involved in environmental response such as chemoreceptors and innate immunity genes, and are thought to be maintained by balancing selection. This phenomenon is not unique to C. elegans; any organism with high population diversity or where you are mapping a divergent strain to a single reference can show similar patterns.

The solution is not to “fix” the library prep but rather to use a strain-specific assembly, a pangenome reference, or a graph genome approach that can represent the diversity in these regions. Alternatively, using a more lenient aligner setting may recover some reads, but at the cost of noisier alignments.

Appendix G: How BAM/CRAM index files work

When you run samtools index, why does the resulting .bai/.csi/.crai let you jump straight to chr10:1,000-2,000 without reading the whole file? Two ingredients:

BGZF block compression: BAM is compressed with BGZF, a flavor of gzip made of independent ~64kb blocks. A reader can seek to the start of any block and decompress from there instead of from the top of the file. Positions are stored as “virtual offsets” that pack the compressed block offset and the offset-within-the-uncompressed-block into a single 64-bit number.
A two-part index: the .bai combines a binning index and a linear index.
- The binning index uses the UCSC hierarchical binning scheme: the genome is split into bins at several size levels (the smallest is 16kb), and each read is filed into the smallest bin that fully contains it. To answer a region query, you compute which bins overlap the region at each level and gather the file chunks filed under them.
- The linear index stores, for each 16kb window, the smallest virtual offset where reads overlapping that window begin, so the reader can skip chunks that end before the region of interest.

The .csi index generalizes this with a configurable bin size and depth (needed for chromosomes longer than ~512Mb). CRAM’s .crai is a bit different: it is a gzipped list of the file’s containers/slices with their reference id, start, span, and file offsets.

If you want to actually see which bins and chunks a given query touches, I made an interactive BAM index visualizer that is kind of fun to play with: https://cmdcolin.github.io/bam_index_visualizer/

SAM can contain any type of sequence, not specifically reads. If you created a de novo assembly, you could align the contigs of the de novo assembly to a reference genome and store the results in SAM. ↩
Does not always have to have information about mapping to a reference genome. You can also store unaligned data in SAM/BAM/CRAM (so-called uBAM for example) but most of the time, the reads in SAM format are aligned to a reference genome. ↩
Examples of programs that do alignment include bwa, bowtie, and minimap2 (there are many others). These programs can all produce SAM outputs. ↩