Misc scribbles

Structural variants and the SAM format - the long (reads) and short (reads) of it

2022-02-06

The SAM specification is pretty amazing (https://samtools.github.io/hts-specs/SAMv1.pdf) but it is also fairly terse and abstract. True understanding might come from playing with real world data. I will try to relay some things I have learned over the years, with a bit of a focus on how SAM file concepts can relate to structural variants.

Disclaimer: I'm a developer of JBrowse 2. This document has some screenshots and links for it, feel free to try it at https://jbrowse.org.

#Basics

#What is a SAM file and how does it relate to BAM and CRAM?

You can convert SAM to BAM with samtools

samtools view file.sam -o file.bam

You can also convert a BAM back to SAM with samtools view

samtools view -h file.bam -o file.sam

The -h just makes sure to preserve the header.

If you are converting SAM to CRAM, it may require the -T argument to specify your reference sequence (this is because the CRAM is "reference compressed")

samtools view -T reference.fa file.sam -o file.cram

Also see Appendix C: piping FASTQ from minimap2 directly to CRAM

[1] SAM can contain any type of sequence, not specifically reads. If you created a de novo assembly, you could align the contigs of the de novo assembly to a reference genome and store the results in SAM.

[2] Does not always have to have information about mapping to a reference genome. You can also store unaligned data in SAM/BAM/CRAM (so-called uBAM for example) but most of the time, the reads in SAM format are aligned to a reference genome.

[3] Examples of programs that do alignment include bwa, bowtie, and minimap2 (there are many others). These programs all can produce SAM outputs

#What is in a SAM file

A SAM file contains a header (BAM and CRAM files also have the SAM header) and a series of records. A record is a single line in a SAM file, and it generally corresponds to a single read, but as we will see, a split alignment may produce multiple records that refer to the same source read.

Note: if a read failed to align to the reference genome, it may still be in your SAM file, marked as unmapped using the flag column. Sometimes, "dumpster diving" (looking at the unmapped records from a SAM file) can be used to aid structural variant searches (e.g. there may be novel sequence in there not from the reference genome that could be assembled)

#What are tags in a SAM file

A SAM file has a core set of required fields, and then an arbitrary list of extra columns called tags. The tags have a two character abbreviation like MQ (mapping quality) or many others. They can be upper or lower case. Upper case are reserved for official usages (except those with X, Y, or Z prefixed). See SAMtags.pdf for more details

#What is a CIGAR string

A CIGAR string is a "compact idiosyncratic gapped alignment report". It tells you about insertions, deletions, and clipping. It is a series of "operators" with lengths.

Insertion example:

50M50I50M

That would be 50bp of matching bases (50M), followed by a 50bp insertion (50I), followed by another 50bp of matches (50M). The 50bp insertion means the read contains 50 bases in the middle which did not match the reference genome that you are comparing the read to.

Clipping example:

50S50M50S

This means that 50bp matched (50M in the middle of the CIGAR string) and both sides of the read are soft clipped. The clipping means the aligner was not able to align the reads on either side.

Notes:

See SAMv1.pdf for all the CIGAR operators.

If you are working with SAM data, you will often write loops that directly parse CIGAR strings. See Appendix B for handy functions for parsing CIGAR strings. Don't fear the CIGAR!

#Detecting SVs from long reads

Long reads offer a wide array of methods for detecting SVs

Note that there are many different methods for detecting SVs from long reads, e.g. not all use mapped reads from SAM files, some use de novo assembly, but it is still useful to be familiar with mapped read methods.

#What are split/supplementary/chimeric alignments?

Split alignments, or chimeric alignments, are alignments where part of the read maps to one place, and another part to another. For example, part of a long read may map to chr1 and part of it maps to chr4. It is worth reading the definition of "Chimeric alignment" from SAMv1.pdf when you get the chance.

As SAMv1.pdf tells us, one record is marked as "representative", sometimes also called the "primary" record, while the other components of the split read are marked "supplementary", given the 2048 flag. The "primary" record generally has a SEQ field that represents the entirety of the original read's sequence (with CIGAR soft clipping operators saying which part of that sequence aligned), and the "supplementary alignments" will have SEQ field but sometimes just segments of the original read's sequence with CIGAR hard clipping operators indicating that it is partial.

Supplementary alignments are especially common with long reads, and it can be a signal for structural variants e.g. where two chromosomes are fused together, and parts of the read align to multiple chromosomes, or the split alignment may align to either side of a large deletion, or they may be split to align through an inversion (part of it aligns to the forward strand, part of it to the reverse strand, and again the forward strand)

There is no limitation on how many splits might occur so the split can align to 3, 4, or more different places. Each part of the split puts a new line in the SAM file, and note that all the records also have the same read name, or QNAME (first column of SAM).

#What are secondary alignments/multi-mappers

Secondary alignments generally come from "multi-mappers" where the entire read maps equally well (or at least somewhat equally well) to, say, somewhere on both chr4 and chr1. "Multi mapping" results in secondary alignments, while split reads result in supplementary alignments. See "Multiple mapping" in the SAMv1.pdf for the definition of multi-mapping. Note also that secondary alignments sometimes are missing the SEQ field entirely too, see https://github.com/lh3/minimap2/issues/458#issuecomment-516661855

I wrote a tool called secondary_rewriter to add the SEQ field back to secondary alignments, which may help in some cases

#What is the SA tag?

The SA tag is outputted on each part of the supplementary/split/chimeric alignment, e.g. the primary contains an SA tag that refers to information (e.g. the location) of where all the supplementary alignments where placed, and each of the supplementary alignments also contains an SA tag that refers to the primary alignment and each other supplementary alignment.

Fun fact: The SA tag conceptually can result in a 'quadratic explosion' of data, because each part of the split contains references to every other part. For example, if a read is split into 4 pieces, then each record would would have an SA tag with 3 segments, so 3*4 segments will be documented in the SA tag. In many cases, this is not a problem, but if you imagine a finished chromosome aligned to a draft assembly, it may get split so many times this could be a factor.

See SAMtags.pdf for more info on the SA tag.

#Visualizing split reads across a breakend or translocation

This is a specialized JBrowse 2 feature, but if there is an inter-chromosomal translocation, you can load this into JBrowse and visualize support for this event using our "breakpoint split view". This view shows the evidence for the reads that are split aligned across an SV, and can show connections between paired-end reads across an SV too.

We also have a workflow called the "SV inspector" that helps you setup the "breakpoint split views" (https://jbrowse.org/jb2/docs/user_guide/#sv-inspector. The SV inspector and Breakpoint split view work best on Breakends (e.g. VCF 4.3 section 5.4) and <TRA> (translocation) events from VCF, or BEDPE formatted SV calls, and you can launch the "breakpoint split view" from the "SV inspector"

#Visualizing a 'read vs reference' view given a split alignment

If we are given the the primary alignment of an arbitrary split read, then we can construct what that split looks like compared to the reference genome.

If we are not given the primary alignment (e.g. we are starting from a supplementary alignment) then we can search the SA list for the one that is primary, because at least one will be.

Now that we have the primary alignment, it will have the SEQ (of the entire read, the supplementary alignments typically have a blank SEQ!) and the SA tag containing the CIGAR of all the different parts of the split. We can then construct how the entire read, not just a particular record of the split alignment, compares to the genome. In JBrowse 2 we implemented this and it uses a synteny-style rendering. [1]

Figure showing JBrowse 2 piecing together a long read vs the reference genome from a single read

In order to do this reconstruction, JBrowse 2 takes the CIGAR strings of the primary alignment and each of the pieces of the SA tag (it is a semi-colon separated list of chunks), sort them by the amount of softclipping (the softclipping values will progressively trim off more of the SEQ telling you it aligned further and further on in the long read), and then this tells me where each piece of the split alignment came from in the original SEQ, so we can plot the alignments of the read vs the reference genome using synteny style display.

[1] Similar functionality also exists in GenomeRibbon https://genomeribbon.org

#SAM vs VCF - Breakends vs split alignments

An interesting outcome (to me) is that from a single record in a SAM file, I can reconstruct the "derived" genome around a region of interest from a single read.

If I was to try to do this with the VCF Breakend specification (section 5.4 of VCF4.3.pdf), it may actually be more challenging than from a SAM read. This is because a Breakend in VCF format is only an edge in a graph (and the sequences are nodes). Therefore, in order to properly reconstruct a structural variant from a VCF with Breakends, I would have to construct a graph and decode paths through it.

I like the ability to reconstruct the derived genome from a single read, but individual reads can be noisy (contain errors). That said, de novo assembled contigs can also be stored in SAM format and is significantly less noisy (being composed of the aggregate information of many reads).

The point though is that interpretation of the VCF breakend specification is challenging due to imposing a sequence graph on the genome, while the SA tag remains just a simple set of linear alignments that can easily be pieced together, and you only need to refer to a single record in the SAM file to do so.

I am not aware of a lot of tools that work on the VCF Breakend graph, and expect more will need to be created to truly work with this standard. An inversion for example may create 4 record in the VCF file (see section 5.4 in the VCF4.3.pdf for example), and needs careful interpretation.

#Haplotype tagged reads

A new trend has been to create SAM/BAM/CRAM files with tagged reads, which tells us which haplotype a read was inferred to have come from. This is commonly done with the HP tag, which might have HP=0 and HP=1 for a diploid genome. Tools like whatshap can add these tags to a SAM file, and IGV and JBrowse 2 can color and sort by these tags.

Screenshot of JBrowse 2 with the "Color by tag" and "Sort by tag" setting enabled (coloring and sorting by the HP tag) letting us see that only one haplotype has a deletion. Tutorial for how to do this in JBrowse 2 here https://jbrowse.org/jb2/docs/user_guide/#sort-color-and-filter-by-tag

#How do you detect SVs with paired-end reads?

#Distance between pairs being abnormally large or short

The distance between pairs is encoded by the TLEN column in the SAM format. The distance between pairs with good mapping is relatively constant and called the "insert length". This comes from how the sequencing is done: paired-end sequencing performs sequencing on both ends of a fragment.

But, if you are mapping reads vs the reference genome, and you observe that they are abnormally far apart, say 50kb apart instead of 1kb apart, this may indicate there your sample contains a deletion relative to the reference.

Screenshot of JBrowse 1 with "View as pairs" enabled with "mate paired" sequencing, and large insert size colored as red (from https://jbrowse.org/docs/paired_reads.html. Note that some of JBrowse 1's View as pairs features are not yet available in JBrowse 2

#Mate pairs vs paired end reads

"Mate paired" reads have a larger distance between pairs (aka the insert size is larger). Think: mate pairs have 2-5kb between pairs instead of a couple hundred bp. Mate pairs can be useful in scaffolding de novo genome assemblies since the larger range can span gaps, giving evidence that multiple contigs are connected.

"Paired-end" reads typically have like 200-500bp distance between pairs (aka, the insert size).

Data sheet https://web.archive.org/web/20210827021305/https://www.illumina.com/Documents/products/datasheets/datasheet_genomic_sequence.pdf

The above JBrowse screenshot shows mate pairs. Note that mate pairs have a different pair orientation, and pairs don't necessarily point at each other, they "point away from each other" in normal conditions due. See Appendix A for more info.

#Linked reads aka simulated long reads

Linked reads are another method that uses short reads but gives long read information. I don't have a lot of info on this but will update this section if I do. The company, 10x genomics, that originally used this discontinued commercial sequencing using this method to focus instead on single cell AFAIK. https://www.10xgenomics.com/products/linked-reads

#An abundance of reads being "clipped" at a particular position

This can indicate that part of the reads map well, but then there was an abrupt stop to the mapping. This might mean that there is a sequence that was an insertion at that position, or a deletion, or a translocation.

The clipping is indicated by the CIGAR string, either at the start or end of it by an S or an H. The S indicates "soft clipping", and indicates that the sequence of the clipped portion can be found in the SEQ field of the primary alignment. The H is hard clipped, and the sequence that is hard clipped will not appear in the SEQ.

Screenshot of JBrowse 2 showing blue clipping indicator with a "pileup" of soft-clipping at a particular position shown in blue. The clipping is an "interbase" operation (it occurs between base pair coordinates) so it is plotted separately from the normal coverage histogram.

Screenshot of JBrowse 2 showing an insertion with Nanopore (top), PacBio (middle) and Illumina short reads. The long reads may completely span the insertion, so the CIGAR string on those have an I operator and are indicated by the purple triangle above the reads. For the short reads, the reads near the insertion will be clipped since they will not properly map to the reference genome and cannot span the sinsertion. The "Show soft clipping" setting in JBrowse 2 and IGV can be used to show visually the bases that extend into the insertion (shown on the bottom track).

#Unexpected pair orientation

With standard paired end sequencing, the pairs normally point at each other

forward reverse
 --->    <---

If the stranded-ness of the pair is off, then it could indicate a structural variant. See Appendix A for a handy function for calculating pair orientation.

This guide from IGV is helpful for interpreting the pair directionality with patterns of SVs using "Color by pair orientation"

https://software.broadinstitute.org/software/igv/interpreting_pair_orientations

Figure: JBrowse 2 showing an inverted (tandem) duplication in 1000 genomes data. It uses the same coloring as IGV for pair orientation. The tandem duplication can produce green arrows which have reads pointing in opposite directions e.g. <-- and -->, while blue arrows which can indicate an inversion point in the same direction e.g. --> and -->

#Caveat about TLEN

Note that TLEN is a field in the SAM format that is somewhat ill defined, at least in the sense that different tools may use it differently https://github.com/pysam-developers/pysam/issues/667#issuecomment-381521767

If needed, you can calculate TLEN yourself if you process the file yourself (e.g. process all reads, get the actual records for the pairs, and calculate distance) but I have not had trouble with relying on the TLEN from the data files themselves.

#Calling copy number variants with your short or long reads

Another type of SV that you can get from your SAM files are copy number variants (CNVs). By looking at the depth-of-coverage for your data files, you can look for abnormalities that may indicate copy number variants. By using a tool like mosdepth, you can quickly get a file showing the coverage across the genome.

Be aware that if you are comparing the coverage counts from different tools, that they have different defaults that may affect comparison. Some discard QC_FAIL, DUP, and SECONDARY flagged reads. This is probably appropriate, and corresponds to what most genome browsers will display (see https://gist.github.com/cmdcolin/9f677eca28448d8a7c5d6e9917fc56af for a short summary of depth calculated from different tools)

Note that both long and short reads can be used for CNV detection. Long reads may give more accurate measurements also, with their better ability to map smoothly through difficult regions of the genome.

Screenshot showing coverage in BigWig format from nanopore reads on normal and tumor tissue from a melanoma cancer cell line (COLO829) plotted using JBrowse 2. This coverage data is calculated from nanopore sequencing from here using mosdepth, converted from BedGraph to BigWig, and loaded into JBrowse 2. See (demo and tutorial)

#The future, with graph genomes and de novo assemblies

Currently, SV visualization is highly based on comparing data versus a reference genome (and the SAM format is a signature of this: it stores data in terms of reference genome coordinates). In the future, SV visualization may look more similar to comparative genomics, where we compare an SV to a population specific reference from a graph genomes or something like this.

It is known that de novo assembly has more power to detect SVs than some read operations (https://twitter.com/lh3lh3/status/1362921612690010118/photo/1 as de novo assembled genomes improve and become more widespread, we may see a shift in how SVs are called

I would also like to see improved ability to do fast or 'on the fly' gene prediction on the de novo assembled genomes, and we can see what SNPs or modified splicing might look like in copies of genes (e.g. derived regions of the CNV duplications).

Fun fact: the GAF (graphical alignment format) is a strict superset of PAF (pairwise alignment format) by storing graph node labels in the target name slot of PAF, and can refer to an rGFA (reference genome graph)! Looking forward to the graph genome world.

#Conclusion

Algorithms that actually call structural variants face many challenges, but understanding how the reads are encoded in SAM format, and seeing what they look like in the genome browser is a useful first step to gaining a better understanding.

In summary, some of the signatures of SVs may include:

If you have any ideas I should include here, let me know!

#Appendix A: Interpreting pair orientation

This may be unnecessarily low level code for this article, but it could be helpful utility to help determine the pair orientation from a single SAM/BAM/CRAM record.

// @param flags - flags from a single read
// @param ref - the string of the reference sequence, just used to determine if it matches rnext
// @param rnext - the string of the RNEXT, just used to determine if it matches ref
// @param tlen - the TLEN field from SAM
// @return e.g. F1R2 normal paired end orientation
function getPairOrientation(
  flags: number,
  ref: string,
  rnext: string,
  tlen: number,
) {
  // this read is not unmapped &&
  // this read's mate is also not unmapped &&
  // this read's mate is on the same reference genome
  if (!flags & 4 && !flags & 8 && ref === rnext) {
    const s1 = flags & 16 ? 'R' : 'F'
    const s2 = flags & 32 ? 'R' : 'F'
    let o1 = ' '
    let o2 = ' '
 
    // if first in pair
    if (flags & 64) {
      o1 = '1'
      o2 = '2'
    }
 
    // else if second in pair
    else if (flags & 128) {
      o1 = '2'
      o2 = '1'
    }
 
    const tmp = []
    if (tlen > 0) {
      tmp[0] = s1
      tmp[1] = o1
      tmp[2] = s2
      tmp[3] = o2
    } else {
      tmp[2] = s1
      tmp[3] = o1
      tmp[0] = s2
      tmp[1] = o2
    }
    return tmp.join('')
  }
  return null
}

Then this can be broken down further by orientation type

So you can interpret e.g. F1R2 in relation to being a paired end read (fr) or mate pair (rf) below and with this link https://software.broadinstitute.org/software/igv/interpreting_pair_orientations

{
  "fr": {
    "F1R2": "LR",
    "F2R1": "LR",
 
    "F1F2": "LL",
    "F2F1": "LL",
 
    "R1R2": "RR",
    "R2R1": "RR",
 
    "R1F2": "RL",
    "R2F1": "RL"
  },
 
  "rf": {
    "R1F2": "LR",
    "R2F1": "LR",
 
    "R1R2": "LL",
    "R2R1": "LL",
 
    "F1F2": "RR",
    "F2F1": "RR",
 
    "F1R2": "RL",
    "F2R1": "RL"
  }
}

#Appendix B - CIGAR parsing

// @param cigar: CIGAR string in text form
// @returns an array of elements like ['30','M', '2','I', '50','M', '40','D']
// which you can consume in a loop two elements at a time
function parseCigar(cigar: string) {
  return cigar.split(/([MIDNSHPX=])/)
}

Then parse the returned array two at a time

// this function does nothing, but is informative for how to interpret a
// CIGAR string
// @param cigar - CIGAR string from record
// @param readSeq -  the SEQ from record
// @param refSeq -  the reference sequence underlying the read
function interpretCigar(cigar: string, readSeq: string, refSeq: string) {
  const opts = parseCigar(cigar)
  let qpos = 0 // query position, position on the read
  let tpos = 0 // target position, position on the reference sequence
 
  // opts will be an array like this ['30','M', '2','I', '50','M', '40','D']
  // which we parse two elements at a time
  for (let i = 0; i < opts.length; i += 2) {
    const length = +opts[i]
    const operator = opts[i + 1]
    // do things. refer to the CIGAR chart in SAMv1.pdf for which operators
    // "consume reference" to see whether to increment
    if (op === 'M' || op === '=') {
      // matches consume query and reference
      const refMatch = refSeq.slice(tpos, tpos + len)
      const readMatch = readSeq.slice(qpos, qpos + len)
      for (let i = 0; i < len; i++) {
        if (refMatch[i] !== readMatch[i]) {
          // SNP at this position
        }
      }
      qpos += len
      tpos += len
    }
    if (op === 'I') {
      // insertions only consume query
      // sequence of the insertion from the read is
      const insSeq = readSeq.slice(qpos, qpos + len)
      qpos += len
    }
    if (op === 'D') {
      // deletions only consume reference
      // sequence of the deletion from the reference is
      const delSeq = refSeq.slice(tpos, tpos + len)
      tpad += len
    }
    if (op === 'N') {
      // skips only consume reference
      // skips are similar to deletions but are related to spliced alignments
      tpad += len
    }
    if (op === 'X') {
      // mismatch using the extended CIGAR format
      // could lookup the mismatch letter in a string containing the reference
      const mismatch = refSeq.slice(tpos, tpos + len)
      qpos += len
      tpos += len
    }
    if (op === 'H') {
      // does not consume query or reference
      // hardclip is just an indicator
    }
    if (op === 'S') {
      // softclip consumes query
      // below gets the entire soft clipped portion
      const softClipStr = readSeq.slice(qpos, qpos + len)
      qpos += len
    }
  }
}

Note for example, that to determine how long a record is on the reference sequence, you have to combine the records start position with the CIGAR string, basically parsing the CIGAR string to add up tpos and return tpos

#Appendix C - align FASTQ directly to CRAM

This example from the htslib documentation (http://www.htslib.org/workflow/fastq.html shows how you can stream directly from FASTQ to CRAM (and generate the index file .crai too)

If you want, you can make this a little shell script, easy_align_shortreads.sh

easy_align_shortreads.sh

#!/bin/bash
minimap2 -t 8 -a -x sr "$1" "$2" "$3"  | \
samtools fixmate -u -m - - | \
samtools sort -u -@2 - | \
samtools markdup -@8 --reference "$1" - --write-index "$4"

Similar idea for longreads, except just a single fastq file is generally used for longreads

easy_align_longreads.sh

#!/bin/bash
minimap2 -t 8 -a "$1" "$2"  | \
samtools fixmate -u -m - - | \
samtools sort -u -@2 - | \
samtools markdup -@8 --reference "$1" - --write-index "$3"

Then call

bash easy_align_shortreads.sh ref.fa reads1.fq reads2.fq out.cram
bash easy_align_longreads.sh ref.fa reads.fq out.cram
 
## output BAM instead
bash easy_align_shortreads.sh ref.fa reads1.fq reads2.fq out.bam
bash easy_align_longreads.sh ref.fa reads.fq out.bam

This same concept works with other common aligners as well like bwa

Bonus: CRAM to bigwig, for looking at CNV/coverage

#!/bin/bash
# quickalign.sh ref.fa 1.fq 2.fq out.cram
# produces out.cram and out.bw
samtools faidx $1
minimap2 -t 8 -a -x sr "$1" "$2" "$3"  | \
samtools fixmate -u -m - - | \
samtools sort -u -@2 - | \
samtools markdup -@8 --reference "$1" - --write-index "$4"
 
 
mosdepth $4 -f $1 $4
gunzip $4.per-base.bed.gz
bedGraphToBigWig $4.per-base.bed $1.fa.fai $4.bw

Call as "quickalign.sh ref.fa 1.fq 2.fq out.cram" gives you out.cram, out.cram.crai, and out.cram.bw (coverage)

#Appendix D - the MD tag and finding SNPs in reads

The MD tag helps tell you where the mismatches are without looking at the reference genome. This is useful because as I mentioned, CIGAR can say 50M (50 matches) but some letters inside those 50 matches can be mismatches, it only says there are no insertions/deletions in those 50 bases, but you have to determine where in those 50 bases where the mismatches are. The MD tag can help tell you where those are, but it is somewhat complicated to decode (https://vincebuffalo.com/notes/2014/01/17/md-tags-in-bam-files.html). You have to combine it with the CIGAR to get the position of the mismatches on the reference genome. If you have a reference genome to look at, you might just compare all the bases within the 50M to the reference genome and look for mismatches yourself and forget about the MD tag

The MD tag is also not required to exist, but the command samtools calmd yourfile.bam --reference reference.fa can add MD tags to your BAM file. It is generally not useful for CRAM because CRAM actually does store mismatches with the reference genome in it's compression format. Note that there are also some oddities about MD tag representation leading to complaints (e.g. https://github.com/samtools/hts-specs/issues/505) leading more credence to "doing it yourself" e.g. finding your own mismatches by comparing the read sequence with the reference, keeping track of where you are on the read and ref position with the CIGAR string (a la Appendix B)