Structural variants and the SAM format - the long (reads) and short (reads) of it
2022-02-06
The SAM
specification is pretty amazing
(https://samtools.github.io/hts-specs/SAMv1.pdf) but it is also fairly terse and
abstract. True understanding might come from playing with real world data. I
will try to relay some things I have learned over the years, with a bit of a
focus on how SAM
file concepts can relate to structural variants.
Disclaimer: I'm a developer of JBrowse 2. This document has some screenshots and links for it, feel free to try it at https://jbrowse.org.
#Basics
#What is a SAM
file and how does it relate to BAM
and CRAM
?
-
A
SAM
file generally contains "reads" from a sequencer, with information about how they are mapped to a reference genome [1][2]. -
A
SAM
file is generally produced when an aligner takes in raw unaligned reads (often stored inFASTQ
format files) and aligns them to a reference genome [3]. -
A
SAM
file is a text format that you can read with your text editor.BAM
andCRAM
are compressed representations of theSAM
format.
You can convert SAM
to BAM
with samtools
You can also convert a BAM
back to SAM
with samtools view
The -h just makes sure to preserve the header.
If you are converting SAM
to CRAM
, it may require the -T argument to specify
your reference sequence (this is because the CRAM
is "reference compressed")
Also see Appendix C: piping FASTQ from minimap2
directly to CRAM
[1] SAM
can contain any type of sequence, not specifically reads. If you
created a de novo assembly, you could align the contigs of the de novo
assembly to a reference genome and store the results in SAM
.
[2] Does not always have to have information about mapping to a reference
genome. You can also store unaligned data in SAM
/BAM
/CRAM
(so-called
uBAM
for example) but most of the time, the reads in SAM
format are aligned
to a reference genome.
[3] Examples of programs that do alignment include bwa
, bowtie
, and
minimap2
(there are many others). These programs all can produce SAM
outputs
#What is in a SAM
file
A SAM
file contains a header (BAM
and CRAM
files also have the SAM
header) and a series of records. A record is a single line in a SAM
file, and
it generally corresponds to a single read, but as we will see, a split alignment
may produce multiple records that refer to the same source read.
Note: if a read failed to align to the reference genome, it may still be in your
SAM
file, marked as unmapped using the flag column. Sometimes, "dumpster
diving" (looking at the unmapped records from a SAM
file) can be used to aid
structural variant searches (e.g. there may be novel sequence in there not from
the reference genome that could be assembled)
#What are tags in a SAM
file
A SAM file has a core set of required fields, and then an arbitrary list of
extra columns called tags. The tags have a two character abbreviation like MQ
(mapping quality) or many others. They can be upper or lower case. Upper case
are reserved for official usages (except those with X, Y, or Z prefixed). See
SAMtags.pdf for more details
#What is a CIGAR
string
A CIGAR
string is a "compact idiosyncratic gapped alignment report". It tells
you about insertions, deletions, and clipping. It is a series of "operators"
with lengths.
Insertion example:
50M50I50M
That would be 50bp of matching bases (50M
), followed by a 50bp insertion
(50I
), followed by another 50bp of matches (50M
). The 50bp insertion means
the read contains 50 bases in the middle which did not match the reference
genome that you are comparing the read to.
Clipping example:
50S50M50S
This means that 50bp matched (50M
in the middle of the CIGAR
string) and
both sides of the read are soft clipped. The clipping means the aligner was not
able to align the reads on either side.
Notes:
-
Finding mismatches: A
CIGAR
string match like50M
means 50 bases "matched" the reference genome, but that only means that there are no insertions or deletions in those 50 bases. There could be underlying mismatches in the read compared to the reference. Note: there is also extendedCIGAR
that replacesM
with=
(exact match) andX
(mismatch). Also see Appendix D on theMD
tag and finding where the mismatches are, but note thatMD
tag is tricky -
Ambiguity of representation: A
CIGAR
string with insertions and deletions could be50M1D1I50M
. This string had a 1bp deletion and a 1bp insertion back-to-back. This could be just a mismatch! There is ambiguity in sequence alignment representations. Downstream programs must accommodate this. -
Split records and soft-clipping: A
CIGAR
string with soft-clipping500S50M
this means that 500 bases of the read were not aligned at this position, but 50 bases were! Note that the alignment might have been a split alignment (see section on split alignments below) so another record in theSAM
file, linked by theSA
(supplementary alignment) tag, might contain info on where the other 500 bases aligned! (or, they might not map anywhere). The linked split or supplementary alignments all have the same read name (QNAME
).
See SAMv1.pdf for all the CIGAR operators.
If you are working with SAM
data, you will often write loops that directly
parse CIGAR strings. See Appendix B for handy functions for parsing CIGAR
strings. Don't fear the CIGAR
!
#Detecting SVs from long reads
Long reads offer a wide array of methods for detecting SVs
- Small insertions/deletions: Long reads can completely span moderate sized
insertions and deletions, indicated by
I
orD
in aCIGAR
string. - Large insertions/deletions: If a long read does not completely span an insertion or deletion, it may be split aligned on either side of the SV or could be soft/hard clipped where it can't align all the way through an insertion.
- Translocations: A split long alignment can span long range or even inter-chromosomal translocations, so part of the read maps to one chromosome and one part maps to the other
- Inversions: A split alignment can span an inversion, the long read is split into multiple parts, one part of it aligns in the reverse orientation, while the other part aligns in the forward orientation
Note that there are many different methods for detecting SVs from long reads, e.g. not all use mapped reads from SAM files, some use de novo assembly, but it is still useful to be familiar with mapped read methods.
#What are split/supplementary/chimeric alignments?
Split alignments, or chimeric alignments, are alignments where part of the read
maps to one place, and another part to another. For example, part of a long read
may map to chr1
and part of it maps to chr4
. It is worth reading the
definition of "Chimeric alignment" from
SAMv1.pdf when you get the
chance.
As SAMv1.pdf tells us, one
record is marked as "representative", sometimes also called the "primary"
record, while the other components of the split read are marked "supplementary",
given the 2048 flag. The "primary" record generally has a SEQ
field that
represents the entirety of the original read's sequence (with CIGAR soft
clipping operators saying which part of that sequence aligned), and the
"supplementary alignments" will have SEQ
field but sometimes just segments of
the original read's sequence with CIGAR hard clipping operators indicating that
it is partial.
Supplementary alignments are especially common with long reads, and it can be a signal for structural variants e.g. where two chromosomes are fused together, and parts of the read align to multiple chromosomes, or the split alignment may align to either side of a large deletion, or they may be split to align through an inversion (part of it aligns to the forward strand, part of it to the reverse strand, and again the forward strand)
There is no limitation on how many splits might occur so the split can align to
3, 4, or more different places. Each part of the split puts a new line in the
SAM file, and note that all the records also have the same read name, or QNAME
(first column of SAM
).
#What are secondary alignments/multi-mappers
Secondary alignments generally come from "multi-mappers" where the entire read
maps equally well (or at least somewhat equally well) to, say, somewhere on both
chr4
and chr1
. "Multi mapping" results in secondary alignments, while split
reads result in supplementary alignments. See "Multiple mapping" in the
SAMv1.pdf for the definition
of multi-mapping. Note also that secondary alignments sometimes are missing the
SEQ
field entirely too, see
https://github.com/lh3/minimap2/issues/458#issuecomment-516661855
I wrote a tool called secondary_rewriter to add the SEQ
field back to
secondary alignments, which may help in some cases
#What is the SA
tag?
The SA
tag is outputted on each part of the supplementary/split/chimeric
alignment, e.g. the primary contains an SA
tag that refers to information
(e.g. the location) of where all the supplementary alignments where placed, and
each of the supplementary alignments also contains an SA
tag that refers to
the primary alignment and each other supplementary alignment.
Fun fact: The SA
tag conceptually can result in a 'quadratic explosion' of
data, because each part of the split contains references to every other part.
For example, if a read is split into 4 pieces, then each record would would have
an SA
tag with 3 segments, so 3*4 segments will be documented in the SA
tag. In many cases, this is not a problem, but if you imagine a finished
chromosome aligned to a draft assembly, it may get split so many times this
could be a factor.
See SAMtags.pdf for more
info on the SA
tag.
#Visualizing split reads across a breakend or translocation
This is a specialized JBrowse 2 feature, but if there is an inter-chromosomal translocation, you can load this into JBrowse and visualize support for this event using our "breakpoint split view". This view shows the evidence for the reads that are split aligned across an SV, and can show connections between paired-end reads across an SV too.
We also have a workflow called the "SV inspector" that helps you setup the
"breakpoint split views" (https://jbrowse.org/jb2/docs/user_guide/#sv-inspector.
The SV inspector and Breakpoint split view work best on Breakends (e.g. VCF 4.3
section 5.4) and <TRA>
(translocation) events from VCF
, or BEDPE
formatted
SV calls, and you can launch the "breakpoint split view" from the "SV inspector"
#Visualizing a 'read vs reference' view given a split alignment
If we are given the the primary alignment of an arbitrary split read, then we can construct what that split looks like compared to the reference genome.
If we are not given the primary alignment (e.g. we are starting from a
supplementary alignment) then we can search the SA
list for the one that is
primary, because at least one will be.
Now that we have the primary alignment, it will have the SEQ
(of the entire
read, the supplementary alignments typically have a blank SEQ
!) and the SA
tag containing the CIGAR
of all the different parts of the split. We can then
construct how the entire read, not just a particular record of the split
alignment, compares to the genome. In JBrowse 2 we implemented this and it uses
a synteny-style rendering. [1]
Figure showing JBrowse 2 piecing together a long read vs the reference genome from a single read
In order to do this reconstruction, JBrowse 2 takes the CIGAR
strings of the
primary alignment and each of the pieces of the SA
tag (it is a semi-colon
separated list of chunks), sort them by the amount of softclipping (the
softclipping values will progressively trim off more of the SEQ
telling you it
aligned further and further on in the long read), and then this tells me where
each piece of the split alignment came from in the original SEQ
, so we can
plot the alignments of the read vs the reference genome using synteny style
display.
[1] Similar functionality also exists in GenomeRibbon https://genomeribbon.org
#SAM
vs VCF
- Breakends vs split alignments
An interesting outcome (to me) is that from a single record in a SAM
file, I
can reconstruct the "derived" genome around a region of interest from a single
read.
If I was to try to do this with the VCF
Breakend specification (section 5.4 of
VCF4.3.pdf), it may actually
be more challenging than from a SAM
read. This is because a Breakend in VCF
format is only an edge in a graph (and the sequences are nodes). Therefore, in
order to properly reconstruct a structural variant from a VCF
with Breakends,
I would have to construct a graph and decode paths through it.
I like the ability to reconstruct the derived genome from a single read, but
individual reads can be noisy (contain errors). That said, de novo assembled
contigs can also be stored in SAM
format and is significantly less noisy
(being composed of the aggregate information of many reads).
The point though is that interpretation of the VCF
breakend specification is
challenging due to imposing a sequence graph on the genome, while the SA
tag
remains just a simple set of linear alignments that can easily be pieced
together, and you only need to refer to a single record in the SAM
file to do
so.
I am not aware of a lot of tools that work on the VCF
Breakend graph, and
expect more will need to be created to truly work with this standard. An
inversion for example may create 4 record in the VCF
file (see section 5.4 in
the VCF4.3.pdf for example),
and needs careful interpretation.
#Haplotype tagged reads
A new trend has been to create SAM
/BAM
/CRAM
files with tagged reads, which
tells us which haplotype a read was inferred to have come from. This is commonly
done with the HP
tag, which might have HP=0
and HP=1
for a diploid genome.
Tools like whatshap
can add these tags to a SAM
file, and IGV and JBrowse 2
can color and sort by these tags.
Screenshot of JBrowse 2 with the "Color by tag" and "Sort by tag" setting
enabled (coloring and sorting by the HP
tag) letting us see that only one
haplotype has a deletion. Tutorial for how to do this in JBrowse 2 here
https://jbrowse.org/jb2/docs/user_guide/#sort-color-and-filter-by-tag
#How do you detect SVs with paired-end reads?
#Distance between pairs being abnormally large or short
The distance between pairs is encoded by the TLEN
column in the SAM
format.
The distance between pairs with good mapping is relatively constant and called
the "insert length". This comes from how the sequencing is done: paired-end
sequencing performs sequencing on both ends of a fragment.
But, if you are mapping reads vs the reference genome, and you observe that they are abnormally far apart, say 50kb apart instead of 1kb apart, this may indicate there your sample contains a deletion relative to the reference.
Screenshot of JBrowse 1 with "View as pairs" enabled with "mate paired" sequencing, and large insert size colored as red (from https://jbrowse.org/docs/paired_reads.html. Note that some of JBrowse 1's View as pairs features are not yet available in JBrowse 2
#Mate pairs vs paired end reads
"Mate paired" reads have a larger distance between pairs (aka the insert size is larger). Think: mate pairs have 2-5kb between pairs instead of a couple hundred bp. Mate pairs can be useful in scaffolding de novo genome assemblies since the larger range can span gaps, giving evidence that multiple contigs are connected.
"Paired-end" reads typically have like 200-500bp distance between pairs (aka, the insert size).
The above JBrowse screenshot shows mate pairs. Note that mate pairs have a different pair orientation, and pairs don't necessarily point at each other, they "point away from each other" in normal conditions due. See Appendix A for more info.
#Linked reads aka simulated long reads
Linked reads are another method that uses short reads but gives long read information. I don't have a lot of info on this but will update this section if I do. The company, 10x genomics, that originally used this discontinued commercial sequencing using this method to focus instead on single cell AFAIK. https://www.10xgenomics.com/products/linked-reads
#An abundance of reads being "clipped" at a particular position
This can indicate that part of the reads map well, but then there was an abrupt stop to the mapping. This might mean that there is a sequence that was an insertion at that position, or a deletion, or a translocation.
The clipping is indicated by the CIGAR
string, either at the start or end of
it by an S
or an H
. The S
indicates "soft clipping", and indicates that
the sequence of the clipped portion can be found in the SEQ
field of the
primary alignment. The H
is hard clipped, and the sequence that is hard
clipped will not appear in the SEQ
.
Screenshot of JBrowse 2 showing blue clipping indicator with a "pileup" of soft-clipping at a particular position shown in blue. The clipping is an "interbase" operation (it occurs between base pair coordinates) so it is plotted separately from the normal coverage histogram.
Screenshot of JBrowse 2 showing an insertion with Nanopore (top), PacBio
(middle) and Illumina short reads. The long reads may completely span the
insertion, so the CIGAR
string on those have an I
operator and are indicated
by the purple triangle above the reads. For the short reads, the reads near the
insertion will be clipped since they will not properly map to the reference
genome and cannot span the sinsertion. The "Show soft clipping" setting in
JBrowse 2 and IGV can be used to show visually the bases that extend into the
insertion (shown on the bottom track).
#Unexpected pair orientation
With standard paired end sequencing, the pairs normally point at each other
forward reverse
---> <---
If the stranded-ness of the pair is off, then it could indicate a structural variant. See Appendix A for a handy function for calculating pair orientation.
This guide from IGV is helpful for interpreting the pair directionality with patterns of SVs using "Color by pair orientation"
https://software.broadinstitute.org/software/igv/interpreting_pair_orientations
Figure: JBrowse 2 showing an inverted (tandem) duplication in 1000 genomes data.
It uses the same coloring as IGV for pair orientation. The tandem duplication
can produce green arrows which have reads pointing in opposite directions e.g.
<--
and -->
, while blue arrows which can indicate an inversion point in the
same direction e.g. -->
and -->
#Caveat about TLEN
Note that TLEN
is a field in the SAM format that is somewhat ill defined, at
least in the sense that different tools may use it differently
https://github.com/pysam-developers/pysam/issues/667#issuecomment-381521767
If needed, you can calculate TLEN
yourself if you process the file yourself
(e.g. process all reads, get the actual records for the pairs, and calculate
distance) but I have not had trouble with relying on the TLEN
from the data
files themselves.
#Calling copy number variants with your short or long reads
Another type of SV that you can get from your SAM
files are copy number
variants (CNVs). By looking at the depth-of-coverage for your data files, you
can look for abnormalities that may indicate copy number variants. By using a
tool like mosdepth
, you can quickly get a file showing the coverage across the
genome.
Be aware that if you are comparing the coverage counts from different tools,
that they have different defaults that may affect comparison. Some discard
QC_FAIL
, DUP
, and SECONDARY
flagged reads. This is probably appropriate,
and corresponds to what most genome browsers will display (see
https://gist.github.com/cmdcolin/9f677eca28448d8a7c5d6e9917fc56af for a short
summary of depth calculated from different tools)
Note that both long and short reads can be used for CNV detection. Long reads may give more accurate measurements also, with their better ability to map smoothly through difficult regions of the genome.
Screenshot showing coverage in BigWig
format from nanopore reads on normal and
tumor tissue from a melanoma cancer cell line (COLO829) plotted using JBrowse 2.
This coverage data is calculated from nanopore sequencing from
here using
mosdepth
, converted from BedGraph
to
BigWig
, and loaded into JBrowse 2. See
(demo
and
tutorial)
#The future, with graph genomes and de novo assemblies
Currently, SV visualization is highly based on comparing data versus a reference
genome (and the SAM
format is a signature of this: it stores data in terms of
reference genome coordinates). In the future, SV visualization may look more
similar to comparative genomics, where we compare an SV to a population specific
reference from a graph genomes or something like this.
It is known that de novo assembly has more power to detect SVs than some read operations (https://twitter.com/lh3lh3/status/1362921612690010118/photo/1 as de novo assembled genomes improve and become more widespread, we may see a shift in how SVs are called
I would also like to see improved ability to do fast or 'on the fly' gene prediction on the de novo assembled genomes, and we can see what SNPs or modified splicing might look like in copies of genes (e.g. derived regions of the CNV duplications).
Fun fact: the
GAF
(graphical alignment format) is a strict superset of
PAF
(pairwise alignment
format) by storing graph node labels in the target name
slot of PAF
, and can
refer to an rGFA
(reference genome graph)! Looking forward to the graph genome
world.
#Conclusion
Algorithms that actually call structural variants face many challenges, but understanding how the reads are encoded in SAM format, and seeing what they look like in the genome browser is a useful first step to gaining a better understanding.
In summary, some of the signatures of SVs may include:
- Aberrant insert size (
TLEN
) detection (longer for deletion, shorter for insertion) - Aberrant pair orientation (pairs are not pointing at each other)
- Split-read detection (
SA
tag) CIGAR
string processing (D
operator for deletions,I
operator for insertions)- Over-abundance of clipping (
S
orH
operators inCIGAR
) - Depth of coverage changes for CNVs
- Aligning de novo assembly vs a reference genome
(https://twitter.com/lh3lh3/status/1362921612690010118/photo/1) which can
output
SAM
, but it can also outputPAF
format (which can be loaded in JBrowse 2 in the synteny views). Techniques of detecting SVs on PAF will be fundamentally pretty similar to the techniques listed above but may look a bit different (seecs
tag inPAF
for example, it is a modifiedCIGAR
-like string)
If you have any ideas I should include here, let me know!
#Appendix A: Interpreting pair orientation
This may be unnecessarily low level code for this article, but it could be helpful utility to help determine the pair orientation from a single SAM/BAM/CRAM record.
Then this can be broken down further by orientation type
- Paired end reads are "fr"
- Mate pair reads are "rf"
So you can interpret e.g. F1R2 in relation to being a paired end read (fr) or mate pair (rf) below and with this link https://software.broadinstitute.org/software/igv/interpreting_pair_orientations
#Appendix B - CIGAR
parsing
Then parse the returned array two at a time
Note for example, that to determine how long a record is on the reference sequence, you have to combine the records start position with the CIGAR string, basically parsing the CIGAR string to add up tpos and return tpos
#Appendix C - align FASTQ
directly to CRAM
This example from the htslib documentation
(http://www.htslib.org/workflow/fastq.html shows how you can stream directly
from FASTQ
to CRAM
(and generate the index file .crai too)
If you want, you can make this a little shell script, easy_align_shortreads.sh
easy_align_shortreads.sh
Similar idea for longreads, except just a single fastq file is generally used for longreads
easy_align_longreads.sh
Then call
This same concept works with other common aligners as well like bwa
Bonus: CRAM to bigwig, for looking at CNV/coverage
Call as "quickalign.sh ref.fa 1.fq 2.fq out.cram" gives you out.cram, out.cram.crai, and out.cram.bw (coverage)
#Appendix D - the MD
tag and finding SNPs in reads
The MD
tag helps tell you where the mismatches are without looking at the
reference genome. This is useful because as I mentioned, CIGAR
can say 50M
(50 matches) but some letters inside those 50 matches can be mismatches, it only
says there are no insertions/deletions in those 50 bases, but you have to
determine where in those 50 bases where the mismatches are. The MD
tag can
help tell you where those are, but it is somewhat complicated to decode
(https://vincebuffalo.com/notes/2014/01/17/md-tags-in-bam-files.html). You have
to combine it with the CIGAR
to get the position of the mismatches on the
reference genome. If you have a reference genome to look at, you might just
compare all the bases within the 50M to the reference genome and look for
mismatches yourself and forget about the MD
tag
The MD
tag is also not required to exist, but the command
samtools calmd yourfile.bam --reference reference.fa
can add MD
tags to your
BAM
file. It is generally not useful for CRAM
because CRAM
actually does
store mismatches with the reference genome in it's compression format. Note that
there are also some oddities about MD
tag representation leading to complaints
(e.g. https://github.com/samtools/hts-specs/issues/505) leading more credence to
"doing it yourself" e.g. finding your own mismatches by comparing the read
sequence with the reference, keeping track of where you are on the read and ref
position with the CIGAR
string (a la Appendix B)