The
SAM
file format and it's specification
is pretty amazing , but it is also fairly terse and abstract. To really
understand what is going on with your reads, you probably gotta play around with
real world data and learn from tutorials.
I will try to relay some things I have learned over the years, with a bit of a
focus on how SAM
file concepts can relate to structural variants.
Disclaimer: I'm a developer of JBrowse 2. This document has some screenshots and links for it, feel free to try it at https://jbrowse.org.
SAM
file and how does it relate to BAM
and CRAM
?A SAM
file generally contains "reads" from a sequencer, with information
about how they are mapped to a reference genome [^1][^2].
A SAM
file is generally produced when an aligner takes in raw unaligned
reads (often stored in FASTQ
format files) and aligns them to a reference
genome [^3].
A SAM
file is a text format that you can read with your text editor. BAM
and CRAM
are compressed representations of the SAM
format.
You generally see BAM
or CRAM
in the wild instead of SAM
since the files
are very large, so the compressed versions are better to store on disk.
See Appendix E on how to convert between SAM/BAM/CRAM.
[^1] SAM
can contain any type of sequence, not specifically reads. If you
created a de novo assembly, you could align the contigs of the de novo
assembly to a reference genome and store the results in SAM
.
[^2] Does not always have to have information about mapping to a reference
genome. You can also store unaligned data in SAM
/BAM
/CRAM
(so-called
uBAM
for example) but most of the time, the reads in SAM
format are aligned
to a reference genome.
[^3] Examples of programs that do alignment include bwa
, bowtie
, and
minimap2
(there are many others). These programs all can produce SAM
outputs
SAM
/BAM
/CRAM
file?A SAM
file contains a "header" and a series of "records".
SAM
"record"?A record is a single line in a SAM
file, and it generally corresponds to a
single read, but as we will see, a split alignment may produce multiple records
that refer to the same source read.
Note that the word read and record are sometimes used interchangeably, but
record has the more specific meaning of being a single line in the SAM
file.
SAM
fileA SAM file has a core set of required fields, and then an arbitrary list of
extra columns called tags. The tags have a two character abbreviation like MQ
(mapping quality) or many others. They can be upper or lower case. Upper case
are reserved for official usages (except those with X, Y, or Z prefixed). See
SAMtags.pdf for more details
CIGAR
string and how do you interpret it?A CIGAR
string is a "compact idiosyncratic gapped alignment report". It tells
you about insertions, deletions, and clipping. It is a series of "operators"
with lengths.
Insertion example:
50M50I50M
That would be 50bp of matching bases (50M
), followed by a 50bp insertion
(50I
), followed by another 50bp of matches (50M
). The 50bp insertion means
the read contains 50 bases in the middle which did not match the reference
genome that you are comparing the read to.
Clipping example:
50S50M50S
This means that 50bp matched (50M
in the middle of the CIGAR
string) and
both sides of the read are soft clipped. The clipping means the aligner was not
able to align the reads on either side.
Notes:
Finding mismatches: A CIGAR
string match like 50M
means 50 bases
"matched" the reference genome, but that only means that there are no
insertions or deletions in those 50 bases. There could be underlying
mismatches in the read compared to the reference. Note: there is also extended
CIGAR
that replaces M
with =
(exact match) and X
(mismatch). Also see
Appendix D on the MD
tag and finding where the mismatches are, but note that
MD
tag is tricky
Ambiguity of representation: A CIGAR
string with insertions and deletions
could be 50M1D1I50M
. This string had a 1bp deletion and a 1bp insertion
back-to-back. This could be just a mismatch! There is ambiguity in sequence
alignment representations. Downstream programs must accommodate this.
Split records and soft-clipping: A CIGAR
string with soft-clipping
500S50M
this means that 500 bases of the read were not aligned at this
position, but 50 bases were! Note that the alignment might have been a split
alignment (see section on split alignments below) so another record in the
SAM
file, linked by the SA
(supplementary alignment) tag, might contain
info on where the other 500 bases aligned! (or, they might not map anywhere).
The linked split or supplementary alignments all have the same read name
(QNAME
).
See SAMv1.pdf for all the CIGAR operators.
If you are working with SAM
data, you will often write loops that directly
parse CIGAR strings. See Appendix B for handy functions for parsing CIGAR
strings. Don't fear the CIGAR
!
Records do not have easy-to-interpret "start" and "end" and "strand" field like,
say, lines in a BED
file do
Instead, each record has:
POS
coordinateCIGAR
stringUsing this, you can find the "end" coordinate of a record by, starting at the
POS coordinate, and enumerating all the matches and deletions and insertions and
clipping of the CIGAR
. Once you have processed the whole CIGAR
string, you
can find where the record "ends" on the reference genome. And the "start" will
be POS
Reads can align to the "forward strand" of the reference genome (e.g. it matches the DNA letters as written in the reference genome FASTA file), resulting in "forward strand reads".
Or, they can align to the reverse complement of the DNA in the reference genome, resulting in "reverse strand reads".
Strandedness is recorded for each record in the SAM/BAM/CRAM file by having the SAM flags bitwise flag 16 enabled (https://broadinstitute.github.io/picard/explain-flags.html), the flag there is referred to as "read reverse strand"
CIGAR
for reverse strand reads?As mentioned above, you get the "start" and "end" by processing the CIGAR
, so
one might wonder whether you count "downwards" from POS for reverse strand
reads. This is actually not correct though
Here is what the SAM
spec has to say about this
"
POS
is the 1-based leftmost mapping POSition of the firstCIGAR
operation that “consumes” a reference base"
And regarding CIGAR
and other fields:
"For segments that have been mapped to the reverse strand, the recorded
SEQ
is reverse complemented from the original unmapped sequence andCIGAR
,QUAL
, and strand-sensitive optional fields are reversed and thus recorded consistently with the sequence bases as represented"
That is a little bit of a mind boggle but effectively, no matter whether the
record is forward or reverse strand, you count upwards when processing the
CIGAR
.
You can see in an example of this in SAMv1 section 1.1 -- the read r001/2
is a
"reverse strand" read, and its POS
, or "start" position, is 37, and then the
CIGAR
(9M, 9 matches) counts upwards from there, giving it an "end" position
of 45.
SAM
fileWhen you run an aligner program like bwa or minimap2, you type e.g.
minimap2 -a reference_genom.fa reads.fq > out.sam
out.sam will be ordered in the same order as the fastq. For preparation for
loading into analysis tools or genome browsers, we will often sort the reads
using samtools sort
which will group by chromosome name, and then increasing
in coordinate start position. Then samtools index
creates an index file (e.g.
bam.bai, cram.crai) that aids getting the reads in a specific genomic region
If a read failed to align to the reference genome, it may still be in your SAM
file, marked as unmapped using the flag column. Sometimes, "dumpster diving"
(looking at the unmapped records from a SAM
file) can be used to aid
structural variant searches (e.g. there may be novel sequence in there not from
the reference genome that could be assembled)
Long reads offer a wide array of methods for detecting SVs
I
or D
in a CIGAR
string.Note that there are many different methods for detecting SVs from long reads, e.g. not all use mapped reads from SAM files, some use de novo assembly, but it is still useful to be familiar with mapped read methods.
Split alignments, or chimeric alignments, are alignments where part of the read
maps to one place, and another part to another. For example, part of a long read
may map to chr1
and part of it maps to chr4
. It is worth reading the
definition of "Chimeric alignment" from
SAMv1.pdf when you get the
chance.
As SAMv1.pdf tells us, one
record is marked as "representative", sometimes also called the "primary"
record, while the other components of the split read are marked "supplementary",
given the 2048 flag. The "primary" record generally has a SEQ
field that
represents the entirety of the original read's sequence (with CIGAR soft
clipping operators saying which part of that sequence aligned), and the
"supplementary alignments" will have SEQ
field but sometimes just segments of
the original read's sequence with CIGAR hard clipping operators indicating that
it is partial.
Supplementary alignments are especially common with long reads, and it can be a signal for structural variants e.g. where two chromosomes are fused together, and parts of the read align to multiple chromosomes, or the split alignment may align to either side of a large deletion, or they may be split to align through an inversion (part of it aligns to the forward strand, part of it to the reverse strand, and again the forward strand)
There is no limitation on how many splits might occur so the split can align to
3, 4, or more different places. Each part of the split puts a new line in the
SAM file, and note that all the records also have the same read name, or QNAME
(first column of SAM
).
Secondary alignments generally come from "multi-mappers" where the entire read
maps equally well (or at least somewhat equally well) to, say, somewhere on both
chr4
and chr1
. "Multi mapping" results in secondary alignments, while split
reads result in supplementary alignments. See "Multiple mapping" in the
SAMv1.pdf for the definition
of multi-mapping. Note also that secondary alignments sometimes are missing the
SEQ
field entirely too, see
https://github.com/lh3/minimap2/issues/458#issuecomment-516661855
I wrote a tool called secondary_rewriter to add the SEQ
field back to
secondary alignments, which may help in some cases
SA
tag?The SA
tag is outputted on each part of the supplementary/split/chimeric
alignment, e.g. the primary contains an SA
tag that refers to information
(e.g. the location) of where all the supplementary alignments where placed, and
each of the supplementary alignments also contains an SA
tag that refers to
the primary alignment and each other supplementary alignment.
Fun fact: The SA
tag conceptually can result in a 'quadratic explosion' of
data, because each part of the split contains references to every other part.
For example, if a read is split into 4 pieces, then each record would would have
an SA
tag with 3 segments, so 3*4 segments will be documented in the SA
tag. In many cases, this is not a problem, but if you imagine a finished
chromosome aligned to a draft assembly, it may get split so many times this
could be a factor.
See SAMtags.pdf for more
info on the SA
tag.
This is a specialized JBrowse 2 feature, but if there is an inter-chromosomal translocation, you can load this into JBrowse and visualize support for this event using our "breakpoint split view". This view shows the evidence for the reads that are split aligned across an SV, and can show connections between paired-end reads across an SV too.
We also have a workflow called the "SV inspector" that helps you setup the
"breakpoint split views" (https://jbrowse.org/jb2/docs/user_guide/#sv-inspector.
The SV inspector and Breakpoint split view work best on Breakends (e.g. VCF 4.3
section 5.4) and <TRA>
(translocation) events from VCF
, or BEDPE
formatted
SV calls, and you can launch the "breakpoint split view" from the "SV inspector"
If we are given the the primary alignment of an arbitrary split read, then we can construct what that split looks like compared to the reference genome.
If we are not given the primary alignment (e.g. we are starting from a
supplementary alignment) then we can search the SA
list for the one that is
primary, because at least one will be.
Now that we have the primary alignment, it will have the SEQ
(of the entire
read, the supplementary alignments typically have a blank SEQ
!) and the SA
tag containing the CIGAR
of all the different parts of the split. We can then
construct how the entire read, not just a particular record of the split
alignment, compares to the genome. In JBrowse 2 we implemented this and it uses
a synteny-style rendering. [1]
Figure showing JBrowse 2 piecing together a long read vs the reference genome from a single read
In order to do this reconstruction, JBrowse 2 takes the CIGAR
strings of the
primary alignment and each of the pieces of the SA
tag (it is a semi-colon
separated list of chunks), sort them by the amount of softclipping (the
softclipping values will progressively trim off more of the SEQ
telling you it
aligned further and further on in the long read), and then this tells me where
each piece of the split alignment came from in the original SEQ
, so we can
plot the alignments of the read vs the reference genome using synteny style
display.
[1] Similar functionality also exists in GenomeRibbon https://genomeribbon.org
SAM
vs VCF
- Breakends vs split alignmentsAn interesting outcome (to me) is that from a single record in a SAM
file, I
can reconstruct the "derived" genome around a region of interest from a single
read.
If I was to try to do this with the VCF
Breakend specification (section 5.4 of
VCF4.3.pdf), it may actually
be more challenging than from a SAM
read. This is because a Breakend in VCF
format is only an edge in a graph (and the sequences are nodes). Therefore, in
order to properly reconstruct a structural variant from a VCF
with Breakends,
I would have to construct a graph and decode paths through it.
I like the ability to reconstruct the derived genome from a single read, but
individual reads can be noisy (contain errors). That said, de novo assembled
contigs can also be stored in SAM
format and is significantly less noisy
(being composed of the aggregate information of many reads).
The point though is that interpretation of the VCF
breakend specification is
challenging due to imposing a sequence graph on the genome, while the SA
tag
remains just a simple set of linear alignments that can easily be pieced
together, and you only need to refer to a single record in the SAM
file to do
so.
I am not aware of a lot of tools that work on the VCF
Breakend graph, and
expect more will need to be created to truly work with this standard. An
inversion for example may create 4 record in the VCF
file (see section 5.4 in
the VCF4.3.pdf for example),
and needs careful interpretation.
A new trend has been to create SAM
/BAM
/CRAM
files with tagged reads, which
tells us which haplotype a read was inferred to have come from. This is commonly
done with the HP
tag, which might have HP=0
and HP=1
for a diploid genome.
Tools like whatshap
can add these tags to a SAM
file, and IGV and JBrowse 2
can color and sort by these tags.
Screenshot of JBrowse 2 with the "Color by tag" and "Sort by tag" setting
enabled (coloring and sorting by the HP
tag) letting us see that only one
haplotype has a deletion. Tutorial for how to do this in JBrowse 2 here
https://jbrowse.org/jb2/docs/user_guide/#sort-color-and-filter-by-tag
The distance between pairs is encoded by the TLEN
column in the SAM
format.
The distance between pairs with good mapping is relatively constant and called
the "insert length". This comes from how the sequencing is done: paired-end
sequencing performs sequencing on both ends of a fragment.
But, if you are mapping reads vs the reference genome, and you observe that they are abnormally far apart, say 50kb apart instead of 1kb apart, this may indicate there your sample contains a deletion relative to the reference.
Screenshot of JBrowse 1 with "View as pairs" enabled with "mate paired" sequencing, and large insert size colored as red (from https://jbrowse.org/docs/paired_reads.html. Note that some of JBrowse 1's View as pairs features are not yet available in JBrowse 2
"Mate paired" reads have a larger distance between pairs (aka the insert size is larger). Think: mate pairs have 2-5kb between pairs instead of a couple hundred bp. Mate pairs can be useful in scaffolding de novo genome assemblies since the larger range can span gaps, giving evidence that multiple contigs are connected.
"Paired-end" reads typically have like 200-500bp distance between pairs (aka, the insert size).
The above JBrowse screenshot shows mate pairs. Note that mate pairs have a different pair orientation, and pairs don't necessarily point at each other, they "point away from each other" in normal conditions due. See Appendix A for more info.
Linked reads are another method that uses short reads but gives long read information. I don't have a lot of info on this but will update this section if I do. The company, 10x genomics, that originally used this discontinued commercial sequencing using this method to focus instead on single cell AFAIK. https://www.10xgenomics.com/products/linked-reads
This can indicate that part of the reads map well, but then there was an abrupt stop to the mapping. This might mean that there is a sequence that was an insertion at that position, or a deletion, or a translocation.
The clipping is indicated by the CIGAR
string, either at the start or end of
it by an S
or an H
. The S
indicates "soft clipping", and indicates that
the sequence of the clipped portion can be found in the SEQ
field of the
primary alignment. The H
is hard clipped, and the sequence that is hard
clipped will not appear in the SEQ
.
Screenshot of JBrowse 2 showing blue clipping indicator with a "pileup" of soft-clipping at a particular position shown in blue. The clipping is an "interbase" operation (it occurs between base pair coordinates) so it is plotted separately from the normal coverage histogram.
Screenshot of JBrowse 2 showing an insertion with Nanopore (top), PacBio
(middle) and Illumina short reads. The long reads may completely span the
insertion, so the CIGAR
string on those have an I
operator and are indicated
by the purple triangle above the reads. For the short reads, the reads near the
insertion will be clipped since they will not properly map to the reference
genome and cannot span the sinsertion. The "Show soft clipping" setting in
JBrowse 2 and IGV can be used to show visually the bases that extend into the
insertion (shown on the bottom track).
With standard paired end sequencing, the pairs normally point at each other
forward reverse ---> <---
If the stranded-ness of the pair is off, then it could indicate a structural variant. See Appendix A for a handy function for calculating pair orientation.
This guide from IGV is helpful for interpreting the pair directionality with patterns of SVs using "Color by pair orientation"
https://software.broadinstitute.org/software/igv/interpreting_pair_orientations
Figure: JBrowse 2 showing an inverted (tandem) duplication in 1000 genomes data.
It uses the same coloring as IGV for pair orientation. The tandem duplication
can produce green arrows which have reads pointing in opposite directions e.g.
<--
and -->
, while blue arrows which can indicate an inversion point in the
same direction e.g. -->
and -->
Note that TLEN
is a field in the SAM format that is somewhat ill defined, at
least in the sense that different tools may use it differently
https://github.com/pysam-developers/pysam/issues/667#issuecomment-381521767
If needed, you can calculate TLEN
yourself if you process the file yourself
(e.g. process all reads, get the actual records for the pairs, and calculate
distance) but I have not had trouble with relying on the TLEN
from the data
files themselves.
Another type of SV that you can get from your SAM
files are copy number
variants (CNVs). By looking at the depth-of-coverage for your data files, you
can look for abnormalities that may indicate copy number variants. By using a
tool like mosdepth
, you can quickly get a file showing the coverage across the
genome.
Be aware that if you are comparing the coverage counts from different tools,
that they have different defaults that may affect comparison. Some discard
QC_FAIL
, DUP
, and SECONDARY
flagged reads. This is probably appropriate,
and corresponds to what most genome browsers will display (see
https://gist.github.com/cmdcolin/9f677eca28448d8a7c5d6e9917fc56af for a short
summary of depth calculated from different tools)
Note that both long and short reads can be used for CNV detection. Long reads may give more accurate measurements also, with their better ability to map smoothly through difficult regions of the genome.
Screenshot showing coverage in BigWig
format from nanopore reads on normal and
tumor tissue from a melanoma cancer cell line (COLO829) plotted using JBrowse 2.
This coverage data is calculated from nanopore sequencing from
here using
mosdepth
, converted from BedGraph
to
BigWig
, and loaded into JBrowse 2. See
(demo
and
tutorial)
Currently, SV visualization is highly based on comparing data versus a reference
genome (and the SAM
format is a signature of this: it stores data in terms of
reference genome coordinates). In the future, SV visualization may look more
similar to comparative genomics, where we compare an SV to a population specific
reference from a graph genomes or something like this.
It is known that de novo assembly has more power to detect SVs than some read operations (https://twitter.com/lh3lh3/status/1362921612690010118/photo/1 as de novo assembled genomes improve and become more widespread, we may see a shift in how SVs are called
I would also like to see improved ability to do fast or 'on the fly' gene prediction on the de novo assembled genomes, and we can see what SNPs or modified splicing might look like in copies of genes (e.g. derived regions of the CNV duplications).
Fun fact: the
GAF
(graphical alignment format) is a strict superset of
PAF
(pairwise alignment
format) by storing graph node labels in the target name
slot of PAF
, and can
refer to an rGFA
(reference genome graph)! Looking forward to the graph genome
world.
Algorithms that actually call structural variants face many challenges, but understanding how the reads are encoded in SAM format, and seeing what they look like in the genome browser is a useful first step to gaining a better understanding.
In summary, some of the signatures of SVs may include:
TLEN
) detection (longer for deletion, shorter for
insertion)SA
tag)CIGAR
string processing (D
operator for deletions, I
operator for
insertions)S
or H
operators in CIGAR
)SAM
, but it can also output
PAF
format (which can
be loaded in JBrowse 2 in the synteny views). Techniques of detecting SVs on
PAF will be fundamentally pretty similar to the techniques listed above but
may look a bit different (see cs
tag in PAF
for example, it is a modified
CIGAR
-like string)If you have any ideas I should include here, let me know!
This may be unnecessarily low level code for this article, but it could be helpful utility to help determine the pair orientation from a single SAM/BAM/CRAM record.
Then this can be broken down further by orientation type
So you can interpret e.g. F1R2 in relation to being a paired end read (fr) or mate pair (rf) below and with this link https://software.broadinstitute.org/software/igv/interpreting_pair_orientations
CIGAR
parsingThen parse the returned array two at a time
Note for example, that to determine how long a record is on the reference sequence, you have to combine the records start position with the CIGAR string, basically parsing the CIGAR string to add up tpos and return tpos
FASTQ
directly to CRAM
This example from the htslib documentation
(http://www.htslib.org/workflow/fastq.html shows how you can stream directly
from FASTQ
to CRAM
(and generate the index file .crai too)
If you want, you can make this a little shell script, easy_align_shortreads.sh
easy_align_shortreads.sh
Similar idea for longreads, except just a single fastq file is generally used for longreads
easy_align_longreads.sh
Then call
This same concept works with other common aligners as well like bwa
Bonus: CRAM to bigwig, for looking at CNV/coverage
Call as "quickalign.sh ref.fa 1.fq 2.fq out.cram" gives you out.cram, out.cram.crai, and out.cram.bw (coverage)
MD
tag and finding SNPs in readsThe MD
tag helps tell you where the mismatches are without looking at the
reference genome. This is useful because as I mentioned, CIGAR
can say 50M
(50 matches) but some letters inside those 50 matches can be mismatches, it only
says there are no insertions/deletions in those 50 bases, but you have to
determine where in those 50 bases where the mismatches are.
The MD
tag can help tell you where those are, but it is somewhat complicated
to decode (https://vincebuffalo.com/notes/2014/01/17/md-tags-in-bam-files.html).
You have to combine it with the CIGAR
to get the position of the mismatches on
the reference genome. If you have a reference genome to look at, you might just
compare all the bases within the 50M to the reference genome and look for
mismatches yourself and forget about the MD
tag
The MD
tag is also not required to exist, but the command
samtools calmd yourfile.bam --reference reference.fa
can add MD
tags to your
BAM
file. It is generally not useful for CRAM
because CRAM
actually does
store mismatches with the reference genome in it's compression format.
Note that there are some oddities about MD
tag representation leading to
complaints (e.g. https://github.com/samtools/hts-specs/issues/505), which could
lend credence to "doing it yourself" e.g. finding your own mismatches by
comparing the read sequence with the reference, instead of relying on the MD
tag. (see Appendix B)
You can convert SAM
to BAM
with samtools
You can also convert a BAM
back to SAM
with samtools view
The -h just makes sure to preserve the header.
If you are converting SAM
to CRAM
, it may require the -T argument to specify
your reference sequence (this is because the CRAM
is "reference compressed")
Note that in some cases you can pipe data directly from e.g. an aligner straight
to CRAM. See Appendix C: piping FASTQ from minimap2
directly to CRAM