CRAM codec specification (version 3.1)

Table of Contents

This document covers the compression and decompression algorithms (codecs) specific to the CRAM format. All bar the first of these were introduced in CRAM v3.1.

It does not cover the CRAM format itself. For that see http://samtools.github.io/hts-specs/.

1.1 Pseudocode introduction#

Various parts of this specification are written in a simplistic pseudocode. This intentionally does not make explicit use of data types and has minimal error checking. The number of operators is kept to a minimum, but some are necessary and may be language specific. Due to the lack of explicit data types, we also have different division operators to symbolise floating point and integer divisions.

The pseudocode doesn’t prescribe any particular programming paradigm - functional, procedural or object oriented - but it does have a few implicit assumptions. Variables are considered to be passed between functions via unspecified means. For example the Range Coder sets range\textit{range} and code\textit{code} during creation and these are used during the decoding steps, but are not explicitly passed in as variables. We make the implicit assumption that they are simply local variables of the particular usage of this range coder. Other than ephemeral loop counters, we do not reuse variable names so the intention should be clear.

The exception to the above is occasionally we need to have multiple instances of a particular data type, such as Order-1 decoding will have many models. Here we use an object oriented way of describing the problem with instance\textit{instance}.Function notation.

Note some functions may return multiple items, such as return (value, length), but the calling code may assign a single variable to this result. In this case the first value value will be used and length will be discarded.

1.2 Mathematical operators#

Operator Description
a+ba + b Addition
aba - b Subtraction
a×ba \times b Multiplication
a/ba\ /\ b Floating point division a/ba/b
adivba \mathbin{\text{div}} b Integer division a/ba/b, equivalent to a/b\lfloor{a/b}\rfloor
amodba \bmod b Integer modulo (remainder) ab×(adivb)a - b\times(a \mathbin{\text{div}} b)
a=ba = b Compares aa and bb variables, yielding true if they match, false otherwise
aba \gets b Assigns value of bb to variable aa
aba \ll b Bit-wise left shift aa by bb bits, shifting in zeros
aba \gg b Bit-wise right shift aa by bb bits, shifting in zeros
aANDba \mathbin{\text{AND}} b Bit-wise AND operator, joining values aa, bb
aORba \mathbin{\text{OR}} b Bit-wise OR operator, joining values aa, bb
a𝒐𝒓ba \mathbin{\textbf{or}} b Logical OR operator, joining expressions aa, bb
a𝒂𝒏𝒅ba \mathbin{\textbf{and}} b Logical AND operator, joining expressions aa, bb
a++ba \text{++} b String concatenation of aa and bb: abab
ViV_i Element ii of vector VV The entire vector VV may be passed into a function
Wi,jW_{i,j} Element i,ji,j of two-dimensional vector WW. The entire vector WW or a one dimensional slice WiW_i (of size jj) may be passed into a function.

Note that string concatenation with the ++\text{++} operator assumes the left and right values are converted to string form. For example “level” ++42\text{++} 42 will convert the integer 42 to “42” and produce the string “level42”.

1.3 Implicit functions#

Operator Description
Min(a,b)(a,\ b) Smaller of aa and bb
Max(a,b)(a,\ b) Larger of aa and bb
ReadUint8 Read an 8-bit unsigned integer (1 byte) from unspecified input source
ReadUint32 Read a 32-bit unsigned little-endian integer from unspecified input source
ReadUint8(𝑠𝑟𝑐)(\textit{src}) Read an 8-bit unsigned integer (1 byte) from input 𝑠𝑟𝑐\textit{src}
ReadUint32(𝑠𝑟𝑐)(\textit{src}) Read a 32-bit unsigned little-endian integer from input 𝑠𝑟𝑐\textit{src}
ReadData(𝑙𝑒𝑛)(\textit{len}) Read 𝑙𝑒𝑛\textit{len} bytes (8-bit unsigned) from an unspecified input source
ReadData(𝑙𝑒𝑛,𝑠𝑟𝑐)(\textit{len}, \textit{src}) Read 𝑙𝑒𝑛\textit{len} bytes (8-bit unsigned) from input 𝑠𝑟𝑐\textit{src}
ReadChar(𝑠𝑟𝑐)(\textit{src}) Read a single character from input 𝑠𝑟𝑐\textit{src}
ReadString(𝑠𝑟𝑐)(\textit{src}) Read a nul-terminated string from input 𝑠𝑟𝑐\textit{src}
EOF Returns true if the input source is exhausted.
Char(a)(a) Converts integer aa to a single character of appropriate ASCII value
Length(a)(a) Returns length of string aa excluding any nul-termination bytes
Swap(a,b)(a,\ b) Swaps the contents of aa and bb variables

Many of the input functions here and below are defined to read either from an unspecified input source (such as the input file descriptor, or a global buffer that has not been explicitly stated in the pseudocode), but also have forms that may decode from specified inputs / buffers. They both consume their input sources in the same manner, using an implicit offset of how many bytes so far have been read.

1.4 Other basic functions#

7-bit integer encoding stores values 7-bits at a time with the top bit set if further bytes are required.

// Read a variable sized unsigned integer 7-bits at a time.  Returns the value.
Function ReadUint7(source)  // If source is unspecified then it is the default input stream
  value ← 0
  length ← 0
  repeat:
    c ← ReadUint8()
    value ← (value 7) + (c 127)
    length ← length + 1
  until c < 128
  return value

ITF8 integer encoding stores the additional number of bytes needed in the count of the top bits set in the initial byte (ending with a zero bit), followed by any subsequent whole bytes. See the main CRAM specification for more details.

// Read a variable sized unsigned integer with ITF8 encoding.  Returns the value.
Function ReadITF8(source)  // If source is unspecified then it is the default input stream
  v ← ReadUint8()
  if i >= 0xf0:  // 1111xxxx => +4 bytes
    v ← (v 0x0f) 28
    v ← v + ( ReadUint8() 20)
    v ← v + ( ReadUint8() 12)
    v ← v + ( ReadUint8() 4)
    v ← v + ( ReadUint8() 4)
  else if i >= 0xe0:  // 1110xxxx => +3 bytes
    v ← (v 0x0f) 24
    v ← v + ( ReadUint8() 16)
    v ← v + ( ReadUint8() 8)
    v ← v + ReadUint8()
  else if i >= 0xc0:  // 110xxxxx => +2 bytes
    v ← (v 0x1f) 16
    v ← v + ( ReadUint8() 8)
    v ← v + ReadUint8()
  else if i >= 0x80:  // 10xxxxxx => +1 bytes
    v ← (v 0x3f) 8
    v ← v + ReadUint8()
  return v

2 rANS 4x8 - Asymmetric Numeral Systems#

This is the rANS format first defined in CRAM v3.0.

rANS is the range-coder variant of the Asymmetric Numerical Systems1.

The structure of the external rANS codec consists of several components: meta-data consisting of compression-order, and compressed and uncompressed sizes; normalised frequencies of the alphabet systems to be encoded, either in Order-0 or Order-1 context; and the rANS encoded byte stream itself.

Here “Order” refers to the number of bytes of context used in computing the frequencies. It will be 0 or 1. Ignoring punctuation and space, an Order-0 analysis of English text may observe that ‘e’ is the most common letter (12-13%), and that ‘u’ occurs only around 2.5% of the time. If instead we consider the frequency of a letter in the context of one previous letter (Order-1) then these statistics change considerably; we know that if the previous letter was ‘q’ then ‘e’ becomes a rare letter while ‘u’ is the most likely.

These observed frequencies are inversely related to the amount of storage required to encode a symbol (e.g. an alphabet letter)2.

2.0.1 rANS 4x8 compressed data structure#

A compressed data block consists of the following logical parts:

Value data type Name Description
byte order the order of the codec, either 0 or 1
uint32 compressed size the size in bytes of frequency table and compressed blob
uint32 data size raw or uncompressed data size in bytes
byte[] frequency table byte frequencies of input data written using RLE
byte[] compressed blob compressed data

2.1 Frequency table#

The alphabet used here has a maximum of 256 possible symbols (all byte values), but alphabets where fewer symbols are permitted too.

The symbol frequency table indicates which symbols are present and what their relative frequencies are. The total sum of symbol frequencies are normalised to add up to 40953. Given rounding differences when renormalising to a fixed sum, it is up to the encoder to decide how to distribute any remainder or remove excess frequencies. The normalised frequency tables below are examples and not prescriptive of a specific normalisation strategy.

Formally, this is an ordered alphabet A\mathbb{A} containing symbols ss where sis_{i} with the ii-th symbol in A\mathbb{A}, occurring with the frequency freqi\textit{freq}_{i}.

2.1.1 Order-0 encoding#

The normalised symbol frequencies are then written out as {symbol, frequency} pairs in ascending order of symbol (0 to 255 inclusive). If a symbol has a frequency of 0 then it is omitted.

To avoid storing long consecutive runs of symbols if all are present (eg a-z in a long piece of English text) we use run-length-encoding on the alphabet symbols. If two consecutive symbols have non-zero frequencies then a counter of how many other non-zero frequency consecutive symbols is output directly after the second consecutive symbol, with that many symbols being subsequently omitted.

For example for non-zero frequency symbols ‘a’, ‘b’, ‘c’, ‘d’ and ‘e’ we would write out symbol ‘a’, ‘b’ and the value 3 (to indicate ‘c’, ‘d’ and ‘e’ are also present).

The frequency is output after every symbol (whether explicit or implicit) using ITF8 encoding. This means that frequencies 0-127 are encoded in 1 byte while frequencies 128-4095 are encoded in 2 bytes.

Finally the symbol 0 is written out to indicate the end of the symbol-frequency table.

As an example, take the string abracadabra.

Symbol frequency:

Symbol Frequency
a 5
b 2
c 1
d 1
r 2

Normalised to sum to 4095:

Symbol Frequency
a 1863
b 744
c 372
d 372
r 744

Encoded as:

0x61      0x87 0x47      # `a'           <1863>
0x62 0x02 0x82 0xe8      # `b' <+2: c,d>  <744>
          0x81 0x74      # `c' (implicit) <372>
          0x81 0x74      # `d' (implicit) <372>
0x72      0x82 0xe8      # `r'            <744>
0x00                     # <0>

2.1.2 Order-1 encoding#

To encode Order-1 statistics typically requires a larger table as for an NN sized alphabet we need to potentially store an NNxNN matrix. We store these as a series of Order-0 tables.

We start with the outer context byte, emitting the symbol if it is non-zero frequency. We perform the same run-length-encoding as we use for the Order-0 table and end the contexts with a nul byte. After each context byte we emit the Order-0 table relating to that context.

One last caveat is that we have no context for the first byte in the data stream (in fact for 4 equally spaced starting points, see “interleaving” below). We use the ASCII value (‘\0’) as the starting context for each interleaved rANS state and so need to consider this in our frequency table.

Consider abracadabraabracadabraabracadabraabracadabrad as example input. Note for the additional trailing “d” giving us 45 characters instead of 44. This can be broken into 4 approximate equal portions abracadabra abracadabra abracadabra abracadabrad. We operate one independent rANS stream per portion, providing us the opportunity to exploit CPU data parallelism.

Naively observed Order-1 frequencies:

Context Symbol Frequency
\0 a 4
a a 3
b 8
c 4
d 5
b r 8
c a 4
d a 4
r a 8

Normalised (per Order-0 statistics):

Context Symbol Frequency
\0 a 4095
a a 614
b 1639
c 819
d 1023
b r 4095
c a 4095
d a 4095
r a 4095

Note that the above table has redundant entries. While our complete string had three cases of two consecutive “a” characters (“…cadabraabraca…”), these spanned the junction of our split streams and each rANS state is operating independently, starting with the same last character of nul (0). Hence during decode we will not need to access the table for the frequency of “a” in the context of a previous “a”. A similar issue occurs for the very last byte used for each rANS state, which will not be used as a context. In extreme cases this may even be the only time that symbols occurs anywhere. While these scenarios represent unnecessary data to store, and these frequency entries can be safely omitted, their presence does not invalidate the data format and it may be simpler to use a more naive algorithm when producing the frequency tables.

The above tables are encoded as:

0x00                 # `\0' context
0x61      0x8f 0xff  # a  <4095>
0x00                 # end of Order-0 table

0x61                 # `a' context
0x61      0x82 0x66  # a             <614>
0x62 0x02 0x86 0x67  # b <+2: c,d>  <1639>
          0x83 0x33  # c (implicit)  <819>
          0x83 0xff  # d (implicit) <1023>
0x00                 # end of Order-0 table

0x62 0x02            # `b' context, <+2: c, d>
0x72      0x8f 0xff  # r <4095>
0x00                 # end of Order-0 table

                     # `c' context (implicit)
0x61      0x8f 0xff  # a <4095>
0x00                 # end of Order-0 table

                     # `d' context (implicit)
0x61      0x8f 0xff  # a <4095>
0x00                 # end of Order-0 table

0x72                 # `r' context
0x61      0x8f 0xff  # a <4095>
0x00                 # end of Order-0 table

0x00                 # end of contexts

2.2 rANS entropy encoding#

The encoder takes a symbol ss and a current state xx (initially LL below) to produce a new state xx' with function CC.

x=C(s,x)x' = C(s,x)

The decoding function DD is the inverse of CC such that C(D(x))=xC(D(x)) = x.

D(x)=(s,x)D(x') = (s,x)

The entire encoded message can be viewed as a series of nested CC operations, with decoding yielding the symbols in reverse order, much like popping items off a stack. This is where the asymmetric part of ANS comes from.

As we encode into xx the value will grow, so for efficiency we ensure that it always fits within known bounds. This is governed by

Lx<bL1L \leq x < bL-1

where bb is the base and LL is the lower-bound.

We ensure this property is true before every use of CC and after every use of DD. Finally to end the stream we flush any remaining data out by storing the end state of xx.

Implementation specifics

We use an unsigned 32-bit integer to hold xx. In encoding it is initialised to LL. For decoding it is read little-endian from the input stream.

Recall freqi\textit{freq}_{i} is the frequency of the ii-th symbol sis_{i} in alphabet A\mathbb{A}. We define cfreqi\textit{cfreq}_i to be cumulative frequency of all symbols up to but not including sis_{i}:

cfreqi={0if i<1cfreqi1+freqi1if i1\textit{cfreq}_{i} = \left\{ \begin{array}{l l} 0 & \quad \textrm{if } i < 1 \\ \textit{cfreq}_{i-1} + \textit{freq}_{i-1} & \quad \textrm{if } i \geq 1 \end{array} \right.

We have a reverse lookup table cfreq_to_symc\textit{cfreq\_to\_sym}_c from 0 to 4095 (0xfff) that maps a cumulative frequency cc to a symbol ss.

cfreq_to_symc=siwherec:cfreqic<cfreqi+freqi\textit{cfreq\_to\_sym}_c = s_{i} \quad \textit{where} \quad c: \enskip \textit{cfreq}_i \leq c < \textit{cfreq}_i + \textit{freq}_i

The x=C(s,x)x' = C(s,x) function used for the ii-th symbol ss is:

x=(x/freqi)×0x1000+cfreqi+(xmodfreqi)x' = (x/\textit{freq}_i) \times \texttt{0x1000} + \textit{cfreq}_i + (x \bmod \textit{freq}_i)

The D(x)=(s,x)D(x') = (s,x) function used to produce the ii-th symbol ss and a new state xx is:

c=xAND0xfffsi=cfreq_to_symcx=freqi(x/0x1000)+ccfreqi\begin{aligned} c = x' \mathbin{\text{AND}} \texttt{0xfff} \\ s_{i} = \textit{cfreq\_to\_sym}_{c} \\ x = \textit{freq}_{i} (x' / \texttt{0x1000}) + c - \textit{cfreq}_{i} \end{aligned}

Most of these operations can be implemented as bit-shifts and bit-AND, with the encoder modulus being implemented as a multiplication by the reciprocal, computed once only per alphabet symbol.

We use L=0x800000L = \texttt{0x800000} and b=256b = 256, permitting us to flush out one byte at a time (encoded and decoded in reverse order).

Before every encode C(s,x)C(s,x) we renormalise xx, shifting out the bottom 8 bits of xx until x<0x80000×freqix < \texttt{0x80000} \times \textit{freq}_i. After finishing all encoding we flush 4 more bytes (lowest 8-bits first) from xx.

After every decoded D(x)D(x') we renormalise xx', shifting in the bottom 8 bits until x0x800000x \geq \texttt{0x800000}.

2.2.1 Interleaving#

For efficiency, we interleave 4 separate rANS codecs at the same time4. For the Order-0 codecs these simply encode or decode the 4 neighbouring bytes in cyclic fashion using interleaved codec 0, 1, 2, and 3, sharing the same output buffer (so the output bytes get interleaved).

For the Order-1 codec we cannot do this as we need to know the previous byte value as the context for the next byte. We therefore split the input data into 4 approximately equal sized fragments5 starting at 0, len/4\lfloor{}\textit{len}/4\rfloor{}, len/4×2\lfloor{}\textit{len}/4\rfloor{}\times2 and len/4×3\lfloor{}\textit{len}/4\rfloor{}\times 3. Each Order-1 codec operates in a cyclic fashion as with Order-0, all starting with 0 as their state and sharing the same compressed output buffer. Any remainder, when the input buffer is not divisible by 4, is processed at the end by the 4th rANS state.

We do not permit Order-1 encoding of data streams smaller than 4 bytes.

2.3 rANS decode pseudocode#

A naïve implementation of a rANS decoder follows. This pseudocode is for clarity only and is not expected to be performant and we would normally rewrite this to use lookup tables for maximum efficiency. The function ReadUint8 fetches the next single unsigned byte from an unspecified input source. Similarly for ReadITF8 (variable size integer) and ReadUint32 (32-bit unsigned integer in little endian format).

Procedure RansDecode(input, output)
  order ← ReadUint8()  // Implicit read from input
  n_in ← ReadUint32()
  n_out ← ReadUint32()
  if order = 0:
    RansDecode0(output, n_out)
  else:
    RansDecode1(output, n_out)

2.3.1 rANS order-0#

The Order-0 code is the simplest variant. Here we also define some of the functions for manipulating the rANS state, which are shared between Order-0 and Order-1 decoders.

// Reads a table of Order-0 symbol frequencies F_i
// and sets the cumulative frequency table C_i+1 = C_i+F_i
Procedure ReadFrequencies0(F, C)
  s ← ReadUint8()  // Next alphabet symbol
  last_sym ← s
  rle ← 0
  repeat:
    F_s ← ReadITF8()
    if rle > 0:
      rle ← rle-1
      s ← s+1
    else:
      s ← ReadUint8()
      if s = last_sym+1:
        rle ← ReadUint8()
    last_sym ← s
  until s = 0
  //   (Compute cumulative frequencies C_i from F_i
  C_0 ← 0
  for s ← 0 to 255:
    C_s+1 ← C_s + F_s

// Bottom 12 bits of our rANS state R are our frequency
Function RansGetCumulativeFreq(R)
  return R 0xfff

// Convert frequency to a symbol. Find s such that C_s ≤ f < C_s+1
// We would normally implement this via a lookup table
Function RansGetSymbolFromFreq(C, f)
  s ← 0
  while f >= C_s+1:
    s ← s+1
  return s

// Compute the next rANS state R given frequency f and cumulative freq c
Function RansAdvanceStep(R, c, f)
  return f * (R 12) + (R 0xfff) - c

// If too small, feed in more bytes to the rANS state R
Function RansRenorm(R)
  while R < (1 23):
    R ← (R 8) + ReadUint8()
  return R

Procedure RansDecode0(output, nbytes)
  ReadFrequencies0(F, C)
  for j ← 0 to 3:  // Initialise the 4 interleaved streams
    R_j ← ReadUint32()  // Unsigned 32-bit little endian
  for i ← 0 to nbytes-1:
    j ← i mod 4
    f ← RansGetCumulativeFreq(R_j)
    s ← RansGetSymbolFromFreq(C, f)
    output_i ← s
    R_j ← RansAdvanceStep(R_j, C_s, F_s)
    R_j ← RansRenorm(R_j)

2.3.2 rANS order-1#

As described above, the decode logic is very similar to rANS Order-0 except we have a two dimensional array of frequencies to read and the decode uses the last character as the context for decoding the next one. In the pseudocode we illustrate this by using two dimensional vectors Ci,jC_{i,j} and Fi,jF_{i,j}. For simplicity, we reuse the Order-0 code by referring to CiC_i and FiF_i of the 2D vectors to get a single dimensional vector that operates in the same manner as the Order-0 code. This is not necessarily the most efficient implementation.

Note the code for dealing with the remaining bytes when an output buffer is not an exact multiple of 4 is less elegant in the Order-1 code. This is correct, but it is unfortunately a design oversight.

// Reads a table of Order-1 symbol frequencies F_i,j
// and sets the cumulative frequency table C_i,j+1 = C_i,j+F_i,j
Procedure ReadFrequencies1(F, C)
  sym ← ReadUint8()  // Next alphabet symbol
  last_sym ← sym
  rle ← 0
  repeat:
    ReadFrequencies0(F_i, C_i)
    if rle > 0:
      rle ← rle-1
      sym ← sym+1
    else:
      sym ← ReadUint8()
      if sym = last_sym+1:
        rle ← ReadUint8()
    last_sym ← sym
  until sym = 0

Procedure RansDecode1(output, nbytes)
  ReadFrequencies1(F, C)
  for j ← 0 to 3:  // Initialise 4 interleaved streams
    R_j ← ReadUint32()  // Unsigned 32-bit little endian
    L_j ← 0  // Last symbol
  i ← 0
  while i < ⌊ nbytes/4 ⌋:
    for j ← 0 to 3:
      f ← RansGetCumulativeFreq(R_j)
      s ← RansGetSymbolFromFreq(C_L_j, f)
      output_i + j * ⌊ nbytes/4 ⌋ ← s
      R_j ← RansAdvanceStep(R_j, C_L_j,s, F_L_j,s)
      R_j ← RansRenorm(R_j)
      L_j ← s
    i ← i+1
  //     (Now deal with the remainder if buffer size is not a multiple of 4,
  //     (using rANS state 3 exclusively.
  for i ← i * 4 to len-1:
    f ← RansGetCumulativeFreq(R_3)
    s ← RansGetSymbolFromFreq(C_L_3, f)
    output_i ← s
    R_3 ← RansAdvanceStep(R_3, C_L_3,s, F_L_3,s)
    R_3 ← RansRenorm(R_3)
    L_3 ← s

3 rANS Nx16#

CRAM version 3.1 defines an additional rANS entropy encoder, using 16-bit renormalisation instead of the 8-bit used in CRAM 3.0 and with inbuilt bit-packing and run-length encoding. The lower-bound and initial encoder state LL is also changed to 0x8000. The Order-1 rANS Nx16 encoder has also been modified to permit a maximum frequency of 1024 instead of 4096. This offers better cache performance for the large Order-1 tables and usually has minimal impact on compression ratio.

Additionally it permits adjustment of the number of interleaved rANS states from the fixed 4 used in rANS 4x8 to either 4 or 32 states. The benefit of the 32-way interleaving is in enabling efficient use of SIMD instructions for faster encoding and decoding speeds. However it has a small cost in size and initialisation times so it is not recommended on smaller blocks of data.

Frequencies are now stored using uint7 format instead of ITF8. The tables are also stored differently, separating the list of symbols present in the alphabet (those with frequency greater than zero) from the frequencies themselves.

Finally transformations may be applied to the data prior to compression (or after decompression). These consist of stripe, for structured data where every Nth byte is sent to one of N separate compression streams, Run Length Encoding replacing repeated strings of symbols with a symbol and count, and bit-packing where reduced alphabets can combine multiple symbols into a byte prior to entropy encoding.

The initial “Order” byte is expanded with additional bits to list the transformations to be applied. The specifics of each sub-format are listed below, in the order they are applied.

3.1 Frequency tables#

Frequency tables in rANS Nx16 separate the list of symbols from their frequencies. The symbol list must be stored in ascending ASCII order, with their frequency values in the same ordering as their corresponding symbols. For the Order-1 frequency table this list of symbols is those used in any context, thus we only have one alphabet recorded for all contexts. This means in some contexts some (potentially many) symbols will have zero frequency. To reduce the Order-1 table size an additional zero run-length encoding step is used. Finally the Order-1 frequencies may optionally be compressed using the Order-0 rANS Nx16 codec.

Frequencies must always add up to a power of 2, but do not necessarily have to match the final power of two used in the Order-0 (4096) and Order-1 (1024, 4096) entropy decoder algorithm. A normalisation step is applied after reading the frequencies to scale them appropriately. This is required as the Order-1 frequencies may be scaled differently for each context.

// Reads a set of symbols A used in our alphabet
Function ReadAlphabet()
  s ← ReadUint8()
  last_sym ← s
  rle ← 0
  repeat:
    A ← (A,s)  // Append s to the symbol set A
    if rle > 0:
      rle ← rle-1
      s ← s+1
    else:
      s ← ReadUint8()
      if s = last_sym+1:
        rle ← ReadUint8()
    last_sym ← s
  until s = 0
  return A
// Reads a table of Order-0 symbol frequencies F_i
// and sets the cumulative frequency table C_i+1 = C_i+F_i
Procedure ReadFrequenciesNx16_0(F, C)
  F ← (0, ...)
  A ← ReadAlphabet()
  for each i in A:
    F_i ← ReadUint7()
  
  NormaliseFrequenciesNx16_0(F, 12)
  
  C_0 ← 0
  for s ← 0 to 255:  // All 256 possible byte values
    C_s+1 ← C_s + F_s
// Normalises a table of frequencies F_i to sum to a specified power of 2.
Procedure NormaliseFrequenciesNx16_0(F, bits)
  tot ← 0
  for i ← 0 to 255:
    tot ← tot + F_i
  if tot = 0 tot = (1 bits):
    return 
  
  shift ← 0
  while tot < (1 bits):
    tot ← tot*2
    shift ← shift+1
  
  for i ← 0 to 255:
    F_i ← F_i shift

The Order-1 frequencies also store the complete alphabet of observed symbols (ignoring context) followed by a table of frequencies for each symbol in the alphabet. Given many frequencies will be zero where a symbol is present in one context but not in others, all zero frequencies are followed by a run length to omit adjacent zeros.

The order-1 frequency table itself may still be quite large, so is optionally compressed using the order-0 rANS Nx16 codec with a fixed 4-way interleaving. This is specified in the bottom bit of the first byte. If this is 1, it is followed by 7-bit encoded uncompressed and compressed lengths and then the compressed frequency data. The pseudocode here differs slightly to elsewhere as it indicates the input sources, which are either the uncompressed frequency buffer or the default (unspecified) source. The top 4 bits of the first byte indicate the number of bits used for the frequency tables. Permitted values are 10 and 12.

// Reads a table of Order-1 symbol frequencies F_i,j
// and sets the cumulative frequency table C_i,j+1 = C_i,j+F_i,j
Procedure ReadFrequenciesNx16_1(F, C, bits)
  comp ← ReadUint8()
  bits ← comp 4
  if (comp 1) ≠ 0:
    u_size ← ReadUint7()  // Uncompressed size
    c_size ← ReadUint7()  // Compressed size
    c_data ← ReadData(c_size)
    source ← RansDecodeNx16_0(c_data, 4)  // Create u_size bytes of source from c_data
  else:
    (define source to be the default input stream)

  F ← ((0, ...), ...)
  A ← ReadAlphabet(from source)
  for each i in A:
    run ← 0
    for each j in A:
      if run > 0:
        run ← run-1  // F_i,j is implicitly zero already
      else:
        F_i,j ← ReadUint7(from source)
        if F_i,j = 0:
          run ← ReadUint8(from source)
    NormaliseFrequenciesNx16_0(F_i, bits)
    
    C_i,0 ← 0
    for j ← 0 to 255:
      C_i,j+1 ← C_i,j + F_i,j

3.2 rANS Nx16 Order-0#

To decode an Order-0 encoded byte stream we first decode the symbol frequencies as described above and then decode the N interleaved rANS states. This is similar to the old (4x8) rANS decoder, but the RansRenorm function is replaced by a single 16-bit renormalisation instead of a loop using 8-bit values and can interleave to different amounts.

Function RansGetCumulativeFreqNx16(R, bits)
  return R ((1 bits) -1)

Function RansAdvanceStepNx16(R, c, f, bits)
  return f * (R bits) + (R ((1 bits) -1) - c

Function RansRenormNx16(R)
  if R < (1 15):
    R ← (R 16) + ReadUint16()
  return R

Function RansDecodeNx16_0(len, N)
  ReadFrequenciesNx16_0(F, C)
  for j ← 0 to N-1:
    R_j ← ReadUint32()
  for i ← 0 to len-1:
    j ← i mod N
    f ← RansGetCumulativeFreqNx16(R_j, 12)
    s ← RansGetSymbolFromFreq(C, f)
    out_i ← s
    R_j ← RansAdvanceStepNx16(R_j, C_s, F_s, 12)
    R_j ← RansRenormNx16(R_j)
  return out

3.3 rANS Nx16 Order-1#

The Order-1 code is comparable to Order-0 but with an extra dimension (the previous value) to the FF and CC matrices and a more complex system for storing the frequencies. The frequencies may also add up to either 1024 or 4096 (10 or 12 bits).

The N rANS states don’t operate on interleaved data either, but in distinct regions so they can utilise their previous symbols without inter-dependency between the decoded output of the rANS states. This makes the handling of data that isn’t a multiple of N is a little more complex too.

Function RansDecodeNx16_1(len, N)
  ReadFrequenciesNx16_1(F, C, bits)
  for j ← 0 to N-1:
    R_j ← ReadUint32()
    L_j ← 0  // Last symbol
  // The primary unrollable loop
  for i ← 0 to ⌊ len/N ⌋ - 1:
    for j ← 0 to N-1:
      f ← RansGetCumulativeFreqNx16(R_j, bits)
      s ← RansGetSymbolFromFreq(C_L_j, f)
      out_i + j * ⌊ len/N ⌋ ← s
      R_j ← RansAdvanceStepNx16(R_j, C_L_j, s, F_L_j, s, bits)
      R_j ← RansRenormNx16(R_j)
      L_j ← s
  // The remainder for data not a multiple of N in size, using R_N-1 throughout.
  for i ← i * N to len-1:
    f ← RansGetCumulativeFreqNx16(R_N-1, bits)
    s ← RansGetSymbolFromFreq(C_L_N-1, f)
    out_i ← s
    R_N-1 ← RansAdvanceStepNx16(R_N-1, C_L_N-1,s, F_L_N-1,s, bits)
    R_N-1 ← RansRenormNx16(R_N-1)
    L_N-1 ← s
  return out

3.4 rANS Nx16 Run Length Encoding#

For symbols that occur many times in succession, we can replace them with a single symbol and a count. In this specification, run lengths are always provided for certain symbols (even if the run length is 1) and never for the other symbols (even if many are consecutive).

The data stream is split into two parts: the meta-data holding run-lengths and the run-removed data itself.

Bytes Type Name Description
? uint7 𝑟𝑙𝑒_𝑚𝑒𝑡𝑎_𝑙𝑒𝑛\textit{rle\_meta\_len} RLE meta-data-size times 2. The bottom bit is a flag to indicate whether 𝑟𝑙𝑒_𝑚𝑒𝑡𝑎\textit{rle\_meta} is uncompressed (1) or compressed (0).
? uint7 𝑟𝑙𝑒_𝑙𝑒𝑛\textit{rle\_len} Size of uncompressed data before DecodeRLE is applied
? uint7 (𝑐𝑜𝑚𝑝_𝑚𝑒𝑡𝑎_𝑙𝑒𝑛)(\textit{comp\_meta\_len}) Only stored if bottom bit of 𝑟𝑙𝑒_𝑚𝑒𝑡𝑎_𝑙𝑒𝑛\textit{rle\_meta\_len} is unset. Size of compressed RLE meta data.
? uint8[] 𝑟𝑙𝑒_𝑚𝑒𝑡𝑎\textit{rle\_meta} RLE meta-data. Decompress with RansDecodeNx16_0 if bottom bit of 𝑟𝑙𝑒_𝑚𝑒𝑡𝑎_𝑙𝑒𝑛\textit{rle\_meta\_len} is unset.

The meta-data format starts with the count of symbols which have runs associated with them (zero being interpreted as all of them) and the list of these symbol values. This is followed by the run lengths encoded as variable sized integers in the uint7 format.

// Reads and optionally uncompresses the blob of run-lengths and the array L
// indicating which symbols have associates run-lengths.
Function DecodeRLEMeta(N)
  L ← (0, ...)
  rle_meta_len ← ReadUint7()
  len ← ReadUint7()  // Length of uncompressed O0/O1 data, pre-expansion
  if rle_meta_len 1:
    rle_meta ← ReadData(⌊rle_meta_len/2⌋)
  else:
    comp_meta_len ← ReadUint7()
    rle_meta ← ReadData(comp_meta_len)
    $rle_meta \gets

  n ← ReadUint8(metadata)
  if n = 0:
    n ← 256
  for i ← 0 to n-1:
    s ← ReadUint8(metadata)
    L_s ← 1

  return (L, rle_meta, len)

The use of the run length meta-data occurs when expanding the uncompressed data, after Order-0 or Order-1 data decompression.

// Expands data (in) using run-length metadata
Function DecodeRLE(in, L, metadata, in_len)
  j ← 0
  for i ← 0 to in_len - 1:
    sym ← ReadUint8(in)
    if L_s:
      run ← ReadUint7(metadata)
      for k ← 0 to run:
        out_j+k ← s
      j ← j + run + 1
    else:
      out_j ← s
      j ← j + 1
  return out

3.5 rANS Nx16 Bit Packing#

If the alphabet of used symbols in the uncompressed data stream is small - no more than 16 - then we can pack multiple symbols together to form bytes prior to compression. This permits 2, 4, 8 and infinite (all symbols are the same) numbers per byte. The distinct symbol values do not need to be adjacent as a mapping table PP converts mapped value xx to original symbol PxP_x.

The packed format is split into uncompressed meta-data (below) and the compressed packed data.

Bytes Type Name Description
1 byte 𝑛𝑠𝑦𝑚\textit{nsym} Number of distinct symbols
𝑛𝑠𝑦𝑚\textit{nsym} byte[] PP Symbol map
? uint7 𝑙𝑒𝑛\textit{len} Length of packed data

The first meta-data byte holds nsym\textit{nsym}, the number of distinct values, followed by nsym\textit{nsym} bytes to construct the PP map. If nsym=1\textit{nsym} = 1 then the byte stream is a stream of constant values and no bit-packing is done (we know every value already). If nsym=2\textit{nsym} = 2 then each symbol is 1 bit (8 per byte), if 2<nsym42 < \textit{nsym} \le 4 symbols are 2 bits each (4 per byte) and if 4<nsym164 < \textit{nsym} \le 16 symbols are 4 bits each (2 per byte). It is not permitted to have nsym>16\textit{nsym} > 16 or nsym=0\textit{nsym} = 0 as bit packing is not possible. Bits are unpacked from low to high.

Decoding this meta-data is implemented by the DecodePackMeta function below.

Function DecodePackMeta()
  nsym ← ReadUint8()
  for i ← 0 to nsym-1:
    P_i ← ReadUint8()
  len ← ReadUint7()
  return (P, nsym, len)

After uncompressing the main data stream, it should be unpacked using DecodePack below. The format of this data stream is packed data as described above.

Function DecodePack(data, P, nsym, len)
  j ← 0  // Index into data; i is index into output
  if nsym ≤ 1:  // Constant value
    for i ← 0 to len-1:
      out_i ← P_0

  else if nsym ≤ 2:  // 1 bit per value
    for i ← 0 to len-1:
      if i mod 8 = 0:
        v ← data_j
        j ← j+1
      out_i ← P_(v 1)
      v = v 1

  else if nsym ≤ 4:  // 2 bits per value
    for i ← 0 to len-1:
      if i mod 4 = 0:
        v ← data_j
        j ← j+1
      out_i ← P_(v 3)
      v = v 2

  else if nsym ≤ 16:  // 4 bits per value
    for i ← 0 to len-1:
      if i mod 2 = 0:
        v ← data_j
        j ← j+1
      out_i ← P_(v 15)
      v = v 4

  else:
    Error()

  return out

3.6 Striped rANS Nx16#

If we have a series of 32-bit values, we can often get better compression by treating it as a series of 4 8-bit values representing the first to last bytes in each 32-bit word, than we can by simply processing it as a stream of 8-bit values. Each 4th4{th} byte is sent to its own stream producing 4 interleaved streams, so the 1st1^{st} stream will hold data from byte 0, 4, 8, etc while the 2nd2^{nd} stream will hold data from byte 1, 5, 9, etc. Each of those four streams is then itself compressed using this compression format.

For example an input block of small unsigned 32-bit little-endian numbers may use RLE for the first three streams as they are mostly zero, and a non-RLE Order-0 entropy encoder for the last stream.

In the general case we describe this as NN-way interleaved streams. We can consider this interleaving process to be equivalent to a table transpose of MM rows by NN columns to NN rows by MM columns, followed by compressing each NN row independently.

The byte stream consists of a 7-bit encoded uncompressed combined length, a byte holding the value of NN, followed by NN compressed lengths also 7-bit encoded. Finally the data sub-streams themselves, each a valid cdata\textit{cdata} stream, follow.

Normally our cdata\textit{cdata} format will include the decoded size, but with Stripe we can omit this from the internal compressed sub-streams (using the NoSize flag) as given the total length we know how to compute the sub-lengths.

Reproducing the original uncompressed data involves decoding the NN sub-streams and interleaving them together again (reversing the table transpose). The uncompressed data length may not necessary be an exact multiple of NN, in which case the latter uncompressed sub-streams may be 1 byte shorter.

As an example starting with input data DD we define the transposed data TT as:

D=aAAbBBcCCdDDeT=[ abcde, ABCD, ABCD ]\begin{aligned} D = aAAbBBcCCdDDe \\ \\ T = [\ \textit{abcde},\ \textit{ABCD},\ ABCD\ ] \end{aligned}

Note our example data is not a multiple of NN long, missing EE, which gives TT fragments of length [5, 4, 4].

If DiD_i is the ithi^{th} character in DD and Tj,iT_{j,i} is the ithi^{th} character of the jthj^{th} substring in TT, transformations between DD and TT are defined as:

Tj,i=DiN+jDi=T(imodN), (idivN)\begin{aligned} T_{j,i} = D_{i N +j} \\ \\ D_i = T_{(i \bmod N),\ (i \mathbin{\text{div}} N)} \end{aligned}
Function RansDecodeStripe(len)
  N ← ReadUint8()
  for j ← 0 to N:  // Fetch N compressed lengths
    clen_j ← ReadUint7()

  for j ← 0 to N:  // Decode N streams
    ulen_j ← (len div N) + ((len mod N) > j)  // (x > y) expression being 1 if true, 0 if false
    T_j ← RansDecodeNx16(ulen_j)

  for j ← 0 to N - 1:  // Stripe
    for i ← 0 to ulen_j - 1:
      out_i * N + j ← T_j,i
  return out

3.7 Combined rANS Nx16 Format#

We combine the Order-0 and Order-1 rANS Nx16 encoder with optional run-length encoding, bit-packing and four-way interleaving into a single data stream.

Bits Type Name Description
8 uint8 𝑓𝑙𝑎𝑔\textit{flag} Data format bit field
Unless NoSize flag is set:
? uint7 ulen Uncompressed length
If Stripe flag is set:
8 uint8 N Number of sub-streams
? uint7[] clen[] N copies of compressed sub-block length
? uint8[] cdata[] N copies of Compressed data sub-block (recurse)
If Cat flag is set (and Stripe flag is unset):
? uint8[] udata Uncompressed data stream
If Pack flag is set (and neither Stripe or Cat flags are set):
? uint8[] pack_meta Pack lookup table
If RLE flag is set (and neither Stripe or Cat flags are set):
? uint8[] rle_meta RLE meta-data
If neither Stripe or Cat flags are set:
? uint8[] cdata Entropy encoded data stream (see Order flag)

The first byte of our generalised data stream is a bit-flag detailing the type of transformations and entropy encoders to be combined, followed by optional meta-data, followed by the actual compressed data stream. The bit-flags are defined below, but note not all combinations are permitted.

Bit AND value Code Description
1 Order Order-0 or Order-1 entropy coding
2 reserved Reserved (for possible order-2/3)
4 N32 Interleave N=32 rANS states (else N=4)
8 Stripe** multi-way interleaving of byte streams
16 NoSize original size is not recorded (for use by Stripe)
32 Cat Data is uncompressed
64 RLE Run length encoding, with runs and literals encoded separately
128 Pack Pack 2, 4, 8 or infinite symbols per byte
(**) Not to be used in conjunction with other bit-field values except NoSize.

Bit-packing and run length encoding transforms have their own meta-data, which is decoded prior to the main compressed data stream. After decoding, these transforms are applied in order of RLE followed by unpack as required and in that order.

For example a flag\textit{flag} data-format value of 197 indicates a byte stream should decode the pack meta-data, the RLE meta-data and the Order-1 compressed data itself, all using 32 rANS states, and then apply the RLE followed by Unpack transforms to yield the original uncompressed data.

RansDecodeNx16 describes the decoding of the generalised RANS Nx16 decoder.

Function RansDecodeNx16(len)
  flags ← ReadUint8()
  if flags NoSize ≠ 0:
    len ← ReadUint7()
  if flags Stripe:
    data ← RansDecodeStripe(len)
    return data
  if flags N32:
    N ← 32
  else:
    N ← 4
  // Read meta-data
  if flags Pack:
    pack_len ← len
    (P, nsym, len) ← DecodePackMeta()
  if flags RLE:
    rle_len ← len
    (L, rle_meta, len) ← DecodeRLEMeta(N)
  // Uncompress main data block
  if flags Cat:
    data ← ReadData(len)
  else if flags Order:
    data ← RansDecodeNx16_1(len, N)
  else:
    data ← RansDecodeNx16_0(len, N)
  // Apply data transformations
  if flags RLE:
    data ← DecodeRLE(data, L, rle_meta, rle_len)
  if flags Pack:
    data ← DecodePack(data, P, nsym, pack_len)
  return data

4 Range coding#

The range coder is a byte-wise arithmetic coder that operates by repeatedly reducing a probability range (for example 0.0 to 1.0) one symbol (byte) at a time, with the complete compressed data being represented by any value within the final range.

This is easiest demonstrated with a worked example, so let us imagine we have an alphabet of 4 symbols, ‘t’, ‘c’, ‘g’, and ‘a’ with probabilities 0.2, 0.3, 0.3 and 0.2 respectively. We can construct a cumulative distribution table and apply probability ranges to each of the symbols:

Symbol Probability Range low Range high
t 0.2 0.0 0.2
c 0.3 0.2 0.5
g 0.3 0.5 0.8
a 0.2 0.8 1.0

As a conceptual example (note: this is not how it is implemented in practice, see below) using arbitrary precision floating point mathematics this could operate as follows.

If we wish to encode a message, such as “cat” then we will encode one symbol at a time (‘c’, ‘a’, ‘t’) successively reducing the initial range of 0.0 to 1.0 by the cumulative distribution for that symbol. At each point the new range is adjusted to be the proportion of the previous range covered by the cumulative symbol range. See the table footnotes below for the worked mathematics.

Range low / high Symbol Sym. low / high New range low New range high
0.000 1.000 c 0.2 0.5 0+(10)×.20+(1-0)\times.2 0+(10)×.50+(1-0)\times.5
0.200 0.500 a 0.8 1.0 .2+(.5.2)×.8.2+(.5-.2)\times.8 .2+(.5.2)×1.2+(.5-.2)\times 1
0.440 0.500 t 0.0 0.2 .44+(.5.44)×0.44+(.5-.44)\times 0 .44+(.5.44)×.2.44+(.5-.44)\times .2
0.440 0.452 <end>

Our final range is 0.44 to 0.452 with any value in that range representing “cat”, thus 0.45 would suffice. A pictorial example of this process is below.

A pictorial demonstration of range reduction.

Decoding is simply the reverse of this. In the above picture we can see that 0.45 would read off ‘c’, ‘a’ and ‘t’ by repeatedly comparing the symbol ranges to the current range and using those to identify the symbol and produce a new range.

Range low Range high Fraction into range Symbol
0.000 1.000 0.450 c
0.200 0.500 0.833a a
0.440b 0.500 0.167 t

a. 0.45 into range 0.2 to 0.5: (0.450.2)/(0.50.2)=0.833(0.45-0.2)/(0.5-0.2) = 0.833. This falls within the 0.8 to 1.0 symbol range for 'a'.

b. 'a' symbol range 0.8 to 1.0 applied to range 0.2 to 0.5: 0.2+0.8×(0.50.2)=0.440.2+0.8\times(0.5-0.2) = 0.44 and 0.2+1.0×(0.50.2)=0.50.2+1.0\times(0.5-0.2) = 0.5.

Note: The above example is not how the actual implementation works6. For efficiency, we use integer values having a starting range of 0 to 23212^{32}-1. We write out the top 8-bits of the range when low and high become the same value. Special care needs to be taken to handle small values that are numerically close but stradding a top byte value, such as 0x37ffba20 to 0x38000034. The decoder does not need to do anything special here, but the encoder must track the number of 0xff or 0x00 values to emit in order to avoid needing arbitrary precision integers.

Pseudocode for the range codec decoding follows. This implementation uses code (next few bytes in the current bit-stream) and range instead of low and high, both 32-bit unsigned integers. This specification focuses on decoding, but given the additional complexity of the precision overflows in encoder we describe this implementation too.

RangeDecodeCreate initialises the range coder, reading the first bytes of the compressed data stream.

Function RangeDecodeCreate()
  range ← 2^32-1  // Maximum 32-bit unsigned value
  code ← 0  // 32-bit unsigned
  for i ← 0 to 4:
    code ← (code 8) +ReadUint8()
  code ← code 2^32-1
  return this range coder (range, code)

Decoding each symbol is in two parts; getting the current frequency and updating the range.

Function RangeGetFreq(tot_freq)
  range ← range div tot_freq
  return code div range
Procedure RangeDecode(sym_low, sym_freq, tot_freq)
  code ← code - sym_low * range
  range ← range * sym_freq
  while range < 2^24:  // Renormalise
    range ← range 8
    code ← (code8) +ReadUint8()

As mentioned above, the encoder is more complex as it cannot shift out the top byte until it has determined the value. This can take a considerable while if our current low / high (low + range) are very close but span a byte boundary, such as 0x37ffba20 to 0x38000034, where ultimately we will later emit either 0x37 or 0x38. To handle this case, when the range gets too small but the top bytes still differ, the encoder caches the top byte of low (0x37) and keeps track of how many 0xff or 0x00 values will need to be written out once we finally observe which value the range has shrunk to.

The RangeEncode function is a straight forward reversal of the RangeDecode, with the exception of the special code for shifting the top byte out of the low\textit{low} variable.

Procedure RangeEncode(sym_low, sym_freq, tot_freq)
  old_low ← low
  range ← range div tot_freq
  low ← low + sym_low * range
  range ← range * sym_freq

  if low < old_low:
    carry ← 1  // overflow
  while range < 2^24:  // Renormalise
    range ← range 8
    RangeShiftLow()

RangeShiftLow is the main heart of the encoder renormalisation. It tracks the total number of extra bytes to emit and carry\textit{carry} indicates whether they are a string of 0xFF or 0x00 values.

Procedure RangeShiftLow()
  if low < 0xff000000 carry ≠ 0:
    if carry = 0:
      WriteByte(cache)  // top byte cache plus FFs
      while FFnum > 0:
        WriteByte(0xff)
        FFnum ← FFnum - 1
    else:
      WriteByte(cache + 1)  // top byte cache+1 plus 00s
      while FFnum > 0:
        WriteByte(0)
        FFnum ← FFnum - 1
    cache ← low 24  // Copy of top byte ready for next flush
    carry ← 0
  else:
    FFnum ← FFnum + 1

  low ← low 8

For completeness, the Encoder initialisation and finish functions are below.

Procedure RangeEncodeStart()
  low ← 0
  range ← 2^32-1
  FFnum ← 0
  carry ← 0
  cache ← 0
Procedure RangeEncodeEnd()
  for i ← 0 to 4:  // Flush any residual state in low
    RangeShiftLow()

4.1 Adaptive Modelling#

The probabilities passed to the range coder may be fixed for all scenarios (as we had in the “cat” example), or they may be adaptive and context aware. For example the letter ‘u’ occurs around 3% of time in English text, but if the previous letter was ‘q’ it is close to 100% and if the previous letter was ‘u’ it is close to 0%. Using the previous letter is known as an Order-1 entropy encoder, but the context can be anything. We can also adaptively adjust our probabilities as we encode or decode, learning the likelihoods and thus avoiding needing to store frequency tables in the data stream covering all possible contexts.

To do this we use a statistical model, containing an array of symbols SS and their frequencies FF. The sum of these frequences must be less than 216162^{16}-16 so after adjusting the frequencies it never go above the maximum unsigned 16-bit integer. When they get too high, they are renormalised by approximately halving the frequencies (ensuring none drop to zero).

Typically an array of models are used where the array index represents the current context.

To encode any symbol the entropy encoder needs to know the frequency of the symbol to encode, the cumulative frequencies of all symbols prior to this symbol, and the total of all frequencies. For decoding a cumulative frequency is obtained given the frequency total and the appropriate symbol is found matching this frequency. Symbol frequencies are updated after each encode or decode call and the symbols are kept in order of most-frequent symbol first in order to reduce the overhead of scanning through the cumulative frequencies.

ModelCreate initialises a model by setting every symbol to have a frequency of 1. (At no point do we permit any symbol to have zero frequency.)

Function ModelCreate(num_sym)
  total_freq ← num_sym
  max_sym ← num_sym-1
  for i ← 0 to max_sym:
    S_i ← i
    F_i ← 1
  return this model (total_freq, max_sym, S, F)

ModelDecode is called once for each decoded symbol. It returns the next symbol and updates the model frequencies automatically.

Function ModelDecode(rc)
  freq ← rc.RangeGetFrequency(total_freq)
  x ← 0
  acc ← 0
  while acc + F_x ≤ freq:
    acc ← acc + F_x
    x ← x+1
  rc.RangeDecode(acc, F_x, total_freq)
  F_x ← F_x + 16  // Update model frequencies
  total_freq ← total_freq + 16
  if total_freq > 2^16-17:
    ModelRenormalise()
  sym ← S_x
  if x > 0 F_x > F_x-1:
    Swap(F_x, F_x-1)
    Swap(S_x, S_x-1)
  return sym

ModelRenormalise is called whenever the total frequencies get too high. The frequencies are halved, taking sure to avoid any zero frequencies being created.

Procedure ModelRenormalise()
  total_freq ← 0
  for i ← 0 to max_sym:
    F_i ← F_i - (F_i div 2)
    total_freq ← total_freq + F_i

4.2 Order-0 and Order-1 Encoding#

We can combine the model defined above and the range coder to provide a simple function to perform Order-0 entropy decoder.

Function DecodeOrder0(len)
  max_sym ← ReadUint8()
  if max_sym = 0:
    max_sym ← 256
  model_lit ← ModelCreate(max_sym)

  rc ← RangeDecodeCreate()
  for i ← 0 to len-1:
    out_i ← model_lit.ModelDecode(rc)
  return out

The Order-1 variant simply uses an array of models and selects the appropriate model based on the previous value encoded or decoded. This array index is our “context”.

Function DecodeOrder1(len)
  max_sym ← ReadUint8()
  if max_sym = 0:
    max_sym ← 256
  for i ← 0 to max_sym-1:
    model_lit_i ← ModelCreate(max_sym)

  rc ← RangeDecodeCreate()
  last ← 0
  for i ← 0 to len-1:
    out_i ← model_lit_last.ModelDecode(rc)
    last ← out_i
  return out

4.3 RLE with Order-0 and Order-1 Encoding#

The DecodeOrder0 and DecodeOrder1 codecs can be expanded to include a count of how many runs of each symbol should be decoded. Both order 0 and order 1 variants are possible.

After the symbol is decoded, the run length must be decoded to indicate how many extra copies of this symbol occur. Long runs are broken into a series of lengths of no more than 3. If length 3 is decoded it indicates we must decode an additional length and add to the current one. The context used for the run length model is the symbol itself for the initial run, 256 for the first continuation run (if 4\ge 4) and 257 for any further continuation runs. Thus encoding 10 ‘A’ characters would first store symbol ‘A’ followed by run length 3 (with context ‘A’), length 3 (context 256), length 3 (context 257), and length 1 (context 257).

For example, if we have the string “ABBCCCCDDDDD” we will record “A”<0> “B”<1> “C”<3,0> and “D”<3,1>.

Function DecodeRLE0(len)
  max_sym ← ReadUint8()
  if max_sym = 0:
    max_sym ← 256
  model_lit ← ModelCreate(max_sym)
  for i ← 0 to 257:
    model_run_i ← ModelCreate(4)

  rc ← RangeDecodeCreate()
  i ← 0
  while i < len:
    out_i ← model_lit.ModelDecode(rc)
    part ← model_run_out_i.ModelDecode(rc)
    run ← part
    rctx ← 256
    while part = 3:
      part ← model_run_rctx.ModelDecode(rc)
      rctx ← 257
      run ← run + part
    for j ← 1 to run:
      out_i+j ← out_i
    i ← run+1
  return out

The order-1 run length variant is identical to order-0 except the previous symbol is used as the context for the next literal. The context for the run length does not change.

Function DecodeRLE1(len)
  max_sym ← ReadUint8()
  if max_sym = 0:
    max_sym ← 256
  for i ← 0 to max_sym-1:
    model_lit_i ← ModelCreate(max_sym)
  for i ← 0 to 257:
    model_run_i ← ModelCreate(4)

  rc ← RangeDecodeCreate()
  last ← 0
  i ← 0
  while i < len:
    out_i ← model_lit_last.ModelDecode(rc)
    last ← out_i
    part ← model_run_last.ModelDecode(rc)
    run ← part
    rctx ← 256
    while part = 3:
      part ← model_run_rctx.ModelDecode(rc)
      rctx ← 257
      run ← run + part
    for j ← 1 to run:
      out_i+j ← last
    i ← run+1
  return out

We wrap up the Order-0 and 1 entropy encoder, both with and without run length encoding, into a data stream that specifies the type of encoded data and also permits a number of additional transformations to be applied. These transformations support bit packing (for example a data block with only 4 distinct values can be packed with 4 values per byte), no-op for tiny data blocks where entropy encoding would grow the data and N-way interleaving of the 8-bit components of a 32-bit value.

Bits Type Name Description
8 uint8 𝑓𝑙𝑎𝑔\textit{flag} Data format bit field
Unless NoSize flag is set:
? uint7 ulen Uncompressed length
If Stripe flag is set:
8 uint8 N Number of sub-streams
? uint7[] clen[] N copies of compressed sub-block length
? uint8[] cdata[] N copies of Compressed data sub-block (recurse)
If Cat flag is set (and Stripe flag is unset):
? uint8[] udata Uncompressed data stream
If Pack flag is set (and neither Stripe or Cat flags are set):
? uint8[] pack_meta Pack lookup table
If neither Stripe or Cat flags are set:
? uint8[] cdata Entropy encoded data stream (see Order / RLE / Ext flags)

The first byte of our generalised data stream is a bit-flag detailing the type of transformations and entropy encoders to be combined, followed by optional meta-data, followed by the actual compressed data stream. The bit-flags are defined below, but note not all combinations are permitted.

Bit AND value Code Description
1 Order** Order-0 or Order-1 entropy coding
2 reserved Reserved (for possible order-2/3)
4 Ext "External" compression via bzip2
8 Stripe N-way interleaving of byte streams
16 NoSize Original size is not recorded (used by Stripe)
32 Cat Data is uncompressed
64 RLE** Run length encoding, with runs and literals encoded separately
128 Pack Pack 2, 4, 8 or infinite symbols per byte

(**) Has no effect when Ext flag is set.

(†) Not to be used in conjunction with other flags except Pack and NoSize.

Of these Stripe is the most complex. As with the rANS Nx16 encoder, the data is rearranged such that every NthN^{th} byte is adjacent in a single block producing N distinct sub-blocks. Each of the N streams is then itself compressed using this compression format.

For example an input block of small unsigned 32-bit little-endian numbers may use RLE for the first three streams as they are mostly zero, and a non-RLE Order-0 entropy encoder of the last stream. Normally our data format will include the decoded size, but with Stripe we can omit this from the internal compressed streams as we know their size will be a computable fraction of the combined data.

The data layout differs for each of these bit types, as described below in the ArithDecode function. Some of these can be used in combination, so the order needs to be observed. The Pack format has additional meta data. This is decoded first, before entropy decoding and finally expanding the specified pack transformation after decompression. For example value 193 is indicates a byte stream should be decoded with an RLE aware order-1 entropy encoder and then unpacked.

Function ArithDecode(len)
  flags ← ReadUint8()
  if flags NoSize ≠ 0:
    len ← ReadUint7()
  if flags Stripe:
    data ← DecodeStripe(len)
    return data
  if flags Pack:
    pack_len ← len
    (P, nsym, len) ← DecodePackMeta()
  // Entropy Decoding
  if flags Cat:
    data ← ReadData(len)
  else if flags Ext:
    data ← DecodeEXT(len)
  else if flags RLE:
    if flags Order:
      data ← DecodeRLE1(len)
    else:
      data ← DecodeRLE0(len)
  else:
    if flags Order:
      data ← DecodeOrder1(len)
    else:
      data ← DecodeOrder0(len)
  // Apply data transformations
  if flags Pack:
    data ← DecodePack(data, P, nsym, pack_len)
  return data

The specifics of each sub-format are described below, in the order (minus meta-data specific shuffling) they are applied.

5 Name tokenisation codec#

Sequence names (identifiers) typically follow a structured pattern and compression based on columns within those structures usually leads to smaller sizes. The sequence name (identifier) tokenisation relies heavily on the rANS Nx16 and Adaptive arithmetic coders described above.

As an example, take a series of names:

I17_08765:2:123:61541:01763#9
I17_08765:2:123:1636:08611#9
I17_08765:2:124:45613:16161#9

We may wish to tokenise each of these into 7 tokens, e.g. “I17_08765:2:”, “123”, ”:”, “61541”, ”:”, “01763”and “#9”. Some of these are multi-byte strings, some single characters, and some numeric, possibily with a leading zero. We also observe some regularly have values that match the previous line (the initial prefix string, colons, “#9”) while others are numerically very close to the value in the previous line (124 vs 123).

The name tokeniser compares each name against a previous name (which is not necessarily the one immediately prior) and encodes this name either as a series of differences to the previous name or marking it as an exact duplicate. A maximum of 128 tokens are permitted within any single read name.

Token IDs (types) are listed below.

ID Type Value Description
0 TYPE Type Used to determine the type of token at a given position
5 DUP Integer (distance) The entire name is a duplicate of an earlier one. Used in position 0 only
6 DIFF Integer (distance) The entire name differs to earlier ones. Used in position 0 only
1 STRING String A nul-terminated string of characters
2 CHAR Byte A single character
7 DIGITS 00 \le Int <232< 2^{32} A numerical value, not containing a leadng zero
3 DIGITS0 00 \le Int <232< 2^{32} A numerical value possibly starting in leading zeros
4 DZLEN Int length Length of associated DIGITS0 token
8 DELTA 00 \le Int <256< 256 A numeric value being stored as the difference to the numeric value of this token on the previous name
9 DELTA0 00 \le Int <256< 256 As DELTA, but for numeric values starting with leading zeros
10 MATCH (none) This token is identical type and value to the same position in the previous name (NB: not permitted for DELTA/DELTA0)
11 NOP (none) Does nothing
12 END (none) Marks end of name

The tokens and values are stored in a 2D array of byte streams, Bpos,typeB_{\textit{pos},\textit{type}}, where pos 0 is reserved for name meta-data (whether it is a duplicate name) and pos 1 onwards is for the first, second and later tokens. Type\textit{Type} is one of the token types listed above, corresponding to the type of data being stored. Some token types may also have associated values. Bpos,TYPEB_{\textit{pos},\texttt{TYPE}} (type\textit{type}) holds the token type itself and that is then used to retrieve any associated value(s) if appropriate from Bpos,typeB_{\textit{pos},\textit{type}}. Thus multiple types at the same token position will have their values encoded in distinct data streams, e.g. if position 5 is of type either DIGITS or DELTA then data streams will exist for B5,TYPEB_{5,\texttt{TYPE}}, B5,DIGITSB_{5,\texttt{DIGITS}} and B5,DELTAB_{5, \texttt{DELTA}}. Decoding per name continues until a token of type END is observed.

More detail on the token types is given below.

Decoding needs some simple functions to read successive bytes from our token byte streams, as 8-bit characters or unsigned integers, as 32-bit unsigned integers and nul-terminated strings. We reuse the ReadUint32 and related functions with the byte array specified as input.

// Convert an integer to a string form in base-10 digits, at least len bytes long with leading zeros
Function LeftPadNumber(val, len)
  str ← val  // Implicit language-specific Integer to String conversion
  while Length(str) < len:
    str ← `0' ++ str
  return str

For the main name decoding loop, we use a single dimensional array of names decoded so far, NN, and a two dimensional array of their tokens TT (indexed by name number nn and token position tt within that name). We define a function to decode the nthn^{th} name (NnN_n) using a previous mthm^{th} name (NmN_m). The tokens TT are used in MATCH and DELTA token types to copy data from when constructing the name.

Now we have the basic primitives for pulling from the BB byte streams, decoding the nthn^{th} individual name is as follows7:

// Decodes the n^th name, returning N_n and updating globals N_n and T_n)}
Function DecodeSingleName(n)
  type ← ReadUint8(B_0,TYPE)
  dist ← ReadUint32(B_0,type)
  m ← n-dist
  if type = DUP:
    N_n ← N_m
    T_n ← T_m  // Copy for all T_n,*
    return N_n

  t ← 1  // Token number t
  repeat:
    type ← ReadUint8(B_t,TYPE)
    if type = CHAR:
      T_n,t ← ReadChar(B_t,CHAR)
    else if type = STRING:
      T_n,t ← ReadString(B_t,STRING)
    else if type = DIGITS:
      T_n,t ← ReadUint32(B_t,DIGITS)
    else if type = DIGITS0:
      d ← ReadUnt32(B_t,DIGITS0)
      l ← ReadUint8(B_t,DZLEN)
      T_n,t ← LeftPadNumber(d, l)
    else if type = DELTA:
      T_n,t ← T_m,t + ReadUint8(B_t,DELTA)
    else if type = DELTA0:
      d ← T_m,t + ReadUint8(B_t,DELTA0)
      l ← Length(T_m,t)  // String length including leading zeros
      T_n,t ← LeftPadNumber(d, l)
    else if type = MATCH:
      T_n,t ← T_m,t
    else:
      T_n,t ← `'
    N_n ← N_n ++ T_n,t
    t ← t+1
  until type = END
  return N_n

Given a complex name with both position and type specific values, this can lead to many separate data streams. The name tokeniser codec is a format within a format, as the multiple byte streams Bpos,typeB_{\textit{pos},\textit{type}} are serialised into a single byte stream.

The serialised data stream starts with two unsigned little endian 32-bit integers holding the total size of uncompressed name buffer and the number of read names, and a flag byte indicating whether data is compressed with arithmetic coding or rANS Nx16. Note the uncompressed size is calculated as the sum of all name lengths including a termination byte per name (e.g. the nul char). This is irrespective of whether the implementation produces data in this form or whether it returns separate name and name-length arrays.

This is then followed by serialised data and meta-data for each token stream. Token types, ttype\textit{ttype} holds one of the token ID values listed above in the list above, plus special values to indicate certain additional flags. Bit 6 (64) set indicates that this entire token data stream is a duplicate of one earlier. Bit 7 (128) set indicates the token is the first token at a new position. This way we only need to store token types and not token positions.

The total size of the serialised stream needs to be already known, in order to determine when the token types finish.

Bytes Type Name Description
4 uint32 𝑢𝑛𝑐𝑜𝑚𝑝_𝑙𝑒𝑛𝑔𝑡ℎ\textit{uncomp\_length} Length of uncompressed name buffer
4 uint32 𝑛𝑢𝑚_𝑟𝑒𝑎𝑑𝑠\textit{num\_reads} Number of read names
1 uint8 𝑢𝑠𝑒_𝑎𝑟𝑖𝑡ℎ\textit{use\_arith} Whether compression is arithmetic (1) or rANS Nx16 (0)
For each token data stream
1 uint8 𝑡𝑡𝑦𝑝𝑒\textit{ttype} Token type code plus flags (64=duplicate, 128=next token position).
If ttype AND 64 (duplicate)
1 uint8 𝑑𝑢𝑝_𝑝𝑜𝑠\textit{dup\_pos} Duplicate from this token position
1 uint8 𝑑𝑢𝑝_𝑡𝑦𝑝𝑒\textit{dup\_type} Duplicate from this token type ID
else if not duplicate
? uint7 𝑐𝑙𝑒𝑛\textit{clen} compressed length
𝑐𝑙𝑒𝑛\textit{clen} uint8[] 𝑐𝑑𝑎𝑡𝑎\textit{cdata} compressed data stream

A few tricks are used to remove some byte streams. In addition to the explicit marking of duplicate bytes streams, if a byte stream of token types is entirely MATCH apart from the very first value it is discarded. It is possible to regenerate this during decode by observing the other byte streams. For example if we have a byte stream B5,DIGITSB_{5,\textit{DIGITS}} but no B5,TYPEB_{5,\textit{TYPE}} then we assume the contents of B5,TYPEB_{5,\textit{TYPE}} consist of one DIGITS type followed by as many MATCH types as are needed.

The cdata\textit{cdata} stream itself is as described in the relevant entropy encoder section above (rANS or arithmetic coding).

// Decodes and uncompresses the serialised token byte streams
Function DecodeTokenByteStreams(use_arith)
  sz ← 0
  t ← -1
  repeat:
    ttype ← ReadUint8()
    tok_new ← ttype 128
    tok_dup ← ttype 64
    type ← ttype 63
    if tok_new ≠ 0:
      t ← t+1
      if type ≠ TYPE:
        B_t,TYPE ← (type, TOK_MATCH, TOK_MATCH, ...)  // for nnames-1 times

    if tok_dup ≠ 0:
      dup_pos ← ReadUint8()
      dup_type ← ReadUint8()
      B_t,type ← B_dup_pos,dup_type
    else:
      clen ← ReadUint7()
      data ← ReadData(clen)
      if use_arith:
        B_t,type ← ArithDecode(clen, source=data)
      else:
        B_t,type ← RansDecodeNx16(clen, source=data)
  until EOF()
  return B
// Decodes all names, returning N
Function DecodeNames()
  ulen ← ReadUint32()
  nnames ← ReadUint32()
  use_arith ← ReadUint8()
  B ← DecodeTokenByteStreams(use_arith)
  for n ← 0 to nnames-1:
    N_n ← DecodeSingleName(n)
  return N

6 FQZComp quality codec#

The FQZComp quality codec uses an adaptive statistical model to predict the next quality value in a given context (comprised of previous quality values, position along this sequence, whether the sequence is the second in a pair, and a running total of number of times the quality has changed in this sequence).

For each position along the sequence, the models produce probabilities for all possible next quality values, which are passed into an arithmetic entropy encoder to encode or decode the actual next quality value. The models are then updated based on the actual next quality in order to learn the statistical properties of the quality data stream. This step wise update process is identical for both encoding and decoding.

The algorithm is a generalisation on the original fqzcomp program, described in Compression of FASTQ and SAM Format Sequencing Data by Bonfield JK, Mahoney MV (2013). PLoS ONE 8(3): e59190. https://doi.org/10.1371/journal.pone.0059190

6.1 FQZComp Models#

The FQZComp process utilises knowledge of the read lengths, complement (qualities reversed) status, and a generic parameter selector, but in order to maintain a strict separation between CRAM data series this knowledge is stored (duplicated) within the quality data stream itself. Note the complement model is only needed in CRAM 3.1 as CRAM 4 natively stores the quality in the original orientation already. Both reversed and duplication models have no context and are boolean values.

The parameter selector model also has no context associated with it and encodes max_sel\textit{max\_sel} distinct values. The selector value may be quantised further using stab\textit{stab} (Selector Table) to reduce the selector to fewer sets of parameters. This is useful if we wish to use the selector bits directly in the context using the same parameters. The selector is arbitrary and may be used for distinguishing READ1 from READ2, as a precalculated “delta” instead of the running total, distinguishing perfect alignments from imperfect ones, or any other factor that is shown to improve quality predictability and increase compression ratio (average quality, number of mismatches, tile, swathe, proximity to tile edge, etc).

The quality model has a 16-bit context used to address an array of 2162^{16} models, each model permitting max_sym\textit{max\_sym} distinct quality values. The context used is defined by the FQZcomp parameters, of which there may be multiple sets, selected using the selector model. There are 4 read length models each having max_sym\textit{max\_sym} of 256. Each model is used for the 4 successive bytes in a 32-bit length value.

The entropy encoder used is shared between all models, so the bit streams are multiplexed together.

The 16-bit quality value context is constructed by adding sub-contexts together consisting of previous quality values, position along the current record, a running count (per record) of how many times the quality value has differed to the previous one (delta), and an arbitrary stored selector value, each shifted to a defined location within the combined context value (qloc\textit{qloc}, ploc\textit{ploc}, dloc\textit{dloc} and sloc\textit{sloc} respectively). The qual, pos and delta sub-contexts are computed from the previous data while the selector, if used, is read directly from the compressed data stream. The selector may be used to switch parameter sets, or simply to group quality strings into arbitrary user-defined sub-sets. The numeric values for each of these components can be passed through lookup tables (qtab\textit{qtab} for quality, ptab\textit{ptab} for positions, dtab\textit{dtab} for running delta and stab\textit{stab} for turning the selector ss into a parameter index xx). These all convert the monotonically increasing range 0\rightarrowM to a (usually smaller) monotonically increasing 0\rightarrowN. For example if we wish to use the approximate position along a 100 byte string, we may uniformly map 0\rightarrow127 to 0\rightarrow15 to utilise 4 bits of our 16-bit combined context.

An example FQZComp configuration.

As some sequencing instruments produce binned qualities, e.g. 0, 10, 25, 35, these values are squashed to incremental values from 0 to max_sym1\textit{max\_sym}-1 where max_sym\textit{max\_sym} is the maximum number of distinct quality values observed. If this transform is required, the flag have_qmap\textit{have\_qmap} will be set and a mapping table (qmap\textit{qmap}) will hold the original quality values. The encoded qualities will be the smaller mapped range.

The quality sub-context is constructed by shifting left the previous quality sub-context by qshift\textit{qshift} bits and adding the current quality after passing through the qmap\textit{qmap} transform and if defined through the qtab\textit{qtab} lookup table. The quality context is limited to qbits\textit{qbits} long and is added to the combined context starting at bit qloc\textit{qloc}. The quality sub-context is reset to zero at the start of each new record. 8

The position context is simply the number of remaining quality values in this record, so is a value starting at record length (minus 1) and decrementing. As with the quality context it may be passed through a lookup table ptab\textit{ptab} before shifting left by ploc\textit{ploc} bits and adding to the combined context.

Delta is a count of the number of times the quality value has changed from one value to a different one. Thus a run of identical values will not increase delta. It gets reset to zero at the start of every record. It may be adjusted by the dtab\textit{dtab} lookup table and is shifted by dloc\textit{dloc} before adding to the combined context.

The selector value may also be used as a sub-context, if the do_sel\textit{do\_sel} paramter is set. The initial context value (reset per record) is defined within each parameter set, providing a more general purpose alternative to adding the selector value at a defined location (sloc\textit{sloc}) into the context.

Thus the full context can be updated after each decoded quality with the following pseudocode. Note for brevity this is assuming the pos\textit{pos}, delta\textit{delta}, prevq\textit{prevq}, qctx\textit{qctx} and sel\textit{sel} parameters referred are global and updateable.

// Add quality q to produce and return a new context ctx
Function FQZUpdateContext(params, q)
  ctx ← params.context  // Also the initial value
  qctx ← (qctx params.qshift) + qtab_q
  ctx ← ctx + ((qctx (2^params.qbits-1)) params.qloc)
  if params.pflags 32:  // have_ptab
    p ← Min(pos, 1023)
    ctx ← ctx + (ptab_p params.ploc)
  if params.pflags 64:  // have_dtab
    d ← Min(delta, 255)
    ctx ← ctx + (dtab_d params.dloc)
    if prevq ≠ q:
      delta ← delta+1
    prevq ← q
  if params.pflags 8:  // do_sel
    ctx ← ctx + (sel params.sloc)
  return ctx(2^16-1)

In summary context is produced using the following models:

Model Max symbol Context size Description
𝑚𝑜𝑑𝑒𝑙_𝑞𝑢𝑎𝑙\textit{model\_qual} 𝑚𝑎𝑥_𝑠𝑦𝑚\textit{max\_sym} 2162^{16} Primary model for quality values
𝑚𝑜𝑑𝑒𝑙_𝑙𝑒𝑛\textit{model\_len} 256 4 Read length models with the context 0-3 being successive byte numbers (little endian order)
𝑚𝑜𝑑𝑒𝑙_𝑟𝑒𝑣\textit{model\_rev} 2 none Used if 𝑝𝑓𝑙𝑎𝑔𝑠.𝑑𝑜_𝑟𝑒𝑣\textit{pflags}.\textit{do\_rev} is defined. Indicating which strings to reverse.
𝑚𝑜𝑑𝑒𝑙_𝑑𝑢𝑝\textit{model\_dup} 2 none Used if 𝑝𝑓𝑙𝑎𝑔𝑠.𝑑𝑜_𝑑𝑢𝑝\textit{pflags}.\textit{do\_dup} is defined. Indicates if this whole string is a duplicate of the last one
𝑚𝑜𝑑𝑒𝑙_𝑠𝑒𝑙\textit{model\_sel} 𝑚𝑎𝑥_𝑠𝑒𝑙\textit{max\_sel} none Used if 𝑔𝑓𝑙𝑎𝑔𝑠.𝑚𝑢𝑙𝑡𝑖_𝑝𝑎𝑟𝑎𝑚\textit{gflags}.\textit{multi\_param} or 𝑝𝑓𝑙𝑎𝑔𝑠.𝑑𝑜_𝑠𝑒𝑙\textit{pflags}.\textit{do\_sel} are defined.

6.2 FQZComp Data Stream#

The start of an FQZComp data stream consists of the parameters used by the decoder. The data layout is as follows.

Bits Type Name Description
8 uint8 𝑣𝑒𝑟𝑠𝑖𝑜𝑛\textit{version} FQZComp format version: must be 5
8 uint8 𝑔𝑓𝑙𝑎𝑔𝑠\textit{gflags} Global FQZcomp bit-flags. From lowest bit to highest: 1: 𝑚𝑢𝑙𝑡𝑖_𝑝𝑎𝑟𝑎𝑚\textit{multi\_param}: indicates more than one parameter block is present. Otherwise set 𝑛𝑝𝑎𝑟𝑎𝑚=1\textit{nparam} = 1 2: ℎ𝑎𝑣𝑒_𝑠𝑡𝑎𝑏\textit{have\_stab}: indicates the parameter selector is mapped through 𝑠𝑡𝑎𝑏\textit{stab}. Otherwise set 𝑠𝑡𝑎𝑏i=i\textit{stab}_i = i 4: 𝑑𝑜_𝑟𝑒𝑣\textit{do\_rev}: 𝑚𝑜𝑑𝑒𝑙_𝑟𝑒𝑣𝑐𝑜𝑚𝑝\textit{model\_revcomp} will be used (CRAM v3.1)
If 𝑚𝑢𝑙𝑡𝑖_𝑝𝑎𝑟𝑎𝑚\textit{multi\_param} gflag is set:
8 uint8 𝑛𝑝𝑎𝑟𝑎𝑚\textit{nparam} Number of parameter blocks (defaults to 1)
If ℎ𝑎𝑣𝑒_𝑠𝑡𝑎𝑏\textit{have\_stab} gflag is set:
8 uint8 𝑚𝑎𝑥_𝑠𝑒𝑙\textit{max\_sel} Maximum parameter selector value
variable array 𝑠𝑡𝑎𝑏\textit{stab} Parameter selector table
Parameter block: repeated 𝑛𝑝𝑎𝑟𝑎𝑚\textit{nparam} times: (selected via 𝑚𝑜𝑑𝑒𝑙_𝑠𝑒𝑙\textit{model\_sel} and 𝑠𝑡𝑎𝑏\textit{stab})
16 uint16 𝑐𝑜𝑛𝑡𝑒𝑥𝑡\textit{context} Starting context value
8 uint8 𝑝𝑓𝑙𝑎𝑔𝑠\textit{pflags} Per-parameter block bit-flags. From lowest bit to highest: 1: Reserved 2: 𝑑𝑜_𝑑𝑒𝑑𝑢𝑝\textit{do\_dedup}: model_dup will be used 4: 𝑑𝑜_𝑙𝑒𝑛\textit{do\_len}: model_len will be used for every record 8: 𝑑𝑜_𝑠𝑒𝑙\textit{do\_sel}: model_sel will be used 16: ℎ𝑎𝑣𝑒_𝑞𝑚𝑎𝑝\textit{have\_qmap}: indicates quality map is present 32: ℎ𝑎𝑣𝑒_𝑝𝑡𝑎𝑏\textit{have\_ptab}: Load 𝑝𝑡𝑎𝑏\textit{ptab}, otherwise position contexts are unused 64: ℎ𝑎𝑣𝑒_𝑑𝑡𝑎𝑏\textit{have\_dtab}: Load 𝑑𝑡𝑎𝑏\textit{dtab}, otherwise delta contexts are unused 128: ℎ𝑎𝑣𝑒_𝑞𝑡𝑎𝑏\textit{have\_qtab}: Load 𝑞𝑡𝑎𝑏\textit{qtab}, otherwise set 𝑞𝑡𝑎𝑏i=i\textit{qtab}_i = i
8 uint8 𝑚𝑎𝑥_𝑠𝑦𝑚\textit{max\_sym} Total number of distinct quality values
4 uint4 (high) 𝑞𝑏𝑖𝑡𝑠\textit{qbits} Total number of bits for quality context
4 uint4 (low) 𝑞𝑠ℎ𝑖𝑓𝑡\textit{qshift} Left bit shift per successive quality in quality context
4 uint4 (high) 𝑞𝑙𝑜𝑐\textit{qloc} Bit position of quality context
4 uint4 (low) 𝑠𝑙𝑜𝑐\textit{sloc} Bit position of selector context
4 uint4 (high) 𝑝𝑙𝑜𝑐\textit{ploc} Bit position of position context
4 uint4 (low) 𝑑𝑙𝑜𝑐\textit{dloc} Bit position of delta context
If ℎ𝑎𝑣𝑒_𝑞𝑚𝑎𝑝\textit{have\_qmap} pflag is set:
variable uint8[𝑚𝑎𝑥_𝑠𝑦𝑚\textit{max\_sym}] 𝑞𝑚𝑎𝑝\textit{qmap} Map for unbinning quality values.
If ℎ𝑎𝑣𝑒_𝑞𝑡𝑎𝑏\textit{have\_qtab} pflag is set:
variable array 𝑞𝑡𝑎𝑏\textit{qtab} Quality context lookup table
If ℎ𝑎𝑣𝑒_𝑡𝑎𝑏\textit{have\_tab} pflag is set:
variable array 𝑝𝑡𝑎𝑏\textit{ptab} Position context lookup table
If ℎ𝑎𝑣𝑒_𝑡𝑎𝑏\textit{have\_tab} pflag is set:
variable array 𝑑𝑡𝑎𝑏\textit{dtab} Delta context lookup table

FQZDecodeParams below describes the pseudocode for reading the parameter block.

Procedure FQZDecodeParams()
  vers ← ReadUint8()
  if vers ≠ 5:
    ERROR
  gflags ← ReadUint8()
  if gflags 1:  // multi_param
    nparam ← ReadUint8()
    max_sel ← nparam
  else:
    nparam ← 1
    max_sel ← 0
  if gflags 2:  // have_stab
    max_sel ← ReadUint8()
    stab ← ReadArray(256)
  max_sym ← 0
  for p ← 0 to nparam-1:
    param_p ← FQZDecodeSingleParam()
    if max_sym < param_p.max_sym:
      max_sym ← param_p.max_sym  // Maximum across all param sets
Function FQZDecodeSingleParam()
  p.context ← ReadUint16()
  p.flags ← ReadUint8()
  p.max_sym ← ReadUint8()
  p.first_len ← 1
  x ← ReadUint8()
  p.qbits ← x div 16
  p.qshift ← x mod 16
  x ← ReadUint8()
  p.qloc ← x div 16
  p.sloc ← x mod 16
  x ← ReadUint8()
  p.ploc ← x div 16
  p.dloc ← x mod 16
  if p.flags16:  // Have qmap
    for i ← 0 to p.max_sym-1:
      p.qmap_i ← ReadUint8()
  if p.flags128:  // Have qtab
    p.qtab ← ReadArray(256)
  else:
    for i ← 0 to 256:
      p.qtab_i ← i
  if p.flags32:  // Have ptab
    p.ptab ← ReadArray(1024)
  if p.flags64:  // Have dtab
    p.dtab ← ReadArray(256)
  return p

FQZCreateModels creates the decoder models based on the above parameters and the shared range coder.

Procedure FQZCreateModels()
  rc ← RangeDecodeCreate()
  for i ← 0 to 3:
    model_len_i ← ModelCreate(256)
  for i ← 0 to 2^16-1:
    model_qual_i ← ModelCreate(max_sym+1)
  model_dup ← ModelCreate(2)
  model_rev ← ModelCreate(2)
  if max_sel > 0:
    model_sel ← ModelCreate(max_sel+1)

ReadArray reads an array AA of size nn which maps values 0 to n1n-1 to a smaller range (0 to m1m-1), both monotonically increasing. For efficiency this is done using a two-level run length encoding.

Assuming m<nm < n there will be runs of the same value. We measure run lengths for all values (even if they are zero). For example an array A={0,1,3,4,5,6,7,7,7,7}A = \{0,1,3,4,5,6,7,7,7,7\} may be converted to run lengths R={1,1,0,1,1,1,1,4}R = \{1,1,0,1,1,1,1,4\}. To keep values in this array fitting within one byte, long runs are broken down in a successive series of 255 values, so a run of length 600 becomes 255 255 90.

This array RR is no longer monotonically increasing but may still have repeated values, so is run-length encoded by storing the number of additional values whenever the last two lengths match. This converts RR to R2={1,1,+0,0,1,1,+2,4}R2 = \{1, 1, +0, 0, 1, 1, +2, 4\} where the ’+’ symbol is shown purely to indicate the values representing the additional run-length copy numbers. (This also now turns the example run of 600 above into 255 255 0 90.)

The final array R2 is the stored data stream. The decoder process is the reverse of the above, starting by converting R2 to RR and then AA. The following pseudocode demonstrates this process.

Function ReadArray(n)
  i,j,z ← 0
  last ← -1
  while z < n:  // Convert R2 to R
    run ← ReadUint8()
    R_j ← run
    j ← j+1
    z ← z + run
    if run = last:
      copy ← ReadUint8()
      for x ← 1 to copy:
        R_j ← run
        j ← j+1
      z ← z + run * copy
    last ← run

  i,j,z ← 0
  while z < n:  // Convert R to A
    run_len ← 0
    repeat:
      part ← R_j
      j ← j + 1
      run_len ← run_len + part
    until part ≠ 255
    for x ← 1 to run_len:
      A_z ← i
      z ← z+1
    i ← i+1

  return A

The FQZComp main loop decodes data in the following order per read: read length (if not fixed), the flag for whether this is read 2 (if needed), a bit flag to indicate if the quality is duplicated (if needed), followed by record length number of quality values using various data gathered since the start of this read as context.

The output of this function is an array of quality values in the variable output\textit{output}, indexed with the ithi^{th} value via outputi\textit{output}_i. The output buffer is a concatenation of all quality values for each record. The record lengths are recorded, but note this is the number of qualities encoded in CRAM for this sequence record and this does not necessarily have to match the number of base calls (for example where qualities are explicitly specified for SNP bases but not elsewhere).

Function FQZNewRecord()
  sel ← 0
  x ← 0
  if max_sel > 0:  // Find parameter selector
    sel ← model_sel.ModelDecode(rc)
    if have_stab:
      x ← stab_sel
  param ← params_x

  if param.do_len param.first_len:  // Decode read length
    rec_len ← DecodeLength(rc)
    param.last_len ← rec_len
    param.first_len = 0
  else:
    rec_len ← param.last_len
  pos ← rec_len

  if param.do_rev:  // Check if needs reversal
    rev_rec ← model_rev.ModelDecode(rc)
    len_rec ← rec_len
  rec ← rec+1

  is_dup ← 0
  if do_dedup:  // Duplicate last string if appropriate
    if model_dup.ModelDecode(rc) > 0:
      is_dup ← 1

  qctx ← 0
  delta ← 0
  prevq ← 0
  return x  // Tabulated parameter selector
Procedure FQZDecode()
  buf_len ← ReadUint7()
  FQZDecodeParams()
  FQZCreateModels()
  i ← 0  // Position in total quality block
  pos ← 0  // Remaining base count current quality string
  next_record::
  while i < buf_len:
    if pos = 0:  // Reset state at start of each new record
      x ← FQZNewRecord()
      if is_dup = 1:
        for j ← 0 to rec_len-1:
          output_i+j ← output_i+j-rec_len
        i ← i+rec_len
        pos ← 0
        \Goto{next_record}

      param ← params_x
      ctx ← param.context

    q ← model_qual_ctx.ModelDecode(rc)  // Decode a single quality value
    if param.have_qmap:
      output_i ← qmap_q
    else:
      output_i ← q

    ctx ← FQZUpdateContext(param, q)  // Also updates qctx, prevq and delta

    i ← i + 1
    pos ← pos - 1
  if do_rev:
    ReverseQualities(output, buf_len, rev, len)

Read lengths are encoded as 4 8-bit bytes, each having its own model.

Function DecodeLength(rc)
  rec_len ← model_len_0.ModelDecode(rc)
  rec_len ← rec_len + (model_len_1.ModelDecode(rc)8)
  rec_len ← rec_len + (model_len_2.ModelDecode(rc)16)
  rec_len ← rec_len + (model_len_3.ModelDecode(rc)24)
  return rec_len

For CRAM v4.0 quality values are stored in their original FASTQ orientation. For CRAM v3.1 they are stored in their alignment orientation and it may be beneficial for compression purposes to reverse them first. If so do_rev\textit{do\_rev} will be set and the ReverseQualities procedure called below after decoding.

Procedure ReverseQualities(qual, qual_len, rev, len)
  rec ← 0
  i ← 0
  while i < qual_len:
    if rev_rec ≠ 0:
      j ← 0
      k ← len_rec-1
      while j < k:
        Swap(qual_i+j, qual_i+k)
        j ← j+1
        k ← k-1
      i ← i + len_rec
      rec ← rec+1

Footnotes#

  1. J. Duda, Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding, http://arxiv.org/abs/1311.2540

  2. C.E. Shannon, A Mathematical Theory of Communication, Bell System Technical Journal, vol. 27, pp. 379-423, 623-656, July, October, 1948

  3. While the maths work fine up to 4096, for historical reasons this has always been documented as having a limit of 4095. Implementations may wish to validate decoding on <=4096<= 4096, but we recommend they use a limit of 4095 in their encoding output.

  4. F. Giesen, Interleaved entropy coders, http://arxiv.org/abs/1402.3392

  5. This was why the ‘\0’ \to ‘a’ context in the example above had a frequency of 4 instead of 1.

  6. This implementation was designed by Eugene Shelwein, based on Michael Schindler’s earlier work.

  7. For simplicity of algorithm description, we take a flexible approach as to whether we read/write TT in numeric or string form. For example a DELTA token will fetch the previous token as a string, interpret it as a numeric value, add to it, and then write it back as a string. Practical implementations may wish to separate out T into distinct integer and string arrays.

  8. For example if we have 4 quality values in use – 0, 10, 25 and 35 – we will be encoding quality values 0, 1, 2 and 3. We may wish to define qbits\textit{qbits} to be 6 and qshift\textit{qshift} to be 2 such that the previous 3 quality values can be used as context, for the prediction of the next quality value. There will likely be little reason to use qtab\textit{qtab} in this scenario, but an encoder could define qtab\textit{qtab} to convert {0, 1, 2, 3} to {0, 0, 0, 1} and use qshift\textit{qshift} of 1 instead, giving us knowledge of which of the previous 6 values were maximum quality.