Compression Filters

Filters to compress the data in snapshots can be applied to reduce the disk footprint of the datasets. The filters provided by SWIFT are filters natively provided by HDF5, implying that the library will automatically and transparently apply the reverse filter when reading the data stored on disk. They can be applied in combination with, or instead of, the lossless gzip compression filter.

These compression filters are lossy, meaning that they modify the data written to disk

The filters will reduce the accuracy of the data stored. No check is made inside SWIFT to verify that the applied filters make sense. Poor choices can lead to all the values of a given array reduced to 0, Inf, or to have lost too much accuracy to be useful. The onus is entirely on the user to choose wisely how they want to compress their data.

The filters are not applied when using parallel-hdf5.

The name of any filter applied is carried by each individual field in the snapshot using the meta-data attribute Lossy compression filter.

The available filters are listed below.

N-bit filters for long long integers

The N-bit filter takes a long long and saves only the most significant N bits.

This can be used in cases similar to the particle IDs. For instance, if they cover the range \([1, 10^{10}]\) then 64-bits is too many and a lot of disk space is wasted storing the 0s. In this case \(\left\lceil{\log_2(10^{10})}\right\rceil + 1 = 35\) bits are sufficient (The extra “+1” is for the sign bit).

SWIFT implements 5 variants of this filter:

  • Nbit36 stores the 36 most significant bits (Numbers up to \(3.4\times10^{10}\), comp. ratio: 1.78)
  • Nbit40 stores the 40 most significant bits (Numbers up to \(5.4\times10^{11}\), comp. ratio: 1.6)
  • Nbit44 stores the 44 most significant bits (Numbers up to \(8.7\times10^{12}\), comp. ratio: 1.45)
  • Nbit48 stores the 48 most significant bits (Numbers up to \(1.4\times10^{14}\), comp. ratio: 1.33)
  • Nbit56 stores the 56 most significant bits (Numbers up to \(3.6\times10^{16}\), comp. ratio: 1.14)

Note that if the data written to disk is requiring more than the N bits then part of the information written to the snapshot will lost. SWIFT does not apply any verification before applying the filter.

Scaling filters for floating-point numbers

The D-scale filters can be used to round floating-point values to a fixed absolute accuracy.

They start by computing the minimum of an array that is then deducted from all the values. The array is then multiplied by \(10^n\) and truncated to the nearest integer. These integers are stored with the minimal number of bits required to store the values. That process is reversed when reading the data.

For an array of values

1.2345 -0.1267 0.0897

and \(n=2\), we get stored on disk (but hidden to the user):

136 0 22

This can be stored with 8 bits instead of the 32 bits needed to store the original values in floating-point precision, realising a gain of 4x.

When reading the values (for example via h5py or swiftsimio), that process is transparently reversed and we get:

1.2333 -0.1267 0.0933

Using a scaling of \(n=2\) hence rounds the values to two digits after the decimal point.

SWIFT implements 4 variants of this filter:

  • DScale1 scales by \(10^1\)
  • DScale2 scales by \(10^2\)
  • DScale3 scales by \(10^3\)
  • DScale4 scales by \(10^4\)
  • DScale5 scales by \(10^5\)
  • DScale6 scales by \(10^6\)

An example application is to store the positions with pc accuracy in simulations that use Mpc as their base unit by using the DScale6 filter.

The compression rate of these filters depends on the data. On an EAGLE-like simulation (100 Mpc box), compressing the positions from Mpc to pc (via Dscale6) leads to rate of around 2.2x.

Modified floating-point representation filters

These filters modify the bit-representation of floating point numbers to get a different relative accuracy.

In brief, floating point (FP) numbers are represented in memory as \((\pm 1)\times a \times 2^b\) with a certain number of bits used to store each of \(a\) (the mantissa) and \(b\) (the exponent) as well as one bit for the overall sign [1]. For example, a standard 4-bytes float uses 23 bits for \(a\) and 8 bits for \(b\). The number of bits in the exponent mainly drives the range of values that can be represented whilst the number of bits in the mantissa drives the relative accuracy of the numbers.

Converting to the more familiar decimal notation, we get that the number of decimal digits that are correctly represented is \(\log_{10}(2^{n(a)+1})\), with \(n(x)\) the number of bits in \(x\). The range of positive numbers that can be represented is given by \([2^{-2^{n(b)-1}+2}, 2^{2^{n(b)-1}}]\). For a standard float, this gives a relative accuracy of \(7.2\) decimal digits and a representable range of \([1.17\times 10^{-38}, 3.40\times 10^{38}]\). Numbers above the upper limit are labeled as Inf and below this range they default to zero.

The filters in this category change the number of bits in the mantissa and exponent. When reading the values (for example via h5py or swiftsimio) the numbers are transparently restored to regular float but with 0s in the bits of the mantissa that were not stored on disk, hence changing the result from what was stored originally before compression.

These filters offer a fixed compression ratio and a fixed relative accuracy. The available options in SWIFT for a float (32 bits) output are:

Filter name \(n(a)\) \(n(b)\) Accuracy Range Compression ratio
No filter 23 8 7.22 digits \([1.17\times 10^{-38}, 3.40\times 10^{38}]\)
FMantissa13 13 8 4.21 digits \([1.17\times 10^{-38}, 3.40\times 10^{38}]\) 1.45x
FMantissa9 9 8 3.01 digits \([1.17\times 10^{-38}, 3.40\times 10^{38}]\) 1.78x
BFloat16 7 8 2.41 digits \([1.17\times 10^{-38}, 3.40\times 10^{38}]\) 2x
HalfFloat 10 5 3.31 digits \([6.1\times 10^{-5}, 6.5\times 10^{4}]\) 2x

Same for a double (64 bits) output:

Filter name \(n(a)\) \(n(b)\) Accuracy Range Compression ratio
No filter 52 11 15.9 digits \([2.2\times 10^{-308}, 1.8\times 10^{308}]\)
DMantissa13 13 11 4.21 digits \([2.2\times 10^{-308}, 1.8\times 10^{308}]\) 2.56x
DMantissa9 9 11 3.01 digits \([2.2\times 10^{-308}, 1.8\times 10^{308}]\) 3.05x

The accuracy given in the table corresponds to the number of decimal digits that can be correctly stored. The “no filter” row is displayed for comparison purposes.

In the first table, the first two filters are useful to keep the same range as a standard float but with a reduced accuracy of 3 or 4 decimal digits. The last two are the two standard reduced precision options fitting within 16 bits: one with a much reduced relative accuracy and one with a much reduced representable range.

The compression filters for the double quantities are useful if the values one want to store fall outside the exponent range of float numbers but only a lower relative precision is necessary.

An example application is to store the densities with the FMantissa9 filter as we rarely need more than 3 decimal digits of accuracy for this quantity.

[1]Note that the representation in memory of FP numbers is more complicated than this simple picture. See for instance this Wikipedia article.