====== Compression ====== ~~NOTOC~~ ===== Goals ===== The goal of this effort is to establish conventions about quantities for lossless and primarily lossy compression algorithms that are useful for users to define and set. {{ :std:compression:example-data-simplex-206-sigbits-3bits.png?200|}} In detail: * Identification all quantities that are -- from the user-perspective -- useful to be set on a compression algorithm, i.e., they help users to control the compression rate and performance. * Define these quantities properly and assign an understandable abbreviation for the quantities * Foster development of APIs and tools that use these quantities ===== Contribution ===== Please register on our [[https://www.vi4io.org/listinfo/IO-compression-API|mailing list]]. Participating institutions: * DKRZ, Germany * Universität Hamburg, Germany * Institut Pierre Simon Laplace, France * RIKEN, Japan ===== Approach ==== Our strategy and timeline for establishing these conventions are as follows: * Identification the first set of user-defined quantities that shall be allowed to set for the compression. (Please see the current list of quantities below) * Invite international experts to this effort * Identify relevant quantities * Publish a document with the definitions of the quantities ===== Quantities ===== The following list of quantities contains candidates for the standardisation. They can be classified into: 1) accuracy/precision loss bounding quantities for lossy algorithms; 2) performance related quantities; 3) other quantities ==== Accuracy bounding quantities ==== These quantities define the tolerable error on individual values or multidimensional fields of data from a given datatype. The definition is mostly based on the notion of the term error, which is the residual when subtracting the (lossy) compressed value (d) from the true value (v). * **Absolute error tolerance**: is the maximum amount of the residual error in the calculations; abs(v-d) < absolute error * **Relative error tolerance** is a measure of absolute error compared to the size of the calculations. * **Relative error with finest absolute tolerance** is a combination of two quantities. With a relative tolerance, small numbers around 0 are problematic for compressors, e.g. 1% relative error for the data value 0.01 results in the compressed accuracy of 0.01±0.0001. The finest absolute tolerance limits the smallest relative error. In our example, setting a relative error finest absolute tolerance of 0.01 would result in an error of ±0.01f or small numbers, while for large numbers their relative error is considered. Thus, it is the lower bound and guaranteed error for relative error bounds, where as the absolute tolerance is the guaranteed resolution for all data points. * **Precision bits and precision digits** indicates how much bits or decimal digits are required to represent the array values. * **Mean squared error (MSE)** is the arithmetic mean of squared errors between uncompressed and original values; * **Standard deviation** is the square root of the mean squared error. * **Average absolute deviation** summarises the statistical dispersion or variability. * **Peak signal-to-noise ratio (PSNR)** is the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. * **Preserved values** , which must be preserved literally, i.e., they cannot be changed and must be preserved, i.e., only lossless compression can be applied to those values. ==== Performance quantities ==== * **Compression/decompression speed** sets throughput limit. Otherwise, a default will be used, to achieve maximum error tolerance. ==== Other quantities ==== * **Rate limitation** defines the mean number of bits to be used for compression. Based on the entropy and, thus, the compressibility of information, the precision of data is reduced to meet the overall mean rate. ===== Publications =====