An Efficient Lossless Compression Algorithm for Trajectories of Atom

3 days ago - Our method is implemented in C++ and provided as free software under the GNU LGPL license. It has been included in the TRAVIS program ...
0 downloads 0 Views 3MB Size
Subscriber access provided by Kaohsiung Medical University

Computational Chemistry

An Efficient Lossless Compression Algorithm for Trajectories of Atom Positions and Volumetric Data Martin Brehm, and Martin Thomas J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00501 • Publication Date (Web): 18 Sep 2018 Downloaded from http://pubs.acs.org on September 18, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

An Efficient Lossless Compression Algorithm for Trajectories of Atom Positions and Volumetric Data Martin Brehm∗ and Martin Thomas Institut f¨ ur Chemie – Theoretische Chemie, Martin-Luther-Universit¨at Halle–Wittenberg, Von-Danckelmann-Platz 4, 06120 Halle (Saale), Germany E-mail: Martin [email protected]

Abstract We present our newly developed and highly efficient lossless compression algorithm for trajectories of atom positions and volumetric data. The algorithm is designed as a two-step approach. In a first step, efficient polynomial extrapolation schemes reduce the information entropy of the data by exploiting both spatial and temporal continuity. The second step processes the data by a series of transformations (Burrows–Wheeler, move-to-front, run length encoding), and finally compresses the stream with multi-table canonical Huffman coding. Our approach reaches a compression ratio of around 15:1 for typical position trajectories in XYZ format. For volumetric data trajectories in Gaussian Cube format (such as electron density), even a compression ratio of around 35:1 is yielded, which is by far the smallest size of all formats compared here. At the same time, compression and decompression are still reasonably fast for everyday use. The precision of the data can be selected by the user. For storage of the compressed data, we introduce the BQB file format, which is very robust, flexible, and efficient. In contrast to most archiving formats, it allows fast random access to individual trajectory

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

frames. Our method is implemented in C++ and provided as free software under the GNU LGPL license. It has been included in the TRAVIS program package, but is also available as stand-alone tool and as a library (“libbqb”) for use in other projects.

Introduction During the last decades, computer simulation methods such as molecular dynamics (MD) or Monte Carlo (MC) have become an invaluable tool in the field of computational chemistry. The direct result of a MD or MC simulation is a trajectory, which can be considered as a path through the high-dimensional phase space of the system. The properties of interest are, however, mostly of low dimensionality, such as histograms and spectra. Therefore, algorithms (so-called “analyses”) are required to obtain these properties from trajectories. There exist two general approaches how this can be achieved. The typical way is to temporarily store the simulation trajectory to a hard disk drive while the simulation is running, and subsequently perform all desired analyses on this trajectory file. This often requires large amounts of disk space. Despite the large capacity and reasonable cost of modern hard disk drives, this can even be the limiting factor with regard to simulation time or system size. The second approach is to perform the analysis “on-the-fly” while the simulation is running. This does not require any temporary storage space, but comes with several other drawbacks. First, such an on-the-fly analysis is easy to implement for structural properties (which do not depend on history), but much harder for dynamical quantities such as lifetimes or spectra (which require knowledge of the history). Secondly, the set (and parameters) of analyses needs to be defined before the start of the simulation. Any mistake in analysis setup, or the desire to further investigate some phenomenon by follow-up analyses, immediately requires to re-run the whole simulation, which can be very demanding. Based on this rationale, saving the trajectory to disk is still the standard way of performing simulations. Methods are required to mitigate the hefty requirements on storage space.

2

ACS Paragon Plus Environment

Page 2 of 51

Page 3 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Many different file formats for storage of atomistic simulation trajectories have emerged over the decades. On the one hand, there are text file formats for this purpose. They carry the inherent disadvantage of automatically leading to relatively large file sizes, because alphanumeric symbols only use a small subset of the range that ASCII characters offer. However, they have the advantage of being truly platform-independent (consider, e. g., endianness) and being directly human-readable (which often is useful for checking). Popular text trajectory formats include XMol XYZ, 1 PDB, 2,3 and mol2. While XYZ files only carry information on element labels and atom coordinates, the latter two formats are able to also store additional information such as system topology and cell vectors. On the other hand, several binary trajectory file formats exist, which typically lead to reduced file size when compared to text formats. Among those are the DCD file format from NAMD, the XTC format of Gromacs, 4–6 and the HDF5 format 7,8 (which is not a dedicated trajectory file format, but a flexible multi-purpose format which can also be used for trajectories).

Trajectories can not only be composed of atom positions (and possibly velocities), but also of volumetric data, i. e., three-dimensional Cartesian grids of scalar values; the latter are called volumetric trajectories (in contrast to position trajectories). Volumetric trajectories can also be the result of an atomistic simulation. If, for example, an ab initio MD simulation is carried out, the total electron density within the simulation cell is inherently available, and can be stored together with the atom coordinates. This was demonstrated to be useful, e. g., for computing atomic partial charges 9,10 and vibrational spectra of liquids. 11–14 Other possibly interesting quantities which are available as volumetric data from ab initio MD simulations include the electrostatic potential and molecular orbitals. In the field of computational chemistry, volumetric data (both single frames and trajectories) is often stored in the Gaussian Cube file format, 15,16 which is a text file format. Another format for this kind of data is the gOpenMol 17 PLT file format, which offers both binary file and text file storage options. Apart from those, also the HDF5 format can be used to store volumetric data due

3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

to its flexibility. Compression algorithms for similar data sets have been developed in the field of video encoding 18,19 and medical 3D imaging. 20

Trajectories of volumetric data are typically huge files. To give an example, the raw data of our recently reported bulk phase Raman optical activity (ROA) spectrum 14 comprises 500 000 frames of volumetric electron density, which would require 24.5 TiB of storage space in Gaussian Cube file format. Such amounts of data nowadays can be handled and stored, but still require major effort. To reduce the file size, general compression algorithms could be employed. For example, with bzip2 compression 21 at highest compression ratio, the data set from above would still require 5.1 TiB of space (and very long compression time). Apart from that, one would lose the ability to randomly access frames, so that this approach would be at most a mediocre solution for archiving, but not suitable for scientific work on the data.

To find a solution to this problem, we developed a specialized compression algorithm for volumetric data, which takes maximum advantage of the special structure that those data sets typically possess (i. e., a certain smoothness in time and space). This is in contrast to general compression methods such as bzip2, 21 which aim at all kinds of input data and do not exploit any special structure. Therefore, such a specialized compression scheme might in principle lead to much higher compression ratios than achieved with general compressed formats. However, it would be not acceptable for scientific applications if the accuracy of the data would suffer from the storage approach. Hence, only lossless compression algorithms are acceptable for this purpose. In computers, data is always stored in finite precision, and in a wider sense, this discretization is lossy, which allows some debate on what a lossless computer algorithm has to perform. Here, we call an algorithm lossless if the user can decide on a required accuracy, and the algorithm then guarantees that the data possesses at least this accuracy after decompression.

4

ACS Paragon Plus Environment

Page 4 of 51

Page 5 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In this article, we present our newly developed lossless compression algorithm for trajectories of atom positions and volumetric data. To efficiently store the compressed data, we designed a flexible and robust file format, which we call the BQB format (original meaning was “binary cube”). In the following part of the manuscript, we discuss the mathematical and computational details of the compression algorithm as well as the file format. A schematic flowchart of our work is presented in Figure 1, which is separated into three parts that will be independently discussed in the next three sections. Subsequently, we present some benchmark results on “real” trajectories of atomistic simulations. We end this article with conclusions.

Figure 1: Flowchart of the compression algorithm presented here, vertically divided into three parts, which will be discussed independently in the following sections.

5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

We implemented all methods from this article as C++ code, which we license under the GNU LGPL v3 license; it can be freely used by the scientifc community. Apart from a stand-alone program to compress and decompress trajectories (“bqbtool”), we also offer a library with this functionality (“libbqb”), which can be included in other software projects (such as simulation and visualization programs, which might directly write/read compressed trajectories). These implementations are, together with a documentation of the BQB file format, freely available in the internet on “www.brehm-research.de/bqb”. Furthermore, we included all methods into our free trajectory analysis tool TRAVIS, 22 which now can directly read and analyze compressed trajectories in BQB format without prior decompression. The plots in Figures 2, 5, and 6 have been created with Xmgrace. 23 The image in the lower panel of Figure 4 and the table of content image were rendered with Povray. 24

The Front-Ends As depicted in Figure 1, the aim of the different front-ends is to convert the input data into a stream of integer numbers (32 bit signed integers in our implementation). This is not yet a reduction in size; the integer stream typically has the same symbol count as the input data. To yield high compression efficiency in the subsequent compression step, this stream should possess a small entropy. Apart from the entropy (which does not take into account ordering of the symbols), it is advantageous to preserve locality, which means that similar integers following each other increase the compression ratio (due to the multi-table Huffman coding; see corresponding section below ).

The key to reducing information entropy is a good extrapolation scheme, which predicts the current (unknown) value based on known values from the history of the trajectory. Such an extrapolation is typically not exact, and there remains a difference between extrapolated value and true value, which we call residual. However, with a suitable extrapolation scheme,

6

ACS Paragon Plus Environment

Page 6 of 51

Page 7 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the residuals will be much smaller than the absolute values. Because the extrapolation scheme is known and can also be applied while decompressing the data later on, it is sufficient to only store the residuals to the compressed file. This yields a significant reduction of entropy. In the ideal case of a perfect extrapolation, all the residuals would be zero, which would lead to an infinite compression ratio in theory.

Binary files can also be compressed; then, no front-end is active, and the byte stream from the file is directly passed to the compressor. This corresponds to an operating mode very similar to bzip2. 21

In the following sections, the extrapolation schemes for position trajectories as well as volumetric data trajectories will be described.

Position Trajectories A position trajectory is a time series of atom positions in a given system, and is the direct result of a molecular dynamics simulation. Apart from the positions, also other information might belong to such a trajectory (simulation cell vectors, atom labels, topology, . . . ). However, this additional information is either typically fixed during the simulation run (atom labels, topology) or small in size (cell vectors). Therefore, the focus in compressing position trajectories is clearly set on the atom position time series.

The steps performed in the position trajectory front-end are depicted in the upper middle part of Figure 1. In theory, the atom positions are real numbers. However, in practice, numerical integration algorithms are used to perform MD simulations, which typically work with a finite precision (i. e., machine numbers). Usually, the numbers are not written with full machine precision to the trajectory file. Depending on the application, a certain precision in atom positions is required to obtain fully converged results, and it would be a waste of 7

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

space to save the coordinates in higher precision. Although this approach is not lossless, the loss of precision is not related to the compression algorithm presented here. When the chosen storage precision matches the input trajectory precision, the algorithm is guaranteed to be strictly lossless, i. e., the decompressed file is bitwise identical to the input trajectory. When a real number shall be truncated to finite precision, there are two different approaches. One could either set a relative precision, such that each truncated number possesses the same number of significant digits. The second method is to apply an absolute precision, such that there exists a smallest possible difference, which is equivalent for all numbers (e. g., keeping a fixed number of digits after the decimal point). The fundamental difference between both approaches is that applying relative precision implies a special importance of the coordinate origin, because numbers close to zero are stored with very high absolute precision. With absolute precision, on the other hand, there is no special meaning of the origin. As systems in MD simulations are typically invariant under translation, the choice of the origin is arbitrary, and it is clear that the atom positions should be stored with a given absolute precision. In the implementation presented here, this precision can be set by the user, and the default value is a precision of 10−3 pm, which should be sufficient for all atomistic simulations.

Polynomial Extrapolation As explained above, the central step in reducing data entropy is the extrapolation scheme which is applied to the position time series. As it is known that solutions of Newton’s equations of motion can be very well approximated by polynomials within small time intervals, 25 we resort to polynomial extrapolation here. Note that such an extrapolation is only suitable if the atom coordinates are not wrapped to the primary image of the periodic simulation cell, and therefore non-wrapped positions should be used in the input trajectory (wrapping can be easily performed later after decompression). An example for such an extrapolation is given in Figure 2. On the vertical axis, the X position of one selected atom from a bulk phase simulation in each time step is given by circles,

8

ACS Paragon Plus Environment

Page 8 of 51

Page 9 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2: Schematic illustration of extrapolation polynomials of different degree n (lines) fitted to the trajectory of one atom from a simulation (circles). Red arrow depicts position to be extrapolated. whereas the horizontal axis depicts the simulation time. The aim is to predict the position at the red arrow, and the last (n + 1) positions may be used for extrapolation (depending on the order n of the polynomial). Often, it is observed that the extrapolation quality initially rises with increasing polynomial order, but then starts to fall again after transcending an optimal value. In the inset of Figure 2, it can be seen that the best prediction is achieved with n = 4, i. e., by using a 4th -order polynomial, in this example. A polynomial of order n in one dimension (time) can be written as

P n (t) := c0 + c1 (t − t0 ) + c2 (t − t0 )2 + · · · + cn (t − t0 )n ,

(1)

where t0 is the reference time and can be arbitrarily chosen. The coefficients c0 , . . . , cn can be determined by solving the linear system

Ac = q

9

ACS Paragon Plus Environment

(2)

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 51

defined by 

1  1  A :=  .  ..   1

(−1)

···

(−2) .. .

··· ...

− (n + 1)



···



    c q (t − 1) 0   0        n   c1   q (t0 − 2)  (−2)       , c :=  .  , q :=   (3) .. .   ..    . . .         n   − (n + 1) cn q t0 − (n + 1) (−1)n

with (n + 1) equations and unknowns, where q (t) denotes the (known) position at time t,  which is sampled at the times (t0 − 1) , . . . , t0 − (n + 1) from the history of the trajectory. For a non-singular matrix A, the solution c can be obtained by

c = A−1 q.

(4)

Note that A does not depend on the input data, and is identical for all extrapolations within one trajectory. Therefore, it is sufficient to determine the inverse matrix A−1 only once in the beginning of the run. A typical choice is to set t0 to the time at which the position shall be extrapolated; then, the predicted value is simply found as the c0 coefficient of the polynomial. The extrapolated value q˜ can be computed as a simple dot product by

q˜ = c0 = ha1 ,qi ,

(5)

with a1 being the vector formed by the first row of the matrix A−1 . Thus, a full polynomial extrapolation requires only (n + 1) floating point multiplications and n floating point additions, where n is the order of the extrapolation, which renders this approach very efficient.

Up to now, only cases were considered in which the number of data points used for extrapolation equals the degree of the polynomial plus one, i. e., A is a square matrix. However, we found that the compression ratio (defined as the quotient of input file size and compressed 10

ACS Paragon Plus Environment

Page 11 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

output file size) often increases if additional data points from history are taken into account, leading to an over-determined linear system. In general, an exact solution does no longer exist in such cases, and one has to resort to a least-squares solution of the system. 26 Such a least-squares solution of an over-determined system

Ax = b

(6)

x = A+ b,

(7)

can be efficiently computed by

where A+ denotes the Moore–Penrose pseudoinverse 27–29 of the matrix A (note that the classical inverse A−1 is not defined for non-square matrices). The Moore–Penrose pseudoinverse is a generalization of the classical matrix inverse (identical for non-singular square matrices, but also defined for all other matrices), and can be efficiently computed from the singular value decomposition (SVD) 30 A = U ΣV of matrix A by A+ := V Σ+ U ∗ (Σ is a diagonal matrix, so that Σ+ is simply determined by taking the reciprocal of all non-zero∗ diagonal elements of Σ).

When computing a least-squares solution to an over-determined linear system with n unknowns and m > n equations, the individual equations can be weighted by real numbers w1 , . . . ,wm . Equations with larger weighting factor have stronger influence on the solution. Such a weighted system can be expressed as 

w1 0   0 w2  W Ax = W b, W :=  . ..  .. .   0 0

···



0   ··· 0    . . . ..  .   · · · wm

(8)

with weight matrix W , which is equivalent to multiplying both the i-th row of matrix A and ∗

We use a threshold of 10−13 in our implementation.

11

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 51

the i-th component of vector b with the weight factor wi . The solution is given by x = (W A)+ W b.

(9)

If A is a non-singular square matrix, the unique solution of the system is preserved independently of the weighting factors, which can be easily seen from (W A)−1 W = A. As above, the matrix (W A)+ W is constant, and can be precomputed in the beginning of the run. In polynomial extrapolation, each equation of the linear system corresponds to one known data point from the history of the trajectory. A typical choice of the weights wi is to assign larger weights to equations which correspond to data points with smaller temporal distance, so that samples which date back most have the smallest influence on the extrapolation result. We found that a good choice is given by  wi :=

1 1+i

e ,

(10)

where wi is the weight for an equation which corresponds to a data point from i time steps earlier than the extrapolated value, and the value e is called temporal weighting exponent.

The extrapolation scheme for position trajectories described in this section possesses three adjustable parameters which can be specified by the user: The number of data points m taken from the history of the trajectory, the degree n of the extrapolation polynomial, and the weighting exponent e. It is required that m ≥ n + 1 to avoid under-determined linear systems. In the special case of m = n + 1, the linear system possesses a unique solution, and the weighting exponent e becomes meaningless. To give an example for typical parameter values, the best compression ratio for system E with ∆t = 0.5 fs (see 4th numeric row in Table 2) was achieved with a history of m = 9 samples, a polynomial degree of n = 5, and a weighting exponent of e = 0.546.

12

ACS Paragon Plus Environment

Page 13 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

When considering the first frame of a position trajectory, there is no known position value from past steps, and no extrapolation is possible. Therefore, both the number of history points and the degree of the polynomial rise within the first steps of the trajectory, until they reach the value given by the user. The nominal value of the compression ratio is only reached after this number of steps has been passed.

Center of Mass Correction In an ideal MD simulation, the total momentum is preserved (which follows from Noether’s first theorem 31 due to the translation invariance of the system), so that the center of mass remains fixed during the simulation if the initial velocities are chosen accordingly. In reality, however, this is often not the case due to numerical issues. When compressing position trajectories, it was found to be advantageous to take special care of this phenomenon. In the implementation presented here, the center of mass is treated as an atom, and stored as described above by using polynomial extrapolation. All other atom coordinates are stored relative to the center of mass. If the whole system drifts, this is no longer reflected in the individual atom coordinates, which slightly increases the compression efficiency.

Sorting by Atom Mass In a well-equilibrated atomistic system in NVT ensemble, the average velocity of an atom is approximately proportional to the square root of the reciprocal atom mass (according to the Maxwell–Boltzmann distribution, the average velocity of the atoms in an ideal gas is exactly proportional to the square root of the reciprocal atom mass). The polynomial extrapolation described above can be expected to yield a constant relative prediction accuracy. When the total position change of an atom between successive time steps becomes larger, the residual will become larger by the same factor on average. This means that light atoms will have larger residuals than heavy atoms on average, because they possess higher velocities. As described above, the compression algorithm takes great advantage from locality, i. e., similar

13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

symbols closely following each other in the integer stream. Based on this rationale, it is an obvious choice to sort the atoms by their mass before storing the coordinates, so that residuals of similar magnitude follow each other. Due to the way how the Huffman tables are constructed (see section below ), it is advantageous to start with the heaviest atoms, which leads to the smallest residuals appearing first in the stream. The atom labels are stored in the original atom order in the compressed file. Both the compressor and the decompressor use a stable sorting algorithm to sort this atom list by mass. This ensures that compressor and decompressor are guaranteed to obtain the same atom permutation, and the original atom order can be restored. The atom sorting is activated by default, but can be switched off by the user.

Discretizing Residuals Finally, the obtained residuals (which are still real numbers) are discretized to integer numbers by taking into account the number of digits required by the target absolute precision that was given by the user. As the positions are stored with fixed absolute precision, the residuals can be simply expressed in integer numbers as multiples of the smallest position difference. Special care is required to avoid glitches due to rounding artifacts of floating point numbers. Each residual is discretized such that it yields the correct position when added to the predicted position from the last n already discretized positions. Most of these steps rely on integer arithmetics, which minimizes the probability of a deviation due to non-standard behavior of floating point arithmetics.

Volumetric Data Trajectories A volumetric data trajectory is a time series of atom positions in a given system inside a cell, together with a uniform three-dimensional grid of real values in the cell which is stored for every frame. The grid is characterized by the resolution (i. e., number of entries) along each of the three cell vectors. Furthermore, the grid is not necessarily rectangular—the 14

ACS Paragon Plus Environment

Page 14 of 51

Page 15 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

grid vectors are parallel to the cell vectors, and also non-orthorhombic cells are allowed. Volumetric trajectories are often produced from ab initio MD simulations, where quantities such as the total electron density or the electrostatic potential are available in each time step. The volumetric electron density along a trajectory can be used to compute vibrational spectra 12–14 or atomic partial charges 9,10 of the system, among many other applications. However, this leads to huge amounts of data (multiple Terabytes) which need to be stored (at least temporarily). Therefore, it is an important task to efficiently compress time series of volumetric data. As described, a volumetric data trajectory often contains not only volumetric data, but also atom positions. The atom positions can be stored by the method described in the last section. Therefore, only the storage of the volumetric data will be discussed here.

The volumetric data on the grid consists of real numbers, which need to be truncated in some way before storage. As described above, this truncation could be performed either by relative precision or by absolute precision. In contrast to atom positions (which are invariant under translations), the volumetric data often corresponds to a physical quantity where zero has a special meaning, such as the electron density (where zero means “no electron density”) or the electrostatic potential (where the sign of the value is very important). Therefore, the volumetric data is stored with a given relative precision, i. e., the number of significant digits is constant. This is in line with the Gaussian Cube file format, where the values are stored in a mantissa–exponent representation, which also yields a constant number of significant digits. The original Gaussian Cube file format possesses exactly 5 significant digits in the mantissa. 15 In the implementation presented here, the user can choose the desired number of significant digits for volumetric data storage; up to 9 significant digits are supported. By default, 5 significant digits are processed. If the compression precision matches the precision of the input volumetric data, the decompressed file is guaranteed to be bitwise identical to the input Cube file.

15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Certain applications require the storage of complex numbers (e. g., wave functions) or vectors (if vector fields shall be stored) instead of real numbers. According to the documentation of the Gaussian Cubegen utility, 16 Cube files can also store this type of data, where every grid element contains multiple components. Our algorithm is able to handle such cases by simply considering each component individually. Compression efficiency might be increased by applying some bijective transformation to the data before separation into components. In case of complex numbers, this might be, e. g., to store absolute value and argument instead of real and imaginary part.

Polynomial Extrapolation In contrast to position trajectories discussed above, volumetric data trajectories possess both spatial and temporal dimensions along which extrapolation can be performed. As most functions of interest (electron densities, molecular orbitals, electrostatic potentials) are usually smooth in space, it is a good choice to apply polynomial extrapolation to both spatial and temporal dimensions of the volumetric data. In the standard case of three-dimensional volumetric data sets, a four-dimensional extrapolation problem results.

An obvious choice for maximum extrapolation quality is to perform polynomial extrapolation directly in four-dimensional space. The sample points are then taken both from spatial and temporal history of the volumetric trajectory (when traversing the volumetric grid along some well-defined path), and the polynomial contains terms depending on all four variables x, y, z, and t in several powers, where also cross-terms (i. e., products from powers of different variables) are allowed. This extrapolation problem can be handled by the same approach for over-determined systems as discussed above for position trajectories. However, it turned out that the compression efficiency was unexpectedly low when using this combined four-dimensional approach. One reason might be that spatial and temporal dimensions behave fundamentally different, and should not be treated together in one single scheme.

16

ACS Paragon Plus Environment

Page 16 of 51

Page 17 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

As a second approach, we designed a predictor–corrector extrapolation scheme, which yields much better compression efficiency, and will be described in the following. The predictor step operates only along the temporal dimension. The value of one certain grid element within the volumetric data set is predicted based on the values of this grid element in previous trajectory frames, without taking into account any other grid elements at all. This temporal extrapolation is performed by the approach described for position trajectories above (i. e., polynomial extrapolation with more sample points than degree of polynomial, using weighting factors based on a weighting exponent). A typical example of a compressed volumetric trajectory (“system E” with ∆t = 0.5 fs, as discussed in the results section below) uses a temporal history of 6, a temporal polynomial degree of 2, and a temporal weighting exponent of 0.43 for the volumetric predictor.

After the value of one grid element has been predicted from its temporal history, a corrector step is applied, which takes into account spatial information to refine the predicted value. For all grid elements already visited before in the current time frame (i. e., all grid elements which are located before the current element on some well-defined traversal path through the grid), both the true value and the predicted value from the temporal predictor step are known. The difference of these values depicts the deviation of the predictor step results from the true value. By taking into account all these deviations from grid point history, the corrector tries to extrapolate the deviation at the current grid point. This extrapolated deviation is then used to refine the predicted value. For the spatial extrapolation of the deviations, a rectangular region from the volumetric data grid is selected, from which the sample points are taken. For the sake of clarity, a simplified two-dimensional example will be discussed here, which is presented in Figure 3. The blue and green grid elements have already been visited before (canonical X-before-Y grid traversal); the values on these elements are therefore known. The value at the pink grid

17

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3: Schematic two-dimensional illustration of grid points in the volumetric corrector algorithm. element at position x0 , y0 shall be predicted by extrapolation. All grid elements within the green region have been selected as samples to construct the extrapolation polynomial; the region is defined by its size (“X Range” and “Y Range”) as well as the offset at which the target element is found within the region (“X Offset” and “Y Offset”). If the volumetric data set possesses periodic boundary conditions (like in most MD simulations), periodicity is fully exploited for extrapolation. Furthermore, back in the three-dimensional case, an extrapolation polynomial P (x,y,z) needs to be specified. It is defined by the highest power of the three spatial variables that it contains (“X Order”, “Y Order”, and “Z Order”), and typically also contains all cross-terms, which are given by all possible products of monoms for different variables. In complete analogy to the extrapolation for position trajectories described above, this leads to an over-determined linear system, which contains one equation per sample grid element and one unknown per polynomial coefficient. As in the one-dimensional case above, we choose x0 = y0 = z0 = 0, so that the extrapolated value is simply given by the c0 coefficient of the polynomial. Again, weights are assigned to the equations to tune their influence on the extrapolation polynomial. 18

ACS Paragon Plus Environment

Page 18 of 51

Page 19 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Equations which correspond to grid elements spatially close to the extrapolated element should possess a large weight. In similarity to equation 10, we obtained good results by using

wxyz :=

1

!e

p

1 + x2 + y 2 + z 2

,

(11)

for the weighting factors, where wxyz is the weighting factor corresponding to a grid element with spatial distance of x, y, and z from the extrapolated element in X, Y, and Z dimension, respectively. The spatial weighting exponent e determines how fast grid elements lose influence with rising distance.

A typical example of a compressed volumetric trajectory (“system E” with ∆t = 0.5 fs, as discussed in the results section below) uses a spatial range of 7, a spatial offset of 3, and a spatial polynomial degree of 3 (for all three space dimensions) together with a spatial weighting exponent of 3.14 for the volumetric corrector. This leads to a system of 342 equations (i. e., sample points) and 20 coefficients when constructing the spatial extrapolation polynomial. It takes around one second to compute the Moore–Penrose pseudoinverse of the corresponding 342 × 20 matrix (which needs to be performed only once in the beginning).

Hilbert Curve Re-Ordering Locality is an important concept in chemistry, 32–34 and—as already discussed above—data locality in the integer stream is very important for the compressor to reach a high compression ratio. In a three-dimensional grid, locality means that neighboring grid entries are also close to each other in the one-dimensional stream of integer numbers. When traversing the grid in canonical order (with three nested loops for X, Y, Z index), locality is very bad. Neighbors in X direction are still neighbors in the stream, but neighbors in Y and especially Z direction are very far separated from each other. To overcome this, non-canonical traversal schemes can be applied to preserve some degree of locality. A good choice in such a case is the space-filling

19

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Hilbert curve. 35 It is guaranteed to touch every grid point exactly once (and therefore is a valid traversal path), and it preserves locality. Two illustrations of such a Hilbert curve are presented in Figure 4 in two dimensions (upper panel) and three dimensions (lower panel). It is visible that neighboring grid points are often (not always) close to each other on the traversal path. Our implementation uses a fast algorithm (without recursion) to construct three-dimensional Hilbert curves. 36 The Hilbert curve traversal is activated by default, but can be switched off by the user. In a typical example application (“system E” with ∆t = 0.5 fs, as discussed in the results section below), the Hilbert curve traversal reduces the output file size by around 10 %.

Figure 4: Illustration of grid traversal by a space-filling Hilbert curve in two dimensions (upper panel) and three dimensions (lower panel); the locality of neighboring grid elements is preserved.

20

ACS Paragon Plus Environment

Page 20 of 51

Page 21 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Discretizing Residuals After all the above steps have been performed, the resulting residuals need to be discretized to integer numbers. This is slightly more involved as in the position trajectory case, as here the grid elements are stored with fixed relative precision. To avoid glitches due to different floating point hardware, the grid values are not stored as floating point machine numbers, but the mantissa and exponent are stored as integer numbers separately, so that only integer arithmetic (which is strictly reliable) needs to be performed. Now, a case discrimination is applied. If the predicted value has the same exponent as the true value (this is the typical case), then the difference between both mantissas is stored as integer number. If the exponents between predicted and true value differ by exactly 1 in either direction, then a special symbol is put to the integer stream, followed by the residual either multiplied by 10 or divided by 10, such that the true value is reached with the required relative precision. In all other cases, another special symbol is put to the stream, followed by the integer numbers for mantissa and exponent of the true value (not the residual). This ensures that the true value can be stored if the prediction is too far off from the true value (e. g., due to a discontinuity in the volumetric trajectory).

A special case arises when considering volumetric data (e. g., electron density) of gas phase systems, where one molecule is contained in a large box of vacuum. One would assume that grid elements far from the molecule should be zero. However, as grid elements are stored with fixed relative precision, these elements are not zero, but rather contain numerical noise with very small exponents. As noise carries very high entropy and also cannot be predicted by extrapolation, this is very bad for compression efficiency. Our implementation offers to introduce a lower bound for the absolute precision (defined by a smallest permissible exponent in the mantissa–exponent representation), such that all values below this bound are truncated to zero. This does not significantly influence the data, but increases the compression ratio for almost an order of magnitude for gas phase systems. By default, the smallest allowed 21

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

exponent is set to −12, which can be modified by the user (depending, e. g., on the unit of the volumetric data). However, the algorithm is then no longer lossless in a strict sense. This only concerns gas-phase trajectories—in bulk phase systems, the electron density never becomes so low that the numerical background noise becomes visible.

The Compressor As depicted in Figure 1, the compressor receives a stream of integer numbers with reduced entropy, and reduces the size of the input data as far as possible by a sequence of transformations. The front-ends reduce the entropy of the data, but do not compress the data at all, so that the only part where size reduction takes place is the compressor. The direct output from the compressor is a sequence of bits, where byte boundaries do not play any role. This is important for a high compression ratio, because fields smaller than 8 bits do not waste space. The sequence of transformations presented below is inspired by the bzip2 compression tool, 21 which was developed by Julian Seward. The source code of bzip2 is freely available in the internet under a BSD-like license. No parts of the bzip2 source code have been copied when the implementation presented here was written; only the general concept is very similar.

Alphabet Creation The first step in the compressor is the creation of an alphabet (this step is not performed in bzip2, because in contrast to this work the input data are bytes there, and therefore the data range is already quite limited). The use of an alphabet transforms the symbol space from all possible integers to a consecutive range of non-negative integers. If n different symbols are present in the input integer stream, then the output stream will contain the values 0 . . . (n − 1). The alphabet is created in such a way that the symbols (which are integers) are sorted in ascending order. This means that a zero in the output stream corresponds to the

22

ACS Paragon Plus Environment

Page 22 of 51

Page 23 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

smallest integer from the input stream, and a (n − 1) corresponds to the largest integer. To accelerate the transformation of the input stream by the alphabet (especially for very large n), a hash table 37,38 is utilized. To be able to decompress the data, the alphabet needs to be stored along with the compressed data. Because the symbols in the alphabet are in ascending order, only the first symbol (i. e., the smallest integer from the input stream) needs to be stored directly. All other symbols are stored as difference to their predecessor. This resulting list of differences can be stored in one of the following ways: 1. Store the list by Huffman coding 39 with normal (non-canonical) tree. 2. Store the list by Huffman coding with canonical tree. 40 3. Use the “border storage” algorithm: Let f be the bit count required to store the largest element from the list. Let b < f be a natural number. Loop over all list elements. If the current list element requires ≤ b bits for storage, push a “0” bit to the stream, followed by b bits for the element. Otherwise, push a “1” bit to the stream, followed by f bits for the element. Find an optimal b such that the resulting total bit sequence becomes shortest. Each time an alphabet should be stored, our implementation tries each of these three methods. The method with the lowest resulting bit count is chosen and put to the output bit stream.

Applying Transformations After the alphabet creation, a stream of integer numbers within some well-defined interval results, which might be directly compressed by entropy encoding subsequently. However, in many applications it has shown to be advantageous to apply a series of transformations to this stream first. A popular combination of such transformations is applied in bzip2, namely 23

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the sequence of Burrows–Wheeler transformation, 41 Move-to-Front transformation, 42 and run-length encoding 43 of zero runs. This sequence tries to exploit remaining redundancy and repetitiveness in the integer stream. The Burrows–Wheeler transformation, also known as block sorting, is a permutation on the input stream which often re-shuffles the integers in such a way that identical values follow each other. The subsequent Move-to-Front transformation turns these runs of identical values into runs of zeroes. Finally, the run-length encoding of zero runs stores these blocks efficiently through base-2 bijective numeration, which reduces the total number of symbols in the stream. A detailed explanation of all three transformations is given in the Supporting Information, including a worked out example and two schematic illustrations in Figures S1 and S2. As described above, this sequence of transformations aims at exploiting and eliminating repetitiveness in the integer stream. It is clear that certain kinds of input data, such as text files, possess a high degree of repetitiveness (because the same words appear several times), such that these transforms massively contribute to a good compression efficiency (as in bzip2). However, in the case of trajectory compression discussed here, it is not clear where repetitiveness in the integer stream should result from. In the ideal case, the extrapolation scheme should already have exploited all structure on the data, such that the residues only represent white noise that cannot be extrapolated. In such a case, the sequence of transformations from above would be of no use. And indeed, in turns out that often the compression efficiency is even decreased by applying these transformations. The algorithm presented here therefore automatically determines which combination of the transformations yields the strongest compression. If there is only very little gain through one of the transformations, it is typically switched off to save computer time. This is especially the case for the Burrows–Wheeler transformation, which requires the sorting of very long sequences. 44

24

ACS Paragon Plus Environment

Page 24 of 51

Page 25 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Huffman Coding The most important step for size reduction is the Huffman coding. 39 Both bzip2 and our implementation presented herein strongly profit from an advanced multi-table Huffman coding. Before this will be covered below, Huffman coding with a single table will be explained. Huffman coding is a widely used entropy encoding method (e. g., in the JPEG image format and MP3 audio format). It aims at reducing the size of a stream of symbols by maximizing the entropy. This is achieved by creating a variable-length prefix code where each symbol from the input corresponds to a variable-length sequence of bits. According to Shannon’s source coding theorem, 45 such a code is optimal if the length of each bit sequence is approximately proportional to − log2 (p), 46 where p is the probability of finding the corresponding symbol within the input sequence. In simple words, the symbols which appear most often will have the shortest corresponding bit sequences. This is not always strictly fulfilled, but it can be shown that Huffman coding produces an optimal code among all methods that encode each symbol separately. 47 An efficient algorithm to construct a Huffman coding table makes use of a binary tree structure. First, a list of 2–tuples is created where each tuple contains one symbol from the alphabet and the frequency of that symbol, i. e., the total count of this symbol in the input data. This list is sorted in ascending order by frequency, such that it starts with the least frequent symbols. The first two tuples are taken and removed from the list. A new tree node is created, with these two tuples as children. The new tree node is also represented as tuple, with its frequency being the sum of the two children’s frequency. This newly created node is now inserted into the list of tuples on the correct position, such that the list remains sorted with respect to the tuple frequencies. Then, the process is repeated by taking and removing the first two tuples from the list, and so on. The algorithm stops if there is only one tuple left in the list. This is the root of a binary tree. All symbols from the alphabet represent leaves in this tree. To find the Huffman code of a given symbol, the tree is traversed starting from the root until the leaf is reached which corresponds to this 25

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

symbol. During this traversal, a “0” bit is produced each time the left child of a node is entered, and a “1” bit each time the right child of a node is accessed. The resulting sequence of bits is the Huffman code word for the symbol in the leaf. Following from this, the length of the code word for some symbol is equivalent to the depth of this symbol within the binary tree.

Decoding of Huffman-encoded data is simple if the binary tree is known (or can be reconstructed). Then, bits are taken from the input bit stream, and depending on the bit value, the left or right child of a tree node is entered, starting at the root node. As soon as a leaf is encountered, the end of the code word is reached (due to the prefix property of the code). The leaf’s symbol value is written to the output, and the position in the tree is reset to the root node. However, it is inefficient to store the whole Huffman tree in the compressed file. To reconstruct the tree at decode time, the exact symbol frequencies would be required, and their storage would also reduce efficiency. A more efficient solution to store the Huffman table is the so-called canonical Huffman code, which is used in both bzip2 and our implementation.

Canonical Huffman Coding Canonical Huffman coding 40 takes the original Huffman code (i. e., a table which assigns a code word of bits to each symbol) created above as an input, but subsequently modifies the code. This modification is performed as follows. The Huffman code table is sorted first in ascending order by code word length, and secondly by symbol. All code words preserve their bit lengths, but the bits are overwritten. The first code word from the sorted list is overwritten by all zero bits (keeping the length). Subsequent code words are overwritten by the next binary number in sequence, such that the code word is larger in value than all preceding code words. If a code word with larger bit length is reached, the required amount of zero bits is appended at the right side after increasing the binary number. This procedure yields a canonical Huffman code table. All symbols are encoded by code words with the same bit length as in the non-canonical Huffman code, and therefore, the compression efficiency 26

ACS Paragon Plus Environment

Page 26 of 51

Page 27 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

is not influenced. However, the canonical Huffman table itself can be stored with much less information, as explained in the following. For storage, the canonical Huffman table is first sorted in ascending order by symbol, so that the code words of the symbols will be stored in alphabetical order. For each symbol, one requires to store three pieces of information: The symbol itself, the bit length of the code word, and the code word itself. However, as the symbols are stored in alphabetical order, the symbol itself can be omitted. As this is a canonical Huffman code, we have the additional information that an (alphabetically) later code word will always be higher in value than an earlier one of the same bit length. Taking this together, it is only necessary to store the bit length of each symbol’s code word in order to unambiguously reconstruct the canonical Huffman table. This is a large saving when compared to the storage requirements of a non-canonical Huffman tree. As a final question, it will be discussed how the list of code word lengths for the canonical Huffman code is stored efficiently. Code word lengths are typically short, but symbols that occur very infrequently can have quite long code words, and the list of symbols can be very long (> 105 ). It would be ineffective to store every code word length with a constant number of bits, such that also the longest code words fit. One possibility is the “border storage” which was explained in the section on alphabet storage. Another possibility is to take the list of code word lengths as input data for another subordinate canonical Huffman coding. The implementation presented here tests these possibilities, and chooses the one which requires the smallest number of bits. Indeed, the subordinate canonical Huffman coding can decide to store its code word length list by another canonical Huffman coding if this is the most efficient choice. The nesting level of this process is not limited in our implementation. In typical applications (see below ), there are around three nested instances of canonical Huffman coding before the symbol count becomes so low that a simpler storage algorithm becomes more space efficient.

27

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 51

Multi-Table Huffman Coding Huffman coding possesses disadvantages when the input data features some degree of locality. A symbol which infrequently occurs in the complete input data will be assigned to a long code word. If there is a short block in the input data with very high frequency of this symbol, nothing will change. One way to take advantage of such situations is to use multiple Huffman tables. The input data is separated into short blocks, and each such block is encoded with the Huffman table that leads to the smallest output size. This raises the question how to construct different Huffman tables such that many blocks can be encoded efficiently with one of them. In the bzip2 source code, an elegant solution is found, which will be explained in the following. As an example, three independent Huffman tables shall be constructed for an input data stream with known symbol frequencies. In the beginning, the three Huffman tables are constructed in such a way that each symbol is contained in exactly one Huffman table, and each Huffman table contains symbols which account for approximately

1 3

of the total input

data. To achieve this, the alphabet is sorted in descending order by symbol frequency, so that the most frequent symbol is found in the beginning of the list. Then, symbols from the beginning of the list are assigned to the first Huffman table, until the sum of symbol frequencies exceeds

1 3

of the total symbol count. The following symbols are assigned to the

second table, until the sum of symbol frequencies also in this table exceeds

1 3

of the total

symbol count. All remaining symbols are assigned to the third table. The three Huffman tables are now constructed, based on the symbols assigned to them and their respective frequencies. The input data is separated into small blocks. For each block, it is decided which of the three Huffman tables is most efficient in encoding the block data. If a symbol from the block is already contained in the Huffman table, it is clear how many bits its code word would occupy. If a symbol is not yet contained in this table, it is computed how much additional storage would be required by adding this symbol to the table (including the effect on all 28

ACS Paragon Plus Environment

Page 29 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

preceding blocks that have been encoded with this table). After all blocks of the input data have been processed, each of the blocks is assigned to one of the Huffman tables, and the Huffman tables were augmented with many additional symbols (many symbols now appear in more than one Huffman table). The whole process is iteratively repeated, starting again from the first input data block, and deciding for each block which table would be most efficient to encode it. The total required storage size reduces with each iteration. Specialized Huffman tables will form, which are very efficient for specific types of data blocks. The iteration is stopped if the total size no longer decreases or if a maximum cycle count is reached. After the iterative Huffman table optimization has finished, it is clear which block should be encoded by which Huffman table. This information needs to be stored at the start of each block to ensure correct decoding. Instead of simply storing the integer number of the used Huffman table in front of each block, a more efficient approach is used in bzip2. The table number is stored by unary encoding, writing a sequence of n bits of value 1, terminated by a single bit of value 0. The number n of “1” bits represents the Huffman table that was used. Additionally, tables are stored in a move-to-front list, so that a table that is used will be moved to the beginning of the list. If the same table is used more than once in subsequent blocks, only a “0” bit needs to be stored in front of the block, because the Huffman table is already at the front of the list. The length of this unary code for table switching is also taken into account when determining which table is optimal for encoding a given block; therefore, the table from the preceding block has a slight bonus due to the very short table switching code.

The optimal block length and Huffman table count strongly depends on the nature of the input data. Bzip2 uses 6 tables in its highest compression setting (“–9”), and a block length of 50 symbols. The implementation presented here is able to use very high table counts (> 100), but this no longer gives any practical advantage. The table count as well as the block length can be chosen by the user. For a discussion of the parameter dependence of the

29

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

compression ratio and the default values, please see the “Results and Benchmarks” section.

Automated Tuning of Compression Parameters As it was described above, the compression ratio is significantly influenced by many parameters which can be adjusted by the user, both in the front-end and compressor. There exist, for example, 24 numeric and 14 Boolean parameters which control the compression efficiency of volumetric data from Gaussian Cube files. The optimal choice of these parameters depends very much on the input data, which is the reason why these internal settings were left open to the user. However, it is very hard, if not impossible, for a user to determine the optimal combination of those parameters by a simple trial-and-error approach. Therefore, we implemented a scheme which automatically tunes the parameters for a particular input trajectory. To do so, first, a number of frames is read from the beginning of the trajectory (50 for position trajectories, 20 for volumetric trajectories) and stored in memory. Based on this sample, many different combinations of parameters are probed. The underlying approach is driven by a decision tree: Based on the information already obtained from the input trajectory, only parameter combinations which are promising for this specific type of trajectory are considered. The parameter optimization can be performed on different levels, specified by an integer number between 0 and 4, where 0 means no automated parameter optimization, and 4 means “exhaustive optimization”. By default, optimization level 2 is selected for position trajectories (takes typically some seconds), while optimization level 1 is used for volumetric data trajectories (normally runs for around 5 minutes). The optimal parameter combination is written to screen and log file, enabling the user to re-use this set of parameters to compress similar trajectories in the future without requiring another optimization pass. The input and output of parameters can be performed both by a sequence of command-line arguments (which is a very long line for ≈ 40 parameters) or by a single alphanumerical code which

30

ACS Paragon Plus Environment

Page 30 of 51

Page 31 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

contains all relevant parameter values. Such a code is much more compact and easier to handle than a command line argument list, and enables the efficient archiving of good compression settings for different trajectories. For decompression, no parameters need to be specified, because all relevant information is contained in the compressed files. The parameters which were used for compression are written to screen upon decompression.

The BQB File Format The output data from the compressor is a stream of bits. Such a raw sequence is impractical to handle, and needs to be encapsulated in a suitable file format for practical use. To this aim, we present the BQB file format here. Initially, it was considered as “Binary Cube” file format (therefore the name BQB), but in the end it was designed as a multi-purpose file format, which can carry many different kinds of payload data related to trajectories. It does not offer the degree of flexibility known from formats such as HDF5; however, in contrast to HDF5, it considers a high compression ratio as central design goal. Even the headers and control structures are compressed, and the format is built as bit stream (i. e., not respecting byte boundaries) to avoid any unnecessary bit. The BQB file format is defined on a per-frame basis. A BQB frame starts with a frame header, which contains a “magic number”, followed by a frame type specifier, the total payload size in bytes, and a CRC-32 code 48 of the frame payload. Depending on the frame type, a frame can either be a “short frame” or a “long frame”. A long frame can contain many different chunks of data, which can encode different types of information in one single frame. A short frame contains a fixed type of data for a specific application. There exist, e. g., short frames for storing compressed position data (when a position trajectory is compressed) and for storing both position and volumetric data in the same frame (when a volumetric trajectory is to be compressed). Both frame formats can optionally contain information on the (possibly

31

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

non-orthorhombic) cell vectors and the atom types in the system, which often are not required in every single frame. Apart from that, also a short frame format for compressing arbitrary binary files is available (similar to bzip2). Most BQB files possess a special frame type at the end of the file, which is an index frame. Index frames in the middle of a BQB file are simply ignored. This brings the following advantage. If additional frames are written to an existing BQB file, an updated index frame is simply appended at the end, and the existing file data is not modified in any way. The index frame contains a list of all preceding frames in the file, together with their respective frame types and file offsets. This enables fast random access to arbitrary frames. Such an index frame is optional, it can be left out if it is not required. An application can probe the existence of an index frame by considering the last bytes of a BQB file. Index frames end with a special signature. If this signature is found, then the length of the index frame is stored in the bytes directly in front of this signature, so that the start of the index frame can be reconstructed without scanning the file. This index concept is inspired by the PDF document file format, where it is called a “catalog” and has proven to be suitable for many applications. A BQB trajectory is simply a sequence of BQB frames. No data structures exist above the BQB frame level. This means that several BQB frames/trajectories can be concatenated by simply concatenating the underlying raw files. If required, a new index frame can be created for the concatenated file and appended at the very end. BQB files can be also concatenated by creating a so-called list file, which is a text file that contains absolute file names to BQB files in each of its rows. The access to the individual BQB files is handled transparently, such that the list file behaves as a standard BQB file. When trajectory data is compressed with polynomial extrapolation, the exact data from the previous n frames is required to decompress one frame. This means that to decompress a single frame in the middle of the BQB file, one needs to start at the first frame, and decompress all frames up to the desired frame in sequential order. In many applications,

32

ACS Paragon Plus Environment

Page 32 of 51

Page 33 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

trajectories are read sequentially anyway, and this might not be a concern. If random access to decompressed frames is desired, one can take a compromise: Every, e. g., 100 frames, the temporal polynomial extrapolation order is reset to zero, so that one frame appears which is not dependent on any preceding frame. All following frames increase the extrapolation order by 1 each, up to the desired global extrapolation order. This concept is very similar to “key frames” in modern video encoding. If a random frame shall be decompressed, only 99 preceding frames need to be decompressed first in the worst case. The key frame interval can be adapted to the specific application. Key frames possess a special frame type, and can therefore be easily identified in the BQB index. The implementation published together with this article offers support for compressing and decompressing position trajectories, volumetric trajectories, and binary files. Furthermore, BQB trajectories can be merged together or split into several smaller files. Finally, there exist functions to compare BQB trajectories to other trajectories (both the volumetric data and the atom positions), to add an index frame at the end of a BQB trajectory, and to check BQB trajectories for integrity (CRC-32 code and correctness of the index).

Results and Benchmarks To benchmark the compression efficiency of the algorithms presented here, some existing molecular dynamics trajectories have been used. Because electron density data is required for benchmarking the volumetric data compression, mostly DFT-based ab initio MD (AIMD) simulations were chosen. An overview of the trajectories considered here is given in Table 1. System P is a bulk phase AIMD simulation of liquid (R)-propylene oxide that was used to compute bulk phase vibrational spectra (ROA and Raman) before. 14 System G is similar to the previous system, but contains only one molecule of (R)-propylene oxide in the gas phase. System H equals system P, but possesses a higher grid resolution for the volumetric data. System E is a bulk phase AIMD simulation of the ionic liquid 1-ethyl-3-methylimidazolium

33

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 51

acetate ([EMIm][OAc]), which was studied before. 12,49–52 Finally, system W is a force field simulation of 4096 water molecules described by the TIP4P model, 53,54 which by definition constrains all bonds and angles. Systems P, G, and H were simulated at 300 K with the CP2k program package, 55,56 while the other two simulations were conducted at 350 K using LAMMPS. 57 Table 1: Systems used as benchmarks for compression. System P G H E W

Composition 32 (R)-propylene oxide 1 (R)-propylene oxide 32 (R)-propylene oxide 36 [EMIm][OAc] 4096 TIP4P water

Atoms 320 10 320 936 12288

Cell Size (pm) 1558 1558 1558 2121 5014

Volumetric Grid 160 × 160 × 160 160 × 160 × 160 240 × 240 × 240 216 × 216 × 216 –

Position Trajectories For benchmarking the compression of position trajectories, the systems E and W were chosen. As the position extrapolation is carried out independently for each atom, the compression ratio is almost independent on the total atom count, so that the system size is not of importance here. The full set of resulting output data sizes can be found in Table 2. “XYZ” is a simple text file format, which contains one atom position per row; no unnecessary characters (multiple whitespaces, more digits than required, etc.) have been written to ensure the smallest possible file size. “bzip2” and “xz” refer to command line compression utilities, 21,58 which have been used (in highest compression settings) to compress the XYZ text file. Please note that “xz” gives high compression ratio, but is extremely slow at compressing data. “XTC” is a binary position trajectory format used by the Gromacs program package. 4–6 “BQB” corresponds to the compressed files created with the algorithms from this article. The DCD file format is not benchmarked here, because it simply stores the coordinates as double-precision (64 bit) floating point numbers, and is therefore not very size-effective. All numbers are averaged over 1 000 consecutive trajectory steps with the trajectory stride ∆t given in the second column.

34

ACS Paragon Plus Environment

Page 35 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 2: Comparison of compressed size for different position trajectories, trajectory strides ∆t, and coordinate precisions (see Table 1), averaged over 1 000 steps each. Compression parameters were automatically determined at optimization level 2 (default). “shuffle” means randomly selecting 1 000 time steps from a long trajectory. Last column depicts XYZ to BQB compression ratio. System

∆t (fs) 0.5

E

W

1.0 2.0 4.0 10.0 40.0 shuffle 1.0 4.0 10.0 40.0

Precision (pm) 1 0.1 0.01 0.001

0.001

0.001

XYZ 17.13 19.87 22.61 25.36 25.35 25.36 25.37 25.39 25.56 25.56 351.87 351.86 351.87 352.12

Size (kiB/Frame) bzip2 –9 xz –9 XTC 3.19 1.15 3.59 5.63 2.76 4.73 7.12 4.06 5.89 8.78 5.32 6.99 8.83 5.81 7.00 8.85 6.42 7.00 8.87 7.32 7.00 8.88 8.58 7.00 8.91 9.52 7.00 8.95 9.68 7.00 117.37 66.38 90.61 117.94 85.10 90.61 118.21 109.29 90.62 118.53 122.19 90.84

BQB 0.52 0.78 1.11 1.56 2.69 4.73 5.58 6.07 6.79 7.96 20.51 46.82 72.61 88.61

BQB Ratio 32.94 25.47 20.37 16.26 9.42 5.36 4.55 4.18 3.76 3.21 17.16 7.52 4.85 3.97

In the first four rows, the dependence of the compression ratio on the absolute precision of the coordinates is investigated. As expected, all file formats increase in size if more digits are required to be stored. For practical applications, a precision of 0.01 or 0.001 pm is recommended; therefore, these rows are most interesting. In the remaining rows, the influence of the trajectory stride on the compression efficiency is studied. Please note that the fourth row with ∆t = 0.5 fs is also part of this comparison. The last row for system E (“shuffle”) contains 1 000 randomly selected steps from a 100 000 step trajectory, so that the atoms no longer obey any continuous movement. The XYZ text format, the bzip2 compressed file, and the XTC file size is independent of the trajectory stride, which means that these format do not exploit similarities between consecutive steps. The xz compressed file takes slight advantage from the continuous movement. A very strong effect is observed for the BQB file, which almost doubles its size when going from ∆t = 0.5 fs to 1.0 fs. This can be understood from the decreasing accuracy and reliability of the polynomial extrapolation with increasing

35

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

trajectory stride. Interestingly, this means that 2 000 steps with ∆t = 0.5 fs require almost the same storage space as 1 000 steps with ∆t = 1.0 fs, which might encourage scientists to write trajectories with a finer stride. The BQB format is very effective at a trajectory stride of ∆t = 0.5 fs, yielding a compression ratio of 16:1 for system E with a precision of 10−3 pm, which corresponds to 4.55 bits per coordinate. Lower precision results in even higher compression ratios (1.52 bits per coordinate at 1 pm precision). In the shuffled system, the BQB compression efficiency breaks down dramatically because the extrapolation scheme no longer leads to any entropy decrease. But even in this extreme case, the BQB file is still slightly smaller than the bzip2 and xz compressed trajectories, with only the Gromacs XTC file being smaller than BQB. In all cases except “shuffle”, the BQB file is the smallest among the methods compared here. The results for system W with four different trajectory strides are presented in the last four rows of Table 2. As described above, all bonds and angles were constrained in this simulation; therefore, the fast movement of the hydrogen atoms is suppressed. This leads to a higher compression efficiency even at larger trajectory strides, which is easily seen by comparison to the results for system E. Just as noticed above, the BQB format is the most size-effective storage format among all competitors also for system W. The fourth row from Table 2 (i. e., system E, ∆t = 0.5 fs, precision=0.001 pm) is visualized in Figure 5, together with the compression and decompression times of all methods. “FP32” refers to storing atom coordinates in single-precision floating point numbers (i. e., 32 bits per coordinate). It can be seen that apart from the best compression ratio, the BQB file format also has very reasonable compression and decompression times of 2.1 ms and 1.7 ms per frame, respectively (measured on a single CPU core “Intel Xeon E5-2609” at 2.5 GHz). Finally, it is interesting to consider which contributions in the BQB file make up which percentage of the total size. The results are presented in Table 3 for system E, ∆t = 0.5 fs, precision=0.001 pm. For a description of the contributions, please refer to section “The Compressor”. In the table, “Overhead” depicts all BQB format data fields (magic number,

36

ACS Paragon Plus Environment

Page 36 of 51

Page 37 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5: Comparison of compression ratio and timing for typical position trajectory (system E, see Table 1) with ∆t = 0.5 fs and a precision of 10−3 pm. Averaged over 1 000 steps. Compression parameters were automatically determined at optimization level 2 (default). Algorithm presented here is on the far right.

Table 3: Size contributions to compressed BQB frame of position trajectory (system E, see Table 1) with ∆t = 0.5 fs and precision 0.001 pm. Averaged over 1 000 frames. Contribution Overhead Alphabet Huffman Tables Table Switching Payload Total

Bytes/Frame 11.86 21.81 71.15 14.63 1473.63 1593.08

37

ACS Paragon Plus Environment

% 0.74 1.37 4.47 0.92 92.50 100.00

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

CRC-32, etc.), the atom labels and cell vectors, and the compressor flags (transformations which were applied, etc.). “Alphabet” is the storage of the alphabet data. “Huffman tables” are the storage of the canonical Huffman tables. “Table switching” corresponds to the unary encoding of the Huffman table in front of each block. Finally, “Payload” characterizes the actual Huffman-encoded atom position data. It can bee seen that this contribution makes up more than 90 % of the total file size.

Volumetric Data Trajectories The resulting file sizes from the compression of the benchmark volumetric data trajectories are presented in Table 4. As described above, the total electron density on a grid from the DFT calculation was used as volumetric data in each frame. “Cube” corresponds to the Gaussian Cube file format, 15 which is a simple text format for volumetric data with 5 significant digits; therefore, also 5 significant digits were stored in the BQB files. “FP32” is the hypothetical size of storing every grid entry in a 32 bit (single precision) floating point number, such as performed in the gOpenMol 17 PLT file format. “bzip2” and “xz” are command-line compression utilities, 21,58 which were used in highest compression settings to compress the Cube files (as xz is very slow, this took days to complete). “BQB” depicts the resulting size of the BQB trajectory, obtained with the algorithms presented here. All data was averaged over 1 000 volumetric data frames, except for the last row, where only one single volumetric frame was compressed. In the first four rows, the dependence of the compression ratio on the trajectory stride ∆t is given. As expected, the “Cube”, “FP32”, and “bzip2” formats do not depend on the trajectory stride; they do not exploit similarities between consecutive frames. In the “xz” format, a small dependence is found. As already observed when compressing position trajectories, the largest trajectory stride dependence is encountered for the BQB format. The size of one BQB frame increases with rising trajectory trajectory stride because the polynomial extrapolation becomes less accurate and reliable with increasing trajectory stride. The BQB format is very 38

ACS Paragon Plus Environment

Page 38 of 51

Page 39 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 4: Comparison of compressed size for different volumetric data trajectories and trajectory strides (see Table 1), all with 5 significant digits, averaged over 1 000 steps each, expect for “single”, where only a single volumetric frame is compressed. Compression parameters were automatically determined at optimization level 1 (default). Last column shows the Cube to BQB compression ratio. System P G H E

∆t (fs) 0.1 0.5 1.0 4.0 0.5 0.5 0.5 single

Cube

51.46

173.60 126.59

Size (MiB/Frame) FP32 bzip2 –9 xz –9 10.61 5.99 10.60 8.62 15.63 10.59 9.52 10.60 9.73 11.98 3.08 52.73 35.69 32.14 25.95 23.87 38.44 26.01 24.10

BQB 0.90 1.36 1.71 2.25 0.85 3.37 3.37 6.61

BQB Ratio 57.18 37.84 30.09 22.87 60.54 51.51 37.56 19.15

effective at ∆t = 0.5 fs, yielding a compression ratio of 37:1, which corresponds to only 2.79 bits per grid element (5 significant digits). The fifth row shows the data for the gas phase trajectory, where most of the volumetric data elements are very small in value. As described above, a smallest exponent of −12 was chosen here, so that all electron density values below 10−12 were truncated to zero. Strictly speaking, this is no longer a lossless compression, but it preserves all data which is relevant for practical use, while it reaches a remarkable compression ratio of around 60:1. The sixth row depicts the data from the bulk phase system with a finer volumetric grid; it can be seen that a finer grid leads to a higher BQB compression ratio. Finally, the last two lines contains the data for system E, both with ∆t = 0.5 fs and for single volumetric frames. While in the former case a very good compression ratio is achieved in the BQB file, the latter case is pathological, as no temporal extrapolation is possible for a single frame. However, the BQB file is still by far smaller than the bzip2 and xz file (because spatial extrapolation still works), while it takes a much shorter compression time than bzip2 and xz. Even for this complicated case, the result is impressive – the single Gaussian Cube frame is reduced from 126 MiB to 6.6 MiB (5.5 bits per grid element) within seconds, and therefore, e. g., suitable as an email attachment. It can be concluded that the BQB file format gives the smallest results among the competitors in this table in all cases, even for single frames 39

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

without continuity in time.

The volumetric data set that was mentioned in the introduction, which would have required 24.5 TiB of storage space in Gaussian Cube file format, could be stored in a BQB file with a size of only 664 GiB, so that it fits on any single modern hard disk drive. In contrast to archive formats such as bzip2, random access to volumetric data frames is still possible in the compressed file, which is very useful for scientific applications.

Figure 6: Comparison of compression ratio and timing for typical volumetric data trajectory (system E, see Table 1) with ∆t = 0.5 fs and 5 significant digits. Averaged over 1 000 steps. Compression parameters were automatically determined at optimization level 1 (default). Algorithm presented here is on the far right. The seventh data row from Table 4 (i. e., system E, ∆t = 0.5 fs) is graphically presented in Figure 6, together with the compression and decompression times of the individual methods. It is visible that apart from the by far best compression ratio, the BQB file format also has the shortest compression and decompression times, which are 8.5 s and 3.1 s per frame, respectively (all measured on a single CPU core “Intel Xeon E5-2609” at 2.5 GHz). When considering the size of an uncompressed frame in Gaussian Cube format, this corresponds to 40

ACS Paragon Plus Environment

Page 40 of 51

Page 41 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

a compression and decompression data rate of 15.0 MiB s−1 and 40.6 MiB s−1 , respectively. Table 5: Size contributions to compressed BQB frame of volumetric data trajectory (system E, ∆t = 0.5 fs, see Table 1) with single (cols. 2, 3) and multiple (cols. 4, 5) Huffman tables. Averaged over 1 000 frames. Contribution Overhead Alphabet Huffman Tables Table Switching Payload Total

1 Table kiB/Frame % 0.071 0.002 0.246 0.007 0.309 0.009 − − 3621.663 99.982 3622.289 100.000

6 Tables, BL=20 kiB/Frame % 0.089 0.003 0.246 0.007 0.637 0.019 105.927 3.069 3344.985 96.903 3451.885 100.000

Finally, the size contributions in the compressed BQB volumetric data trajectory (i. e., system E, ∆t = 0.5 fs) are inspected; see Table 5. For a description of the rows, please refer to the discussion of Table 3 above. The first two numeric columns correspond to a compressor run with only one Huffman table (i. e., no multi-table Huffman coding). Consequently, no size contribution for table switching appears here. Around 99.98 % of the total average frame size are actual payload data. When using multiple Huffman tables instead (numeric columns 3 and 4), a significant amount of the compressed data (more than 100 kiB per frame) is occupied by table switching. However, the compression with multiple Huffman tables is much more efficient, and the decrease in payload data size completely compensates for the additional table switching contributions, leading to an overall average frame size of almost 200 kiB (around 5 %) less than in the former case.

Conclusion In this article, we presented our newly developed algorithm for lossless compression of position and volumetric data trajectories. In contrast to general compression algorithms (such as bzip2), our approach fully exploits the special structure of the input data (i. e., smoothness in time and space) to reach very high compression efficiency. We show that an accurate extrap-

41

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

olation scheme is the key to reducing data entropy and therefore reaching high compression ratios.

After an in-depth discussion of the mathematical and computational details involved in this project, we presented some benchmark data from compressing trajectories that we have published in literature before. Concerning position trajectories at a precision of 10−3 pm, our method reaches a compression ratio of around 15:1 with respect to XYZ files if the trajectory was written in every simulation step. This corresponds to 4.55 bits per coordinate on average. Lower precision results in even higher compression ratios (1.52 bits per coordinate at 1 pm precision). For larger trajectory stride, the compression efficiency is reduced (e. g., 5:1 if every 20th step was written). However, this is still the smallest output among all methods compared here, including the Gromacs XTC format and bzip2. Only for randomly shuffled simulation steps (where temporal smoothness cannot be utilized at all), our result is almost on par with bzip2, with the XTC file being slightly smaller. Despite yielding superior compression ratio in almost all cases, our method is still reasonably fast both in compression and decompression, and therefore well suitable for scientific use.

When it comes to volumetric data trajectories, compression ratios of around 35:1 with respect to Gaussian Cube files (5 significant digits) can be achieved if every simulation step is written to file, corresponding to only 2.79 bits per grid element. This compression ratio is by almost a factor 10 better than that obtained with bzip2, which is remarkable. In case the volumetric data contains large empty areas (e. g., electron density of a cluster of molecules in vacuum), we show that even better compression can be reached. In analogy to position trajectories, increasing trajectory stride leads to a decrease in compression ratio also here. However, even if no temporal smoothness can be exploited, there is still some degree of spatial smoothness in the data. A single volumetric data frame (electron density of a bulk phase system) can still be compressed with a ratio around 20:1 w.r.t. Gaussian Cube format. The

42

ACS Paragon Plus Environment

Page 42 of 51

Page 43 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

original cube file of 126.6 MiB was reduced to only 6.6 MiB (5.5 bits per grid element) within seconds, now being suitable as an email attachment. Also here it has to be noted that both compression and decompression are reasonably fast, especially when compared to methods like bzip2 and xz. We would like to mention that also complex grid values (as encountered in wave functions) and vector fields can be stored by our method; this will be implemented and benchmarked in a future update. Apart from that, the extrapolation-based compression might also be useful for image and video encoding, which shall be investigated in a follow-up study.

To efficiently and safely store the compressed data from our algorithms, we developed the BQB file format. Being specifically designed for applications in computational chemistry, it can not only store the trajectories themselves, but also many additional optional data, such as topology, atom labels, cell vectors, charges, velocities, etc. While this format is very space-efficient on the one hand, it is robust and safe on the other hand, as it contains several checks, such as CRC codes of the payload data. BQB files possess an index block at the end of the file, which enables quick random access to arbitrary frames.

The methods which are described in this article have been implemented to C++ code, which is licensed under the GNU LGPL v3, and therefore free to use for the scientific community. We offer both a stand-alone program to compress and decompress trajectories (“bqbtool”) and a C/C++ library with this functionality (“libbqb”), which can be included in other software projects (such as simulation and visualization programs, which might directly write/read compressed trajectories). The source code of our implementations as well as a documentation of the BQB file format are freely available in the internet on “www.brehm-research.de/bqb”. We also included all methods into our free trajectory analysis tool TRAVIS, 22 which now can directly read and analyze compressed trajectories in BQB format without prior decompression.

43

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 51

Acknowledgement MB acknowledges financial support by the Deutsche Forschungsgemeinschaft (DFG) through project Br 5494/1-1.

Conflict of Interest The authors declare no competing financial interest.

Supporting Information Available Detailed explanation of Burrows–Wheeler transformation, Move-to-Front transformation, and run-length encoding.

References (1) XYZ File Format, see http://www.ccl.net/chemistry/resources/messages/1996/10/21.005dir/index.html (accessed July 24th 2018). (2) Berman, H. M. The Protein Data Bank: A Historical Perspective. Acta Crystallographica Section A 2007, 64.1, 88–95. (3) PDB file format, see http://www.wwpdb.org/documentation/file-format (accessed July 24th 2018). (4) XTC File Format, see http://manual.gromacs.org/online/xtc.html (accessed 24th July 2018). (5) Van Der Spoel, D.; Lindahl, E.; Hess, B.; Groenhof, G.; Mark, A. E.; Berendsen, H. J. C. Gromacs: Fast, Flexible, and Free. J. Comput. Chem. 2005, 26, 1701–1718.

44

ACS Paragon Plus Environment

Page 45 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(6) Hess, B.; Kutzner, C.; van der Spoel, D.; Lindahl, E. Gromacs 4: Algorithms for Highly Efficient, Load-balanced, and Scalable Molecular Simulation. J. Chem. Theory Comput. 2008, 4, 435–447. (7) de Buyl, P.; Colberg, P. H.; H¨ofling, F. H5MD: A Structured, Efficient, and Portable File Format for Molecular Data. Comput. Phys. Commun. 2014, 185, 1546–1553. (8) HDF5 File Format; see https://www.hdfgroup.org/ (accessed 24th July 2018). (9) Hirshfeld, F. L. Bonded-Atom Fragments for Describing Molecular Charge Densities. Theor. Chim. Acta 1977, 44, 129–138. (10) Fonseca, G. C.; Handgraaf, J.-W.; Baerends, E. J.; Bickelhaupt, F. M. Voronoi Deformation Density (VDD) Charges: Assessment of the Mulliken, Bader, Hirshfeld, Weinhold, and VDD Methods for Charge Analysis. J. Comput. Chem. 2004, 25, 189–210. (11) Thomas, M.; Brehm, M.; Fligg, R.; V¨ohringer, P.; Kirchner, B. Computing Vibrational Spectra from ab initio Molecular Dynamics. Phys. Chem. Chem. Phys. 2013, 15, 6608– 6622. (12) Thomas, M.; Brehm, M.; Kirchner, B. Voronoi Dipole Moments for the Simulation of Bulk Phase Vibrational Spectra. Phys. Chem. Chem. Phys. 2015, 17, 3207–3213. (13) Thomas, M.; Kirchner, B. Classical Magnetic Dipole Moments for the Simulation of Vibrational Circular Dichroism by ab initio Molecular Dynamics. J. Phys. Chem. Lett. 2016, 7, 509–513. (14) Brehm, M.; Thomas, M. Computing Bulk Phase Raman Optical Activity Spectra from ab initio Molecular Dynamics Simulations. J. Phys. Chem. Lett. 2017, 8, 3409–3414. (15) Gaussian Cube File Format; see http://paulbourke.net/dataformats/cube/ (accessed 24th July 2018).

45

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(16) Gaussian Cubegen Utility; see http://gaussian.com/cubegen/ (accessed 24th July 2018). (17) Laaksonen, L. gOpenMol: Program for the Display and Analysis of Molecular Structures. Centre for Scientific Computing: Espoo, Finland. (18) Khan, N.; Masud, S.; Ahmad, A. A variable block size motion estimation algorithm for real-time H.264 video encoding. Signal Process. Image Commun. 2006, 21, 306 – 315. (19) Yin, H.; Cai, H.; Yang, E.; Zhou, Y.; Wu, J. An efficient all-zero block detection algorithm for high efficiency video coding with RDOQ. Signal Process. Image Commun. 2018, 60, 79 – 90. (20) Guarda, A. F.; Santos, J. M.; da Silva Cruz, L. A.; Assun¸ca˜o, P. A.; Rodrigues, N. M.; de Faria, S. M. A method to improve HEVC lossless coding of volumetric medical images. Signal Process. Image Commun. 2017, 59, 96 – 104. (21) Bzip2 Program, see http://www.bzip.org/ (accessed 24th July 2018). (22) Brehm, M.; Kirchner, B. TRAVIS – A Free Analyzer and Visualizer for Monte Carlo and Molecular Dynamics Trajectories. J. Chem. Inf. Model. 2011, 51, 2007–2023. (23) http://plasma gate.weizmann.ac.il/Grace/, Grace, (c) 1996–2015 Grace Development Team, see http://plasma-gate.weizmann.ac.il/Grace (accessed July 24th 2018). (24) Persistence of Vision Pty. Ltd. (2004), Persistence of Vision Raytracer (version 3.7), see http://www.povray.org (accessed 24th July 2018). (25) Verlet, L. Computer “Experiments” on Classical Fluids. I. Thermodynamical Properties of Lennard–Jones Molecules. Phys. Rev. 1967, 159, 98–103. (26) Anton, H.; Rorres, C. Elementary Linear Algebra, 9th ed.; John Wiley and Sons, Hoboken, New Jersey, 2005.

46

ACS Paragon Plus Environment

Page 46 of 51

Page 47 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(27) Moore, E. H. On the Reciprocal of the General Algebraic Matrix. Bulletin of the American Mathematical Society 1920, 26, 394–395. (28) Bjerhammar, A. Application of Calculus of Matrices to Method of Least Squares; with Special References to Geodetic Calculations. Trans. Roy. Inst. Tech. Stockholm 1951, 49, 1–86. (29) Penrose, R. A Generalized Inverse for Matrices. Math. Proc. Cambridge Philos. Soc. 1955, 51, 406413. (30) Golub, G.; Kahan, W. Calculating the Singular Values and Pseudo-inverse of a Matrix. Journal of the Society for Industrial and Applied Mathematics Series B Numerical Analysis 1965, 2, 205–224. (31) Noether, E. Invariante Variationsprobleme. Nachrichten von der Gesellschaft der Wissenschaften zu G¨ottingen, Mathematisch-Physikalische Klasse 1918, 1918, 235–257. (32) Wendler, K.; Zahn, S.; Dommert, F.; Berger, R.; Holm, C.; Kirchner, B.; Delle Site, L. Locality and Fluctuations: Trends in Imidazolium-Based Ionic Liquids and Beyond. J. Chem. Theory Comput. 2011, 7, 3040–3044, PMID: 26598146. utte, C.; Delle Site, L. Grand-Canonical-like Molecular(33) Wang, H.; Hartmann, C.; Sch¨ Dynamics Simulations by Using an Adaptive-Resolution Technique. Phys. Rev. X 2013, 3, 011018. (34) Agarwal, A.; Wang, H.; Sch¨ utte, C.; Site, L. D. Chemical potential of liquids and mixtures via adaptive resolution simulation. The Journal of Chemical Physics 2014, 141, 034102. ¨ (35) Hilbert, D. Uber die stetige Abbildung einer Linie auf ein Fl¨achenst¨ uck. Mathematische Annalen 1891, 38, 459–460.

47

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(36) Butz, A. R. Alternative Algorithm for Hilbert’s Space-Filling Curve. IEEE Trans. Comp. 1971, 20, 424–426. (37) Knuth, D. The Art of Computer Programming. 3: Sorting and Searching, 2nd ed.; Addison-Wesley, Boston, US, 1998. (38) Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.; Stein, C. Introduction to Algorithms, 2nd ed.; MIT Press, Cambridge, Massachusetts, 2001. (39) Huffman, D. A. A Method for the Construction of Minimum-Redundancy Codes. Proc. IRE 1952, 40, 1098–1101. (40) Klein, S. T. Space- and Time-efficient Decoding with Canonical Huffman Trees. Combinatorial Pattern Matching. Berlin, Heidelberg, 1997; pp 65–75. (41) Burrows, M.; Wheeler, D. J. A Block-Sorting Lossless Data Compression Algorithm; 1994. (42) Tech Correspondence, C. Technical Correspondence. Commun. ACM 1987, 30, 792–796. (43) Robinson, A. H.; Cherry, C. Results of a Prototype Television Bandwidth Compression Scheme. Proc. IEEE 1967, 55, 356–364. (44) Hoare, C. A. R. Algorithm 64: Quicksort. Commun. ACM 1961, 4, 321. (45) Shannon, C. E. A Mathematical Theory of Communication. Bell System Technical Journal 1948, 27, 379–423, 623–656. (46) MacKay, D. Information Theory, Inference, and Learning Algorithms; Cambridge University Press, Cambridge, England, 2003. (47) Gilbert, E. N.; Moore, E. F. Variable-Length Binary Encodings. Bell System Technical Journal 1959, 38, 933–967.

48

ACS Paragon Plus Environment

Page 48 of 51

Page 49 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(48) Peterson, W. W.; Brown, D. T. Cyclic Codes for Error Detection. Proc. IRE 1961, 49, 228–235. (49) Brehm, M.; Weber, H.; Pensado, A. S.; Stark, A.; Kirchner, B. Proton Transfer and Polarity Changes in Ionic Liquid-Water Mixtures: A Perspective on Hydrogen Bonds from ab initio Molecular Dynamics at the Example of 1-Ethyl-3-methylimidazolium Acetate-Water Mixtures – Part 1. Phys. Chem. Chem. Phys. 2012, 14, 5030–5044. (50) Brehm, M.; Weber, H.; Pensado, A. S.; Stark, A.; Kirchner, B. Liquid Structure and Cluster Formation in Ionic Liquid/Water Mixtures – An Extensive ab initio Molecular Dynamics Study on 1-Ethyl-3-methylimidazolium Acetate/Water Mixtures – Part 2. Z. Phys. Chem. 2013, 227, 177–203. (51) Thomas, M.; Brehm, M.; Holl´oczki, O.; Kelemen, Z.; Nyul´aszi, L.; Pasinszki, T.; Kirchner, B. Simulating the Vibrational Spectra of Ionic Liquid Systems: 1-Ethyl-3methylimidazolium Acetate and its Mixtures. J. Chem. Phys. 2014, 141, 024510. (52) Brehm, M.; Sebastiani, D. Simulating Structure and Dynamics in small Droplets of 1-Ethyl-3-methylimidazolium Acetate. J. Chem. Phys. 2018, 148, 193802. (53) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. Comparison of Simple Potential Functions for Simulating Liquid Water. J. Chem. Phys. 1983, 79, 926–935. (54) Jorgensen, W. L.; Madura, J. D. Temperature and Size Dependence for Monte Carlo Simulations of TIP4P Water. Mol. Phys. 1985, 56, 1381–1392. (55) The CP2k Developers Group, 2018. CP2k is freely available from http://www.cp2k.org/ (accessed 24th July 2018). (56) Hutter, J.; Iannuzzi, M.; Schiffmann, F.; VandeVondele, J. Cp2k: Atomistic Simulations of Condensed Matter Systems. WIREs Comput. Mol. Sci. 2013, 4, 15–25. 49

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(57) Plimpton, S. Fast Parallel Algorithms for short-range Molecular Dynamics. J. Comp. Phys. 1995, 117, 1–19. (58) Xz File Format, see https://tukaani.org/xz/format.html (accessed 24th July 2018).

50

ACS Paragon Plus Environment

Page 50 of 51

Page 51 of 51 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Graphical TOC Entry

51

ACS Paragon Plus Environment