Subject: Design note: Data compression (draft)
Date: Saturday 27th July 2013 09:30:42 UTC (over 4 years ago)
Data compression Overview This note examines design and implementation issues of a transparent data compression feature for Tux3. First, we look at the various parts of the code base that will be affected, with a brief description of the changes or additions required. Then design details are considered, including rationale for the design decisions in terms of efficiency and implementation complexity. We contemplate this work being done in two or three phases. Initially, compressed data will be stored with no attempt to pack compressed extent tail fragments efficiently. Expected compression ratio would be good but not stellar. A second implementation phase would introduce a space-efficient means of storing compressed tail fragments. A final clean up phase would remove some restrictions adopted to ease initial implementation, such as "compression stride", with a further modest improvement in compression ratio expected. Overview of Changes mount options - enable/disable transparent compression globally. Specify compression method and possibly compression control parameters. ioctl interface - enable/disable compression for a given file and possibly specify compression parameters (alternatively, rely on global mount option parameters). Possibly set/clear a directory "sticky" bit so that new files created in the directory have compression on/off. metadata - compressed data size is generally not the same as logical data size and is measured in bytes as opposed to blocks. Metadata format (dleaf) must at minimum record the number of physical blocks used by the extent, which is currently recorded implicitly as the difference between logical addresses of successive dleaf entries. Only sufficient metadata needs to be available in the dleaf to allow a compressed block to be loaded. Metadata not needed to load a compressed extent, such as compression method and parameters, is recorded in the header of the compressed block itself. map_region - maps cache page to media data for read or write. Updates metadata for the latter. Normally, data may be transferred directly between media and cache pages, but when compression/decompression is required then the data must be transformed, so the transfer becomes a multi stage operation. read path - the traditional block IO library scheme for buffered reads is not well suited to the multi stage transfer required for decompression. Rather than work around this we will reimplement that functionality inside Tux3 in a form suited to decompression, and possibly also better suited to its traditional task. It is likely that implementing a new cache read scheme will be of comparable complexity to working around deficiencies of the traditional scheme, while delivering a superior result. Metadata format Our existing metadata formart requires only minor elaboration to accommodate compression. A dleaf itself only needs to specify how to load a compressed extent, that is, its location and length. Only the latter is new, because compressed block count is potentially different from logical count. Additional details needed to perform the actual compression, such as compression algorithm, can be stored in the compressed extent itself. So we need just one new field per dleaf extent entry, the physical extent block count. The physical count can be limited to a fairly small number because there is not much advantage in compressing very long regions in terms of compression efficiency, and a distinct disadvantage in for random IO. So 8 bits for the compressed block count should be sufficient. Currently, we provide a 32 bit "version" field in each entry of the dleaf2 format, which is more precision than necessary. We can reasonably reduce that to 24 bits to accommodate the new compressed extent count field. Compression stride Writing a compressed extent that logically overlaps an existing compressed extent would require reading the original and rewriting the entire affected region, either shortening the original or the new extent, or merging them. Multiple compressed extents could possibly be overlapped. If uncompressed extents are overlapped by a new compressed extent the uncompressed extent does not not to be read into cache but can be shortened just by updating dleaf2 entries. The complexity of handling overlap on compressed extent write is considerable, therefore it is expedient to plan an initial implementation that prevents overlapping compressed writes by adopting a fixed logical "compression stride". A compression stride of 16 would mean always compress 16 block regions aligned on 16 block boundaries. To save a modest amount of CPU on arithmetic, the compression stride may be restricted to a power of 2. Compression tends to improve with the size of input data available. For typical sliding window schemes like libzip, window size might be up to 64K. Therefore, a very small compression stride would reduce compression efficiency. (How much is an open question, which we should research.) On the other hand, an overly large stride would harm random rewrite performance. Where the optimum lies is an good question, to which there may be no fixed answer. However there would appear to be a range of possible values with both good compression ratio and good random rewrite performance. For the time being, our default compression stride will be 16 blocks. To provide a limited form of forward compatibility, only writes should be restricted by compression stride. Reads should allow unrestricted extent alignment. Once chosen for a given file, the compression stride can be made smaller only at the cost of reintroducing the possibility of write overlap. Compressed extent tail fragments With short compressed extents, wastage in the compressed tail fragment becomes relatively large. If the average size of a compressed extent is three blocks then wastage in the tail fragment will be roughly 14%, rather large when the principal objective is to save space. This could be amortized across a larger compressed extent size, with possible negative impact on random read/write efficiency, however this will not help with small files. We therefore desire a means of packing tail fragments together with other filesystem data or metadata to eliminate this wastage, while hopefully not greatly reducing read or write efficiency greatly. The filesystem wastage in the absence of packing could range from zero to considerably more than 50% in extreme cases, with an expected benefit in the range of 10%. In compression terms, this is significant, however not so significant that we are compelled to implement such an optimization immediately. We therefore regard tail fragment packing as a follow on project from the initial compression work in the sense that a respectable compression ratio is expected for typical usage even while wasting considerable slack space in tail fragements. Nonetheless, we should have a plausible design for tail packing now to serve as a reference for other developement work, to facilite this follow on work. To avoid false sharing and its negative effect on write performance, tail fragments should be packed together with other data that is functionally or temporally related in the sense that it tends to be needed in cache at the same time, and has a similar lifetime on media. The current Tux3 design and implementation already supports variable number of variable length attributes in inodes, which looks attractive as a means of storing compressed tail fragments. The lifetime of a tail fragment tends to be similar to the lifetime of the inode, and the problem of false sharing with other files is avoided. Some difficulties arise. The size of a tail fragment should be limited to some fraction of the block size in order to avoid introducing harmful corner cases in btree splitting. However, a tail fragment can be as big as a block, less a byte. It therefore seems clear that we need to support storing multiple fragment attributes per tail fragment. We would store the logical, sub-block offset of each tail fragment in the attribute header. There may be multiple compressed extents with tail fragments, which must be uniquely indexed. The physical block address of the compressed extent would suffice for this purpose. So a tail fragment attribute header would contain, in addition to attribute type and version: * physical block address of compressed extent (lookup key) * fragment offset within tail block * fragment length The fragment data follows immediately. During the usual attribute scan at inode open time, fragments will be reassembled and cached until demanded by read access to a non-present cache block backed by the associated compressed extent. The blocks of the compressed extent less the tail fragment will be read into an intermediate buffer (the volmap cache may serve for this purposed. Then the compressed extent including the reassembled tail fragment will be decompressed into the file page cache. A weakness of this scheme is that an inode that includes many compressed extent tail fragments may become inconveniently large, spanning many inode tree blocks and increasing inode load and save times. A simple resolution is to limit the number of tail fragments stored, then fall back to storing tails together with the rest of the compressed extent, just the case where no tail fragment packing scheme is available, with consequently degraded compression ratio. As a more elaborate but superior approach, a new kind of fixed length attribute, a "fragment tree", could be an extent tree, similar to a dtree, where fragments are stored end to end in each referenced extent. This would avoid expending work on tail fragments until actually demanded, and would not require the entire structure to be rewritten if only one of the extents is affected by an update. It is apparent that improving compression ratio modestly by storing tail fragments compactly requires nontrivial implementation work. However, the gains will most probably justify the effort. Transfer path With compression, data no longer moves directly from cache to media, instead it is compressed or decompressed using an intermediate buffer, then sent to its final destination. Compression also changes the size of data, breaking the normal one-to-one correspondence between cache blocks and media blocks. By way of review, map_region is responsible for mapping logically mapped file cache IO requests to media blocks. Map_region itself does not perform data IO, but may read metadata as necessary into volume cache. Map_region takes a file cache region as input and returns some number of physical media extents as a result. For initial write or nondestructive rewrite, these extents are newly allocated from volume free space, while for read or destructive rewrite, existing extents keyed by logical file address are retrieved from cached file metadata. For write, extent metadata is added for newly allocated extents and preexisting extents are resized or removed as necessary. For read, cache blocks not covered by existing extent mappings are reported as holes. Transparent compression imposes additional requirements on map_region. Unlike normal media extents, compressed extents are indivisible in the sense that an extent must be available in its entirety in order to decompress it. For write, any partially overwritten compressed extents must be read into cache and partially recompressed, which introduces a new requirement for read before write. Write Path Compression is done at delta marshal time, for all dirty inodes in a delta. Memory lifetime of compressed data is from marshal to delta complete. Therefore a common, expandable buffer may be used to hold compressed extent blocks. Each compressed write bio references this intermediate buffer rather than the original file cache pages. Read before write is required in the case of rewrite within an existing compressed extent. Under high fragmentation conditions, a single compressed extent may be split across multiple physical extents. We may ignore this possibility in our initial implementation, and accept the risk of allocation failure under rare fragmentation conditions. Eventually, multiple physical extents per compressed extents should be supported in order to guarantee full utilization of a fragmented volume. When multiple physical extents per compressed extent are allowed, map_region needs to know which physical extents belong to a given compressed extent. No new metadata is strictly required for this: the logical length of all but the last of the dleaf entries for the extent will be zero (in dleaf2 format, the logical offsets are equal). However, it may be expedient to explicitly mark "continuation" extents in order to facilitate finding the first extent. Read Path we already replace the core kernel dirty inode flushing mechanism with our own because we cannot tolerate core's current behavior of flushing inodes effectively randomly without reference to our consistency requirements, but still use core's multi-page buffer read scheme, which interfaces to a filesystem via the ->get_block hook. However, this traditional API is inadequate for transparent compression because of the assuption that we want the mapped data to be read into into file cache. In fact, we must read the data into some temporary buffer before decompressing into file cache. Unlike the write case, cache reads take place as frontend activity at unpredictable times. We might allocate and free such temporary read buffers on demand per compressed read, or we might operate our own pool, if that is more efficient. A large read might cover many compressed extents. A simple, synchronous read, decompress, copy loop would block the CPU on IO wait when it could be doing decompression work or resolving other read mappings. Instead, our read mechanism should be content just to submit all required reads, with decompression and cache loading work being scheduled by read completions and done in a worker thread.