From 791d9daa51e413d5dc63fc938213b754ebe9b871 Mon Sep 17 00:00:00 2001 From: Dan Reif <881372+BMDan@users.noreply.github.com> Date: Thu, 24 Mar 2022 15:10:01 -0700 Subject: [PATCH 1/2] Correct formatting errors in Encodings.md In particular, address the issue that formatting caused the text after "The data stream looks like:" in "Delta-length byte array" to disappear entirely. The remainder of changes are simply for consistency and readability. --- Encodings.md | 60 +++++++++++++++++++++++++++++++++------------------- 1 file changed, 38 insertions(+), 22 deletions(-) diff --git a/Encodings.md b/Encodings.md index 4f56104c7..679695253 100644 --- a/Encodings.md +++ b/Encodings.md @@ -22,9 +22,9 @@ Parquet encoding definitions This file contains the specification of all supported encodings. -### Plain: (PLAIN = 0) +### Plain (PLAIN = 0) -Supported Types: all +*Supported Types: all* This is the plain encoding that must be supported for types. It is intended to be the simplest encoding. Values are encoded back to back. @@ -151,7 +151,7 @@ Note that the BIT_PACKED encoding method is only supported for encoding repetition and definition levels. ### Delta Encoding (DELTA_BINARY_PACKED = 5) -Supported Types: INT32, INT64 +*Supported Types: INT32, INT64* This encoding is adapted from the Binary packing described in ["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. Boytsov. @@ -192,40 +192,56 @@ If, in the last block, less than `````` miniblo The following examples use 8 as the block size to keep the examples short, but in real cases it would be invalid. #### Example 1 +``` 1, 2, 3, 4, 5 +``` -After step 1), we compute the deltas as: +After step 1, we compute the deltas as: +``` 1, 1, 1, 1 +``` The minimum delta is 1 and after step 2, the deltas become +``` 0, 0, 0, 0 +``` The final encoded data is: - header: -8 (block size), 1 (miniblock count), 5 (value count), 1 (first value) - - block -1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0) +``` +header: + 8 (block size), 1 (miniblock count), 5 (value count), 1 (first value) +block: + 1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0) +``` #### Example 2 -7, 5, 3, 1, 2, 3, 4, 5, the deltas would be +``` +7, 5, 3, 1, 2, 3, 4, 5 +``` +The deltas would be: + +``` -2, -2, -2, 1, 1, 1, 1 +``` The minimum is -2, so the relative deltas are: +``` 0, 0, 0, 3, 3, 3, 3 +``` -The encoded data is - - header: -8 (block size), 1 (miniblock count), 8 (value count), 7 (first value) +The encoded data is: - block --2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2 bits) +``` +header: + 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value) +block: + -2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2 bits) +``` #### Characteristics This encoding is similar to the [RLE/bit-packing](#RLE) encoding. However the [RLE/bit-packing](#RLE) encoding is specifically used when the range of ints is small over the entire page, as is true of repetition and definition levels. It uses a single bit width for the whole page. @@ -233,7 +249,7 @@ The delta encoding algorithm described above stores a bit width per miniblock an ### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6) -Supported Types: BYTE_ARRAY +*Supported Types: BYTE_ARRAY* This encoding is always preferred over PLAIN for byte array columns. @@ -244,15 +260,15 @@ and possibly better compression in the data (it is no longer interleaved with th The data stream looks like: +``` +``` -For example, if the data was "Hello", "World", "Foobar", "ABCDEF": - -The encoded data would be DeltaEncoding(5, 5, 6, 6) "HelloWorldFoobarABCDEF" +For example, if the data was "Hello", "World", "Foobar", "ABCDEF", the encoded data would be `DeltaEncoding(5, 5, 6, 6)` followed by `"HelloWorldFoobarABCDEF"`. ### Delta Strings: (DELTA_BYTE_ARRAY = 7) -Supported Types: BYTE_ARRAY +*Supported Types: BYTE_ARRAY* This is also known as incremental encoding or front compression: for each element in a sequence of strings, store the prefix length of the previous entry plus the suffix. @@ -264,7 +280,7 @@ the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). ### Byte Stream Split: (BYTE_STREAM_SPLIT = 9) -Supported Types: FLOAT DOUBLE +*Supported Types: FLOAT, DOUBLE* This encoding does not reduce the size of the data but can lead to a significantly better compression ratio and speed when a compression algorithm is used afterwards. From 32f90947128049dbe6abee05239b357c72ec517f Mon Sep 17 00:00:00 2001 From: Dan Reif <881372+BMDan@users.noreply.github.com> Date: Thu, 24 Mar 2022 15:15:46 -0700 Subject: [PATCH 2/2] Update Encodings.md Remove remaining colons in category headers. Ensure all encodings have anchors corresponding to their constant names. --- Encodings.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/Encodings.md b/Encodings.md index 679695253..2b551d52c 100644 --- a/Encodings.md +++ b/Encodings.md @@ -46,7 +46,7 @@ For native types, this outputs the data as little endian. Floating For the byte array type, it encodes the length as a 4 byte little endian, followed by the bytes. -### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) +### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) The dictionary encoding builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integers using the [RLE/Bit-Packing Hybrid](#RLE) encoding. If the dictionary grows too big, whether in size @@ -123,7 +123,7 @@ data: * Dictionary indices * Boolean values in data pages, as an alternative to PLAIN encoding -### Bit-packed (Deprecated) (BIT_PACKED = 4) +### Bit-packed (Deprecated) (BIT_PACKED = 4) This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding. Each value is encoded back to back using a fixed width. @@ -150,7 +150,7 @@ bit label: ABCDEFGH IJKLMNOP QRSTUVWX Note that the BIT_PACKED encoding method is only supported for encoding repetition and definition levels. -### Delta Encoding (DELTA_BINARY_PACKED = 5) +### Delta Encoding (DELTA_BINARY_PACKED = 5) *Supported Types: INT32, INT64* This encoding is adapted from the Binary packing described in ["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. Boytsov. @@ -247,7 +247,7 @@ block: This encoding is similar to the [RLE/bit-packing](#RLE) encoding. However the [RLE/bit-packing](#RLE) encoding is specifically used when the range of ints is small over the entire page, as is true of repetition and definition levels. It uses a single bit width for the whole page. The delta encoding algorithm described above stores a bit width per miniblock and is less sensitive to variations in the size of encoded integers. It is also somewhat doing RLE encoding as a block containing all the same values will be bit packed to a zero bit width thus being only a header. -### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6) +### Delta-length byte array (DELTA_LENGTH_BYTE_ARRAY = 6) *Supported Types: BYTE_ARRAY* @@ -266,7 +266,7 @@ The data stream looks like: For example, if the data was "Hello", "World", "Foobar", "ABCDEF", the encoded data would be `DeltaEncoding(5, 5, 6, 6)` followed by `"HelloWorldFoobarABCDEF"`. -### Delta Strings: (DELTA_BYTE_ARRAY = 7) +### Delta Strings (DELTA_BYTE_ARRAY = 7) *Supported Types: BYTE_ARRAY* @@ -278,7 +278,7 @@ For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY). -### Byte Stream Split: (BYTE_STREAM_SPLIT = 9) +### Byte Stream Split (BYTE_STREAM_SPLIT = 9) *Supported Types: FLOAT, DOUBLE*