From 791d9daa51e413d5dc63fc938213b754ebe9b871 Mon Sep 17 00:00:00 2001
From: Dan Reif <881372+BMDan@users.noreply.github.com>
Date: Thu, 24 Mar 2022 15:10:01 -0700
Subject: [PATCH 1/2] Correct formatting errors in Encodings.md
In particular, address the issue that formatting caused the text after "The data stream looks like:" in "Delta-length byte array" to disappear entirely.
The remainder of changes are simply for consistency and readability.
---
Encodings.md | 60 +++++++++++++++++++++++++++++++++-------------------
1 file changed, 38 insertions(+), 22 deletions(-)
diff --git a/Encodings.md b/Encodings.md
index 4f56104c7..679695253 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -22,9 +22,9 @@ Parquet encoding definitions
This file contains the specification of all supported encodings.
-### Plain: (PLAIN = 0)
+### Plain (PLAIN = 0)
-Supported Types: all
+*Supported Types: all*
This is the plain encoding that must be supported for types. It is
intended to be the simplest encoding. Values are encoded back to back.
@@ -151,7 +151,7 @@ Note that the BIT_PACKED encoding method is only supported for encoding
repetition and definition levels.
### Delta Encoding (DELTA_BINARY_PACKED = 5)
-Supported Types: INT32, INT64
+*Supported Types: INT32, INT64*
This encoding is adapted from the Binary packing described in ["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. Boytsov.
@@ -192,40 +192,56 @@ If, in the last block, less than `````` miniblo
The following examples use 8 as the block size to keep the examples short, but in real cases it would be invalid.
#### Example 1
+```
1, 2, 3, 4, 5
+```
-After step 1), we compute the deltas as:
+After step 1, we compute the deltas as:
+```
1, 1, 1, 1
+```
The minimum delta is 1 and after step 2, the deltas become
+```
0, 0, 0, 0
+```
The final encoded data is:
- header:
-8 (block size), 1 (miniblock count), 5 (value count), 1 (first value)
-
- block
-1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0)
+```
+header:
+ 8 (block size), 1 (miniblock count), 5 (value count), 1 (first value)
+block:
+ 1 (minimum delta), 0 (bitwidth), (no data needed for bitwidth 0)
+```
#### Example 2
-7, 5, 3, 1, 2, 3, 4, 5, the deltas would be
+```
+7, 5, 3, 1, 2, 3, 4, 5
+```
+The deltas would be:
+
+```
-2, -2, -2, 1, 1, 1, 1
+```
The minimum is -2, so the relative deltas are:
+```
0, 0, 0, 3, 3, 3, 3
+```
-The encoded data is
-
- header:
-8 (block size), 1 (miniblock count), 8 (value count), 7 (first value)
+The encoded data is:
- block
--2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2 bits)
+```
+header:
+ 8 (block size), 1 (miniblock count), 8 (value count), 7 (first value)
+block:
+ -2 (minimum delta), 2 (bitwidth), 00000011111111b (0,0,0,3,3,3,3 packed on 2 bits)
+```
#### Characteristics
This encoding is similar to the [RLE/bit-packing](#RLE) encoding. However the [RLE/bit-packing](#RLE) encoding is specifically used when the range of ints is small over the entire page, as is true of repetition and definition levels. It uses a single bit width for the whole page.
@@ -233,7 +249,7 @@ The delta encoding algorithm described above stores a bit width per miniblock an
### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6)
-Supported Types: BYTE_ARRAY
+*Supported Types: BYTE_ARRAY*
This encoding is always preferred over PLAIN for byte array columns.
@@ -244,15 +260,15 @@ and possibly better compression in the data (it is no longer interleaved with th
The data stream looks like:
+```
+```
-For example, if the data was "Hello", "World", "Foobar", "ABCDEF":
-
-The encoded data would be DeltaEncoding(5, 5, 6, 6) "HelloWorldFoobarABCDEF"
+For example, if the data was "Hello", "World", "Foobar", "ABCDEF", the encoded data would be `DeltaEncoding(5, 5, 6, 6)` followed by `"HelloWorldFoobarABCDEF"`.
### Delta Strings: (DELTA_BYTE_ARRAY = 7)
-Supported Types: BYTE_ARRAY
+*Supported Types: BYTE_ARRAY*
This is also known as incremental encoding or front compression: for each element in a
sequence of strings, store the prefix length of the previous entry plus the suffix.
@@ -264,7 +280,7 @@ the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
### Byte Stream Split: (BYTE_STREAM_SPLIT = 9)
-Supported Types: FLOAT DOUBLE
+*Supported Types: FLOAT, DOUBLE*
This encoding does not reduce the size of the data but can lead to a significantly better
compression ratio and speed when a compression algorithm is used afterwards.
From 32f90947128049dbe6abee05239b357c72ec517f Mon Sep 17 00:00:00 2001
From: Dan Reif <881372+BMDan@users.noreply.github.com>
Date: Thu, 24 Mar 2022 15:15:46 -0700
Subject: [PATCH 2/2] Update Encodings.md
Remove remaining colons in category headers. Ensure all encodings have anchors corresponding to their constant names.
---
Encodings.md | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/Encodings.md b/Encodings.md
index 679695253..2b551d52c 100644
--- a/Encodings.md
+++ b/Encodings.md
@@ -46,7 +46,7 @@ For native types, this outputs the data as little endian. Floating
For the byte array type, it encodes the length as a 4 byte little
endian, followed by the bytes.
-### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
+### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
The dictionary encoding builds a dictionary of values encountered in a given column. The
dictionary will be stored in a dictionary page per column chunk. The values are stored as integers
using the [RLE/Bit-Packing Hybrid](#RLE) encoding. If the dictionary grows too big, whether in size
@@ -123,7 +123,7 @@ data:
* Dictionary indices
* Boolean values in data pages, as an alternative to PLAIN encoding
-### Bit-packed (Deprecated) (BIT_PACKED = 4)
+### Bit-packed (Deprecated) (BIT_PACKED = 4)
This is a bit-packed only encoding, which is deprecated and will be replaced by the [RLE/bit-packing](#RLE) hybrid encoding.
Each value is encoded back to back using a fixed width.
@@ -150,7 +150,7 @@ bit label: ABCDEFGH IJKLMNOP QRSTUVWX
Note that the BIT_PACKED encoding method is only supported for encoding
repetition and definition levels.
-### Delta Encoding (DELTA_BINARY_PACKED = 5)
+### Delta Encoding (DELTA_BINARY_PACKED = 5)
*Supported Types: INT32, INT64*
This encoding is adapted from the Binary packing described in ["Decoding billions of integers per second through vectorization"](http://arxiv.org/pdf/1209.2137v5.pdf) by D. Lemire and L. Boytsov.
@@ -247,7 +247,7 @@ block:
This encoding is similar to the [RLE/bit-packing](#RLE) encoding. However the [RLE/bit-packing](#RLE) encoding is specifically used when the range of ints is small over the entire page, as is true of repetition and definition levels. It uses a single bit width for the whole page.
The delta encoding algorithm described above stores a bit width per miniblock and is less sensitive to variations in the size of encoded integers. It is also somewhat doing RLE encoding as a block containing all the same values will be bit packed to a zero bit width thus being only a header.
-### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6)
+### Delta-length byte array (DELTA_LENGTH_BYTE_ARRAY = 6)
*Supported Types: BYTE_ARRAY*
@@ -266,7 +266,7 @@ The data stream looks like:
For example, if the data was "Hello", "World", "Foobar", "ABCDEF", the encoded data would be `DeltaEncoding(5, 5, 6, 6)` followed by `"HelloWorldFoobarABCDEF"`.
-### Delta Strings: (DELTA_BYTE_ARRAY = 7)
+### Delta Strings (DELTA_BYTE_ARRAY = 7)
*Supported Types: BYTE_ARRAY*
@@ -278,7 +278,7 @@ For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding
This is stored as a sequence of delta-encoded prefix lengths (DELTA_BINARY_PACKED), followed by
the suffixes encoded as delta length byte arrays (DELTA_LENGTH_BYTE_ARRAY).
-### Byte Stream Split: (BYTE_STREAM_SPLIT = 9)
+### Byte Stream Split (BYTE_STREAM_SPLIT = 9)
*Supported Types: FLOAT, DOUBLE*