Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-1950: Define core features #164
base: master
Are you sure you want to change the base?
PARQUET-1950: Define core features #164
Changes from all commits
ede8e42
75bd1b7
d2bab9e
fd5bf7d
2dfe463
b400ff2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might make sense to also call out here status of data page v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean as a note? I am currently not sure if data page v2 worth to be supported at all. It was never widely used and the benefits of it are more about using some more advanced encodings and not the v2 page header itself.
So, I am not sure about the status of v2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the data page v2 header allow selective compression of individual pages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on parquet.thrift V2 only allows to select if a page is compressed or not. The compression codec is specified in
ColumnMetaData
so it can be set by column (and not page) for both V1 and V2.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if some pages don't compress well you can skip the CPU cost of compression for them (I suppose that was the rationale for adding this feature?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that they aren't widely supported, so if you use it you will find implementations that can't read your files (per the chicken/egg problem above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what would be the general approach here? Do we want to add V2 to the first release of core features so we expect all the implementations to support it or we leave it for a next release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a consensus about whether DataPageV2 is considered a stable feature now ?
It would help if the spec clearly says whether or not DataPageV2 is finalized and is expected to be supported rather than skipping mention of it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Stable" is unfortunately underdefined here. It certainly seems stable as far as the format is concerned. It's not necessarily widely implemented, though both parquet-mr and Parquet C++ support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trino also supports reading V2 data pages. However, when we attempted to use it in our parquet writer, we discovered other engines failed to read it trinodb/trino#6377
There was confusion about whether V2 is officially finalised trinodb/trino#7953 (comment)
I believe support for V2 pages in other engines has improved since then.
If it's simply a matter of adoption, then it would help to have some clarity about it in the spec (like for example acknowledging it as a core feature here) so that someone implementing a parquet writer can be assured that writing V2 pages is not some experimental feature that can be deprecated from the format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parquet-cpp and parquet-mr both support unsigned right? I checked Impala and it would just interpret it as a signed integer. I think it's not the only engine consuming parquet than only has signed types.
I guess included unsigned here seems fairly reasonable, but it would come with the caveat that the behaviour is implementation-dependent about how the cast to signed would be handled (e.g. allowing overflow or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-cpp definitely does, since C++ has native unsigned integers (i.e. no cast to signed is involved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am aware of only one implementation where the cast can be a problem, this is parquet-mr. There is no native support for unsigned values in primitives in java. In parquet-mr we give back the original bitmap in the related primitive (so a long for an unsigned int64) and it is up to the client to deal with the value. If the related client does not check the logical type (so realize that the value is unsigned) then the value might be represented incorrectly.
I think this would be a case in Hive. Meanwhile, I am not aware of any SQL engines that would write unsigned int values so currently it does not seem to be a problem.
If we want to be on the safe side we should not add unsigned values because of the java limitations but I am not sure about this because of the issue would be on the client side of parquet-mr and not parquet-mr itself. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last three seem highly application-specific. For example, Arrow doesn't have native types for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. I think this is true for logical types in general - if the application doesn't support the type at all, then it's a limitation of the application, not the application's Parquet implementation. It's reasonable for applications to not want to handle, say, timestamps.
Of course parquet-mr and parquet-cpp need to be able to at least pass through the data to the application in all of these cases.
So maybe all of these need to be qualified:
E.g. suppose you have an application that only processes DECIMAL values. It might be a bad and useless application, but that's not Parquet's problem. If it can read and write all DECIMAL encodings correctly, then it's compliant. If it couldn't read decimals stored as int64, or if it wrote decimals with precision < 10, it's not compliant (see https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, this topic is not that simple. However parquet-mr usually returns values related to the primitive types only (no conversion for logical type) there are some additional logic that depends on the logical type.
To summarize, we might not add all of the logical types (or not all of the primitive representations) to the core features but I think, we shall add at least the ones already used widely. The whole core features idea is about to help interoperability. Logical types are a big part of it.
Although, from interoperability point of view a read-only parquet implementation that cannot do anything with the concept of DECIMAL it might not implement the reading of it. But if one would like to write a UTF-8 string than we shall require to use a STRING logical type in the schema for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice that shouldn't be very problematic, as most JSON should be UTF8-encoded (note that the IETF standard requires either UTF-8, UTF-16 or UTF-32, and states that UTF-8 is the default encoding).
Agreed. The question is: what does supporting a logical type mean? For example, if I accept JSON data and simply present it as regular string data, is that enough to say that I support it? Otherwise, am I supposed to do something specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit difficult to read to me, what do you call "related type" or "related version of core features"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guessed so. I am not very good at it and that's why I've tried to write something here instead of in the document at the first place.
I would like to formalize the following idea. Writer writes a logical type if has a data type with similar semantics. It is allowed to not to write some logical types if there are no similar values to write (kind of common sense). Reader have to recognize all the listed logical types and either convert (or read as is) to one of its own types or if no such internal types exists it may not read such values but have to inform the user. We do not want misinterpreted or silently skipped values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it is not ok for a reader to ignore the logical type and still interpret the data based on the physical type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not. Logical types can significantly modify the meaning of a primitive type. Back to one of my examples. A BINARY primitive type can be a STRING or a DECIMAL. Even the statistics (min/max) are different for these two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the first time these abbreviations are used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parts describing how the parquet-mr implementations work were not meant to be part of the final document. As I don't know too much about other implementations I've felt it is good to note the current behavior so others may understand the situation better. After a consensus of the related topic made I would remove this note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this notes the current status for parquet-mr, it would greatly help to have a clarification in the spec about whether the delta encodings are only allowed with V2 Data pages or if it's legal to use delta encodings in V1 Data pages as well. Or in other words, is the choice of encodings linked in any way to the choice of V1 or V2 in the data page ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, the choice of data encodings is orthogonal to the choice of V1 or V2 data pages. But different default values may be applied based on which version the user asks for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I'm hoping for as well, however the spec doesn't say explicitly if it is valid to write delta encodings in V1 pages. The parquet-mr implementation does not use any delta encodings in V1 pages. Keeping this ambiguous in the spec means that someone implementing a writer cannot be sure about whether the various projects implementing readers will consider this a valid input. It's also not clear to me whether parquet-mr has tests guaranteeing that this works given that parquet-mr itself will not do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you open a separate issue about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, what's the appropriate place to open issues ? GH issues seem to be disabled on this repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you need to open a JIRA issue. See here: https://parquet.apache.org/community/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While parquet-mr does not use this directly, the spec currently says that "This encoding is always preferred over PLAIN for byte array columns" at https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
So why doesn't parquet-mr use DELTA_LENGTH_BYTE_ARRAY instead of PLAIN for byte arrays ?
As an implementor, it's not clear to me how should the line
parquet-mr: Not used directly
be interpreted.Is this recommending doing exactly what parquet-mr does regarding usage of encodings or should implementors use their own judgement or should we follow the prescription of the spec at https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 regardless of what parquet-mr is doing ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably because DELTA_BYTE_ARRAY should be more efficient in general than DELTA_LENGTH_BYTE_ARRAY, thanks to prefix compression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be true from the point of view of compression, however the decoder logic is more complicated with DELTA_BYTE_ARRAY and decoding speed could be slower.
If DELTA_BYTE_ARRAY is always the better choice, the spec should instead recommend that DELTA_BYTE_ARRAY should always be preferred over PLAIN for byte array columns and explicitly discourage using DELTA_LENGTH_BYTE_ARRAY directly as an encoding for byte arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the spec should recommend an encoding over another one. It should clearly document the advantages and drawbacks of each encoding, so that implementations and applications can make an informed choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the spec would simply provide guidelines and not be opinionated consistently in every case, that would be okay too. However, that doesn't appear to be the case today given the above recommendation of DELTA_LENGTH over PLAIN in the current spec. This PR is also highlighting the specific encodings chosen by parquet-mr along with choice of V1/V2 page format, which gives me the impression that any choice than what is mentioned here is not really recommended even if it exists in the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may deserve a separate issue, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the source code, the three DELTA encodings are not yet supported in Arrow parquet-cpp? (@emkornfield is that correct?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet C++ should now support DELTA_BINARY_PACKED for reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Parquet C++ now supports all encodings on the read side (which is sufficient here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in Arrow parquet-cpp you can use RLE_DICTIONARY (when asking for version=2) without also using V2 data pages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impala 3.4 and below wouldn't support this, FWIW.
I also recently implemented this in Impala and on the read side it's orthogonal to the data page version, but does require additional code: IMPALA-6434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another general choice here. Do we want to add encodings that might not supported widely yet at the first round or add only the ones we regularly use and leave the rest (maybe some even newer ones) for a later release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is LZO deprecated? If so we should state that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we want to list every possibility based on the format and state whether it is supported or not. LZO was removed after these comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to use LZ4/ZSTD more cause the performance/density benefits can be significant; not sure which implementations have support yet though (Impala has both as of the 3.3 release).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The big question here is if the other implementations can guarantee to read/write these compressions. For example parquet-mr uses hadoop for some of these therefore it depends on the environment if support them or not. If we put these to the core features parquet-mr cannot rely on the environment to provide the related codecs. (According to https://github.com/airlift/aircompressor it does not seem to be a big deal just requires some efforts and testing.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-cpp has recently fixed its LZ4 support to be compatible with the weird encoding used by parquet-mr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean that it uses hadoop for the codec? Does that mean it's difficult or impossible in some environments to enable the compression codecs?
We ran into the same LZ4 issue too - https://issues.apache.org/jira/browse/IMPALA-8617
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet-mr does not ship all of the codec implementations it supports. E.g. we do not unit test LZO, LZ4 and ZSTD because they are "not distributed in the default version of Hadoop". It means it is up to the user to install the related codecs to the hadoop environment so parquet-mr can use them. (It might mean that in some environments it is not possible to use some codecs but I don't have too mush experience in this.) This is the current situation.
That's why I've said if we add these codecs to the core features than parquet-mr have to add the implementations of these codecs (as direct dependencies) so it can guarantee the codecs are accessible in any situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, we (meaning @chairmank) discovered an incompatibility in the LZ4 compatibility code that was added to Parquet C++ to handle Hadoop-produced files: see https://issues.apache.org/jira/browse/ARROW-11301
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an update, while we (parquet-cpp) managed to fix the aforementioned compatibility issue, another compatibility issue appeared in the other direction (LZ4-compressed files produced by parquet-cpp cannot be read back by parquet-java): https://issues.apache.org/jira/browse/PARQUET-1974
At this point, it seems the lack of specification around LZ4 compression makes it basically impossible to implement in a compatible manner. Therefore it shouldn't be in this spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've proposed a new
LZ4_RAW
codec in #168, where I also deprecate the currentLZ4
codec.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove
LZ4
from here. I guess we cannot addLZ4_RAW
here yet.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to clarify how this differs from the
version
key above ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point, @jorisvandenbossche. However, I am not sure what the other version is for. Maybe, it would be better to specify that one correctly and it would be clear that this one is different thing.
Currently parquet-mr always writes
1
to that version field. I guess this is a file format version that we never incremented as it is backward compatible to the first ones we wrote. If it is the case then I think we will never increment it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Arrow (parquet-cpp) this
version
field actually gets populated with1
or2
, and depending on that different features are used to write the file (version 2 enabled some additional features, like writing nanosecond timestamps, or using RLE_DICTIONARY instead of PLAIN_DICTIONARY)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't sound good. It means that we are using a required field for different purposes. Because it is required we cannot deprecate it easily.
parquet-mr does not read this field only writes
1
all the time. What does parquet-cpp does with this value at the read path?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the field is not used when reading (but I am not super familiar with the C++ code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, Impala reads this field and fails if it is 2 (https://issues.apache.org/jira/browse/IMPALA-11838?filter=-2)