Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-1950: Define core features #164

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

gszadovszky
Copy link
Contributor

Make sure you have checked all steps below.

Jira

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

The whole document is up to discussion but the parts which are marked with a ? or TODO are the ones where I don't have a hard opinion. Feel free to add any comment about content or wording.

CoreFeatures.md Outdated Show resolved Hide resolved
CoreFeatures.md Outdated Show resolved Hide resolved
* `LZO` **(?)**
* `BROTLI` **(?)**
* `LZ4` **(?)**
* `ZSTD` **(?)**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to use LZ4/ZSTD more cause the performance/density benefits can be significant; not sure which implementations have support yet though (Impala has both as of the 3.3 release).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big question here is if the other implementations can guarantee to read/write these compressions. For example parquet-mr uses hadoop for some of these therefore it depends on the environment if support them or not. If we put these to the core features parquet-mr cannot rely on the environment to provide the related codecs. (According to https://github.com/airlift/aircompressor it does not seem to be a big deal just requires some efforts and testing.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet-cpp has recently fixed its LZ4 support to be compatible with the weird encoding used by parquet-mr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean that it uses hadoop for the codec? Does that mean it's difficult or impossible in some environments to enable the compression codecs?

We ran into the same LZ4 issue too - https://issues.apache.org/jira/browse/IMPALA-8617

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet-mr does not ship all of the codec implementations it supports. E.g. we do not unit test LZO, LZ4 and ZSTD because they are "not distributed in the default version of Hadoop". It means it is up to the user to install the related codecs to the hadoop environment so parquet-mr can use them. (It might mean that in some environments it is not possible to use some codecs but I don't have too mush experience in this.) This is the current situation.
That's why I've said if we add these codecs to the core features than parquet-mr have to add the implementations of these codecs (as direct dependencies) so it can guarantee the codecs are accessible in any situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, we (meaning @chairmank) discovered an incompatibility in the LZ4 compatibility code that was added to Parquet C++ to handle Hadoop-produced files: see https://issues.apache.org/jira/browse/ARROW-11301

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an update, while we (parquet-cpp) managed to fix the aforementioned compatibility issue, another compatibility issue appeared in the other direction (LZ4-compressed files produced by parquet-cpp cannot be read back by parquet-java): https://issues.apache.org/jira/browse/PARQUET-1974

At this point, it seems the lack of specification around LZ4 compression makes it basically impossible to implement in a compatible manner. Therefore it shouldn't be in this spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've proposed a new LZ4_RAW codec in #168, where I also deprecate the current LZ4 codec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove LZ4 from here. I guess we cannot add LZ4_RAW here yet.

CoreFeatures.md Outdated Show resolved Hide resolved
CoreFeatures.md Outdated Show resolved Hide resolved
CoreFeatures.md Outdated Show resolved Hide resolved
CoreFeatures.md Outdated Show resolved Hide resolved
CoreFeatures.md Outdated Show resolved Hide resolved
`FileMetaData` must be set according to the specification.
The following page types are supported:
* Data page V1 (see `DataPageHeader`)
* Dictionary page (see `DictionaryPageHeader`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might make sense to also call out here status of data page v2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean as a note? I am currently not sure if data page v2 worth to be supported at all. It was never widely used and the benefits of it are more about using some more advanced encodings and not the v2 page header itself.
So, I am not sure about the status of v2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the data page v2 header allow selective compression of individual pages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on parquet.thrift V2 only allows to select if a page is compressed or not. The compression codec is specified in ColumnMetaData so it can be set by column (and not page) for both V1 and V2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if some pages don't compress well you can skip the CPU cost of compression for them (I suppose that was the rationale for adding this feature?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that they aren't widely supported, so if you use it you will find implementations that can't read your files (per the chicken/egg problem above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what would be the general approach here? Do we want to add V2 to the first release of core features so we expect all the implementations to support it or we leave it for a next release?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a consensus about whether DataPageV2 is considered a stable feature now ?
It would help if the spec clearly says whether or not DataPageV2 is finalized and is expected to be supported rather than skipping mention of it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Stable" is unfortunately underdefined here. It certainly seems stable as far as the format is concerned. It's not necessarily widely implemented, though both parquet-mr and Parquet C++ support it.

Copy link
Contributor

@raunaqmorarka raunaqmorarka May 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trino also supports reading V2 data pages. However, when we attempted to use it in our parquet writer, we discovered other engines failed to read it trinodb/trino#6377
There was confusion about whether V2 is officially finalised trinodb/trino#7953 (comment)
I believe support for V2 pages in other engines has improved since then.
If it's simply a matter of adoption, then it would help to have some clarity about it in the spec (like for example acknowledging it as a core feature here) so that someone implementing a parquet writer can be assured that writing V2 pages is not some experimental feature that can be deprecated from the format.

CoreFeatures.md Outdated Show resolved Hide resolved
CoreFeatures.md Outdated Show resolved Hide resolved
gszadovszky and others added 2 commits December 14, 2020 12:13
* Separate requirements for writers and readers
* Add requirement for interoperability tests
* Remove TODO about the dictionary encoding of FIXED values as it seems
  only parquet-mr has limitations but the other implementations do not
* Remove LZO because licensing issues makes hard to include them in the
  implementations
@gszadovszky
Copy link
Contributor Author

@julienledem, @rdblue, could you please add your notes/ideas about this?

* `UNKNOWN` **(?)**
* `JSON` **(?)**
* `BSON` **(?)**
* `UUID` **(?)**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last three seem highly application-specific. For example, Arrow doesn't have native types for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I think this is true for logical types in general - if the application doesn't support the type at all, then it's a limitation of the application, not the application's Parquet implementation. It's reasonable for applications to not want to handle, say, timestamps.

Of course parquet-mr and parquet-cpp need to be able to at least pass through the data to the application in all of these cases.

So maybe all of these need to be qualified:

  • If the implementation is general-purpose Parquet library, users of the library must be able to access data with all logical types. It may just return the data in the same format as the primitive type. E.g. if JSON is just returned as a STRING, that seems fine.
  • If the implementation is anything else (SQL engine, computation framework, etc), then if it supports a primitive or logical type, it must support all core features associated with that type.

E.g. suppose you have an application that only processes DECIMAL values. It might be a bad and useless application, but that's not Parquet's problem. If it can read and write all DECIMAL encodings correctly, then it's compliant. If it couldn't read decimals stored as int64, or if it wrote decimals with precision < 10, it's not compliant (see https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds fine to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, this topic is not that simple. However parquet-mr usually returns values related to the primitive types only (no conversion for logical type) there are some additional logic that depends on the logical type.

  • For example the sorting order matters when we are talking about statistics (min/max for row groups or column indexes). The sorting order might depend on the logical type (e.g. BINARY as a STRING or as a DECIMAL).
  • In your example about JSON you cannot return a String if you don't know that the related BINARY primitive contains an UTF-8 encoded value.
  • We have many problems with the different semantics used for timestamps in the different projects. This problem cannot be solved entirely in parquet but it can be the base of the solution. That's why we introduced the new timestamp logical type.
  • Some parquet-mr bindings (e.g. parquet-avro) do depend on the logical types when converting values
  • We already found some issues with the string representation. Impala did not write the logical type UTF-8 for strings only the primitive type BINARY. Meanwhile, iceberg required to have this logical type otherwise it recognized the value as a binary and it did not work with it.

To summarize, we might not add all of the logical types (or not all of the primitive representations) to the core features but I think, we shall add at least the ones already used widely. The whole core features idea is about to help interoperability. Logical types are a big part of it.

Although, from interoperability point of view a read-only parquet implementation that cannot do anything with the concept of DECIMAL it might not implement the reading of it. But if one would like to write a UTF-8 string than we shall require to use a STRING logical type in the schema for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your example about JSON you cannot return a String if you don't know that the related BINARY primitive contains an UTF-8 encoded value.

In practice that shouldn't be very problematic, as most JSON should be UTF8-encoded (note that the IETF standard requires either UTF-8, UTF-16 or UTF-32, and states that UTF-8 is the default encoding).

The whole core features idea is about to help interoperability. Logical types are a big part of it.

Agreed. The question is: what does supporting a logical type mean? For example, if I accept JSON data and simply present it as regular string data, is that enough to say that I support it? Otherwise, am I supposed to do something specific?

Copy link
Member

@pitrou pitrou Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit difficult to read to me, what do you call "related type" or "related version of core features"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guessed so. I am not very good at it and that's why I've tried to write something here instead of in the document at the first place.

I would like to formalize the following idea. Writer writes a logical type if has a data type with similar semantics. It is allowed to not to write some logical types if there are no similar values to write (kind of common sense). Reader have to recognize all the listed logical types and either convert (or read as is) to one of its own types or if no such internal types exists it may not read such values but have to inform the user. We do not want misinterpreted or silently skipped values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is not ok for a reader to ignore the logical type and still interpret the data based on the physical type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not. Logical types can significantly modify the meaning of a primitive type. Back to one of my examples. A BINARY primitive type can be a STRING or a DECIMAL. Even the statistics (min/max) are different for these two.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thank you.

@pitrou
Copy link
Member

pitrou commented Dec 16, 2020

For the record, we started documenting the features supported by parquet-cpp (which is part of the Arrow codebase).

@nevi-me
Copy link

nevi-me commented Jan 8, 2021

For the record, we started documenting the features supported by parquet-cpp (which is part of the Arrow codebase).

I've opened https://issues.apache.org/jira/browse/ARROW-11181 so we can do the same for the Rust implementation

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits, but having an implementation standard sounds good to me!

CoreFeatures.md Outdated
compliance level then it must use only features from the *core feature list* of
that parquet-format release. If a reader implementation claims the same if must
implement all of the listed features. This way it is easier to ensure
compatibility between the different parquet implementations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet

CoreFeatures.md Outdated

We cannot and don't want to stop our clients to use any features that are not
on this list but it shall be highlighted that using these features might make
the written parquet files unreadable by other implementations. We can say that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet

CoreFeatures.md Outdated
this document under the same major version. (We might deprecate some, though.)
Because of the semantic versioning if one implementation supports the core
features of the parquet-format release `a.b.x` it must be able to read any
parquet files written by implementations supporting the release `a.d.y` where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet

CoreFeatures.md Outdated
Because of the semantic versioning if one implementation supports the core
features of the parquet-format release `a.b.x` it must be able to read any
parquet files written by implementations supporting the release `a.d.y` where
`b >= d`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: switch b and d in this example

CoreFeatures.md Outdated
parquet files written by implementations supporting the release `a.d.y` where
`b >= d`.

If a parquet file is written according to a released version of this document
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet

parquet-mr: As per the spec this encoding is deprecated while we still use it
for V1 page dictionaries.
* [RLE](Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3)
parquet-mr: Used for both V1 and V2 pages to encode RL and DL and for BOOLEAN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time these abbreviations are used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parts describing how the parquet-mr implementations work were not meant to be part of the final document. As I don't know too much about other implementations I've felt it is good to note the current behavior so others may understand the situation better. After a consensus of the related topic made I would remove this note.

The following compression algorithms are supported (including `UNCOMPRESSED`).
* `SNAPPY`
* `GZIP`
* `BROTLI` **(?)**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is LZO deprecated? If so we should state that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want to list every possibility based on the format and state whether it is supported or not. LZO was removed after these comments.

README.md Outdated
@@ -239,6 +239,13 @@ There are many places in the format for compatible extensions:
- Encodings: Encodings are specified by enum and more can be added in the future.
- Page types: Additional page types can be added and safely skipped.

## Compatibility
Because of the many features got into the Parquet format it is hard for the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"got into the" -> "that have been added to the"?

README.md Outdated
@@ -239,6 +239,13 @@ There are many places in the format for compatible extensions:
- Encodings: Encodings are specified by enum and more can be added in the future.
- Page types: Additional page types can be added and safely skipped.

## Compatibility
Because of the many features got into the Parquet format it is hard for the
different implementations to keep up. We introduced the list of "core
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it hard? I don't think it's our place to say. We should stay objective and state that some implementations have not kept up.

README.md Outdated
## Compatibility
Because of the many features got into the Parquet format it is hard for the
different implementations to keep up. We introduced the list of "core
features". This document is versioned by the parquet format releases and defines
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet

@emkornfield
Copy link
Contributor

It would be nice to cover the status of external file column chunk (apache/arrow#8130 was opened for C++)

@gszadovszky
Copy link
Contributor Author

@emkornfield, in parquet-mr there was another reason to use the file_path in the footer. The feature is called summary files. The idea was to have a separate file containing a summarized footer of several parquet files so you might do filtering and pruning without even checking a file's own footer. As far as I know this implementation exists in parquet-mr only and there are no specification for it in parquet-format.
This feature is more or less abandoned meaning during the development of some newer features (e.g. column indexes, bloom filters) the related parts might not updated properly. There were a couple of discussions about this topic in the dev list: here and here.

Because non of the ideas of external column chunks nor the summary files were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field file_path in this document or even explicitly specify that this field is not supported.

I am open to specify such features properly and after the required demonstration we may include them in a later version of the core features. However, I think these requirements (e.g. snapshot API, summary files) are not necessarily needed by all of our clients or already implemented in some ways (e.g. storing statistics in HMS, Iceberg).

@emkornfield
Copy link
Contributor

Because non of the ideas of external column chunks nor the summary files were spread across the different implementations (because of the lack of specification) I think we should not include the usage of the field file_path in this document or even explicitly specify that this field is not supported.

Being explicit seems reasonable to me if others are OK with it.

@raduteo
Copy link

raduteo commented Feb 1, 2021

@gszadovszky and @emkornfield it's highly coincidental that I was just looking into cleaning up apache/arrow#8130 when I noticed this thread.
External column chunks support is one of the key features that attracted me to parquet in the first place and I would like the chance to lobby for keeping it and actually expanding its adoption - I already have the complete PR mentioned above and I can help with supporting it across other implementations.
There are a few major domains where I see this as valuable component:

  1. Allowing concurrent read to fully flushed row groups while parquet file is still being appended to. A slight variant of this is allowing subsequent row group appends to a parquet file without impacting potential readers.
  2. Being able to aggregate multiple data sets in a master parquet file: One scenario if cumulative recordings like stock prices that get collected daily and need to be presented as one unified historical file, another the case of enrichment where we want to add new columns to an existing data set.
  3. Allowing for bi-temporal changes to parquet file: External columns chunks allows one to apply small corrections by simply creating delta files and new footers that simply swap out the chunks that require changes and point to the new ones.

If the above use cases are addressed by other parquet overlays or they don't line up with the intended usage of parquet I can look elsewhere but it seems like huge opportunity and the development cost for supporting it are quite minor by comparison

@emkornfield
Copy link
Contributor

@raduteo the main driver for this PR is there has been a lot of confusion as what is defined as needing core support. I think once we finish this PR I'm not fully opposed to the idea of supporting this field but I think we need to go into greater detail in the specification of what supporting the individual files actually means (and i think willing to help both Java and C++ support both can go a long way to convincing people that it should become a core feature).

@timarmstrong
Copy link
Contributor

+1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer.

@ggershinsky
Copy link
Contributor

to add the parquet encryption angle to this discussion. This feature adds protection of confidentiality and integrity of parquet files (when they have columns with sensitive data). These security layers will make it difficult to support many of the legacy features mentioned above, like external chunks or merging multiple files into a single master file (this interferes with definition of file integrity). Reading encrypted data is also difficult before file writing is finished. All of these are not impossible, but challenging, and would require an explicit scaffolding plus some Thrift format changes. If there is a strong demand for using encryption with these legacy features, despite them being deprecated (or with some of the mentioned new features), we can plan this for future versions of parquet-format, parquet-mr etc.

@gszadovszky
Copy link
Contributor Author

@ggershinsky, I think it is totally fine to say that encryption does not support external chunks and similar features even if they were fully supported by the implementations.

BTW, as we are already talking about the encryption I did not plan to include this features here for now. I think this feature is not mature enough yet and also it is not something that every implementation requires. Also, it might be a good candidate for a later release of core features.

@ggershinsky
Copy link
Contributor

@gszadovszky I certainly agree the encryption feature is not ready yet to be on this list. According to the definition, we need to "have at least two different implementations that are released and widely tested". While we already have parquet-mr and parquet-cpp implementations, their release and testing status is not yet at that point. We can revisit this for a later version of the CoreFeatures.md.

@raduteo
Copy link

raduteo commented Feb 3, 2021

+1 to @emkornfield's comment - the intent of this is to establish a clear baseline about what is supported widely in practice - there are a bunch of Parquet features that are in the standard but are hard to use in practice because they don't have read support from other implementatoins. I think it should ultimately make it easier to get adoption on new features cause the status of each feature will be clearer.

Thank you @emkornfield and @timarmstrong for the clarifications!
Btw, I am 100% in favor of the current initiative and I can relate to the world of pain one has to go through navigating parquet incompatibilities and I can definitely see how this can mitigate those issues while allowing the standard and underlying implementations to evolve.

/**
* This field might be set with the version number of a parquet-format release
* if this file is created by using only the features listed in the related
* list of core features. See CoreFeatures.md for details.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to clarify how this differs from the version key above ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point, @jorisvandenbossche. However, I am not sure what the other version is for. Maybe, it would be better to specify that one correctly and it would be clear that this one is different thing.
Currently parquet-mr always writes 1 to that version field. I guess this is a file format version that we never incremented as it is backward compatible to the first ones we wrote. If it is the case then I think we will never increment it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Arrow (parquet-cpp) this version field actually gets populated with 1 or 2, and depending on that different features are used to write the file (version 2 enabled some additional features, like writing nanosecond timestamps, or using RLE_DICTIONARY instead of PLAIN_DICTIONARY)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't sound good. It means that we are using a required field for different purposes. Because it is required we cannot deprecate it easily.
parquet-mr does not read this field only writes 1 all the time. What does parquet-cpp does with this value at the read path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the field is not used when reading (but I am not super familiar with the C++ code)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, Impala reads this field and fails if it is 2 (https://issues.apache.org/jira/browse/IMPALA-11838?filter=-2)

Comment on lines +144 to +153
* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
**(?)**
parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
**(?)**
parquet-mr: Not used directly
* [DELTA\_BYTE\_ARRAY](Encodings.md#delta-strings-delta_byte_array--7)
**(?)**
parquet-mr: Used for V2 pages to encode BYTE\_ARRAY and
FIXED\_LEN\_BYTE\_ARRAY values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the source code, the three DELTA encodings are not yet supported in Arrow parquet-cpp? (@emkornfield is that correct?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet C++ should now support DELTA_BINARY_PACKED for reading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Parquet C++ now supports all encodings on the read side (which is sufficient here).

FIXED\_LEN\_BYTE\_ARRAY values
* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
**(?)**
parquet-mr: Used for V2 page dictionaries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in Arrow parquet-cpp you can use RLE_DICTIONARY (when asking for version=2) without also using V2 data pages

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impala 3.4 and below wouldn't support this, FWIW.

I also recently implemented this in Impala and on the read side it's orthogonal to the data page version, but does require additional code: IMPALA-6434

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another general choice here. Do we want to add encodings that might not supported widely yet at the first round or add only the ones we regularly use and leave the rest (maybe some even newer ones) for a later release?

* `DATE`
* `TIME`: **(Which unit, utc?)**
* `TIMESTAMP`: **(Which unit, utc?)**
* `INTEGER`: (all bitwidth 8, 16, 32, 64) **(unsigned?)**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parquet-cpp and parquet-mr both support unsigned right? I checked Impala and it would just interpret it as a signed integer. I think it's not the only engine consuming parquet than only has signed types.

I guess included unsigned here seems fairly reasonable, but it would come with the caveat that the behaviour is implementation-dependent about how the cast to signed would be handled (e.g. allowing overflow or not).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parquet-cpp and parquet-mr both support unsigned right?

parquet-cpp definitely does, since C++ has native unsigned integers (i.e. no cast to signed is involved).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am aware of only one implementation where the cast can be a problem, this is parquet-mr. There is no native support for unsigned values in primitives in java. In parquet-mr we give back the original bitmap in the related primitive (so a long for an unsigned int64) and it is up to the client to deal with the value. If the related client does not check the logical type (so realize that the value is unsigned) then the value might be represented incorrectly.
I think this would be a case in Hive. Meanwhile, I am not aware of any SQL engines that would write unsigned int values so currently it does not seem to be a problem.
If we want to be on the safe side we should not add unsigned values because of the java limitations but I am not sure about this because of the issue would be on the client side of parquet-mr and not parquet-mr itself. What do you think?

FIXED\_LEN\_BYTE\_ARRAY values
* [RLE\_DICTIONARY](Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8)
**(?)**
parquet-mr: Used for V2 page dictionaries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impala 3.4 and below wouldn't support this, FWIW.

I also recently implemented this in Impala and on the read side it's orthogonal to the data page version, but does require additional code: IMPALA-6434

@shangxinli
Copy link

@gszadovszky Are you still looking forward to merging this PR?

@gszadovszky
Copy link
Contributor Author

@shangxinli, I don't have enough time to work on it and also lost the contacts to Impala. This should be a cross community effort. It would be nice if someone could take it over but if there is no much interest it might not worth the effort.

@emkornfield
Copy link
Contributor

What isn't clear to me how we make sure we've reached all communities. To enumerate Parquet implementations I know about:

  • parquet-mr
  • impala
  • Rust (Arrow and Arrow2)
  • C++ (parquet-cpp/arrow)
  • Python (parquet-cpp/pyarrow and fastparquet).

What do we believe is needed to move forward?

@gszadovszky
Copy link
Contributor Author

I don't know other implementation either. Since parquet-format is managed by this community I would expect the "implementors" to listen to the dev mailing list at least. I believe there is twitter channel for parquet managed by @julienledem. It might also help reaching parquet implementor communities.
To move forward someone should pick up all the discussions in this PR and agree on solutions. There might be TODOs left in it as well. Unfortunately, I don't have the time to work on it anymore.

parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
* [DELTA\_LENGTH\_BYTE\_ARRAY](Encodings.md#delta-length-byte-array-delta_length_byte_array--6)
**(?)**
parquet-mr: Not used directly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While parquet-mr does not use this directly, the spec currently says that "This encoding is always preferred over PLAIN for byte array columns" at https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6
So why doesn't parquet-mr use DELTA_LENGTH_BYTE_ARRAY instead of PLAIN for byte arrays ?
As an implementor, it's not clear to me how should the line parquet-mr: Not used directly be interpreted.
Is this recommending doing exactly what parquet-mr does regarding usage of encodings or should implementors use their own judgement or should we follow the prescription of the spec at https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-length-byte-array-delta_length_byte_array--6 regardless of what parquet-mr is doing ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably because DELTA_BYTE_ARRAY should be more efficient in general than DELTA_LENGTH_BYTE_ARRAY, thanks to prefix compression?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be true from the point of view of compression, however the decoder logic is more complicated with DELTA_BYTE_ARRAY and decoding speed could be slower.
If DELTA_BYTE_ARRAY is always the better choice, the spec should instead recommend that DELTA_BYTE_ARRAY should always be preferred over PLAIN for byte array columns and explicitly discourage using DELTA_LENGTH_BYTE_ARRAY directly as an encoding for byte arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the spec should recommend an encoding over another one. It should clearly document the advantages and drawbacks of each encoding, so that implementations and applications can make an informed choice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the spec would simply provide guidelines and not be opinionated consistently in every case, that would be okay too. However, that doesn't appear to be the case today given the above recommendation of DELTA_LENGTH over PLAIN in the current spec. This PR is also highlighting the specific encodings chosen by parquet-mr along with choice of V1/V2 page format, which gives me the impression that any choice than what is mentioned here is not really recommended even if it exists in the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, that doesn't appear to be the case today given the above recommendation of DELTA_LENGTH over PLAIN in the current spec.

This may deserve a separate issue, then.

`FileMetaData` must be set according to the specification.
The following page types are supported:
* Data page V1 (see `DataPageHeader`)
* Dictionary page (see `DictionaryPageHeader`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a consensus about whether DataPageV2 is considered a stable feature now ?
It would help if the spec clearly says whether or not DataPageV2 is finalized and is expected to be supported rather than skipping mention of it here.

values in case of V2 pages
* [DELTA\_BINARY\_PACKED](Encodings.md#delta-encoding-delta_binary_packed--5)
**(?)**
parquet-mr: Used for V2 pages to encode INT32 and INT64 values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this notes the current status for parquet-mr, it would greatly help to have a clarification in the spec about whether the delta encodings are only allowed with V2 Data pages or if it's legal to use delta encodings in V1 Data pages as well. Or in other words, is the choice of encodings linked in any way to the choice of V1 or V2 in the data page ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, the choice of data encodings is orthogonal to the choice of V1 or V2 data pages. But different default values may be applied based on which version the user asks for.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the choice of data encodings is orthogonal to the choice of V1 or V2 data pages

That's what I'm hoping for as well, however the spec doesn't say explicitly if it is valid to write delta encodings in V1 pages. The parquet-mr implementation does not use any delta encodings in V1 pages. Keeping this ambiguous in the spec means that someone implementing a writer cannot be sure about whether the various projects implementing readers will consider this a valid input. It's also not clear to me whether parquet-mr has tests guaranteeing that this works given that parquet-mr itself will not do this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I'm hoping for as well, however the spec doesn't say explicitly if it is valid to write delta encodings in V1 pages.

Can you open a separate issue about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, what's the appropriate place to open issues ? GH issues seem to be disabled on this repo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you need to open a JIRA issue. See here: https://parquet.apache.org/community/

@pitrou
Copy link
Member

pitrou commented May 31, 2023

Wild idea: instead of defining core features, how about rephrasing this in terms of presets?

We could have a growing number of calendar-versioned presets, example:

  • Preset 2023.06 : v2 data pages + delta encodings + ZSTD + Snappy + ZLib (+ logical types etc.)
  • Preset 2024.11 : the former + byte stream split encoding + LZ4_RAW
  • ...

I'm also skeptical that this needs to be advertised in the Thrift metadata. Presets would mostly serve as a guideline for implementations and as an API simplification for users.

(edit: could name this "profiles" instead of "presets" if that's more familiar to users)

@emkornfield
Copy link
Contributor

We could have a growing number of calendar-versioned presets, example:

@pitrou I'm not really familiar with the term presets or profiles. What are the implications across parquet-mr and parquet-cpp for them? How would they be used?

@pitrou
Copy link
Member

pitrou commented Jun 3, 2023

A preset would be a set of features.

On the read side, each implementation would document the presets it's compatible with (meaning they are able to read all the features in the preset).

On the write side, the user would be able to ask for a given preset when writing a Parquet file (potentially using any feature in the preset, but not features outside of it).

@pitrou
Copy link
Member

pitrou commented Jun 3, 2023

For an analogy, you can think of RISC-V profiles or MPEG profiles, but I would suggest Parquet profiles (or presets) to be monotonically increasing for ease of use (i.e. each profile is a superset of any earlier profile).

@jaychia
Copy link

jaychia commented Jul 11, 2023

I've been doing more digging into parquet-format and versioning of the spec has definitely been one of the more confusing pieces. I'm glad that there is an effort to define "core" features, or feature presets.

I made a simple tool that just reads Parquet files and produces a boolean checklist of features that are being used in each file: https://github.com/Eventual-Inc/parquet-benchmarking. Running a tool like this on "data in the wild" has been our approach so far for understanding what features are actively being used by the main Parquet producers (Arrow, Spark, Trino, Impala etc).

It could be useful for developers of Parquet-producing frameworks to start producing Parquet read/write feature compatibility documentation. As a community we can then start to understand which features have been actively adopted and which ones should be considered more experimental or specific to a given framework!


A cool idea might be to have a parquet-features repository where I can contribute Daft (our dataframe framework) code which:

  1. Reads canonical Parquet files that uses certain feature presets
  2. Writes Parquet files to use certain feature presets

The repository can then have automated CI to publish reports on each implementation's supported features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.