Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13794: [C++] Deprecate PARQUET_VERSION_2_0 #11031

Closed
wants to merge 5 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Aug 30, 2021

Introduce PARQUET_VERSION_2_4 and PARQUET_VERSION_2_6 which allow fine-grained selection of post-1.0 Parquet format features.

@github-actions
Copy link

ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0"
ParquetVersion_V2_0" parquet::ParquetVersion::PARQUET_2_0"
ParquetVersion_V2_4" parquet::ParquetVersion::PARQUET_2_4"
ParquetVersion_V2_6" parquet::ParquetVersion::PARQUET_2_6"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should latest be exposed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now it's not being used in the PyArrow code. We can expose it when needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user point of view, would it be useful to have this though? That you can specify version="latest" or version="2.x" or something to always get the latest version? (although that might not be good practice to do anyway)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. As you say, it may not be good practice. If people need features from a specific version, they should probably use that version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always add this later, in case there is demand for it / a good use case.

@emkornfield
Copy link
Contributor

A few small nits but LGTM. Thanks for cleaning this up.

pitrou added 3 commits August 31, 2021 10:17
Introduce PARQUET_VERSION_2_4 and PARQUET_VERSION_2_6 which allow fine-grained selection of post-1.0 Parquet format features.
@pitrou pitrou force-pushed the ARROW-13794-parquet-version branch from cb54ed6 to 279778a Compare August 31, 2021 08:21
@pitrou
Copy link
Member Author

pitrou commented Aug 31, 2021

@jorisvandenbossche Can you check the Python changes here?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for doing this!
Mostly some minor docstring comments.

///
/// The Parquet file metadata stores the version as a single integer,
/// therefore the returned value can only be an approximation of the
/// Parquet format version actually supported by the producer of the file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sidenote: It's also not very clear in general what this field is for (cfr the core features discussion at apache/parquet-format#164 (comment), parquet-mr always uses 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. Thanks for the pointer. We may want to deprecate this API as a separate JIRA, then...

Comment on lines 34 to 37
/// Enable only pre-2.0 Parquet format features when writing
///
/// This setting is useful for maximum compatibility with legacy readers.
PARQUET_1_0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know into what detail we want to go here, but this is not "exactly" correct that it only enables pre-2.0 features. We currently already do write for example TIMESTAMP_MILLIS with version="1.0", although strictly speaking this was only introduced in Parquet format 2.2. I think it are only the several integer logical types that we don't write with version 1.0 (as those can be interpreted incorrectly if you don't understand the type annotation, since they are physically stored as int32 or int64; which is what matters most for compatibility with older readers).

So I think from a practical point of view, you can ignore the above nitpick :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. The problem is that, as for PARQUET_2_0, the set of features enabled does not map to a well-defined version. That said, I may write "pre-2.2" instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIMESTAMP_MILLIS is actually only from 2.2, so "pre 2.2" doesn't help that much ... (but, yeah, this is a mess :))

case ``allow_truncated_timestamps=True`` can be used to suppress the
raised exception.
microseconds ('us'), and seconds to milliseconds ('ms') by default.
For ``version='2.4'``, nanoseconds will be cast to microseconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't change compared to version=1.0 (the sentence before)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're right. Parquet doesn't have seconds.

raised exception.
microseconds ('us'), and seconds to milliseconds ('ms') by default.
For ``version='2.4'``, nanoseconds will be cast to microseconds.
For ``version='2.6'``, the original resolution is always preserved.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clarify that this sentence is only about nanoseconds? (I think seconds are still cast to milliseconds?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, seconds are still cast.

ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0"
ParquetVersion_V2_0" parquet::ParquetVersion::PARQUET_2_0"
ParquetVersion_V2_4" parquet::ParquetVersion::PARQUET_2_4"
ParquetVersion_V2_6" parquet::ParquetVersion::PARQUET_2_6"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a user point of view, would it be useful to have this though? That you can specify version="latest" or version="2.x" or something to always get the latest version? (although that might not be good practice to do anyway)

Files written with version='2.4' or '2.6' may not be readable in all
Parquet implementations, so version='1.0' is likely the choice that
maximizes file compatibility.
Logical types and UINT32 are only available with version '2.4'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually write logical types for version 1.0 ..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thanks. This is a mess.

@pitrou
Copy link
Member Author

pitrou commented Aug 31, 2021

@jorisvandenbossche Could you take a look at the updated wording?

of resolution. Seconds are always cast to milliseconds ('ms') by default,
as Parquet does not have any temporal type with seconds resolution.
If the casting results in loss of data, it will raise an exception
unless ``allow_truncated_timestamps=True`` is given.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clear now!

@pitrou pitrou closed this in 542e81b Aug 31, 2021
@pitrou pitrou deleted the ARROW-13794-parquet-version branch August 31, 2021 12:51
zeroshade pushed a commit to zeroshade/arrow that referenced this pull request Sep 12, 2021
Introduce PARQUET_VERSION_2_4 and PARQUET_VERSION_2_6 which allow fine-grained selection of post-1.0 Parquet format features.

Closes apache#11031 from pitrou/ARROW-13794-parquet-version

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants