-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13794: [C++] Deprecate PARQUET_VERSION_2_0 #11031
Conversation
ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0" | ||
ParquetVersion_V2_0" parquet::ParquetVersion::PARQUET_2_0" | ||
ParquetVersion_V2_4" parquet::ParquetVersion::PARQUET_2_4" | ||
ParquetVersion_V2_6" parquet::ParquetVersion::PARQUET_2_6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should latest be exposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now it's not being used in the PyArrow code. We can expose it when needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a user point of view, would it be useful to have this though? That you can specify version="latest"
or version="2.x"
or something to always get the latest version? (although that might not be good practice to do anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. As you say, it may not be good practice. If people need features from a specific version, they should probably use that version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always add this later, in case there is demand for it / a good use case.
A few small nits but LGTM. Thanks for cleaning this up. |
Introduce PARQUET_VERSION_2_4 and PARQUET_VERSION_2_6 which allow fine-grained selection of post-1.0 Parquet format features.
cb54ed6
to
279778a
Compare
@jorisvandenbossche Can you check the Python changes here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for doing this!
Mostly some minor docstring comments.
cpp/src/parquet/metadata.h
Outdated
/// | ||
/// The Parquet file metadata stores the version as a single integer, | ||
/// therefore the returned value can only be an approximation of the | ||
/// Parquet format version actually supported by the producer of the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sidenote: It's also not very clear in general what this field is for (cfr the core features discussion at apache/parquet-format#164 (comment), parquet-mr always uses 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. Thanks for the pointer. We may want to deprecate this API as a separate JIRA, then...
cpp/src/parquet/type_fwd.h
Outdated
/// Enable only pre-2.0 Parquet format features when writing | ||
/// | ||
/// This setting is useful for maximum compatibility with legacy readers. | ||
PARQUET_1_0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know into what detail we want to go here, but this is not "exactly" correct that it only enables pre-2.0 features. We currently already do write for example TIMESTAMP_MILLIS with version="1.0"
, although strictly speaking this was only introduced in Parquet format 2.2. I think it are only the several integer logical types that we don't write with version 1.0 (as those can be interpreted incorrectly if you don't understand the type annotation, since they are physically stored as int32 or int64; which is what matters most for compatibility with older readers).
So I think from a practical point of view, you can ignore the above nitpick :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. The problem is that, as for PARQUET_2_0, the set of features enabled does not map to a well-defined version. That said, I may write "pre-2.2" instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIMESTAMP_MILLIS is actually only from 2.2, so "pre 2.2" doesn't help that much ... (but, yeah, this is a mess :))
python/pyarrow/parquet.py
Outdated
case ``allow_truncated_timestamps=True`` can be used to suppress the | ||
raised exception. | ||
microseconds ('us'), and seconds to milliseconds ('ms') by default. | ||
For ``version='2.4'``, nanoseconds will be cast to microseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This didn't change compared to version=1.0 (the sentence before)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right. Parquet doesn't have seconds.
python/pyarrow/parquet.py
Outdated
raised exception. | ||
microseconds ('us'), and seconds to milliseconds ('ms') by default. | ||
For ``version='2.4'``, nanoseconds will be cast to microseconds. | ||
For ``version='2.6'``, the original resolution is always preserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clarify that this sentence is only about nanoseconds? (I think seconds are still cast to milliseconds?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, seconds are still cast.
ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0" | ||
ParquetVersion_V2_0" parquet::ParquetVersion::PARQUET_2_0" | ||
ParquetVersion_V2_4" parquet::ParquetVersion::PARQUET_2_4" | ||
ParquetVersion_V2_6" parquet::ParquetVersion::PARQUET_2_6" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a user point of view, would it be useful to have this though? That you can specify version="latest"
or version="2.x"
or something to always get the latest version? (although that might not be good practice to do anyway)
python/pyarrow/parquet.py
Outdated
Files written with version='2.4' or '2.6' may not be readable in all | ||
Parquet implementations, so version='1.0' is likely the choice that | ||
maximizes file compatibility. | ||
Logical types and UINT32 are only available with version '2.4'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually write logical types for version 1.0 ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thanks. This is a mess.
@jorisvandenbossche Could you take a look at the updated wording? |
of resolution. Seconds are always cast to milliseconds ('ms') by default, | ||
as Parquet does not have any temporal type with seconds resolution. | ||
If the casting results in loss of data, it will raise an exception | ||
unless ``allow_truncated_timestamps=True`` is given. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clear now!
Introduce PARQUET_VERSION_2_4 and PARQUET_VERSION_2_6 which allow fine-grained selection of post-1.0 Parquet format features. Closes apache#11031 from pitrou/ARROW-13794-parquet-version Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Introduce PARQUET_VERSION_2_4 and PARQUET_VERSION_2_6 which allow fine-grained selection of post-1.0 Parquet format features.