-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][Python] Switch default Parquet version to 2.4 #28022
Comments
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou:
|
Joris Van den Bossche / @jorisvandenbossche: See also this overview of converted/logical types added in which versions: https://nbviewer.jupyter.org/gist/jorisvandenbossche/3cc9942eaffb53564df65395e5656702 (for types, not for encodings) My conclusion in that email-thread was also that the NANOS might be problematic to already enable by default (I don't know what the status of this feature is in other implementations ..) Another option could also be to have a
There is indeed not spec about this, there was some discussion about this on the "core features" PR: apache/parquet-format#164 (comment) |
Antoine Pitrou / @pitrou: |
Micah Kornfield / @emkornfield: Thank you for the analysis. Based upon it, I would suggest maybe instead of 1.9 we try to make this value correspond with release (introduce a 2.3 and and 2.5) if we don't think it will make the code to horrendous. |
As I said above, we want "2.0" to still enable all features. But the additional version must be lower than "2.0", because it will enable only some of the features. |
Joris Van den Bossche / @jorisvandenbossche: |
That's already the case. What change are you suggesting? |
Joris Van den Bossche / @jorisvandenbossche: But anyway, that's only a naming discussion, and both ways to name the version have pros and cons. The main discussion point is whether we need such an additional version number to have more fine-grained control over which features are used. If that makes it easier to make "1.9"/"2.4" the default, then I think that's a good idea. |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: Also, I am not convinced we need more fine-grained feature selection. That's more control than most people want to have. My primary concern here is that people don't get a completely outdate feature set (no UINT32!) by default. |
Micah Kornfield / @emkornfield: |
Antoine Pitrou / @pitrou: |
Micah Kornfield / @emkornfield:
|
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
Joris Van den Bossche / @jorisvandenbossche: On another note, this is still tagged as 4.0. But it might not be the best feature to switch just before the release. It might be safer to switch directly after the 4.0 release, so we have some time to gather feedback? (although that depends on how many people use the dev version, of course ..) |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
Jorge Leitão / @jorgecarleitao: For the data pages, I do not think there are so many differences between 1 and 2, right? it is mostly where is the compression is applied and where the byte length of the def and rep levels are declared (in the page data or in the header). So, in that context keeping data pages v1 by default seems ok. |
Micah Kornfield / @emkornfield: |
Joris Van den Bossche / @jorisvandenbossche: But +1 on Micah's proposal to notify the mailing list (and put it in the release notes for 5.0.0 maybe as well) that we plan to switch in the next release. |
Antoine Pitrou / @pitrou: |
Micah Kornfield / @emkornfield: |
Antoine Pitrou / @pitrou: |
Micah Kornfield / @emkornfield: |
Joris Van den Bossche / @jorisvandenbossche: |
Micah Kornfield / @emkornfield: |
Antoine Pitrou / @pitrou: |
Raúl Cumplido / @raulcd: |
Antoine Pitrou / @pitrou: |
Micah Kornfield / @emkornfield:
This will still potentially cause issues with imports into BQ if unsigned types are used I think. I think the project has generally been pretty patient, so I understand if there is a strong desire to move forward with it. Will can probably give a better timeline on when BQ would be able to handle the logical types. |
Krisztian Szucs / @kszucs: |
Antoine Pitrou / @pitrou: |
Antoine Pitrou / @pitrou: |
Currently, Parquet write APIs default to maximum-compatibility Parquet version "1.0", which disables some logical types such as UINT32. We may want to switch the default to "2.0" instead, to allow faithful representation of more types.
Reporter: Antoine Pitrou / @pitrou
Assignee: Raúl Cumplido / @raulcd
Related issues:
PRs and other links:
Note: This issue was originally created as ARROW-12203. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: