Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Parquet] Should we support PARQUET_2_8 version? #35776

Open
mapleFU opened this issue May 26, 2023 · 11 comments
Open

[C++][Parquet] Should we support PARQUET_2_8 version? #35776

mapleFU opened this issue May 26, 2023 · 11 comments

Comments

@mapleFU
Copy link
Member

mapleFU commented May 26, 2023

Describe the enhancement requested

Nowadays, we support BYTE_STREAM_SPLIT in parquet. However, during writing, our highest format is PARQUET_2_6. So, do we need to support Parquet 2.8 or higher version

Changelogs: https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-280

Component(s)

C++, Parquet

@mapleFU
Copy link
Member Author

mapleFU commented May 26, 2023

cc @pitrou @wgtmac

@wgtmac
Copy link
Member

wgtmac commented May 26, 2023

To provide some facts:

It seems that the WriterProperties does not check if any feature does not belong to a certain version. I have only observed several places it was checked:

} else if ((version == ParquetVersion::PARQUET_1_0 ||
version == ParquetVersion::PARQUET_2_4) &&
source_type.unit() == ::arrow::TimeUnit::NANO) {
// Absent superseding user instructions, when writing Parquet version <= 2.4 files,
// timestamps in nanoseconds are coerced to microseconds
std::shared_ptr<ArrowWriterProperties> properties =
(ArrowWriterProperties::Builder())
.coerce_timestamps(::arrow::TimeUnit::MICRO)
->disallow_truncated_timestamps()
->build();
return WriteCoerce(properties.get());

if (properties_->version() == ParquetVersion::PARQUET_1_0) {
thrift_encodings.push_back(ToThrift(Encoding::PLAIN));
} else {
thrift_encodings.push_back(ToThrift(properties_->dictionary_page_encoding()));

if (properties.version() == ::parquet::ParquetVersion::PARQUET_1_0) {
type = ParquetType::INT64;
} else {
type = ParquetType::INT32;
logical_type = LogicalType::Int(32, false);

parquet-mr does not recognize any 2.x format version and always hardcodes version 1 to the footer metadata:

https://github.com/apache/parquet-mr/blob/d5a4ce05643e6709312e3060838cb9236e882014/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1339-L1343

@wgtmac
Copy link
Member

wgtmac commented May 26, 2023

My question is: should we check if any enabled feature is beyond the support of the specified format version? If yes, should we support deduce the version from enabled feature set? It is not easy for a user to know which format version to set but it is much easier to know what features are needed.

@wgtmac wgtmac assigned wgtmac and unassigned wgtmac May 26, 2023
@mapleFU
Copy link
Member Author

mapleFU commented May 26, 2023

Hmmm I go through the Rust implementions, and found that it just uses "1.0" or "2.0". All implementions use different adhoc way to setting this...
And maybe user uses BYTE_STREAM_SPLIT with 2.6 in arrow, I guess it's a disaster when we really wants to check this...

@wgtmac
Copy link
Member

wgtmac commented May 26, 2023

How about adding a Status WriterProperties::validate_format(). This won't break current users but provide a way to check format integrity.

@wgtmac
Copy link
Member

wgtmac commented May 26, 2023

cc @emkornfield

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 26, 2023

Hmmm I go through the Rust implementions, and found that it just uses "1.0" or "2.0". All implementions use different adhoc way to setting this...

Yeah, this "version" field in the footer metadata is not very well specified. See also related discussion at apache/parquet-format#164 (comment)

@pitrou
Copy link
Member

pitrou commented May 26, 2023

How about adding a Status WriterProperties::validate_format(). This won't break current users but provide a way to check format integrity.

Something like that could be useful, yes.

@mapleFU
Copy link
Member Author

mapleFU commented Jun 6, 2023

I guess it's a big hard, because checking is separted to different places...

  1. under FieldToNode in src/parquet/arrow/schema.cc
    case ArrowTypeId::TIMESTAMP:
      RETURN_NOT_OK(
          GetTimestampMetadata(static_cast<::arrow::TimestampType&>(*field->type()),
                               properties, arrow_properties, &type, &logical_type));
      break;
  1. } else if ((version == ParquetVersion::PARQUET_1_0 ||
    version == ParquetVersion::PARQUET_2_4) &&
    source_type.unit() == ::arrow::TimeUnit::NANO) {
    // Absent superseding user instructions, when writing Parquet version <= 2.4 files,
    // timestamps in nanoseconds are coerced to microseconds
    std::shared_ptr<ArrowWriterProperties> properties =
    (ArrowWriterProperties::Builder())
    .coerce_timestamps(::arrow::TimeUnit::MICRO)
    ->disallow_truncated_timestamps()
    ->build();
    return WriteCoerce(properties.get());
    when write timestamp

I guess we need a validate_format like:

validate_format(const WriterProperties& properties, const ArrowWriterProperties& arrow_properties, Schema);

@wgtmac
Copy link
Member

wgtmac commented Jun 6, 2023

I see the problem. IMO, we can add a new option in the WriterProperties to enable format validation and call validate_format you proposed while creating the parquet writer? @mapleFU

@mapleFU
Copy link
Member Author

mapleFU commented Jun 9, 2023

I'll try to add PARQUET_2_9 and add BYTE_STREAM_SPLIT checking, and try not break the previous implementions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants