[C++][Parquet] Should we support PARQUET_2_8 version? #35776

mapleFU · 2023-05-26T05:30:14Z

Describe the enhancement requested

Nowadays, we support BYTE_STREAM_SPLIT in parquet. However, during writing, our highest format is PARQUET_2_6. So, do we need to support Parquet 2.8 or higher version

Changelogs: https://github.com/apache/parquet-format/blob/master/CHANGES.md#version-280

Component(s)

C++, Parquet

mapleFU · 2023-05-26T05:30:27Z

cc @pitrou @wgtmac

wgtmac · 2023-05-26T05:53:08Z

To provide some facts:

It seems that the WriterProperties does not check if any feature does not belong to a certain version. I have only observed several places it was checked:

arrow/cpp/src/parquet/column_writer.cc

Lines 2062 to 2072 in 2d32efe

    
           } else if ((version == ParquetVersion::PARQUET_1_0 || 
        
                       version == ParquetVersion::PARQUET_2_4) && 
        
                      source_type.unit() == ::arrow::TimeUnit::NANO) { 
        
             // Absent superseding user instructions, when writing Parquet version <= 2.4 files, 
        
             // timestamps in nanoseconds are coerced to microseconds 
        
             std::shared_ptr<ArrowWriterProperties> properties = 
        
                 (ArrowWriterProperties::Builder()) 
        
                     .coerce_timestamps(::arrow::TimeUnit::MICRO) 
        
                     ->disallow_truncated_timestamps() 
        
                     ->build(); 
        
             return WriteCoerce(properties.get());

arrow/cpp/src/parquet/metadata.cc

Lines 1467 to 1470 in 2d32efe

    
           if (properties_->version() == ParquetVersion::PARQUET_1_0) { 
        
             thrift_encodings.push_back(ToThrift(Encoding::PLAIN)); 
        
           } else { 
        
             thrift_encodings.push_back(ToThrift(properties_->dictionary_page_encoding()));

arrow/cpp/src/parquet/arrow/schema.cc

Lines 319 to 323 in 2d32efe

    
           if (properties.version() == ::parquet::ParquetVersion::PARQUET_1_0) { 
        
             type = ParquetType::INT64; 
        
           } else { 
        
             type = ParquetType::INT32; 
        
             logical_type = LogicalType::Int(32, false);

parquet-mr does not recognize any 2.x format version and always hardcodes version 1 to the footer metadata:

https://github.com/apache/parquet-mr/blob/d5a4ce05643e6709312e3060838cb9236e882014/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1339-L1343

wgtmac · 2023-05-26T05:57:58Z

My question is: should we check if any enabled feature is beyond the support of the specified format version? If yes, should we support deduce the version from enabled feature set? It is not easy for a user to know which format version to set but it is much easier to know what features are needed.

mapleFU · 2023-05-26T06:12:43Z

Hmmm I go through the Rust implementions, and found that it just uses "1.0" or "2.0". All implementions use different adhoc way to setting this...
And maybe user uses BYTE_STREAM_SPLIT with 2.6 in arrow, I guess it's a disaster when we really wants to check this...

wgtmac · 2023-05-26T06:26:37Z

How about adding a Status WriterProperties::validate_format(). This won't break current users but provide a way to check format integrity.

wgtmac · 2023-05-26T06:26:57Z

cc @emkornfield

jorisvandenbossche · 2023-05-26T09:07:51Z

Hmmm I go through the Rust implementions, and found that it just uses "1.0" or "2.0". All implementions use different adhoc way to setting this...

Yeah, this "version" field in the footer metadata is not very well specified. See also related discussion at apache/parquet-format#164 (comment)

pitrou · 2023-05-26T09:10:00Z

How about adding a Status WriterProperties::validate_format(). This won't break current users but provide a way to check format integrity.

Something like that could be useful, yes.

mapleFU · 2023-06-06T14:35:31Z

I guess it's a big hard, because checking is separted to different places...

under FieldToNode in src/parquet/arrow/schema.cc

    case ArrowTypeId::TIMESTAMP:
      RETURN_NOT_OK(
          GetTimestampMetadata(static_cast<::arrow::TimestampType&>(*field->type()),
                               properties, arrow_properties, &type, &logical_type));
      break;

arrow/cpp/src/parquet/column_writer.cc

Lines 2062 to 2072 in 2d32efe

    
           } else if ((version == ParquetVersion::PARQUET_1_0 || 
        
                       version == ParquetVersion::PARQUET_2_4) && 
        
                      source_type.unit() == ::arrow::TimeUnit::NANO) { 
        
             // Absent superseding user instructions, when writing Parquet version <= 2.4 files, 
        
             // timestamps in nanoseconds are coerced to microseconds 
        
             std::shared_ptr<ArrowWriterProperties> properties = 
        
                 (ArrowWriterProperties::Builder()) 
        
                     .coerce_timestamps(::arrow::TimeUnit::MICRO) 
        
                     ->disallow_truncated_timestamps() 
        
                     ->build(); 
        
             return WriteCoerce(properties.get());

when write timestamp

I guess we need a validate_format like:

validate_format(const WriterProperties& properties, const ArrowWriterProperties& arrow_properties, Schema);

wgtmac · 2023-06-06T14:48:17Z

I see the problem. IMO, we can add a new option in the WriterProperties to enable format validation and call validate_format you proposed while creating the parquet writer? @mapleFU

mapleFU · 2023-06-09T04:57:31Z

I'll try to add PARQUET_2_9 and add BYTE_STREAM_SPLIT checking, and try not break the previous implementions.

mapleFU added the Type: enhancement label May 26, 2023

github-actions bot added Component: C++ Component: Parquet labels May 26, 2023

wgtmac assigned wgtmac and unassigned wgtmac May 26, 2023

mapleFU mentioned this issue Feb 17, 2024

[C++][Parquet] Expand ParquetVersion enum values #40096

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Should we support PARQUET_2_8 version? #35776

[C++][Parquet] Should we support PARQUET_2_8 version? #35776

mapleFU commented May 26, 2023

mapleFU commented May 26, 2023

wgtmac commented May 26, 2023

wgtmac commented May 26, 2023

mapleFU commented May 26, 2023

wgtmac commented May 26, 2023

wgtmac commented May 26, 2023

jorisvandenbossche commented May 26, 2023 •

edited

Loading

pitrou commented May 26, 2023

mapleFU commented Jun 6, 2023

wgtmac commented Jun 6, 2023

mapleFU commented Jun 9, 2023

[C++][Parquet] Should we support PARQUET_2_8 version? #35776

[C++][Parquet] Should we support PARQUET_2_8 version? #35776

Comments

mapleFU commented May 26, 2023

Describe the enhancement requested

Component(s)

mapleFU commented May 26, 2023

wgtmac commented May 26, 2023

wgtmac commented May 26, 2023

mapleFU commented May 26, 2023

wgtmac commented May 26, 2023

wgtmac commented May 26, 2023

jorisvandenbossche commented May 26, 2023 • edited Loading

pitrou commented May 26, 2023

mapleFU commented Jun 6, 2023

wgtmac commented Jun 6, 2023

mapleFU commented Jun 9, 2023

jorisvandenbossche commented May 26, 2023 •

edited

Loading