[BUG] Can't open Parquet files created in Spark anymore #121

veranscoto · 2024-11-11T15:32:18Z

Parquet Viewer Version
3.1.0, also tried/used 2.8.

Where was the parquet file created?
Apache Spark
org.apache.spark.timeZone GMT org.apache.spark.legacyINT96 org.apache.spark.version 3.5.1) org.apache.spark.legacyDateTimeJ parquet-mr version 1.14.3

Sample File
Cannot

Describe the bug
Previous / earlier files were openable and had no issue. Recent files created give the below error upon trying to open, but are parsed correctly using pandas (python), and confirmed with creators that files are valid.

Additional context
Parquet files are 'split' into up to 20+ parquet files. Previous files were also split, and did not have an issue reading, but newer files do. Change in apache spark, or file format for parquets? Correctly parses in python using pandas read_parquet default engine (pyarrow I believe).

Error message:

Something went wrong

Could not load parquet file.

If the problem persists please consider opening a bug ticket in the project repo: Help → About

ParquetViewer.Engine.Exceptions.FileReadException: Encountered an error reading file.

---> System.IO.IOException: only 34063 out of 6260620 bytes are available

at Parquet.Extensions.StreamExtensions.ReadBytesExactly(Stream s, Int32 count)

at Parquet.Meta.Proto.ThriftCompactProtocolReader.ReadString()

at Parquet.Meta.ColumnChunk.Read(ThriftCompactProtocolReader proto)

at Parquet.Meta.RowGroup.Read(ThriftCompactProtocolReader proto)

at Parquet.Meta.FileMetaData.Read(ThriftCompactProtocolReader proto)

at Parquet.ParquetActor.ReadMetadataAsync(CancellationToken cancellationToken)

at Parquet.ParquetReader.InitialiseAsync(CancellationToken cancellationToken)

at Parquet.ParquetReader.CreateAsync(String filePath, ParquetOptions parquetOptions, CancellationToken cancellationToken)

at ParquetViewer.Engine.ParquetEngine.OpenFileAsync(String parquetFilePath, CancellationToken cancellationToken)

--- End of inner exception stack trace ---

at ParquetViewer.Engine.ParquetEngine.OpenFileAsync(String parquetFilePath, CancellationToken cancellationToken)

at ParquetViewer.MainForm.OpenFieldSelectionDialog(Boolean forceOpenDialog)

mukunku · 2024-11-11T16:43:04Z

This will be hard to resolve without a sample file. Any chance you could create a dummy test file to share here?

veranscoto · 2024-11-12T02:18:32Z

Unfortunately, I do not create the file, which is related to the reasoning I can't share one. It's a client's file and I do not have permission to provide it to others. Is there something else I can do or provide to help with it? Information or otherwise?

veranscoto · 2024-12-17T07:10:57Z

Is there perhaps I method I could use that you're aware of that would scramble the data itself without changing the format? I think perhaps I could use python, but I'm not sure if that would change the format afterwards?

mukunku · 2024-12-22T13:17:22Z

You mentioned your files are partitioned (20+ files), what happens if you have a parquet file with just one record using the same schema? Do you still get the error in that case?

 // Read the Parquet file
val df = spark.read.parquet("path/to/your/parquet/file.parquet")

// Select top 1 record
val topRecord = df.limit(1)

// Write the top record to a new Parquet file
topRecord.write
  .mode(SaveMode.Overwrite)
  .parquet("path/to/output/scrambled.parquet")

Scrambling the data would be great but not sure what's the easiest way besides writing a custom spark-sql query to scramble the data as you're fetching it from the source. I asked ChatGPT and it recommended the following randomization code in Scala, not sure if it's good or not so please take it with a grain of salt 😅 : chatgpt.zip

mukunku · 2024-12-24T04:29:53Z

The latest Spark 3.5.4 comes with parquet-mr version 1.13.1. Can you clarify how are you using 1.14.3?

I tried manually updating my JAR's to 1.14.3 and afterwards the generated parquet files indeed can't be opened but spark doesn't seem to support parquet-mr 1.14.0+ yet.

mukunku · 2024-12-26T15:29:41Z

I also opened this ticket to see if folks in the Parquet.NET repo can help: aloneguid/parquet-dotnet#583

veranscoto · 2025-01-02T17:26:39Z

The latest Spark 3.5.4 comes with parquet-mr version 1.13.1. Can you clarify how are you using 1.14.3?

I tried manually updating my JAR's to 1.14.3 and afterwards the generated parquet files indeed can't be opened but spark doesn't seem to support parquet-mr 1.14.0+ yet.

I'm getting the files from a Palantir backend usage of Spark, so I only consume the files; I don't get to generate them. I'm assuming they have updated their version to 1.14.x something?

wjoeri · 2025-01-28T15:05:22Z

fact_codes_example.zip

You can see the problem here. The v154 cannot be opened by this tool, the other one can. It seems the newer SPARK code is not correctly managed in this version. Hope this helps you guys!

aloneguid · 2025-01-30T14:21:55Z

@mukunku this is fixed in 5.1.0!

veranscoto added the bug Something isn't working label Nov 11, 2024

mukunku changed the title ~~[BUG]~~ [BUG] Can't open Parquet files created in Spark anymore Dec 22, 2024

mukunku added parquet-dotnet-bug and removed bug Something isn't working labels Jan 9, 2025

mukunku mentioned this issue Jan 9, 2025

[BUG] "don't know how to skip type Set" error #118

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Can't open Parquet files created in Spark anymore #121

[BUG] Can't open Parquet files created in Spark anymore #121

veranscoto commented Nov 11, 2024 •

edited

Loading

mukunku commented Nov 11, 2024

veranscoto commented Nov 12, 2024

veranscoto commented Dec 17, 2024

mukunku commented Dec 22, 2024

mukunku commented Dec 24, 2024 •

edited

Loading

mukunku commented Dec 26, 2024

veranscoto commented Jan 2, 2025

wjoeri commented Jan 28, 2025

aloneguid commented Jan 30, 2025

[BUG] Can't open Parquet files created in Spark anymore #121

[BUG] Can't open Parquet files created in Spark anymore #121

Comments

veranscoto commented Nov 11, 2024 • edited Loading

Error message:

Something went wrong

mukunku commented Nov 11, 2024

veranscoto commented Nov 12, 2024

veranscoto commented Dec 17, 2024

mukunku commented Dec 22, 2024

mukunku commented Dec 24, 2024 • edited Loading

mukunku commented Dec 26, 2024

veranscoto commented Jan 2, 2025

wjoeri commented Jan 28, 2025

aloneguid commented Jan 30, 2025

veranscoto commented Nov 11, 2024 •

edited

Loading

mukunku commented Dec 24, 2024 •

edited

Loading