Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Can't open Parquet files created in Spark anymore #121

Open
veranscoto opened this issue Nov 11, 2024 · 9 comments
Open

[BUG] Can't open Parquet files created in Spark anymore #121

veranscoto opened this issue Nov 11, 2024 · 9 comments

Comments

@veranscoto
Copy link

veranscoto commented Nov 11, 2024

Parquet Viewer Version
3.1.0, also tried/used 2.8.

Where was the parquet file created?
Apache Spark
org.apache.spark.timeZone GMT org.apache.spark.legacyINT96 org.apache.spark.version 3.5.1) org.apache.spark.legacyDateTimeJ parquet-mr version 1.14.3

Sample File
Cannot

Describe the bug
Previous / earlier files were openable and had no issue. Recent files created give the below error upon trying to open, but are parsed correctly using pandas (python), and confirmed with creators that files are valid.

Additional context
Parquet files are 'split' into up to 20+ parquet files. Previous files were also split, and did not have an issue reading, but newer files do. Change in apache spark, or file format for parquets? Correctly parses in python using pandas read_parquet default engine (pyarrow I believe).

Error message:

Something went wrong

Could not load parquet file.

If the problem persists please consider opening a bug ticket in the project repo: Help → About

ParquetViewer.Engine.Exceptions.FileReadException: Encountered an error reading file.

---> System.IO.IOException: only 34063 out of 6260620 bytes are available

at Parquet.Extensions.StreamExtensions.ReadBytesExactly(Stream s, Int32 count)

at Parquet.Meta.Proto.ThriftCompactProtocolReader.ReadString()

at Parquet.Meta.ColumnChunk.Read(ThriftCompactProtocolReader proto)

at Parquet.Meta.RowGroup.Read(ThriftCompactProtocolReader proto)

at Parquet.Meta.FileMetaData.Read(ThriftCompactProtocolReader proto)

at Parquet.ParquetActor.ReadMetadataAsync(CancellationToken cancellationToken)

at Parquet.ParquetReader.InitialiseAsync(CancellationToken cancellationToken)

at Parquet.ParquetReader.CreateAsync(String filePath, ParquetOptions parquetOptions, CancellationToken cancellationToken)

at ParquetViewer.Engine.ParquetEngine.OpenFileAsync(String parquetFilePath, CancellationToken cancellationToken)

--- End of inner exception stack trace ---

at ParquetViewer.Engine.ParquetEngine.OpenFileAsync(String parquetFilePath, CancellationToken cancellationToken)

at ParquetViewer.MainForm.OpenFieldSelectionDialog(Boolean forceOpenDialog)

@veranscoto veranscoto added the bug Something isn't working label Nov 11, 2024
@mukunku
Copy link
Owner

mukunku commented Nov 11, 2024

This will be hard to resolve without a sample file. Any chance you could create a dummy test file to share here?

@veranscoto
Copy link
Author

Unfortunately, I do not create the file, which is related to the reasoning I can't share one. It's a client's file and I do not have permission to provide it to others. Is there something else I can do or provide to help with it? Information or otherwise?

@veranscoto
Copy link
Author

Is there perhaps I method I could use that you're aware of that would scramble the data itself without changing the format? I think perhaps I could use python, but I'm not sure if that would change the format afterwards?

@mukunku
Copy link
Owner

mukunku commented Dec 22, 2024

You mentioned your files are partitioned (20+ files), what happens if you have a parquet file with just one record using the same schema? Do you still get the error in that case?

 // Read the Parquet file
val df = spark.read.parquet("path/to/your/parquet/file.parquet")

// Select top 1 record
val topRecord = df.limit(1)

// Write the top record to a new Parquet file
topRecord.write
  .mode(SaveMode.Overwrite)
  .parquet("path/to/output/scrambled.parquet")

Scrambling the data would be great but not sure what's the easiest way besides writing a custom spark-sql query to scramble the data as you're fetching it from the source. I asked ChatGPT and it recommended the following randomization code in Scala, not sure if it's good or not so please take it with a grain of salt 😅 : chatgpt.zip

@mukunku mukunku changed the title [BUG] [BUG] Can't open Parquet files created in Spark anymore Dec 22, 2024
@mukunku
Copy link
Owner

mukunku commented Dec 24, 2024

The latest Spark 3.5.4 comes with parquet-mr version 1.13.1. Can you clarify how are you using 1.14.3?

I tried manually updating my JAR's to 1.14.3 and afterwards the generated parquet files indeed can't be opened but spark doesn't seem to support parquet-mr 1.14.0+ yet.

@mukunku
Copy link
Owner

mukunku commented Dec 26, 2024

I also opened this ticket to see if folks in the Parquet.NET repo can help: aloneguid/parquet-dotnet#583

@veranscoto
Copy link
Author

The latest Spark 3.5.4 comes with parquet-mr version 1.13.1. Can you clarify how are you using 1.14.3?

I tried manually updating my JAR's to 1.14.3 and afterwards the generated parquet files indeed can't be opened but spark doesn't seem to support parquet-mr 1.14.0+ yet.

I'm getting the files from a Palantir backend usage of Spark, so I only consume the files; I don't get to generate them. I'm assuming they have updated their version to 1.14.x something?

@mukunku mukunku added parquet-dotnet-bug and removed bug Something isn't working labels Jan 9, 2025
@wjoeri
Copy link

wjoeri commented Jan 28, 2025

fact_codes_example.zip

You can see the problem here. The v154 cannot be opened by this tool, the other one can. It seems the newer SPARK code is not correctly managed in this version. Hope this helps you guys!

@aloneguid
Copy link

@mukunku this is fixed in 5.1.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants