-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Can't open Parquet files created in Spark anymore #121
Comments
This will be hard to resolve without a sample file. Any chance you could create a dummy test file to share here? |
Unfortunately, I do not create the file, which is related to the reasoning I can't share one. It's a client's file and I do not have permission to provide it to others. Is there something else I can do or provide to help with it? Information or otherwise? |
Is there perhaps I method I could use that you're aware of that would scramble the data itself without changing the format? I think perhaps I could use python, but I'm not sure if that would change the format afterwards? |
You mentioned your files are partitioned (20+ files), what happens if you have a parquet file with just one record using the same schema? Do you still get the error in that case?
Scrambling the data would be great but not sure what's the easiest way besides writing a custom spark-sql query to scramble the data as you're fetching it from the source. I asked ChatGPT and it recommended the following randomization code in Scala, not sure if it's good or not so please take it with a grain of salt 😅 : chatgpt.zip |
The latest Spark 3.5.4 comes with parquet-mr version 1.13.1. Can you clarify how are you using 1.14.3? I tried manually updating my JAR's to 1.14.3 and afterwards the generated parquet files indeed can't be opened but spark doesn't seem to support parquet-mr 1.14.0+ yet. |
I also opened this ticket to see if folks in the Parquet.NET repo can help: aloneguid/parquet-dotnet#583 |
I'm getting the files from a Palantir backend usage of Spark, so I only consume the files; I don't get to generate them. I'm assuming they have updated their version to 1.14.x something? |
You can see the problem here. The v154 cannot be opened by this tool, the other one can. It seems the newer SPARK code is not correctly managed in this version. Hope this helps you guys! |
@mukunku this is fixed in 5.1.0! |
Parquet Viewer Version
3.1.0, also tried/used 2.8.
Where was the parquet file created?
Apache Spark
org.apache.spark.timeZone GMT org.apache.spark.legacyINT96 org.apache.spark.version 3.5.1) org.apache.spark.legacyDateTimeJ parquet-mr version 1.14.3
Sample File
Cannot
Describe the bug
Previous / earlier files were openable and had no issue. Recent files created give the below error upon trying to open, but are parsed correctly using pandas (python), and confirmed with creators that files are valid.
Additional context
Parquet files are 'split' into up to 20+ parquet files. Previous files were also split, and did not have an issue reading, but newer files do. Change in apache spark, or file format for parquets? Correctly parses in python using pandas read_parquet default engine (pyarrow I believe).
Error message:
Something went wrong
Could not load parquet file.
If the problem persists please consider opening a bug ticket in the project repo: Help → About
ParquetViewer.Engine.Exceptions.FileReadException: Encountered an error reading file.
---> System.IO.IOException: only 34063 out of 6260620 bytes are available
at Parquet.Extensions.StreamExtensions.ReadBytesExactly(Stream s, Int32 count)
at Parquet.Meta.Proto.ThriftCompactProtocolReader.ReadString()
at Parquet.Meta.ColumnChunk.Read(ThriftCompactProtocolReader proto)
at Parquet.Meta.RowGroup.Read(ThriftCompactProtocolReader proto)
at Parquet.Meta.FileMetaData.Read(ThriftCompactProtocolReader proto)
at Parquet.ParquetActor.ReadMetadataAsync(CancellationToken cancellationToken)
at Parquet.ParquetReader.InitialiseAsync(CancellationToken cancellationToken)
at Parquet.ParquetReader.CreateAsync(String filePath, ParquetOptions parquetOptions, CancellationToken cancellationToken)
at ParquetViewer.Engine.ParquetEngine.OpenFileAsync(String parquetFilePath, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at ParquetViewer.Engine.ParquetEngine.OpenFileAsync(String parquetFilePath, CancellationToken cancellationToken)
at ParquetViewer.MainForm.OpenFieldSelectionDialog(Boolean forceOpenDialog)
The text was updated successfully, but these errors were encountered: