Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Parquet float/double statistic is wrong when float/double column contains NaN #13948

Closed
res-life opened this issue Aug 24, 2023 · 4 comments
Labels
bug Something isn't working cuIO cuIO issue

Comments

@res-life
Copy link
Contributor

Describe the bug
Parquet float/double statistic is wrong when float/double column contains NaN.
For example, a double column contains 2 values [NaN, 1.0d].

The CPU org.apache.parquet.format.Statistics: min_value 1.0, max_value: NaN.

max_value = b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

The GPU org.apache.parquet.format.Statistics: min_value 1.0, max_value: 1.0.

max_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

Steps/Code to reproduce bug
CPU:

import org.apache.hadoop.fs.Path;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.simple.SimpleGroupFactory;
import org.apache.parquet.hadoop.ParquetWriter;
import org.apache.parquet.hadoop.example.ExampleParquetWriter;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.MessageTypeParser;
import org.junit.Test;

import java.io.File;
import java.io.IOException;

public class TestDoubleStatistic {

  @Test
  public void testDouble() throws IOException {
    MessageType schema = MessageTypeParser.parseMessageType(
      "message schema {\n" +
        "    repeated double d;\n" +
        "}");

    File file = new File("/tmp/test-001.parquet");
    if (file.exists()) {
      file.delete();
    }
    Path fsPath = new Path(file.getAbsolutePath());
    SimpleGroupFactory factory = new SimpleGroupFactory(schema);

    ParquetWriter writer = ExampleParquetWriter.builder(fsPath)
      .withType(schema)
      .build();

    Group group = factory.newGroup()
      .append("d", Double.NaN)
      .append("d", 1.0d);
    writer.write(group);
    writer.close();

    System.out.println("ok");
  }
}

parquet-tools inspect --detail /tmp/test-001.parquet

max_value = b'\x00\x00\x00\x00\x00\x00\xf8\x7f'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

GPU:

TEST_F(MyDebugTests, TestDoubleStatistic)
{
  auto nan = 0.0 / 0.0;
  auto d = std::vector<double> {nan, 1.0};
  float64_col dc(d.begin(), d.end());
  cudf::table_view expected({dc});

  cudf::io::table_input_metadata expected_metadata(expected);
  expected_metadata.column_metadata[0].set_name("c1");

  auto filepath = "/tmp/TestDoubleStatistic.parquet";
  cudf::io::parquet_writer_options out_opts =
    cudf::io::parquet_writer_options::builder(cudf::io::sink_info{filepath}, expected)
      .metadata(expected_metadata);
  cudf::io::write_parquet(out_opts);
}

parquet-tools inspect --detail TestDoubleStatistic.parquet

max_value = b'\x00\x00\x00\x00\x00\x00\xf0?'
min_value = b'\x00\x00\x00\x00\x00\x00\xf0?'

Expected behavior
Make float/double stat consistent with CPU.

Environment overview (please complete the following information)

Environment details
cuDF: branch-23.10
parquet: apache-parquet-1.12.2

Additional context
parquet-tools link

Parquet convert from org.apache.parquet.format.Statistics to org.apache.parquet.column.statistics:

https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.2/parquet-column/src/main/java/org/apache/parquet/column/statistics/Statistics.java#L123-L126

        if (min.isNaN() || max.isNaN()) {
          stats.setMinMax(0.0, 0.0);
          ((Statistics<?>) stats).hasNonNullValue = false;
        }

If there are NaN, then org.apache.parquet.column.statistics min/max are converted to 0.0/0.0 and make they are invalid:

hasNonNullValue = false

So GPU should make sure min/max contains a NaN.

@res-life res-life added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Aug 24, 2023
@etseidl
Copy link
Contributor

etseidl commented Sep 4, 2023

Statistics behavior in the presence of NaN values is currently in flux. See the discussion in apache/parquet-format#196

From the current parquet thrift specification:

   *     When writing statistics the following rules should be followed:
   *     - NaNs should not be written to min or max statistics fields.
   *     - If the computed max value is zero (whether negative or positive),
   *       `+0.0` should be written into the max statistics field.
   *     - If the computed min value is zero (whether negative or positive),
   *       `-0.0` should be written into the min statistics field.

@res-life
Copy link
Contributor Author

res-life commented Sep 13, 2023

@etseidl You are right, but it's a new version behavior. And there are also an extra field nan_count to record the nan count in the new version. For the older version of Parquet, there are no nan_count, so the problem is GPU does not follow the old version and not follow the new version. The CPU file in the reproduce steps is generated by Parquet API, I think GPU should follow the same behavior.

Actually, CPU Spark 330 returns wrong result when reading the GPU file with spark.sql.parquet.aggregatePushdown enabled.

# enable agg push down
spark.conf.set("spark.sql.parquet.aggregatePushdown", "true") 

# use V2 datasource
spark.conf.set("spark.sql.sources.useV1SourceList", "") 

df = spark.read.parquet("/tmp/my-parquet")
df.createOrReplaceTempView("tab")

# agg push down, directly use the max stat, it's wrong.
spark.sql("select max(v) from tab").show()
+------+
|max(v)|
+------+
|   2.0|
+------+

spark.read.parquet("/tmp/my-parquet").show() # show all values
+---+
|  v|
+---+
|1.0|
|2.0|
|NaN|
+---+

#add additinal filter will disable this agg push down, then gets correct result.
spark.sql("select max(v) from tab where v != 1.0").show()  
+------+
|max(v)|
+------+
|   NaN|
+------+

In scala, the min/max value are NaN for values (1.0, 2.0, NaN)

scala> val array = Array(1.0d, 2.0d, Double.NaN)
array: Array[Double] = Array(1.0, 2.0, NaN)

scala> array.max
res2: Double = NaN

scala> array.min
res3: Double = NaN

The GPU generated file:
a.parquet.zip

parquet-cli meta a.parquet

Row group 0:  count: 3  14.00 B records  start: 4  total: 42 B
--------------------------------------------------------------------------------
   type      encodings count     avg size   nulls   min / max
v  DOUBLE    S RB_     3         14.00 B    0       "1.0" / "2.0"

@revans2

@etseidl
Copy link
Contributor

etseidl commented Sep 13, 2023

@res-life I mentioned the new discussion of NaN handling to say things might change at some point in the future. But given the current parquet specification, here's what it has to say about floating point ordering:

   * (*) Because the sorting order is not specified properly for floating
   *     point values (relations vs. total ordering) the following
   *     compatibility rules should be applied when reading statistics:
   *     - If the min is a NaN, it should be ignored.
   *     - If the max is a NaN, it should be ignored.
   *     - If the min is +0, the row group may contain -0 values as well.
   *     - If the max is -0, the row group may contain +0 values as well.
   *     - When looking for NaN values, min and max should be ignored.
   * 
   *     When writing statistics the following rules should be followed:
   *     - NaNs should not be written to min or max statistics fields.
   *     - If the computed max value is zero (whether negative or positive),
   *       `+0.0` should be written into the max statistics field.
   *     - If the computed min value is zero (whether negative or positive),
   *       `-0.0` should be written into the min statistics field.

I'm no expert, but looking at issues such as https://issues.apache.org/jira/browse/PARQUET-1222 I'd contend that the current GPU behavior is not incorrect. The specification was ambiguous in the past, so different implementations had different (incompatible it seems) orderings and min/max behaviors. I believe the current GPU behavior of not writing NaN is consistent with the current spec. Being different from the parquet-mr implementation is not necessarily a bug.

@GregoryKimball
Copy link
Contributor

Thank you all for the discussion. I'll close this for now.

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
None yet
Development

No branches or pull requests

4 participants