Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Inconsistent ORC Index Usage Based on Row Count Threshold #24702

Open
feloxx opened this issue Jan 14, 2025 · 0 comments
Open

[Discussion] Inconsistent ORC Index Usage Based on Row Count Threshold #24702

feloxx opened this issue Jan 14, 2025 · 0 comments

Comments

@feloxx
Copy link

feloxx commented Jan 14, 2025

Issue Description

When reading ORC files in Trino, the index usage is determined solely by comparing the stripe row count with orc.row.index.stride. This design may cause the index to be unstable, especially when dealing with wide-row data (rows with large fields).

Current Behavior

In StripeReader.java, the decision to use index is made based on this condition:

lib/trino-orc/src/main/java/io/trino/orc/StripeReader.java

if (rowsInRowGroup.isPresent() && stripe.getNumberOfRows() > rowsInRowGroup.getAsInt()) {
// Use index, Such as bloomfilter
} else {
// Skip index usage
}

This means:

  • If stripe rows < row.index.stride: No index usage
  • If stripe rows > row.index.stride: Use index

Problem Statement

  1. Inconsistent Index Usage

    • When dealing with wide-row data, a stripe might contain significant data volume but few rows
    • This leads to situations where some stripes use indexes while others don't, even within the same table
    • The actual data size isn't considered in the decision-making process
  2. External Control Dependencies

    • While we can control this through external writers (e.g., Flink ORC writer configurations, the size of the file output to ensure the fit of the index), this creates:
      • The cost of communication before the user, Data is written to a different group of users using the data
      • Implicit dependencies on external configurations
      • Lack of clear error/warning messages when index usage is skipped
  3. Performance Implications

    • For wide-row data, skipping index usage based solely on row count might lead to unnecessary full stripe reads
    • This could be particularly impactful when dealing with predicate pushdown scenarios

Questions for Discussion

  1. Is there a specific reason for using only row count as the threshold for index usage?
  2. Would adding size-based or hybrid criteria for index usage introduce any performance concerns?
  3. Should we consider adding configuration options to give users more control over index usage strategies?
  4. How can we better handle the trade-off between index overhead and query performance for wide-row data?

idea

  1. Adding session parameters, If there is a BloomFilter, force use, facilitates debugging or troubleshooting
  2. Whether to use an index, such as BloomFilter, based on the file size
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant