[Discussion] Inconsistent ORC Index Usage Based on Row Count Threshold #24702

feloxx · 2025-01-14T06:43:56Z

Issue Description

When reading ORC files in Trino, the index usage is determined solely by comparing the stripe row count with orc.row.index.stride. This design may cause the index to be unstable, especially when dealing with wide-row data (rows with large fields).

Current Behavior

In StripeReader.java, the decision to use index is made based on this condition:

lib/trino-orc/src/main/java/io/trino/orc/StripeReader.java

if (rowsInRowGroup.isPresent() && stripe.getNumberOfRows() > rowsInRowGroup.getAsInt()) {
// Use index, Such as bloomfilter
} else {
// Skip index usage
}

This means:

If stripe rows < row.index.stride: No index usage
If stripe rows > row.index.stride: Use index

Problem Statement

Inconsistent Index Usage
- When dealing with wide-row data, a stripe might contain significant data volume but few rows
- This leads to situations where some stripes use indexes while others don't, even within the same table
- The actual data size isn't considered in the decision-making process
External Control Dependencies
- While we can control this through external writers (e.g., Flink ORC writer configurations, the size of the file output to ensure the fit of the index), this creates:
  - The cost of communication before the user, Data is written to a different group of users using the data
  - Implicit dependencies on external configurations
  - Lack of clear error/warning messages when index usage is skipped
Performance Implications
- For wide-row data, skipping index usage based solely on row count might lead to unnecessary full stripe reads
- This could be particularly impactful when dealing with predicate pushdown scenarios

Questions for Discussion

Is there a specific reason for using only row count as the threshold for index usage?
Would adding size-based or hybrid criteria for index usage introduce any performance concerns?
Should we consider adding configuration options to give users more control over index usage strategies?
How can we better handle the trade-off between index overhead and query performance for wide-row data?

idea

Adding session parameters, If there is a BloomFilter, force use, facilitates debugging or troubleshooting
Whether to use an index, such as BloomFilter, based on the file size

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Inconsistent ORC Index Usage Based on Row Count Threshold #24702

[Discussion] Inconsistent ORC Index Usage Based on Row Count Threshold #24702

feloxx commented Jan 14, 2025 •

edited

Loading

[Discussion] Inconsistent ORC Index Usage Based on Row Count Threshold #24702

[Discussion] Inconsistent ORC Index Usage Based on Row Count Threshold #24702

Comments

feloxx commented Jan 14, 2025 • edited Loading

Issue Description

Current Behavior

Problem Statement

Questions for Discussion

idea

feloxx commented Jan 14, 2025 •

edited

Loading