-
Notifications
You must be signed in to change notification settings - Fork 437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-675: Add INTERVAL_YEAR_MONTH and INTERVAL_DAY_TIME types #43
base: master
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -162,19 +162,48 @@ enum ConvertedType { | |
BSON = 20; | ||
|
||
/** | ||
* @Deprecated: use INTERVAL_YEAR_MONTH or INTERVAL_DAY_TIME | ||
* since the SQL standard defines either YEAR_MONTH or DAY_TIME unit. | ||
* This is deprecated in favor of those 2 types | ||
* | ||
* An interval of time | ||
* | ||
* | ||
* This type annotates data stored as a FIXED_LEN_BYTE_ARRAY of length 12 | ||
* This data is composed of three separate little endian unsigned | ||
* integers. Each stores a component of a duration of time. The first | ||
* integer identifies the number of months associated with the duration, | ||
* the second identifies the number of days associated with the duration | ||
* and the third identifies the number of milliseconds associated with | ||
* and the third identifies the number of milliseconds associated with | ||
* the provided duration. This duration of time is independent of any | ||
* particular timezone or date. | ||
*/ | ||
INTERVAL = 21; | ||
|
||
|
||
/** | ||
* An interval of time with a year-month unit. | ||
* | ||
* This type annotates data stored as an INT32 | ||
* This data is stored as a little endian unsigned | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be signed. SQL standard does not mention this explicitly, but defines There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agreed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||
* integer identifying the number of months associated with the duration. | ||
* This duration of time is independent of any | ||
* particular timezone or date. | ||
*/ | ||
INTERVAL_YEAR_MONTH = 22; | ||
|
||
/** | ||
* An interval of time with days-milliseconds unit | ||
* | ||
* This type annotates data stored as a FIXED_LEN_BYTE_ARRAY of length 8 | ||
* This data is composed of two separate little endian unsigned | ||
* integers. Each stores a component of a duration of time. | ||
* | ||
* The first identifies the number of days associated with the duration | ||
* and the second identifies the number of milliseconds associated with | ||
* the provided duration. This duration of time is independent of any | ||
* particular timezone or date. | ||
*/ | ||
INTERVAL_DAY_TIME = 23; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there an alternative to embedding two values in a single byte array? The problem is that this is incompatible with stats filtering and efficient integer encodings. I think those are significant enough that we should consider alternatives. I like the idea of using this annotation for a group of two int32 values. That way we get good compression and encoding, stats filtering works, and we get can get independent dictionary encoding, which would work well for the days field. What are the downsides to that approach instead of the byte array? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the downside is the two values not being contiguous for processing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given that a consumer would be unlikely to read/consume the two values separately, storing them separately seems like overkill. Since we already do the set of three values in the existing interval, it seems like storing two in a single 8 bytes would be okay and consistent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My understanding is that using a single INT64 with milliseconds (1 day = 86400000 milliseconds) would be the most convenient solution. Besides better encoding and statistics it has the advantage of unique representation (does not allow for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with Jacques that this is lossy because a day isn't consistently 86400000 ms. We need to store 2 values, but I'm not convinced that there is much value in storing these in a single byte array. What about users that only use days or only use micros? The proposed type would use 8 bytes per value either way. I don't think it is worth that cost when we can deserialize two columns to a single arrow column instead. Is there a micro-benchmark to back up the benefit of using this format instead of a 2-column format? Using an int64 won't work because with 2 values that aren't constrained by one another, min and max stats are invalid (days always overrides micros). In fact, because days and micros can't be converted to one another, I'd argue that there is no consistent ordering for this interval type. So we would get compression for the days value but at the cost of more confusion about how to produce stats. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am guessing that by day not equal 86400000 you mean the influence of leap seconds. I don't know any database system which would implement those, especially for intervals. Most just store one component for days and seconds (among them Spark) and while for example Postgres uses separate components for them, it still makes a day equal 24 hours for comparison purposes. But if you do decide to store milliseconds with intention of it being an independent dimension, 32 bits seem not enough: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh, another reason you may mean for different day lengths is if they span different DST time zones (on the DST change dates). SQL does not provide a meaningful way to handle those as SQL time stamps only carry time zone offset (in minutes) and not the time zone in the traditional sense (location). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The case I was thinking about was DST. The length of a day may be 23 hours, 24 hours, or 25 hours. That makes 1 day incomparable with 24 hours, unless the SQL spec includes a way to compare them. I'm fine with not having a comparison for Parquet, even if the SQL spec defines one. The other consideration is whether we should constrain hours, minutes, and seconds to less than 24 hours. The other comment from the SQL spec appears to support that, but constraining the microseconds field to < 86,400,000 will require conversion logic somewhere in order to support arithmetic on interval values: 0 days, 22 hours + 0 days 4 hours = ???. Parquet could just require the microseconds field to be less than 86,400,000, but that shifts the problem to processing engines. And either way, it doesn't support the idea of storing the entire interval using microseconds. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rdblue, there are basically two approaches:
Going forward with option 1) is fine (as long as it is clearly documented), but then I don't see the advantage over the existing interval type with this is deprecating. |
||
|
||
} | ||
|
||
/** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the SQL spec say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See links in https://issues.apache.org/jira/browse/PARQUET-675.
For example: "An interval is defined as the difference between two dates and times. Intervals are expressed in one of two different ways. One is a year-month interval that expresses intervals in terms of years and an integral number of months. The other is a day-time interval that expresses intervals in terms of days, minutes, and seconds. These two types of intervals are distinct and cannot be mixed, because months can have varying numbers of days."
https://msdn.microsoft.com/en-us/library/ms716506(v=vs.85).aspx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add this quote in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL 92 Spec:
Further:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are intervals comparable? Are there details in the spec for resolving cases like (1 day, 0 ms) <=> (0 days, 86400000 ms)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the SQL definition 1 day = 24 hours = 86400 seconds.
and
Of course this is just the standard SQL, Parquet does not have to follow this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michal-databricks, can you post the reference for 1 day = 24 hours = 86400 seconds? I'm not sure that what you've quoted here requires it. The second paragraph, "When it is necessary...", is talking about conversion and the first constrains the fields. That doesn't mean that "1 day" and "23 hours 59 minutes 59.999999 seconds" are comparable. It just means that 86400 seconds is an invalid interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, this is the complete paragraph defining the comparison of day-time intervals where the quote was taken from.
To my understanding, this means that, for example, comparing
INTERVAL '100' HOUR
withINTERVAL '4 04:00' DAY TO MINUTE
requires conversion to the common super type, in this caseINTERVAL DAY TO MINUTE
. During conversion least significant fields are filled with zeros and most significant "in a manner obeying the natural rules for dates and times associated with the Gregorian calendar", since the non-most-significant fields must conform to the constraints in the table (comes from the same chapter). In this caseINTERVAL '100' HOUR
becomesINTERVAL '4 04:00' DAY TO MINUTE
. Now they are compared field by field (making them equal).This is the best reference from the SQL '92 standard I can give to argument 1 day = 24 hours. And while implementations of the standard vary they still tend to use the above method for comparisons.