Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal]: Add UUID conversion to and from 16 byte fixed sequences #100

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

urmastalimaa
Copy link
Contributor

UUIDs are often passed around in application code in their canonical, hex as string representation e.g. "550e8400-e29b-41d4-a716-446655440000". Encoding UUIDs as Avro "string"s takes 37 bytes, while encoding UUIDs in their binary form fits into a 16 byte sized "fixed", saving 21 bytes per encoding.

This change allows application code to keep passing around canonical hex UUIDs while converting to the compact encoding, requiring only uuid_format: :canonical_string to be given in decode options.

The Java reference implementation also supports encoding UUIDs as both strings and 16 byte fixed sequences.

  • Encoding is augmented such that a 16 byte fixed schema with %{"logicalType" => "uuid"}, converts a hex-string UUID to the 16 byte binary representation.

  • Decoding is augmented such that given uuid_format: :canonical_string in decode options, the binary representation is converted to the canonical hex-string representation.

The encoding change is nearly backwards-compatible, previously when given an incorrectly size "fixed" with {"logicalType": "uuid"}, an error was raised, while now conversion is attempted.

The decoding change is fully backwards-compatible, as uuid_format defaults to :binary.

For UUID codec, the uniq library was added (no transitive dependencies).

@urmastalimaa urmastalimaa requested a review from a team as a code owner February 12, 2025 17:24
UUIDs are often passed around in application code in their canonical,
hex as string representation e.g. "550e8400-e29b-41d4-a716-446655440000".
Encoding UUIDs as Avro "string"s takes 37 bytes, while encoding UUIDs in
their binary form fits into a 16 byte sized "fixed", saving 21 bytes per
encoding.

This change allows application code to keep passing around canonical hex
UUIDs while converting to the compact encoding, requiring only
`uuid_format: :canonical_string` to be given in decode options.

The [Java reference implementation][java-implementation] also supports
encoding UUIDs as both strings and 16 byte fixed sequences.

* Encoding is augmented such that a 16 byte fixed schema with
  `%{"logicalType" => "uuid"}`, converts a hex-string UUID to the 16
  byte binary representation.

* Decoding is augmented such that given `uuid_format: :canonical_string`
  in decode options, the binary representation is converted to the
  canonical hex-string representation.

The encoding change is nearly backwards-compatible, previously when
given an incorrectly size "fixed" with `{"logicalType": "uuid"}`, an
error was raised, while now conversion is attempted.

The decoding change is fully backwards-compatible, as `uuid_format`
defaults to `:binary`.

For UUID codec, the `uniq` library was added (no transitive
dependencies).

[java-implementation]: https://github.com/apache/avro/blob/230414abbb68e63e68f3b55bfc0cbca94f2737f6/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java#L291-L309
when is_binary(data) do
<<fixed::binary-size(size), rest::binary>> = data

case Keyword.get(opts, :uuid_format, :binary) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without any opts-based configuration, the change would be backwards incompatible.
I'll gladly accept input on whether configuration is necessary at all and if so, the key and value names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant