Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(connector): resolve $ref in json schema #20446

Merged
merged 10 commits into from
Feb 19, 2025
Merged

Conversation

yuhao-su
Copy link
Contributor

@yuhao-su yuhao-su commented Feb 11, 2025

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

close #20351

This PR only resolve the $ref in a json schema. Other fields like $anchor $id etc was not resolved or checked.

This PR does not handle the json schema used by confluent schema registry.

Checklist

  • I have written necessary rustdoc comments.
  • I have added necessary unit tests and integration tests.
  • I have added test labels as necessary.
  • I have added fuzzing tests or opened an issue to track them.
  • My PR contains breaking changes.
  • My PR changes performance-critical code, so I will run (micro) benchmarks and present the results.
  • My PR contains critical fixes that are necessary to be merged into the latest release.

Documentation

  • My PR needs documentation updates.
Release note

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 5568 files.

Valid Invalid Ignored Fixed
2348 1 3219 0
Click to see the invalid file list
  • src/connector/codec/tests/integration_tests/json.rs
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

let mut schema = match self.schema_cache.get(&ref_no_fragment) {
Some(cached_schema) => cached_schema.clone(),
None => {
if url_schema == "http" || url_schema == "https" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we want to support schema reference from http/fs?
Should we support retrieve references from schema registry instead? https://docs.confluent.io/platform/current/schema-registry/fundamentals/serdes-develop/serdes-json.html#schema-references-in-json-schemas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this feature a customer request? Please provide more context, thanks.

Copy link
Contributor Author

@yuhao-su yuhao-su Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User only requested schema ref resolution without an external link.

I thought the HTTP support can be a general approve to support schema registry. But I just realized there is a version segment after the subject. I'll change this pr to only support schema resolution without external resource for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we do allow users using http and file. So it won't hurt to add those support.

It's just for the schema registry, we need to investigate more into how it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though schemas are identified by URIs, those identifiers are not necessarily network-addressable. They are just identifiers. Generally, implementations don't make HTTP requests (https://) or read from the file system (file://) to fetch schemas. Instead, they provide a way to load schemas into an internal schema database. When a schema is referenced by it's URI identifier, the schema is retrieved from the internal schema database.

https://json-schema.org/understanding-json-schema/structuring

Copy link
Contributor Author

@yuhao-su yuhao-su Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they provide a way to load schemas into an internal schema database

It's true a for a deref library to focus on derefing instead of fetching (in fact it's referencing do). But if we want do do so we will have to ask users to provide all the URL in the schema ref or parse the schema to get all the $ref recurrently before ref.

@yuhao-su yuhao-su marked this pull request as ready for review February 11, 2025 20:31
@yuhao-su yuhao-su requested a review from a team as a code owner February 11, 2025 20:31
@yuhao-su yuhao-su requested a review from xxchan February 12, 2025 03:58
let mut schema = match self.schema_cache.get(&ref_no_fragment) {
Some(cached_schema) => cached_schema.clone(),
None => {
if url_schema == "http" || url_schema == "https" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though schemas are identified by URIs, those identifiers are not necessarily network-addressable. They are just identifiers. Generally, implementations don't make HTTP requests (https://) or read from the file system (file://) to fetch schemas. Instead, they provide a way to load schemas into an internal schema database. When a schema is referenced by it's URI identifier, the schema is retrieved from the internal schema database.

https://json-schema.org/understanding-json-schema/structuring

Comment on lines +69 to +70
#[derive(Debug)]
pub struct JsonRef {
Copy link
Contributor

@xiangjinwu xiangjinwu Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledge that $id and $anchor are currently ignored and unsupported. To be honest I feel like this shall be supported by the json schema library rather than by a hack in the caller that replaces the schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check the jsonschema library. It's more like designed for validation and is vary hard to use it for dereffing the schema file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move the impl to a standalone crate? I feel that the kernel should not touch the logic of JsonRef.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We may need to hack it to support confluent schema registry. The confluent schema registry should be handled differently. IIUC they can use reference name to retrieve a reference instead of URI.

@yuhao-su yuhao-su added this pull request to the merge queue Feb 19, 2025
Merged via the queue into main with commit b8feb75 Feb 19, 2025
28 of 29 checks passed
@yuhao-su yuhao-su deleted the yuhao/use-jsonschema branch February 19, 2025 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

handle json reference in json schema
4 participants