Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support simple file ingestion from public remote file #14812

Open
cyrilou242 opened this issue Jan 14, 2025 · 1 comment
Open

support simple file ingestion from public remote file #14812

cyrilou242 opened this issue Jan 14, 2025 · 1 comment

Comments

@cyrilou242
Copy link
Contributor

cyrilou242 commented Jan 14, 2025

This is a feature request.

Problem

As of today, it is possible to ingest small datasets to try Pinot quickly with the endpoints /ingestfromURI and /ingestFromFile.
See https://docs.pinot.apache.org/basics/data-import/batch-ingestion#ingestfromuri and
https://docs.pinot.apache.org/basics/data-import/batch-ingestion#ingestfromfile.

The problems are the following:

  1. ingestfromURI can ingest from S3, but does not support public buckets.
  2. ingestFromFile requires to pass the file in the payload. This means the client must have the file locally. Often this means the client has to download the file first, then pass it in the payload. This is pretty inefficient.

Proposal

It'd be nice to be able to ingest from a public remote file.
I suggest to add a new parameter to the /ingestFromFile endpoint: remoteFileUrl .
If this parameter is set, instead of getting the file from the payload, Pinot will download the file directly.

Around this line:
https://github.com/apache/pinot/blob/6eddacfc32055e959ad72634684b904cf4098e20/pinot-controller/src/main/java/org/apache/pinot/controller/api/resources/PinotIngestionRestletResource.java#L154C81-L154C92

Pseudo code will look like this:

if (remoteFileUrl != null) {
  // NEW - download file directly
  dataToIngest = new DataPayload(new URI(remoteFileUrl))
} else {
  // CURRENT CODE - get file from request payload
  dataToIngest = new DataPayload(fileUpload)
}
@hpvd
Copy link

hpvd commented Jan 15, 2025

+1, would also simplify testing with standardized/public data and for error reproduction with shared data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants