RDF Dataset Description #110
Replies: 1 comment 1 reply
-
There are a lot of topics in your email. I'm not sure I understand all of them correctly. (Please forgive me if I get something wrong.) I think (I could easily be wrong) that at one point, you suggest that one request will return the RDF info for all datasets in a given ERDDAP. In general, I think that is a bad idea. Many ERDDAPs have a large number of datasets (3,000 to 30,000), so this response would be huge. If you envision some external system polling this frequently (every 15 minutes) to maintain all the metadata from a given ERDDAP, that seems like a very inefficient system that will take up lots of ERDDAP resources (especially given that most dataset won't have changed in that time period). It would be better to build an external system that subscribes to all of an ERDDAP's datasets so that it can then request metadata just for datasets that change and immediately after they have changed. Subscriptions (plus subsequent requests for the related metadata file for a dataset) are the most efficient (fastest, with minimal data transfer) way to detect changes to a given dataset, by far. I understand that there will always be new ways of formatting each dataset's metadata and that different users prefer different formats. That's a fundamental feature of ERDDAP. I see that you would find your RDF format variants useful. Okay. So I see the value in making a new response format(s) for the erddap/info/ or erddap/metadata/ system in ERDDAP. These files could then be requested by clients when the dataset has changed. So that sounds like a good addition to ERDDAP. I don't understand all of your fancy header options. What do they provide that simple, direct requests (e.g., give me the .ttl file for this dataset) don't? So I pushed back on some of your ideas. I suspect my suggestions aren't what you want to hear. Please tell me more and give me use-case examples, so that I understand what you want, and why, and why your approach is needed. Best wishes. |
Beta Was this translation helpful? Give feedback.
-
Hey there,
I've been working on an RDF implementation/description way to describe ERDDAP dataset's using DCAT for a European Project (FAIR-EASE),
The main idea behind all of this, would be to (externally), create a client which harvest a list of ERDDAP server (or data provider in general) to create a database (in our case a TripleStore) which contains every useful information about every (public) dataset, like their description, where to find them, specification about things like the subsetting URL..
Using this Triplestore we could use it to find the best suited Dataset to our needs (based on the description) and get the URL to it (+ ideally the associated subsetting URL, to get the minimal amount of information).
For instance, if we are looking for a dataset in a certain spatial region and time, it will ask for the TripleStore to get every dataset matching theses constraints (across the multiple Data Provider Server), and return the corresponding URL and if it's possible the best suited Subsetting URL.
I'm using an external RDF library to generate in-memory graph, and serialize it.
See: https://jena.apache.org/tutorials/rdf_api.html
This does not intend to replace the
.das
or.dds
format, but rather allows user (or bot) for every dataset, to get :Currently, the RDF model looks like this :
We can now generate these file formats :
.jsonld
: Json LD.n3
: Notation3.nt
: N-Triples.nq
: N-Quads.rdfxml
: RDF XML.trig
: TriG.ttl
: TurtleFor example, for the dataset : https://coastwatch.pfeg.noaa.gov/erddap/griddap/erdMWcflh1day.html
If you're asking to the RDF Model in Turtle, http://localhost:8080/erddap/griddap/erdMWcflh1day.ttl would return :
Another important part is, from a single URL, access to the RDF description of every available and accessible, while maintaining a certain form of control on the number results.
For that, I've created 2 URL, the first one like
/{warName}/info/catalog.{RDF format}?page={..}&itemsPerPage={itemsPerPage}
who list and wrap every{itemsPerPage}
firsts datasets (sorted by thelatestModifiedDate
) RDF description inside an RDFdcat:Catalog
.I also created another URL (even if it's technically the same)
/{warName}/info/catalog.{RDF format}
which contains and list every/{warName}/info/catalog.{RDF format}?page={..}&itemsPerPage={..}
URL available.With this, just using a simple static URL, a user/bot can redirect himself and get every dataset without any knowledge of the support, and get the minimal information he needs.
Another cool feature would be, depending on the Header of the request, redirect it to the corresponding URL (kind of content Negotiation):
curl -L -X GET https://coastwatch.pfeg.noaa.gov/erddap/griddap/erdMWcflh1day.html -H "Accept: */*"
-> return the default html page
curl -L -X GET https://coastwatch.pfeg.noaa.gov/erddap/griddap/erdMWcflh1day.html -H "Accept: text/turtle"
-> get redirect to
https://coastwatch.pfeg.noaa.gov/erddap/griddap/erdMWcflh1day.ttl
-> return the Turtle content
I have already done it for the default RDF format, and created these redirection:
/tabledap|griddap/{datasetid}.{format}
-> Accept Header : RDF format ->/tabledap|griddap/{datasetid}.{corresponding RDF format}
/info/index.html
-> Accept Header : RDF format ->/info/catalog.{corresponding RDF format}
All changes can be found in this ERDDAP fork : https://github.com/vliz-be-opsci/FAIR-EASE-erddap
There's also a complete docker build & runnable environment inside, but this is not the main topic.
I don't think we can see this as "the final" or optimal solution, but really think it can be a pretty good starting point to think about the next step for a more FAIR ERDDAP
Beta Was this translation helpful? Give feedback.
All reactions