Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dictionaries #125

Open
guybedford opened this issue Nov 8, 2022 · 9 comments
Open

Dictionaries #125

guybedford opened this issue Nov 8, 2022 · 9 comments
Labels
enhancement New feature or request

Comments

@guybedford
Copy link
Collaborator

Are dictionaries on the roadmap for the component model? At the moment it seems this requires representation via list<tuple<string, ...>>, where a dictionary could be a relatively straightforward host-level representation over such a datatype.

@lukewagner
Copy link
Member

That's a good question and indeed this is one of those types on the edge. Some reasons why it seems like we should hold off for now:

  • We have record types which subsume a lot of what would otherwise be dictionary use cases, particularly with option fields, while providing much more declarative interface information and a more-efficient representation. To wit, the Web IDL dictionary type is effectively a record and used extensively in Web APIs, while the Web IDL record type is effectively a dictionary and used relatively infrequently.
  • It would be a bunch of tricky design and implementation work, since there's a wide design and implementation space of options (esp. around sortedness/uniqueness).

Thus, as with #56, while I can see the use case, I'd like to see if we can get by without these in an MVP release.

@guybedford
Copy link
Collaborator Author

I would of course tend towards JS semantics here, so records having unique strings and ordering supported (ie exactly a list<tuple<string, ...>> representation underneath). In theory uniqueness could be managed by the host, in the same way valid USV strings are managed by the host without requiring a check everytime, since it's a strong property of the nature of the data it returns.

I can certainly get by with a list, but my concern is that it's a wide enough pattern that there's a natural temptation to want to decorate the interface with a JS object to give a nice representation - which then forces me back outside of the component model to a space of component model + sprinkles of JS, which seems non-ideal.

@guybedford
Copy link
Collaborator Author

I suppose records & tuples alignment in JS would imply unordered record objects these days, so perhaps sorting might be preferable, it's hard to know.

Either way, agreed as with #56 that this is nice-to-have but not necessarily MVP.

@lukewagner
Copy link
Member

Agreed. And just explaining a bit more about the design challenges:

It makes sense that when passing in a host value, the host can already have guarantees of uniqueness; the complexity comes when the Canonical ABI lifts a dictionary from linear memory since the host can't trust anything and thus must verify things dynamically every time.

For ordering: requiring ordered keys (in the list<tuple<string, ...>>) is great for lifting performance (each string can be compared with its predecessor, fused with the copy loop) but JS objects and maps (and, in general, any hash-table-based dictionary) aren't lexicographically ordered, so this would add a non-trivial sorting operation when lowering these into wasm (negating any win from knowing uniqueness). But if we drop the ordering requirement, detecting dupes when lifting linear memory is rather more expensive.

Lastly, we could have dictionary<K,V> simply mean list<tuple<K,V>> and not impose any uniqueness/ordering requirements, which would achieve your goal of being able to derive a nice idiomatic JS interface automatically. But then the worry is that dupes end up leading to corner case bugs.

I'm not sure what the right answer is here, just that it'd be nice to punt on this in the MVP.

@yordis
Copy link

yordis commented Aug 26, 2024

Could we take inspiration from Protobuf, while also Cap'n Proto seems to do list<tuple<string, ...>>

@dzmitry-lahoda
Copy link

dzmitry-lahoda commented Nov 14, 2024

I did not made this text pretty with ai. Main idea is how to find types which are maybe_goodtoaddtowit.

There are https://en.wikipedia.org/wiki/Refinement_type and https://en.wikipedia.org/wiki/Dependent_type which ultimately allow to encode something like JSON Shema or XSD and prove things about composability.
I have not seen further more typed way to define types.
And exiting example https://www.jolie-lang.org/index.html .

What types/properties for these most modern languages can do up to some degree?
It is ordered and unique.

type maybe_map = list<tuple<string, ...>> is not ordered no unique, nor there is a way to verify that. why?
because parsing maybe_map is O(size), while verifying unique is O(size * log size) which is dos attack.
That is https://protobuf.dev/programming-guides/proto3/#maps which is barely usable. What about O(const) of comparison?

Some people tend to reject idea of ordered to be any relevance to unique so https://www.ietf.org/archive/id/draft-devault-bare-11.html#section-2.2 , does not feels good. Specifically he asks to raise error if non unique, but that is dos and cannot be part of general standard.

So from this, may be need to think about ordered first, which allows unique.

We can have key ordered maps, which are not binding to list of tuples, because there is specific key value, and just ordered lists, basis for sets.

So what to do with set<tuple<MyType>>, which would imply that any primitive type as defined has canonical order, and each record has canonical order too. In general can define something. But not sure if I know all alignments. And is order ascending or descending? What order of fields in record used as key?
UPDATE: May be rule that order decided by first 2 items will work.

So having ordered can be hard, hence unique too.

So what can be done?
I can document that my ``;

/// ordered
/// unique
type ordered_set<T> = list<T>

or I can make docs structured:

/// refinements:
///  - ordered
///  - unique
type ordered_set<T> = list<T>

or I can think of attributes

#[refinements(ordered, unique)]
type ordered_set<T> = list<T>

Is safe type subset possible?
Possible safe subset which can be done is ascending ordered set where element is fixed primitive type or fixed size list or tuple of such types or when key is like element in set.

How to handle not fixed elements like strings?
Some trustful(by "admin"/"sudo") operation can be initialized with something to be validate by O(n log n) to be set/map and stored. And after that set or map of index elements(index of stored element) to be send as key of map/set. So

So may having structured tags or custom attributes is way to have dictionary?

#58

@oovm
Copy link

oovm commented Dec 24, 2024

I agree with the solution of using refinement type to enhanc dict<K,V>.

The guest language can choose the mapping based on its support for refined dict.

type dict<K, V>; // HashMap<K, V>
@predicate(key_sorted)
type sorted_string_dict = dict<string, string> // BTreeMap<String, String>

@ouillie
Copy link

ouillie commented Feb 5, 2025

We have record types which subsume a lot of what would otherwise be dictionary use cases.

This is the wrong way to think about dictionary types. If the key space is finite, records are always preferred. Dictionaries should support the separate use case where the key space is "infinite", e.g. "a phone book", HTTP headers, JSON blobs, etc. Using the latter to emulate the former always turns into tech debt (I know JS does exactly that, but that's exactly why TS and JSON schemas are so popular), so we should think of them as disjoint use cases.

type maybe_map = list<tuple<string, ...>> is not ordered no unique, nor there is a way to verify that. why?
because parsing maybe_map is O(size), while verifying unique is O(size * log size) which is dos attack.
That is https://protobuf.dev/programming-guides/proto3/#maps which is barely usable. What about O(const) of comparison?

This assumes B-tree or similar implementation. Hash tables are amortized Θ(n) to lift and lower and can enforce uniqueness, and they're more broadly support in standard libraries.

The canonical ABI for map<K, V> could be equivalent to list<tuple<K, V>> with overwrite on key collision. Lifting would involve, effectively:

fn lift(entries: Vec<(K, V)>) -> HashMap<K, V> {
    let dictionary = HashMap::with_capacity(entries.len());
    for (key, value) in entries.iter() {
        dictionary.insert(key, value);  // Overwrite on key collision. Uniqueness enforced. 👍
    }
    return dictionary;
}

I suppose the alternative would be to trap on key collisions, but I don't see why that's necessary.

Guest languages could lift them into B-trees but that would involve sorting. Sorted maps are not as standardly-supported as general associative maps and would be better represented as lists anyway, so I think the basic map type should be semantically unsorted and potentially generalize to non-string keys, like the JS Map, but could be restricted to string keys at first.

Having the capacity known ahead of time (as per canonical list ABI) means no resizing during lifting, and hashing tends to be very performant (especially for strings). If the hash function is silly, a malicious component could engineer hash collisions for unequal keys to induce O(n^2), but even MD5 would be difficult to exploit in that way at scale.

Happy to help push this boulder up the hill any way that I can.

@lukewagner
Copy link
Member

I've also been thinking that list<tuple<K, V>>-semantics is probably the best we can do (leaving it up to the lowering wasm code to catch/handle duplicates... or not). One thing we should probably say, spec-wise, is that, just like how we non-deterministically allow but do not require NaN payloads to be scrambled (allowing, but not requiring, hosts to canonicalize NaNs at cross-component boundaries), we could likewise non-deterministically allow-but-not-require duplicate keys to be merged (with well-defined last-value-overwrites semantics when such merging occurs).

One thought in the meantime is that, K=string may (1) cover a large percentage of the use cases, (2) be much simpler to implement in bindings generators in the wide variety of associative maps out there, so it might make sense to start with just string keys.

Another thought is that the lazy lowering ABI may end up having a significant impact on the implementation, so to avoid churn, we might want to sequence this feature after that? Although if anyone wanted to do any prototyping in wit-bindgen in the short term, that'd be great too. Having heard a number of use cases and interest in this since the issue was opened, I do think this feature makes sense to include in 1.0 if folks are willing to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants