Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend PartialStruct to represent non-contiguously defined fields #57304

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

serenity4
Copy link
Contributor

So far, PartialStruct has been unable to represent non-contiguously defined fields, where e.g. a struct would have fields 1 and 3 defined but not field 2. This PR extends it so that such information may be represented with PartialStruct, extending the applicability of optimizations e.g. introduced in #55297 by @aviatesk or #57222.

The semantics of new prevent the creation of a struct with non-contiguously defined fields, therefore this change is mostly relevant to model mutable structs whose fields may be previously set or assumed to be defined after creation, or immutable structs whose creation is opaque.

Notably, with this change we may now infer information about structs in the following case:

mutable struct A; x; y; z; A() = new(); end

function f()
    mut = A()
   
    # some opaque call preventing optimizations
    # who knows, maybe `identity` will set fields from `mut` in a future world age!
    invokelatest(identity, mut)
   
    isdefined(mut, :z) && isdefined(mut, :x) || return
   
    isdefined(mut, :x) & isdefined(mut, :z) # this now infers as `true`
    isdefined(mut, :y) # this does not
end

whereas previously, only information gained successively with isdefined(mut, :x) && isdefined(mut, :y) && isdefined(mut, :z) could allow inference to model mut having its z field defined.

Copy link
Member

@topolarity topolarity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a separate BitVector{Bool}, would it be better to add a tombstone / marker for non-Const fields (e.g. Any)?

That seems easier to iterate and more memory-efficient for the common case of small structs.

@serenity4
Copy link
Contributor Author

serenity4 commented Feb 7, 2025

The issue with choosing Any that is that we wouldn't be able to represent that a field is defined but its type is unknown. For example, in the case of ref::RefValue{Any} if we have fields = Any[Any] that means ref.x is defined and access will not throw.

We could have a sentinel value like nothing instead. I don't know if that changes much about memory efficiency or performance, but perhaps it would be simpler with not having to manage type field indices vs .fields indices. I guess using a BitVector makes it faster to lookup whether a field is defined or not, on the other hand (mostly relevant for isdefined tfuncs, I'd guess getfield requires to know the defined-ness + the type of a given field).

@vtjnash
Copy link
Member

vtjnash commented Feb 7, 2025

The sentinel for a value that definitely throws on access is Union{} and appears throughout inference for that purpose. If you are only tracking may-throw, then you need something like our VarInfo to add a boolean for it, but that complicates the type lattice

@topolarity
Copy link
Member

The sentinel for a value that definitely throws on access

I think @serenity4 wants the opposite here - something that definitely doesn't throw on access. The pattern to copy is probably VarState, which has an undef::Bool to track maybe-undefined

That makes this two separate enhancements: (1) handling non-contiguous fields, and (2) tracking type / defined-ness independently - seems fair to me

@serenity4
Copy link
Contributor Author

We need a very basic "may-throw" information here, which in the current state is encoded as a lack of information. The idea was to avoid complicating the type lattice, which is why I went for a solution that does not require introducing a new sentinel value, and instead keeps track externally of available information.

But on the other hand, it makes little sense to iterate this .fields data structure without first knowing what specific field of the wrapped type it corresponds to, so we might as well put a sentinel value that is never meant to be manipulated as a lattice element.

@serenity4
Copy link
Contributor Author

serenity4 commented Feb 7, 2025

Note that (2) is already implemented in current master, when a lattice element is present but does not add any more information than fieldtype(T, i).

@topolarity
Copy link
Member

Is it possible to represent on current master a field that is Const(3) but may also be undefined?

@serenity4
Copy link
Contributor Author

No, Const(3) implies that the field is guaranteed to be defined and to have value 3. In fact, so long as you associate a lattice element with a field, it models the guarantee that the field is defined.

@topolarity
Copy link
Member

topolarity commented Feb 7, 2025

Hmm, it feels strange that you can represent a defined && !Const field but not a !defined && Const one TBH

My vote would be to either go in the VarState direction and make definedness independent from the type information, or else keep PartialStruct's existing limitations for !defined && Const fields but use nothing as a tombstone (that way it's clear this doesn't work like VarState)

@serenity4
Copy link
Contributor Author

In what situation would you want to model !defined && Const? This information seems contradictory to me, if a field is undefined then it must not have any value set, by definition. The only Const in this case that seems to make sense would be something like NULL which is redundant with !defined (but I feel you are talking about something else).

In the current state (master and this PR), PartialStruct covers the following cases:

  • maybe-defined (no information, possibly no PartialStruct formed in the first place)
  • defined
  • defined-and-refined (e.g. Int refined as Const(3), Any refined as Int, etc)

The only thing it doesn't model is certainty that a field is undefined, but I don't find any situation where this would be particularly useful. (plus, as we can't unset fields, it would be hard to make this guarantee in the first place, unless it is immutable or all operations on the object are known since its creation)

@topolarity
Copy link
Member

In what situation would you want to model !defined && Const?

You should read this as "maybe-undefined and Const when it is defined" because the defined bit represents "must-not-be-undefined"

That is what VarState can represent which PartialStruct currently can't - It's fine to preserve that limitation, but I do think a tombstone is clearer (and probably faster). The defined::Bool information made me think this would behave more like VarState, which it doesn't

@serenity4
Copy link
Contributor Author

Ah I see, indeed we have been conservative with PartialStruct so far but it might be useful to model things like "getfield may throw, but if it doesn't, it has this type/value".

In fact, it is already possible to do so using conditional branching information with Conditional, which can express that externally. Besides Conditional, perhaps a Union of PartialStructs may also be used in practice to achieve a similar effect, although both of these probably wouldn't scale when holding this two-state information for many fields.

My understanding is that since VarState is not part of the inference lattice, it cannot use the same mechanisms and therefore it is more important to encode this information internally.

I'd suggest to keep relying on the same mechanisms to express that, and perhaps use nothing (or another non-lattice element) as a sentinel value instead of having a bitvector. @aviatesk what would be your thoughts on this?

@aviatesk
Copy link
Member

The fields in PartialStruct tell us two things:

  • If fields[i] exists, then the i-th field has definitely been assigned a value.
  • If fields[i] ⊏ fieldtype(obj, i), then we have refined type information about the i-th field.

So, right now, the fields array mixes two kinds of information about refined field types and field definedness information. And as long as a new refactored data structure can express two pieces of information separately, there doesn't seem to be a fundamental difference between the approach in this PR (adding a defined::BitVector field to PartialStruct) and the idea of storing special data like VarState (or some sentinel) in the fields array of PartialStruct? I think we can choose whichever is easier to implement or more efficient.

Though since field defined-ness is really similar variable defined-ness information represented by VarState, I think it might be better to follow the VarState pattern and use (maybe)undef::BitVector instead of defined::BitVector.

Regardless of which approach we take, we'll need to carefully audit all the places where PartialStruct's fields are accessed and make the necessary updates. But adding undef::BitVector seems less likely to break existing code, so how about we move forward with that approach and then measure the performance later, and think about the further refactoring if there is any performance issue detected?

@topolarity
Copy link
Member

My understanding is that since VarState is not part of the inference lattice, it cannot use the same mechanisms and therefore it is more important to encode this information internally

VarState's .typ field are lattice elements, so I think it really can - but it applies in other circumstances since Conditional requires that you have the local branch condition available to refer to (e.g. in a loop or after an assignment, it may be lost) and does not comment on defined-ness for top-level variables (e.g. a Conditional(b, Const(false), Union{}) can be undefined even if b is true, although sometimes we make up for that by carrying a second Conditional to encode the defined-ness information). I'm pretty sure a Union of PartialStruct can't be made equivalent either, although I might be wrong about that

And as long as a new refactored data structure can express two pieces of information separately, there doesn't seem to be a fundamental difference

I think that's the main point of discussion - the current data structure can't represent those two pieces of information separately, since defined::Bool just indicates a "skipped" field not in the array (it is equivalent to a "missing" entry in the array or a tombstone). If we want to make this more like VarState, then the "skipped field" representation doesn't work because you can't represent maybe-undef + refined type

That said, I'm a fan of @serenity4's plan above to leave that enhancement for later - it seems likely to impact a lot more code vs. the smaller changeset here

@aviatesk
Copy link
Member

I think that's the main point of discussion - the current data structure can't represent those two pieces of information separately, since defined::Bool just indicates a "skipped" field not in the array (it is equivalent to a "missing" entry in the array or a tombstone). If we want to make this more like VarState, then the "skipped field" representation doesn't work because you can't represent maybe-undef + refined type

I'm a bit confused about this. Are you saying we can't represent a Base.RefValue{Any} with a "maybe undefined" Const(42) field like this?

struct PartialStruct
    typ
    fields::Vector{Any}
    maybeundef::BitVector
end
x = PartialStruct(Base.RefValue{Any}, Any[Const(42)], trues(1))

We'd need to handle the "maybe undef" information correctly everywhere we access x.fields. But I think that's also true if we represent it as e.g. PartialStruct(Base.RefValue{Any}, Any[MaybeUndef(Const(42))]).

@serenity4
Copy link
Contributor Author

We would essentialy go with a vectorized version of VarState then:

  • undef::BitVector that indicates whether a field may be undefined (true) or must be defined (false), essentially expressing exactly what the new defined field introduced (undef == .!defined) but in a way that is more consistent with other data structures (such as VarState).
  • fields::Vector{Any} that has always size fieldcount(typ), as opposed to count(.!undef)/count(defined), with lattice elements that are always equivalent to or more specific than fieldtype(typ, i). (special care will be required for types like Tuple where the number of fields is unknown)

The main difference with the current state of this PR (besides undef != !defined) is that then fields always have an entry for a given field index; what @topolarity referred to is that right now if the field may be undefined (!defined[i])) we just don't have an entry in this vector (so length(fields) <= fieldcount(typ)).

In this way, we move out the information that the i-th field is defined to the bitvector, instead of encoding it as length(fields) >= i.

@serenity4
Copy link
Contributor Author

serenity4 commented Feb 10, 2025

If we are worried about memory efficiency/performance (notably for large structs), we can always keep the optimization that if there exists i such that undef[j] === true && fields[j] == fieldtype(typ, j) for all j >= i (i.e. no new information provided by this PartialStruct for those fields), we may resize fields (and possibly undef) to have length i - 1 (to be updated as soon as new information is available for field >= i). Unless we notice performance degradations we can keep that for later though.

That seems to be the cost of being able to express both maybe-undef and refined type information for a given field (which the current PR avoids by only keeping type for fields that are known to be defined, or master by giving up on storing refined field type information starting from the first possibly undefined field).

@vtjnash
Copy link
Member

vtjnash commented Feb 11, 2025

The fields in PartialStruct tell us two things:

  • If fields[i] exists, then the i-th field has definitely been assigned a value.
  • If fields[i] ⊏ fieldtype(obj, i), then we have refined type information about the i-th field.
    So, right now, the fields array mixes two kinds of information about refined field types and field definedness information.

This description sounds concerning, as it seems to mean we cannot correctly represent the tmerge of two of these such types

@aviatesk
Copy link
Member

We can't simply tmerge the fields of PartialStruct even right now.
When dealing with something like PartialStruct(Base.RefValue{Any}, Any[]) ⊔ PartialStruct(Base.RefValue{Any}, Any[Const(42)]) = PartialStruct(Base.RefValue{Any}, Any[]), the tmerge_partial_struct function needs to behave differently depending on the length(fields) of each PartialStruct.
Since we need to handle these maybe-undefined-ness anyway, I think it would be better to encode that information explicitly with undef::BitVector instead of relying on the current length(fields) to represent them?

@serenity4 serenity4 force-pushed the extend-partialstruct branch from 8a9f31d to 67b9168 Compare February 11, 2025 13:31
@vtjnash
Copy link
Member

vtjnash commented Feb 11, 2025

the tmerge_partial_struct function needs to behave differently depending on the length(fields) of each PartialStruct.

Oh, I see, that is slightly confusing, since it goes the opposite direction of the lattice normally is implemented, and so it discards information slightly faster than expected. This change to the bitvector approach (with an optimization that it only gets allocated if any are maybe-undefined) seems slightly clearer semantically.

Copy link
Contributor Author

@serenity4 serenity4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests seem to be passing (besides one OOM in threading tests, not sure if that's related), should we run benchmarks to see if the current design has any performance regressions? It seemed locally that compile times had increased a fair bit, but I don't seem to observe that in CI. I would nonetheless recommend to check.

@serenity4 serenity4 marked this pull request as ready for review February 14, 2025 16:36
@aviatesk
Copy link
Member

@nanosoldier runbenchmarks("inference", vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@serenity4
Copy link
Contributor Author

@nanosoldier runbenchmarks("inference", vs=":master")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants