Allow custom `BINARY_OP` specializations to be registered at runtime. #162

brandtbucher · 2021-12-08T23:19:26Z

brandtbucher
Dec 8, 2021
Maintainer

(@markshannon, not sure if this accurately captures your vision for this. Please let me know if not!)

During discussions about the recent unification of all of the BINARY_*/INPLACE_* ops into BINARY_OP, the possibility of providing hooks for "custom" specializations (say, for third-party types) briefly came up. I figured we could attempt to flesh out the ideas here a bit more. I think that, in addition to dramatically increasing the flexibility of our specialization machinery, this also has the opportunity to (a) make adding and experimenting with new specializations easier, and (b) clean up our existing operator specializations quite a bit.

The way I see it, the bare minimum information required for pluggable operator specializations would include:

the operator being specialized
the type being specialized on
a callback to actually perform the specialized operation

This information could be provided to the interpreter using a simple API, that would be called somewhere during interpreter creation (or, in the case of third-party/stdlib modules, during import):

// int _PyEval_HookBinaryOp(int op, PyTypeObject *type, binaryfunc hook);
if (_PyEval_HookBinaryOp(NB_ADD, &PyLong_Type, (binaryfunc)_PyLong_Add) ||
    _PyEval_HookBinaryOp(NB_INPLACE_ADD, &PyLong_Type, (binaryfunc)_PyLong_Add))
{
    // ...
}

Specializing BINARY_OP instructions would just be a matter of consulting these hooks for a match. I'm imagining something like this:

void
_Py_Specialize_BinaryOp(PyObject *lhs, PyObject *rhs, _Py_CODEUNIT *instr,
                        SpecializedCacheEntry *caches)
{
    _PyAdaptiveEntry *adaptive = &caches[0].adaptive;
    if (!Py_IS_TYPE(lhs, Py_TYPE(rhs))) {
        SPECIALIZATION_FAIL(BINARY_OP, SPEC_FAIL_DIFFERENT_TYPES);
        goto failure;
    }
    adaptive->version = Py_TYPE(lhs)->tp_version_tag;
    if (UINT16_MAX < adaptive->version) {
        SPECIALIZATION_FAIL(BINARY_OP, SPEC_FAIL_OUT_OF_VERSIONS);
        goto failure;
    }
    // Check the registered hooks one-by-one until we get a match (or we've
    // exhausted them all):
    int op = _Py_OPARG(*instr);
    PyInterpreterState *interpreter = _PyInterpreterState_GET();
    for (Py_ssize_t i = 0; i < interpreter->nhooks; i++) {
        hookinfo *info = &interpreter->hooks[i];
        if (info->op == op && Py_IS_TYPE(lhs, info->type)) {
            _PyBinaryFuncCache *binary = &caches[-1].binary;
            binary->func = info->hook;
            *instr = _Py_MAKECODEUNIT(BINARY_OP_HOOKED, _Py_OPARG(*instr));
            STAT_INC(BINARY_OP, specialization_success);
            adaptive->counter = initial_counter_value();
            return;
        }
    }
    SPECIALIZATION_FAIL(BINARY_OP, SPEC_FAIL_OTHER);
failure:
    STAT_INC(BINARY_OP, specialization_failure);
    cache_backoff(adaptive);
}

The instruction implementation itself would be pretty simple:

TARGET(BINARY_OP_HOOKED) {
    PyObject *lhs = SECOND();
    PyObject *rhs = TOP();
    DEOPT_IF(!Py_IS_TYPE(lhs, Py_TYPE(rhs)), BINARY_OP);
    SpecializedCacheEntry *caches = GET_CACHE();
    _PyAdaptiveEntry *adaptive = &caches[0].adaptive;
    DEOPT_IF(Py_TYPE(lhs)->tp_version_tag != adaptive->version, BINARY_OP);
    STAT_INC(BINARY_OP, hit);
    _PyBinaryFuncCache *binary = &caches[-1].binary;
    PyObject *res = binary->func(lhs, rhs);
    if (res == NULL) {
        goto error;
    }
    STACK_SHRINK(1);
    Py_DECREF(lhs);
    Py_DECREF(rhs);
    SET_TOP(res);
    DISPATCH();
}

I'm not sure if allowing more sophisticated specialization criteria (like different LHS and RHS types) would be worth it, especially considering that all of our existing specializations (except for BINARY_OP_INPLACE_ADD_UNICODE) would work perfectly fine with the proposed scheme. It would probably also require an additional cache entry per instruction.

It would also be an open question whether it's worth converting our existing BINARY_OP specializations to these hooks. If there's no measurable slowdown, I think it would be quite nice to clean up all of the existing special-case logic we have for these specializations.

markshannon · 2022-01-13T13:51:11Z

markshannon
Jan 13, 2022
Collaborator

I think the registered functions should consume the references to the arguments.

This pushes work into the client code, but when temporary variables are used, it would allow inplace modification.

Given the lengths that NumPy goes to avoid creating temporaries it should be popular with third-parties.

0 replies

markshannon · 2022-01-13T13:56:21Z

markshannon
Jan 13, 2022
Collaborator

Another, somewhat related, possibility is to have the compiler clear the lhs in l += r, or l = l + r.
So, instead of compiling to

LOAD_FAST                0 (l)
LOAD_FAST                1 (r)
BINARY_OP               13 (+=)
STORE_FAST               0 (l)

we would compile to:

LOAD_FAST                0 (l)
DELETE_FAST              0 (l)
LOAD_FAST                1 (r)
BINARY_OP               13 (+=)  # Refcount of lhs is 1 (if not referenced elsewhere), allowing inplace modification
STORE_FAST               0 (l)

(The LOAD_FAST; DELETE_FAST could be combined into a CLEAR_AND_LOAD instruction)
This is a subtle change in behavior; s will be not visible in the debugger during the += operation.

Doing this would allow us to get rid of the special case for string addition, so shouldn't add any complexity overall.

8 replies

brandtbucher Jan 13, 2022
Maintainer Author

Why do the test _Py_OPCODE(*next_instr) == STORE_FAST && GETLOCAL(_Py_OPARG(*next_instr)) == lhs at runtime, when it can be done at compile time?

I believe it covers more cases, is simpler, and is less error-prone.

markshannon Jan 13, 2022
Collaborator

I don't see why it is less error prone, nor do I see that it covers more case. It is simpler, but also slower.

The good thing is that it doesn't really matter, as we change from one implementation to the other without breaking anything.

brandtbucher Jan 13, 2022
Maintainer Author

I don't see why it is less error prone

It doesn't clear l on error, and doesn't require extra logic in the compiler to distinguish between l += r, l[x] += r, l.x += r, l = l + r, l += l + r, l += (l := r), etc.

nor do I see that it covers more case

It correctly optimizes any case where lhs has the same identity as the name being stored. Including unnamed temporaries.

markshannon Jan 13, 2022
Collaborator

Except when it crashes and burns due to a use after free.
E.g.

def __add__(this, that):
    del sys.getframe(1).f_locals['l']
    # this has now been freed.

If l has already been cleared, then it is safe.

brandtbucher Jan 13, 2022
Maintainer Author

Ah, sneaky. So maybe combine the two approaches, and replace Py_DECREF(lhs); in the fast path with SETLOCAL(_Py_OPARG(*next_instr), NULL);, and restore it with SETLOCAL(_Py_OPARG(*next_instr), lhs); in the case where res == NULL.

We'd probably need to require that lhs is never mutated in the error case, but that wouldn't be too painful.

markshannon · 2022-01-13T14:10:10Z

markshannon
Jan 13, 2022
Collaborator

We probably want to specialize on builtin, immutable classes only, and on both operands (as we want to specialize for (int, float) and other cases where type(lhs) != type(rhs)).

So the guard code would probably look something like:

DEOPT_IF(Py_TYPE(lhs)->tp_version_tag != adaptive->version & 255, BINARY_OP);
DEOPT_IF(Py_TYPE(rhs)->tp_version_tag != adaptive->version >> 8, BINARY_OP);

We will need to reserve the first 255 versions for these classes.

If we want to guarantee that a registered function is always called once registered, then the non-specialized form will need to perform the lookup efficiently. So we will need some sort of hashtable mapping (op, lhs-version, rhs-version) to the registered function. We can always add this feature later.

0 replies

markshannon · 2022-01-13T18:04:43Z

markshannon
Jan 13, 2022
Collaborator

FTR, my original HotPy used table lookup for binary operators, to avoid the overhead of tracing the double-dispatch dance for simple types.
It worked well.

0 replies

markshannon · 2022-06-23T10:07:58Z

markshannon
Jun 23, 2022
Collaborator

Thinking about this further, we want a design that:

Doesn't need much, if any, additional space in the bytecode
Has as few memory reads as possible
Has only one indirect call or jump

To keep the additional space to a minimum, we want to use an index into a table, rather than a version number and a function pointer. If we are willing to pay additional cost when de-optimizing and creating the co_code bytes we can store the index in the oparg. Another option is use the top 8 bits of the counter, and use an 8 bit counter when adapting binary ops. Or we can just add another 16 bit to the cache, which is much simpler. We'll probably just add the 16 bits for maintainability reasons.

We can avoid multiple tests if we use a 64 bit version number in the table.

So we want:

A small integer index in the instruction
That indexes into a table of typedef struct { uint64_t version_pair; binaryfunc funcptr; } TableEntry;

The code for the specialized form would look something like:

     TableEntry *ptr = &TheTable[cache->index];
     DEOPT_IF((PyTYPE(a)->tp_version << 32 | PyTYPE(b)->tp_version) != ptr->version_pair);
     PyObject *res = ptr->function(a, b); /* Consumes references */
     ...
}

We will also want an ancillary table to look up the index from the version number pair when specializing, but that's much less performance critical.

0 replies

markshannon · 2022-06-23T12:48:57Z

markshannon
Jun 23, 2022
Collaborator

Looking at the stats, we should be able to virtually eliminate failed specializations with 20 or 30 entries.
That might seem like a lot of extra code, but much of the code already exists, it just needs to be registered.

0 replies

markshannon · 2022-08-23T11:17:22Z

markshannon
Aug 23, 2022
Collaborator

We can also handle BINARY_SUBSCR with the same code. It will need a different opcode to allow deoptimization, but the instruction code can be the same.

0 replies

sh471 · 2023-08-19T20:29:58Z

sh471
Aug 19, 2023

Just found this. I'm so sorry for the extremely delayed 😔 response.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom `BINARY_OP` specializations to be registered at runtime. #162

{{title}}

Replies: 8 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Allow custom BINARY_OP specializations to be registered at runtime. #162

brandtbucher Dec 8, 2021 Maintainer

Replies: 8 comments · 8 replies

markshannon Jan 13, 2022 Collaborator

markshannon Jan 13, 2022 Collaborator

brandtbucher Jan 13, 2022 Maintainer Author

markshannon Jan 13, 2022 Collaborator

brandtbucher Jan 13, 2022 Maintainer Author

markshannon Jan 13, 2022 Collaborator

brandtbucher Jan 13, 2022 Maintainer Author

markshannon Jan 13, 2022 Collaborator

markshannon Jan 13, 2022 Collaborator

markshannon Jun 23, 2022 Collaborator

markshannon Jun 23, 2022 Collaborator

markshannon Aug 23, 2022 Collaborator

sh471 Aug 19, 2023

Allow custom `BINARY_OP` specializations to be registered at runtime. #162

brandtbucher
Dec 8, 2021
Maintainer

Replies: 8 comments 8 replies

markshannon
Jan 13, 2022
Collaborator

markshannon
Jan 13, 2022
Collaborator

brandtbucher Jan 13, 2022
Maintainer Author

markshannon Jan 13, 2022
Collaborator

brandtbucher Jan 13, 2022
Maintainer Author

markshannon Jan 13, 2022
Collaborator

brandtbucher Jan 13, 2022
Maintainer Author

markshannon
Jan 13, 2022
Collaborator

markshannon
Jan 13, 2022
Collaborator

markshannon
Jun 23, 2022
Collaborator

markshannon
Jun 23, 2022
Collaborator

markshannon
Aug 23, 2022
Collaborator

sh471
Aug 19, 2023