Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread-local arenas #8692

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kddnewton
Copy link

Summary

Currently, all threads use the same arena for imaging. When there are enough workers, in regular Python the GIL will be under lots of contention and in free-threaded Python the mutex will be under lots of contention.

This commit instead introduces lockless thread-local arenas for environments that support it. For environments that do not support thread-locals (or for environments where we couldn't determine if they do or not) we fall back to either the GIL or a mutex if there is no GIL.

This has some implications for statistics, as statistics are now thread-specific. This could be solved in a couple of ways (in C or in Python), or left unsolved and just documented. I think either way is fine.

Code

Most of the code doesn't actually need to change. The bulk of the changes were getting setup.py to emit the proper compilation definitions so that we could check which kind of thread-local declarations were supported at compile-time. Other than that, the declaration of the default arena now has the thread-local declaration and the places where we previously locked the mutex the macro name has changed to reflect that it is specific to the thread-local arena.

Benchmarks

For regular Python, this didn't make much of a difference. (The difference in the samples wasn't statistically significant, 95% CI). For free-threaded Python, however, the difference was fairly massive (about a 70% increase).

v3.13.0 on main

Max: 0.439743 Mean: 0.355661 Min: 0.305783
Max: 0.415384 Mean: 0.361710 Min: 0.304075
Max: 0.427207 Mean: 0.366160 Min: 0.300381
Max: 0.460026 Mean: 0.388431 Min: 0.316797
Max: 0.419853 Mean: 0.361484 Min: 0.309495
Max: 0.393699 Mean: 0.350330 Min: 0.302294
Max: 0.443584 Mean: 0.372369 Min: 0.311351
Max: 0.404041 Mean: 0.355057 Min: 0.309706
Max: 0.420880 Mean: 0.341415 Min: 0.280980
Max: 0.408922 Mean: 0.320707 Min: 0.228622

v3.13.0t on main

Max: 0.218140 Mean: 0.143962 Min: 0.091831
Max: 0.195644 Mean: 0.124187 Min: 0.079139
Max: 0.169986 Mean: 0.124365 Min: 0.081508
Max: 0.194228 Mean: 0.136258 Min: 0.103134
Max: 0.192837 Mean: 0.131196 Min: 0.094301
Max: 0.180463 Mean: 0.126546 Min: 0.079336
Max: 0.181516 Mean: 0.126875 Min: 0.083507
Max: 0.178397 Mean: 0.120558 Min: 0.083620
Max: 0.182262 Mean: 0.129299 Min: 0.087499
Max: 0.167291 Mean: 0.114647 Min: 0.074147

v3.13.0 on branch

Max: 0.429302 Mean: 0.362776 Min: 0.314723
Max: 0.406314 Mean: 0.355255 Min: 0.299485
Max: 0.438540 Mean: 0.378539 Min: 0.308898
Max: 0.425942 Mean: 0.368141 Min: 0.310095
Max: 0.408924 Mean: 0.365672 Min: 0.313756
Max: 0.419717 Mean: 0.361498 Min: 0.307699
Max: 0.418639 Mean: 0.355136 Min: 0.314148
Max: 0.426816 Mean: 0.377236 Min: 0.321773
Max: 0.424230 Mean: 0.358225 Min: 0.291148
Max: 0.421029 Mean: 0.363783 Min: 0.315103

v3.13.0t on branch

Max: 0.103066 Mean: 0.041306 Min: 0.018575
Max: 0.121496 Mean: 0.043042 Min: 0.018622
Max: 0.129727 Mean: 0.040726 Min: 0.014389
Max: 0.124282 Mean: 0.037581 Min: 0.018034
Max: 0.112015 Mean: 0.042051 Min: 0.017231
Max: 0.123254 Mean: 0.042117 Min: 0.019646
Max: 0.129165 Mean: 0.043886 Min: 0.017393
Max: 0.157608 Mean: 0.045151 Min: 0.017874
Max: 0.117050 Mean: 0.043070 Min: 0.016238
Max: 0.131859 Mean: 0.044563 Min: 0.017736

Script

Below is the script that I used to run these benchmarks.

bench.py
import concurrent.futures
import os
import threading
import time

from PIL import Image

num_threads = 16
num_images = 1024


def operation():
    images = []
    for i in range(num_images):
        img = Image.new(
            "RGB", (100, 100), color=(i % 256, (i // 256) % 256, (i // 65536) % 256)
        )
        images.append(img)

    for img in images:
        img = img.convert("CMYK")

    images.clear()


def worker(barrier):
    barrier.wait()
    runtimes = []

    for _ in range(5):
        start_time = time.time()
        operation()
        end_time = time.time()
        runtimes.append(end_time - start_time)

    return runtimes


def benchmark():
    with concurrent.futures.ThreadPoolExecutor(max_workers=num_threads) as executor:
        barrier = threading.Barrier(num_threads)
        futures = [executor.submit(worker, barrier) for _ in range(num_threads)]

        run_times = []
        for future in concurrent.futures.as_completed(futures):
            try:
                run_times.extend(future.result())
            except IndexError:
                os._exit(-1)

        min_time = min(run_times)
        max_time = max(run_times)
        mean_time = sum(run_times) / len(run_times)
        print(f"Max: {max_time:.6f} Mean: {mean_time:.6f} Min: {min_time:.6f}")


benchmark()

@aclark4life
Copy link
Member

aclark4life commented Jan 13, 2025

@kddnewton Can we say "environment" instead of "arena" here? Otherwise, thank you for the PR! Oh, or alternatively please explain what "arena" is in this context, haven't heard that one before. At a glance, it looks like either "arena" is another word for "project" or it's an imaging term I'm not familiar with 😄

@hugovk hugovk added the Free-threading PEP 703 support label Jan 13, 2025
@kddnewton
Copy link
Author

@aclark4life No problem! Arenas in this context are memory arenas, which are already in use inside Pillow. The general idea is that they represent large contiguous blocks of memory, that you can then go and manually allocate memory from but avoid the cost of calling malloc/free. Below is a super simplified example:

extern struct my_struct;

int main() {
  struct my_struct s1 = malloc(sizeof(struct my_struct));
  struct my_struct s2 = malloc(sizeof(struct my_struct));
  struct my_struct s3 = malloc(sizeof(struct my_struct));

  /* do something */

  free(s1);
  free(s2);
  free(s3);

  return EXIT_SUCCESS;
}

In this example we manually allocate memory for all three structs, and then manually free them. This can cause heap fragmentation and results in a lot of sys calls. Instead:

extern struct my_struct;

struct my_arena {
  uint8_t *memory;
  size_t size;
};

void* my_malloc(struct my_arena *arena, size_t size) {
  void* result = arena->memory + arena->size + size;
  arena->size += size;
  return result;
}

int main() {
  struct my_arena arena = { .memory = malloc(1024), .size = 0 };

  struct my_struct s1 = my_malloc(&arena, sizeof(struct my_struct));
  struct my_struct s2 = my_malloc(&arena, sizeof(struct my_struct));
  struct my_struct s3 = my_malloc(&arena, sizeof(struct my_struct));

  /* do something */

  free(arena.memory);

  return EXIT_SUCCESS;
}

In this example we make a single memory allocation and then a single free, which means all of the memory is contiguous (helping with locality) and only 2 sys calls are made (more efficient). I'm omitting a couple of details here about bookkeeping, but that's the general gist.

This is already in place in Pillow. There is a single global arena that is used for all memory allocations. This is great, and helps a lot in terms of performance. However the downside is that in a multi-threaded environment, the mutex that wraps access to the arena (be it the GIL or a Python mutex in free-threaded Python) falls under a lot of contention because everyone is trying to use the same arena.

This commit instead makes a separate arena for each thread, so that each thread manages its own memory. This means it never falls under contention, and you can see in the benchmarks that it drastically speeds up free-threaded Python because it never has to lock anything.

I hope I explained that sufficiently, let me know if there's anything I can clear up!

@hugovk
Copy link
Member

hugovk commented Jan 13, 2025

cc @lysnikolaou who's been helping with the free-threaded work.

@kddnewton kddnewton force-pushed the thread-local-arenas-2 branch from e799ace to 8445d50 Compare January 13, 2025 15:32
Copy link
Contributor

@lysnikolaou lysnikolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @kddnewton! Only suggested a change to setup.py, so that it's a bit clearer.

setup.py Outdated Show resolved Hide resolved
@kddnewton kddnewton force-pushed the thread-local-arenas-2 branch 2 times, most recently from 350283e to e76b4d4 Compare January 13, 2025 16:22
setup.py Outdated Show resolved Hide resolved
@kddnewton kddnewton force-pushed the thread-local-arenas-2 branch 2 times, most recently from 51a476e to 20f2f4c Compare January 13, 2025 17:23
setup.py Outdated Show resolved Hide resolved
@kddnewton kddnewton force-pushed the thread-local-arenas-2 branch from 20f2f4c to f751960 Compare January 13, 2025 17:37
@kddnewton
Copy link
Author

@lysnikolaou are those test failures related to my changes? Doesn't seem like it but since setup.py infects everything I'm not so sure.

@radarhere
Copy link
Member

The test failures should be fixed by #8686
The docs failure has been fixed in main by #8691

@hugovk
Copy link
Member

hugovk commented Jan 13, 2025

The test failures should be fixed by #8686

Just merged, please update this PR from main.

setup.py Outdated Show resolved Hide resolved
@kddnewton kddnewton force-pushed the thread-local-arenas-2 branch from f751960 to fa6a6b0 Compare January 13, 2025 19:24
Currently, all threads use the same arena for imaging. This can
result in a lot of contention when there are enough workers and
the mutex is constantly being checked.

This commit instead introduces lockless thread-local arenas for
environments that support it.
@kddnewton kddnewton force-pushed the thread-local-arenas-2 branch from fa6a6b0 to cfb2dcd Compare January 13, 2025 19:26
@kddnewton
Copy link
Author

@hugovk done!

@wiredfool
Copy link
Member

wiredfool commented Jan 13, 2025

The ImagingMemoryArena is an implicit default for the image -- it's not recorded anywhere that I see. What happens if an image is passed from thread to thread?

This is the image struct:

struct ImagingMemoryInstance {

And this is where the memory is released back into the pool:

ImagingDestroyArray(Imaging im) {

@kddnewton
Copy link
Author

kddnewton commented Jan 13, 2025

@wiredfool do you have an example of passing it from thread to thread? I'm not sure if I know how that would happen.

@wiredfool
Copy link
Member

wiredfool commented Jan 13, 2025

(sorry, managed to edit rather than comment)

I've done it in the past where I had an app where all of the processing was offloaded to worker threads, using queuing. Scanner -> initial processing -> thumbnailing -> uploading were all done off the main thread.

Anything were you're doing something with a UI main thread and processing elsewhere -- there are a bunch of operations that will create a new image. If you then have a reference on the main thread you won't be able to release it.

I'm also thinking that it's going to interfere with the lifetimes for arrow support, because that could potentially be freed from a thread that's not even part of our process.

Actually -- is memory in thread local storage actually available outside of the thread?

@kddnewton
Copy link
Author

The honest answer is I'm not sure. I think we should test this out. Just so that I can properly replicate what you're saying, are you describing: create images in parent thread, child threads pick them off queue and process them, child threads exit, parent thread resumes?

As for TLS being visible outside of the thread, I think the answer is it depends on the implementation. Linux has actual instructions for TLS, whereas macOS implements it as a library from what I understand. I imagine this would impact the answer.

@aclark4life
Copy link
Member

aclark4life commented Jan 13, 2025

I was going to raise "does this help #1888?" so I'm curious to know the answer too … thanks all!

@wiredfool
Copy link
Member

I think something like:

  1. Open image in parent thread
  2. call image.resize in child thread (e.g. threading.run())
  3. Return resized image to parent thread

would probably be enough to do it.

Actually, looking at gcc's tls:

lifetimes that match the thread lifetime, and destructors that cleanup the unique per-thread storage

That concerns me on a couple of fronts --

The image memory is probably accessible outside the thread, since it's a malloc, and it's just the original struct that's going to be in the TLS storage. However, if we have a oneshot thread that passes the image off, the arena will be deallocated before the image is freed.

So not only will the mutex and the arena likely be wrong in the child, there's going to be a pretty good memory leaks there because we won't necessarily get a chance to clean up the malloc'd items.

@kddnewton
Copy link
Author

@wiredfool Okay that helps a lot. I'll put together some example code and see what I can see. Maybe we'll need to add some logic on moving between threads to ensure everything is working properly. In the meantime let's put a pause on this PR until I can answer your questions.

@wiredfool
Copy link
Member

Ok, Some thoughts here --

  1. I don't think that image storage tied to the life of the thread is a good idea. It breaks how we think about python objects. However failing tests due to that on this branch that pass on main should be added to main, because this is clearly an undertested corner. I suspect some of the tests might only fail under valgrind.
  2. Alternatively, It might be possible to reduce the scope of the locks, so that we're only locking things that actually modify the arena struct. e.g., we don't need to lock around the (re)malloc, only the addition into the block list. Reads are probably ok, and comparisons to mostly static ones like the blocks_max and block_size probably don't need locks. More fine grained locks might reduce contention.
  3. Or a set of 8 or 16 or n memory arenas and choose which of one them to use with a hash of some thread id. We'd need to store a pointer to the arena in the image struct though, and follow that for destruction. If the thread goes away, we're not actually losing any arena. There could potentially be arenas that don't get allocated to though, or allocated to and then never drained. The settings for the block cache are per arena, so there's the potential for n* the expected memory to be retained when images are freed.

@ngoldbaum
Copy link

ngoldbaum commented Jan 15, 2025

The original version of this PR I saw used pthread and windows key/value TLS. While it would require a pthreads dependency on Unix-like platforms, I wonder if that would avoid the problems with the thread_local keyword. Although that said I don't know if the storage ultimately being tied to a thread lifetime would still be an issue.

Thanks for pointing out that these arenas can be shared between threads! An important point. The uses of thread_local in NumPy are for places where NumPy used to use static scratch spaces local to a compilation unit as a performance optimization.

@wiredfool
Copy link
Member

Had some back of the mind thinking about this -- I think under TLS the mutex is essentially a no-op, because it's never going to have contention. How could it when the only things that lock it are on the same thread. So while it's not (IMO) a valid approach to speeding this up due to object lifetime, I think it's a good indication of the speedup that could happen from eliminating the bottleneck.

If there were a static array of N memory arenas, and each thread took the next one in order, eg, (m++ % N) there would only need to be a single int stored in TLS. There's a bit of subtlety about pool sizes vs number of pools, but it's going to be a bit of a second order tuning thing. We'd definitely need to tag the image with it's source pool and read that on destroy.

API wise, I'm wondering if the Python TLS support would be a good place to look rather than native TLS, as it 's in the stable ABI. https://docs.python.org/3/c-api/init.html#thread-local-storage-support I'm not sure if that's going to work on pypy though.

  • For single threaded programs, this would wind up just using a single arena like now.
  • For onshot threads, the allocations would rotate between pools.
  • For n workers, you'd likely want to configure n pool threads so there's no contention, otherwise you'd get threads/pools contention.

@kddnewton
Copy link
Author

@wiredfool I like that approach a lot. Sorry I got caught up with other work stuff. I'll take a look at this first thing next week (starting Tuesday).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Free-threading PEP 703 support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants