-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve allocation caching #709
base: main
Are you sure you want to change the base?
Conversation
src/tensor/cache.rs
Outdated
// TODO: default max size | ||
max_size: RwLock::new(1_000_000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coreylowman Thoughts on a default max_size
? I think that 1gb is reasonable in most cases, but maybe we can do something like a percentage of available system memory or we could require the user to specify this value upon enabling the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to require users to specify the maximum size of the cache when they enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, great idea! Thoughts on using an enum to represent different options for this? I'm definitely going to forget what a plain usize would represent
dev.enable_cache(CacheSize::Unlimited)
dev.enable_cache(CacheSize::NumItems(1000))
dev.enable_cache(CacheSize::Bytes(10))
dev.enable_cache(CacheSize::MB(10))
dev.enable_cache(CacheSize::GB(10))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional num bytes? Option<usize>
So are these the main benefits of this PR?
|
Yes, and this should also be more memory safe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the size limiting idea a lot, I don't even think that's configurable in pytorch so that'd be a great add!
Will need to comb over to determine if safety is still met. The new layers of abstraction here do make it a bit more difficult to reason about the safety at each part, which may be a downside of trying to abstract between cpu/cuda.
And something to think about: I think we may want to enable re-using allocations that are bigger than what is requested (e.g. I need 100MB, and the allocation cache gives me a 200MB buffer). These changes shouldn't make this harder or anything, but worth thinking about what changes would need to be made in both cases
I still think this is good for memory safety because it reduces the surface area of memory unsafety to CacheWrapper and CacheStorage, rather than having cache operations using memory unsafe operations directly.
I was actually thinking about this, and implementing this could be super easy due to the use of a BTreeMap, because BTreeMap's make it fast to get the next largest key after the one you requested. My main concern is that some operations may rely on the physical size of the buffer to get physical_numel or the number of threads to launch, and we would need to implement a |
let src_layout = Layout::new::<T>().pad_to_align(); | ||
let dst_layout = Layout::new::<T2>().pad_to_align(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it's okay to use these with cuda - I don't think they follow the same layout/alignment rules as rust does.
It seems like they are just used to get number of bytes. Maybe just use std::mem::size_of::<>() * self.data.len()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that cuda formats memory in much the same way as the cpu, as we are able to dtoh
and htod
directly from one to another, any my understanding is that this is a byte-for-byte copy of the host array. I believe that we should keep this in case we wish to arrays with structs, as this is able to compute the padded size of a type as it would appear in an array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah byte to byte I think is fine, I'm just not sure about the sizes/alignment.
pad_to_align says the following, which makes it sound like it can modify the length?
Creates a layout by rounding the size of this layout up to a multiple of the layout’s alignment.
As far as alignment it sounds like cuda is always aligned to at least 256, which is definitely not the case for rust:
Any address of a variable residing in global memory or returned by one of the memory allocation routines from the driver or runtime API is always aligned to at least 256 bytes.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
|
||
/// Uses transmute_elements to convert to an element type with alignment `align` before dropping. | ||
/// This **must** be a memory safe way to drop self, given the correct alignment | ||
unsafe fn drop_with_alignment(self, align: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on removing default impl of this? This implementation is the CPU one, and I'm not sure we should be doing all of this for cuda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is correct for cuda, and while it may not be necessary, it doesn't have a significant performance impact as transmute_elements is very fast.
src/tensor/cache.rs
Outdated
// Tracks the number of matching 'AllocationKey's in drop_queue to ignore. This is used to | ||
// "remove" the next instance of the matching AllocationKey in the drop_queue, without having | ||
// to run an O(n) operation to actually remove the key. | ||
ignore_drops: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is all this just because we have the invariant of "if a key is in the tree, it must have > 0 values"? I wonder if it would be better to just remove that assumption, and just keep keys in the tree forever.
Also, have you noticed speedups when doing this? I'm wondering how much of an impact on speed the key removal actually has. My main concern is that this will be hard to understand/maintain whenever we revisit. I think the performance benefit would have to be pretty high for this to be worth it IMO.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"if a key is in the tree, it must have > 0 values"
Keys now remain in the map if they have either ignore_drops > 0
or allocations.len() > 0
. This is to satisfy this constraint, which states that
(instances of key in drop_queue) = allocations[key].ignore_drops + allocations[key].allocations.len()
for all keys.
Also, have you noticed speedups when doing this?
This exists mainly for correctness, as failing to remove keys from the drop_queue upon popping would cause preferential removal of frequently used keys over old keys. While this implementation works, it has a major flaw that I've overlooked until now. If shrink
is never called, drop_queue will grow every time insert
is called, without bound.
} | ||
std::mem::drop(cache); | ||
self.shrink(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think its worth only calling this if necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a small modification to shrink
that minimizes the number of RwLock operations done by shrink when no shrinking needs to occur.
So far, makes the TensorCache api cleaner using the CacheStorage trait to allow calling buffer conversion logic within the context of a TensorCache. Also adds CacheWrapper, which implements Drop and encapsulates all of the unsafe operations in the cache, aside from returning uninitialized memory.
Todo:
Set a reasonable default maximum cache size, and allow users to configure cache size.