-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix seastar::resource::allocate() error on EC2 m7gd.16xlarge instance #2624
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asked to review so I did. I'm not familar with this code but from a high level it looks good to me.
Someone more familiar with this should also review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm in general. in addition to the inlined comments.
- could you please use a more specific prefix in the title of the commit message? like: "resource: "
- and use a more specific title. like "fall back to single io group if hwloc fails to work".
BTW, could even split the commit into two. one for moving the non-hwloc code up. the other for using it when hwloc fails tell the CPU topology.
src/core/resource.cc
Outdated
// cannot receive error from the API. | ||
// Therefore, we have to detect cpu topology availability in our code. | ||
static bool is_hwloc_available() { | ||
const std::string cpux_properties[] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, std::string_view
would suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and better off referencing the related hwloc function, so that the posterity understand why we are using this logic to determine if hwloc is able to identify the CPU topology.
src/core/resource.cc
Outdated
} | ||
|
||
// cpu0 might be offline, try to check first online cpu. | ||
auto online = read_first_line_as<std::string>("/sys/devices/system/cpu/online"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed we could potentially use read_first_line_as<unsigned>("/sys/devices/system/cpu/online")
and handle the exception. If this approach wouldn't significantly simplify the implementation, feel free to keep the current solution.
I think this is too extreme. Here's hwloc-ls output on m7gd.16xlarge:
So it aborted linux discovery, but was still able to keep going. Maybe hwloc-ls is telling hwloc to fall back to alternative methods if needed, and we are not. |
I think the problem is that hwloc doesn't report the memory as belonging to any NUMA nodes (on a normal machine the NUMA nodes have memory counts). // Divide local memory to cpus
for (auto&& cs : cpu_sets()) {
auto cpu_id = hwloc_bitmap_first(cs);
assert(cpu_id != -1);
auto node = cpu_to_node.at(cpu_id);
cpu this_cpu;
this_cpu.cpu_id = cpu_id;
size_t remain = mem_per_proc - alloc_from_node(this_cpu, node, topo_used_mem, mem_per_proc);
remains.emplace_back(std::move(this_cpu), remain);
}
// Divide the rest of the memory
auto depth = hwloc_get_type_or_above_depth(topology, HWLOC_OBJ_NUMANODE);
for (auto&& [this_cpu, remain] : remains) {
auto node = cpu_to_node.at(this_cpu.cpu_id);
auto obj = node;
while (remain) {
remain -= alloc_from_node(this_cpu, obj, topo_used_mem, remain);
do {
obj = hwloc_get_next_obj_by_depth(topology, depth, obj);
} while (!obj);
if (obj == node)
break;
}
assert(!remain);
ret.cpus.push_back(std::move(this_cpu));
} We need to add a third loop that allocates non-NUMA memory if the second loop fails. |
Bad NUMA detection:
Good NUMA detection:
So we just need to detect the case where the detected NUMA memory (here 0) is less that the memory we want to allocate, and treat that case too. |
7e81ac7
to
4f14ad6
Compare
@avikivity I completely rewrited the patch, now it does not use non-hwloc functions, it fixup memory size on hwloc objects instead. |
Looks good, I'll wait for @tchaikov to review too. |
@tchaikov Could you review the patch again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am sorry for the latency.
left a few comments.
src/core/resource.cc
Outdated
@@ -305,13 +305,26 @@ size_t div_roundup(size_t num, size_t denom) { | |||
return (num + denom - 1) / denom; | |||
} | |||
|
|||
static size_t alloc_from_node(cpu& this_cpu, hwloc_obj_t node, std::unordered_map<hwloc_obj_t, size_t>& used_mem, size_t alloc) { | |||
static inline hwloc_uint64_t get_local_memory_from_node(hwloc_obj_t node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am interested in why you added the inline
specifier here? not because i am against this, just because i don't understand why it is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop inline since it is not necessary.
src/core/resource.cc
Outdated
// This code does not assume that there are multiple nodes, | ||
// but when this 'if' condition is met, hwloc fails to detect | ||
// the hardware configuration and is expected to operate as | ||
// a single node configuration, so it should work correctly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably could improve this comment a little bit (just for better readability, and assuming the single-node configuration does indeed apply):
// If hwloc fails to detect the hardware topology, it falls back to treating
// the system as a single-node configuration. While this code supports
// multi-node setups, the fallback behavior is safe and will function
// correctly in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
merged new comment
src/core/resource.cc
Outdated
@@ -305,13 +305,26 @@ size_t div_roundup(size_t num, size_t denom) { | |||
return (num + denom - 1) / denom; | |||
} | |||
|
|||
static size_t alloc_from_node(cpu& this_cpu, hwloc_obj_t node, std::unordered_map<hwloc_obj_t, size_t>& used_mem, size_t alloc) { | |||
static inline hwloc_uint64_t get_local_memory_from_node(hwloc_obj_t node) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better off renaming this function to get_memory_of_node()
, so that we can reuse it for retrieving the total size of memory of this machine. where the node is an object of HWLOC_OBJ_MACHINE
. also, because we retrieve total_memory
when HWLOC_API_VERSION >= 0x00020000
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied this idea, use common functions for both machine and node.
New function name are get_memory_from_hwloc_obj()
and set_memory_from_hwloc_obj()
.
@@ -579,6 +592,10 @@ resources allocate(configuration& c) { | |||
#else | |||
auto available_memory = machine->memory.total_memory; | |||
#endif | |||
if (!available_memory) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could reuse get_local_memory_from_node()
for retrieving the size of memory owned by the machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied
src/core/resource.cc
Outdated
// but when this 'if' condition is met, hwloc fails to detect | ||
// the hardware configuration and is expected to operate as | ||
// a single node configuration, so it should work correctly. | ||
set_local_memory_to_node(node, available_memory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am confused. you're assigning the total size of memory of the whole machine to a NUMA node. on my machine, it is 32GiB. assuming my machine has two NUMA nodes and the OS is NUMA-awared, with your algorithm, we will be splitting 64GiB memory across all the processing units, am i right?
how do we assure that there is only a single NUMA node here ? is this guaranteed by the algorithm or is it based on our observation ? can we verify this in the code, and at least print out an error message if this assumption is violated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because when hwloc hits hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery.
error, it does not proceed rest of HW detection code, so it does not have NUMA nodes information, memory information, etc.
see: https://github.com/open-mpi/hwloc/blob/09f88977dd85b01b2e4915a48c95a90cea74754c/hwloc/topology-linux.c#L5987
I verified this on m8g.metal-48xl, which has 2 NUMA node.
On m8g.metal-48xl, cpu topology issue does not occur, so I emulated the issue by hiding files under topology directory, like this:
for i in /sys/devices/system/cpu/cpu*/topology;do sudo mount -t ramfs none $i;done
Before mounted ramfs, Seastar output is like this:
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU0 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU1 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU2 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU3 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU4 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU5 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU6 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU7 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU8 to NUMA0
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU9 to NUMA0
...
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU100 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU101 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU102 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU103 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU104 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU105 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU106 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU107 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU108 to NUMA1
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU109 to NUMA1
...
DEBUG 2025-01-30 11:56:56,832 seastar - Assign CPU191 to NUMA1
So hwloc reported 2 NUMA node.
After mounted ramfs, hwloc reports single NUMA node like this:
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU0 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU1 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU2 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU3 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU4 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU5 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU6 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU7 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU8 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU9 to NUMA0
...
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU100 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU101 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU102 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU103 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU104 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU105 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU106 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU107 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU108 to NUMA0
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU109 to NUMA0
...
DEBUG 2025-01-30 11:57:49,039 seastar - Assign CPU191 to NUMA0
So we can say when the issue occured, NUMA node should be single node.
can we verify this in the code, and at least print out an error message if this assumption is violated?
I added assert(num_nodes == 1);
for safety.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! it helps to convince us it's indeed safe.
4f14ad6
to
662bc23
Compare
…hwloc On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar program such as Scylla fails to startup with following error message: ``` $ /opt/scylladb/bin/scylla --log-to-stdout 1 WARNING: debug mode. Not for benchmarking or production hwloc/linux: failed to find sysfs cpu topology directory, aborting linux discovery. scylla: seastar/src/core/resource.cc:683: resources seastar::resource::allocate(configuration &): Assertion `!remain' failed. ``` It seems like hwloc is failed to initialize because of /sys/devices/system/cpu/cpu0/topology/ not available on the instance. I debugged src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0. It is likely because of failure of hwloc initialize described above. I also found that calculate_memory() going wrong since machine->total_memory is also 0. To avoid the error on such environment, we need to fixup memory size on both machine->total_memory and node->total_memory. We can use sysconf(_SC_PAGESIZE) * sysconf(_SC_PHYS_PAGES) for this, just like we do on non-hwloc version of allocate(). Fixes scylladb/scylladb#22382 Related scylladb/scylla-pkg#4797
662bc23
to
63d95e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@avikivity Can we now merge it? we need it for returning to ASG for aarch64 |
@yaronkaikov Opened backport PRs for branch-2024.1 and branch-2024.2: |
@yaronkaikov Seems like branch-6.1 and branch-6.2 are using master branch of seastar, so it need to update submodule instead of backport patch. |
I think it because we didn't any updates since we branch, IIUC we should now create the branch and do the backport. right @avikivity @denesb ? |
Right. |
@denesb @avikivity Can you create the new branch and backport this change? we need it in all releases so we can switch back to ASG for aarch64 |
We need the same for 2025.1 |
@yaronkaikov @syuu1228 this discussion doesn't really belong to the seastar repository, but if it was started here already.... I created branch-2025.1, branch-6.2 and branch-6.1 in scylla-seastar. Please open backport PRs and I'll merge them. |
I queued commits that change .gitmodules to point to scylla-seastar.git. |
On Fedora 41 AMI on some aarch64 instance such as m7gd.16xlarge, Seastar program such as Scylla fails to startup with following error message:
It seems like hwloc is failed to initialize because of /sys/devices/system/cpu/cpu0/topology/ not available on the instance.
I debugged src/core/resource.cc to find out why assert occured, and found that alloc_from_node() is failing because node->total_memory is 0. It is likely because of failure of hwloc initialize described above.
To avoid the error on such environment, we should stop using hwloc on resource.cc.
hwloc initalization function does not return error code even error message is printed, we need to check "topology" directory is available on /sys. Since resource.cc has code to build Seastar without libhwloc, we need to call them if "topology" directory is not available.
Fixes scylladb/scylladb#22382
Related scylladb/scylla-pkg#4797