Skip to content

Commit

Permalink
fix(e2e): treat corner cases in ds_writer.go (aquasecurity#4554)
Browse files Browse the repository at this point in the history
fix(epbf): fix handling of compat tasks in syscall checkers

fix(ebpf): treat sched_process_exit corner cases

The sched_process_exit event may be triggered by a standard exit, such
as a syscall, or by alternative kernel paths, making it unsafe to assume
that it is always associated with a syscall exit.

do_exit and do_exit_group, while typically invoked by the exit and
exit_group syscalls, can also be reached through internal kernel
mechanisms such as signal handling. A concrete example of this occurs
when a syscall returns, enters signal handling, and subsequently calls
do_exit after get_signal. Both get_signal and do_exit involve
tracepoints.

A real execution flow illustrating this scenario in the kernel is as
follows:

entry_SYSCALL_64
  ├── do_syscall_64
  ├── syscall_exit_to_user_mode
  ├── __syscall_exit_to_user_mode_work
  ├── exit_to_user_mode_prepare
  ├── exit_to_user_mode_loop
  ├── arch_do_signal_or_restart
  ├── get_signal  (has signal_deliver tracepoint)
  ├── do_group_exit
  └── do_exit  (has sched_process_exit tracepoint)

feat(events): convert syscall arg to name at processing stage (aquasecurity#4563)

Since signatures now receive unparsed event arguments, conversion of syscall IDs to names became an issue since the signature has no way to know whether the event was generated on an x86_64 or ARM64 system.
To solve this, we convert the syscall ID argument to its name at the event processing stage, so it happens regardless of argument parsing.
This is applied to the `suspicious_syscall_source` and `stack_pivot` events.

fix: get exit code and signal values

When a process exits normally via exit(n), the exit code is
stored in the upper byte (exit_code << 8). The lower byte is
used for signal information if the process was terminated by
a signal.

Also, align the type of exit_code used at struct task_struct.

fix(ebpf): revise thread stack identification logic
Identifying thread stacks by a VMA is unreliable because VMA splitting
and joining may cause the searched VMA to not align nicely with the
thread stack VMA.
Instead, we can identify a thread stack by an address, making the
identification straight-forward.

fix(pipeline): fix stack-addresses not working (aquasecurity#4579)

fix: type parsing net_tcp_connect

The code attempted to cast "type" as a string, which caused issues
when the actual value was of a different type. So this commit changes
the type to int32, aligning it with parsers.SOCK_STREAM.Value().

test: add test for event net_tcp_connect

perf(proc): use formatters for procfs file paths

Since the type of the converted primitive is already known, formatter
helpers should be used to construct procfs file paths instead of relying
on `fmt.Sprintf`. Using `fmt.Sprintf` is relatively costly due to its
internal formatting logic, which is unnecessary for simple path
construction.

perf(proc): introduce ReadFile for /proc

`os.ReadFile` is not efficient for reading files in `/proc` because it
attempts to determine the file size before reading. This step is
unnecessary for `/proc` files, as they are virtual files with sizes
that are often reported as unknown or `0`.

`proc.ReadFile` is a new function designed specifically for reading
files in `/proc`. It reads directly into a buffer and is more efficient
than `os.ReadFile` because it allows tuning the initial buffer size to
better suit the characteristics of `/proc` files.

Running tool: /home/gg/.goenv/versions/1.22.4/bin/go test -benchmem
-run=^$ -tags ebpf -bench ^BenchmarkReadFile$
github.com/aquasecurity/tracee/pkg/utils/proc -benchtime=10000000x

goos: linux
goarch: amd64
pkg: github.com/aquasecurity/tracee/pkg/utils/proc
cpu: AMD Ryzen 9 7950X 16-Core Processor
BenchmarkReadFile/ProcFSReadFile/Empty_File-32        10000000  3525 ns/op  408 B/op  4 allocs/op
BenchmarkReadFile/OsReadFile/Empty_File-32            10000000  4070 ns/op  872 B/op  5 allocs/op
BenchmarkReadFile/ProcFSReadFile/Small_File-32        10000000  3961 ns/op  408 B/op  4 allocs/op
BenchmarkReadFile/OsReadFile/Small_File-32            10000000  4538 ns/op  872 B/op  5 allocs/op
BenchmarkReadFile/ProcFSReadFile/Exact_Buffer_Size-32 10000000  4229 ns/op  920 B/op  5 allocs/op
BenchmarkReadFile/OsReadFile/Exact_Buffer_Size-32     10000000  4523 ns/op  872 B/op  5 allocs/op
BenchmarkReadFile/ProcFSReadFile_/proc/self/stat-32   10000000  4043 ns/op  408 B/op  4 allocs/op
BenchmarkReadFile/OsReadFile_/proc/self/stat-32       10000000  4585 ns/op  872 B/op  5 allocs/op
PASS
ok  	github.com/aquasecurity/tracee/pkg/utils/proc	334.751s

perf(proc): improve stat file parsing

Remove the use of library functions to parse the stat file and instead
parse it manually (on the fly) to reduce the number of allocations and
improve performance.

chore(proc): align parsing of stat field with the formats size

This also align parsing sizes with the formats to avoid wrong parsing
of the stat file. The internal fields are represented aligned with the
actual kernel fields to avoid any confusion (signed/unsigned).

perf(proctree/proc): align fields to real size

Propagate values based on its real size which in most cases is smaller
than int (64-bit). This change reduces the memory footprint or at least
the stress on the stack/heap.

perf(proc): improve status file parsing

Remove the use of library functions to parse the status file and instead
parse it manually (on the fly) to reduce the number of allocations and
improve performance.

perf(proc): improve ns

Reduce ProcNS memory footprint by using the right member type sizes -
namespace id is an uint32, since it is the inode number in
struct ns_common.

This change also improves the performance of GetAllProcNS(), GetProcNS()
and GetMountNSFirstProcesses().

chore: introduce builders with specific fields

- NewProcStatFields()
- NewThreadStatFields()
- NewProcStatusFields()
- NewThreadStatusFields()

perf(proctree): remove stat call

Calling stat on /proc/<pid> would only increase the window for
process termination between the stat call and the read of the file.

This also replaces fmt.Sprintf with string concatenation and
strconv.FormatInt for better performance.

perf(proctree)!: rearrange struct fields

Mind the padding and the alignment of the fields to avoid wasting
memory.

BREAKING CHANGE: invoked_from_kernel, sched_process_exec event arg, is
now a bool.

perf(proctree): centralize child and thread Maps

Previously, each Process maintained its own maps for children and
threads, leading to significant overhead in the process tree. This
commit moves those maps into the ProcessTree, which now centrally
manages the children and threads for every process.

Additionally, this change allows us to simplify the Process struct by
removing the dedicated mutex that was solely used for protecting the
individual maps.

chore(proctree): use atomic types
  • Loading branch information
geyslan committed Feb 10, 2025
1 parent 47c94d8 commit c72de60
Show file tree
Hide file tree
Showing 55 changed files with 2,642 additions and 780 deletions.
2 changes: 1 addition & 1 deletion docs/docs/events/builtin/extra/sched_process_exec.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ security, and auditing.
12. **interp** (`const char*`): Specifies the interpreter of the binary.
13. **stdin_type** (`umode_t`): Mode of the standard input.
14. **stdin_path** (`char*`): Path of the standard input.
15. **invoked_from_kernel** (`int`): Flag to determine if the process was initiated by the kernel.
15. **invoked_from_kernel** (`bool`): Flag to determine if the process was initiated by the kernel.
16. **env** (`const char**`): Environment variables associated with the process.

## Hooks
Expand Down
10 changes: 5 additions & 5 deletions pkg/changelog/changelog_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ import (

"github.com/stretchr/testify/assert"

"github.com/aquasecurity/tracee/pkg/utils"
"github.com/aquasecurity/tracee/pkg/utils/tests"
)

func getTimeFromSec(second int) time.Time {
Expand Down Expand Up @@ -189,16 +189,16 @@ func TestChangelog_StructType(t *testing.T) {
// Run it as DEBUG test to see the output.
func TestChangelog_PrintSizes(t *testing.T) {
changelog1 := NewChangelog[int](1)
utils.PrintStructSizes(os.Stdout, changelog1)
tests.PrintStructSizes(t, os.Stdout, changelog1)

entry1 := entry[int]{}
utils.PrintStructSizes(os.Stdout, entry1)
tests.PrintStructSizes(t, os.Stdout, entry1)

//

changelog2 := NewChangelog[string](1)
utils.PrintStructSizes(os.Stdout, changelog2)
tests.PrintStructSizes(t, os.Stdout, changelog2)

entry2 := entry[string]{}
utils.PrintStructSizes(os.Stdout, entry2)
tests.PrintStructSizes(t, os.Stdout, entry2)
}
217 changes: 217 additions & 0 deletions pkg/ebpf/c/common/memory.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,15 @@ statfunc unsigned long get_arg_end_from_mm(struct mm_struct *);
statfunc unsigned long get_env_start_from_mm(struct mm_struct *);
statfunc unsigned long get_env_end_from_mm(struct mm_struct *);
statfunc unsigned long get_vma_flags(struct vm_area_struct *);
statfunc struct vm_area_struct *find_vma(void *ctx, struct task_struct *task, u64 addr);
statfunc bool vma_is_file_backed(struct vm_area_struct *vma);
statfunc bool vma_is_main_stack(struct vm_area_struct *vma);
statfunc bool vma_is_main_heap(struct vm_area_struct *vma);
statfunc bool vma_is_anon(struct vm_area_struct *vma);
statfunc bool vma_is_golang_heap(struct vm_area_struct *vma);
statfunc bool vma_is_vdso(struct vm_area_struct *vma);
statfunc enum vma_type get_vma_type(struct vm_area_struct *vma);
statfunc bool address_in_thread_stack(task_info_t *task_info, u64 address);

// FUNCTIONS

Expand Down Expand Up @@ -51,4 +60,212 @@ statfunc struct mount *real_mount(struct vfsmount *mnt)
return container_of(mnt, struct mount, mnt);
}

/**
* A busy process can have somewhere in the ballpark of 1000 VMAs.
* In an ideally balanced tree, this means that the max depth is ~10.
* A poorly balanced tree can have a leaf node that is up to twice as deep
* as another leaf node, which in the worst case scenario places its depth
* at 2*10 = 20.
* To be extra safe and accomodate for VMA counts higher than 1000,
* we define the max traversal depth as 25.
*/
#define MAX_VMA_RB_TREE_DEPTH 25

static bool alerted_find_vma_unsupported = false;

// Given a task, find the first VMA which contains the given address.
statfunc struct vm_area_struct *find_vma(void *ctx, struct task_struct *task, u64 addr)
{
/**
* TODO: from kernel version 6.1, the data structure with which VMAs
* are managed changed from an RB tree to a maple tree.
* We currently don't support finding VMAs on such systems.
*/
struct mm_struct *mm = BPF_CORE_READ(task, mm);
if (!bpf_core_field_exists(mm->mm_rb)) {
if (!alerted_find_vma_unsupported) {
tracee_log(ctx, BPF_LOG_LVL_WARN, BPF_LOG_FIND_VMA_UNSUPPORTED, 0);
alerted_find_vma_unsupported = true;
}
return NULL;
}

struct vm_area_struct *vma = NULL;
struct rb_node *rb_node = BPF_CORE_READ(mm, mm_rb.rb_node);

#pragma unroll
for (int i = 0; i < MAX_VMA_RB_TREE_DEPTH; i++) {
barrier(); // without this, the compiler refuses to unroll the loop

if (rb_node == NULL)
break;

struct vm_area_struct *tmp = container_of(rb_node, struct vm_area_struct, vm_rb);
unsigned long vm_start = BPF_CORE_READ(tmp, vm_start);
unsigned long vm_end = BPF_CORE_READ(tmp, vm_end);

if (vm_end > addr) {
vma = tmp;
if (vm_start <= addr)
break;
rb_node = BPF_CORE_READ(rb_node, rb_left);
} else
rb_node = BPF_CORE_READ(rb_node, rb_right);
}

return vma;
}

statfunc bool vma_is_file_backed(struct vm_area_struct *vma)
{
return BPF_CORE_READ(vma, vm_file) != NULL;
}

statfunc bool vma_is_main_stack(struct vm_area_struct *vma)
{
struct mm_struct *vm_mm = BPF_CORE_READ(vma, vm_mm);
if (vm_mm == NULL)
return false;

u64 vm_start = BPF_CORE_READ(vma, vm_start);
u64 vm_end = BPF_CORE_READ(vma, vm_end);
u64 start_stack = BPF_CORE_READ(vm_mm, start_stack);

// logic taken from include/linux/mm.h (vma_is_initial_stack)
if (vm_start <= start_stack && start_stack <= vm_end)
return true;

return false;
}

statfunc bool vma_is_main_heap(struct vm_area_struct *vma)
{
struct mm_struct *vm_mm = BPF_CORE_READ(vma, vm_mm);
if (vm_mm == NULL)
return false;

u64 vm_start = BPF_CORE_READ(vma, vm_start);
u64 vm_end = BPF_CORE_READ(vma, vm_end);
u64 start_brk = BPF_CORE_READ(vm_mm, start_brk);
u64 brk = BPF_CORE_READ(vm_mm, brk);

// logic taken from include/linux/mm.h (vma_is_initial_heap)
if (vm_start < brk && start_brk < vm_end)
return true;

return false;
}

statfunc bool vma_is_anon(struct vm_area_struct *vma)
{
return !vma_is_file_backed(vma);
}

// The golang heap consists of arenas which are memory regions mapped using mmap.
// When allocating areans, golang supplies mmap with an address hint, which is an
// address that the kernel should place the mapping at.
// Hints for x86_64 begin at 0xc000000000 and for ARM64 at 0x4000000000.
// From observation, when allocating arenas the MAP_FIXED flag is used which forces
// the kernel to use the specified address or fail the mapping, so it is safe to
// rely on the address pattern to determine if it belongs to a heap arena.
#define GOLANG_ARENA_HINT_MASK 0xffffffff00000000UL
#if defined(bpf_target_x86)
#define GOLANG_ARENA_HINT (0xc0UL << 32)
#elif defined(bpf_target_arm64)
#define GOLANG_ARENA_HINT (0x40UL << 32)
#else
#error Unsupported architecture
#endif
// We define a max hint that we assume golang allocations will never exceed.
// This translates to the address 0xff00000000.
// This means that we assume that a golang program will never allocate more than
// 256GB of memory on x86_64, or 768GB on ARM64.
#define GOLANG_ARENA_HINT_MAX (0xffUL << 32)

statfunc bool vma_is_golang_heap(struct vm_area_struct *vma)
{
u64 vm_start = BPF_CORE_READ(vma, vm_start);

// Check if the VMA address is in the range provided by golang heap arena address hints.
// Of course, any program can also allocate memory at these addresses which will result
// in a false positive for this check, so any caller of this function must make sure
// that a false positive for this check is acceptable.
return (vm_start & GOLANG_ARENA_HINT_MASK) >= GOLANG_ARENA_HINT &&
(vm_start & GOLANG_ARENA_HINT_MASK) <= GOLANG_ARENA_HINT_MAX;
}

statfunc bool vma_is_vdso(struct vm_area_struct *vma)
{
struct vm_special_mapping *special_mapping =
(struct vm_special_mapping *) BPF_CORE_READ(vma, vm_private_data);
if (special_mapping == NULL)
return false;

// read only 6 characters (7 with NULL terminator), enough to compare with "[vdso]"
char mapping_name[7];
bpf_probe_read_str(&mapping_name, 7, BPF_CORE_READ(special_mapping, name));
return strncmp("[vdso]", mapping_name, 7) == 0;
}

statfunc enum vma_type get_vma_type(struct vm_area_struct *vma)
{
// The check order is a balance between how expensive the check is and how likely it is to pass

if (vma_is_file_backed(vma))
return VMA_FILE_BACKED;

if (vma_is_main_stack(vma))
return VMA_MAIN_STACK;

if (vma_is_main_heap(vma))
return VMA_HEAP;

if (vma_is_anon(vma)) {
if (vma_is_golang_heap(vma))
return VMA_GOLANG_HEAP;

if (vma_is_vdso(vma))
return VMA_VDSO;

return VMA_ANON;
}

return VMA_UNKNOWN;
}

statfunc const char *get_vma_type_str(enum vma_type vma_type)
{
switch (vma_type) {
case VMA_FILE_BACKED:
return "file backed";
case VMA_ANON:
return "anonymous";
case VMA_MAIN_STACK:
return "main stack";
case VMA_THREAD_STACK:
return "thread stack";
case VMA_HEAP:
return "heap";
case VMA_GOLANG_HEAP:
// Goroutine stacks are allocated on the golang heap
return "golang heap/stack";
case VMA_VDSO:
return "vdso";
case VMA_UNKNOWN:
default:
return "unknown";
}
}

statfunc bool address_in_thread_stack(task_info_t *task_info, u64 address)
{
// Get the stack area for this task
address_range_t *stack = &task_info->stack;
if (stack->start == 0 && stack->end == 0)
// This thread's stack isn't tracked
return false;

return address >= stack->start && address <= stack->end;
}

#endif
4 changes: 2 additions & 2 deletions pkg/ebpf/c/common/task.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ statfunc u64 get_task_start_time(struct task_struct *task);
statfunc u32 get_task_host_pid(struct task_struct *task);
statfunc u32 get_task_host_tgid(struct task_struct *task);
statfunc struct task_struct *get_parent_task(struct task_struct *task);
statfunc u32 get_task_exit_code(struct task_struct *task);
statfunc int get_task_exit_code(struct task_struct *task);
statfunc int get_task_parent_flags(struct task_struct *task);
statfunc const struct cred *get_task_real_cred(struct task_struct *task);

Expand Down Expand Up @@ -194,7 +194,7 @@ statfunc struct task_struct *get_leader_task(struct task_struct *task)
return BPF_CORE_READ(task, group_leader);
}

statfunc u32 get_task_exit_code(struct task_struct *task)
statfunc int get_task_exit_code(struct task_struct *task)
{
return BPF_CORE_READ(task, exit_code);
}
Expand Down
Loading

0 comments on commit c72de60

Please sign in to comment.