Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Coordinator crashes with OOM in 468 #24572

Closed
guyco33 opened this issue Dec 24, 2024 · 49 comments · Fixed by #24706
Closed

Regression: Coordinator crashes with OOM in 468 #24572

guyco33 opened this issue Dec 24, 2024 · 49 comments · Fixed by #24706
Labels
bug Something isn't working RELEASE-BLOCKER

Comments

@guyco33
Copy link
Member

guyco33 commented Dec 24, 2024

Coordinator in 468 consumes much more memory than 467 that leads to OS OOM.
What could be the reason for such memory consumption change ?

Listed below the output from free memory stats and gc logs during execution from both coordinator versions.

467
===
** after 15 minutes 

[2024-12-24T11:10:18.671+0000][7691][info][gc,start       ] GC(60) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:18.672+0000][7691][info][gc,task        ] GC(60) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Pre Evacuate Collection Set: 0.87ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Merge Heap Roots: 0.44ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Evacuate Collection Set: 159.86ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Post Evacuate Collection Set: 1.28ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,phases      ] GC(60)   Other: 0.12ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Eden regions: 100->0(89)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Survivor regions: 8->9(14)
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Old regions: 34->34
[2024-12-24T11:10:18.834+0000][7691][info][gc,heap        ] GC(60) Humongous regions: 0->0
[2024-12-24T11:10:18.834+0000][7691][info][gc,metaspace   ] GC(60) Metaspace: 291741K(295488K)->291741K(295488K) NonClass: 254345K(256448K)->254345K(256448K) Class: 37395K(39040K)->37395K(39040K)
[2024-12-24T11:10:18.834+0000][7691][info][gc             ] GC(60) Pause Young (Normal) (G1 Evacuation Pause) 4449M->1279M(16384M) 162.817ms
[2024-12-24T11:10:18.834+0000][7691][info][gc,cpu         ] GC(60) User=0.43s Sys=0.04s Real=0.17s
               total        used        free      shared  buff/cache   available
Mem:              30           7          19           0           4          23
Swap:              0           0           0
Tue Dec 24 11:10:26 UTC 2024


** after 1:15 hour

[2024-12-24T12:09:59.680+0000][7686][info][gc,marking     ] GC(173) Concurrent Mark From Roots 2367.862ms
[2024-12-24T12:09:59.680+0000][7686][info][gc,marking     ] GC(173) Concurrent Preclean
[2024-12-24T12:09:59.681+0000][7686][info][gc,marking     ] GC(173) Concurrent Preclean 1.161ms
[2024-12-24T12:09:59.681+0000][7691][info][gc,start       ] GC(173) Pause Remark
[2024-12-24T12:09:59.724+0000][7691][info][gc             ] GC(173) Pause Remark 5988M->5691M(16384M) 42.204ms
[2024-12-24T12:09:59.724+0000][7691][info][gc,cpu         ] GC(173) User=0.15s Sys=0.00s Real=0.04s
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking     ] GC(173) Concurrent Mark 2413.972ms
[2024-12-24T12:09:59.726+0000][7686][info][gc,marking     ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Rebuild Remembered Sets and Scrub Regions 483.491ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,start       ] GC(173) Pause Cleanup
[2024-12-24T12:10:00.210+0000][7691][info][gc             ] GC(173) Pause Cleanup 5725M->5725M(16384M) 0.054ms
[2024-12-24T12:10:00.210+0000][7691][info][gc,cpu         ] GC(173) User=0.00s Sys=0.00s Real=0.00s
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Clear Claimed Marks
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Clear Claimed Marks 0.427ms
[2024-12-24T12:10:00.210+0000][7686][info][gc,marking     ] GC(173) Concurrent Cleanup for Next Mark
[2024-12-24T12:10:00.228+0000][7686][info][gc,marking     ] GC(173) Concurrent Cleanup for Next Mark 17.547ms
[2024-12-24T12:10:00.228+0000][7686][info][gc             ] GC(173) Concurrent Mark Cycle 3285.446ms
               total        used        free      shared  buff/cache   available
Mem:              30          16          10           0           4          14
Swap:              0           0           0
Tue Dec 24 12:10:41 UTC 2024

===
468
===
** after 15 minutes 

[2024-12-24T11:10:04.019+0000][7679][info][gc,start       ] GC(62) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T11:10:04.019+0000][7679][info][gc,task        ] GC(62) Using 4 workers of 4 for evacuation
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Pre Evacuate Collection Set: 0.32ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Merge Heap Roots: 0.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Evacuate Collection Set: 139.19ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Post Evacuate Collection Set: 2.22ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,phases      ] GC(62)   Other: 0.08ms
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Eden regions: 118->0(104)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Survivor regions: 8->10(16)
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Old regions: 33->33
[2024-12-24T11:10:04.161+0000][7679][info][gc,heap        ] GC(62) Humongous regions: 0->0
[2024-12-24T11:10:04.161+0000][7679][info][gc,metaspace   ] GC(62) Metaspace: 292209K(296000K)->292209K(296000K) NonClass: 254676K(256768K)->254676K(256768K) Class: 37533K(39232K)->37533K(39232K)
[2024-12-24T11:10:04.161+0000][7679][info][gc             ] GC(62) Pause Young (Normal) (G1 Evacuation Pause) 5009M->1277M(16384M) 142.330ms
[2024-12-24T11:10:04.162+0000][7679][info][gc,cpu         ] GC(62) User=0.44s Sys=0.04s Real=0.14s
               total        used        free      shared  buff/cache   available
Mem:              30          16           9           0           4          13
Swap:              0           0           0
Tue Dec 24 11:10:27 UTC 2024


** after 1:15 hour

[2024-12-24T12:10:12.713+0000][7679][info][gc,start       ] GC(179) Pause Young (Normal) (G1 Evacuation Pause)
[2024-12-24T12:10:12.714+0000][7679][info][gc,task        ] GC(179) Using 4 workers of 4 for evacuation
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Pre Evacuate Collection Set: 0.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Merge Heap Roots: 0.24ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Evacuate Collection Set: 77.50ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Post Evacuate Collection Set: 2.34ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,phases      ] GC(179)   Other: 0.09ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Eden regions: 303->0(302)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Survivor regions: 4->5(39)
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Old regions: 113->113
[2024-12-24T12:10:12.795+0000][7679][info][gc,heap        ] GC(179) Humongous regions: 1->1
[2024-12-24T12:10:12.795+0000][7679][info][gc,metaspace   ] GC(179) Metaspace: 296387K(300352K)->296387K(300352K) NonClass: 258630K(260800K)->258630K(260800K) Class: 37756K(39552K)->37756K(39552K)
[2024-12-24T12:10:12.795+0000][7679][info][gc             ] GC(179) Pause Young (Normal) (G1 Evacuation Pause) 13361M->3699M(16384M) 82.112ms
[2024-12-24T12:10:12.795+0000][7679][info][gc,cpu         ] GC(179) User=0.26s Sys=0.01s Real=0.08s
               total        used        free      shared  buff/cache   available
Mem:              30          29           1           0           0           0
Swap:              0           0           0
Tue Dec 24 12:10:25 UTC 2024


*** syslog OOM output 
grep -i oom /var/log/syslog

2024-12-24T12:14:35.481240+00:00 ip-172-30-20-32 kernel: http-worker-691 invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2024-12-24T12:14:35.496433+00:00 ip-172-30-20-32 kernel:  oom_kill_process+0x116/0x270
2024-12-24T12:14:35.496434+00:00 ip-172-30-20-32 kernel:  ? oom_evaluate_task+0x143/0x1e0
2024-12-24T12:14:35.496435+00:00 ip-172-30-20-32 kernel:  __alloc_pages_may_oom+0x10c/0x1d0
2024-12-24T12:14:35.511859+00:00 ip-172-30-20-32 kernel: [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cloud-final.service,task=trino-server,pid=7634,uid=0
2024-12-24T12:14:35.511912+00:00 ip-172-30-20-32 kernel: Out of memory: Killed process 7634 (trino-server) total-vm:36015048kB, anon-rss:29973656kB, file-rss:2948kB, shmem-rss:0kB, UID:0 pgtables:61856kB oom_score_adj:0
2024-12-24T12:14:35.792898+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: A process of this unit has been killed by the OOM killer.
2024-12-24T12:14:35.803968+00:00 ip-172-30-20-32 systemd[1]: cloud-final.service: Failed with result 'oom-kill'.
2024-12-24T12:14:35.806272+00:00 ip-172-30-20-32 systemd[1]: system.slice: A process of this unit has been killed by the OOM killer.

Trino coordinator is running on EC2 machine type r6a.xlarge with OpenJDK Runtime Environment Temurin-23.0.1+11 (build 23.0.1+11)

jvm.conf is

-server
-Xmx16G
-Xms16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+ExplicitGCInvokesConcurrent
-XX:+ExitOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-Djava.io.tmpdir=/data/tmpdir
-XX:ReservedCodeCacheSize=512M
-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000
-Dfile.encoding=UTF-8
-Dcom.sun.jndi.ldap.object.disableEndpointIdentification=true
-Djdk.attach.allowAttachSelf=true
-Djdk.nio.maxCachedBufferSize=2000000
# Reduce starvation of threads by GClocker, recommend to set about the number of cpu cores (JDK-8192647)
-XX:+UnlockDiagnosticVMOptions
-XX:G1NumCollectionsKeepPinned=10000000
-XX:GCLockerRetryAllocationCount=32
-Xlog:gc*:/data/trino/var/log/gc.log:time,level,tags,tid:filecount=10,filesize=100M
@guyco33
Copy link
Member Author

guyco33 commented Dec 24, 2024

Executing same load on r6a.2xlarge with double memory size and it seems that OOM will happen soon.

Memory consumption after 1.5 hours:

===
467
===

$ date && free -g && (top -n 20| grep trino) && date && free -g

Tue Dec 24 15:00:31 UTC 2024
               total        used        free      shared  buff/cache   available
Mem:              61          24          32           0           4          36
Swap:              0           0           0

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   7766 root      20   0   44.6g  23.0g  30676 S  18.2  37.4  35:48.76 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  20.3  37.4  35:49.37 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  18.6  37.4  35:49.93 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  16.3  37.4  35:50.42 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  15.0  37.4  35:50.87 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  16.6  37.4  35:51.37 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  12.3  37.4  35:51.74 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  11.3  37.4  35:52.08 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  13.7  37.4  35:52.49 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  12.3  37.4  35:52.86 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S  11.0  37.4  35:53.19 trino-s+
   7766 root      20   0   44.6g  23.0g  30676 S   7.6  37.4  35:53.42 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   7.3  37.4  35:53.64 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   6.0  37.4  35:53.82 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   6.6  37.4  35:54.02 trino-s+
   7766 root      20   0   44.5g  23.0g  30676 S   6.6  37.4  35:54.22 trino-s+
   7766 root      20   0   44.4g  23.0g  30676 S   8.0  37.4  35:54.46 trino-s+
   7766 root      20   0   44.4g  23.0g  30676 S   6.3  37.4  35:54.65 trino-s+
   7766 root      20   0   44.4g  23.0g  30676 S   7.0  37.4  35:54.86 trino-s+
   7766 root      20   0   44.6g  23.1g  30676 S 278.7  37.6  36:03.25 trino-s+
Tue Dec 24 15:01:29 UTC 2024
===
468
===

$ date && free -g && (top -n 20| grep trino) && date && free -g

Tue Dec 24 15:00:30 UTC 2024
               total        used        free      shared  buff/cache   available
Mem:              61          43          13           0           4          17
Swap:              0           0           0

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   7784 root      20   0   61.7g  42.1g  30820 S  10.0  68.6  39:12.02 trino-s+
   7784 root      20   0   61.7g  42.2g  30820 S   8.3  68.6  39:12.27 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S   7.0  68.7  39:12.48 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S  10.3  68.7  39:12.79 trino-s+
   7784 root      20   0   61.9g  42.2g  30820 S   7.3  68.7  39:13.01 trino-s+
   7784 root      20   0   61.9g  42.3g  30820 S   5.6  68.8  39:13.18 trino-s+
   7784 root      20   0   61.9g  42.3g  30820 S   5.7  68.8  39:13.35 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   5.0  68.9  39:13.50 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   3.0  68.9  39:13.59 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   2.0  68.9  39:13.65 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   2.3  68.8  39:13.72 trino-s+
   7784 root      20   0   61.8g  42.3g  30820 S   1.7  68.8  39:13.77 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S   2.0  68.7  39:13.83 trino-s+
   7784 root      20   0   61.8g  42.2g  30820 S   2.0  68.7  39:13.89 trino-s+
   7784 root      20   0   62.0g  42.4g  30820 S 181.4  69.0  39:19.35 trino-s+
   7784 root      20   0   62.9g  43.0g  30820 S 220.7  70.0  39:25.97 trino-s+
   7784 root      20   0   63.4g  43.5g  30820 S  41.2  70.8  39:27.21 trino-s+
   7784 root      20   0   63.8g  44.0g  30820 S  43.9  71.6  39:28.53 trino-s+
   7784 root      20   0   64.6g  44.7g  30820 S 155.0  72.7  39:33.18 trino-s+
   7784 root      20   0   64.9g  45.1g  30820 S  34.6  73.3  39:34.22 trino-s+
Tue Dec 24 15:01:27 UTC 2024
               total        used        free      shared  buff/cache   available
Mem:              61          46          10           0           4          14
Swap:              0           0           0

@guyco33 guyco33 added bug Something isn't working RELEASE-BLOCKER labels Dec 24, 2024
@guyco33
Copy link
Member Author

guyco33 commented Dec 24, 2024

OS OOM happens again

$ grep oom /var/log/syslog

2024-12-24T17:14:34.706917+00:00 ip-172-30-21-101 kernel: query-execution invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
2024-12-24T17:14:34.744709+00:00 ip-172-30-21-101 kernel:  oom_kill_process+0x116/0x270
2024-12-24T17:14:34.744711+00:00 ip-172-30-21-101 kernel:  ? oom_evaluate_task+0x143/0x1e0
2024-12-24T17:14:34.744712+00:00 ip-172-30-21-101 kernel:  __alloc_pages_may_oom+0x10c/0x1d0
2024-12-24T17:14:34.792485+00:00 ip-172-30-21-101 kernel: [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
2024-12-24T17:14:34.792510+00:00 ip-172-30-21-101 kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/cloud-final.service,task=trino-server,pid=7784,uid=0
2024-12-24T17:14:34.792511+00:00 ip-172-30-21-101 kernel: Out of memory: Killed process 7784 (trino-server) total-vm:74788372kB, anon-rss:63005676kB, file-rss:2412kB, shmem-rss:0kB, UID:0 pgtables:128700kB oom_score_adj:0
2024-12-24T17:14:37.446266+00:00 ip-172-30-21-101 kernel: oom_reaper: reaped process 7784 (trino-server), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
2024-12-24T17:14:37.759101+00:00 ip-172-30-21-101 systemd[1]: cloud-final.service: Failed with result 'oom-kill'.

@wendigo
Copy link
Contributor

wendigo commented Dec 24, 2024

@guyco33 can you do a heapdump when gc pressure goes up?

@guyco33
Copy link
Member Author

guyco33 commented Dec 29, 2024

I've run the same exact test queries for about 1 hour on both 467 and 468 and created a trace dump comparison with VisualVM.
Maybe the increase in objects count might explain the increase in overall memory although the total heap size is less than 2GB

Image Image Image

@losipiuk
Copy link
Member

It looks we need to be looking for something off-heap (java heap seems fine).

  • @wendigo did you recall updating any libs which use off-heap memory in 467..468 period? Anyway this should ratter impacty workers not coordinator.
  • Do python functions support has potential of using off-heap memeroy?
  • @guyco33 what connector/table format are you using?

@wendigo
Copy link
Contributor

wendigo commented Dec 30, 2024

@losipiuk not really

@losipiuk
Copy link
Member

As for python:
this code

Instance instance = Instance.builder(PYTHON_MODULE)
.withMachineFactory(PythonModule::create)
.withImportValues(importValues)
.build();
allocate = instance.export("allocate");
deallocate = instance.export("deallocate");
setup = instance.export("setup");
execute = instance.export("execute");
memory = instance.memory();

is potentially leaky. We are not closing Instance explicitly - just assuming GC will do the job - and IDK if it does not hold some native memory.
Also it looks like we are not always have allocate/deallocate pairs. E.g for argTypeAddress and returnTypeAddress in PythonEngine.doSetup. IDK if that is a problem or not. @electrum PTAL.
But this is probably not what @guyco33 is observing as this should not be called if Python functions are not in use.

@guyco33
Copy link
Member Author

guyco33 commented Dec 30, 2024

what connector/table format are you using?

I'm using postgres, mysql and hive tables with ORC and csv gzip file formats

@electrum
Copy link
Member

@losipiuk There is no close for Instance. It doesn't hold any native memory. The WASM memory is allocated on-heap as a single contiguous array. The argType and returnType pointers passed to setup() are saved and held onto for the lifetime of the WASM engine.

There is a potential leak of PythonEngine which creates a Jimfs instance that may get registered with the JDK file system machinery. We create a ScalarFunctionImplementation with an instanceFactory so that the engine can create one per-thread as needed. If this is a problem, we'll need to change the engine to support Closeable function instances, use java.lang.ref.Cleaner, etc. Also, we may need memory tracking for the function instances, as language functions will typically use more memory than other types of functions.

That said, that seems unrelated to this issue, as there is no evidence they are being used here.

@guyco33
Copy link
Member Author

guyco33 commented Jan 2, 2025

I ran the same tests after deleting ./plugin/functions-python/ directory, but OS OOM still happened on the coordinator, so it seems that the leaky code comes from something else.
What would be the best approach to tracking off-heap leaks ?

@wendigo
Copy link
Contributor

wendigo commented Jan 2, 2025

@guyco33 we don't allocate off-heap memory directly, the only known instances that I'm aware of is the usage of the Apache Arrow (BigQuery, Snowflake connectors) but this is not the case here.

@wendigo
Copy link
Contributor

wendigo commented Jan 2, 2025

@guyco33 how much memory is assigned to the JVM process? Can you lower Xmx to 70% of the quota?

@teddy-hackle
Copy link

We are also experiencing the same problem after upgrading to 468.

We are using s3, hive metastore and iceberg through the jdbc connector.

Due to this issue, the -XX:TrimNativeHeapInterval=5000 option was added among the JVM options, and in terms of metric, native heap memory is removed periodically and fluctuates, but it continues to increase and eventually OOM occurs.

Image

@wendigo
Copy link
Contributor

wendigo commented Jan 3, 2025

@teddy-hackle @guyco33 can you do a thread dump?

@guyco33
Copy link
Member Author

guyco33 commented Jan 3, 2025

how much memory is assigned to the JVM process? Can you lower Xmx to 70% of the quota?

@wendigo I'm using 16G for Xmx on r6a.xlarge instance with 32G

@wendigo
Copy link
Contributor

wendigo commented Jan 3, 2025

@guyco33 can you provide a thread dump?

@losipiuk
Copy link
Member

losipiuk commented Jan 3, 2025

@guyco33 we don't allocate off-heap memory directly

I used to observe behaviour (in different project) which suggested that AWS SDK is allocating stuff off-heap when doing PUTs for S3.
I did not manage to get to root of that though. It does not much what coordinator is doing though.

@wendigo
Copy link
Contributor

wendigo commented Jan 3, 2025

Can one of you run Trino with JVM flag -XX:NativeMemoryTracking=detail and then capture the statistics for native memory using jps to get Trino PID and then jcmd [PID] VM.native_memory baseline scale=MB ?

@guyco33
Copy link
Member Author

guyco33 commented Jan 3, 2025

@losipiuk OOM happens also when hive is not reachable from the coordinator and s3 operations are not involved - I have an assumption that it's something about the planner.
@wendigo Attached some output files from jcmd

statistics.txt
summary.diff.txt
detail.diff.txt

@wendigo
Copy link
Contributor

wendigo commented Jan 3, 2025

I'm worried about this Other section which contains ByteBuffers and such (Unsafe allocations):

-                     Other (reserved=5839MB +449MB, committed=5839MB +449MB)
                            (malloc=5839MB +449MB #3755 +278)

Trino itself is not using ByteBuffer's as the Slice is wrapping around heap-allocated byte[] arrays and that's the primary memory allocation/manipulation unit. I think that the allocations are coming from 3rd party code, i.e. Netty or AWS SDK.

Is it possible for you to attach a debugger to a running process and check Unsafe.allocateMemory calls?

@wendigo
Copy link
Contributor

wendigo commented Jan 3, 2025

@dain @electrum ideas how to approach that?

@wendigo
Copy link
Contributor

wendigo commented Jan 3, 2025

I think that these allocations are happening in Jetty:

Breakpoint reached
	at jdk.internal.misc.Unsafe.allocateMemory(Unsafe.java:625)
	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:115)
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:363)
	at org.eclipse.jetty.util.BufferUtil.allocateDirect(BufferUtil.java:137)
	at org.eclipse.jetty.util.BufferUtil.allocate(BufferUtil.java:153)
	at org.eclipse.jetty.io.ArrayByteBufferPool.acquire(ArrayByteBufferPool.java:208)
	at org.eclipse.jetty.server.internal.HttpConnection$SendCallback.process(HttpConnection.java:779)
	at org.eclipse.jetty.util.IteratingCallback.processing(IteratingCallback.java:262)
	at org.eclipse.jetty.util.IteratingCallback.iterate(IteratingCallback.java:243)
	at org.eclipse.jetty.server.internal.HttpConnection$HttpStreamOverHTTP1.send(HttpConnection.java:1427)
	at org.eclipse.jetty.server.HttpStream$Wrapper.send(HttpStream.java:185)
	at org.eclipse.jetty.server.internal.HttpChannelState$ChannelResponse.write(HttpChannelState.java:1300)
	at org.eclipse.jetty.server.Response$Wrapper.write(Response.java:768)
	at org.eclipse.jetty.server.handler.EventsHandler$EventsResponse.write(EventsHandler.java:353)
	at org.eclipse.jetty.server.Response$Wrapper.write(Response.java:768)
	at org.eclipse.jetty.server.handler.ContextResponse.write(ContextResponse.java:56)
	at org.eclipse.jetty.ee10.servlet.ServletContextResponse.write(ServletContextResponse.java:288)
	at org.eclipse.jetty.ee10.servlet.HttpOutput.channelWrite(HttpOutput.java:215)
	at org.eclipse.jetty.ee10.servlet.HttpOutput.channelWrite(HttpOutput.java:200)
	at org.eclipse.jetty.ee10.servlet.HttpOutput.write(HttpOutput.java:848)
	at org.glassfish.jersey.servlet.internal.ResponseWriter$NonCloseableOutputStreamWrapper.write(ResponseWriter.java:298)
	at java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:170)
	at org.glassfish.jersey.message.internal.CommittingOutputStream.flushBuffer(CommittingOutputStream.java:286)
	at org.glassfish.jersey.message.internal.CommittingOutputStream.commit(CommittingOutputStream.java:235)
	at org.glassfish.jersey.message.internal.CommittingOutputStream.close(CommittingOutputStream.java:250)
	at org.glassfish.jersey.message.internal.OutboundMessageContext.close(OutboundMessageContext.java:568)
	at org.glassfish.jersey.server.ContainerResponse.close(ContainerResponse.java:403)
	at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:763)
	at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:398)
	at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:388)
	at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:277)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
	at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
	at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
	at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:266)
	at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:253)
	at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:696)
	at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:397)
	at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:349)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:358)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:312)
	at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHolder.handle(ServletHolder.java:736)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1619)
	at io.airlift.http.server.tracing.TracingServletFilter.doFilter(TracingServletFilter.java:113)
	at org.eclipse.jetty.ee10.servlet.FilterHolder.doFilter(FilterHolder.java:205)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1591)
	at org.eclipse.jetty.ee10.servlet.ServletHandler$MappedServlet.handle(ServletHandler.java:1552)
	at org.eclipse.jetty.ee10.servlet.ServletChannel.dispatch(ServletChannel.java:819)
	at org.eclipse.jetty.ee10.servlet.ServletChannel.handle(ServletChannel.java:436)
	at org.eclipse.jetty.ee10.servlet.ServletHandler.handle(ServletHandler.java:469)
	at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:611)
	at org.eclipse.jetty.server.handler.ContextHandler.handle(ContextHandler.java:1060)
	at org.eclipse.jetty.server.Handler$Wrapper.handle(Handler.java:740)
	at org.eclipse.jetty.server.handler.EventsHandler.handle(EventsHandler.java:81)
	at org.eclipse.jetty.server.Server.handle(Server.java:182)
	at org.eclipse.jetty.server.internal.HttpChannelState$HandlerInvoker.run(HttpChannelState.java:662)
	at org.eclipse.jetty.server.internal.HttpConnection.onFillable(HttpConnection.java:418)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:322)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:99)
	at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:480)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:443)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:293)
	at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:201)
	at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:311)
	at org.eclipse.jetty.util.thread.MonitoredQueuedThreadPool$1.run(MonitoredQueuedThreadPool.java:73)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:979)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1209)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1164)
	at java.lang.Thread.runWith(Thread.java:1588)
	at java.lang.Thread.run(Thread.java:1575)

I've reported this issue to Jetty folks

@guyco33
Copy link
Member Author

guyco33 commented Jan 5, 2025

Attached output files of jcmd $(pgrep trino) VM.native_memory detail.diff from both 467 and 468 after running the same queries for about 1 hour.
detail.diff.467.txt
detail.diff.468.txt

The Other section in 468 seems to be the issue.

467

-                     Other (reserved=63MB +48MB, committed=63MB +48MB)
                            (malloc=63MB +48MB #1111 +328)

468

-                     Other (reserved=6586MB +5151MB, committed=6586MB +5151MB)
                            (malloc=6586MB +5151MB #3986 +2434)

@guyco33
Copy link
Member Author

guyco33 commented Jan 5, 2025

I think that the allocations are coming from 3rd party code, i.e. Netty or AWS SDK.

@wendigo
I tried to run 468 after downgrading io.netty to 4.1.115.Final #24506 and AWS SDK to 2.29.29 #24450 & 1.12.779 #24485 (same as in 467) but memory leak still occurs.

@wendigo
Copy link
Contributor

wendigo commented Jan 5, 2025

How did you downgrade them?

@guyco33
Copy link
Member Author

guyco33 commented Jan 5, 2025

Build 468 after updating pom.xml with the previous versions

@wendigo
Copy link
Contributor

wendigo commented Jan 5, 2025

Did the native memory (other) change with the downgrade?

@guyco33
Copy link
Member Author

guyco33 commented Jan 5, 2025

Did the native memory (other) change with the downgrade?

No. Same high value

@robinsinghstudios
Copy link

robinsinghstudios commented Jan 5, 2025

Not sure if it was the same reason, but I believe I faced the same issue on 467 as well. Might it be related to the shift to the new UBI?

@guyco33
Copy link
Member Author

guyco33 commented Jan 5, 2025

I have a production cluster running on 467 and the coordinator has been running for almost 3 weeks with no OS OOM issues.

@wendigo
Copy link
Contributor

wendigo commented Jan 5, 2025

@robinsinghstudios base image doesn't affect runtime

@robinsinghstudios
Copy link

Understood. Thanks @wendigo .

@guyco33
Copy link
Member Author

guyco33 commented Jan 5, 2025

I've tested two snapshot versions of 468 and found that the issue started in one of the changes done between #24305 (a7e72d4) and #22974 (4718011)

a7e72d4 (test) - no issue

-                     Other (reserved=95MB +91MB, committed=95MB +91MB)
                            (malloc=95MB +91MB #1542 +1451)

4718011 (HEAD -> test2) - OS OOM

-                     Other (reserved=4213MB +2180MB, committed=4213MB +2180MB)
                            (malloc=4213MB +2180MB #3592 +879)
* 4718011a39 (HEAD -> test2) Use BigQuery storage read API when reading external BigLake tables
* 9b24b74f80 Add assertConsistently
* 9499dc82f2 Update google-sheets api to v4-rev20241203-2.0.0
* b869fe2f43 Update exasol to 24.2.1
* ccfefa284d Update s3mock testcontainers to 3.12.0
* bcfbcc3248 Update nessie to 0.101.1
* 5553fc29c4 Update JLine to 3.28.0
* f2c1b83c67 Update AWS SDK v2 to 2.29.31
* c794993acd Update swagger to 2.2.27
* a017723a76 Update reactor-core to 3.7.1
* 6704bf467c Update metrics-core to 4.2.29
* 0c01c04242 Update airlift to 292
* ee5cba4407 Update airbase to 204
* 4fef0f0bee Update Hudi library to 1.0.0
* a7e72d424f (test) Add missing table features when adding timestamp_ntz in Delta

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

First 7 commits doesn't matter which leaves reactor, metrics, aws sdk, airlift

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

* 4718011a39 (HEAD -> test2) Use BigQuery storage read API when reading external BigLake tables <- doesn't matter - different connector
* 9b24b74f80 Add assertConsistently <- doesn't matter - test only
* 9499dc82f2 Update google-sheets api to v4-rev20241203-2.0.0 <- doesn't matter - different connector
* b869fe2f43 Update exasol to 24.2.1 <- doesn't matter - different connector
* ccfefa284d Update s3mock testcontainers to 3.12.0 <- doesn't matter - test only
* bcfbcc3248 Update nessie to 0.101.1 <- doesn't matter - different connector
* 5553fc29c4 Update JLine to 3.28.0 <- doesn't matter - CLI only
* f2c1b83c67 Update AWS SDK v2 to 2.29.31 <- candidate
* c794993acd Update swagger to 2.2.27 <- doesn't matter
* a017723a76 Update reactor-core to 3.7.1 <- candidate (but this applies to Azure SDK only)
* 6704bf467c Update metrics-core to 4.2.29 <- candidate
* 0c01c04242 Update airlift to 292 <- candidate
* ee5cba4407 Update airbase to 204 <- candidate
* 4fef0f0bee Update Hudi library to 1.0.0 <- doesn't matter - different connector
* a7e72d424f (test) Add missing table features when adding timestamp_ntz in Delta <- doesn't matter - different connector

@guyco33
Copy link
Member Author

guyco33 commented Jan 6, 2025

so it seems to be the update of airbase from version 203 to version 204

* ee5cba4407 (bad) Update airbase to 204 
* 4fef0f0bee (good) Update Hudi library to 1.0.0
* a7e72d424f Add missing table features when adding timestamp_ntz in Delta

airbase was updated 3 times in 467..678

$ git log --oneline --decorate --graph 467..468 | grep 'airbase'
* 547cc430e7 Update airbase to 206 and airlift to 293
* c1cc49fce2 Update airbase to 205
* ee5cba4407 Update airbase to 204

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

Airlift 204 is among other things Jetty 12.0.16: https://github.com/airlift/airbase/releases/tag/204

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

I've created https://github.com/airlift/airlift/pull/1349/files to bound/reuse memory/heap allocations in Jetty

@guyco33
Copy link
Member Author

guyco33 commented Jan 6, 2025

I've just approved that running 468 with airbase version 203 solves the memory leak issue

* 5f9b222b93 (HEAD -> test) rollback to airbase 203
* cb7a9cc1cf (tag: 468) [maven-release-plugin] prepare release 468

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

@guyco33 can you test with the airflit fix in place?

@guyco33
Copy link
Member Author

guyco33 commented Jan 6, 2025

@wendigo Doesn't it make sense that the fix would be in airbase ?

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

@guyco33 airbase is just a set of dependency declarations. Airbase defines versions for airlift and airlift is used by Trino. We need Jetty 12.0.16 as it solves other issue that we were facing so it's not really an option to downgrade.

@wendigo
Copy link
Contributor

wendigo commented Jan 6, 2025

We've filled jetty/jetty.project#12670 to track Jetty improvement

@guyco33
Copy link
Member Author

guyco33 commented Jan 6, 2025

@wendigo I can approve the your airlift fix solves the issue !
Thanks for your professional and exceptional workmanship on this fix.

@sajjoseph
Copy link
Contributor

@wendigo - I wonder whether the above issue is impacting HTTP 1.x alone. I have been running 468 code in a cluster for past 14 days with HTTP 2.0 on and no OOMs so far. I didn't send a ton of traffic through that cluster though.

@mosabua - How about adding a warning in 468 release page about this known issue?

@wendigo
Copy link
Contributor

wendigo commented Jan 7, 2025

@sajjoseph yep, it's affecting HTTP/1 only

@wendigo
Copy link
Contributor

wendigo commented Jan 7, 2025

I've run locally 0.5M queries in the last 2 hrs to see whether memory will leak and the cluster is stable so I think we will be good here.

Image

@mosabua
Copy link
Member

mosabua commented Jan 7, 2025

@sajjoseph ... we don't really add regressions to old release notes since it would be spreading many of them all over the place and would also require testing and knowing where any issues apply. We just add a release notes entry for the fix when it comes... this one will be in 469 from what I can see at the moment.

@wendigo
Copy link
Contributor

wendigo commented Jan 14, 2025

Image

it seems pretty stable now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working RELEASE-BLOCKER
Development

Successfully merging a pull request may close this issue.

8 participants