You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running gluten + celeborn(rss shuffle), I noticed one of the task consistent failed and the executor also failed.
# JRE version: OpenJDK Runtime Environment (ByteOpenJDK) (17.0.9+10) (build 17.0.9+10)
# Java VM: OpenJDK 64-Bit Server VM (ByteOpenJDK) (17.0.9+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0x28c550] AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<548964ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 548964ul>::oop_access_barrier(void*)+0x0
--------------- T H R E A D ---------------
Current thread (0x00007fb2713f1a00): JavaThread "Executor task launch worker for task 125.1 in stage 4.0 (TID 330)" daemon [_thread_in_vm, id=76, stack(0x00007fb26cfff000,0x00007fb26d400000)]
Stack: [0x00007fb26cfff000,0x00007fb26d400000], sp=0x00007fb26d3fd038, free space=4088k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V [libjvm.so+0x28c550] AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<548964ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 548964ul>::oop_access_barrier(void*)+0x0
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j io.glutenproject.vectorized.ShuffleWriterJniWrapper.split(JIJJ)V+0
j org.apache.spark.shuffle.VeloxCelebornHashBasedColumnarShuffleWriter.internalWrite(Lscala/collection/Iterator;)V+265
j org.apache.spark.shuffle.CelebornHashBasedColumnarShuffleWriter.write(Lscala/collection/Iterator;)V+2
j org.apache.spark.shuffle.ShuffleWriteProcessor.write(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/ShuffleDependency;JLorg/apache/spark/TaskContext;Lorg/apache/spark/Partition;)Lorg/apache/spark/scheduler/MapStatus;+63
j org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;+189
j org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+2
j org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;Lscala/collection/immutable/Map;Lscala/Option;)Ljava/lang/Object;+226
j org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Lorg/apache/spark/executor/Executor$TaskRunner;Lscala/runtime/BooleanRef;)Ljava/lang/Object;+36
j org.apache.spark.executor.Executor$TaskRunner$$Lambda$935+0x00007fb2aa6342c8.apply()Ljava/lang/Object;+8
j org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+4
j org.apache.spark.executor.Executor$TaskRunner.run()V+428
j java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+92 [email protected]
j java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 [email protected]
j java.lang.Thread.run()V+11 [email protected]
v ~StubRoutines::call_stub
when I add -Xcheck:jni in executor jvm arguments, the following error appears before the core dump:
FATAL ERROR in native method: Non-array passed to JNI array operations
at io.glutenproject.vectorized.ShuffleWriterJniWrapper.close(Native Method)
at org.apache.spark.shuffle.VeloxCelebornHashBasedColumnarShuffleWriter.closeShuffleWriter(VeloxCelebornHashBasedColumnarShuffleWriter.scala:209)
at org.apache.spark.shuffle.CelebornHashBasedColumnarShuffleWriter.stop(CelebornHashBasedColumnarShuffleWriter.scala:143)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:84)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:134)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:538)
at org.apache.spark.executor.Executor$TaskRunner$$Lambda$935/0x00007f568b6279d0.apply(Unknown Source)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1618)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:541)
at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
at java.lang.Thread.run([email protected]/Thread.java:840)
Spark version
Spark 3.2
Spark configurations
Not related.
System information
No response
Relevant logs
The text was updated successfully, but these errors were encountered:
zjuwangg
changed the title
rss shuffle causing executor core dump
gluten + rss shuffle encounter core dump in some situations
Feb 7, 2025
zjuwangg
changed the title
gluten + rss shuffle encounter core dump in some situations
Gluten + rss shuffle encounter core dump in some situations
Feb 7, 2025
zjuwangg
changed the title
Gluten + rss shuffle encounter core dump in some situations
Gluten + rss shuffle(celeborn) encounter core dump in some situations
Feb 7, 2025
Yohahaha
changed the title
Gluten + rss shuffle(celeborn) encounter core dump in some situations
[VL] Gluten + rss shuffle(celeborn) encounter core dump in some situations
Feb 7, 2025
Backend
VL (Velox)
Bug description
When running gluten + celeborn(rss shuffle), I noticed one of the task consistent failed and the executor also failed.
when I add -Xcheck:jni in executor jvm arguments, the following error appears before the core dump:
Spark version
Spark 3.2
Spark configurations
Not related.
System information
No response
Relevant logs
The text was updated successfully, but these errors were encountered: