Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Gluten + rss shuffle(celeborn) encounter core dump in some situations #8685

Closed
zjuwangg opened this issue Feb 7, 2025 · 0 comments · Fixed by #8686
Closed

[VL] Gluten + rss shuffle(celeborn) encounter core dump in some situations #8685

zjuwangg opened this issue Feb 7, 2025 · 0 comments · Fixed by #8686
Labels
bug Something isn't working triage

Comments

@zjuwangg
Copy link
Contributor

zjuwangg commented Feb 7, 2025

Backend

VL (Velox)

Bug description

When running gluten + celeborn(rss shuffle), I noticed one of the task consistent failed and the executor also failed.

# JRE version: OpenJDK Runtime Environment (ByteOpenJDK) (17.0.9+10) (build 17.0.9+10)
# Java VM: OpenJDK 64-Bit Server VM (ByteOpenJDK) (17.0.9+10, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x28c550]  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<548964ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 548964ul>::oop_access_barrier(void*)+0x0
---------------  T H R E A D  ---------------

Current thread (0x00007fb2713f1a00):  JavaThread "Executor task launch worker for task 125.1 in stage 4.0 (TID 330)" daemon [_thread_in_vm, id=76, stack(0x00007fb26cfff000,0x00007fb26d400000)]

Stack: [0x00007fb26cfff000,0x00007fb26d400000],  sp=0x00007fb26d3fd038,  free space=4088k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x28c550]  AccessInternal::PostRuntimeDispatch<G1BarrierSet::AccessBarrier<548964ul, G1BarrierSet>, (AccessInternal::BarrierType)2, 548964ul>::oop_access_barrier(void*)+0x0

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  io.glutenproject.vectorized.ShuffleWriterJniWrapper.split(JIJJ)V+0
j  org.apache.spark.shuffle.VeloxCelebornHashBasedColumnarShuffleWriter.internalWrite(Lscala/collection/Iterator;)V+265
j  org.apache.spark.shuffle.CelebornHashBasedColumnarShuffleWriter.write(Lscala/collection/Iterator;)V+2
j  org.apache.spark.shuffle.ShuffleWriteProcessor.write(Lorg/apache/spark/rdd/RDD;Lorg/apache/spark/ShuffleDependency;JLorg/apache/spark/TaskContext;Lorg/apache/spark/Partition;)Lorg/apache/spark/scheduler/MapStatus;+63
j  org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Lorg/apache/spark/scheduler/MapStatus;+189
j  org.apache.spark.scheduler.ShuffleMapTask.runTask(Lorg/apache/spark/TaskContext;)Ljava/lang/Object;+2
j  org.apache.spark.scheduler.Task.run(JILorg/apache/spark/metrics/MetricsSystem;Lscala/collection/immutable/Map;Lscala/Option;)Ljava/lang/Object;+226
j  org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Lorg/apache/spark/executor/Executor$TaskRunner;Lscala/runtime/BooleanRef;)Ljava/lang/Object;+36
j  org.apache.spark.executor.Executor$TaskRunner$$Lambda$935+0x00007fb2aa6342c8.apply()Ljava/lang/Object;+8
j  org.apache.spark.util.Utils$.tryWithSafeFinally(Lscala/Function0;Lscala/Function0;)Ljava/lang/Object;+4
j  org.apache.spark.executor.Executor$TaskRunner.run()V+428
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+92 [email protected]
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5 [email protected]
j  java.lang.Thread.run()V+11 [email protected]
v  ~StubRoutines::call_stub

when I add -Xcheck:jni in executor jvm arguments, the following error appears before the core dump:

FATAL ERROR in native method: Non-array passed to JNI array operations
	at io.glutenproject.vectorized.ShuffleWriterJniWrapper.close(Native Method)
	at org.apache.spark.shuffle.VeloxCelebornHashBasedColumnarShuffleWriter.closeShuffleWriter(VeloxCelebornHashBasedColumnarShuffleWriter.scala:209)
	at org.apache.spark.shuffle.CelebornHashBasedColumnarShuffleWriter.stop(CelebornHashBasedColumnarShuffleWriter.scala:143)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:84)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:134)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:538)
	at org.apache.spark.executor.Executor$TaskRunner$$Lambda$935/0x00007f568b6279d0.apply(Unknown Source)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1618)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:541)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at java.lang.Thread.run([email protected]/Thread.java:840)

Spark version

Spark 3.2

Spark configurations

Not related.

System information

No response

Relevant logs

@zjuwangg zjuwangg added bug Something isn't working triage labels Feb 7, 2025
@zjuwangg zjuwangg changed the title rss shuffle causing executor core dump gluten + rss shuffle encounter core dump in some situations Feb 7, 2025
@zjuwangg zjuwangg changed the title gluten + rss shuffle encounter core dump in some situations Gluten + rss shuffle encounter core dump in some situations Feb 7, 2025
@zjuwangg zjuwangg changed the title Gluten + rss shuffle encounter core dump in some situations Gluten + rss shuffle(celeborn) encounter core dump in some situations Feb 7, 2025
@Yohahaha Yohahaha changed the title Gluten + rss shuffle(celeborn) encounter core dump in some situations [VL] Gluten + rss shuffle(celeborn) encounter core dump in some situations Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
1 participant