Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Should convert kSpillReadBufferSize and kShuffleSpillDiskWriteBufferSize to number #8684

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

boneanxs
Copy link
Contributor

@boneanxs boneanxs commented Feb 7, 2025

What changes were proposed in this pull request?

Fix the issue introduced by #8045, we could meet the error if manually set spark.unsafe.sorter.spill.reader.buffer.size to value like 2m

org.apache.gluten.exception.GlutenException: Non-whitespace character found after end of conversion: "m"
	at org.apache.gluten.vectorized.PlanEvaluatorJniWrapper.nativeCreateKernelWithIterator(Native Method)
	at org.apache.gluten.vectorized.NativePlanEvaluator.createKernelWithBatchIterator(NativePlanEvaluator.java:68)
	at org.apache.gluten.backendsapi.velox.VeloxIteratorApi.genFirstStageIterator(VeloxIteratorApi.scala:204)
	at org.apache.gluten.execution.GlutenWholeStageColumnarRDD.$anonfun$compute$1(GlutenWholeStageColumnarRDD.scala:88)
	at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
	at org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
	at org.apache.gluten.execution.GlutenWholeStageColumnarRDD.compute(GlutenWholeStageColumnarRDD.scala:77)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:344)
	at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:106)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:380)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:344)

We should parse value to number in bytes before put to nativeConf

(Fixes: #ISSUE-ID)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@github-actions github-actions bot added the CORE works for Gluten Core label Feb 7, 2025
Copy link

github-actions bot commented Feb 7, 2025

Thanks for opening a pull request!

Could you open an issue for this pull request on Github Issues?

https://github.com/apache/incubator-gluten/issues

Then could you also rename commit message and pull request title in the following format?

[GLUTEN-${ISSUES_ID}][COMPONENT]feat/fix: ${detailed message}

See also:

Copy link

github-actions bot commented Feb 7, 2025

Run Gluten Clickhouse CI on x86

@boneanxs
Copy link
Contributor Author

boneanxs commented Feb 7, 2025

@jinchengchenghh @FelixYBW Hey, could you please help review this? Thanks!

@Yohahaha
Copy link
Contributor

Yohahaha commented Feb 7, 2025

object GlutenConfigUtil {
private def getConfString(configProvider: ConfigProvider, key: String, value: String): String = {
Option(ConfigEntry.findEntry(key))
.map {
_.readFrom(configProvider) match {
case o: Option[_] => o.map(_.toString).getOrElse(value)
case null => value
case v => v.toString
}
}
.getOrElse(value)
}
def parseConfig(conf: Map[String, String]): Map[String, String] = {
val provider = new MapProvider(conf.filter(_._1.startsWith("spark.gluten.")))
conf.map {
case (k, v) =>
if (k.startsWith("spark.gluten.")) {
(k, getConfString(provider, k, v))
} else {
(k, v)
}
}.toMap
}
}

spark.unsafe.sorter.spill.reader.buffer.size=2m should be converted by above codes, could you investigate why it not works?

@Yohahaha
Copy link
Contributor

Yohahaha commented Feb 8, 2025

object GlutenConfigUtil {
private def getConfString(configProvider: ConfigProvider, key: String, value: String): String = {
Option(ConfigEntry.findEntry(key))
.map {
_.readFrom(configProvider) match {
case o: Option[_] => o.map(_.toString).getOrElse(value)
case null => value
case v => v.toString
}
}
.getOrElse(value)
}
def parseConfig(conf: Map[String, String]): Map[String, String] = {
val provider = new MapProvider(conf.filter(_._1.startsWith("spark.gluten.")))
conf.map {
case (k, v) =>
if (k.startsWith("spark.gluten.")) {
(k, getConfString(provider, k, v))
} else {
(k, v)
}
}.toMap
}
}

spark.unsafe.sorter.spill.reader.buffer.size=2m should be converted by above codes, could you investigate why it not works?

oh, GlutenConfigUtil only process the config which prefix is 'spark.gluten'.

@boneanxs
Copy link
Contributor Author

boneanxs commented Feb 8, 2025

object GlutenConfigUtil {
private def getConfString(configProvider: ConfigProvider, key: String, value: String): String = {
Option(ConfigEntry.findEntry(key))
.map {
_.readFrom(configProvider) match {
case o: Option[_] => o.map(_.toString).getOrElse(value)
case null => value
case v => v.toString
}
}
.getOrElse(value)
}
def parseConfig(conf: Map[String, String]): Map[String, String] = {
val provider = new MapProvider(conf.filter(_._1.startsWith("spark.gluten.")))
conf.map {
case (k, v) =>
if (k.startsWith("spark.gluten.")) {
(k, getConfString(provider, k, v))
} else {
(k, v)
}
}.toMap
}
}

spark.unsafe.sorter.spill.reader.buffer.size=2m should be converted by above codes, could you investigate why it not works?

oh, GlutenConfigUtil only process the config which prefix is 'spark.gluten'.

Do we need to extend this function to allow all configures?

@Yohahaha
Copy link
Contributor

Yohahaha commented Feb 8, 2025

Do we need to extend this function to allow all configures?

yeah, we may need add new method GlutenConfigUtil#get(ConfigEntry) to process non-sql configs.

@boneanxs
Copy link
Contributor Author

Do we need to extend this function to allow all configures?

yeah, we may need add new method GlutenConfigUtil#get(ConfigEntry) to process non-sql configs.

Hey @Yohahaha We cannot use ConfigEntry since they are private in Spark package only

https://github.com/apache/spark/blob/cea79dc1918b7f03870fe1cb189da9a152e3bbaf/core/src/main/scala/org/apache/spark/internal/config/package.scala#L1892-L1899

How about only extract a specific method that handle bytes value only to reduce duplicates?

@Yohahaha
Copy link
Contributor

Do we need to extend this function to allow all configures?

yeah, we may need add new method GlutenConfigUtil#get(ConfigEntry) to process non-sql configs.

Hey @Yohahaha We cannot use ConfigEntry since they are private in Spark package only

https://github.com/apache/spark/blob/cea79dc1918b7f03870fe1cb189da9a152e3bbaf/core/src/main/scala/org/apache/spark/internal/config/package.scala#L1892-L1899

How about only extract a specific method that handle bytes value only to reduce duplicates?

sounds good to me.

@boneanxs boneanxs force-pushed the fix_unexpected_character branch from 3cf701b to 590a0a7 Compare February 12, 2025 08:33
Copy link

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants