Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-8557][CH]Optimize nested function calls for And/Or/GetStructField/GetJsonObject #8558

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

KevinyhZou
Copy link
Contributor

@KevinyhZou KevinyhZou commented Jan 17, 2025

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

(Fixes: #8557)

How was this patch tested?

test by ut

@KevinyhZou KevinyhZou marked this pull request as draft January 17, 2025 09:11
@github-actions github-actions bot added CORE works for Gluten Core CLICKHOUSE labels Jan 17, 2025
Copy link

#8557

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls [GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField Jan 17, 2025
@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And, GetStructField [GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField Jan 17, 2025
Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou KevinyhZou changed the title [GLUTEN-8557][CH]Optimize nested function calls for And/GetStructField [GLUTEN-8557][CH]Optimize nested function calls for And/Or/GetStructField/GetJsonObject Jan 21, 2025
@KevinyhZou KevinyhZou closed this Jan 21, 2025
@KevinyhZou KevinyhZou reopened this Jan 21, 2025
@KevinyhZou KevinyhZou marked this pull request as ready for review January 21, 2025 11:06
Copy link

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Feb 4, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Feb 4, 2025

Run Gluten Clickhouse CI on x86

Copy link

github-actions bot commented Feb 5, 2025

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@KevinyhZou
Copy link
Contributor Author

性能测试

性能测试:测试数据量6千万行,分别执行三次,查看端到端执行时间

get_json_object: select count(1) from test_tbl1 where get_json_object(get_json_object(d, '$.a'), '$.b') = 'c';
优化前:53.194s, 52.18s, 51.603s
优化后:24.894s, 25.042s, 24.724s

and: select count(1) from test_tbl34 where id = 1 and d != 'axx' and d != 'cxx' and d != 'zxx';
优化前: 3.918s,4.106s,3.835s
优化后: 3.123s, 3.303s, 3.288s

or: select count(1) from test_tbl34 where id = 1 or d != 'axx' or d != 'cxx' or d != 'zxx';
优化前:2.611s,2.765s, 2.484s
优化后:2.3s, 2.409s, 2.283s;

get_struct_field: select count(1) from test_tbl31 where d.e.d.y = 'y123';
优化前:80.819s, 80.517s,83.3s
优化后:80.885s, 80.108s, 81.121s;

.internal()
.doc("Collapse nested functions as one for optimization.")
.stringConf
.createWithDefault("get_struct_field,get_json_object");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about and, or

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and, or has conflict with ReusedExchange optimize rule, it would make the rule not take effect at some suitation, about 3 ut failed in the ci: https://opencicd.kyligence.com/job/gluten/job/gluten-ci/14509/consoleText. so I disable them by default.

@@ -97,7 +97,10 @@ object ExpressionConverter extends SQLConfHelper with Logging {
if (udf.udfName.isEmpty) {
throw new GlutenNotSupportException("UDF name is not found!")
}
val substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)
var substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collapsedFunctionsMap和udf有什么关系?看起来逻辑上没必要耦合在一块,最好能解耦

import org.apache.spark.sql.execution.SparkPlan
import org.apache.spark.sql.types.{DataType, DataTypes}

case class CollapseNestedExpressions(spark: SparkSession) extends Rule[SparkPlan] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add necessary comment

}

private def canBeOptimized(plan: SparkPlan): Boolean = plan match {
case p: ProjectExecTransformer =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can expression in generate operator be optimized ?

@@ -462,8 +463,55 @@ class GetJsonObjectImpl

static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; }

bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const
Copy link
Contributor

@taiyang-li taiyang-li Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgbo-ustc pls review changes related to get_json_object

@@ -59,9 +59,9 @@ class GetJSONObjectParser : public FunctionParser
DB::ActionsDAG & actions_dag) const override
{
const auto & args = substrait_func.arguments();
if (args.size() != 2)
if (args.size() < 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgbo-ustc
Copy link
Contributor

lgbo-ustc commented Feb 12, 2025

解释下get_json_object的优化思路,不是太明白为何在这里要改成多个路径参数。
已经有 #8305 优化的情况下,这个改动带来哪些变化?

@lgbo-ustc
Copy link
Contributor

建议不同函数的优化拆分到不同的PR里面

mutable size_t total_normalized_rows = 0;

template<typename JSONParser, typename JSONStringSerializer>
void insertResultToColumn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be complex, I guess there should be a simpler implement with less branches

Copy link
Contributor

@lgbo-ustc lgbo-ustc Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explain which case it is for each branch

Comment on lines +801 to +802
const char * query_begin = reinterpret_cast<const char *>(sub_field.c_str());
const char * query_end = sub_field.c_str() + sub_field.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not used the normalized one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLICKHOUSE CORE works for Gluten Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH]Optimize nested function calls
3 participants