-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GLUTEN-8557][CH]Optimize nested function calls for And
/Or
/GetStructField
/GetJsonObject
#8558
base: main
Are you sure you want to change the base?
Conversation
Run Gluten Clickhouse CI on x86 |
And
, GetStructField
And
, GetStructField
And
/GetStructField
Run Gluten Clickhouse CI on x86 |
And
/GetStructField
And
/Or
/GetStructField
/GetJsonObject
Run Gluten Clickhouse CI on x86 |
b86d9e7
to
2cc64f2
Compare
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
Run Gluten Clickhouse CI on x86 |
… into opt_nested_funcs
Run Gluten Clickhouse CI on x86 |
性能测试
|
.internal() | ||
.doc("Collapse nested functions as one for optimization.") | ||
.stringConf | ||
.createWithDefault("get_struct_field,get_json_object"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about and
, or
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and
, or
has conflict with ReusedExchange
optimize rule, it would make the rule not take effect at some suitation, about 3 ut failed in the ci: https://opencicd.kyligence.com/job/gluten/job/gluten-ci/14509/consoleText. so I disable them by default.
@@ -97,7 +97,10 @@ object ExpressionConverter extends SQLConfHelper with Logging { | |||
if (udf.udfName.isEmpty) { | |||
throw new GlutenNotSupportException("UDF name is not found!") | |||
} | |||
val substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get) | |||
var substraitExprName = UDFMappings.scalaUDFMap.get(udf.udfName.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collapsedFunctionsMap和udf有什么关系?看起来逻辑上没必要耦合在一块,最好能解耦
import org.apache.spark.sql.execution.SparkPlan | ||
import org.apache.spark.sql.types.{DataType, DataTypes} | ||
|
||
case class CollapseNestedExpressions(spark: SparkSession) extends Rule[SparkPlan] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add necessary comment
} | ||
|
||
private def canBeOptimized(plan: SparkPlan): Boolean = plan match { | ||
case p: ProjectExecTransformer => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can expression in generate operator be optimized ?
@@ -462,8 +463,55 @@ class GetJsonObjectImpl | |||
|
|||
static size_t getNumberOfIndexArguments(const DB::ColumnsWithTypeAndName & arguments) { return arguments.size() - 1; } | |||
|
|||
bool insertResultToColumn(DB::IColumn & dest, typename JSONParser::Element & root, std::vector<std::shared_ptr<DB::GeneratorJSONPath<JSONParser>>> & generator_json_paths, size_t & json_path_pos) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lgbo-ustc pls review changes related to get_json_object
@@ -59,9 +59,9 @@ class GetJSONObjectParser : public FunctionParser | |||
DB::ActionsDAG & actions_dag) const override | |||
{ | |||
const auto & args = substrait_func.arguments(); | |||
if (args.size() != 2) | |||
if (args.size() < 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
解释下get_json_object的优化思路,不是太明白为何在这里要改成多个路径参数。 |
建议不同函数的优化拆分到不同的PR里面 |
mutable size_t total_normalized_rows = 0; | ||
|
||
template<typename JSONParser, typename JSONStringSerializer> | ||
void insertResultToColumn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to be complex, I guess there should be a simpler implement with less branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should explain which case it is for each branch
const char * query_begin = reinterpret_cast<const char *>(sub_field.c_str()); | ||
const char * query_end = sub_field.c_str() + sub_field.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not used the normalized one?
What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
(Fixes: #8557)
How was this patch tested?
test by ut