-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(mango
): rolling execution statistics (exploration)
#4735
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems neat and tidy to me. I'm surprised we're losing this information today or that 'completion' is ambiguous but perhaps I don't understand your concerns in the ticket info.
292639e
to
29d9b85
Compare
Thanks for the comments @rnewson! If I understand correctly, the way how The couchdb/src/fabric/src/fabric_view_map.erl Lines 178 to 204 in 86df356
But couchdb/src/mango/src/mango_cursor_view.erl Lines 526 to 540 in 86df356
That is, in terms of couchdb/src/fabric/src/fabric_view_map.erl Lines 106 to 122 in 86df356
There couchdb/src/rexi/src/rexi_utils.erl Lines 41 to 64 in 86df356
Then the main loop in couchdb/src/fabric/src/fabric_view_map.erl Lines 63 to 68 in 86df356
That is why I was unsure if the approach in the PR is the right one and it is not something that tackles the symptoms only and does not fix the underlying root cause. But I am afraid that My other concern was certainly the bandwidth usage itself. By this approach, the number of messages sent per rows doubles which may have its own performance implications. But I cannot judge how much it means in practice, if there is an optimization somewhere down in the stack that helps with that and can make the associated costs amortized. |
That's very helpful background. You are right that it was t anticipated we'd need information back about a worker we know we no longer need to calculate the response. It would be better to address that directly if we can. |
To expand on that last comment, perhaps we alter how workers are killed? Today we do This way ensures we send no more messages than today, at the small cost of making the coordinator wait for one message from each worker it would normally have unilaterally and asynchronously killed. |
abd0850
to
fbfe7b5
Compare
In case of map-reduce views, the arrival of the `complete` message is not guaranteed for the view callback (at the shard) when a `stop` is issued during the aggregation (at the coordinator). Due to that, internally collected shard-level statistics may not be fed back to the coordinator which can cause data loss hence inaccuracy in the overall execution statistics. Address this issue by switching to a "rolling" model where row-level statistics are immediately streamed back to the coordinator. Support mixed-version cluster upgrades by activating this model only if requested through the map-reduce arguments and the given shard supports that. Fixes apache#4560
fbfe7b5
to
2cb7139
Compare
mango
): rolling execution statisticsmango
): rolling execution statistics (exploration)
Because I did not want to lose the original description of this PR along with the discussion here, and I wanted to put the change in a different perspective in the light of #4812, I forked it as #4958. After talking with @chewbranca about the problem and its solution, and adding the fact that he has been working on a model that would follow a similar approach, the current code seems feasible. I have studied @rnewson's suggestions but neither of them brought a clear solution to the problem. I am inclined to believe (perhaps I am wrong here) that fixing the issue from the side of |
Closing this in favor of #4958. |
Overview
In case of map-reduce views, the arrival of the
complete
message is not guaranteed for the view callback (at the shard) when astop
is issued during the aggregation (at the coordinator). Due to that, internally collected shard-level statistics may not be fed back to the coordinator which can cause data loss hence inaccuracy in the overall execution statistics.Address this issue by switching to a "rolling" model where row-level statistics are immediately streamed back to the coordinator. Support mixed-version cluster upgrades by activating this model only if requested through the map-reduce arguments and the given shard supports that.
This is only a proposal, a way explore the approach, comments and feedback are welcome. Remarks:
execution_stats
messages. Is this acceptable?complete
)?Testing recommendations
Running the respective Mango unit and integration test suites might suffice (which is done by the CI):
make eunit apps=mango make mango-test MANGO_TEST_OPTS="15-execution-stats-test"
But there is a detailed description in related the ticket (see below) on how to trigger the problem. Feel free to kick the tires.
Related Issues or Pull Requests
Fixes #4560
Checklist