-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(mango
): rolling execution statistics
#4958
Conversation
2cb7139
to
17c243f
Compare
de20e1c
to
d2c439a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 lgtm but see a few style suggestions here and there (as inline comments)
78c2d23
to
c4ba325
Compare
Sorry, I did mean to send feedback when I saw this ages ago. I'm a little concerned at the increase in messages here (every Specifically, would it not be possible to alter how we implement the end of the workers, so they do get an opportunity to send their final stats while exiting (as their exit reason, say, which I believe we can capture). |
@rnewson that is a fair concern. Even though we have started to explore ways to make sure that the Based on the discussion with @chewbranca and others, the main idea is that these rolling execution statistics would be anyway superseded by the CSRT framework at some later point. But it is not clear when it will be finished and merged, and until that happens I still would like to have a simplified version of that added to |
Would some batching work to accumulate stats for some time (100 msec?), and only send stats intermittently? It might miss some stats at the end but for a long running request it should capture the stats as it goes along without emitting too many messages? |
@pgj the idea is that it would be the difference between not sending any data back if |
I believe this does not work for short-lived requests. Actually, the real cause of this problem is a race condition: workers often die earlier they could touch the But why is the increase in the message count problematic? @chewbranca's upcoming is much more about the same approach. Or did I miss something, perhaps I misunderstood that? Is not there an optimization somewhere down the stack that enables coalesencing of messages? The messages would not increase the required bandwith much due to their small size...? |
I was mainly suggesting a way to have a bound on the uncertainty in the stats, if it the increase in messages is a problem. It could be a time-bound (milliseconds) or doc examined count (if N > 100 emit ...) bound. It's a half-way hack, it's better than getting no stats at all, but maybe less expensive than emitting stats for every row. But, I don't know if it's a performance issue, I think you and @chewbranca did some measurements?
From what I remember in mango the issue was happening when we have a
I don't know if rexi has message batching. There is buffering if the send channel is blocked, and there is an un-acked message limit. But you're also right that the lower level dist and tcp channel will probably do the batching depending on how the environment is set up. |
If @chewbranca's work would also significantly increase message count in a way that adversely affected performance (or latency of other requests trying to make progress inbetween such messages) I would also want to see that improved before it was merged (and, if it were unavoidably severe, not merged). I'll focus on reviewing the work in hand here and not speculate further about other changes until they are ready for review. I like nick's middle ground of having the workers send info periodically, we would benefit from seeing those stats rising during the processing of a long request (though I think @pgj's point is also relevant, very short requests won't appear at all, and to my mind that lines up with my original point that we should improve how the end of the request is handled. Ideally we would wait for all workers to return their final message (perhaps as their exit reason as already noted), but without forcing the user to wait for more responses than are needed to form a complete response (i.e, the first response from each shard range). Put simply, it was not anticipated that we would care about the other workers when this was built, the code fires off the jobs and wants to return as soon as it can, killing the remaining processes. A good fix here should tackle that directly. |
@nickva I am afraid we cannot allow any uncertainty in the statistics if downstream consumers wanted to build uses cases on top of this mechanism that involves billing.
@chewbranca might have some performance results on his side, but I have not collected any myself. Probably I should do some just for curiosity and for including here.
Yes, but if I recall correctly there is always a limit assigned for the
This can be done easily: replace the
@rnewson do you have any pointers or ideas where to start with this? In the previous discussions, you had some suggestions, but I could not make them work. |
Hmm, if billing is involved then even the current implementation of the PR, will not produce repeatable results unless we're talking about Q=1 databases only. I was under the impression that this is mostly about providing slightly more accurate results (less 0 ones). Maybe this uncertainty is ok, depending on the downstream consumers' requirements, but if billing is involved, it's good to make that clear. To convince ourselves I quickly patched an old script I had to run a few _find results https://gist.github.com/nickva/31e0f1f1c2a5a651259dc897a1bb5cfa With 5000 docs we get a wild variety of results:
To be clear, I don't know if it is a performance impact to send all the stats rows, I was merely mentioning a possible strategy to apply if it is a concern. All this given the assumption there is already some variability in results anyway. I am slightly concerned with the synchronous stream finalization strategy: we'd dealing with a streaming protocol during the active phase, only to switch back to a synchronous termination, which involves extra waiting applied to every request (finalize all Q workers, wait for all Q workers to respond, possibly also handle broken, timed out cases, etc). That would add some fixed delay to all the requests (extra 10-20 msec?). Maybe we could do the finalization after we respond the user, but we'd have to see how that interacts with request resource cleanup part. |
I have done some preliminary benchmarking with Here are the results:
There seem to be a 10-20% performance penalty for the streaming statistics, while the synchronous termination causes 66%. |
I do not think it should produce repeatable results. If a query happened to cause the scanning of 2,000 documents once and then 1,500 for another instance is acceptable — that is how the system works. But returning zero in either case is definitely a miss on the side of accounting.
Great, thanks @nickva!
Unfortunately there is (see my comment above) :-(
If that is what replacing |
I wonder whether we need to decouple the implementation of statistics collection for user-facing results when For the user-facing case, performance of a few extra 10s of ms seems unlikely to matter so much as it's typically a one-time analysis (e.g. via Fauxton) - a correct answer via a synchronous solution seems appropriate. For the |
noting that "billing" of requests is not a couchdb concern. |
Yes, I agree. But it is CouchDB that could ensure that the data that is going to |
A higher level question occurs. It seems to me you want to account for the cost of the workers that did not contribute to the result (they were killed by the coordinator because a response from another copy was received first), whereas couchdb has an interest in not doing the unnecessary work. Did I misunderstand what the "missing" data is here? |
actually, I'm going to assume none of us mean the execution stats of non-contributing workers. What I think is being proposed here sounds like an attempt to fix the current, and by implication quite weird, execution stats code. Only a by inference the execution stats have been calculated and transmitted out-of-band relative to the returned rows, and this, if I'm right, is a mistake we should fix. i.e, I suggest that the workers include the number of rows/docs they examined and discarded with each matching row they return to the coordinator. this is a small amount of extra data to an existing message. For very efficient query executions, it's the number 0 on every row. The coordinator can obviously count how many rows it passed to the client and can sum these new numbers to calculate how many rows/docs were examined. As its the same message, we can't get the wrong answer. Now I'll wait until the morning to learn how completely wrong I am about the problem. |
Thanks for sharing your thoughts @rnewson. Yes, the execution stats are currently passed around in a dedicated The problem is that this can happen even on a single-node cluster. Although I can imagine the scenario you described, I am still inclined to believe that the use of Your suggestion looks similar to the one which is contained in this PR. However, I notice that you recommended to extend the |
Essentially that,yes. But a bit further. The only execution stat we need at the row level is how many rows were examined but discarded prior to the row we are sending. |
Thanks for running the performance check, @pgj!
It will depend not just on how many were scanned but also in what order they arrived and how they were merged at the coordinator. The idea is there is some fudge factor there as is. I updated the query script to query the API endpoint multiple times in a row for the same docs:
I like @rnewson's idea to add stats to the view row. The As a bonus, we'd also then avoid odd out-of-order processing of stats like we have here: couchdb/src/fabric/src/fabric_view_all_docs.erl Lines 260 to 274 in 2501fe6
With per-row stats we may miss some stats emitted by workers at the end from workers which sent the rows already, but they just didn't go towards producing the response. So there is some discrepancy between total work induced in the cluster as a result of the API request vs work which took place on the workers before the emitted rows, if those rows were included in the response.
Ah, that's interesting to think about chttpd_stats, @willholley. That's in chttpd and it takes in request/response objects. But if we wanted to emit online stats for each request as it's being processed (a long running request), we could alter the API there such that each worker would get some set of request bits (path, principal, nonce, request id?...) passed to it (in #mrargs extras), and then the worker can report stats independently without having to shuffle them back to the coordinator. Kind of a larger change, but that way it could account for all the work generated as a side-effect of an API call, even if it didn't make as part of the response. |
c4ba325
to
1c1a48c
Compare
I have implemented the half of @rnewson's suggestions and the code is now sending execution stats along with the view rows only. That is still a separate message, but it already improved the performance (same configuration as above):
It seems to be only 5-10% off with this change only. |
a52c3d7
to
0b0834c
Compare
There is an updated version now where the stats are sent along with the row data. Instead of creating a new record type, I switched to using maps and added support for coordinators and workers of the old version. The benchmark shows good performance (same configuration as earlier):
I tested this manually with old worker—new coordinator, and new worker—old coordinator configurations with a random set of Also, the combined message format revealed that in case of limits, there is always an extra key and sometimes and extra document examined. Previously this information was lost since the corresponding The only question that remains (for me) is whether in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work, @pgj!
I like the new view row map. I wonder if we can go a step further and make all the view rows into maps, even the map/reduce and the other cases. In other words eventually be able to remove the view_row record altogether. But I think it should also be its own PR.
In the separate PR for mrviews, we'd update https://github.com/apache/couchdb/blob/main/src/fabric/src/fabric_rpc.erl#L480-L531. Based an on the #mrargs.extra
option we'd either use view_cb
or view_cb_compact
. One would return a map the other, the record. Replicator might be another places which uses view_row records, so we might to update those as well. Same idea as you did for mango rows.
Since we're trying to return stats with every row, we'd also want to account for all the rows emitted after the last row and until the complete
message. As is, we'd still need to send rexi:stream2({execution_stats, Stats}),
just before the complete
is emitted. What if, while we're changing the view row formats, we also update complete to be {complete, #{stats => Stats}}
instead of complete
. Another one to update might be the no_pass message, that could also have an additional map for stats or other things.
To summarize, I think it's a great change, it's about time we switch away from sending hard-coded records across the cluster. But I think we can take this effort ever further and first have a PR to use map for view rows and complete callbacks. And then, in a separate PR, add rolling stats, which should now be trivial.
@nickva thanks for the confirmation and the suggestions! I will address them in the other PR about introducing |
fe80c21
to
7de0a80
Compare
7de0a80
to
342f568
Compare
In case of map-reduce views, the arrival of the `complete` message is not guaranteed for the view callback (at the shard) when a `stop` is issued during the aggregation (at the coordinator). Due to that, internally collected shard-level statistics may not be fed back to the coordinator which can cause data loss hence inaccuracy in the overall execution statistics. Address this issue by switching to a "rolling" model where row-level statistics are immediately streamed back to the coordinator. Support mixed-version cluster upgrades by activating this model only if requested through the map-reduce arguments and the given shard supports that. Fixes apache#4560
342f568
to
2d944ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgj nicely done!
Also kudos to @rnewson for catching the performance regression and coming up with a better design for sending rows stats (#4958 (comment))
Thank you all for your help too! |
In case of map-reduce views, the arrival of the
complete
message is not guaranteed for the view callback (at the shard) when astop
is issued during the aggregation (at the coordinator). Due to that, internally collected shard-level statistics may not be fed back to the coordinator which can cause data loss hence inaccuracy in the overall execution statistics.Address this issue by switching to a "rolling" model where row-level statistics are immediately streamed back to the coordinator. Support mixed-version cluster upgrades by activating this model only if requested through the map-reduce arguments and the given shard supports that.
Note that this change shares some similarities with @chewbranca's work on Couch Stats Resource Tracker though it stems from a different origin. In the long run, the goal is to either have it merged or even replaced with that approach to facilitate convergence. It has a smaller scope hence it is less complicated and it is ready for inclusion.
Testing recommendations
Running the respective Mango unit and integration test suites shall suffice (which is done by the CI):
make eunit apps=mango make mango-test MANGO_TEST_OPTS="15-execution-stats-test"
But there is a detailed description in related the ticket (see below) on how to trigger the problem, which might also be used to exercise the code change.
Related Issues or Pull Requests
Originates from #4735, forerunner of #4812
Fixes #4560
Checklist