-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert background cluster networking calls to async - Gossip, Replication #914
Conversation
So, the meet lock is there to throttle bursts of meet requests. The same connection is used for gossip and if that gets delayed a timeout will occur which will break and re-initialize the connection, unless gossip-delay is increases. I would make the WriteLock async if that is an issue. |
Should be fine to go ahead and remove that lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
…hub.com/microsoft/garnet into users/padgupta/remove_cluster_meet_lock
There are multiple cluster networking calls that are being converted to async to improve performance and responsiveness of server. Generally there are cross node communication in a mesh like metadata communication system, if a node is unresponsive in this mesh and we're making synchronous calls we can block the .NET threadpool and cause thread starvation. .NET will eventually create more threads if Max limit is not there but that will cause memory overhead and perf hit. So it's safer to convert these calls to async.
There should be no major overhead of async state machine in non critical path like these but the gains in the scenarios described above are immense.
Additional minor change to remove Meet Lock:
The lock is not really protecting anything in this context as we still proceed with Meet rather than returning the thread until it's available. So doesn't seem like we really need a lock. In some cases during startup if there are multiple meets (and one or more nodes are unresponsive), it can cause CAS retries during it's release so better to remove it if it's not needed.
Also note that MEET is currently Blocking and consumes a threadpool thread.
CPU stacks indicating WriteUnlock taking time