Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert background cluster networking calls to async - Gossip, Replication #914

Merged
merged 9 commits into from
Jan 16, 2025

Conversation

msft-paddy14
Copy link
Contributor

@msft-paddy14 msft-paddy14 commented Jan 14, 2025

There are multiple cluster networking calls that are being converted to async to improve performance and responsiveness of server. Generally there are cross node communication in a mesh like metadata communication system, if a node is unresponsive in this mesh and we're making synchronous calls we can block the .NET threadpool and cause thread starvation. .NET will eventually create more threads if Max limit is not there but that will cause memory overhead and perf hit. So it's safer to convert these calls to async.
There should be no major overhead of async state machine in non critical path like these but the gains in the scenarios described above are immense.

Additional minor change to remove Meet Lock:
The lock is not really protecting anything in this context as we still proceed with Meet rather than returning the thread until it's available. So doesn't seem like we really need a lock. In some cases during startup if there are multiple meets (and one or more nodes are unresponsive), it can cause CAS retries during it's release so better to remove it if it's not needed.
Also note that MEET is currently Blocking and consumes a threadpool thread.

CPU stacks indicating WriteUnlock taking time
image

@vazois
Copy link
Contributor

vazois commented Jan 14, 2025

So, the meet lock is there to throttle bursts of meet requests. The same connection is used for gossip and if that gets delayed a timeout will occur which will break and re-initialize the connection, unless gossip-delay is increases. I would make the WriteLock async if that is an issue.

@vazois vazois self-requested a review January 14, 2025 17:23
@msft-paddy14 msft-paddy14 changed the title Remove meet lock in TryMeet Convert background cluster networking calls to async - Gossip, Replication Jan 15, 2025
@vazois
Copy link
Contributor

vazois commented Jan 15, 2025

So, the meet lock is there to throttle bursts of meet requests. The same connection is used for gossip and if that gets delayed a timeout will occur which will break and re-initialize the connection, unless gossip-delay is increases. I would make the WriteLock async if that is an issue.

Should be fine to go ahead and remove that lock

Copy link
Contributor

@vazois vazois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@TalZaccai TalZaccai merged commit 18fce91 into main Jan 16, 2025
18 checks passed
@TalZaccai TalZaccai deleted the users/padgupta/remove_cluster_meet_lock branch January 16, 2025 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants