-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite wakeup loop issue in version 0.23 and onwards #72
Comments
Sorry about this regression! Are you upgrading from tokio-rustls 0.22 (which is pretty old at this point -- corresponds to rustls 0.19, released 3 years ago) to something more current? I wonder if this is also related to the stuff in #68. |
yeah its pretty old but been pretty stable. we tried the latest version of tokio-rustls, and noticed this triggering. then we found the earliest version to trigger this issue, so its not like it only happens on 0.23, its just the earliest that triggers it. |
Are you able to synthesize this issue to the extent that you can put in some debugging code? Are you trying to upgrade to the latest version? Given that potentially "abnormal" rustls state is in play it would be good to get to the bottom of this. cc @ctz |
getting the actual tcp packets at play is probably not going to be possible, but my next step was to fork rustls itself and add logging to see which of the conditionals that can lead to |
Normally we would not enter this wake, which means read_tls returns 0, but rustls should check whether eof occurs when reading. https://github.com/rustls/rustls/blob/v/0.23.5/rustls/src/conn.rs#L239 Another possibility is that read_tls reads normally, but wants_read returns false (no readable plaintext exists). If wake is not call at this time, it may hang indefinitely. |
exactly, and normally it doesnt trigger. however, it does happen, and when it does, that wake causes quite a bit of headache. so rustls itself aside, is relying on an implementation detail of the runtime (i.e the coop budget) the best way to handle it? since even the comment acknowledged this issue. |
Issue are not expected, we need to find out the problem. The prerequisite for Infinite wakeup is "but if rustls state is wrong". |
I agree with you, would be good to find out whats going on. im not familiar enough with rustls internals to be able to accurately say what constitutes "bad" state though. |
So what do we know:
Feels like the proper solution would be #60, but probably need to find the bug here first. |
OK, so i tried running with latest tokio-rustls (0.26) to avoid any issues that has been fixed since 0.23, but pointed to my own rustls fork (still at 0.23.5), where i added a line of log when we run into the state described in rustls/rustls#959, and indeed that logged an awful log. so, we read a close_notify, havent seen an eof, and there is still data in the buffer. |
Thanks for the investigation!
Which buffer are you referring to here? |
if you look at where i linked my addition, it reaches that line. so its seen close notify and deframer is not empty. but the current code will still try and read again... ergo exactly like that issue i linked from rustls. |
That's not necessary: just run |
Trying to update from 0.22.x.
in certain situations, we are seeing a massive increase in wakeups for rustls without any progress, causing high cpu utilization , essentially a busy poll.
we narrowed the issue (or at least the trigger for this behaviour) down to this bit of code: https://github.com/rustls/tokio-rustls/blob/main/src/common/mod.rs#L233
this is part of the diff for the 0.22 -> 0.23 update, v/0.22.0...v/0.23.0#diff-07e58a4ba3e21a351457fb113f95290184c934033ffc82fbc0ab0f343b6fdb82
the code originally seems to have been added in this PR:
tokio-rs/tls#79, in particular, this comment:
tokio-rs/tls#79 (comment)
Describes pretty closely the behaviour we are seeing.
Removing that wakeup seems to resolve the issue, but i am unsure if that code is load bearing in other sense and is actually required for correctness, or was introduced as some sort of "eh, try again" or purely an optimization?
Either way, relying on an implicit task budget to handle infinite loops doesnt seem healthy or robust, to me.
are there reasons this code cant be removed?
digging a bit lower into what happens on the rustls side,
we are using tokio-rustls in a somewhat adversarial environment where not all packets can be trusted to be well-formed, and i believe rustls/rustls#959 this issue might be related, as the code it references is still in rustls, but that is still under investigation.
The text was updated successfully, but these errors were encountered: