Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite wakeup loop issue in version 0.23 and onwards #72

Closed
VladimirBramstedt opened this issue May 15, 2024 · 14 comments · Fixed by rustls/rustls#1950
Closed

Infinite wakeup loop issue in version 0.23 and onwards #72

VladimirBramstedt opened this issue May 15, 2024 · 14 comments · Fixed by rustls/rustls#1950

Comments

@VladimirBramstedt
Copy link

Trying to update from 0.22.x.
in certain situations, we are seeing a massive increase in wakeups for rustls without any progress, causing high cpu utilization , essentially a busy poll.
we narrowed the issue (or at least the trigger for this behaviour) down to this bit of code: https://github.com/rustls/tokio-rustls/blob/main/src/common/mod.rs#L233
this is part of the diff for the 0.22 -> 0.23 update, v/0.22.0...v/0.23.0#diff-07e58a4ba3e21a351457fb113f95290184c934033ffc82fbc0ab0f343b6fdb82

the code originally seems to have been added in this PR:
tokio-rs/tls#79, in particular, this comment:
tokio-rs/tls#79 (comment)

If this doesn't look good, we can make it call read_io again only when it io_pending. but if rustls state is wrong, it will fall into an infinite loop, which may be worse than panic.

Describes pretty closely the behaviour we are seeing.
Removing that wakeup seems to resolve the issue, but i am unsure if that code is load bearing in other sense and is actually required for correctness, or was introduced as some sort of "eh, try again" or purely an optimization?
Either way, relying on an implicit task budget to handle infinite loops doesnt seem healthy or robust, to me.
are there reasons this code cant be removed?

digging a bit lower into what happens on the rustls side,
we are using tokio-rustls in a somewhat adversarial environment where not all packets can be trusted to be well-formed, and i believe rustls/rustls#959 this issue might be related, as the code it references is still in rustls, but that is still under investigation.

@djc
Copy link
Member

djc commented May 15, 2024

Sorry about this regression!

Are you upgrading from tokio-rustls 0.22 (which is pretty old at this point -- corresponds to rustls 0.19, released 3 years ago) to something more current?

I wonder if this is also related to the stuff in #68.

@VladimirBramstedt
Copy link
Author

yeah its pretty old but been pretty stable. we tried the latest version of tokio-rustls, and noticed this triggering. then we found the earliest version to trigger this issue, so its not like it only happens on 0.23, its just the earliest that triggers it.

@djc
Copy link
Member

djc commented May 15, 2024

Are you able to synthesize this issue to the extent that you can put in some debugging code? Are you trying to upgrade to the latest version? Given that potentially "abnormal" rustls state is in play it would be good to get to the bottom of this.

cc @ctz

@VladimirBramstedt
Copy link
Author

getting the actual tcp packets at play is probably not going to be possible, but my next step was to fork rustls itself and add logging to see which of the conditionals that can lead to WouldBlock is triggering, to confirm or dismiss it being related to rustls/rustls#959

@quininer
Copy link
Member

quininer commented May 15, 2024

Normally we would not enter this wake, which means read_tls returns 0, but rustls should check whether eof occurs when reading. https://github.com/rustls/rustls/blob/v/0.23.5/rustls/src/conn.rs#L239

Another possibility is that read_tls reads normally, but wants_read returns false (no readable plaintext exists). If wake is not call at this time, it may hang indefinitely.

@VladimirBramstedt
Copy link
Author

Normally we would not enter this wake, which means read_tls returns 0, but rustls should check whether eof occurs when reading. https://github.com/rustls/rustls/blob/v/0.23.5/rustls/src/conn.rs#L239

exactly, and normally it doesnt trigger. however, it does happen, and when it does, that wake causes quite a bit of headache. so rustls itself aside, is relying on an implementation detail of the runtime (i.e the coop budget) the best way to handle it? since even the comment acknowledged this issue.

@quininer
Copy link
Member

Issue are not expected, we need to find out the problem. The prerequisite for Infinite wakeup is "but if rustls state is wrong".

@VladimirBramstedt
Copy link
Author

Issue are not expected, we need to find out the problem. The prerequisite for Infinite wakeup is "but if rustls state is wrong".

I agree with you, would be good to find out whats going on. im not familiar enough with rustls internals to be able to accurately say what constitutes "bad" state though.

@djc
Copy link
Member

djc commented May 15, 2024

So what do we know:

  • In the "read a packet" loop, we either read EOF or we read data and session.wants_read() started returning false
    • wants_read() returning false means that either:
      • received_plaintext() is not empty, or
      • we received CloseNotify, or
      • the session cannot receive application data and there is not TLS data to send
  • session.reader().read(buf.initialize_unfilled()) yielded WouldBlock
    • Even though buf was non-empty, rustls was not able to copy any received_plaintext into it
    • rustls (a) has not seen EOF and there is no pending TLS data, and (b) has not seen a CloseNotify alert

I think this means that, in the case where we did read EOF in the middle of an encrypted data packet, wants_read() would return false while read() might yield WouldBlock because the deframer buffer is not empty. This doesn't make sense because if we've seen EOF read() would yield UnexpectedEof instead of WouldBlock.

Feels like the proper solution would be #60, but probably need to find the bug here first.

@VladimirBramstedt
Copy link
Author

OK, so i tried running with latest tokio-rustls (0.26) to avoid any issues that has been fixed since 0.23, but pointed to my own rustls fork (still at 0.23.5), where i added a line of log when we run into the state described in rustls/rustls#959, and indeed that logged an awful log.

so, we read a close_notify, havent seen an eof, and there is still data in the buffer.

@djc
Copy link
Member

djc commented May 15, 2024

Thanks for the investigation!

so, we read a close_notify, havent seen an eof, and there is still data in the buffer.

Which buffer are you referring to here?

@VladimirBramstedt
Copy link
Author

if you look at where i linked my addition, it reaches that line. so its seen close notify and deframer is not empty. but the current code will still try and read again... ergo exactly like that issue i linked from rustls.

@VladimirBramstedt
Copy link
Author

@ctz @djc this issue isnt really closed until a release of tokio-rustls itself also depends on 0.23.6 of rustls, isnt it?

@djc
Copy link
Member

djc commented May 17, 2024

That's not necessary: just run cargo update and you should get the fixed version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants