Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQLClient connection leak in getConnectionAwait(), leads to a deadlock #195

Closed
ph4r05 opened this issue May 13, 2021 · 11 comments
Closed

SQLClient connection leak in getConnectionAwait(), leads to a deadlock #195

ph4r05 opened this issue May 13, 2021 · 11 comments
Assignees
Labels

Comments

@ph4r05
Copy link

ph4r05 commented May 13, 2021

Version

  • vertx 3.9.7
  • kotlin 1.5.0 (tested also on 1.4.32)
  • open-jdk-14

Context

I observed a resource leak in JDBCClient, namely in the vertx-lang-kotlin:3.9.7 function

suspend fun SQLClient.getConnectionAwait(): SQLConnection 

The problem is if coroutine calling getConnectionAwait() is cancelled before the handler on the following method is called:

SQLClient getConnection(Handler<AsyncResult<SQLConnection>> handler);

If such situation happens (e.g., calling REST connection is terminated / timed out), the obtained connection, passed to the handler, is not closed and returned to the pool. This causes leakage, depleting all free connections (if using with connection pool like C3P0). The program ends up in the terminal state without usable DB connection.

I believe that the problem lies in the

suspend fun <T> awaitEvent(block: (h: Handler<T>) -> Unit): T {

suspend fun <T> awaitEvent(block: (h: Handler<T>) -> Unit): T {
  return suspendCancellableCoroutine { cont: CancellableContinuation<T> ->
    try {
      block.invoke(Handler { t ->
        cont.resume(t) // <---- here, if cont is cancelled, t remains opened
      })
    } catch (e: Exception) {
      cont.resumeWithException(e)
    }
  }
}

I solved the problem by implementing the coroutine bridging on my own with a if-branch handling cancelled coroutine, such as

    try {
      block.invoke(Handler { t ->
        if (cont.isActive)
          cont.resume(t)
        } else {
          t.close()  // <--- close connection if coroutine is already cancelled when handler is invoked
        } 
      })
    } catch (e: Exception) {
      cont.resumeWithException(e)
    }

I noticed that this method is deprecated vert-x3/vertx-jdbc-client#196

@Deprecated(message = "Instead use getConnection returning a future and chain with await()", replaceWith = ReplaceWith("getConnection().await()"))
suspend fun SQLClient.getConnectionAwait(): SQLConnection {
but this may help others having the same issues if they didn't migrate yet.

Also, vertx 3 API is planned to be supported also in vertx 4 so this race condition can potentially affect a lot of users, making their servers unresponsive. Btw was it deprecated also because of this potential issue?

The similar issues can be present also elsewhere in the same pattern, i.e., call await method (coroutine), wait for handler, when handler is called, coroutine is already cancelled, thus resource obtained & returned in the handler is not properly closed.

Do you have a reproducer?

Not yet, I can create one if description is not enough.

Steps to reproduce

  1. setup MySQL JDBC-backed demo
  2. spawn 2000 worker threads simulating clients, each querying the database (code below), let clients timeout randomly
  3. watch number of failures / timeouts
val rnd = Random()
withTimeout(rnd.nextLong().rem(1000L)) {
    dbClient.getConnectionAwait().use {
        it.queryAwait("SELECT 1")
    }
}
@ph4r05 ph4r05 added the bug Something isn't working label May 13, 2021
@vietj vietj added this to the 4.1.0 milestone May 14, 2021
@vietj
Copy link
Contributor

vietj commented May 14, 2021

can you provide a reproducer project ? that would help

@ph4r05
Copy link
Author

ph4r05 commented May 14, 2021

@vietj I hoped that my description was detailed enough to avoid implementing PoC as I am quite busy 😅

Here is the POC:
https://github.com/ph4r05/vertx-sqlconn-poc

  • Check out readme for the usage. There is also a fix implemented.
  • If you cannot reproduce it, increase number of clients from 500 to 2000 and dos-time from 10_000 to 20_000.
  • Any local MySQL server works, no DB is needed for a simple ping SELECT 1. Unfortunately I don't have spare time to implement it without requiring a real MySQL database.

Code of interest:
https://github.com/ph4r05/vertx-sqlconn-poc/blob/659e211be57ffabde712fa750c5b5dd36f30ccde/src/main/kotlin/me/deadcode/vertx/WebServer.kt#L106-L112

Buggy ping
https://github.com/ph4r05/vertx-sqlconn-poc/blob/659e211be57ffabde712fa750c5b5dd36f30ccde/src/main/kotlin/me/deadcode/vertx/WebServer.kt#L129-L133

Ping with fixed race condition
https://github.com/ph4r05/vertx-sqlconn-poc/blob/659e211be57ffabde712fa750c5b5dd36f30ccde/src/main/kotlin/me/deadcode/vertx/WebServer.kt#L135-L179

@vietj vietj self-assigned this May 25, 2021
@vietj
Copy link
Contributor

vietj commented May 25, 2021

can you elaborate on how the coroutine can get cancelled in your case ?

@vietj
Copy link
Contributor

vietj commented May 25, 2021

I understand the issue and it seems it is related to the fact that Vert.x futures cannot be cancelled and I am wondering how we could solve this. e.g we could a close called when the coroutine is cancelled and the future result implements an interface like Closeable.

@vietj
Copy link
Contributor

vietj commented May 25, 2021

Perhaps we could make Vert.x coroutines as non cancellable ?

@ph4r05
Copy link
Author

ph4r05 commented May 25, 2021

can you elaborate on how the coroutine can get cancelled in your case ?

Coroutine gets usually cancelled due to timeout in withTimeout() as we want to limit execution time of some code paths, not to lock too many resources if server is overloaded (e.g., if DB server is overloaded or has opened maximum amount of connections) and to respond in timely manner.

Another reason is client disconnect during the request, e.g., in the websocket server handler, client may stop responding to pings, socket can get disconnected due to network issue, client disconnects when using mobile data. The coroutine chain triggered by the client request get cancelled. Similarly, the CoroutineScope may be cancelled from another reason (e.g., reloading TLS certificates, maintenance mode).

e.g we could a close called when the coroutine is cancelled and the future result implements an interface like Closeable.

I was also thinking along these lines. It seems like a fine workaround. It t comes from the Future, it is probably sound to close it if coroutine is not active anymore.

Perhaps we could make Vert.x coroutines as non cancellable ?

I am not sure about the impact of this. Personally, it feels a bit risky and maybe too restrictive.

@vietj
Copy link
Contributor

vietj commented May 25, 2021

I would actually discourage using such cancellation. The more I think about it, the more I see coroutine cancellation as something similar to killing a thread without having an opportunity to cleanly close all the resources associated with the coroutines.

Instead you should try to use timeout on the resources you are using and let them fail for you, e.g if your HTTP server uses a WebClient then set timeout on the HTTP requests. It will have a similar effect and has a more reliable handling.

@ph4r05
Copy link
Author

ph4r05 commented May 25, 2021

I actually quite like structured concurrency principles coroutines offer. As cancellation is cooperative and it throws an exception, it is possible to reasonably close resources with r.use { } or try{ } finally{ }. However, bridging coroutines with Future callbacks is a bit more tricky with respect to cancellation. https://medium.com/androiddevelopers/cancellation-in-coroutines-aa6b90163629

The problem is that with current Future design, when handler is invoked and coroutine is not active anymore, we have no way to signal the completion to the calling code, so the logical idea is to close acquired resource.

suspend fun <T> awaitEvent(block: (h: Handler<T>) -> Unit): T {
  return suspendCancellableCoroutine { cont: CancellableContinuation<T> ->
       try {
        block.invoke(Handler { t ->
        if (cont.isActive)
          cont.resume(t)
        } else if(it is java.io.Closeable) {
          t.close()  // <--- close connection if coroutine is already cancelled when handler is invoked
        } 
      })
    } catch (e: Exception) {
      cont.resumeWithException(e)
    }
  }
}

When I experimented with non-cancellable coroutine in awaitEvent, the whole PoC froze, dbclient keeps the system locked. I would like cancelable coroutines better (as currently are) as it is more Kotlin-ic, IMHO. It provides server a way to recover from risk states and to cancel blocked coroutines.

Good point with the resources timeouts. Using timeout also on the top level of coroutine processing makes sure we didn't forget setting any timeouts and guarantee maximal execution time for a request (availability, scalability).

@vietj vietj modified the milestones: 4.1.0, 4.1.1 Jun 1, 2021
@vietj vietj modified the milestones: 4.1.1, 4.2.0 Jul 2, 2021
@tsegismont tsegismont added wontfix and removed bug Something isn't working labels Oct 14, 2021
@tsegismont tsegismont removed this from the 4.2.0 milestone Oct 14, 2021
@tsegismont
Copy link
Contributor

Closing as we prefer not moving forward with this.

@ph4r05
Copy link
Author

ph4r05 commented Nov 1, 2021

@tsegismont so if I understand correctly, you basically say that deadlock will be preserved in code and there is nothing you will do about it? :)

@vietj
Copy link
Contributor

vietj commented Nov 2, 2021

no, we are saying that we will not implement cancelation as this is not what we want to support, e.g supporting loom in the future will not be able to deal with cancelation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants