-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in server when accepting a connection #94
Comments
@sophokles73 can you provide a patch and possibly a test ? |
@vietj discussing with @sophokles73 I'll take a look into it but if he already have a patch + tests it could be useful :-) |
it would be great to have a patch soon that we can backport to 3.5.2 |
providing a patch is easy, writing a test case probably isn't ... |
You can try to connect and then instantly publish for a big amount of times, so I think the right test should be like that, but of course, it will not guarantee that code is not broken. |
the more I think about it, the less I believe it is because of lacking synchronization or field visibility. The MQTT server's handlers all run on the same event loop so the problem is probably more related to sending the CONACK packet to the client via Netty (which probably sends it on a different thread than the event loop thread). In this case, simply setting the connected flag before sending the packet will probably already do the trick. WDYT? |
@sophokles73 I think the best would be a reproducer so we can check if that's a race condition or something else |
@vietj By reproducer you mean a piece of code that demonstrates the problem (reliably)? If so, that is what I meant by
How can I interrupt/pause the execution of |
yes and that exhibit the failure |
As indicated above, I have no clue how I could do that based on the existing code. Do you have any idea? |
I am closing this issue because I am not able to create a reproducer and I am also no longer convinced that there exists a race condition (at least in theory, vert.x should prevent the situation described above from happening). |
thanks @sophokles73 don't hesitate to reopen if you have info |
I have run into this issue again and I now also have the log output which includes the names of threads:
it is a log of an interaction of a client (a unit test case) interacting with a server (running in a Docker container). The server's log statements are prefixed with
because they seem to support my original assumption that the publish handler in the MQTT server is not running on the same thread as the connection handler. The The MQTT server is deployed as a Verticle, thus I believe that there is not much potential for starting the server wrongly, but I might be mistaken ... |
can you provide a reproducer project ? |
are you running this in a verticle ? can you show the code that starts the server ? |
This is part of the integration test suite we are running for the Eclipse Hono project. The error does not occur reliably but instead it happens only once in a while (that's why I suspect a race condition). |
how can I reproduce it ? |
The server is running as a verticle. The client is run as a junit test using the
The * bindInsecureMqttServer* method is invoked as part of the Verticle's start() method ...
|
You can check out the Eclipse Hono project and run the integration tests. However, there is no way to reliably reproduce the issue, i.e. you may or may not run into the problem ... |
can you give instruction here about the git repo to clone and the command to execute ? |
Sure, the repo is https://github.com/eclipse/hono |
I've been able to install the project but when I try to run the test I get:
|
have you been following the Getting Started guide and have built the Docker images on your local Docker daemon? |
I'm able to run tests now, I will try until I see this in the console:
does it make the test fail as well ? |
I've seen this failure after 3 runs:
is it related to this case ? |
you are saying that the MQTT prefixed log is from server however I can see: MQTT14:24:26.449 [vert.x-eventloop-thread-3] INFO o.e.h.c.i.AbstractRequestResponseClient - enabling caching of responses from registration/27b2ce55-7e45-4327-9e3b-5e7197f86fbf this is from client, can you clarify ? |
No, this happens when the client sends more messages (using QoS 0) during the test run than the server can handle at a given time. However, the test simply ignores this. |
No, all of this is from the MQTT server component. Hono consists of multiple (micro-)services. The MQTT adapter is one of them. When an MQTT client connects, it calls out to several other services to e.g. authenticate the client, assert the client's registration status and finally forward the published data to downstream consumers. The interesting thing (again) is, that the statements have been issued from different threads, isn't it? The MQTT server is deployed as a single Verticle, starts the MQTT server and registers a connection and publish handler. Shouldn't all these handlers then run on the same (single) thread, i.e. the one that the verticle is running on? |
In the meantime I have seen this happening twice again on our Travis instance. Once during the execution of the EventMqttIT (publishing messages using QoS 1):
and again during execution of the TelemetryMqttQoS1IT (publishing telemetry messages using QoS 1):
|
Do we agree that the different handlers registered when starting the MQTT server should actually all be running on the same thread? Maybe we are doing something wrong in registering the handlers? Or is there in fact a problem in how MqttServerImpl registers the handlers with netty? |
@vietj any news on this? |
no time to look at it unfortunately :-( , I'll do my best to have a look |
a simple reproducer would have make things easier |
@vietj can you at least state your opinion on my latest comment?
|
yes @sophokles73 as far as I remember when I looked at the code I think I found callbacks that were registering things in vertx from an external thread (coming from spring) and they should actually run on the verticle context to register things to avoid several event loops, for instance:
you can read that to understand more https://github.com/vietj/vertx-materials/blob/master/src/main/asciidoc/output/Demystifying_the_event_loop.adoc |
so you want to make sure that all registrations are done using the correct event loop |
this sounds very reasonable and in fact might already be the solution to our problem. I will check and see if I can make sure to register the handlers all on the same (event loop) thread and let you know ... |
@vietj ok, the problem is indeed related to the fact that the connection handler that we register for the MqttServer invokes other services which run on a different event-loop thread. When this other service completes the result future held by the connection handler, then the CONNACK packet is sent on that other service's context (and underlying thread). The MqttEndpoint's connected flag is then set to |
@sophokles73 can you provide a simple reproducer case based out of this reasoning ? in general we don't use volatile but instead we synchronize on the connection, that's why a reproducer would help me better understand the case |
awesome, I believe we can get a fix for 3.5.3 |
I think the test you created is incorrect because the check has
so it does not really reproduce what we are trying to. |
I've been trying to reproduce this bug with a real case and it's quite impossible. I see that when The issue |
I've also tried to run the hono reproducer lot of time and can't get it reproduced. so for 3.5.3 I will fix the synchronisation and it should fix the issue even if it is hard to reproduce. |
actually was able to reproduce it once, I will still work on the same fix and see if I can run it many times without reproducing. |
Do you mean endpoint is I agree that synchronizing access to the connected field should solve this problem. Have you checked, if the test case then succeeds? |
null as far as I remember I'm currently running IT tests intensively to check the fix. |
can you test the 3.5 branch on your side if I push the fix ? |
I can use it for doing development and see if the problem occurs again when running integration tests. However, I will also take a look at the unit test. |
sure thing |
I am connecting a client using vertx-mqtt to a server implemented on top of vertx-mqtt. The client is implemented to connect to the server, wait for the server's CONNACK and then start publishing some messages.
Every once in a while, the server runs into the following problem when the client sends its first message:
I suspect a race condition in
MqttEndpointImpl
between the connect handler and the publish handler. The former sends the CONNACK packet to the client before it sets its connected flag totrue
. The latter then checks if the client is connected before processing the published message. What seems to happen is that the server sends the CONNACK packet and is then interrupted before setting the connected flag totrue
. The client then sends its first message and the connected check in the publish handler fails ...my understanding is that access to the connected flag should either be synchronized or the connected flag should be declared volatile and the connect handler should set it to
true
before sending the CONACK packet ...The text was updated successfully, but these errors were encountered: