-
Notifications
You must be signed in to change notification settings - Fork 86
Exception in auto-generated serialization code #20
Comments
The code that seems to be crashing is auto-generated serialization code. Naiad should be producing this code as a file as a byproduct of the compilation process, and if you can snag a copy for us to check out that would help a bunch. Although the filenames will be a bit junky, if you can find the one with the corresponding method ParsedInputPono.TrySerializeMany(), that would be super helpful! |
And just to double-check, is the "plain old .NET object" a struct or a class? A class may cause problems, as I believe the generated code just stuffs the fields with deserialized values, and doesn't go out of its way to ensure that the destination is allocated. |
I've checked, the "plain old .NET object" is a struct. |
So I'm trying to figure this out. On OS X it seems that it uses somewhat randomized temporary folder names, in Logging.LogLevel = LoggingLevel.Debug;
Logging.LogStyle = LoggingStyle.Console; Then it should produce some output looking like:
and if you pull out the directory from that, in this case
you should be able to see the .cs files it has generated. I did this with release_0.4, as a caveat, but I think things should be pretty similar (although the serialization code has changed, I think our logging of it probably hasn't). |
Thanks for your help, I did not think about looking in I just noticed there are unsafe section in the generated code: I should probably also mention we're experiencing occasional segfaults of the mono runtime (that we initially ascribed to bugs in mono itself). |
That's a bunch of code! So, the only objects that might be null dereferences in TrySerializeMany (upon inspection) are the string fields of the struct (assuming we haven't passed a SubArray in with a null array; this could be possible, but surprising). It seems that the accesses to .Length, which are the things that would assert, are all guarded by testing that the string is non-null. The two things I can think of are:
I'll chat with Michael about a good way to work on this. I'm not allowed (I think) to contribute to the repo, for reasons of MSFT not accepting IP from non-MSFT folks. But, we'll try to get you some patched code with some diagnostics somehow. |
I am happy to help. There is an implicit contract that you don’t mutate a buffer (or object) after you have sent it in a Naiad channel, and I suppose it’s conceivable your code is violating that. However if you are using structs, and just passing them to a VertexOutputBufferForTime object, it’s hard to see how it could be your fault since the struct will be copied, not passed by reference. As Frank suggests, this might be a Naiad bug… If I were using Visual Studio for this I would ask it to break on thrown exceptions, and run the program single threaded (-t 1) in the debugger, at which point it would break at the time of the dereference in the generated code. I am afraid I have never used Mono: is there any way to do something similar? If the exception happens when you run with a single process and a single thread, that definitely eliminates some possible bugs with mutating objects. Is that the case, or does it only happen when you run multi-threaded? From: Frank McSherry [mailto:[email protected]] That's a bunch of code! So, the only objects that might be null dereferences in TrySerializeMany (upon inspection) are the string fields of the struct (assuming we haven't passed a SubArray in with a null array; this could be possible, but surprising). It seems that the accesses to .Length, which are the things that would assert, are all guarded by testing that the string is non-null. The two things I can think of are:
I'll chat with Michael about a good way to work on this. I'm not allowed (I think) to contribute to the repo, for reasons of MSFT not accepting IP from non-MSFT folks. But, we'll try to get you some patched code with some diagnostics somehow. — |
Sorry, as Frank points out you definitely won’t have the bug single-process single-thread, since it won’t do serialization in that case… However 2-process single-thread each vs 2-process 2-thread each is still an interesting distinction. From: Michael Isard I am happy to help. There is an implicit contract that you don’t mutate a buffer (or object) after you have sent it in a Naiad channel, and I suppose it’s conceivable your code is violating that. However if you are using structs, and just passing them to a VertexOutputBufferForTime object, it’s hard to see how it could be your fault since the struct will be copied, not passed by reference. As Frank suggests, this might be a Naiad bug… If I were using Visual Studio for this I would ask it to break on thrown exceptions, and run the program single threaded (-t 1) in the debugger, at which point it would break at the time of the dereference in the generated code. I am afraid I have never used Mono: is there any way to do something similar? If the exception happens when you run with a single process and a single thread, that definitely eliminates some possible bugs with mutating objects. Is that the case, or does it only happen when you run multi-threaded? From: Frank McSherry [mailto:[email protected]] That's a bunch of code! So, the only objects that might be null dereferences in TrySerializeMany (upon inspection) are the string fields of the struct (assuming we haven't passed a SubArray in with a null array; this could be possible, but surprising). It seems that the accesses to .Length, which are the things that would assert, are all guarded by testing that the string is non-null. The two things I can think of are:
I'll chat with Michael about a good way to work on this. I'm not allowed (I think) to contribute to the repo, for reasons of MSFT not accepting IP from non-MSFT folks. But, we'll try to get you some patched code with some diagnostics somehow. — |
In the meantime, I'll start a review of the places we use buffer pooling! |
So a quick look at our buffer pooling suggests:
Another helpful bit of information would be to build the program with debug information, and run it with I'll keep looking at how we're using the pooling to see if anything leaps out at me, but it's mostly sane looking so far, without additional clues. |
Additional information: there is a counterpart to Michael's implicit contract about not changing something once you've sent it, which is that you should not hold on to a |
Thank you both for your help.
This struct is only used at the very beginning of the computation, as follows:
where
While I believe
The Exception does not happen on local test runs, for which we use a reduced input dataset and where debugging is feasible. It took a few tries for us to be able to run with some stability with the full input dataset (we're pushing around 50k input lines every second, with a pretty heavy computation) and it is still unclear whether the system is able to sustain the load for an extended period of time. If avoiding messing with the payload does not solve the issue, I'll try and run with Thank you again for your help. I know I haven't addressed all your comments, I'll get back to you tomorrow morning (we're in UTC+1). |
In fact, just modifying payload[i].field17 shouldn’t be an issue: the memory won’t be reused until after OnRecv returns so the danger is holding on to the memory after that, not within the call, and as you say copying into the list seems like it should prevent problems. It’s pretty mysterious to me what could be happening. Looking back at your original stack trace it’s the send from SelectMany that is crashing, so I don’t think it’s likely to be a problem with NextStep. And SelectMany has been used a lot :) It would certainly be very helpful to know what line is generating the null dereference. It looks as if the problem should be triggered by a program that just does csvLineInput.SelectMany (csvLine => RenamedStruct.createAndSetNull(csvLine)).Distinct(); so it might be worth testing with that just to reduce the sources of error. From: Andrea Lattuada [mailto:[email protected]] Thank you both for your help. it’s conceivable your code is violating that. However if you are using structs, and just passing them to a VertexOutputBufferForTime object, it’s hard to see how it could be your fault since the struct will be copied, not passed by reference. As Frank suggests, this might be a Naiad bug. This struct is only used at the very beginning of the computation, as follows: csvLineInput.SelectMany (csvLine => RenamedStruct.createAndSetNull(csvLine))
where csvLineInput is a BatchedDataSource and SelectMany is Naiad's Lindi.SelectMany. there is a counterpart to Michael's implicit contract about not changing something once you've sent it, which is that you should not hold on to a Message<S, T> or its .payload after OnRecv returns. Naiad assumes that it owns the message and will go off and use it for something else (including stashing it in a pool). NextStep.OnRecv is in fact modifying all the payloads and then storing them in a List for later use (they will be combined and flushed in a later OnNotify). We have a List items and we run something equivalent to:
While I believe items.Add(message.payload[i]); should be fine (as the struct is copied when passed as an argument), message.payload[i].field17 = message.time.epoch; is most likely an issue. I'll change our code so we copy the struct before modifying it. If I were using Visual Studio for this I would ask it to break on thrown exceptions, and run the program single threaded (-t 1) in the debugger, at which point it would break at the time of the dereference in the generated code. I am afraid I have never used Mono: is there any way to do something similar? Another helpful bit of information would be to build the program with debug information, and run it with --debug flag, which the mono folks suggest will reveal line numbers (http://www.mono-project.com/docs/debug+profile/debug/). The Exception does not happen on local test runs, for which we use a reduced input dataset and debugging is feasible. It took a few tries for us to be able to run with some stability with the full input dataset (we're pushing around 50k input lines every second, with a pretty heavy computation) and it is still unclear whether the system is able to sustain the load for an extended period of time. If avoiding messing with the payload does not solve the issue, I'll try and run with --debug and see if we can stay up long enough to see the error again. Thank you again for your help. I know I haven't addressed all your comments, I'll get back to you tomorrow morning (we're in UTC+1). — |
In fact, copying
Unfortunately mono exited cleanly and Note that this code is successfully (de)serializing tens of thousands of structs before failing and the failure does not seem deterministic. |
Thanks, this is very helpful. In the interest of you making progress in the meantime, I recommend disabling the buffer pooling by changing the lines in the Message struct's allocate and release methods. If we're right that it is a buffer pooling issue (and, it seems likely, given that I can't see why that could would assert otherwise) this should avoid it for you at the cost of spending a bit more time playing with the heap. I'll see if I can get some better info about what might be the cause, or some more diagnostics to put in there to help us pinpoint the root cause. |
Actually, I'm sorry, I've changed my mind. :) Reading the code more closely, it seems that the first thing that it does around the loop is make a copy of the struct. So, even if there was a buffer pooling issue, at this point we would have read some struct and it should be immutable for the rest of the loop. The only thing I can think otherwise is unsafe code, and it is totally possible that some of our unsafe stuff has glitches, and accidentally stomps on memory it shouldn't. Are you still seeing roughly as many mono segfaults as before? |
A few things are still confusing me. I don’t understand Mono or precisely how its exceptions are generated, but naively a NullReferenceException at that line could only be triggered if value.field11 is null. The only way that could happen given the control flow is if another thread is mutating value.field11. But since value is stack allocated, it’s not even heap corruption… It is conceivable that there is a memory error that repeatedly trashes a thread stack in the same place but never causes other crashes, but it seems quite hard to construct. Of course it may be that the Mono exception doesn’t exactly mean what I think. I am surprised by the stack trace you included: did you delete some lines for obfuscation purposes? Otherwise there must have been some optimizations (the PipelineChannel holding a string calls straight into the VertexOutputBuffer holding a RenamedStruct) which might be misleading us. Is this a debug or release build? My next step would be to simplify as much as possible. Can you try the program that just goes straight from the SelectMany to a Distinct and see if you can trigger the error? Otherwise, are there any other parts of the program that do anything that ought to make us suspicious, that might be causing memory corruption that shows up here? Do you use any serialized objects anywhere or are they all structs? Do you happen to use a struct containing a short anywhere? (Frank just found what looks like a bug in the short serialization code.) Also, I would definitely try running –t 1 in each process: if you can cause a crash that way it rules out a whole bunch of potential bugs. From: Frank McSherry [mailto:[email protected]] Actually, I'm sorry, I've changed my mind. :) Reading the code more closely, it seems that the first thing that it does around the loop is make a copy of the structhttps://gist.github.com/utaal/3de8ccb84ba215a76759#file-generated-renamedstruct-cs-L416. So, even if there was a buffer pooling issue, at this point we would have read some struct and it should be immutable for the rest of the loop. The only thing I can think otherwise is unsafe code, and it is totally possible that some of our unsafe stuff has glitches, and accidentally stomps on memory it shouldn't. Are you still seeing roughly as many mono segfaults as before? — |
Our jobs last around 20 minutes and we can't parallelize, so I can't really say I have a good sample size but yes - mono is still randomly segfaulting.
I've double-checked and the stack trace is verbatim with the exception of the name of the struct.
Working on these now. |
Running with
I'll try running again to see if I can get a non-native Exception. |
One more run ( We have access to a few Windows VMs, I'll try and see if we can run this on .net to potentially rule out a bug somewhere in mono. |
If you could get the problem to manifest in a Windows VM that would definitely help, since at this point the program is simple enough that we should be able to repro it with purely synthetic data and I can debug from my end. From: Andrea Lattuada [mailto:[email protected]] One more run (-t 1) resulted in a new inexplicable NullPointerException, this time in our code: https://gist.github.com/utaal/ae01eaa91ef6bccaba51 We have access to a few Windows VMs, I'll try and see if we can run this on .net to potentially rule out a bug somewhere in mono. — |
Hi,
The following exception occasionally takes down our computation. We don't really know how to reproduce it in isolation as it happens only when running on large input datasets and after a few minutes of execution. I am not yet sure that our computation is fully deterministic when run multiple times on the same input (we're verifying this now) so the fact that this happens occasionally may be due to non-deterministic behaviour in our code.
"ParsedInputPono" (renamed) is a plain-old-NET-object with a few public fields (of type
long
,int
,string
) and a constructor that only copies the arguments to the respective fields.We're running a debug build on linux with mono (I'll try to rerun with
--debug
to get line numbers) on about 16 machines with 4 workers each (-t 4).Can you help us out?
Thank you.
The text was updated successfully, but these errors were encountered: