-
Notifications
You must be signed in to change notification settings - Fork 86
https://github.com/MicrosoftResearch/Naiad/issues/20 #21
Comments
Hello, Can you check that you are running the very most recent version? There was a race condition in approximately that part of the code which was fairly recently (1-2 weeks) fixed. Assuming it is not that, it also looks like an issue we've recently seen reported from others. Line 313 is trying to write into a payload array, and the array is never supposed to be un-allocated or inappropriately sized. Some other folks have seen what seems like memory corruption, and we haven't tracked down if it is Naiad's unsafe code, or Mono, or what is going on. You'll notice that all access to I'll take a closer look, and keep you up to date with what we learn in the other case. We are trying to get a reproducible test case, or something reproducing on a non-Mono runtime, just to narrow down whose fault it is and where we can look to fix it. For clarity, which OS are you hosting in the VM? |
Ok, reviewed a bit of code, and I suspect this is totally our fault, not mono or any other nice people. Let me explain: The This is all well and good, but nobody told the general purpose vertex flushing code about that lock. So, if anyone calls If you would be so kind, would you consider adding this.AddOnFlushAction (() => Console.Error.WriteLine ("Flushing UpdateAggregator?!?")); in between lines 170 and 171 of In the meantime, I'll start to figure out a fix, for example just not registering Thanks very much for the bug report! |
Hello Frank, I can put my code here, it is really simple one, our use case benchmarking, so running big data sets very important for us :
Main:
|
That seems pretty simple, and I have a Windows 8 VM! Is the input data something that can be generated / shared? |
It is TPCH lineitem table for big data set. For download : https://www.dropbox.com/s/3atragjp6pr5d9r/lineitem_big.csv?dl=0 |
Thanks. I've grabbed it. I don't suppose you would be willing to share your project too, so that I can just run it and see it explode (and not have to re-implement things like LINEITEM)? Thanks! |
I tried to create simple Naiad program when error occurs in my VM: https://www.dropbox.com/s/pw2lp64c2k8gp0c/TestNaiad.zip?dl=0 We need run program with parameter -t 4. Best, |
Thanks very much! I'll fire it up and report back. |
Well, actually I think in the short term I'll let Michael see if he can reproduce it. My copy of Visual Studio has expired, the "Community" version installer errors out with "can't find package", and ... yeah. MSFT. I'll see if I can get it up and running on Mono and have it explode similarly, but we'll need to wait for Michael to revive from his travels otherwise. |
Just a quick comment, that might help in the meantime: The program you've sent uses pretty small batches (10 records) resulting in 60,000 epochs. Naiad currently scales pretty badly (linearly) with the number of outstanding work items, so it's taking maybe 60,000x longer than it should. If you change the batch size to 1,000 the program completes in about a second on mono with four threads. That isn't a solution to the bug, but it might help you work around it for the moment. I'm back to trying to get it to reproduce (how long before the crash for you, usually? it looks like it will run for a while due to the above issue. |
Thanks very much! It works when we increase batch size. However, there is a strange behavior which is difficult for me to understand, in all cases running with more threads is slower than with single thread. As I know, -t 4 will create at least 4 vertices and it should be really faster than running with one thread which uses only one vertex. |
Hi, I get the same running time for both
For something with very little computation, most of the time ends up being spent in data ingress. All but 0.5s is spent in public static LINEITEM parseLineItem(string s) { /* per-record code from getLineItem() */ } the time improves with two threads to
When I take it up to four threads, it slows down, which makes sense for me, at least, because I have two real cores and two hyper threaded cores. There is nothing in the code causing memory misses, so the hyper threading isn't buying me anything. Given that it takes about 1.3s for me to load up the strings in the first place, this is a difference of 6s -> 4s, which isn't horrible. If it is still bad on a machine with multiple cores (I'm not sure what you are using), let me know and I'll see if I can help out. As regards the crash, I let it run for about 45 minutes and ... nothing. Could you indicate the precise steps to reproduce? Like, Release/Debug build, if Debug, running with/without debugging (a separate thing from the build), etc. |
I should also say, thread performance scaling might be all sorts of weird when using a virtual machine. It will depend a lot on how many cores the VM decides to use, for example. |
Hello,
We are running calculations on Naiad and computation fails on large datasets. For single thread and small datasets, there is no problem. We running Naiad on virtual machine in Mac.
Can you help to find out reason ?
Thank you.
Error :
Logging initialized to console
00:07:57.5913840, Graph 0 failed on scheduler 1 with exception:
System.NullReferenceException: Object reference not set to an instance of an obj
ect.
at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime
2.Send(TRecord record) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Frameworks\StandardVertic es.cs:line 313 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.Conside rFlushingBufferedUpdates() in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime \Progress\UpdateAggregator.cs:line 157 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv( Dictionary
2 deltas) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateAggregator.cs:line 77
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start() i
n c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateProducer.cs:
line 92
at Microsoft.Research.Naiad.Scheduling.Scheduler.DrainMessagesForComputation(
Int32 computationIndex) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Sc
heduling\Scheduler.cs:line 343
00:07:57.5941910, Cancelling execution of graph 0, due to exception:
System.NullReferenceException: Object reference not set to an instance of an obj
ect.
at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime
2.Send(TRecord record) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Frameworks\StandardVertic es.cs:line 313 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.Conside rFlushingBufferedUpdates() in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime \Progress\UpdateAggregator.cs:line 157 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv( Dictionary
2 deltas) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateAggregator.cs:line 77
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start() i
n c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateProducer.cs:
line 92
at Microsoft.Research.Naiad.Scheduling.Scheduler.DrainMessagesForComputation(
Int32 computationIndex) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Sc
heduling\Scheduler.cs:line 343
Logging initialized to console
00:07:57.5913840, Graph 0 failed on scheduler 1 with exception:
System.NullReferenceException: Object reference not set to an instance of an obj
ect.
at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime
2.Send(TRecord record) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Frameworks\StandardVertic es.cs:line 313 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.Conside rFlushingBufferedUpdates() in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime \Progress\UpdateAggregator.cs:line 157 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv( Dictionary
2 deltas) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateAggregator.cs:line 77
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start() i
n c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateProducer.cs:
line 92
at Microsoft.Research.Naiad.Scheduling.Scheduler.DrainMessagesForComputation(
Int32 computationIndex) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Sc
heduling\Scheduler.cs:line 343
00:07:57.5941910, Cancelling execution of graph 0, due to exception:
System.NullReferenceException: Object reference not set to an instance of an obj
ect.
at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime
2.Send(TRecord record) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Frameworks\StandardVertic es.cs:line 313 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.Conside rFlushingBufferedUpdates() in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime \Progress\UpdateAggregator.cs:line 157 at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateAggregator.OnRecv( Dictionary
2 deltas) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateAggregator.cs:line 77
at Microsoft.Research.Naiad.Runtime.Progress.ProgressUpdateProducer.Start() i
n c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Progress\UpdateProducer.cs:
line 92
at Microsoft.Research.Naiad.Scheduling.Scheduler.DrainMessagesForComputation(
Int32 computationIndex) in c:\Users\khgl\Desktop\codebase\Naiad\Naiad\Runtime\Sc
heduling\Scheduler.cs:line 343
The text was updated successfully, but these errors were encountered: