You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@vycezhong I ran this PR with our mxnet vgg-16 test to check for regression. I used 2 worker nodes, each node has 8 GPUs, and 2 server nodes. One of the server nodes will core dump, it happens consistently. Is this something you've seen before? I didn't change the test to use gradient compression, so dmlc/ps-lite#168 shouldn't matter here.
[00:06:35] byteps/server/server.cc:430: BytePS server engine uses 16 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[00:06:35] byteps/server/server.cc:438: Enable engine scheduling for BytePS server
[00:06:35] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[00:06:35] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[00:06:35] [src/van.cc:421: Bind to role=server, ip=xxxxxxx, port=48413, is_recovery=0
00:06:35] src/./zmq_van.h:287: Start ZMQ recv thread
[00:06:35] src/van.cc:510: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=xxxxxxxx, port=48413, is_recovery=0 } }. THIS IS NOT DATA MSG!
[00:07:34] src/van.cc:535: 1 => 2147483647. Meta: request=0, timestamp=3, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=xxx.196, port=35657, is_recovery=0 role=server, id=8, ip=xxx.195, port=61601, is_recovery=0 role=server, id=10, ip=xxx.144, port=48413, is_recovery=0 role=worker, id=11, ip=xxx.142, port=29591, is_recovery=0 role=scheduler, id=1, ip=xxx.195, port=9000, is_recovery=0 } }. THIS IS NOT DATA MSG!
[00:07:34] src/van.cc:370: S[10] is connected to others
[00:07:35] src/van.cc:510: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[00:07:35] src/van.cc:535: 1 => 10. Meta: request=0, timestamp=8, control={ cmd=BARRIER, barrier_group=-564201712 }. THIS IS NOT DATA MSG!
[00:07:35] src/van.cc:510: ? => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[00:07:35] src/van.cc:535: 11 => 10. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140723023324848, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=256 data_size=4
[00:07:35] src/van.cc:535: 9 => 10. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140724865464560, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=256 data_size=4
Segmentation fault (core dumped) bpslaunch
@vycezhong I ran this PR with our mxnet vgg-16 test to check for regression. I used 2 worker nodes, each node has 8 GPUs, and 2 server nodes. One of the server nodes will core dump, it happens consistently. Is this something you've seen before? I didn't change the test to use gradient compression, so dmlc/ps-lite#168 shouldn't matter here.
Originally posted by @pleasantrabbit in #225 (comment)
The text was updated successfully, but these errors were encountered: