tensorflow - hang in google/protobuf/pyext/_message.so at exit -


this tensorflow 1.0.1 installed via pip. runs via embedded cpython (libpython).

sometimes (maybe 30% of runs) hangs in py_finalize(), , see backtrace:

/work/asr2/zeyer/sprint-executables/20160902.235443.fad8965.linux-x86_64-standard/flf/flf-tool.linux-intel-standard(_zn17assertionsprivate15safe_stacktraceei+0x21)[0xc5b891] /work/asr2/zeyer/sprint-executables/20160902.235443.fad8965.linux-x86_64-standard/flf/flf-tool.linux-intel-standard[0xc5b8ef] /u/zeyer/tools/glibc217/libpthread.so.0(+0x113d0)[0x2b6d89bad3d0] /u/zeyer/tools/glibc217/libpthread.so.0(raise+0x29)[0x2b6d89bad2a9] /u/zeyer/py-envs/py2-ubuntu16/local/lib/python2.7/site-packages/faulthandler.so(+0x3198)[0x2b6dc2372198] /u/zeyer/tools/glibc217/libpthread.so.0(+0x113d0)[0x2b6d89bad3d0] /u/zeyer/py-envs/py2-ubuntu16/local/lib/python2.7/site-packages/google/protobuf/pyext/_message.so(+0xaa943)[0x2b6dc14f0943] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x160f6b)[0x2b6d8b23af6b] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0xc8f0e)[0x2b6d8b1a2f0e] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x15d747)[0x2b6d8b237747] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(pydict_setitem+0x7b)[0x2b6d8b23becb] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(_pymodule_clear+0xb5)[0x2b6d8b278565] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(pyimport_cleanup+0x437)[0x2b6d8b2280e7] /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(py_finalize+0xfe)[0x2b6d8b1fed9e] /work/asr2/zeyer/sprint-executables/20160902.235443.fad8965.linux-x86_64-standard/flf/flf-tool.linux-intel-standard(_zn6python11initializer19atexituninithandlerev+0x2e)[0xff80de] /u/zeyer/tools/glibc217/libc.so.6(+0x39fe8)[0x2b6d8bc39fe8] /u/zeyer/tools/glibc217/libc.so.6(+0x3a035)[0x2b6d8bc3a035] /u/zeyer/tools/glibc217/libc.so.6(__libc_start_main+0xf7)[0x2b6d8bc20837] /work/asr2/zeyer/sprint-executables/20160902.235443.fad8965.linux-x86_64-standard/flf/flf-tool.linux-intel-standard[0x7d6991] 

or gdb:

(gdb) bt full #0  0x00002b6dc14f0943 in std::tr1::_hashtable<google::protobuf::descriptorpool const*, std::pair<google::protobuf::descriptorpool const* const, google::protobuf::python::pydescriptorpool*>, std::allocator<std::pair<google::protobuf::descriptorpool const* const, google::protobuf::python::pydescriptorpool*> >, std::_select1st<std::pair<google::protobuf::descriptorpool const* const, google::protobuf::python::pydescriptorpool*> >, std::equal_to<google::protobuf::descriptorpool const*>, google::protobuf::hash<google::protobuf::descriptorpool const*>, std::tr1::__detail::_mod_range_hashing, std::tr1::__detail::_default_ranged_hash, std::tr1::__detail::_prime_rehash_policy, false, false, true>::erase (     __k=@0x7ffd1bbea740: 0x8269780, this=0x2b6dc1826e40 <google::protobuf::python::descriptor_pool_map>)     @ /opt/rh/devtoolset-2/root/usr/include/c++/4.8.2/tr1/hashtable.h:1041         __slot = <optimized out>         __saved_slot = <optimized out>         __code = 136746880         __n = 0         __result = 0 #1  google::protobuf::python::cdescriptor_pool::dealloc (self=0x2b6dc0d86880)     @ google/protobuf/pyext/descriptor_pool.cc:152 no locals. #2  0x00002b6d8b23af6b in ?? () /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0 no symbol table info available. #3  0x00002b6d8b1a2f0e in ?? () /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0 no symbol table info available. #4  0x00002b6d8b237747 in ?? () /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0 no symbol table info available. #5  0x00002b6d8b23becb in pydict_setitem () /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0 no symbol table info available. #6  0x00002b6d8b278565 in _pymodule_clear () /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0 no symbol table info available. #7  0x00002b6d8b2280e7 in pyimport_cleanup () /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0 no symbol table info available. #8  0x00002b6d8b1fed9e in py_finalize () /usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0 no symbol table info available. #9  0x0000000000ff80de in python::initializer::atexituninithandler() () no symbol table info available. #10 0x00002b6d8bc39fe8 in ?? () /u/zeyer/tools/glibc217/libc.so.6 no symbol table info available. #11 0x00002b6d8bc3a035 in exit () /u/zeyer/tools/glibc217/libc.so.6 no symbol table info available. #12 0x00002b6d8bc20837 in __libc_start_main () /u/zeyer/tools/glibc217/libc.so.6 no symbol table info available. #13 0x00000000007d6991 in _start () no symbol table info available. 

i.e. happens in _pymodule_clear, , inside google/protobuf/pyext/_message.so, that's why think tf related.

in case when not hang, see output:

exception attributeerror: attributeerror("'nonetype' object has no attribute 'raise_exception_on_not_ok_status'",) in <bound method session.__del__ of <tensorflow.python.client.session.session object @ 0x2afd625b12d0>> ignored 

i asked upstream on tf suggested post here.

any idea why might hang , how resolve this?

note crash happening inside callback via std::atexit. guess problem stuff google or std cleaned before call py_finalize atexit-handler leads crash. think should not happen though.

anyway, kind of worked around problem not using std::atexit using own exit handler logic instead (which not work if directly use exit() anywhere).


Comments

Popular posts from this blog

c# - Update a combobox from a presenter (MVP) -

How to understand 2 main() functions after using uftrace to profile the C++ program? -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -