kubernetes - Cannot assign a device to node in distributed TensorFlow -


i trying run distributed tf following google cloud ml example in here

i running on kubernetes cluster , have environment variables configured properly. (2 ps , 2 workers) following error:

2017-04-07t21:36:51.092443795z {"environment": "cloud", "cluster": {"ps": ["census-ps-0:5000", "census-ps-1:5000"], "worker": ["census-worker-0:5000", "census-worker-1:5000"], "master": ["census-worker-0:5000"]}, "task": {"type": "master", "inxex": 0}} 2017-04-07t21:36:51.092473871z {u'environment': u'cloud', u'cluster': {u'ps': [u'census-ps-0:5000', u'census-ps-1:5000'], u'worker': [u'census-worker-0:5000', u'census-worker-1:5000'], u'master': [u'census-worker-0:5000']}, u'task': {u'type': u'master', u'inxex': 0}} 2017-04-07t21:36:51.907203514z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use sse3 instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907227466z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use sse4.1 instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907231184z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use sse4.2 instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907234415z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use avx instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907237325z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use avx2 instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907240325z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use fma instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.914365914z tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] initialize grpcchannelcache job master -> {0 -> localhost:5000} 2017-04-07t21:36:51.914383815z tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] initialize grpcchannelcache job ps -> {0 -> census-ps-0:5000, 1 -> census-ps-1:5000} 2017-04-07t21:36:51.914387511z tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] initialize grpcchannelcache job worker -> {0 -> census-worker-0:5000, 1 -> census-worker-1:5000} 2017-04-07t21:36:51.914974731z tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] started server target: grpc://localhost:5000 2017-04-07t21:36:54.784234307z tensorflow/core/distributed_runtime/master_session.cc:1012] start master session dd8a251a59872860 config:  2017-04-07t21:36:54.784259971z gpu_options { 2017-04-07t21:36:54.784263535z   per_process_gpu_memory_fraction: 1 2017-04-07t21:36:54.784266273z } 2017-04-07t21:36:54.784268677z  2017-04-07t21:36:54.861483497z export tf_config='{"environment": "cloud", "cluster": {"ps": ["census-ps-0:5000", "census-ps-1:5000"], "worker": ["census-worker-0:5000", "census-worker-1:5000"], "master": ["census-worker-0:5000"]}, "task": {"type": "master", "inxex": 0}}'starting census: please lauch tensorboard see results: tensorboard --logdir=$model_dir 2017-04-07t21:36:54.86148432z traceback (most recent call last): 2017-04-07t21:36:54.861527172z   file "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main 2017-04-07t21:36:54.861535317z     "__main__", fname, loader, pkg_name) 2017-04-07t21:36:54.861540705z   file "/usr/lib/python2.7/runpy.py", line 72, in _run_code 2017-04-07t21:36:54.861627932z     exec code in run_globals 2017-04-07t21:36:54.861641191z   file "/code/task.py", line 192, in <module> 2017-04-07t21:36:54.86166076z     learn_runner.run(generate_experiment_fn(**arguments), job_dir) 2017-04-07t21:36:54.861668307z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run 2017-04-07t21:36:54.861692382z     return task() 2017-04-07t21:36:54.861698247z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 459, in train_and_evaluate 2017-04-07t21:36:54.86177589z     self.train(delay_secs=0) 2017-04-07t21:36:54.86178479z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 281, in train 2017-04-07t21:36:54.861792289z     monitors=self._train_monitors + extra_hooks) 2017-04-07t21:36:54.861795862z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 280, in new_func 2017-04-07t21:36:54.861845229z     return func(*args, **kwargs) 2017-04-07t21:36:54.863930393z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit 2017-04-07t21:36:54.863933057z     loss = self._train_model(input_fn=input_fn, hooks=hooks) 2017-04-07t21:36:54.863935517z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 981, in _train_model 2017-04-07t21:36:54.863938172z     config=self.config.tf_config) mon_sess: 2017-04-07t21:36:54.863940574z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 315, in monitoredtrainingsession 2017-04-07t21:36:54.863943261z     return monitoredsession(session_creator=session_creator, hooks=all_hooks) 2017-04-07t21:36:54.863945685z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 601, in __init__ 2017-04-07t21:36:54.863948181z     session_creator, hooks, should_recover=true) 2017-04-07t21:36:54.863950474z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 434, in __init__ 2017-04-07t21:36:54.863952972z     self._sess = _recoverablesession(self._coordinated_creator) 2017-04-07t21:36:54.863955292z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 767, in __init__ 2017-04-07t21:36:54.863957783z     _wrappedsession.__init__(self, self._create_session()) 2017-04-07t21:36:54.863960045z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 772, in _create_session 2017-04-07t21:36:54.863965454z     return self._sess_creator.create_session() 2017-04-07t21:36:54.863967812z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 494, in create_session 2017-04-07t21:36:54.863970316z     self.tf_sess = self._session_creator.create_session() 2017-04-07t21:36:54.863972622z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 366, in create_session 2017-04-07t21:36:54.863975112z     self._scaffold.finalize() 2017-04-07t21:36:54.863977366z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 183, in finalize 2017-04-07t21:36:54.863979905z     self._saver.build() 2017-04-07t21:36:54.863982274z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1081, in build 2017-04-07t21:36:54.863984743z     restore_sequentially=self._restore_sequentially) 2017-04-07t21:36:54.863987905z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 671, in build 2017-04-07t21:36:54.86399038z     restore_sequentially, reshape) 2017-04-07t21:36:54.863992624z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 445, in _addshardedrestoreops 2017-04-07t21:36:54.863995148z     name="restore_shard")) 2017-04-07t21:36:54.863997503z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 402, in _addrestoreops 2017-04-07t21:36:54.863999968z     tensors = self.restore_op(filename_tensor, saveable, preferred_shard) 2017-04-07t21:36:54.864002332z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 242, in restore_op 2017-04-07t21:36:54.864004812z     [spec.tensor.dtype])[0]) 2017-04-07t21:36:54.864007694z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2 2017-04-07t21:36:54.864010199z     dtypes=dtypes, name=name) 2017-04-07t21:36:54.864012414z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op 2017-04-07t21:36:54.86401491z     op_def=op_def) 2017-04-07t21:36:54.864017117z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2395, in create_op 2017-04-07t21:36:54.864028044z     original_op=self._default_original_op, op_def=op_def) 2017-04-07t21:36:54.864030331z   file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1264, in __init__ 2017-04-07t21:36:54.864032899z     self._traceback = _extract_stack() 2017-04-07t21:36:54.864035157z  2017-04-07t21:36:54.864037633z invalidargumenterror (see above traceback): cannot assign device node 'save/restorev2_102': not satisfy explicit device specification '/job:ps/task:1/device:cpu:0' because no devices matching specification registered in process; available devices: /job:master/replica:0/task:0/cpu:0, /job:ps/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/cpu:0 2017-04-07t21:36:54.864043209z   [[node: save/restorev2_102 = restorev2[dtypes=[dt_string], _device="/job:ps/task:1/device:cpu:0"](save/const, save/restorev2_102/tensor_names, save/restorev2_102/shape_and_slices)]] 2017-04-07t21:36:54.864046084z  


Comments

Popular posts from this blog

How to understand 2 main() functions after using uftrace to profile the C++ program? -

c# - Update a combobox from a presenter (MVP) -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -