kubernetes - Cannot assign a device to node in distributed TensorFlow -
i trying run distributed tf following google cloud ml example in here
i running on kubernetes cluster , have environment variables configured properly. (2 ps , 2 workers) following error:
2017-04-07t21:36:51.092443795z {"environment": "cloud", "cluster": {"ps": ["census-ps-0:5000", "census-ps-1:5000"], "worker": ["census-worker-0:5000", "census-worker-1:5000"], "master": ["census-worker-0:5000"]}, "task": {"type": "master", "inxex": 0}} 2017-04-07t21:36:51.092473871z {u'environment': u'cloud', u'cluster': {u'ps': [u'census-ps-0:5000', u'census-ps-1:5000'], u'worker': [u'census-worker-0:5000', u'census-worker-1:5000'], u'master': [u'census-worker-0:5000']}, u'task': {u'type': u'master', u'inxex': 0}} 2017-04-07t21:36:51.907203514z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use sse3 instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907227466z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use sse4.1 instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907231184z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use sse4.2 instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907234415z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use avx instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907237325z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use avx2 instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.907240325z w tensorflow/core/platform/cpu_feature_guard.cc:45] tensorflow library wasn't compiled use fma instructions, these available on machine , speed cpu computations. 2017-04-07t21:36:51.914365914z tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] initialize grpcchannelcache job master -> {0 -> localhost:5000} 2017-04-07t21:36:51.914383815z tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] initialize grpcchannelcache job ps -> {0 -> census-ps-0:5000, 1 -> census-ps-1:5000} 2017-04-07t21:36:51.914387511z tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] initialize grpcchannelcache job worker -> {0 -> census-worker-0:5000, 1 -> census-worker-1:5000} 2017-04-07t21:36:51.914974731z tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] started server target: grpc://localhost:5000 2017-04-07t21:36:54.784234307z tensorflow/core/distributed_runtime/master_session.cc:1012] start master session dd8a251a59872860 config: 2017-04-07t21:36:54.784259971z gpu_options { 2017-04-07t21:36:54.784263535z per_process_gpu_memory_fraction: 1 2017-04-07t21:36:54.784266273z } 2017-04-07t21:36:54.784268677z 2017-04-07t21:36:54.861483497z export tf_config='{"environment": "cloud", "cluster": {"ps": ["census-ps-0:5000", "census-ps-1:5000"], "worker": ["census-worker-0:5000", "census-worker-1:5000"], "master": ["census-worker-0:5000"]}, "task": {"type": "master", "inxex": 0}}'starting census: please lauch tensorboard see results: tensorboard --logdir=$model_dir 2017-04-07t21:36:54.86148432z traceback (most recent call last): 2017-04-07t21:36:54.861527172z file "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main 2017-04-07t21:36:54.861535317z "__main__", fname, loader, pkg_name) 2017-04-07t21:36:54.861540705z file "/usr/lib/python2.7/runpy.py", line 72, in _run_code 2017-04-07t21:36:54.861627932z exec code in run_globals 2017-04-07t21:36:54.861641191z file "/code/task.py", line 192, in <module> 2017-04-07t21:36:54.86166076z learn_runner.run(generate_experiment_fn(**arguments), job_dir) 2017-04-07t21:36:54.861668307z file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/learn_runner.py", line 106, in run 2017-04-07t21:36:54.861692382z return task() 2017-04-07t21:36:54.861698247z file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 459, in train_and_evaluate 2017-04-07t21:36:54.86177589z self.train(delay_secs=0) 2017-04-07t21:36:54.86178479z file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 281, in train 2017-04-07t21:36:54.861792289z monitors=self._train_monitors + extra_hooks) 2017-04-07t21:36:54.861795862z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 280, in new_func 2017-04-07t21:36:54.861845229z return func(*args, **kwargs) 2017-04-07t21:36:54.863930393z file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 426, in fit 2017-04-07t21:36:54.863933057z loss = self._train_model(input_fn=input_fn, hooks=hooks) 2017-04-07t21:36:54.863935517z file "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 981, in _train_model 2017-04-07t21:36:54.863938172z config=self.config.tf_config) mon_sess: 2017-04-07t21:36:54.863940574z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 315, in monitoredtrainingsession 2017-04-07t21:36:54.863943261z return monitoredsession(session_creator=session_creator, hooks=all_hooks) 2017-04-07t21:36:54.863945685z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 601, in __init__ 2017-04-07t21:36:54.863948181z session_creator, hooks, should_recover=true) 2017-04-07t21:36:54.863950474z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 434, in __init__ 2017-04-07t21:36:54.863952972z self._sess = _recoverablesession(self._coordinated_creator) 2017-04-07t21:36:54.863955292z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 767, in __init__ 2017-04-07t21:36:54.863957783z _wrappedsession.__init__(self, self._create_session()) 2017-04-07t21:36:54.863960045z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 772, in _create_session 2017-04-07t21:36:54.863965454z return self._sess_creator.create_session() 2017-04-07t21:36:54.863967812z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 494, in create_session 2017-04-07t21:36:54.863970316z self.tf_sess = self._session_creator.create_session() 2017-04-07t21:36:54.863972622z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 366, in create_session 2017-04-07t21:36:54.863975112z self._scaffold.finalize() 2017-04-07t21:36:54.863977366z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 183, in finalize 2017-04-07t21:36:54.863979905z self._saver.build() 2017-04-07t21:36:54.863982274z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1081, in build 2017-04-07t21:36:54.863984743z restore_sequentially=self._restore_sequentially) 2017-04-07t21:36:54.863987905z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 671, in build 2017-04-07t21:36:54.86399038z restore_sequentially, reshape) 2017-04-07t21:36:54.863992624z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 445, in _addshardedrestoreops 2017-04-07t21:36:54.863995148z name="restore_shard")) 2017-04-07t21:36:54.863997503z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 402, in _addrestoreops 2017-04-07t21:36:54.863999968z tensors = self.restore_op(filename_tensor, saveable, preferred_shard) 2017-04-07t21:36:54.864002332z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 242, in restore_op 2017-04-07t21:36:54.864004812z [spec.tensor.dtype])[0]) 2017-04-07t21:36:54.864007694z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 668, in restore_v2 2017-04-07t21:36:54.864010199z dtypes=dtypes, name=name) 2017-04-07t21:36:54.864012414z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op 2017-04-07t21:36:54.86401491z op_def=op_def) 2017-04-07t21:36:54.864017117z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2395, in create_op 2017-04-07t21:36:54.864028044z original_op=self._default_original_op, op_def=op_def) 2017-04-07t21:36:54.864030331z file "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1264, in __init__ 2017-04-07t21:36:54.864032899z self._traceback = _extract_stack() 2017-04-07t21:36:54.864035157z 2017-04-07t21:36:54.864037633z invalidargumenterror (see above traceback): cannot assign device node 'save/restorev2_102': not satisfy explicit device specification '/job:ps/task:1/device:cpu:0' because no devices matching specification registered in process; available devices: /job:master/replica:0/task:0/cpu:0, /job:ps/replica:0/task:0/cpu:0, /job:worker/replica:0/task:0/cpu:0 2017-04-07t21:36:54.864043209z [[node: save/restorev2_102 = restorev2[dtypes=[dt_string], _device="/job:ps/task:1/device:cpu:0"](save/const, save/restorev2_102/tensor_names, save/restorev2_102/shape_and_slices)]] 2017-04-07t21:36:54.864046084z
Comments
Post a Comment