tensorflow - How to execute distributed training where each node has multiple workers -
what command run distributed training on multiple nodes each node has multiple gpus. example in https://github.com/tensorflow/models/tree/master/inception shows case each node has 1 gpu/1 worker. in cluster, each node has 4 gpus should require 4 workers.
i tried following command: on node 0:
bazel-bin/inception/imagenet_distributed_train --batch_size=32 --data_dir=$home/imagenet-data --job_name='worker' --task_id=0 --ps_hosts='ps0.example.com:2222' --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' & ...... bazel-bin/inception/imagenet_distributed_train --batch_size=32 --data_dir=$home/imagenet-data --job_name='worker' --task_id=3 --ps_hosts='ps0.example.com:2222' --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' on node 1:
bazel-bin/inception/imagenet_distributed_train --batch_size=32 --data_dir=$home/imagenet-data --job_name='worker' --task_id=4 --ps_hosts='ps0.example.com:2222' --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' & ...... bazel-bin/inception/imagenet_distributed_train --batch_size=32 --data_dir=$home/imagenet-data --job_name='worker' --task_id=7 --ps_hosts='ps0.example.com:2222' --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' note there & @ end of each command can executed in parallel, has out of gpu memory error.
i tried use 1 worker in each node , each worker uses 4 gpu: on node 0:
bazel-bin/inception/imagenet_distributed_train --batch_size=32 --data_dir=$home/imagenet-data --job_name='worker' --gpus=4 --task_id=0 --ps_hosts='ps0.example.com:2222' --worker_hosts='worker0.example.com:2222,worker1.example.com:2222' on node 1:
bazel-bin/inception/imagenet_distributed_train --batch_size=32 --data_dir=$home/imagenet-data --job_name='worker' --gpus=4 --task_id=1 --ps_hosts='ps0.example.com:2222' --worker_hosts='worker0.example.com:2222,worker1.example.com:2222' but in end each node uses 1 gpu.
so exact command should use? thanks.
Comments
Post a Comment