tensorflow - How to execute distributed training where each node has multiple workers -


what command run distributed training on multiple nodes each node has multiple gpus. example in https://github.com/tensorflow/models/tree/master/inception shows case each node has 1 gpu/1 worker. in cluster, each node has 4 gpus should require 4 workers.

i tried following command: on node 0:

bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --task_id=0  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' & ......  bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --task_id=3  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' 

on node 1:

bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --task_id=4  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' & ......  bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --task_id=7  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' 

note there & @ end of each command can executed in parallel, has out of gpu memory error.

i tried use 1 worker in each node , each worker uses 4 gpu: on node 0:

bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --gpus=4 --task_id=0  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker1.example.com:2222' 

on node 1:

bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --gpus=4 --task_id=1  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker1.example.com:2222' 

but in end each node uses 1 gpu.

so exact command should use? thanks.


Comments

Popular posts from this blog

c# - Update a combobox from a presenter (MVP) -

How to understand 2 main() functions after using uftrace to profile the C++ program? -

How to put a lock and transaction on table using spring 4 or above using jdbcTemplate and annotations like @Transactional? -