tensorflow - How to execute distributed training where each node has multiple workers -

August 15, 2011

what command run distributed training on multiple nodes each node has multiple gpus. example in https://github.com/tensorflow/models/tree/master/inception shows case each node has 1 gpu/1 worker. in cluster, each node has 4 gpus should require 4 workers.

i tried following command: on node 0:

bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --task_id=0  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' & ......  bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --task_id=3  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'

on node 1:

bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --task_id=4  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222' & ......  bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --task_id=7  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker0.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222,worker1.example.com:2222'

note there & @ end of each command can executed in parallel, has out of gpu memory error.

i tried use 1 worker in each node , each worker uses 4 gpu: on node 0:

bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --gpus=4 --task_id=0  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker1.example.com:2222'

on node 1:

bazel-bin/inception/imagenet_distributed_train  --batch_size=32  --data_dir=$home/imagenet-data  --job_name='worker'  --gpus=4 --task_id=1  --ps_hosts='ps0.example.com:2222'  --worker_hosts='worker0.example.com:2222,worker1.example.com:2222'

but in end each node uses 1 gpu.

so exact command should use? thanks.

Search This Blog

MOno

tensorflow - How to execute distributed training where each node has multiple workers -

Comments

Post a Comment

Popular posts from this blog

Retrieving ETA (estimated time of arrival) with Google Distance Matrix API and public transit as transport mode -

android - ConstraintLayout: Realign baseline constraint in case if dependent view visibility was set to GONE -

c# - Populating Gridview inside Listview ItemTemplate On Web User Control from Code Behind -