Distributed-Training-Example
Developing (not in the docker container)
Start Downloading
mkdir ./dataset_dir
cd ./dataset_dir
wget https://www.cs.toronto.edu/\~kriz/cifar-10-python.tar.gz
tar zxvf cifar-10-python.tar.gz
Start Training
python3 -m torch.distributed.run \
--nproc_per_node=1 \
--nnodes=2 \
--node_rank=0 \
--rdzv_id=21046 \
--rdzv_backend=c10d \
--rdzv_endpoint=192.168.1.46:21046 \
main.py
Testint (in the docker container)
Build overlay network in Docker
# Master: Init the cluster
docker swarm init --advertise-addr=192.168.1.46 --listen-addr=192.168.1.46:2377
# Worker: Join the cluster
docker swarm join --token TOKEN_FROM_MASTER 192.168.1.46:2377
# Master: Create a overlay network in Docker
docker network create --driver overlay --attachable train-net
Start Downloading Image
mkdir ./dataset_dir
docker run -it --rm -v ./dataset_dir:/dataset YOUR_IMAGE
Start Training Image
docker run -it \
--rm \
--network train-net \
--runtime=nvidia \
--gpus all \
--name train-0 \
-v ./dataset_dir:/dataset \
-v ./output:/output \
-e GPU_NUM=1 \
-e NODE_NUM=2 \
-e NODE_RANK=0 \
-e MASTER_IP=train-0 \
-e MASTER_PORT=21046 \
snsd0805/cifar100-train:v3 bash