# Distributed-Training-Example ## Developing (not in the docker container) ### Start Downloading ``` mkdir ./dataset_dir cd ./dataset_dir wget https://www.cs.toronto.edu/\~kriz/cifar-10-python.tar.gz tar zxvf cifar-10-python.tar.gz ``` ### Start Training ``` python3 -m torch.distributed.run \ --nproc_per_node=1 \ --nnodes=2 \ --node_rank=0 \ --rdzv_id=21046 \ --rdzv_backend=c10d \ --rdzv_endpoint=192.168.1.46:21046 \ main.py ``` ## Testint (in the docker container) ### Build overlay network in Docker ``` # Master: Init the cluster docker swarm init --advertise-addr=192.168.1.46 --listen-addr=192.168.1.46:2377 # Worker: Join the cluster docker swarm join --token TOKEN_FROM_MASTER 192.168.1.46:2377 # Master: Create a overlay network in Docker docker network create --driver overlay --attachable train-net ``` ### Start Downloading Image ``` mkdir ./dataset_dir docker run -it --rm -v ./dataset_dir:/dataset YOUR_IMAGE ``` ### Start Training Image ``` docker run -it \ --rm \ --network train-net \ --runtime=nvidia \ --gpus all \ --name train-0 \ -v ./dataset_dir:/dataset \ -v ./output:/output \ -e GPU_NUM=1 \ -e NODE_NUM=2 \ -e NODE_RANK=0 \ -e MASTER_IP=train-0 \ -e MASTER_PORT=21046 \ snsd0805/cifar100-train:v3 bash ```