Go to file
2024-05-29 04:47:50 +08:00
dataset feat: add dataset container's example dockerfile 2024-05-11 21:55:58 +08:00
train feat: time evaluation in trainer 2024-05-29 04:47:50 +08:00
.gitignore docs: update .gitignore 2024-05-12 23:33:33 +08:00
README.md docs: update README about docker relay network 2024-05-29 04:31:49 +08:00

Distributed-Training-Example

Developing (not in the docker container)

Start Downloading

mkdir ./dataset_dir
cd ./dataset_dir
wget https://www.cs.toronto.edu/\~kriz/cifar-10-python.tar.gz
tar zxvf cifar-10-python.tar.gz

Start Training

python3 -m torch.distributed.run \
    --nproc_per_node=1 \
    --nnodes=2 \
    --node_rank=0 \
    --rdzv_id=21046 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=192.168.1.46:21046 \
    main.py

Testint (in the docker container)

Build overlay network in Docker

# Master: Init the cluster
docker swarm init --advertise-addr=192.168.1.46 --listen-addr=192.168.1.46:2377

# Worker: Join the cluster
docker swarm join --token TOKEN_FROM_MASTER 192.168.1.46:2377

# Master: Create a overlay network in Docker
docker network create --driver overlay --attachable train-net

Start Downloading Image

mkdir ./dataset_dir
docker run -it --rm -v ./dataset_dir:/dataset YOUR_IMAGE

Start Training Image

docker run -it \
    --rm \
    --network train-net \
    --runtime=nvidia \
    --gpus all \
    --name train-0 \
    -v ./dataset_dir:/dataset \
    -v ./output:/output \
    -e GPU_NUM=1 \
    -e NODE_NUM=2 \
    -e NODE_RANK=0 \
    -e MASTER_IP=train-0 \
    -e MASTER_PORT=21046  \
    snsd0805/cifar100-train:v3 bash