This experiment is an extension of the original paper. MGD can naturally work with current unsupervised learning frameworks, e.g., Momentum Contrast (MoCo) and Simple Siamese Learning (SimSiam). In this repo, we initially investigate MoCo-v2 training with MGD and work in progress on other parts.
- PyTorch 1.8.1
Prepare ImageNet-1K dataset following the official PyTorch ImageNet training code.
Directory Structure
`-- path/to/${ImageNet-1K}/root/folder
`-- train
| |-- n01440764
| |-- n01734418
| |-- ...
| |-- n15075141
`-- valid
| |-- n01440764
| |-- n01734418
| |-- ...
| |-- n15075141
cp -r ../mgd/sampler.py mgd
Please download the pre-trained weight (md5: 59fd9945
, epochs: 200
) of ResNet-50 from MoCo-v2 Models and then load it with the arg of --resume
.
To do unsupervised pre-training of a ResNet-18 model with MGD on ImageNet in an 8-gpu machine, run:
python main_moco_mgd.py \
-a resnet18 \
--lr 0.03 \
--batch-size 256 \
--dist-url 'tcp://localhost:10001' --multiprocessing-distributed --world-size 1 --rank 0 \
--mlp --moco-t 0.2 --aug-plus --cos \
--resume moco_v2_200ep_pretrain.pth.tar \
[your imagenet-folder with train and val folders]
method | model | pre-train epochs |
training logs |
---|---|---|---|
MGD | ResNet-50 distills ResNet-34 | 200 | Baidu Pan [ bkr5 ] |
MGD | ResNet-50 distills ResNet-18 | 200 | Baidu Pan [ jbcv ] |
Note:
- The MGD distiller is engined by the AMP -- absolute max pooling.
- The teacher is ResNet-50 in deafult.
- The hyper-parameters of MGD, such as loss factors, are the same as supervised training. We did not search hyper-parameters. But according to training logs, we believe performances can be better with tunning hyper-parameters, for example, increasing the factor from
1e4
to1e2
.
Same as linear classification of MoCo-v2. Linear classification results on ImageNet using this repo with 8 NVIDIA TITAN Xp GPUs:
method | model | pre-train epochs |
MoCo v2 top-1 acc. |
MoCo v2 top-5 acc. |
---|---|---|---|---|
Teacher | ResNet-50 | 200 | 67.5 | - |
Student | ResNet-34 | 200 | 57.2 | 81.5 |
MGD | ResNet-34 | 200 | 58.5 | 82.7 |
Student | ResNet-18 | 200 | 52.5 | 77.0 |
MGD | ResNet-18 | 200 | 53.6 | 78.7 |
The schedule for updating MGD matching matrix is different with that in the original paper.
We scale it with a log function, i.e., we update matching matrix at the epoch of [1, 2, 3, 6, 9, 15, 26, 43, 74, 126]
.