HPC-MNIST-VAE

This repository provides a basic implementation of a variational autoencoder (VAE) on the MNIST dataset with training on high-performance computing (HPC) systems (specifically NUS HPC) in mind.

It is intended to serve as a reference for setting up and training more complex deep learning architectures on a variety of data; the MNIST dataset has thus been extracted into individual image files and an annotation file to closely simulate the structure of custom datasets.

Environment

./environment-cpu.yml and ./environment-cuda.yml have been provided for use on local machines, but note that the conda environments have been created for Python 3.8.5 and PyTorch 2.0.0, which the target Singularity image on NUS HPC uses.

Note that ./requirements.txt is intended for use in the HPC system.

The following commands should replicate a working environment for the desired Python and PyTorch versions:

conda create -n <environment-name> python=<version>
conda install ipykernel
Desired PyTorch installation (see PyTorch installation instructions)
pip install matplotlib pandas tqdm

Data

The original MNIST data files are provided in ./data/. Run all cells in ./extract-mnist.ipynb to extract the individual image files and create .csv annotations.

Move the ./train/ and ./test/ directories into ./data/.

Training

scp the following files and directories into the target working directory on the HPC system:

Set up the necessary packages in the desired Singularity image:

module load singularity
singularity exec <singularity-image> bash
pip install -r requirements.txt
exit

Ensure that ./train.pbs will load the desired Singularity image. Modify training hyperparameters and PBS requested compute if necessary.

Submit the job to the queue:

qsub train.pbs

Check the status of the job (Q: queue, R: running, E: error, F: finished):

qstat -xfn

stderr.$PBS_JOBID will periodically update to reflect console outputs of train.py.

When the job is complete, the state dictionaries of the VAE and the Adam optimiser will be saved in the working directory, together with a log of hyperparameters and training loss. These files can be scp back to the local machine for use.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
modules		modules
runs		runs
utils		utils
.gitignore		.gitignore
README.md		README.md
environment-cpu.yml		environment-cpu.yml
environment-cuda.yml		environment-cuda.yml
extract-mnist.ipynb		extract-mnist.ipynb
pytorch-vae-mnist.ipynb		pytorch-vae-mnist.ipynb
requirements.txt		requirements.txt
train.pbs		train.pbs
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HPC-MNIST-VAE

Environment

Data

Training

About

Releases

Packages

Languages

ccheemeng/HPC-MNIST-VAE

Folders and files

Latest commit

History

Repository files navigation

HPC-MNIST-VAE

Environment

Data

Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages