This repository presents the second coursework for the MATH70076 Data Science module at Imperial College London, where the project showcases different machine and deep learning models for image classification where the models' performances and complexities are evaluated.
This data project looked at comparing the performance of different machine and deep learning models. The complexity of the models was also considered, i.e. how many parameters each of the models had for the task of image classification. Furthermore, each model was implemented as a Python3 class and the implemented models are used as modules within the main programme of the data project.
To begin the models all used the same datasets for training and testing, respectively. The datasets used for this study were the MNIST-Fashion datasets. This dataset contained images of
The final part of the project was then to compare the methods and to provide a broad recommendation on which method would be best for image classification of black and white images with limited tuning. The results are presented below in the table and it is proposed that the neural network is used for image classification of black and white images given its performance. It is a more complex model than the main.py
programme as it is in the repository.
Model | Accuracy (%) | Number of model parameters |
---|---|---|
Gaussian naive-Bayes | 61.46 | 20 |
|
85.77 | - |
Neural network | 86.18 | 118,282 |
Convolutional neural network | 61.97 | 1,593,620 |
Graph neural network | 35.47 | 126,507 |
- The dataset used for this comparative study was the MNIST-Fashion dataset where it can be downloaded directly from the repository's README page
- The raw folder contains all the raw downloaded data from the dataset's README page, where it has been downloaded into this local directory for processing ready for use in the models that will be used in this study
- The test and train datasets for the machine learning models, Gaussian naive-Bayes and
$k$ -nearest neighbours, are loaded from the downloaded datasets in the raw folder and imported as numpy arrays - The test and train datasets for the deep learning models, neural network, convolutional neural network, and graph neural network, are loaded from the PyTorch utils and imported as PyTorch datasets
- The data are loaded twice because the different families of models have different dataset inputs into their classes for processeing the inputted data
The code to produce the models and results for the comparative study can be found in src.
Before using the code it is best to setup and start a Python virtual environment in order to avoid potential package clashes using the requirements file:
# Navigate into the data project directory
# Create a virtual environment
python3 -m venv <env-name>
# Activate virtual environment
source <env-name>/bin/activate
# Install dependencies for code
pip3 install -r requirements.txt
# When finished with virtual environment
deactivate
The amount of hardcoding has been reduced as much as possible by creating a configuration. This means that if the location of the main programme outputs, the data files, or the naming attributes for outputs files need to change they are updated in this .json
file. Here is an additional example of a configuration object which can be used in place of the current object:
# configurationns .json object
{
"loggingName": "example",
"runNumber": "1",
"dataPath": "../data/raw/",
"outputFigPath": "../outputs/learning-curves/",
"outputValPath": "../outputs/test-predictions/",
"outputModelPath": "../outputs/saved-models/"
}
loggingName
: the name assigned to the outputs of the main programme that wishes to be run for end user file identificationrunNumber
: the run number of this configuration file, set to prevent overwriting previous runs, this can be re-set to 1 for a new run campaign but is not strictly requireddataPath
: the relative path to the data files used to be loaded into the numpy arrays to used for the machine learning modelsoutputFigPath
: the relative path for saving the training curves for the modelsoutputValPath
: the relative path for saving the prediction image outputs the modelsoutputModelPath
: the relative path for saving the models or model parameters for each of the models
The following hyperparameters can be easily tuned and modified in the main.py
programme for each of the models before executing the programme. Further hyperparameters and model choices can be changed but require the user to modify the models' class codes in the respective Python3 modules.
smoothing
: hyperparameter to prevent numerical instabilites when calculating the exponent of the Gaussian distribution used in the model (preventing division of zero when the variance is very small)
-
kmin
: minimum$k$ neighbours to test when finding the optimimum number of neighbours for a given dataset -
kmax
: maximum$k$ neighbours to test when finding the optimimum number of neighbours for a given dataset -
nSplits
: number of cross-validation splits when testing each$k$ neighbours in finding the optimimum number of neighbours for a given dataset - One element of this algorithm that has not be made flexible to change in the
main.py
programme is the distance used to calcualte the$k$ -nearest neighbours; the distance has been set as default to the Eculidean distance for this study; this distance calculation can be changed but requires development or modification to theknn.py
module which end users are free to do
lr
: the learning rate is the step size used in the optimisation procedure when training the network to reach the local minima of the loss function assigned during trainingepochs
: the numbes of full dataset passes to be run during the training procedure
lr
: the learning rate is the step size used in the optimisation procedure when training the network to reach the local minima of the loss function assigned during trainingepochs
: the number of full dataset passes to be run during the training procedure
lr
: the learning rate is the step size used in the optimisation procedure when training the network to reach the local minima of the loss function assigned during trainingepochs
: the number of full dataset passes to be run during the training procedure
Before runnning the code steps below ensure that one has navigated to the data project's directory:
- Initialise the Python virtual environment as guided in Python virtual environment
- Set up configuration as desired to the user's purposes using the configuration JSON file as described in Configuration and setting the relevant hyperparameters as described in Hyperparameters
- Change directories such that the user is in the source code directory
- Run the command
python3 main.py
to execute the main programme - The programme will output as it progresses through the programme updating the user as to what it completed and its progress through the different models' training procedures
- Once the programme has outputed
Completed programme
all the outputs detailed in Outputs will be saved and ready for analysis
An example of the programme logging outputs produed by main.py
can be viewed from this logged file. This is not a standard output of the programme, but simply added to the repository for completeness and to aid end users.
The model parameters or models themselves for each of the methods are saved in the saved-models:
- Gaussian naive-Bayes: the mean and standard deviation for each category belonging to the dataset are saved as model parameters to a parameters file, the
smoothing
hyperparameter is also saved for completeness and potential later use -
$k$ -Nearest neighbours: the value for the optimum$k$ within the range ofkmin
andkmax
, and the number of splits used for cross-validation during the optimisation process are saved as hyperparameters to the parameters file for completeness and potential later use - Neural network: the model was implemented in PyTorch and so the entire trained model is saved as a
.pth
file for completeness and potential later use - Convolutional neural network: the model was implemented in PyTorch and so the entire trained model is saved as a
.pth
file for completeness and potential later use - Graph neural network: the model was implemented in PyTorch and so the entire trained model is saved as a
.pth
file for completeness and potential later use
The learning process for each of the methods is saved and the training progress can be viewed via the training curve graphs that are plotted in the learning-curves folder. The Gaussian naive-Bayes does not have a learning curve output because there was no iterative process in this algorithm which was implemented. The
The main programme, main.py
, will produce a single prediction for each of the methods where a selected test image from the given dataset is hard coded into the programme for each method and the image is saved with the image, image's true classification, and its predicted classification. Examples of the model predictions output can be found in the test-predictions folder.
This section is to provide the end users with resources for understanding the theory behind the implementations of the methods used in this project. For each of the methods background theory resources are provided in the form of papers, articles, and textbooks; some alternative code implementations where possible; and in the case of the graph neural network previous implementations of the class from the presented article.
-
Textbook Chapters:
- Müller, A. C., & Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists. O'Reilly Media. Chapter 2.
- Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 4.2.
-
Papers:
- Langley, P., Iba, W., & Thompson, K. (1992). An Analysis of Bayesian Classifiers. In AAAI (pp. 223-228).
-
Alternative Implementation:
-
Textbook Chapters:
- Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 2.5.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer. Chapter 13.3.
-
Papers:
- Cover, T., & Hart, P. (1967). Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, 13(1), 21-27.
-
Alternative Implementation:
-
Textbook Chapters:
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 6.
- Nielsen, M. (2015). Neural Networks and Deep Learning. Chapter 1.
-
Papers:
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning Representations by Back-Propagating Errors. Nature, 323(6088), 533-536.
-
Alternative Implementation:
-
Textbook Chapters:
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 9.
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O'Reilly Media. Chapter 14.
-
Papers:
- LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989) "Backpropagation Applied to Handwritten Zip Code Recognition", AT&T Bell Laboratories.
-
Alternative Implementation:
-
Textbook Chapters:
- Hamilton, W. L. (2020). Graph Representation Learning. Morgan & Claypool Publishers. Chapters 1-3.
-
Papers:
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. arXiv:1711.08920.
- Kipf, T. N., & Welling, M. (2016). Semi-Supervised Classification with Graph Convolutional Networks. arXiv preprint arXiv:1609.02907.
-
Articles
-
Implementation followed from graph neural network article: