Collection of scripts being used in building Deep Neural Network Tensorflow Model for viral genomes classification.
From data fetching, through clearing and storing the unified dataset within in-file DB, to Estimator based production-ready prediction classificator.
The genomes contains raw biological information about species and could be reasonable factor for classification based on chains of nucleotides.
Where the human would find it difficult to maintain, the machine (ML) can use it's operational capabilities to predicate basing on genomes like DNA/RNA.
After many adjustments to fit the case into rough borders of minimal resources and simplicity (just for experimentation & personal research), the gathered genomes of viruses are used to predicate by it into 7 taxonomy groups (Baltimore Classification):
I: dsDNA viruses (e.g. Adenoviruses, Herpesviruses, Poxviruses)
II: ssDNA viruses (+ strand or "sense") DNA (e.g. Parvoviruses)
III: dsRNA viruses (e.g. Reoviruses)
IV: (+)ssRNA viruses (+ strand or sense) RNA (e.g. Picornaviruses, Togaviruses)
V: (−)ssRNA viruses (− strand or antisense) RNA (e.g. Orthomyxoviruses, Rhabdoviruses)
VI: ssRNA-RT viruses (+ strand or sense) RNA with DNA intermediate in life-cycle (e.g. Retroviruses)
VII: dsDNA-RT viruses DNA with RNA intermediate in life-cycle (e.g. Hepadnaviruses)
Accomplished partially with less than moderate results.
Will install python setup based on pip's freezed requirements.txt
.
make install
Fetches viruses genomes and metadata to text files in specific specie directory.
make fetch_data
Processes saved files from fetch
and stores them normalized and unified in JSON format file (TinyDB).
make store_into_db
Train the neural network and watch the prediction result 👊
make train
Show basic information about stored species and their genomes (eg count, largest genome, groups counts etc).
make analyze_data