Skip to content
/ bacsnp Public

Analyzing single nucleotide polymorphism (SNP) frequencies from baculovirus sequencing data

Notifications You must be signed in to change notification settings

wennj/bacsnp

Repository files navigation

bacsnp - Deciphering baculovirus populations

This repository accompanies the following publication and provides the code, data, and tools described therein:

  • Wennmann, J.T., Fan, J., Jehle, J.A. (2020). Bacsnp: Using Single Nucleotide Polymorphism (SNP) Specificities and Frequencies to Identify Genotype Composition in Baculoviruses. Viruses, https://doi.org/10.3390/v12060625.

Introduction

Baculoviruses and other nuclear arthropod-specific large dsDNA viruses (NALDV) have large genomes of up to hundreds of kbp with up to hundreds of open reading frames (ORF). The size of the genome makes it very difficult to analyse the composition of virus populations, since current sequencing techniques cannot (Illumina sequencing) or can only with great difficulty (Nanopore sequencing) sequence entire genomes or significant genome fragments. Genetic markers such as insertion/deletion or single nucleotide polymorphisms (SNP) are usually used for virus population analysis. SNPs are particularly suitable for analysing intra-isolate specific variation, as there are many bioinformatic tools and workflows available and established to detect SNPs based on raw sequencing data (usually in fastq or fastqsanger format).

In virus populations, SNP positions can be used to count the frequency of the occurring nucleotides. In theory, A, T, G and C can occur in any position. The nucleotides can be determined using the sequencing data because not only one genome but a multitude of genomes is sequenced, which reflects the virus population including its genetic variations.

If some SNP positions are specific for a sequenced sample or isolate, then this position and the occurring nucleotide can be used as a marker to find this isolate/sample in other sequences, even if it is only present in a mixture with another type.

The bacsnp tool was developed to assign specificities to SNP positions. The tool is written in the R programming language and uses the Variant Call Format generated by Mpileup as input.

Requirements

The bacsnp tool is developed in R programming language. Therefore, make sure that R is installed and working. In addition, the devtools, vcfR and ggplot2 packages are required for bacsnp to work.

if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}

if (!requireNamespace("vcfR", quietly = TRUE)) {
  install.packages("vcfR")
}

Installation

There are two ways to install the bacsnp package:

  1. Installation from GitHub (latest version)

    Until the package is available on CRAN, you can install the most recent version directly from GitHub. To do this, you need the devtools package, which can be installed as follows:

    library(devtools)
    
    install_github("https://github.com/wennj/bacsnp", build_vignettes = TRUE)

    The build_vignettes = TRUE argument ensures that the vignette (a detailed introduction and documentation of the package) is also installed.

  2. installation from CRAN (planed for future release)

    The bacsnp package is not yet available on CRAN. Once it is published, you will be able to install it directly from the R console with the following command:

    install.packages("bacsnp") #comming eventually soon

Starting bacsnp package

After installation, you can load and use the package as usual:

library(bacsnp)

All required dependencies, such as vcfR and ggplot2, will be automatically installed during the installation process if they are not already available on your system.


Example workflow

Here, I would like to show you an example of how to analyse the composition of a baculovirus isolate using previously sequenced isolates. The example workflow can be transferred to your own analysis.

The core of the analysis are variable SNP positions, which act as markers and can be specific for certain isolates. To determine variable SNP positions across several sequenced isolates, mpileup and bcftools are used. The output is in variant call format (VCF), which is the required input for the bacsnp tool. To create the VCF file, sequencing data can be processed with usegalaxy.eu. A Galaxy platform is particularly suitable for beginners, people who do not have sufficient computing capacity or do not regularly perform bioinformatic analyses.

Galaxy Workflow

The bioinformatic pipeline for variable SNP position determination involves only three tools:

How the tools interact and process the sequencing data, can be seen from the Galaxy workflow itself. There, you will also find all required individual parameters that have been set for each tool.

Link toGalaxy workflowfile.

Link to Galaxy workflow on usegalaxy.eu.

About

Analyzing single nucleotide polymorphism (SNP) frequencies from baculovirus sequencing data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages