Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGI-HMM crash with some datasets #135

Closed
innovate-invent opened this issue Dec 13, 2018 · 9 comments
Closed

SIGI-HMM crash with some datasets #135

innovate-invent opened this issue Dec 13, 2018 · 9 comments
Assignees
Labels
Milestone

Comments

@innovate-invent
Copy link
Collaborator

SIGI-HMM inconsistently crashes with some datasets.
#89 #134

@innovate-invent
Copy link
Collaborator Author

@klgray25
Colombo v3.8 did not fail for any of the data:
15584_genome.gbk 15597_genome.gbk 15598_genome.gbk 15599_genome.gbk 15600_genome.gbk 15602_genome.gbk

@innovate-invent
Copy link
Collaborator Author

innovate-invent commented Jun 5, 2019

It appears 15584_genome.gbk will crash SigiHMM. The catch is the tool being used to convert from gbk to embl. Bioperl generates slightly different (possibly older) embl output than Biopython. IslandCompare uses Biopython for the conversion while my test originally used Bioperl. I can now consistently reproduce this for SigiHMM version 3.8.
SigiHMM 4.0 also can't parse Biopython output but emits rather than crashing:

This line could not be parsed:                 seq:caa)

and discards a large amount of islands. This line is from 15584_genome.embl from Biopython:

FT                   /anticodon=(pos:complement(1123552..1123554),aa:Leu,
FT                   seq:caa)

The matching line from 15584_genome.embl from Bioperl:

FT                   /anticodon="(pos:complement(1123552..1123554),aa:Leu,seq:ca
FT                   a)"

The difference between biopython and bioperl is that bioperl quotes anticodons if they wrap.

Moving forward we have three options:

  • Add a perl dependency specifically for SigiHMM (yuck)
  • Discard the offending anticodon annotations and keep biopython
  • Change biopython to line wrap the same as bioperl (this may actually be a bug in biopython)

I'll need @fionabrinkman and @cbertell to weigh in on this.

@innovate-invent innovate-invent added this to the MVP Release milestone Jun 5, 2019
@fionabrinkman
Copy link

fionabrinkman commented Jun 6, 2019 via email

@fionabrinkman
Copy link

fionabrinkman commented Jun 6, 2019 via email

@innovate-invent
Copy link
Collaborator Author

innovate-invent commented Jun 6, 2019

Can we discuss the update in the relevant thread. It makes it difficult to follow the issue.
SIGI update discussion: #134

@innovate-invent
Copy link
Collaborator Author

innovate-invent commented Jun 6, 2019

How line wrapping is handled is not in the Genbank/EMBL format specification. I have contacted NIH requesting clarification. They have been good in the past at getting back quickly. If this is a bug in biopython I would prefer to fix it rather than the alternatives.

Edit: I corrected a mistake in my post above analysing the difference between biopython and bioperl output. I must have opened the wrong file when I inspected the bioperl output because when I reran it today all wrapping /anticodon annotations are now quoted.

@innovate-invent
Copy link
Collaborator Author

innovate-invent commented Jun 6, 2019

A change to the biopython converter has resolved the issue for now.
brinkmanlab/galaxy-tools@164d4c3

@innovate-invent
Copy link
Collaborator Author

SigiHMM 3.8 is now crashing for a different reason.
Something about 15596_genome.gbk is throwing:

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 10
	at java.lang.String.substring(String.java:1963)
	at io.RowEMBL.parseSingle(RowEMBL.java:409)
	at io.RowEMBL.parseLocation(RowEMBL.java:370)
	at io.RowEMBL.setLocation(RowEMBL.java:198)
	at io.FileEMBL.parse(FileEMBL.java:331)
	at io.BatchFileReader.readFile(BatchFileReader.java:57)
	at GenericBatch.execute(GenericBatch.java:140)
	at SigiHMM.main(SigiHMM.java:10)

This exception is thrown for data from both bioperl and biopython.
SigiHMM version 4.0 does not throw this exception.

@innovate-invent
Copy link
Collaborator Author

This issue is side stepped by filtering failed datasets from the workflow.
Issue #134 is now the best course of action.

Referencing relevant issues and closing:
biopython/biopython#2112
biojava/biojava#843

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants