-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGI-HMM crash with some datasets #135
Comments
@klgray25 |
It appears 15584_genome.gbk will crash SigiHMM. The catch is the tool being used to convert from gbk to embl. Bioperl generates slightly different (possibly older) embl output than Biopython. IslandCompare uses Biopython for the conversion while my test originally used Bioperl. I can now consistently reproduce this for SigiHMM version 3.8.
and discards a large amount of islands. This line is from 15584_genome.embl from Biopython:
The matching line from 15584_genome.embl from Bioperl:
The difference between biopython and bioperl is that bioperl quotes anticodons if they wrap. Moving forward we have three options:
I'll need @fionabrinkman and @cbertell to weigh in on this. |
Yes, this crashing issue is known.
Claire, can you confirm which SIGIhmm version you were wanting as there seems to be some uncertainty on this? I do think it’s confusing to have Islandviewer and islandcompare make different predictions but may have forgotten conversation about this and plans to change islandcompare too.
… On Jun 5, 2019, at 4:45 PM, Nolan Woods ***@***.***> wrote:
It appears 15584_genome.gbk will crash SigiHMM. The catch is the tool being used to convert from gbk to embl. Bioperl generates slightly different (possibly older) embl output than Biopython. IslandCompare uses Biopython for the conversion while my test originally used Bioperl. I can now consistently reproduce this for SigiHMM version 3.8.
SigiHMM 4.0 also cant parse Biopython output but emits rather than crashing:
This line could not be parsed: seq:caa)
and discards a large amount of islands. This line is from 15584_genome.embl from Biopython:
FT /anticodon=(pos:complement(1123552..1123554),aa:Leu,
FT seq:caa)
The matching line from 15584_genome.embl from Bioperl:
FT /anticodon=(pos:complement(1123552..1123554),aa:Leu,seq:caa
FT )
Moving forward we have two options:
Add a perl dependency specifically for SigiHMM (yuck)
Discard the offending anticodon annotations and keep biopython
The difference between biopython and bioperl seem to be the criteria that they use when determining where to line wrap. Bioperl could potentially have this issue if the anticodon annotations are longer in other datasets. I will contact Genbank as I believe the anticodon annotations are actually not conforming to their own standard.
I'll need @fionabrinkman and @cbertell to weigh in on this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Kristen, I should really have said “Claire and Kristen” below, not just Claire!
… On Jun 6, 2019, at 12:51 AM, FionaBrinkman ***@***.***> wrote:
Yes, this crashing issue is known.
Claire, can you confirm which SIGIhmm version you were wanting as there seems to be some uncertainty on this? I do think it’s confusing to have Islandviewer and islandcompare make different predictions but may have forgotten conversation about this and plans to change islandcompare too.
> On Jun 5, 2019, at 4:45 PM, Nolan Woods ***@***.***> wrote:
>
> It appears 15584_genome.gbk will crash SigiHMM. The catch is the tool being used to convert from gbk to embl. Bioperl generates slightly different (possibly older) embl output than Biopython. IslandCompare uses Biopython for the conversion while my test originally used Bioperl. I can now consistently reproduce this for SigiHMM version 3.8.
> SigiHMM 4.0 also cant parse Biopython output but emits rather than crashing:
>
> This line could not be parsed: seq:caa)
> and discards a large amount of islands. This line is from 15584_genome.embl from Biopython:
>
> FT /anticodon=(pos:complement(1123552..1123554),aa:Leu,
> FT seq:caa)
> The matching line from 15584_genome.embl from Bioperl:
>
> FT /anticodon=(pos:complement(1123552..1123554),aa:Leu,seq:caa
> FT )
> Moving forward we have two options:
>
> Add a perl dependency specifically for SigiHMM (yuck)
> Discard the offending anticodon annotations and keep biopython
> The difference between biopython and bioperl seem to be the criteria that they use when determining where to line wrap. Bioperl could potentially have this issue if the anticodon annotations are longer in other datasets. I will contact Genbank as I believe the anticodon annotations are actually not conforming to their own standard.
>
> I'll need @fionabrinkman and @cbertell to weigh in on this.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
|
Can we discuss the update in the relevant thread. It makes it difficult to follow the issue. |
How line wrapping is handled is not in the Genbank/EMBL format specification. I have contacted NIH requesting clarification. They have been good in the past at getting back quickly. If this is a bug in biopython I would prefer to fix it rather than the alternatives. Edit: I corrected a mistake in my post above analysing the difference between biopython and bioperl output. I must have opened the wrong file when I inspected the bioperl output because when I reran it today all wrapping /anticodon annotations are now quoted. |
A change to the biopython converter has resolved the issue for now. |
SigiHMM 3.8 is now crashing for a different reason.
This exception is thrown for data from both bioperl and biopython. |
This issue is side stepped by filtering failed datasets from the workflow. Referencing relevant issues and closing: |
SIGI-HMM inconsistently crashes with some datasets.
#89 #134
The text was updated successfully, but these errors were encountered: