-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
-max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode #322
Comments
Thanks for the report, @jianshu93. I am aware of this issue with At time of writing, the current release of If my understanding is incorrect (for instance, you may have an example pair of genomes for which the current ANIb implementation gives non-negligibly different results with One thing I would be interested to know is whether any such difference is greater than the variation in output observed when modifying ANIb parameters for fragment size, or the start position of the first fragment, when using the ANIb algorithm (as published). The relative instability of ANIb to these (arbitrary and heuristic) choices is one reason why I recommend that ANIm is used in preference to ANIb. L. |
Reminder (for myself) while I remember that we discussed adding in a check about this to |
Hello All, many thanks for those discussion. @widdowquinn , I agree with you generally and kostas and I actually had a very long discussion on this topic last week. In early days for legacy blastn and also blast version 2.6(blastN), setting max_targe_seq to 100 and 10 differ by a very small amount (the reason you provided) even for close related genomes like Shewanella Baltic OS195 and OS185 (I tested it using the enveomics package ani.rb L286: https://github.com/lmrodriguezr/enveomics/blob/master/Scripts/ani.rb). That's why we use 1 in the end, but at that time the issue has not been found (the bioinformatics paper you mentioned). The OrthoANI is a little bit different, usearch -search_local algorithm also offers the -maxaccecpts option to find only the highest identity match. The algorithm itself is not subject to the mentioned issue for blast because there is a different heuristics (I confirmed this with Robert Edgar for this. I think we should also add usearch -search_local support for pyani, free 32bit version is enough). For mummer, it performs bad around 70-75% ANI and we need to be very careful even ~80% ANI. All 3 tools, pyani, ani.rb and OrthoANI now use -max_targe_seqs 1/-maxaccecpts 1. My testing shows that they are nearly the same for the 2 close genomes mentioned. But apparently, pyani is much better for a lot of comparisons. For general purpose blastn, still the -max_targe_seqs 1 issue should be taken into account. e.g., metagenome contigs, reads mapping to those metagenome contigs (there might be mixed close genomes) using blastn will not necessarily found the best identity match when set to 1. Many thanks, Jianshu |
Summary:
Line 434-439 in anib.py, blastn options -max_target_seqs 1 is not guaranteed to find the best hit in the database even in -task blastn mode
Description:
Database search heuristics implemented in the original blastn and megablast does not guarante that you will find the best hit if max_target_seqs only set to 1. blastn help info use a default of 500 to increase the probability that it can find the best hits and then filter the output according to identity for each query. ANI calculator by the kostas lab use this idea to increase the probability to find the best hit.
Reproducible Steps:
Please report a series of steps (e.g. command-line instructions) that would enable someone else to reproduce the issue. If possible, please attach a small set of input files/data that would be sufficient to allow someone else to reproduce the issue.
If it's not possible to reproduce the issue, please include a description of how you discovered the issue.
Current Output:
The output you get from
pyani
or at the terminal, with this issue. It is very useful to know the current "wrong" behavior on your system.Expected Output:
Describe what you expect the output to be. It is also very useful to know the expected "correct" behavior on your systems.
-max_target_seqs should be increased to 100 or so to increase the probability of finding best hits and then filter according to identity and alignment ratio for each query.
pyani Version:
current GitHub version
installed dependencies
If you are running a version of
pyani
v0.3 or later, then please run the commandpyani listdeps
at the command line, and enter the output below.current git version
Python Version:
3.8
Operating System:
MacOS
Jianshu Kostas lab
The text was updated successfully, but these errors were encountered: