-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyani anim
loads sequences into database before checking class/label files
#129
Comments
@widdowquinn, Is the checking you mean here the The only thing that looks like it deals with the formatting of those def add_run_genomes(
session, run, indir: Path, classpath: Path, labelpath: Path, **kwargs
) -> List:
"""Add genomes for a run to the database.
:param session: live SQLAlchemy session of pyani database
:param run: Run object describing the parent pyani run
:param indir: path to the directory containing genomes
:param classpath: path to the file containing class information for each genome
:param labelpath: path to the file containing class information for each genome
This function expects a single directory (indir) containing all FASTA files
for a run, and optional paths to plain text files that contain information
on class and label strings for each genome.
If the genome already exists in the database, then a Genome object is recovered
from the database. Otherwise, a new Genome object is created. All Genome objects
will be associated with the passed Run object.
The session changes are committed once all genomes and labels are added to the
database without error, as a single transaction.
"""
# Get list of genome files and paths to class and labels files
infiles = get_fasta_and_hash_paths(indir) # paired FASTA/hash files
class_data = {} # type: Dict[str,str]
label_data = {} # type: Dict[str,str]
all_keys = [] # type: List[str]
if classpath:
class_data = load_classes_labels(classpath)
all_keys += list(class_data.keys())
if labelpath:
label_data = load_classes_labels(labelpath)
all_keys += list(label_data.keys())
# Make dictionary of labels and/or classes
new_keys = set(all_keys)
label_dict = {} # type: Dict
for key in new_keys:
label_dict[key] = LabelTuple(label_data[key] or "", class_data[key] or "")
# Get hash and sequence description for each FASTA/hash pair, and add
# to current session database
genome_ids = []
for fastafile, hashfile in infiles:
try:
inhash, _ = read_hash_string(hashfile)
indesc = read_fasta_description(fastafile)
except Exception:
raise PyaniORMException("Could not read genome files for database import")
abspath = fastafile.absolute()
genome_len = get_genome_length(abspath)
# If the genome is not already in the database, add it as a Genome object
genome = session.query(Genome).filter(Genome.genome_hash == inhash).first()
if not isinstance(genome, Genome):
try:
genome = Genome(
genome_hash=inhash,
path=str(abspath),
length=genome_len,
description=indesc,
)
session.add(genome)
except Exception:
raise PyaniORMException(f"Could not add genome {genome} to database")
# Associate this genome with the current run
try:
genome.runs.append(run)
except Exception:
raise PyaniORMException(
f"Could not associate genome {genome} with run {run}"
)
# If there's an associated class or label for the genome, add it
if inhash in label_dict:
try:
session.add(
Label(
genome=genome,
run=run,
label=label_dict[inhash].label,
class_label=label_dict[inhash].class_label,
)
)
except Exception:
raise PyaniORMException(
f"Could not add labels for {genome} to database."
)
genome_ids.append(genome.genome_id)
try:
session.commit()
except Exception:
raise PyaniORMException("Could not commit new genomes in database.")
return |
@widdowquinn Is this issue still relevant? |
I think I was referring to the order of processing, which was - approximately: (1)
This meant that it was possible to generate a new row in the database, but then for the run to fail because of a formatting or other error in the labels/classes files. This could probably be dealt with at the same time as #136 A more sensible order of processing would be: (2)
It's still relevant if the order of operations looks like (1) and not like (2). |
Summary:
The
pyani anim
command should check for correct formatting before committing sequences to the database - this would save timeDescription:
Currently, all sequences are processed and loaded into the database, and then label/class files are checked. This slows things down if there's an error. It would be better to check the formats first.
pyani Version:
v0.3.0dev
Python Version:
3.6
Operating System:
CentOS6
The text was updated successfully, but these errors were encountered: