Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve overall stats, fix test_filter issue #326

Merged
merged 8 commits into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Initial release of nf-core/metatdenovo, created with the [nf-core](https://nf-co

### `Added`

- [#320](<[https://github.com/nf-core/metatdenovo/pull/320](https://github.com/nf-core/metatdenovo/pull/320)>) improvments to Diamond taxonomy plus documentation
- [#320](<[https://github.com/nf-core/metatdenovo/pull/320](https://github.com/nf-core/metatdenovo/pull/320)>) added taxonomy directly with Diamond, part 2
- [#312](<[https://github.com/nf-core/metatdenovo/pull/312](https://github.com/nf-core/metatdenovo/pull/312)>) added taxonomy directly with Diamond, see `--diamond_dbs`
- [#286](<[https://github.com/nf-core/metatdenovo/pull/286](https://github.com/nf-core/metatdenovo/pull/286)>) added an option to save the fasta file output from formatspades.nf module
- [#285](<[https://github.com/nf-core/metatdenovo/pull/285](https://github.com/nf-core/metatdenovo/pull/285)>) added nf-test for default settings.
Expand All @@ -18,6 +18,7 @@ Initial release of nf-core/metatdenovo, created with the [nf-core](https://nf-co

### `Changed`

- [#326](<[https://github.com/nf-core/metatdenovo/pull/326](https://github.com/nf-core/metatdenovo/pull/326)>) - Clean up overall stats table
- [#323](<[https://github.com/nf-core/metatdenovo/pull/323](https://github.com/nf-core/metatdenovo/pull/323)>) - Modified param names for input of assembly and ORFs; added name params for output file naming
- [#323](<[https://github.com/nf-core/metatdenovo/pull/323](https://github.com/nf-core/metatdenovo/pull/323)>) - Removed default for `assembler` and `orf_caller` parameters
- [#311](<[https://github.com/nf-core/metatdenovo/pull/311](https://github.com/nf-core/metatdenovo/pull/311)>) - Update modules and subworkflows
Expand All @@ -29,8 +30,10 @@ Initial release of nf-core/metatdenovo, created with the [nf-core](https://nf-co

### `Fixed`

- [#305](<[https://github.com/nf-core/ampliseq/pull/681](https://github.com/nf-core/metatdenovo/pull/305)>) - Make EUKulele counts output optional as it's not always created
- [#269](<[https://github.com/nf-core/ampliseq/pull/681](https://github.com/nf-core/metatdenovo/pull/269)>) - Make Transdecoder work better with `-resume`
- [#326](<[https://github.com/nf-core/metatdenovo/pull/326](https://github.com/nf-core/metatdenovo/pull/326)>) - Fix resources for test cases
- [#326](<[https://github.com/nf-core/metatdenovo/pull/326](https://github.com/nf-core/metatdenovo/pull/326)>) - Fix output file names for Eukulele and Kofamscan
- [#305](<[https://github.com/nf-core/metatdenovo/pull/305](https://github.com/nf-core/metatdenovo/pull/305)>) - Make EUKulele counts output optional as it's not always created
- [#269](<[https://github.com/nf-core/metatdenovo/pull/269](https://github.com/nf-core/metatdenovo/pull/269)>) - Make Transdecoder work better with `-resume`

### `Dependencies`

Expand Down
6 changes: 3 additions & 3 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -272,13 +272,13 @@ process {
path: { "${params.outdir}/summary_tables/" },
pattern: "kofamscan.tsv.gz",
mode: params.publish_dir_mode,
saveAs: { filename -> "${params.assembly ? 'user_assembly' : params.assembler}.${params.gff ? 'user_orfs' : params.orf_caller}.${filename}" }
saveAs: { filename -> "${params.assembler ?: params.user_assembly_name}.${params.orf_caller ?: params.user_orfs_name}.${filename}" }
],
[
path: { "${params.outdir}/kofamscan/" },
pattern: "kofamscan_output.tsv.gz",
mode: params.publish_dir_mode,
saveAs: { filename -> "${params.assembly ? 'user_assembly' : params.assembler}.${params.gff ? 'user_orfs' : params.orf_caller}.${filename}" }
saveAs: { filename -> "${params.assembler ?: params.user_assembly_name}.${params.orf_caller ?: params.user_orfs_name}.${filename}" }
]
]
}
Expand Down Expand Up @@ -337,7 +337,7 @@ process {
path: { "${params.outdir}/summary_tables" },
mode: params.publish_dir_mode,
pattern: '*.tsv.gz',
saveAs: { filename -> "${params.assembly ? 'user_assembly' : params.assembler}.${params.gff ? 'user_orfs' : params.orf_caller}.${params.eukulele_db ?: 'userdb'}.eukulele.taxonomy.tsv.gz" }
saveAs: { filename -> "${params.assembler ?: params.user_assembly_name}.${params.orf_caller ?: params.user_orfs_name}.${params.eukulele_db ?: 'userdb'}.eukulele.taxonomy.tsv.gz" }
]
}

Expand Down
8 changes: 8 additions & 0 deletions conf/test_eggnog.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test eggnog profile'
config_profile_description = 'Minimal test dataset to check pipeline with eggnog function added'
Expand Down
8 changes: 8 additions & 0 deletions conf/test_eukulele.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test profile for eukulele taxonomic annotation'
config_profile_description = 'Minimal test dataset to check pipeline function'
Expand Down
8 changes: 8 additions & 0 deletions conf/test_filter.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test profile'
config_profile_description = 'Minimal test dataset to check pipeline function, including removal of contaminating sequences (e.g. rRNA)'
Expand Down
8 changes: 8 additions & 0 deletions conf/test_full.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Full test profile'
config_profile_description = 'Full test dataset to check pipeline function'
Expand Down
8 changes: 8 additions & 0 deletions conf/test_kofamscan.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test kofamscan profile'
config_profile_description = 'Minimal test dataset to check pipeline with kofamscan function added'
Expand Down
8 changes: 8 additions & 0 deletions conf/test_prokka.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test profile for prokka orf caller'
config_profile_description = 'Minimal test dataset to check pipeline function'
Expand Down
8 changes: 8 additions & 0 deletions conf/test_spades.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test spades assembler profile'
config_profile_description = 'Minimal test dataset to check pipeline function'
Expand Down
8 changes: 8 additions & 0 deletions conf/test_transdecoder.config
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,14 @@
----------------------------------------------------------------------------------------
*/

process {
resourceLimits = [
cpus: 4,
memory: '15.GB',
time: '1.h'
]
}

params {
config_profile_name = 'Test profile for transdecoder orf caller'
config_profile_description = 'Minimal test dataset to check pipeline function'
Expand Down
105 changes: 46 additions & 59 deletions modules/local/collect_stats.nf
Original file line number Diff line number Diff line change
Expand Up @@ -27,101 +27,88 @@ process COLLECT_STATS {
d = map(
sample,
function(s) {
fread(cmd = sprintf("grep 'Reads written (passing filters)' %s*trimming_report.txt | sed 's/.*: *//' | sed 's/ .*//' | sed 's/,//g'", s)) %>%
as_tibble()
read_tsv(
pipe(sprintf("grep 'Reads written (passing filters)' %s*trimming_report.txt | sed 's/.*: *//' | sed 's/ .*//' | sed 's/,//g'", s)),
col_names = c('n_trimmed'),
col_types = 'i'
) %>%
mutate(n_trimmed = n_trimmed * 2)
}
)
) %>%
unnest(d) %>%
rename(n_trimmed = V1) %>%
mutate(n_trimmed = n_trimmed*2) %>%
unnest(d)
"""
} else {
read_trimlogs = "%>%"
}

if (mergetab) {
if ( mergetab ) {
read_mergetab = """

mergetab <- list.files(pattern = "*_merged_table.tsv.gz" ) %>%
map_df(~read_tsv(., show_col_types = FALSE)) %>%
mutate(sample = as.character(sample))

mergetab <- read_tsv("${mergetab}", show_col_types = FALSE)
"""
} else {
read_mergetab = """
mergetab <- data.frame(sample = character(), stringsAsFactors = FALSE)
mergetab <- tibble(sample = character())
"""
}

"""
#!/usr/bin/env Rscript

library(data.table)
library(dtplyr)
library(dplyr)
library(readr)
library(purrr)
library(tidyr)
library(stringr)

TYPE_ORDER = c('n_trimmed', 'n_non_contaminated', 'idxs_n_mapped', 'idxs_n_unmapped', 'n_feature_count')
start <- tibble(sample = c("${samples.join('", "')}"))

# Collect stats for each sample, create a table in long format that can be appended to
t <- tibble(sample = c("${samples.join('", "')}")) ${read_trimlogs}
# add samtools idxstats output
mutate(
i = map(
sample,
function(s) {
fread(cmd = sprintf("grep -v '^*' %s*idxstats", s), sep = '\\t', col.names = c('chr', 'length', 'idxs_n_mapped', 'idxs_n_unmapped')) %>%
lazy_dt() %>%
summarise(idxs_n_mapped = sum(idxs_n_mapped), idxs_n_unmapped = sum(idxs_n_unmapped)) %>%
as_tibble()
}
)
) %>%
unnest(i) %>%
pivot_longer(2:ncol(.), names_to = 'm', values_to = 'v') %>%
union(
# Total observation after featureCounts
tibble(file = Sys.glob('*.counts.tsv.gz')) %>%
mutate(d = map(file, function(f) fread(cmd = sprintf("gunzip -c %s", f), sep = '\\t'))) %>%
as_tibble() %>%
unnest(d) %>%
mutate(sample = as.character(sample)) %>%
group_by(sample) %>% summarise(n_feature_count = sum(count), .groups = 'drop') %>%
pivot_longer(2:ncol(.), names_to = 'm', values_to = 'v')
)

# Add in stats from BBDuk, if present
trimming <- tibble(sample = c("${samples.join('", "')}")) ${read_trimlogs}

idxs <- read_tsv(
pipe("grep -Hv '^*' *.idxstats"),
col_names = c('c', 'length', 'idxs_n_mapped', 'idxs_n_unmapped'),
col_types = 'ciii'
) %>%
separate(c, c('sample', 'chr'), sep = ':') %>%
transmute(sample = str_remove(sample, '.idxstats'), idxs_n_mapped, idxs_n_unmapped) %>%
group_by(sample) %>% summarise(idxs_n_mapped = sum(idxs_n_mapped), idxs_n_unmapped = sum(idxs_n_unmapped))

counts <- read_tsv("${fcs}", col_types = 'cciicicid') %>%
group_by(sample) %>% summarise(n_feature_count = sum(count))


bbduk <- tibble(sample = character(), n_non_contaminated = integer())
for ( f in Sys.glob('*.bbduk.log') ) {
s = str_remove(f, '.bbduk.log')
t <- t %>% union(
fread(cmd = sprintf("grep 'Result:' %s | sed 's/Result:[ \\t]*//; s/ reads.*//'", f), col.names = c('v')) %>%
as_tibble() %>%
mutate(sample = s, m = 'n_non_contaminated')
)
bbduk <- bbduk %>%
union(
read_tsv(
pipe(sprintf("grep 'Result:' %s | sed 's/Result:[ \t]*//; s/ reads.*//' | sed 's/:/\t/'", f)),
col_names = c('n_non_contaminated'),
col_types = 'i'
) %>%
mutate(sample = s)
)
}
if ( nrow(bbduk) == 0 ) bbduk <- bbduk %>% select(sample)

# Add in stats from taxonomy and function
${read_mergetab}

# Write the table in wide format
t %>%
mutate(m = parse_factor(m, levels = TYPE_ORDER, ordered = TRUE)) %>%
arrange(sample, m) %>%
pivot_wider(names_from = m, values_from = v) %>%
left_join(mergetab, by = 'sample') %>%
write_tsv('${prefix}.overall_stats.tsv.gz')
# Write output
start %>%
left_join(trimming, by = join_by(sample)) %>%
left_join(bbduk, by = join_by(sample)) %>%
left_join(idxs, by = join_by(sample)) %>%
left_join(counts, by = join_by(sample)) %>%
left_join(mergetab, by = join_by(sample)) %>%
arrange(sample) %>%
write_tsv("${meta.id}.overall_stats.tsv.gz")

writeLines(
c(
"\\"${task.process}\\":",
paste0(" R: ", paste0(R.Version()[c("major","minor")], collapse = ".")),
paste0(" dplyr: ", packageVersion('dplyr')),
paste0(" dtplyr: ", packageVersion('dtplyr')),
paste0(" data.table: ", packageVersion('data.table')),
paste0(" readr: ", packageVersion('readr')),
paste0(" purrr: ", packageVersion('purrr')),
paste0(" tidyr: ", packageVersion('tidyr')),
Expand Down
2 changes: 1 addition & 1 deletion modules/local/eggnog/sum.nf
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ process EGGNOG_SUM {
group_by(sample) %>%
drop_na() %>%
summarise( value = sum(count), .groups = 'drop') %>%
add_column(database = "eggnog", field = "eggnog_n_counts") %>%
add_column(database = "eggnog", field = "n") %>%
relocate(value, .after = last_col()) %>%
write_tsv('${meta.id}.eggnog_summary.tsv.gz')

Expand Down
6 changes: 3 additions & 3 deletions modules/local/merge_summary_tables.nf
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ process MERGE_TABLES {
'biocontainers/mulled-v2-b2ec1fea5791d428eebb8c8ea7409c350d31dada:a447f6b7a6afde38352b24c30ae9cd6e39df95c4-1' }"

input:

tuple val(meta), path(eggtab), path(taxtab), path(kofamscan)
tuple val(meta), path(tables)

output:
tuple val(meta), path("${meta.id}_merged_table.tsv.gz") , emit: merged_table
Expand All @@ -34,7 +33,8 @@ process MERGE_TABLES {
Sys.glob('*.tsv.gz') %>%
read_tsv() %>%
mutate(sample = as.character(sample)) %>%
pivot_wider(names_from = c(database,field), values_from = value) %>%
arrange(field, database) %>%
pivot_wider(names_from = c(field,database), values_from = value) %>%
write_tsv('${prefix}_merged_table.tsv.gz')

writeLines(
Expand Down
2 changes: 1 addition & 1 deletion modules/local/sum_kofamscan.nf
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ process SUM_KOFAMSCAN {
inner_join(kofams, by = 'orf') %>%
group_by(sample) %>%
summarise(value = sum(count), .groups = 'drop') %>%
add_column(database = "kofamscan", field = "kofamscan_n_counts") %>%
add_column(database = "kofamscan", field = "n") %>%
relocate(value, .after = last_col()) %>%
write_tsv('${meta.id}.kofamscan_summary.tsv.gz')

Expand Down
7 changes: 7 additions & 0 deletions modules/local/sumtaxonomy/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json
channels:
- conda-forge
- bioconda
dependencies:
- "conda-forge::r-tidyverse=2.0.0 conda-forge::r-dtplyr=1.3.1 conda-forge::r-data.table=1.14.8"
Loading