Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect duplicated columns - columns with similar values across all rows #167

Open
Karim-Mane opened this issue Aug 6, 2024 · 1 comment
Assignees

Comments

@Karim-Mane
Copy link
Member

No description provided.

@Karim-Mane Karim-Mane converted this from a draft issue Aug 6, 2024
@Bisaloo
Copy link
Member

Bisaloo commented Aug 7, 2024

For future memory, we mentioned a potential implementation for this (to double check) could be to convert to factor and then compare levels:

library(magrittr)

dat <- simulist::sim_linelist() 

head(dat)
#>   id         case_name case_type sex age date_onset date_admission   outcome
#> 1  1     Uqbah al-Omar confirmed   m  30 2023-01-01           <NA> recovered
#> 2  2    Gaitha al-Alli confirmed   f  15 2023-01-07           <NA> recovered
#> 3  3  Muna el-Siddique  probable   f  90 2023-01-05           <NA> recovered
#> 4  4    Keauna Vickers confirmed   f  21 2023-01-12           <NA> recovered
#> 5  6 Nazeeha al-Habeeb confirmed   f  26 2023-01-14           <NA> recovered
#> 6  8     Delaney Clark confirmed   f  65 2023-01-09           <NA> recovered
#>   date_outcome date_first_contact date_last_contact ct_value
#> 1         <NA>               <NA>              <NA>     24.2
#> 2         <NA>         2023-01-04        2023-01-05     24.2
#> 3         <NA>         2022-12-31        2023-01-05       NA
#> 4         <NA>         2023-01-04        2023-01-08     24.2
#> 5         <NA>         2023-01-05        2023-01-09     24.2
#> 6         <NA>         2023-01-04        2023-01-07     24.2

lvls <- dat %>%
  vapply(function(col) as.integer(factor(col, levels = unique(as.character(col)))), integer(nrow(.)))

cols_to_compare <- combn(ncol(dat), 2, simplify = FALSE)

duplicated_columns <- vapply(cols_to_compare, function(x) {
  identical(lvls[, x[1]], lvls[, x[2]])
}, logical(1))

message("Duplicated columns: ", sprintf("\n- %s", lapply(cols_to_compare[duplicated_columns], paste, collapse = "/")))
#> Duplicated columns: 
#> - 1/2

Created on 2024-08-07 with reprex v2.1.1

@Karim-Mane Karim-Mane self-assigned this Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants