Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support alphanumeric and hyphenated words #45

Open
nuno-agostinho opened this issue Feb 3, 2020 · 2 comments
Open

Support alphanumeric and hyphenated words #45

nuno-agostinho opened this issue Feb 3, 2020 · 2 comments

Comments

@nuno-agostinho
Copy link

I am using the following words in my package:

  • RNA-seq
  • 1st
  • 2nd
  • EIF4G1

After inserting these words in inst/WORDLIST and running spelling::spell_check_package(), the function reports that the words seq, st, nd and EIF are misspelled.

Currently, my WORDLIST includes the words seq, st, nd and EIF to avoid triggering the spell checker, but I would prefer to include the full words. Thanks.

@jmbarbone
Copy link

I have the same issue, picked up with ordinal indicators. It looks like this is a problem with the hunspell parser:

hunspell::hunspell_parse(c("1st", "RNA-seq", "EIF4G1"))
#> [[1]]
#> [1] "st"
#> 
#> [[2]]
#> [1] "RNA" "seq"
#> 
#> [[3]]
#> [1] "EIF" "G"

Created on 2021-02-06 by the reprex package (v0.3.0)

@jmbarbone
Copy link

Implementing a pre filter right before the parse here could work:

spelling/R/check-files.R

Lines 118 to 123 in a2b5f29

spell_check_file_plain <- function(path, format, dict){
lines <- readLines(path, warn = FALSE, encoding = 'UTF-8')
words <- hunspell::hunspell_parse(lines, format = format, dict = dict)
text <- vapply(words, paste, character(1), collapse = " ")
spell_check_plain(text, dict = dict)
}

It feels like more of a quick-fix because it parses with strsplit() then paste()s back together before being sent to the actual parsing function.

ignore_words <- c("1st", "RNA-seq", "EIF4G1")

lines <- c(
  "This is the 1st line.  It has first written in it.",
  "The second has RNA-seq inside. But does not use RNAseq -- without the '-'",
  "EIF4G1 but not EIF4G1fdsadf is used",
  "This line's words are fine!"
)

pre_filter_plain <- function(lines, ignore = character()) {
  word_list <- strsplit(lines, "([^-[:alnum:][:punct:]])")
  
  vapply(
    word_list,
    function(i) {
      paste(i[!i %in% ignore], collapse = " ")
    },
    character(1)
  )
}

pre_filter_plain(lines, ignore_words)
#> [1] "This is the line.  It has first written in it."                   
#> [2] "The second has inside. But does not use RNAseq -- without the '-'"
#> [3] "but not EIF4G1fdsadf is used"                                     
#> [4] "This line's words are fine!"

Created on 2021-02-06 by the reprex package (v0.3.0)

jmbarbone added a commit to jmbarbone/spelling that referenced this issue Feb 10, 2021
This is meant to be a quick fix; issue should probably be resolved in hunspell parser instead
* remove "ignore" words from WORDLIST before parsing in hunspell
* replaces complex if ... else if ... statement with simplier switch()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants