Support alphanumeric and hyphenated words #45

nuno-agostinho · 2020-02-03T18:25:44Z

I am using the following words in my package:

RNA-seq
1st
2nd
EIF4G1

After inserting these words in inst/WORDLIST and running spelling::spell_check_package(), the function reports that the words seq, st, nd and EIF are misspelled.

Currently, my WORDLIST includes the words seq, st, nd and EIF to avoid triggering the spell checker, but I would prefer to include the full words. Thanks.

The text was updated successfully, but these errors were encountered:

jmbarbone · 2021-02-06T22:31:35Z

I have the same issue, picked up with ordinal indicators. It looks like this is a problem with the hunspell parser:

hunspell::hunspell_parse(c("1st", "RNA-seq", "EIF4G1"))
#> [[1]]
#> [1] "st"
#> 
#> [[2]]
#> [1] "RNA" "seq"
#> 
#> [[3]]
#> [1] "EIF" "G"

^{Created on 2021-02-06 by the reprex package (v0.3.0)}

jmbarbone · 2021-02-07T00:49:18Z

Implementing a pre filter right before the parse here could work:

spelling/R/check-files.R

Lines 118 to 123 in a2b5f29

    
           spell_check_file_plain <- function(path, format, dict){ 
        
             lines <- readLines(path, warn = FALSE, encoding = 'UTF-8') 
        
             words <- hunspell::hunspell_parse(lines, format = format, dict = dict) 
        
             text <- vapply(words, paste, character(1), collapse = " ") 
        
             spell_check_plain(text, dict = dict) 
        
           }

It feels like more of a quick-fix because it parses with strsplit() then paste()s back together before being sent to the actual parsing function.

ignore_words <- c("1st", "RNA-seq", "EIF4G1")

lines <- c(
  "This is the 1st line.  It has first written in it.",
  "The second has RNA-seq inside. But does not use RNAseq -- without the '-'",
  "EIF4G1 but not EIF4G1fdsadf is used",
  "This line's words are fine!"
)

pre_filter_plain <- function(lines, ignore = character()) {
  word_list <- strsplit(lines, "([^-[:alnum:][:punct:]])")
  
  vapply(
    word_list,
    function(i) {
      paste(i[!i %in% ignore], collapse = " ")
    },
    character(1)
  )
}

pre_filter_plain(lines, ignore_words)
#> [1] "This is the line.  It has first written in it."                   
#> [2] "The second has inside. But does not use RNAseq -- without the '-'"
#> [3] "but not EIF4G1fdsadf is used"                                     
#> [4] "This line's words are fine!"

^{Created on 2021-02-06 by the reprex package (v0.3.0)}

This is meant to be a quick fix; issue should probably be resolved in hunspell parser instead * remove "ignore" words from WORDLIST before parsing in hunspell * replaces complex if ... else if ... statement with simplier switch()

jmbarbone mentioned this issue Jul 26, 2022

45 alnumeric hyphenated #67

Open

jmbarbone mentioned this issue Aug 19, 2024

can't ignore words with numbers? #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support alphanumeric and hyphenated words #45

Support alphanumeric and hyphenated words #45

nuno-agostinho commented Feb 3, 2020

jmbarbone commented Feb 6, 2021

jmbarbone commented Feb 7, 2021

Support alphanumeric and hyphenated words #45

Support alphanumeric and hyphenated words #45

Comments

nuno-agostinho commented Feb 3, 2020

jmbarbone commented Feb 6, 2021

jmbarbone commented Feb 7, 2021