Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion of grapheme clusters in ilreq usng pre-Unicode 15.1 definitions #43

Open
andjc opened this issue Feb 3, 2025 · 2 comments
Open

Comments

@andjc
Copy link

andjc commented Feb 3, 2025

In the section on typographic units there is a discussion on extended grapheme clusters, using स्कूल as an example. The text says:

There are two syllables in this word: SA+VIRAMA+KA+UU and LA. Note, however, that there are three Unicode grapheme clusters here: SA+VIRAMA, KA+UU and LA.

Styling is done on the basis of the whole orthographic syllable, not the first character, nor even the first grapheme.

Unicode 15.1, UAX #29 added a new rule specifically for some Indic scripts:

GB9c rule only applies to extended grapheme clusters:
Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker.

So the following characters:

                                Character properties                                
┌──────┬──────┬────────────────────────┬────────────┬────────────┬─────┬──────┬────┐
│ char │ cp   │ name                   │ script     │ block      │ cat │ bidi │ cc │
├──────┼──────┼────────────────────────┼────────────┼────────────┼─────┼──────┼────┤
│ ्     │ 094D │ DEVANAGARI SIGN VIRAMA │ Devanagari │ Devanagari │ Mn  │ NSM  │ 9  │
│ ্     │ 09CD │ BENGALI SIGN VIRAMA    │ Bengali    │ Bengali    │ Mn  │ NSM  │ 9  │
│ ્     │ 0ACD │ GUJARATI SIGN VIRAMA   │ Gujarati   │ Gujarati   │ Mn  │ NSM  │ 9  │
│ ୍     │ 0B4D │ ORIYA SIGN VIRAMA      │ Oriya      │ Oriya      │ Mn  │ NSM  │ 9  │
│ ్     │ 0C4D │ TELUGU SIGN VIRAMA     │ Telugu     │ Telugu     │ Mn  │ NSM  │ 9  │
│ ്     │ 0D4D │ MALAYALAM SIGN VIRAMA  │ Malayalam  │ Malayalam  │ Mn  │ NSM  │ 9  │
└──────┴──────┴────────────────────────┴────────────┴────────────┴─────┴──────┴────┘
                             String: [\p{InCB=Linker}]      

can now extend a grapheme cluster.

So स्कूल will be three extended grapheme clusters (['स्', 'कू', 'ल'] – SA+VIRAMA, KA+UU and LA) in Unicode 15.0 and prior versions, and two extended grapheme clusters (['स्कू', 'ल'] – SA+VIRAMA+KA+UU and LA) in Unicode 15.1 onwards.

So the effect of extended grapheme cluster level segmentation will depend on the Version of Unicode the toolchain is using at the pint of segentation.

@r12a
Copy link
Contributor

r12a commented Feb 7, 2025

Thanks for bringing this up, @andjc, i will make the necessary changes.

@r12a
Copy link
Contributor

r12a commented Feb 7, 2025

@andjc i started by updating the gap analysis documents for Bengali, Devanagari, and Gujarati. You can see the result for deva here: https://www.w3.org/TR/2025/DNOTE-deva-gap-20250207/#issue87_segmentation

Any comments on that, before i move on to other stuff ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants