Discussion of grapheme clusters in ilreq usng pre-Unicode 15.1 definitions #43

andjc · 2025-02-03T21:33:45Z

In the section on typographic units there is a discussion on extended grapheme clusters, using स्कूल as an example. The text says:

There are two syllables in this word: SA+VIRAMA+KA+UU and LA. Note, however, that there are three Unicode grapheme clusters here: SA+VIRAMA, KA+UU and LA.

Styling is done on the basis of the whole orthographic syllable, not the first character, nor even the first grapheme.

Unicode 15.1, UAX #29 added a new rule specifically for some Indic scripts:

GB9c rule only applies to extended grapheme clusters:
Do not break within certain combinations with Indic_Conjunct_Break (InCB)=Linker.

So the following characters:

                                Character properties                                
┌──────┬──────┬────────────────────────┬────────────┬────────────┬─────┬──────┬────┐
│ char │ cp   │ name                   │ script     │ block      │ cat │ bidi │ cc │
├──────┼──────┼────────────────────────┼────────────┼────────────┼─────┼──────┼────┤
│ ्     │ 094D │ DEVANAGARI SIGN VIRAMA │ Devanagari │ Devanagari │ Mn  │ NSM  │ 9  │
│ ্     │ 09CD │ BENGALI SIGN VIRAMA    │ Bengali    │ Bengali    │ Mn  │ NSM  │ 9  │
│ ્     │ 0ACD │ GUJARATI SIGN VIRAMA   │ Gujarati   │ Gujarati   │ Mn  │ NSM  │ 9  │
│ ୍     │ 0B4D │ ORIYA SIGN VIRAMA      │ Oriya      │ Oriya      │ Mn  │ NSM  │ 9  │
│ ్     │ 0C4D │ TELUGU SIGN VIRAMA     │ Telugu     │ Telugu     │ Mn  │ NSM  │ 9  │
│ ്     │ 0D4D │ MALAYALAM SIGN VIRAMA  │ Malayalam  │ Malayalam  │ Mn  │ NSM  │ 9  │
└──────┴──────┴────────────────────────┴────────────┴────────────┴─────┴──────┴────┘
                             String: [\p{InCB=Linker}]

can now extend a grapheme cluster.

So स्कूल will be three extended grapheme clusters (['स्', 'कू', 'ल'] – SA+VIRAMA, KA+UU and LA) in Unicode 15.0 and prior versions, and two extended grapheme clusters (['स्कू', 'ल'] – SA+VIRAMA+KA+UU and LA) in Unicode 15.1 onwards.

So the effect of extended grapheme cluster level segmentation will depend on the Version of Unicode the toolchain is using at the pint of segentation.

The text was updated successfully, but these errors were encountered:

r12a · 2025-02-07T12:03:35Z

Thanks for bringing this up, @andjc, i will make the necessary changes.

r12a · 2025-02-07T16:23:43Z

@andjc i started by updating the gap analysis documents for Bengali, Devanagari, and Gujarati. You can see the result for deva here: https://www.w3.org/TR/2025/DNOTE-deva-gap-20250207/#issue87_segmentation

Any comments on that, before i move on to other stuff ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion of grapheme clusters in ilreq usng pre-Unicode 15.1 definitions #43

Discussion of grapheme clusters in ilreq usng pre-Unicode 15.1 definitions #43

andjc commented Feb 3, 2025

r12a commented Feb 7, 2025

r12a commented Feb 7, 2025

Discussion of grapheme clusters in ilreq usng pre-Unicode 15.1 definitions #43

Discussion of grapheme clusters in ilreq usng pre-Unicode 15.1 definitions #43

Comments

andjc commented Feb 3, 2025

r12a commented Feb 7, 2025

r12a commented Feb 7, 2025