Integrate ko Wikidata into Unicode Inflection #63

grhoten · 2025-01-22T09:23:54Z

The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.

The initial issues include:

The riel-end detection is probably needed in dictionary-parser
Combining the English and Korean data for consonant-end and verb-end will likely improve the quality of the data.
The dictionary-parser output needs to be addressed

Tool output that needs to be addressed:

Line 434105: Q465800 is not a known part of speech grammeme for L748866(가다)
Line 466714: Q465800 is not a known part of speech grammeme for L1012187(버리다)
Line 1116130: Q465800 is not a known part of speech grammeme for L719342(있다)
Line 1202464: Q175026 is not a known grammeme for L14167(너희)

Here is the current generated lexical dictionary files for reference.

ko.zip

The text was updated successfully, but these errors were encountered:

nciric · 2025-01-24T19:38:35Z

There are 519 nouns in Korean, see this query.

grhoten · 2025-01-24T19:48:19Z

Actually, Korean is different than the other languages. Most of the inflection capability comes with loan words because it's a phonetic language. That's why I mentioned combining the English and Korean data. It's also why I did not mention fixing the unit tests. The generated dictionary passes the unit tests, but it would be helpful to remove the hard coded list in supplemental_ko.lst.

grhoten · 2025-01-24T19:49:22Z

Adding nouns won't improve functionality. Getting more verbs to handle the register might be interesting, but that's new functionality.

jungshik · 2025-01-25T00:35:45Z

If the primary purpose of this project is to use the result in the message formatting, I don't think Korean need any extensive data collection unless there is a need for verb/adjective conjugation - tense, honorific/degree of politeness, etc - in the message formatting, which is extremely rare, if ever, for Korean.

As for nouns (proper nouns included) and pronouns, there are always two alternative case markers/post-positions for each case and which one to use is 100% determined by whether the preceding noun ends with a closed syllable (that is, it has a coda consonant) or an open syllable (that is, it ends with a medial vowel). For instance, '은' (topic marker) is used if the preceding noun ends with a closed syllable and '는' (alternate topic marker) is used otherwise. Likewise, '을/를' (object marker), '이/가' (subject marker)

C form: 은, 을, 이, 으로
O form: 는, 를, 가, 로

Let U be the Unicode code point of the last syllable of the preceding noun
Offset = U - 0xAC00
R = Offset % 28
If R = 0, use 'O' form. If R = 8, use '로'
Otherwise, use 'C' form

grhoten · 2025-01-25T10:20:08Z

As far as attaching the particles is concerned, you are generally correct for the Korean script. The current implementation already does that. You also have to take into account other scripts too, like nouns in the Latin script, which do happen, and that's the main problem that this issue needs to address.

The code can also be used when you have a term of address. This is important when generating a message with the correct level of register (degree of politeness). That's incomplete in Korean right now. I'm not aware of any frameworks that can do that, and I'm sure most people just use a default register in Korean. In other languages, this topic is important to properly refer to someone (e.g. in a contact) with the right gender. Such word choices generally affect many parts of speech. The gender of the speaker (first person), the audience (second person), the other human in the message (third person), and the non-human object all affect word choices.

Just because you normally default the register of a sentence in Korean, it doesn't mean that it shouldn't be allowed to change to match the current context. The tense is generally not being modified because that would be too complicated, and generally not necessary.

With that being said, I'm also fine with deferring the replacement of the Latin script data that is already included, if it proves to be too difficult to replace.

grhoten added this to the 0.1 milestone Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate ko Wikidata into Unicode Inflection #63

Integrate ko Wikidata into Unicode Inflection #63

grhoten commented Jan 22, 2025 •

edited

Loading

nciric commented Jan 24, 2025 •

edited

Loading

grhoten commented Jan 24, 2025

grhoten commented Jan 24, 2025

jungshik commented Jan 25, 2025

grhoten commented Jan 25, 2025

Integrate ko Wikidata into Unicode Inflection #63

Integrate ko Wikidata into Unicode Inflection #63

Comments

grhoten commented Jan 22, 2025 • edited Loading

nciric commented Jan 24, 2025 • edited Loading

grhoten commented Jan 24, 2025

grhoten commented Jan 24, 2025

jungshik commented Jan 25, 2025

grhoten commented Jan 25, 2025

grhoten commented Jan 22, 2025 •

edited

Loading

nciric commented Jan 24, 2025 •

edited

Loading