Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate ko Wikidata into Unicode Inflection #63

Open
grhoten opened this issue Jan 22, 2025 · 5 comments
Open

Integrate ko Wikidata into Unicode Inflection #63

grhoten opened this issue Jan 22, 2025 · 5 comments
Milestone

Comments

@grhoten
Copy link
Member

grhoten commented Jan 22, 2025

The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.

The initial issues include:

  • The riel-end detection is probably needed in dictionary-parser
  • Combining the English and Korean data for consonant-end and verb-end will likely improve the quality of the data.
  • The dictionary-parser output needs to be addressed

Tool output that needs to be addressed:

Line 434105: Q465800 is not a known part of speech grammeme for L748866(가다)
Line 466714: Q465800 is not a known part of speech grammeme for L1012187(버리다)
Line 1116130: Q465800 is not a known part of speech grammeme for L719342(있다)
Line 1202464: Q175026 is not a known grammeme for L14167(너희)

Here is the current generated lexical dictionary files for reference.

ko.zip

@grhoten grhoten added this to the 0.1 milestone Jan 22, 2025
@nciric
Copy link
Contributor

nciric commented Jan 24, 2025

There are 519 nouns in Korean, see this query.

@grhoten
Copy link
Member Author

grhoten commented Jan 24, 2025

Actually, Korean is different than the other languages. Most of the inflection capability comes with loan words because it's a phonetic language. That's why I mentioned combining the English and Korean data. It's also why I did not mention fixing the unit tests. The generated dictionary passes the unit tests, but it would be helpful to remove the hard coded list in supplemental_ko.lst.

@grhoten
Copy link
Member Author

grhoten commented Jan 24, 2025

Adding nouns won't improve functionality. Getting more verbs to handle the register might be interesting, but that's new functionality.

@jungshik
Copy link

If the primary purpose of this project is to use the result in the message formatting, I don't think Korean need any extensive data collection unless there is a need for verb/adjective conjugation - tense, honorific/degree of politeness, etc - in the message formatting, which is extremely rare, if ever, for Korean.

As for nouns (proper nouns included) and pronouns, there are always two alternative case markers/post-positions for each case and which one to use is 100% determined by whether the preceding noun ends with a closed syllable (that is, it has a coda consonant) or an open syllable (that is, it ends with a medial vowel). For instance, '은' (topic marker) is used if the preceding noun ends with a closed syllable and '는' (alternate topic marker) is used otherwise. Likewise, '을/를' (object marker), '이/가' (subject marker)

  • C form: 은, 을, 이, 으로
  • O form: 는, 를, 가, 로
  1. Let U be the Unicode code point of the last syllable of the preceding noun
  2. Offset = U - 0xAC00
  3. R = Offset % 28
  4. If R = 0, use 'O' form. If R = 8, use '로'
  5. Otherwise, use 'C' form

@grhoten
Copy link
Member Author

grhoten commented Jan 25, 2025

As far as attaching the particles is concerned, you are generally correct for the Korean script. The current implementation already does that. You also have to take into account other scripts too, like nouns in the Latin script, which do happen, and that's the main problem that this issue needs to address.

The code can also be used when you have a term of address. This is important when generating a message with the correct level of register (degree of politeness). That's incomplete in Korean right now. I'm not aware of any frameworks that can do that, and I'm sure most people just use a default register in Korean. In other languages, this topic is important to properly refer to someone (e.g. in a contact) with the right gender. Such word choices generally affect many parts of speech. The gender of the speaker (first person), the audience (second person), the other human in the message (third person), and the non-human object all affect word choices.

Just because you normally default the register of a sentence in Korean, it doesn't mean that it shouldn't be allowed to change to match the current context. The tense is generally not being modified because that would be too complicated, and generally not necessary.

With that being said, I'm also fine with deferring the replacement of the Latin script data that is already included, if it proves to be too difficult to replace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants