-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate ko Wikidata into Unicode Inflection #63
Comments
There are 519 nouns in Korean, see this query. |
Actually, Korean is different than the other languages. Most of the inflection capability comes with loan words because it's a phonetic language. That's why I mentioned combining the English and Korean data. It's also why I did not mention fixing the unit tests. The generated dictionary passes the unit tests, but it would be helpful to remove the hard coded list in supplemental_ko.lst. |
Adding nouns won't improve functionality. Getting more verbs to handle the register might be interesting, but that's new functionality. |
If the primary purpose of this project is to use the result in the message formatting, I don't think Korean need any extensive data collection unless there is a need for verb/adjective conjugation - tense, honorific/degree of politeness, etc - in the message formatting, which is extremely rare, if ever, for Korean. As for nouns (proper nouns included) and pronouns, there are always two alternative case markers/post-positions for each case and which one to use is 100% determined by whether the preceding noun ends with a closed syllable (that is, it has a coda consonant) or an open syllable (that is, it ends with a medial vowel). For instance, '은' (topic marker) is used if the preceding noun ends with a closed syllable and '는' (alternate topic marker) is used otherwise. Likewise, '을/를' (object marker), '이/가' (subject marker)
|
As far as attaching the particles is concerned, you are generally correct for the Korean script. The current implementation already does that. You also have to take into account other scripts too, like nouns in the Latin script, which do happen, and that's the main problem that this issue needs to address. The code can also be used when you have a term of address. This is important when generating a message with the correct level of register (degree of politeness). That's incomplete in Korean right now. I'm not aware of any frameworks that can do that, and I'm sure most people just use a default register in Korean. In other languages, this topic is important to properly refer to someone (e.g. in a contact) with the right gender. Such word choices generally affect many parts of speech. The gender of the speaker (first person), the audience (second person), the other human in the message (third person), and the non-human object all affect word choices. Just because you normally default the register of a sentence in Korean, it doesn't mean that it shouldn't be allowed to change to match the current context. The tense is generally not being modified because that would be too complicated, and generally not necessary. With that being said, I'm also fine with deferring the replacement of the Latin script data that is already included, if it proves to be too difficult to replace. |
The revised dictionary-parser can parse Wikidata, but some issues need to be resolved.
The initial issues include:
Tool output that needs to be addressed:
Here is the current generated lexical dictionary files for reference.
ko.zip
The text was updated successfully, but these errors were encountered: