-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unreliable detection - windows1250 #70
Comments
Hey bro, still bad ??? Just tested the new VSCode insiders using jschardet and it check here: |
Hey! I just saw your comment on microsoft/vscode#208550. Let me take a look into this. Also, don't-brow-me ;P. |
Yeah, that code page is in the group of tests I wasn't able to come up with a string for it. I was only able to test Thanks for providing one (windows1250.zip). I'm going to use it to create the test and figure out what's going on. Fwiw Sublime Text is also not able to correctly detect it. I'm less familiar with the eastern european languages. Out of curiosity can you tell me the difference between the two? And when one is used more vs the other?. |
Afaik Microsoft used this sentence to test fonts (pangram = showcase accented characters) I'd suggest just if unsure, if it is one of Windows-1252 or ISO8859-2 detected with any confidence, just add also 1250 to possible results with a bit lower confidence level. Its similarity + characters also mentioned here on wikipedia Windows-1250 So if there is a chance its 1252, theres also chance its 1250. That way, current users it would still get what expected before, and in VS Code where i could provide multiple quess candidates in recent insiders build, if i configure it to look only for chance of 1250. i would get what i need... That would be an immediate quick way, with "to be perfected" for later Also i suggest u rename it, it never was named Hungarian afaik, its official name is "Central European" bc its used to many countries there, naming it after one country may trigger others |
This is in relation to #70. The differences between the two are minimal, so this will be a workaround for now. These two encodings use very different models for detection. The windows-1252 detector is purely based on the occurance probability of each character's class. The windows-1250 uses a Hungarian language model to detect the text is in Hungarian. This is brittle as there are other languages using windows-1250.
I was busy on the weekend. I think your suggestion makes sense so I went ahead and implemented just that. The reason it's named "windows-1250 (Hungarian)" is because it uses a Hungarian language model to predict if the text is in Hungarian. Like you mentioned, other countries used the same encoding, so I imagine that's the reason we're getting no match at all. But it could also be that windows-1252 is just being detected with less characters than windows-1250 needs to come up with any confidence. This is something that I might look into in the future, as it could also affect other encodings. Ah yeah, the slavic countries have been a source of invasion and dispute for centuries 😬. I didn't follow the breakup of those countries after the fall of the ussr, but still remember yugoslavia and czechoslovakia being on the news. |
the windows-1250 is mentioned as Hungarian, but it really is Central European, so it may also be Slovak or Czech text, or maybe even other languages. Proper naming is "Central European". Those accented characters to recognize are for example čČšŠťŤžŽéÉľĽ
Found in VSCode using this, text saved as 1250, on reopen gets detected as 1252, or others 125*, or even as ISO-8859-2 etc. Depends what subset of these nonbasic characters are in the content.
The text was updated successfully, but these errors were encountered: