Support utf-8 encoding guessing #84495

sunbohong · 2019-11-11T15:04:50Z

utf-8(['ascii', 'utf-8', 'utf-16', 'utf-32']) encoding guessing is disabled by this line. Ignore encodings that cannot guess correctly
But this link didn't mention it.
And in my test,the follow text will be guessing as utf-8 correctly(result='utf-8').

jschardet result:

In the other hand,with the follow setting, it will be detect as GBK

The text was updated successfully, but these errors were encountered:

vscodebot · 2019-11-11T15:04:56Z

(Experimental duplicate detection)
Thanks for submitting this issue. Please also check if it is already covered by an existing one, like:

sunbohong · 2019-11-12T05:27:52Z

Not as simple as you see...
See these issues:
https://github.com/microsoft/vscode/labels/file-guess-encoding
aadsm/jschardet#48
aadsm/jschardet#49

These issues will be solve by #84503

sunbohong · 2019-11-12T06:09:01Z

Not as simple as you see...
See these issues:
https://github.com/microsoft/vscode/labels/file-guess-encoding
aadsm/jschardet#48
aadsm/jschardet#49

In particular, aadsm/jschardet#48 will be treat as utf-8

bpasero · 2019-11-12T08:57:32Z

I like this change because it makes jschardet more deterministic by giving it full control over the detection. I still think more work is needed to increase the confidence of the detection, but that can continue in other issues.

bpasero · 2019-11-12T17:49:59Z

@sunbohong I noticed a bad regression though that made me push ddfca30 to ignore the guessed encoding "ascii" for a simple reason:

user opens a text file with just ascii characters
we guess the encoding to be "ascii"
user types special characters (like german umlaut)
user saves and closes the file
user reopens the file

=> the file is still guessed as "ascii" because it was not saved with a proper encoding

bpasero · 2019-11-19T18:07:08Z

Verification:

configure "files.autoGuessEncoding": true
configure "files.encoding": "windows1252" (simply to a non-UTF8 encoding)
save a file with special characters contents (e.g. 私は和食が好きです。) as UTF-8
close all files
open it in VSCode

=> the encoding should be UTF-8 from the status bar.

bpasero · 2019-12-05T09:36:25Z

Given scary issues such as #85821 I am putting UTF-16 and 32 back to the list of ignored encodings for guessing. Still, UTF-8 can be guessed.

sunbohong · 2019-12-05T14:32:00Z

@bpasero From this issue #85821, we really need to replace jschardet with a new library

sunbohong mentioned this issue Nov 11, 2019

Update encoding.ts sunbohong/vscode#1

Merged

sunbohong closed this as completed in sunbohong/vscode#1 Nov 11, 2019

sunbohong mentioned this issue Nov 11, 2019

Support utf-8 encoding guessing #84504

Merged

sunbohong reopened this Nov 11, 2019

alexdima assigned bpasero Nov 11, 2019

bpasero added feature-request Request for new features or functionality file-guess-encoding labels Nov 12, 2019

bpasero added this to the November 2019 milestone Nov 12, 2019

bpasero closed this as completed in #84504 Nov 12, 2019

MatthiasWinkelmann mentioned this issue Nov 16, 2019

Silent Corruption of UT8-Encoded Files missidentified as UTF8-BOM #84973

Closed

bpasero added the verification-needed Verification of issue is requested label Nov 19, 2019

connor4312 added verified Verification succeeded verification-found Issue verification failed and removed verified Verification succeeded verification-found Issue verification failed labels Dec 3, 2019

vscodebot bot locked and limited conversation to collaborators Dec 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support utf-8 encoding guessing #84495

Support utf-8 encoding guessing #84495

sunbohong commented Nov 11, 2019 •

edited

Loading

vscodebot bot commented Nov 11, 2019

sunbohong commented Nov 12, 2019

sunbohong commented Nov 12, 2019

bpasero commented Nov 12, 2019

bpasero commented Nov 12, 2019

bpasero commented Nov 19, 2019 •

edited

Loading

bpasero commented Dec 5, 2019

sunbohong commented Dec 5, 2019

Support utf-8 encoding guessing #84495

Support utf-8 encoding guessing #84495

Comments

sunbohong commented Nov 11, 2019 • edited Loading

vscodebot bot commented Nov 11, 2019

sunbohong commented Nov 12, 2019

sunbohong commented Nov 12, 2019

bpasero commented Nov 12, 2019

bpasero commented Nov 12, 2019

bpasero commented Nov 19, 2019 • edited Loading

bpasero commented Dec 5, 2019

sunbohong commented Dec 5, 2019

sunbohong commented Nov 11, 2019 •

edited

Loading

bpasero commented Nov 19, 2019 •

edited

Loading