News:

As usual while waiting for the next release - don't forget to check the nightly builds in the forum.

Main Menu

Looking for non english sources to test encoding detection

Started by Jenna, March 01, 2009, 01:16:30 PM

Previous topic - Next topic

Jenna

I'm currently experimenting mozillas charset-detection for C::B (see this thread: http://forums.next.codeblocks.org/index.php/topic,10159.msg70493.html#msg70493)

I'm looking for files that use encodings, that are not correctly recognized by C::B's encoding detection.

I mean any files that can only be opened after conversion to UTF-8, or by forcing a special fallback encoding or by bypassing C::B's autodetetction.

Especially files in that contain chinese, japanese, cyrillic, eastern-europe or hebrew characters.

It would be nice to have a native and a  UTF-8 version to see if the characters are detected/displayed correctly.

Please don't attach such files to your posts, but send them via mail to "chardet at jenslody dot de".

So we reduce unnecessary server-load.

I will put them on my server, for others to test them, if they want.
They will be available on http://chardet.jenslody.de/ (empty at the moment).

If you don't want the files to be published, please put a short note inside the mail.

I'm interested in single-files and of course also complete (short example) projects/workspaces.

ollydbg

Ok, I can report some files which are located in code::blocks source folder:

src/plugins/codecompletion/parser/tokenizer.cpp

src/sdk/wxscintilla/src/scintilla/src/LexMatlab.cxx

src/sdk/wxscintilla/src/scintilla/src/LexErlang.cxx

src/sdk/wxscintilla/src/scintilla/src/Editor.cxx

src/sdk/resources/lexers/lexer_css.xml

src/plugins/compilergcc/compilergcc.cpp


Thank you!


If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Jenna

The last tow files are identified correctly in pure trunk and with the mozilla detection (one as UTF-8 with BOM and the as UTF-8 without BOM).
The others work only using system fallback on trunk and are detected as CP1252 (Windows 1252) by the mozilla detector.

ollydbg

If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.


Jenna

Quote from: nanyu on March 02, 2009, 03:22:16 AM
I send one.
Thanks nanyu.

With mozilla-detection the non-UTF-8 is detected as chinese simpilfied (cp936) by C::B. That means the encoding-detector told me it is gb18030, but I change it internally to cp936 (windows-936), because wxWidgets only knows this one.
The trunk version only opens the UTF-8 file correctly on my system (detected as UTF-8 with BOM).

In my test version all chars are identical in both files, but some seem to miss: line 18 to 22 show a square as first character.
That's most likely a limitation of the characterset on my system, because iceweasel (the debian name for firefox) shows the same.

nanyu

Quote from: jens on March 02, 2009, 07:13:52 AM

......
..... but some seem to miss: line 18 to 22 show a square as first character....


:D  Don't worry for it! , because those four square characters ARE meant to  four square characters.

ollydbg

Quote from: nanyu on March 02, 2009, 09:36:51 AM
Quote from: jens on March 02, 2009, 07:13:52 AM

......
..... but some seem to miss: line 18 to 22 show a square as first character....


:D  Don't worry for it! , because those four square characters ARE meant to  four square characters.
:D,Yes, Maybe, Jens' system can't display Chinese characters.
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Jenna

Quote from: ollydbg on March 02, 2009, 09:51:12 AM
Quote from: nanyu on March 02, 2009, 09:36:51 AM
Quote from: jens on March 02, 2009, 07:13:52 AM

......
..... but some seem to miss: line 18 to 22 show a square as first character....


:D  Don't worry for it! , because those four square characters ARE meant to  four square characters.
:D,Yes, Maybe, Jens' system can't display Chinese characters.

My linux-system at home can display them, but not my windows-system (even after installing support for chinese characters in XP).
Maybe I'm missing something.
<EDIT>
After installing support for east-asian languages it works in C::B. Windows seems to need more files than just new fonts to display it correctly.
</EDIT>

But I can not read chinese, so I did not know whether the squares are wanted or just replacements.

(My father was able to read and speak a little chinese, but he died 15 months ago, so he can not help me.)

nanyu

those squares are wanted, not for replacement. now you see?

vix

I've just sent a file with chars used in Italian (à, è. é. ì, ò and ù).
Not working in SVN 5696 and 5716.
Works in 5678 and older.

Jenna

Quote from: vix on August 04, 2009, 08:23:06 AM
I've just sent a file with chars used in Italian (à, è. é. ì, ò and ù).
Not working in SVN 5696 and 5716.
Works in 5678 and older.
Thanks, I found the cause for your problems, answer is here .

christina2009

Quote from: jens on March 01, 2009, 01:16:30 PM
I'm currently experimenting mozillas charset-detection for C::B (see this thread: http://forums.next.codeblocks.org/index.php/topic,10159.msg70493.html#msg70493)

I'm looking for files that use encodings, that are not correctly recognized by C::B's encoding detection.

I mean any files that can only be opened after conversion to UTF-8, or by forcing a special fallback encoding or by bypassing C::B's autodetetction.

Especially files in that contain chinese, japanese, cyrillic, eastern-europe or hebrew characters.

It would be nice to have a native and a  UTF-8 version to see if the characters are detected/displayed correctly.

Please don't attach such files to your posts, but send them via mail to "chardet at jenslody dot de".

So we reduce unnecessary server-load.

I will put them on my server, for others to test them, if they want.
They will be available on http://chardet.jenslody.de/ (empty at the moment).

If you don't want the files to be published, please put a short note inside the mail.

I'm interested in single-files and of course also complete (short example) projects/workspaces.

I think this is enough .....
I do agree with you. Those are the most effective way

comparatif simulation taux pret auto - taux pret auto differe selon la prise en compte ... calculent automatiquement le taux pour un prêt automobile donne.comparatif simulation taux pret auto