In the official release or SVN release, I find that Code::Blocks compiled with ANSI mode rather than UTF-8.
Because I am from Taiwan, some chinese words can not display in the Code::Blocks, even some simple comment.
So, I realy realy hole that it could support read utf-8 source code.
By the way, will the next offical release ship with the wx lib?? Or we need comiple the wx lib, it's not easy for begainer.
Thanks.
Hello,
AFAIK C::B RC2 supports UNICODE (http://forums.next.codeblocks.org/index.php?topic=1162.0).
You should compile wxWidgets with UNICODE and then C::B sources.
You can also download Therion's wxWindows 2.6.2 build (see http://paginas.terra.com.br/informatica/mauricio/codeblocks/). This package includes dll and static libraries for GCC 3.4.4 (both Unicode and NonUnicode).
Michael
There are any disvantages of having C::B compiled in Unicode mode for the official releases (ie. RC3)?
Hi, Michael
In the version, Therion's wxWindows 2.6.2 build,
in HELP-> ABOUT still say wx2.6.2(Windows, ANSI)
and I can not use the code::blocks to open an source code which encode by utf-8.
I know that the lib he provide have utf-8 version, but what i mean is that
the code::blocks editors still can not open utf-8 source.
Thanks~~~ ^_^
Quote from: Michael on December 12, 2005, 01:04:26 PM
Hello,
AFAIK C::B RC2 supports UNICODE (http://forums.next.codeblocks.org/index.php?topic=1162.0).
You should compile wxWidgets with UNICODE and then C::B sources.
You can also download Therion's wxWindows 2.6.2 build (see http://paginas.terra.com.br/informatica/mauricio/codeblocks/). This package includes dll and static libraries for GCC 3.4.4 (both Unicode and NonUnicode).
Michael
I'm afraid no one is making Unicode builds of Code::Blocks.
Quote from: Takeshi Miya on December 12, 2005, 01:21:34 PM
I'm afraid no one is making Unicode builds of Code::Blocks.
But you can make a UNICODE build of C::B or? For what I have understood from the post Version 1.0rc2 released! (http://forums.next.codeblocks.org/index.php?topic=1162.0), C::B supports UNICODE.
Michael
Yes anyone can, but no one is distributing builds of C::B Unicode in win32.
C::B supports Unicode means that it can be compiled in Unicode, not that it is compiled in Unicode.
Quote from: Takeshi Miya on December 12, 2005, 01:43:30 PM
C::B supports Unicode means that it can be compiled in Unicode, not that it is compiled in Unicode.
Ok, so I have understood right. Thank you.
I think, dbtsai, that you should have to make a UNICODE build of C::B with wxWidgets UNICODE from Therion (or with wxWidgets UNICODE compile by yourself if you prefer).
Michael
Quote from: Takeshi Miya on December 12, 2005, 01:06:31 PM
There are any disvantages of having C::B compiled in Unicode mode for the official releases (ie. RC3)?
Yes, there are disadvantages. Unicode support is not 100% finished and tested. Also, at least one third party library used in Code::Blocks does not support wide character strings (even though it apparently still works, somehow).
ANSI, on the other hand, works 100% certain and is officially supported.
No doubt, some day Code::Blocks will switch to Unicode alltogether (as that will work universally), but I dare not say when that will be.
hi,
Ok, I will try to compile it by myself. If any good news, I will post it. ^_^
And in my case, the a chinese word is use two bytes in ANSI mode,
but in the C::B, when I use delete key, it will only delete one byte, half of a chinese word.
It is not correct. Most of Chinese or Janpan program need to take this problem into consideration, and
programer need to solve it my theirself, that is why I very very very holp C::B support UTF-8.
Thanks
Quote from: thomas on December 12, 2005, 02:53:08 PM
Also, at least one third party library used in Code::Blocks does not support wide character strings (even though it apparently still works, somehow).
ANSI, on the other hand, works 100% certain and is officially supported.
What are the specific libraries that doesn't support widechars and what can we do to make them support it, appart from submitting a feature request?
This is one I know about, and the most important at the same time:
Quote from: http://www.grinninglizard.com/tinyxmldocs/index.htmlTinyXml supports UTF-8 allowing to manipulate XML files in any language.
[...]
TinyXml does not use or directly support wchar, TCHAR, or Microsofts _UNICODE at this time.
Apparently, it still works ... somehow. Although I do not understand how it works, it actually seems to do o.k. in Unicode builds. But it still does not feel good.
And here might just be the first case where it doesn't....
http://forums.next.codeblocks.org/index.php?topic=1618.0
I tried but didn't have time to fight with it. It's running stable in ANSI so I left it there. :?
Once I take care of my "level 1 problems (most important bugs to fix IMO)," I might work on this again.
Quote from: http://www.grinninglizard.com/tinyxmldocs/index.htmlTinyXml supports UTF-8 allowing to manipulate XML files in any language.
[...]
TinyXml does not use or directly support wchar, TCHAR, or Microsofts _UNICODE at this time.
This makes little sense to me. UTF-8 is a particular representation of Unicode text requiring at least 8 bits per character, widely used because it's 1:1 with ASCII. Supporting UTF-8 should be enough for unicode operability in any language.
WCHAR and TCHAR are just Windows-specific typedef's, as far as I know. (Reference: MSDN (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/winprog/winprog/windows_data_types.asp))
_UNICODE is a preprocessor definition used by Microsoft's compiler. (Reference: Microsoft (http://www.microsoft.com/globaldev/getwr/steps/wrg_unicode.mspx))
What, then, do WCHAR, TCHAR, and _UNICODE have to do with proper/complete implementation of unicode support?
Let's see: TiniXml loads files in UTF-8, it can't load any other Unicode encoding (neither from a file or in memory).
In memory it stores the UTF-8 encoded strings as an array of chars. Each byte IS NOT a character (coincidentaly only in english a byte=character).
However, wxWidgets or Windows for the matter, handle Unicode in other encoding (in memory): UTF-16 <multibyte encoding>.
So if we want the Unicode data from TiniXml, we must convert UTF-8->UTF-16. And if we want to talk from wxWidgets to TiniXml, UTF16->UTF-8.
The reason mentioned above by thomas, that it appears to work "somehow", can be because wxWidgets uses (I think) wxMBConv classes to do this conversion by default (it assumes UTF-8 if you don't specify another encoding) when compiled in Unicode mode.
UTF-8 is a way to write Unicode in a backwards-compatible way on media that support 8bit per character tokens. It is a variable length format which uses between 1 octet (for ANSI characters) to 6 octets. Most languages, except the really exotic ones can usually be represented with sequences of 1-2 octets per character.
Unicode is a family of standards (I know at least two different standards) which represent characters in words of 16 bits or 32 bits. Maybe there are even more standards which I do not know about, but that does not matter. The characters that UTF-8 encodes are really words of 16 or 32 bits.
If you are to represent Unicode text in a wxString, this is done by using wchar_t characters. On Windows, these are 16 bits, on my Linux box, these are 32 bits. Whatever size it is, sizeof(wchar_t) != sizeof(char), because if you pass "ABC" then you do not really pass 0x41, 0x42, 0x43 -- in reality, you pass two (four) times as much data, so for example 0x41, 0x00, 0x42, 0x00, 0x43, 0x00. (In fact I have no idea about the actual encoding -- what matters though, is that these are 16/32 bit values).
So obviously it cannot work reliably if you hand this data to some library which expects characters to be octets. It may work for a while, and then fail randomly due to a thing as simple as calling strlen() on a character string that happens to have 0x00 as the upper byte somewhere.
Quote from: Takeshi MiyaLet's see: TiniXml loads files in UTF-8, it can't load any other Unicode encoding (neither from a file or in memory).
There's no reason why anyone should
require support of multiple Unicode encodings. People may
prefer UTF-16 or some other structure, but it is perfectly possible to convert losslessly. In any case, this is a tangental discussion irrelevant from my original point.
Quote from: thomasSo obviously it cannot work reliably if you hand this data to some library which expects characters to be octets.
A library that properly supports UTF-8 does not expect each character to be eight bits. UTF-8 is a variable-width representation. Characters "wider" than 8 bits come into play when the MSB is set. (Of course, this only occurs when the encoding is not strict ASCII.)
If the code assumes a particular width or byte alignment when it does not exist (as is clearly the case with UTF-8), then it is an improper implementation -- to say the least. The claim that TinyXML supports UTF-8 would therefore be false.
Understand what I meant now?
TinyXml supports UTF-8.
How are you supposed to store in memory UTF-8 encoded in memory then...?
We can probably work around it. When first writing the ConfigManager, I used mb_str() a few times when passing data to tinyXML, and c_str() in other places. Actually I don't remember the reason I did that in the first place, any more. I think it was because certain things were ANSI anyway.... Either way, Yiannis was nice enough to change most of them to mb_str() while I wasn't looking, and that was really a good idea. It means that now we are feeding tinyXML octet streams (except for very few exceptions), so it should really work.
The reason it still does not work 100% is because the CRC calculation for the layout is not good and because we may have missed one or two spots. But I expect it to work reliably once that has been adressed.
So.... hopefully no reason to worry about tinyXML any more.
Quote from: Takeshi Miya on December 16, 2005, 06:13:42 PM
TinyXml supports UTF-8.
How are you supposed to store in memory UTF-8 encoded in memory then...?
What encoding do you use to store your text in RAM, you mean? I see two optimal ways:
1.) As UTF-8 (which, once again, is variable-width)
2.) As UTF-16 (Windows and other systems seem to accept Unicode data most often using this encoding)
#1 makes it a simple matter to read and write data between disk and RAM, since you'll very likely be using UTF-8 for both. The latter option is better if you're commonly calling functions from system or third-party libraries that require UTF-16. The alternative to #2 in the same situation is multiple copies of the text in different encodings, which is not only messy, tedious, and a potential source of bugs, but also a misuse of RAM and processing.
In any case, thomas sounds like he knows how to manage whatever the problem is/was. I still do not completely understand the nature of the problem; hence why I asked.
The problem is that we store all text in UTF-16 using wchar[], and we do not have a choice to do otherwise. tinyXML does not support wchar. Therefore, we convert to UTF-8 just before passing the data to tinyXML.
Also, wxScintilla might not be completely Unicode-safe. This is only a suspicion, not necessarily true. While browsing the sources, I have spotted several places where they use chars as indices or compare against const char values. Unless these are only applied on text fragments which have been converted to UTF-8 (which I don't know, maybe they are?), this may be an issue. In that case, we will have another problem which is not easily solved.
Regarding Scintilla, I once asked the SciTE developers if support for Unicode filenames was feasible, and they answered that someone once started working on that, but it wasn't an easy task and requiered a rather major rewrite.
Anyways, Unicode text in Scintilla seems to work ok, but we always can expect bugs because they have their own string class, and I noticed some const chars* around the code too, so I'm not sure if it supports fully Unicode.
Well, I would like to try utf-8 version C::B,
but I can not compile it well....
Could anyone release an utf-8 compiled version??
And let people to try what's going wrong!!
^_^
Quote from: dbtsai on December 20, 2005, 12:57:21 PMWell, I would like to try utf-8 version C::B,
but I can not compile it well....
http://forums.next.codeblocks.org/index.php?topic=1701.0