Friday, April 25, 2014

Unicode anyone?

As you may have guessed from my previous post I've been digging into Unicode. One may think it odd that I haven't done this sooner, but when working on legacy applications it is hard to justify supporting more modern things when you spend all your time trying to make legacy software better.

Fortunately, I'm now in a position of writing fresh new code that can be built on a strong basis. As I thought through many of the things I wanted to do I fairly quickly came to the conclusion that moving forward I needed to write things using Unicode.

Should be easy. Unicode has been around for 20 years now so, unlike the mere 3 years since adoption of C++ 11, everything should work and the skids should be smooth and well greased.

Of course not!

Windows seems to like UTF-16. That's fine. Other operating systems seem to like UTF-8 or UTF-16.

Out of the box on Mac and Ubuntu I can do something like the following:

    std::cout <<  u8"Α-Ωα-ω\n";

Of course on Windows not only is the u8" syntax not yet supported, but even if you get a std::string with a proper UTF-8 encoding that won't work either.

Turns out that I can set the cmd.exe console encoding to use UTF-8, and it works great if I use printf for my string, but std::cout doesn't work. To top things off Microsoft decided to explicitly disallow UTF-8 in their std::locale implementation. So I can't tell std::cout to send things to the console as UTF-8. Instead it appears that I will need to use printf or find another obscure way of outputting Unicode in my unit test console based application.

I'm not sure what this means, but it does give hope to those worried that the machines will take over. It will likely take them several decades to figure out the mess we have made with software.


No comments: