D Unicode string literals: can't print specific Unicode character

Question

D Unicode string literals: can't print specific Unicode character

I'm just trying to pick up D having come from C++. I'm sure it's something very basic, but I can't find any documentation to help me. I'm trying to print the character à, which is U+00E0. I am trying to assign this character to a variable and then use write() to output it to the console.

I'm told by this website that U+00E0 is encoded as 0xC3 0xA0 in UTF-8, 0x00E0 in UTF-16 and 0x000000E0 in UTF-32.

Note that for everything I've tried, I've tried replacing string with char[] and wstring with wchar[]. I've also tried with and without the w or d suffixes after wide strings.

These methods return the compiler error, "Invalid trailing code unit":

string str = "à";
wstring str = "à"w;
dstring str = "à"d;

These methods print a totally different character (Ò U+00D2):

string str = "\xE0";
string str = hexString!"E0";

And all these methods print what looks like ˧á (note á ≠ à!), which is UTF-16 0x2E7 0x00E1:

string str = "\xC3\xA0";
wstring str = "\u00E0"w;
dstring str = "\U000000E0"d;

Any ideas?

unicode

d

unicode-string

unicode-escapes

asked on Stack Overflow Nov 23, 2018 by

Joe C • edited Nov 23, 2018 by

0xdd

2 Answers

I confirmed it works on my Windows box, so gonna type this up as an answer now.

In the source code, if you copy/paste the characters directly, make sure your editor is saving it in utf8 encoding. The D compiler insists on it, so if it gives a compile error about a utf thing, that's probably why. I have never used c:b but an old answer on the web said edit->encodings... it is a setting somewhere in the editor regardless.

Or, you can replace the characters in your source code with \uxxxx in the strings. Do NOT use the hexstring thing, that is for binary bytes, but your example of "\u00E0" is good, and will work for any type of string (not just wstring like in your example).

Then, on the output side, it depends on your target because the program just outputs bytes, and it is up to the recipient program to interpret it correctly. Since you said you are on Windows, the key is to set the console code page to utf-8 so it knows what you are trying to do. Indeed, the same C function can be called from D too. Leading to this program:

import core.sys.windows.windows;
import std.stdio;

void main() {
    SetConsoleOutputCP(65001);
    writeln("Hi \u00E0");
}

printing it successfully. On older Windows versions, you might need to change your font to see the character too (as opposed to the generic box it shows because some fonts don't have all the characters), but on my Windows 10 box, it just worked with the default font.

BTW, technically the console code page a shared setting (after running the program and it exits, you can still hit properties on your console window and see the change reflected there) and you should perhaps set it back when your program exits. You could get that at startup with the get function ( https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp ), store it in a local var, and set it back on exit. You could auto ccp = GetConsoleOutputCP(); SetConsoleOutputCP(65005;) scope(exit) SetConsoleOutputCP(ccp); right at startup - the scope exit will run when the function exits, so doing it in main would be kinda convenient. Just add some error checking if you want.

The Microsoft docs don't say anything about setting it back, so it probably doesn't actually matter, but still I wanna mention it just in case. But also the knowledge that it is shared and persists can help in debugging - if it works after you comment it, it isn't because the code isn't necessary, it is just because it was set previously and not unset yet!

Note that running it from an IDE might not be exactly the same, because IDEs often pipe the output instead of running it right out to the Windows console. If that happens, lemme know and we can type up some stuff about that for future readers too. But you can also open your own copy of the console (run the program outside the IDE) and it should show correctly for you.

answered on Stack Overflow Nov 25, 2018 by

Adam D. Ruppe

D source code needs to be encoded as UTF-8. My guess is that you're putting a UTF-16 character into the UTF-8 source file.

E.g.

import std.stdio;
void main() {
    writeln(cast(char)0xC3, cast(char)0xA0);
}

Will output as UTF-8 the character you seek.

Which you can then hard code like so:

import std.stdio;
void main() {
    string str = "à";
    writeln(str);
}

answered on Stack Overflow Nov 23, 2018 by

Richard Andrew Cattermole

User contributions licensed under CC BY-SA 3.0