Compare unicode std::string with usual "" literal or u8"" declartion

1

On Windows with Visual Studio 2015

    // Ü
    //    UTF-8  (hex) 0xC3 0x9C 
    //    UTF-16 (hex) 0x00DC 
    //    UTF-32 (hex) 0x000000DC 

    using namespace std::string_literals;
    const auto narrow_multibyte_string_s = "\u00dc"s;
    const auto wide_string_s             = L"\u00dc"s;
    const auto utf8_encoded_string_s     = u8"\u00dc"s;
    const auto utf16_encoded_string_s    = u"\u00dc"s;
    const auto utf32_encoded_string_s    = U"\u00dc"s;

    assert(utf8_encoded_string_s     == "\xC3\x9C");
    assert(narrow_multibyte_string_s ==        "Ü");
    assert(utf8_encoded_string_s     ==      u8"Ü");

    // here is the question
    assert(utf8_encoded_string_s != narrow_multibyte_string_s);

"\u00dc"s is not the same as u8"\u00dc"s or "Ü"s is not the same as u8"Ü"s

Apparently the default encoding for usual string literal is not UTF-8 (Probably UTF-16) and I cannot just compare two std::string without knowing its encoding even they have the same semantic.

What is the practice to perform such string comparison in unicode-enable c++ application development??

For example an API like this:

  class MyDatabase
  {
      bool isAvailable(const std::string& key)
      {
         // *compare*  key in database
         if (key == "Ü")
           return true;
         else
           return false;
      }
  }

Other programs may call isAvailable with std::string in UTF-8 or default (UTF-16?) encoding. How can I garantee to do the proper comparision?

can I detect any encoding mismatch in compile-time?

Note: I prefer C++11/14 stuff. Prefer std::string than std::wstring

c++11
unicode
utf-8
stdstring
string-literals
asked on Stack Overflow Dec 15, 2016 by rnd_nr_gen

1 Answer

2

"\u00dc" is a char[] encoded in whatever the compiler/OS's default 8-bit encoding happens to be, so it can be different on different machines. On Windows, that tends to be the OS's default Ansi encoding, or it could be the encoding that the source file is saved as.

L"\u00dc" is a wchar_t[] encoded with either UTF-16 or UTF-32, depending on the compiler's definition of wchar_t (which is 16-bit on Windows, so UTF-16).

u8"\u00dc" is a char[] encoded in UTF-8.

u"\u00dc" is a char16_t[] encoded in UTF-16.

U"\u00dc" is a char32_t[] encoded in UTF-32.

The ""s suffix simply returns a std::string, std::wstring, std::u16string, or std::u32string, depending on whether a char[], wchar_t[], char16_t[], or char32_t[] is passed to it.

When comparing two strings, make sure they are in the same encoding first. This is especially important for your char[]/std::string data, as it could be in any number of 8-bit encodings, depending on the systems involved. This is not so much a problem if the app is generating the strings itself, but it is important if one or more of the strings is coming from an external source (file, user input, network protocol, etc).

In your example, "\u00dc" and "Ü" are not necessarily guaranteed to produce the same char[] sequence, depending on how the compiler interprets those different literals. But even if they did (which seems to be the case in your example), neither of them will likely produce UTF-8 (you have to go to extra measures to force that), which is why your comparison to utf8_encoded_string_s fails.

So, if you are expecting a string literal to be UTF-8, use u8"" to ensure that. If you are getting string data from an external source and need it to be in UTF-8, convert it to UTF-8 in code as soon as possible, if it is not already (which means you have to know the encoding used by the external source).

answered on Stack Overflow Dec 15, 2016 by Remy Lebeau • edited Dec 15, 2016 by Remy Lebeau

User contributions licensed under CC BY-SA 3.0