问题定义明确指出8位字符编码为UTF-8。这使这成为一个小问题。从一个UTF规范转换到另一个UTF规范所需要的只是花点时间。
只需查看这些Wikipedia页面上的UTF-8,UTF-16和UTF-32编码即可。
原理很简单-根据一个UTF规范输入并组装一个32位Unicode代码点,然后根据另一个规范发出代码点。各个代码点不需要翻译,就像任何其他字符编码一样。这就是使这成为一个简单问题的原因。
这是wchar_t
UTF-8转换的快速实现,反之亦然。它假定输入已经正确编码-俗语“垃圾进,垃圾出”在这里适用。我认为,最好单独进行验证编码。
std::string wchar_to_UTF8(const wchar_t * in)
{
std::string out;
unsigned int codepoint = 0;
for (in; *in != 0; ++in)
{
if (*in >= 0xd800 && *in <= 0xdbff)
codepoint = ((*in - 0xd800) << 10) + 0x10000;
else
{
if (*in >= 0xdc00 && *in <= 0xdfff)
codepoint |= *in - 0xdc00;
else
codepoint = *in;
if (codepoint <= 0x7f)
out.append(1, static_cast<char>(codepoint));
else if (codepoint <= 0x7ff)
{
out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
else if (codepoint <= 0xffff)
{
out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
else
{
out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
}
codepoint = 0;
}
}
return out;
}
上面的代码适用于UTF-16和UTF-32输入,仅仅是因为范围 d800
通过dfff
无效码点; 它们表示您正在解码UTF-16。如果您知道这wchar_t
是32位,则可以删除一些代码以优化该功能。
std::wstring UTF8_to_wchar(const char * in)
{
std::wstring out;
unsigned int codepoint;
while (*in != 0)
{
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
{
if (sizeof(wchar_t) > 2)
out.append(1, static_cast<wchar_t>(codepoint));
else if (codepoint > 0xffff)
{
out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
}
else if (codepoint < 0xd800 || codepoint >= 0xe000)
out.append(1, static_cast<wchar_t>(codepoint));
}
}
return out;
}
同样,如果您知道这wchar_t
是32位,则可以从此函数中删除一些代码,但是在这种情况下,它不会有任何区别。该表达式sizeof(wchar_t) > 2
在编译时是已知的,因此任何体面的编译器都将识别无效代码并将其删除。