Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8#1288
Various fixes for converting cp1252, cp1251, cp1250 characters to and from utf-8#1288Lord-Nightmare wants to merge 3 commits intonwnxee:masterfrom
Conversation
… properly Fixed cp1251 characters from 0x80-0xbf being inconsistently converted to unicode Fixed utf-8 output of any unicode character whose unicode character code was above 0x7ff
|
At code read level this looks good, but I still need to test it out somehow to make sure the mappings are good. By the way, I'm thinking if it is not easier to just do a [128] array for all codepages and simplify the code. |
|
Checked the cp1251 logic – looks good. I haven't actually verified it, but the previously supported characters should keep working as they did. |
Doing this is definitely an option, and would also simplify the FromUTF8 function as well. |
|
This also makes me wonder a few things: Also, should I add supoort in the ToUTF8 function for outputting unicode characters above 0xFFFF (i.e. 4 bytes utf-8, for unicode character codes 0x10000 thru 0x10FFFF)? This would allow (in theory) some characters to be rendered as emojis etc. |
I'll try to explain it: A utf-8 character with unicode character code <= 0x7f is just sent as 0x00-0x7f (0b00000000 thru 0b01111111) A utf-8 character with unicode character code >= 0x80 and <= 0x7ff is broken into two bytes, sent one after the other:
A utf-8 character with unicode character code >= 0x800 and <= 0xffff is broken into three bytes, sent one after the other:
A utf-8 character with uncode character code >= 0x10000 and <= 0x10ffff (in theory this could be as high as 0x1fffff but unicode defined the max range as 0x10ffff for compatibility with UTF-16) is broken into four bytes, sent one after the other:
|
|
No point in adding features that won't be used. As far as I know, cp1250, cp1251 and cp1252 none have 4-byte UTF8 symbols (do they even have 3 byte symbols?) and we don't support anything else. Same with codepoint overrides, it just adds much complexity that I doubt anyone will need. And if they do, it's a simple isolated change to modify the mapping table, so they can just maintain it in their fork. I think it makes sense to move the ToUTF8 encoding to be a |
Yes, there are several 3 byte symbols already: 0x2026 (horizontal ellipsis) for instance, is translated to by cp1252 0x85 (which is used many times in dialog.tlk), and is a 3-byte utf-8 character: 0xE2 0x80 0xA6 |
…odepages, and fixed FromUTF8 to also search the entirety of these tables.
Fixed cp1252 characters from 0x80-0x9f not being converted to unicode properly
Fixed cp1251 characters from 0x80-0xbf being inconsistently converted to unicode
Fixed utf-8 output of any unicode character whose unicode character code was above 0x7ff