BUSH HID THE FACTS - WHY DOES IT HAPPEN !!!
For those of you using Windows, do the following:
- Open an empty notepad file
- Type “Bush hid the facts” (without the quotes).
- Save it as whatever you want.
- Close it, and re-open it.
Are you surprised by what you see??
This is why it happens
You see, text files containing Unicode (more correctly, UTF-16-encoded Unicode) are supposed to start with a “Byte-Order Mark” (BOM), which is a two-byte flag that tells a reader how the following UTF-16 data is encoded. Given that these two bytes are exceedingly unlikely to occur at the beginning of an ASCII text file, it’s commonly used to tell whether a text file is encoded in UTF-16.
But plenty of applications don’t bother writing this marker at the beginning of a UTF-16-encoded file. So what’s an app like Notepad to do?
Windows helpfully provides a function called IsTextUnicode()–you pass it some data, and it tells you whether it’s UTF-16-encoded or not.
It actually runs a couple of heuristics over the first 256 bytes of the data and provides its best guess. As it turns out, these tests aren’t terribly reliable for very short ASCII strings that contain an even number of lower-case letters, like “this app can break”, or more appropriately, “this api can break”.
The documentation for Is TextUnicode says:
These tests are not foolproof. The statistical tests assume certain amounts of variation between low and high bytes in a string, and some ASCII strings can slip through. For example, if lpBuffer points to the ASCII string 0×41, 0×0A, 0×0D, 0×1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, though failure would be preferable.
0 comments:
Post a Comment