mikeash.com pyblog/friday-qa-2010-02-19-character-encodings.html comments

Michael - 2010-12-27 16:21:28

Mon, 27 Dec 2010 16:21:28 GMT

Fantastic article Mike! It really explains things very clearly and I've learnt a lot.

Alex Pretzlav - 2010-02-28 19:35:37

Sun, 28 Feb 2010 19:35:37 GMT

I would like to point out one other unicode subtlety which is particularly relevant to OS X. HFS+ requires all filenames (stored as UTF-16) to use decomposed characters. Unicode allows multiple encodings of the same character sequence. To quote Apple's HFS+ technote:

"The character 'é' can be represented as the single Unicode character u+00E9 (latin small letter e with acute), or as the two Unicode characters u+0065 and u+0301 (the letter 'e' plus a combining acute symbol)."
http://developer.apple.com/mac/library/technotes/tn/tn1150.html#UnicodeSubtleties

Storing text in the former form is called "precomposed" and the latter is "decomposed". Other filesystems, like Linux's ext3, don't have this limitation. I've run into issues where Finder exhibits bizarre behavior when trying to read NFS or Samba network filesystems which use precomposed characters. More info here:
http://en.wikipedia.org/wiki/Unicode_normalization#Errors_due_to_normalization_differences

Jean-Daniel Dupas - 2010-02-25 18:21:23

Thu, 25 Feb 2010 18:21:23 GMT

libicucore ships with both Mac OS X and iPhone OS (though the latter does not ship the headers):

Neither do the former. libicucore is considered a SPI on Mac OS X, but you can use it with some care:

http://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html

Hamish - 2010-02-25 14:10:32

Thu, 25 Feb 2010 14:10:32 GMT

Thanks Mike and Jordy!

I don't know about converting an ICU encoding into an NSStringEncoding. Perhaps via CFStringEncoding (CFStringConvertIANACharSetNameToEncoding() + CFStringConvertEncodingToNSStringEncoding())? But it may be better to use ucsdet_getUChars() because NSStringEncoding only seems to cover the more popular encodings. (There may be better coverage in CFStringEncoding.)

Jordy/Jediknil - 2010-02-24 08:59:28

Wed, 24 Feb 2010 08:59:28 GMT

@Hamish: Cool! Also, don't forget to close (release) the charsetDetector on failure!

Is there an easy way to turn ICU encoding into NSStringEncoding? Or would it be better to read the string using ICU's ucsdet_getUChars()?

mikeash - 2010-02-24 05:37:27

Wed, 24 Feb 2010 05:37:27 GMT

Nice function. Completely unrelated to your main point, but don't use dataWithContentsOfMappedFile: unless you know that the file is on the boot drive. A forcible unmounting of the drive containing the file can cause your program to segfault otherwise.

Hamish - 2010-02-23 22:56:25

Tue, 23 Feb 2010 22:56:25 GMT

libicucore ships with both Mac OS X and iPhone OS (though the latter does not ship the headers):



#define UOnFailReturnNil(errorCode) if (U_FAILURE(errorCode)) { NSLog(@"%s (%d): %s", __PRETTY_FUNCTION__, __LINE__, u_errorName(errorCode)); return nil; }



- (NSString *)charsetForTextFileAtPath:(NSString *)path

{

    UErrorCode errorCode = U_ZERO_ERROR;

    

    UCharsetDetector *charsetDetector = ucsdet_open(&errorCode);

    UOnFailReturnNil(errorCode);

    

    NSData *characterData = [NSData dataWithContentsOfMappedFile:path];

    

    ucsdet_setText(charsetDetector, [characterData bytes], [characterData length], &errorCode);

    UOnFailReturnNil(errorCode);

    

    const UCharsetMatch *bestMatch = ucsdet_detect(charsetDetector, &errorCode);

    UOnFailReturnNil(errorCode);

    

    const char *encodingName = ucsdet_getName(bestMatch, &errorCode);

    UOnFailReturnNil(errorCode);

    

    NSString *encodingNameString = [NSString stringWithUTF8String:encodingName];

    ucsdet_close(charsetDetector);



    return encodingNameString;

}

mikeash - 2010-02-20 12:50:16

Sat, 20 Feb 2010 12:50:16 GMT

I seem to recall some Apple document saying that these methods were very basic, and basically just checked for a BOM and otherwise fell back to UTF-8 and that was it, but last time I looked I couldn't find where it said that....

charles - 2010-02-20 04:50:45

Sat, 20 Feb 2010 04:50:45 GMT

Vasi:

+ (id)stringWithContentsOfURL:(NSURL *)url usedEncoding:(NSStringEncoding *)enc error:(NSError **)error

+ (id)stringWithContentsOfFile:(NSString *)path usedEncoding:(NSStringEncoding *)enc error:(NSError **)error

- (id)initWithContentsOfURL:(NSURL *)url usedEncoding:(NSStringEncoding *)enc error:(NSError **)error

- (id)initWithContentsOfFile:(NSString *)path usedEncoding:(NSStringEncoding *)enc error:(NSError **)error

Vasi - 2010-02-20 02:50:38

Sat, 20 Feb 2010 02:50:38 GMT

Some textual data will come with an initial byte-order mark (BOM), which despite its name also indicates character encoding. It is usually more likely to be present in longer streams of textual data, especially those from files. If you think you may receive data with a BOM, it may be worth checking for one. http://en.wikipedia.org/wiki/Byte-order_mark

It may be worth adding a BOM to any text that you store that may be read by another application. Don't do this for #! files though, they'll stop working. The NSString documentation doesn't seem to be clear on when it produces a BOM when converting to an encoding, which is annoying.

PS: What is this mysterious encoding-auto-detection NSString method? I can't find it for the life of me.

mikeash - 2010-02-20 00:03:20

Sat, 20 Feb 2010 00:03:20 GMT

It all depends on your needs in any given situation.

If you anticipate receiving arbitrary text from crazy sources and you want a best shot at picking the correct encoding, then something like Universal Detector will be a good choice, as long as you accept the extra complexity.

For a simpler situation where you just need a failsafe, it can be better to use a simpler scheme. It's easier to debug when the results aren't what you expect, for one. Trying to be too clever can lead to bad results, like this:

http://en.wikipedia.org/wiki/Bush_hid_the_facts

Finally, if you really have tight control it can be best not to try to fail safe at all, but to just try the one encoding that's mandated (which I hope is UTF-8) and immediately fail if your data isn't valid, rather than trying to persevere. It all depends on just what you're trying to do.

astrange - 2010-02-19 23:06:20

Fri, 19 Feb 2010 23:06:20 GMT

Um, how about using Universal Detector instead?

http://code.google.com/p/theunarchiver/source/browse/#svn/trunk/UniversalDetector

It uses the Mozilla universalchardet library, which I think has been completely written since then but somehow hasn't added any more functionality.

There's an autodetection method added to NSString in 10.6 as well, but somehow I doubt it works that well.