<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>mikeash.com pyblog/friday-qa-2010-02-19-character-encodings.html comments</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>mikeash.com Recent Comments</description><lastBuildDate>Sun, 10 May 2026 04:57:35 GMT</lastBuildDate><generator>PyRSS2Gen-1.0.0</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Michael - 2010-12-27 16:21:28</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>Fantastic article Mike! It really explains things very clearly and I've learnt a lot.</description><guid isPermaLink="true">7cbd03f3ff53281cf2f7047e66bfa942</guid><pubDate>Mon, 27 Dec 2010 16:21:28 GMT</pubDate></item><item><title>Alex Pretzlav - 2010-02-28 19:35:37</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>I would like to point out one other unicode subtlety which is particularly relevant to OS X.  HFS+ requires all filenames (stored as UTF-16) to use decomposed characters.  Unicode allows multiple encodings of the same character sequence.  To quote Apple's HFS+ technote:
&lt;br /&gt;
&lt;br /&gt;"The character 'é' can be represented as the single Unicode character u+00E9 (latin small letter e with acute), or as the two Unicode characters u+0065 and u+0301 (the letter 'e' plus a combining acute symbol)."
&lt;br /&gt;&lt;a href="http://developer.apple.com/mac/library/technotes/tn/tn1150.html#UnicodeSubtleties"&gt;http://developer.apple.com/mac/library/technotes/tn/tn1150.html#UnicodeSubtleties&lt;/a&gt;
&lt;br /&gt;
&lt;br /&gt;Storing text in the former form is called "precomposed" and the latter is "decomposed".   Other filesystems, like Linux's ext3, don't have this limitation.  I've run into issues where Finder exhibits bizarre behavior when trying to read NFS or Samba network filesystems which use precomposed characters.  More info here:
&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Unicode_normalization#Errors_due_to_normalization_differences"&gt;http://en.wikipedia.org/wiki/Unicode_normalization#Errors_due_to_normalization_differences&lt;/a&gt;</description><guid isPermaLink="true">53c9f0be15f7b58164694344c95910bf</guid><pubDate>Sun, 28 Feb 2010 19:35:37 GMT</pubDate></item><item><title>Jean-Daniel Dupas - 2010-02-25 18:21:23</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>&lt;div class="blogcommentquote"&gt;&lt;div class="blogcommentquoteinner"&gt;libicucore ships with both Mac OS X and iPhone OS (though the latter does not ship the headers): &lt;/div&gt;&lt;/div&gt;
&lt;br /&gt;
&lt;br /&gt;Neither do the former.  libicucore is considered a SPI on Mac OS X, but you can use it with some care:
&lt;br /&gt;
&lt;br /&gt;&lt;a href="http://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html"&gt;http://lists.apple.com/archives/xcode-users/2005/Jun/msg00633.html&lt;/a&gt;
&lt;br /&gt;
&lt;br /&gt;</description><guid isPermaLink="true">fc91e4ad2cd5cc62660d52eb7963d15c</guid><pubDate>Thu, 25 Feb 2010 18:21:23 GMT</pubDate></item><item><title>Hamish - 2010-02-25 14:10:32</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>Thanks Mike and Jordy!
&lt;br /&gt;
&lt;br /&gt;I don't know about converting an ICU encoding into an NSStringEncoding. Perhaps via CFStringEncoding (CFStringConvertIANACharSetNameToEncoding() + CFStringConvertEncodingToNSStringEncoding())? But it may be better to use ucsdet_getUChars() because NSStringEncoding only seems to cover the more popular encodings. (There may be better coverage in CFStringEncoding.)
&lt;br /&gt;</description><guid isPermaLink="true">171324688ed25fa4a8c8714ee4347eb2</guid><pubDate>Thu, 25 Feb 2010 14:10:32 GMT</pubDate></item><item><title>Jordy/Jediknil - 2010-02-24 08:59:28</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>@Hamish: Cool! Also, don't forget to close (release) the charsetDetector on failure!
&lt;br /&gt;
&lt;br /&gt;Is there an easy way to turn ICU encoding into NSStringEncoding? Or would it be better to read the string using ICU's ucsdet_getUChars()?</description><guid isPermaLink="true">e9f21894bf05d4f5f903f56a6e1f9c8b</guid><pubDate>Wed, 24 Feb 2010 08:59:28 GMT</pubDate></item><item><title>mikeash - 2010-02-24 05:37:27</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>Nice function. Completely unrelated to your main point, but don't use dataWithContentsOfMappedFile: unless you know that the file is on the boot drive. A forcible unmounting of the drive containing the file can cause your program to segfault otherwise.</description><guid isPermaLink="true">2e43cf16b97de27fbb73d392e53ef543</guid><pubDate>Wed, 24 Feb 2010 05:37:27 GMT</pubDate></item><item><title>Hamish - 2010-02-23 22:56:25</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>libicucore ships with both Mac OS X and iPhone OS (though the latter does not ship the headers):
&lt;br /&gt;
&lt;br /&gt;&lt;code&gt;
&lt;br /&gt;#define UOnFailReturnNil(errorCode) if (U_FAILURE(errorCode)) { NSLog(@"%s (%d): %s", __PRETTY_FUNCTION__, __LINE__, u_errorName(errorCode)); return nil; }
&lt;br /&gt;
&lt;br /&gt;- (NSString *)charsetForTextFileAtPath:(NSString *)path
&lt;br /&gt;{
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;UErrorCode errorCode = U_ZERO_ERROR;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;UCharsetDetector *charsetDetector = ucsdet_open(&amp;amp;errorCode);
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;UOnFailReturnNil(errorCode);
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;NSData *characterData = [NSData dataWithContentsOfMappedFile:path];
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;ucsdet_setText(charsetDetector, [characterData bytes], [characterData length], &amp;amp;errorCode);
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;UOnFailReturnNil(errorCode);
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;const UCharsetMatch *bestMatch = ucsdet_detect(charsetDetector, &amp;amp;errorCode);
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;UOnFailReturnNil(errorCode);
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;const char *encodingName = ucsdet_getName(bestMatch, &amp;amp;errorCode);
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;UOnFailReturnNil(errorCode);
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;NSString *encodingNameString = [NSString stringWithUTF8String:encodingName];
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;ucsdet_close(charsetDetector);
&lt;br /&gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;return encodingNameString;
&lt;br /&gt;}
&lt;br /&gt;&lt;/code&gt;
&lt;br /&gt;</description><guid isPermaLink="true">b52edc5b0e147642ac1093d5671519c6</guid><pubDate>Tue, 23 Feb 2010 22:56:25 GMT</pubDate></item><item><title>mikeash - 2010-02-20 12:50:16</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>I seem to recall some Apple document saying that these methods were very basic, and basically just checked for a BOM and otherwise fell back to UTF-8 and that was it, but last time I looked I couldn't find where it said that....</description><guid isPermaLink="true">f52023eafd4b9a97d734d80532566afb</guid><pubDate>Sat, 20 Feb 2010 12:50:16 GMT</pubDate></item><item><title>charles - 2010-02-20 04:50:45</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>Vasi:
&lt;br /&gt;
&lt;br /&gt;+ (id)stringWithContentsOfURL:(NSURL *)url usedEncoding:(NSStringEncoding *)enc error:(NSError **)error
&lt;br /&gt;
&lt;br /&gt;+ (id)stringWithContentsOfFile:(NSString *)path usedEncoding:(NSStringEncoding *)enc error:(NSError **)error
&lt;br /&gt;
&lt;br /&gt;- (id)initWithContentsOfURL:(NSURL *)url usedEncoding:(NSStringEncoding *)enc error:(NSError **)error
&lt;br /&gt;
&lt;br /&gt;- (id)initWithContentsOfFile:(NSString *)path usedEncoding:(NSStringEncoding *)enc error:(NSError **)error
&lt;br /&gt;
&lt;br /&gt;</description><guid isPermaLink="true">4af8986fef6dd77945647a6fd4c201b5</guid><pubDate>Sat, 20 Feb 2010 04:50:45 GMT</pubDate></item><item><title>Vasi - 2010-02-20 02:50:38</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>Some textual data will come with an initial byte-order mark (BOM), which despite its name also indicates character encoding. It is usually more likely to be present in longer streams of textual data, especially those from files. If you think you may receive data with a BOM, it may be worth checking for one. &lt;a href="http://en.wikipedia.org/wiki/Byte-order_mark"&gt;http://en.wikipedia.org/wiki/Byte-order_mark&lt;/a&gt;
&lt;br /&gt;
&lt;br /&gt;It may be worth adding a BOM to any text that you store that may be read by another application. Don't do this for #! files though, they'll stop working. The NSString documentation doesn't seem to be clear on when it produces a BOM when converting to an encoding, which is annoying.
&lt;br /&gt;
&lt;br /&gt;PS: What is this mysterious encoding-auto-detection NSString method? I can't find it for the life of me.</description><guid isPermaLink="true">c88daaa992f3c6ba68f6df6f4ac70339</guid><pubDate>Sat, 20 Feb 2010 02:50:38 GMT</pubDate></item><item><title>mikeash - 2010-02-20 00:03:20</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>It all depends on your needs in any given situation.
&lt;br /&gt;
&lt;br /&gt;If you anticipate receiving arbitrary text from crazy sources and you want a best shot at picking the correct encoding, then something like Universal Detector will be a good choice, as long as you accept the extra complexity.
&lt;br /&gt;
&lt;br /&gt;For a simpler situation where you just need a failsafe, it can be better to use a simpler scheme. It's easier to debug when the results aren't what you expect, for one. Trying to be too clever can lead to bad results, like this:
&lt;br /&gt;
&lt;br /&gt;&lt;a href="http://en.wikipedia.org/wiki/Bush_hid_the_facts"&gt;http://en.wikipedia.org/wiki/Bush_hid_the_facts&lt;/a&gt;
&lt;br /&gt;
&lt;br /&gt;Finally, if you really have tight control it can be best not to try to fail safe at all, but to just try the one encoding that's mandated (which I hope is UTF-8) and immediately fail if your data isn't valid, rather than trying to persevere. It all depends on just what you're trying to do.</description><guid isPermaLink="true">048d46bb7580da4c41cd3ae81102e23d</guid><pubDate>Sat, 20 Feb 2010 00:03:20 GMT</pubDate></item><item><title>astrange - 2010-02-19 23:06:20</title><link>http://www.mikeash.com/?page=pyblog/friday-qa-2010-02-19-character-encodings.html#comments</link><description>Um, how about using Universal Detector instead?
&lt;br /&gt;
&lt;br /&gt;&lt;a href="http://code.google.com/p/theunarchiver/source/browse/#svn/trunk/UniversalDetector"&gt;http://code.google.com/p/theunarchiver/source/browse/#svn/trunk/UniversalDetector&lt;/a&gt;
&lt;br /&gt;
&lt;br /&gt;It uses the Mozilla universalchardet library, which I think has been completely written since then but somehow hasn't added any more functionality.
&lt;br /&gt;
&lt;br /&gt;There's an autodetection method added to NSString in 10.6 as well, but somehow I doubt it works that well.</description><guid isPermaLink="true">86e0769b95ec627be0606744cb751aff</guid><pubDate>Fri, 19 Feb 2010 23:06:20 GMT</pubDate></item></channel></rss>
