<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"><channel><title>mikeash.com pyblog/performance-comparisons-of-common-operations.html comments</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>mikeash.com Recent Comments</description><lastBuildDate>Sat, 06 Jun 2026 20:39:01 GMT</lastBuildDate><generator>PyRSS2Gen-1.0.0</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Compiled code? - 2012-03-10 15:34:28</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Is it possible for you to also provide a compiled executable, simply for disassembly? (or attached the disassembled intel-syntax assembly code)
&lt;br /&gt;
&lt;br /&gt;Just to allow us to see what's actually being executed, and if you provide an O3 and O0 version (I have no obj-c compiler atm) I'm sure we can fix it up so that O3 doesn't break any of the loops.</description><guid isPermaLink="true">f23b5658c35ea649f3a46e0dd5eaf002</guid><pubDate>Sat, 10 Mar 2012 15:34:28 GMT</pubDate></item><item><title>mikeash - 2008-01-13 05:02:49</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Good idea. There were some interesting changes, read more here: &lt;a href="http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations-leopard-edition.html"&gt;http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations-leopard-edition.html&lt;/a&gt;</description><guid isPermaLink="true">d8cba10839b791f9334f362dfd43e622</guid><pubDate>Sun, 13 Jan 2008 05:02:49 GMT</pubDate></item><item><title>Ahruman - 2008-01-12 21:44:01</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>It’d be nice to see some Leopard numbers here. My tests on a 2.4 GHz iMac give about the same numbers for most things, but NSInvocation is twice as fast and objc message sends are slower. It’d be more interesting with numbers from the same computer.</description><guid isPermaLink="true">c61a712250170d5ee4358c42cae6091d</guid><pubDate>Sat, 12 Jan 2008 21:44:01 GMT</pubDate></item><item><title>mikeash - 2007-09-04 23:28:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Marcel, obviously personal experiences will differ. I&amp;amp;#8217;ve written my share of high-performance software and never had memory allocation of any kind be an important factor. I would assume that individual programming styles will influence these things a lot, since if you subconsciously write code that does something less often, then you will not run into that particular trouble as frequently.&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;In general I would say that if you&amp;amp;#8217;ve run into a problem a few times you should be experienced enough to know when to deal with it, and if you haven&amp;amp;#8217;t then you should follow the general principle of avoiding premature optimization, writing good, clean code, and waiting until you can profile to determine where the problems lie.
&lt;br /&gt;</description><guid isPermaLink="true">d1bee087fa2d89915d1bb2e0dab9a65a</guid><pubDate>Tue, 04 Sep 2007 23:28:00 GMT</pubDate></item><item><title>Marcel Weiher - 2007-09-01 18:58:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>My experience is that performance needs to be considered and designed for from the start.  Not the &amp;amp;#8220;small efficiencies&amp;amp;#8221; whose pursuit has been characterized as the &amp;amp;#8220;root of all evil&amp;amp;#8221;, but the design choices that make good performance possible.  Object-allocation patterns are definitely among the latter, not the former.  This is not theory or corner cases, but hard won experience.  YMMV.
&lt;br /&gt;</description><guid isPermaLink="true">8c4d2403372d5b9dc19d8c29afe875e1</guid><pubDate>Sat, 01 Sep 2007 18:58:00 GMT</pubDate></item><item><title>leeg - 2007-08-29 09:28:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>OK, so now I stop talking from a position of ignorance and refer to my Paper Singh, specifically section 8.15 &amp;amp;#8220;Memory Allocation in User Space&amp;amp;#8221;.  It seems that OS X does have scalable allocation categories, not quite the same as but analogous to Solaris large pages (because they&amp;amp;#8217;re not backed by large pages in the hardware &amp;amp;#8211; they just adapt the bookkeeping).  Your 16b malloc()s are handled by a special case called the tiny region allocator, the quantum for which is exactly 16 bytes.  Your 16M malloc()s are on the border of the large and huge allocations, I think they&amp;amp;#8217;re huge though.  I&amp;amp;#8217;m slightly interested in finding out whether creating a 16M zone and doing the large malloc()s into that zone improves anything, but I currently have no use for that information as I have no reason to optimise malloc() calls anywhere.&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;BTW you&amp;amp;#8217;re correct of course that pages are created/activated by faults, not by allocation.  My mistake.
&lt;br /&gt;</description><guid isPermaLink="true">f7a63c61d73642c6cd6432f9aab8ff23</guid><pubDate>Wed, 29 Aug 2007 09:28:00 GMT</pubDate></item><item><title>mikeash - 2007-08-29 01:04:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>I&amp;amp;#8217;m not sure what happens in the malloc itself exactly but I do know that the OS will put off a lot of the work until you actually start touching the pages. Since my test just does a malloc followed by an immediate free without ever writing to the pages, a great deal of that overhead is skipped.
&lt;br /&gt;</description><guid isPermaLink="true">9bfaf246679df5712a8bad126e5ddd90</guid><pubDate>Wed, 29 Aug 2007 01:04:00 GMT</pubDate></item><item><title>leeg - 2007-08-28 18:48:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Presumably large malloc()s are slow not only (and possibly not even mainly) due to &amp;amp;#8220;every allocation goes straight to the kernel and you pay for that in the form of syscall overhead&amp;amp;#8221; but also every allocation requires a large number of pages to be created, (zeroed?  Not sure whether that happens because I never assume it does) and potentially other pages need to be added to the laundry list to let those new pages become activated.  Solaris on SPARC has some ridiculously large page sizes (like 4M) in order to avoid a number of cache miss and paging operations happening so heavily in Oracle^Wapplications with large working data sets.&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Nice investigation.
&lt;br /&gt;</description><guid isPermaLink="true">ec79e02fc8447c77c6dd55375fccdba4</guid><pubDate>Tue, 28 Aug 2007 18:48:00 GMT</pubDate></item><item><title>mikeash - 2007-08-28 02:20:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Marcel, &lt;i&gt;anything&lt;/i&gt; can be the make-or-break factor in theory. Certainly in very specific types of code this can be the make-or-break factor in practice. But this falls solidly on the side of the line that says you shouldn&amp;amp;#8217;t even bother to think about it until after you have written an entire working program, determined it to be slow, and profiled it to find out why. Contrast with something like, say, spawning a new process in a loop which you know must execute at least 1000 times per second, where you should definitely be reconsidering your design from the start.&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Greg, thanks for the information on the dyld stub. That certainly explains why I saw no difference.
&lt;br /&gt;</description><guid isPermaLink="true">dc86e549e38a120c8425c21bf1dcb43d</guid><pubDate>Tue, 28 Aug 2007 02:20:00 GMT</pubDate></item><item><title>Greg Parker - 2007-08-28 01:11:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Accelerated Objective-C dispatch on PPC is faster because it avoids the dyld stub (which is otherwise used for all cross-library calls). On i386 the dyld stub is much faster than the ppc equivalent, so we didn&amp;amp;#8217;t bother doing extra work to bypass it.
&lt;br /&gt;</description><guid isPermaLink="true">63429be30b1123e8a78b7b37bbeb10fe</guid><pubDate>Tue, 28 Aug 2007 01:11:00 GMT</pubDate></item><item><title>Marcel Weiher - 2007-08-28 01:03:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Nice summary information!  I would disagree somewhat on the conclusions, though:  allocating objects often can be the make-or-break factor in terms of performance.  See:  &amp;lt;a href="&lt;a href="http://www.metaobject.com/blog/2007/08/high-performance-objective-c-i.html"&gt;http://www.metaobject.com/blog/2007/08/high-performance-objective-c-i.html&lt;/a&gt;" rel="nofollow"&amp;gt;&lt;a href="http://www.metaobject.com/blog/2007/08/h..&amp;amp;lt"&gt;http://www.metaobject.com/blog/2007/08/h..&amp;lt;&lt;/a&gt;;/a&amp;gt;
&lt;br /&gt;</description><guid isPermaLink="true">b1cab3f73d2eba7c83dafdbee99718fd</guid><pubDate>Tue, 28 Aug 2007 01:03:00 GMT</pubDate></item><item><title>Ahruman - 2007-08-27 16:06:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Elliot, in your case compiler optimizations where probably converting the remainder operation to less costly operations. A quick checks shows that x % 82 compiles to &amp;lt;em&amp;gt;mulhwu r5,r31,r30; srwi r5,r5,6; mulli r5,r5,82; subf r5,r5,r31&amp;lt;/em&amp;gt; (multiply short unsigned, shift right; multiply short, subtract) on PPC, while x % y compiles to &amp;lt;em&amp;gt;divwu r5,r31,r30; mullw r5,r5,r30; subf r5,r5,r31&amp;lt;/em&amp;gt; (divide unsigned long, multiply long, subtract). The costly division is the explanation there.&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;However, as you can see from the PPC assembler listing on my entry (linked above), the (unoptimized) benchmark code was generating division instructions for both the floating-point and the integer cases. I fully expect this was also the case in Mikes original x86 case.
&lt;br /&gt;</description><guid isPermaLink="true">ad7d59206a94d03ec33ddc907df6849f</guid><pubDate>Mon, 27 Aug 2007 16:06:00 GMT</pubDate></item><item><title>Elliott Hughes - 2007-08-27 15:26:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>were the divisions by constants or non-constants? it makes a big difference, on both x86 and ppc. see here for details: &amp;lt;a href="&lt;a href="http://elliotth.blogspot.com/2007/07/still-no-free-lunch-surprising-cost-of.html"&gt;http://elliotth.blogspot.com/2007/07/still-no-free-lunch-surprising-cost-of.html&lt;/a&gt;" rel="nofollow"&amp;gt;&lt;a href="http://elliotth.blogspot.com/2007/07/sti..&amp;amp;lt"&gt;http://elliotth.blogspot.com/2007/07/sti..&amp;lt;&lt;/a&gt;;/a&amp;gt;
&lt;br /&gt;</description><guid isPermaLink="true">046a662b29b28a13598f6cb3f8f71626</guid><pubDate>Mon, 27 Aug 2007 15:26:00 GMT</pubDate></item><item><title>Ahruman - 2007-08-27 13:34:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Mike, accelerated dispatch is not implemented for x86. The private function &amp;lt;strong&amp;gt;rtp_init()&amp;lt;/strong&amp;gt; (objc4/runtime/objc-rtp.m) does nothing if !defined(&lt;i&gt;ppc&lt;/i&gt;). Looking into what &amp;lt;strong&amp;gt;rtp_init()&amp;lt;/strong&amp;gt; actually does, the reason would seem to be that its easy to find a &amp;lt;strong&amp;gt;blr&amp;lt;/strong&amp;gt; in (fixed-instruction-size) PPC code, but fiddly to do the equivalent for x86 code. Presumably the flag is quietly ignored on x86 so people dont have to fiddle about setting up &amp;lt;strong&amp;gt;GCC_FAST_OBJC_DISPATCH_ppc&amp;lt;/strong&amp;gt;.&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Also, the export file has the non-nil entry points commented out, for both PPC and x86, with the helpful comment non-nil entry points disabled for now. I was really only interested in non-nil sends in the hope that they wouldnt actually be much more efficient. Still, if they were, there might conceivably be an advantage in putting performance-sensitive code which can make that assumption in a category in a separate file with custom build flags. Possibly.&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;(Code extracts removed because Textile sucks.)
&lt;br /&gt;</description><guid isPermaLink="true">a0beb8fab7ec88a11d8452f123f4ce84</guid><pubDate>Mon, 27 Aug 2007 13:34:00 GMT</pubDate></item><item><title>mikeash - 2007-08-26 21:08:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>Colin, thanks for the update. I hope you didn&amp;amp;#8217;t take my jab badly, it aroused legitimate curiosity and the results were a bit surprising, both in the allocation being slower and the drawing being faster than I expected. I had thought there would be about two orders of magnitude between them, not just one.&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Ahruman, thanks for the comparison. For the record, I compiled the code above with -O0 because -O3 screws up the do-nothing loops as you noted, and -O0 doesn&amp;amp;#8217;t seem to hurt anything for this particular test. I also tried accelerated Objective-C method dispatch and to my surprise, there was no effect on the results. I didn&amp;amp;#8217;t try non-nil receivers because I believe that it&amp;amp;#8217;s not a useful measurement for real code.
&lt;br /&gt;</description><guid isPermaLink="true">f1b570e64116aea336e706125bda529e</guid><pubDate>Sun, 26 Aug 2007 21:08:00 GMT</pubDate></item><item><title>Ahruman - 2007-08-26 16:19:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>I intended to run the tests on a PPC last night, but forgot on account of brain not working at 2 am. (Its probably a good thing you distracted me from coding in that state.) Anyway, Ive posted them at &amp;lt;a href="&lt;a href="http://jens.ayton.se/blag/performance-comparisons-of-common-operations-ppc-edition/"&gt;http://jens.ayton.se/blag/performance-comparisons-of-common-operations-ppc-edition/&lt;/a&gt;" rel="nofollow"&amp;gt;&lt;a href="http://jens.ayton.se/blag/performance-co..&amp;amp;lt"&gt;http://jens.ayton.se/blag/performance-co..&amp;lt;&lt;/a&gt;;/a&amp;gt; for space and formatting reasons. Spoiler: Objective-C message dispatch is not twice as fast as FP divide on PPC.&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;lt;br /&amp;gt;
&lt;br /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;Im not entirely satisfied about the accuracy of the smaller numbers, but the order is probably right.
&lt;br /&gt;</description><guid isPermaLink="true">220f9bbe5c83f75c653497f7cca75eb1</guid><pubDate>Sun, 26 Aug 2007 16:19:00 GMT</pubDate></item><item><title>Colin Barrett - 2007-08-26 02:39:00</title><link>http://www.mikeash.com/?page=pyblog/performance-comparisons-of-common-operations.html#comments</link><description>I stand corrected, Mike. For what it&amp;amp;#8217;s worth, I discovered we were doing ObjC allocations in a similar situation anyway&amp;amp;#8212;looks like it was just my own preconceived notions and premature optimization at work.
&lt;br /&gt;</description><guid isPermaLink="true">3b16ee5c4258ecba3d9374f1b85a3c2e</guid><pubDate>Sun, 26 Aug 2007 02:39:00 GMT</pubDate></item></channel></rss>
