mikeash.com: just this guy, you know?

Posted at 2009-03-06 22:19 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2009-03-13: Intro to the Objective-C Runtime
Previous article: Friday Q&A 2009-02-27: Holistic Optimization
Tags: clang fridayqna
Friday Q&A 2009-03-06: Using the Clang Static Analyzer
by Mike Ash  

Welcome back to another exciting Friday Q&A. This week's topic, suggested by Ed Wynne, will be an overview of the Clang Static Analyzer and an example of how to use it.

What Is It?
Clang is part of the LLVM project. LLVM is essentially a compiler and JIT virtual machine framework. Some of the compiler bits are currently available in Mac OS X as llvm-gcc, which fits a gcc parser/front-end to the LLVM code generator/back-end. Clang aims to essentially fill in the other half, and provide a parser/front-end as part of the LLVM project itself, which will allow a pure LLVM compiler.

What's the point of this, and why not just use gcc? It's actually pretty simple: gcc is old and crufty and slow. It has a huge amount of legacy baggage and is not very easy to work with. Clang is considerably more lightweight and its code is much more modular.

That last part is important for this, because some enterprising people have done taken Clang and implemented a static code analyzer with it. In essence, it's a compiler that, instead of translating your code to machine language, goes through and looks for mistakes.

The Clang Static Analyzer (which I will now abbreviate as CSA even though everybody calls it "clang", because Clang is actually the name for the entire front-end, not just CSA) is still early in development and very incomplete, but is still very useful even so.

Where Is It?
The main CSA web page can be found at http://clang.llvm.org/StaticAnalysis.html, and it can be downloaded using the link at the bottom right. I won't link directly to the download because it's still in very active development and so the download link updates frequently.

How To Use It
Using CSA is extremely easy. It provides a scan-build command which you simply invoke at the command line, passing the command to build your code as the parameters. scan-build will do some funky business to convince gcc to pass control over to CSA as it builds, allowing CSA to analyze all of your code instead of actually getting it built.

Since an example is worth a thousand words:

    $ gcc -framework Foundation test.m
    $ scan-build gcc -framework Foundation test.m
    ANALYZE: test.m main
    test.m:5:16: warning: Value stored to 'x' is never read
        int x = 0; x = 1;
                   ^   ~
    1 diagnostic generated.
    scan-build: 1 bugs found.
    scan-build: Run 'scan-view /var/folders/YT/YTiq3QDl2RW4ME+BYnLyRU+++TM/-Tmp-/scan-build-2009-03-06-3' to examine bug reports.
    $ 

And there it is, found a bug. If you run the command it mentions at the end, it gives a really swank HTML view.

Note that the scan-build command can be used not only with gcc but also with xcodebuild and even make. Running an analysis of your Xcode project is just a single command, usually as simple as scan-build xcodebuild in your project's directory.

A Better Example
Let's actually look at some code. I made the following contrived buggy code:

    #import <Foundation/Foundation.h>
    
    static void TestFunc(char *inkind, char *inname)
    {
        NSString *kind = [[NSString alloc] initWithUTF8String:inkind];
        NSString *name = [NSString stringWithUTF8String:inname];
        if(!name)
            return;
        
        const char *kindC = NULL;
        const char *nameC = NULL;
        if(kind)
            kindC = [kind UTF8String];
        if(name)
            nameC = [name UTF8String];
        if(!isalpha(kindC[0]))
            return;
        if(!isalpha(nameC[0]))
            return;
        
        [kind release];
        [name release];
    }

Obviously this code doesn't actually do anything useful, but of course it's meant only for illustration. There are several bugs in this code. Instead of trying to find them by looking, let's ask CSA:

    $ scan-build gcc -c test.m
    ANALYZE: test.m TestFunc
    test.m:5:23: warning: Potential leak of object allocated on line 5 and store into 'kind'
        NSString *kind = [[NSString alloc] initWithUTF8String:inkind];
                          ^
    test.m:18:17: warning: Dereference of null pointer.
        if(!isalpha(nameC[0]))
                    ^~~~~~~~
    2 diagnostics generated.
    scan-build: 2 bugs found.
    scan-build: Run 'scan-view /var/folders/YT/YTiq3QDl2RW4ME+BYnLyRU+++TM/-Tmp-/scan-build-2009-03-06-6' to examine bug reports.

And there we are two bugs! They're both pretty subtle too. The object that's leaked does get released at the end of the method. The problem is simply that there are some return statements in the middle that can cause that code not to be reached. CSA is clever enough to trace out those code paths and find the problem. The other bug requires a similar depth of analysis to find, as the null dereference can only happen if a previous if statement isn't followed.

You may have noticed that it missed a bug, though. This function releases name, which points to an object that it does not own. I'm not sure why CSA missed this, but it's important to keep in mind that it's not perfect and it won't catch everything.

CSA also sometimes sees false positives. These mostly occur when doing funky cross-method memory management tricks. For example, it's common when displaying a sheet to pass an object in to the void *context parameter so that the receiver of the end-sheet message can get information out of it. Proper memory management here requires retaining the context object when making the call, and then releasing it in the callback. Previous versions of CSA would consider the initial retain a leak, since it couldn't see that it was later balanced in another method. They appear to have fixed this particular case now, but other such cases will still be around, simply because it can't be perfect.

Conclusion
The Clang Static Analyzer, although limited, is an extremely useful tool. I guarantee that if you run it for the first time on any substantial base of Cocoa code, you will be surprised and frightened at what it finds. For tracking down leaks and many other common programming errors, it is invaluable. And it's under active development as part of a project with a great deal of support from Apple, so it will only get better.

That wraps up this week's Friday Q&A. Come back next week for another exciting installment. If you have a topic you'd like to see discussed, please write in. Friday Q&A is driven by your submissions, and the more I get, the better topics I can choose. Post your ideas below or e-mail them (and tell me if you don't want me to use your name).

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

I will now abbreviate as CSA even though everybody calls it "clang"


In fact, not everybody call it "clang", some people also use "as-yet-unnamed clang static analyzer" ;-)

The clang community is looking for a better name than "scan-build", or CSA.

This tool is young and miss some important features (like cross module analysis), but it is really useful.

It even report things like missing release call in dealloc of synthesized properties, or check good usage of NSError** objects.

Also see analysis tool for a nice front end.

http://www.karppinen.fi/analysistool/

(It uses clang but adds additional checks)

-john
Er, is it just me, or is it kindC[0] in the example that should be giving the null-dereference warning? You return if name is NULL, so nameC should always be a valid pointer.
Yep, you're right, that's a bug with the static analyzer. I wrote the code with the expectation that one would be good and one would be a bug, and I never realized that the output was backwards!
I'm almost certain the static analysis comes from clang itself, not LLVM. Clang does lots of cheap bitvector dataflow inside the C frontend so it can provide warnings no matter what the optimization level is; these are just a kind of really thorough warning. (you'll notice the analyses still give false uninitialized warnings where gcc does, and can't see through function calls)

And gcc internals may not be very modular but they certainly aren't as bad as you think. (well, except for the backend, anyway)
The static analysis comes from a separate component in the Clang project, but it's not the same as the rest of the compiler.
Specifically, Clang consists of the following components:
Basic: Support code
Lex: Lexing and preprocessing
Parse: Parsing
AST: AST representation
Sema: Semantic analysis, builds the AST
Analyze: Static code analysis, the core logic of CSA
Rewrite: Code rewriting support

CSA combines everything except Rewrite to form the tool, but the real analysis is done in libanalyze.

Note, also, that CSA has two modes of running: flow-based analysis and path-based analysis. Flow-based is faster, but less accurate. It produces the false uninitialized warnings like GCC, which astrange mentioned. This is because it doesn't consider the possible paths through control structures. As such, it will give a false positive here:

void f(int a)
{
  int b;
  if (a > 0) b = 1;
  if (a > 5) printf("%d", b);
}

Path-based tracks possible value ranges of variables and traces out every possible path through the function. That makes it a lot slower (runtime is exponential in the number of branches, in theory, though the component aggressively culls paths), but also more accurate. The path-based analysis recognizes that it's impossible to enter the second if if the first wasn't also entered, and will not warn about an uninitialized b.
And gcc internals may not be very modular but they certainly aren't as bad as you think. (well, except for the backend, anyway)


So, gcc internals aren't as bad as I think, except for where they are? Funny.
I never tried clang until this post inspired me. I wasn't quite surprised and frightened, but I was very impressed with what it found, like rare memory leaks caused by early returns and me passing a BOOL when I should be passing an enum. AnalysisTool is great, too.
I'm pretty sure that 'name' is returned as an autoreleased object, so it is not leaked.
bork: What's your point? I don't believe anyone ever said that name is leaked.
Wow, so Clang is really responsive to bug reports. Version 0.170 of the static analyzer, now available from the web site, fixes the nameC/kindC mixup with this code. Way cool.
thx John McLaughlin, for pointing to the 'karppinen' tool. however, this won't be useful on PPC. it doesn't look like a UB build.
> So, gcc internals aren't as bad as I think, except for where they are? Funny.

Just for completeness: there are three ends in a compiler, and the middle-end (platform-independent optimization) is the most important.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.