mikeash.com: just this guy, you know?

Next article: Friday Q&A 2012-11-09: dyld: Dynamic Linking On OS X
Previous article: Friday Q&A 2012-10-26: Fourier Transforms and FFTs
Tags: audio fft fridayqa guest
Friday Q&A 2012-11-02: Building the FFT

In the last post in this mini-series, Mike gave an overview of the Fourier Transform and then showed you how to use Apple's implementation of the Fast Fourier Transform (FFT).

In this installment, I'll show you how we get from point A to point B. Specifically, I'll talk a bit about the magic behind the Fast Fourier Transform.

A bit of math (British localization: Some maths)
Fourier Transforms have some interesting mathematical properties. Most importantly, Fourier Transforms are a linear operation. That is, if we use $\mathcal{F}(h(t))$ to denote the Fourier Transform of $h(t)$:

So whether you scale or add two signals in the time or frequency domain is up to you. That's pretty handy when you're working on signal processing code. But I digress.

In addition to the above, there are some more interesting properties relating the time and frequency domain representations of a function. We'll use $h(t) \Leftrightarrow H(f)$ to relate the time and frequency domain representations.

Don't focus too much on the individual relations themselves. The main point that I'm trying to get across is that we can manipulate the Fourier Transform and the signal in meaningful ways, and relate those changes between domains.

The Discrete Fourier Transform
Before we continue, I'd like to make a clarification of sorts. The mathematical properties above are using terminology specific to Fourier Transforms of continuous functions defined over infinite time and frequency.

Obviously we're working in a digital world, and we don't have the luxury of continuous signals to work with. In software, we're dealing with sampled data, which is where the Discrete Fourier Transform comes in.

This is basically what Mike already gave you in code form in his last post on the topic, and I encourage you to take another look at his code to understand how it relates to this equation.

What's important is that you understand the Discrete Fourier Transform and Continuous Fourier Transforms are closely related, and have almost exactly the same mathematical properties as described above. For what we're discussing, you can safely ignore the "almost" part, and look to Wikipedia's definition for more discussion.

The Danielson-Lanczos lemma
Using a combination of the mathematical properties of the Fourier Transform above, Danielson and Lanczos discovered that you can rewrite a Discrete Fourier Transform of length N as a sum of two Discrete Fourier Transforms of length N/2: one from the even-numbered and one from the odd-numbered points of input.

This is their proof:

Here, $W^k = e^{2\pi i k/N}$, and $H_k^{e}$ and $H_k^o$ are the even and odd terms of $H_k$, respectively.

Again, it's not important to totally understand the above, or how they got from A to B. This is where I wave my hands and defer to the fact that the mathematical properties of the Discrete Fourier Transform have been used in the derivation.

The Fast Fourier Transform
The discovery by Danielson and Lanczos, combined with the fact that it can be applied recursively, is the basis of the Fast Fourier Transform. Breaking the problem down one more step, we will end up with a combination of $H_k^{ee}$, $H_k^{eo}$, $H_k^{oe}$, and $H_k^{oo}$.

If we stick with power-of-two inputs to the Fourier Transform, we will guarantee that the problem continues to decompose until we reach a fourier transform on one element. And, guess what? That's just a copy of the input value:

There is a way to derive the value of $n$, but I'm choosing to wave my hands again.

Instead, I'm going to let recursion do the work for us. I vote for this option, to keep this post from exploding out of control. Also, get your hands on a copy of Numerical Recipes to really go deep.

(Fairly) Straightforward Implementation
I put together a compact implementation that demonstrates how this all works. It is based on the first optimization, because I think it's a good balance of readability with just a hint of cleverness. (I started out based on this resource, but massaged the implementation for clarity, and to closely match my math above.)

    static complex double *FFT_recurse( complex double *x, int N, int skip ) {
complex double *X = (complex double*)malloc( sizeof(complex double) * N );
complex double *O, *E;

// We've hit the scalar case, and copy the input to the output.
if ( N == 1 ) {
X[0] = x[0];
return X;
}

E = FFT_recurse( x, N/2, skip * 2 );
O = FFT_recurse( x + skip, N/2, skip * 2 );

for ( int k = 0; k < N / 2; k++ ) {
O[k] = ( cexp( 2.0 * I * M_PI * k / N ) * O[k] );
}

// While E[k] and O[k] are of length N/2, and X[k] is of length N, E[k] and
// O[k] are periodic in k with length N/2. See p.609 of Numerical Recipes
// in C (3rd Ed, 2007). [CL]
for ( int k = 0; k < N / 2; k++ ) {
X[k] = E[k] + O[k];
X[k + N/2] = E[k] + O[k];
}

free( O );
free( E );

return X;
}

complex double *FFT( complex double *x, int N ) {
return FFT_recurse( x, N, 1 );
}


It's really not that complicated, but the improvement in performance is immense. I put together some driver code and tossed it all up on my github account. Some simple timings revealed that a straightforward "math definition" of the algorithm took approximately 12.1s, and the FFT implementation above took a mere 0.1s. More than a 100x speed increase, and that's with a whole bunch of malloc()s and free()s strewn about!

In closing
I hope that this explanation was somewhat helpful in demystifying the Fast Fourier Transform and how it works. It's one of many examples of algorithms that exploit mathematics in order to gain an order-of-magnitude speedup.

Oh, and in case Mike didn't make it clear, you should never implement this yourself. Use the vDSP routines in Accelerate.framework!

Apple's performance team continues to push the limits of their FFT implementation year after year on all platforms. A combination of mathematicians, physicists, engineers, scientists, and assembly language wizards are working hard to ensure that Accelerate.framework is always running as fast, and power-efficient as possible.

I'm of the opinion that Apple's performance team is largely responsible for what makes the "cool stuff" on the iPhone possible. Think about live audio effects in Garage Band, video effects in iMovie, the processing in iPhoto, and so forth. All of that stuff is depending on Accelerate.framework in some capacity.

The next time you visit WWDC, make some time to stop by their lab with any performance questions you may have that relate to Accelerate.framework. They're really nice folks, and astoundingly smart, too!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Something that's been bugging me for a while but that I haven't looked into yet - maybe you know the answer! - is the problem of non-power-of-two sized data sets.

Is it possible to compute the FT in O(NlogN) in case the size is not the power of two? (padding the data with zeroes and applying FFT obviously generates wrong results; I think the same goes for padding with the repeated version of the data)
Zero-padding the FFT does not produce incorrect results. In fact, it's the recommended approach when you're dealing with NPOT (non-power-of-two) data sets.

If you're getting bad results, it's because you're either doing something wrong, or perhaps your expectations are incorrect. Care to elaborate more on this?

As for answering your question, I believe there are algorithms that 'degrade gracefully' such that they're not as slow as the straightforward implementation, but they're definitely nowhere near the performance of a power-of-two data set. I'm not sure on their O notation runtimes though.

Check out the KissFFT implementation if you're interested in their approach. It's a fairly readable implementation of the FFT that also handles NPOT data sets.
Thanks for the KissFFT link - this does give me the info I wanted. What their implementation does is get the prime factors of the dataset size and then use the divide&conquer approach for each prime factor. As far as I can tell, each butterfly step is quadratic in the prime factor size, something like O((N/p) * p^2) = O(N*p), so for powers of two you get O(N) for each stage, but for i.e. a prime N you get O(N^2).

On to padding; I guess it depends on what you're using FFT for. I'm remarking that AFAICS there is no way to compute DFT for a NPOT dataset using the outlined algorithm with any padding - for example, the sine wave frequencies of the distribution are different?

For some applications it might not matter, I guess - i.e. using FFT as a means of fast convolution should work with zero-padding.

If you *are* interested in frequency response by itself, however, why should you pad with zeroes? The padding values change the result of the FFT, which means that you can't choose the padding values arbitrarily. If your 5-sample signal is meant to be periodic (with a period of 5), then the "right" way is to pad with the same sequence - but adjusting the size to 8 or 16 should give you different frequency responses, i.e.:

http://www.wolframalpha.com/input/?i=DFT%5B1%2C+2%2C+3%5D%2C+DFT%5B1%2C+2%2C+3%2C+0%5D%2C+DFT%5B1%2C+2%2C+3%2C+1%5D%2C+DFT%5B1%2C+2%2C+3%2C+1%2C+2%2C+3%2C+1%2C+2%5D
"If your 5-sample signal is meant to be periodic (with a period of 5), then the "right" way is to pad with the same sequence - but adjusting the size to 8 or 16 should give you different frequency responses"

You're presuming:
- with a known period,
- and all higher components (harmonic or otherwise) are even multiples of the fundamental period.

An 8-sample window of a 5-sample signal might give a wrong zero-padded result, but you are remembering to window your samples with an apodization function, right? You can't just stick zeroes at the end of {0,1,2,0,1} and expect it to work, the samples have to be tapered. This causes spectral leakage, but that's a consequence of sampling a non-contrived signal, and it unavoidable.

You can only avoid having to use a windowing function by having prior knowledge of the signal, and if you require that level of precision, for an input of arbitrary length, I don't think the FFT is the droid you're looking for.
There are some missing parts, I assume due to non-escaped characters.