The Fastest Fourier Transform in the West

Matteo Frigo
Vanu Inc. and MIT Laboratory for Computer Science

Abstract

FFTW is a system for computing the discrete Fourier transform (DFT) of both real and complex multidimensional data. In constrast to existing DFT libraries that implement a fixed algorithm, FFTW automatically adapts the computation to the underlying hardware so as to maximize performance. FFTW is generally at least as fast as comparable implementations of the DFT, including those that are tuned for a specific machine.

FFTW is an extreme example of automatic program specialization. Specifically, FFTW computes the DFT in three stages.

The first stage occurs before compilation time, when the ``genfft'' domain-specific compiler specializes a high-level specification of a DFT algorithm into a set of ``codelets.'' A codelet is an optimized computational kernel written in C that either computes the DFT of a small input array or can be composed with other codelets to compute the DFT of a larger input. genfft is written in Objective Caml, and its partial-evaluation capabilities are powerful enough to derive DFT algorithms for real data from a complex-data algorithm. In some cases, genfft even derives algorithms that were not previously known. genfft uses the theory of ``cache-oblivious algorithms'' [Frigo et al., 1999] to produce C code for which a good register allocation exists.

The second stage occurs during program initialization, after the codelets have been compiled and linked together to form an executable. In this stage, a ``planner'' accepts a description of the DFT problem to be solved and produces a ``plan,'' which is a composition of codelets that solves that particular problem. The planner performs a dynamic-programming search over the space of allowable plans and selects the fastest by explicit time measurements. This is the machine-dependent step in which FFTW adapts itself to the hardware.

Finally, an ``executor'' accepts a plan and the data to be transformed and computes the DFT of the data. A plan can be used to compute as many DFTs as desired. In the current FFTW system, plans are Lisp-like closures that can be invoked directly, but in older FFTW implementations the executor was an explicit interpreter for a special-purpose DFT language.

In this talk, I present the FFTW system and I argue that, because of the unpredictability of current computer architectures, FFTW represents a reasonable programming paradigm for high-performance software.

Joint work with Steven G. Johnson of the MIT Department of Physics.

This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) under Grant F30602-97-1-0270 and by the Special Research Program SFB F011 ``AURORA'' of the Austrian Science Fund FWF.