Fast Fourier Transformation

Introduction

Let us first define the Discrete Fourier Transformation (abbreviated as DFT) and fix the notation for subsequent discussion.

Discrete Fourier Transformation

Given a sequence of \( N \) complex numbers, namely \(X_N = (x_n)_{n=0}^{N-1}\) its discrete Fourier transformation \(Y_N = (y_m)_{m=0}^{N-1}\) is defined by the following rule

\[\displaystyle y_m = \sum_{n=0}^{N - 1} x_n \ \exp\left(\frac{2 \pi i}{N} \, m n\right) \ \ \ \forall \ 0 \leq m \leq N - 1\]

Moreover the inverse of above map (called inverse discrete Fourier transform) is defined as

\[\displaystyle x_n = \frac{1}{N} \sum_{m=0}^{N - 1} y_m \ \exp\left(- \frac{2 \pi i}{N} \, m n \right) \ \ \ \forall \ 0 \leq n \leq N - 1\]

Let \(\mathbb{F}_N\) denote the \(N \times N\) matrix whose \((m, n)\) entry is given by \(\displaystyle \left( \mathbb{F}_N \right)_{(m, n)} = \exp\left(\frac{2 \pi i}{N} \, m n \right)\) then the above rule can be succintly expressed as the following matrix vector multiplication

\[\displaystyle Y_N = \mathbb{F}_N \, X_N\]

Fast Fourier Transformation

Looking at the equation above one can see that straightforward calculation will require \(O(N^2)\) computations to calculate \(Y_N\). Fast Fourier Transformation (abbreviated as FFT) is an \(O(N \log N )\) algorithm to compute the same.

From now (without loss of generality) we will make an assumption that \(N\) is a power of two. If that is not the case, one can simply pad the input vector \(X_N\) with required number of zeros at the end. Doing that will at most double the input size so that the complexity of FFT algorithm will not be affected.

Even though the matrix \(\mathbb{F}_N\) is far from being sparse, yet we can exploit the special structure it possesses. Namely,

\[\displaystyle \mathbb{F}_N = \begin{pmatrix}\mathbb{I}_{\frac{N}{2}} & \mathbb{D}_{\frac{N}{2}} \\ \mathbb{I}_{\frac{N}{2}} & - \mathbb{D}_{\frac{N}{2}} \end{pmatrix} \begin{pmatrix} \mathbb{F}_{\frac{N}{2}} & 0 \\ 0 & \mathbb{F}_{\frac{N}{2}} \end{pmatrix} \mathbb{S}_N\]

Here \(\mathbb{I}_{\frac{N}{2}}\) is the identity matrix and \(\mathbb{D}_N\) is a diagonal matrix defined as

\[\mathbb{D}_N = \begin{pmatrix} 1 & & & & \\ & \omega_N & & & \\ & & \omega_N^2 & & \\ & & & \ddots & \\ & & & & \omega_N^{N - 1} \end{pmatrix}\]

where \(\omega_N\) is primitive \(N^{\mbox{th}}\) root of unity, i.e. \(\omega_N = \exp\left(\frac{2 \pi i}{N}\right)\). Moreover the permutation matrix \(\mathbb{S}_N\) shuffles even indices before the odd ones. For example

\[\mathbb{S}_4 = \begin{pmatrix} 1 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \ \ \mbox{ and in general } \ \ \mathbb{S}_N = \begin{pmatrix} \mathbf{e}_0 \\ \mathbf{e}_2 \\ \ldots \\ \mathbf{e}_{N} \\ \mathbf{e}_1 \\ \mathbf{e}_3 \\ \ldots \\ \mathbf{e}_{N - 1} \end{pmatrix}\]

If we recursively factorize the matrix \(\mathbb{F}_N\) as above then there will be \(O(\log N )\) many matrix products where each matrix would have at most two non zero entries in each row. Thus if one computes the product successively from the right then each matrix-vector product would take \(O(N)\) time. Thus all in all whole algorithm computes the required product in \(O(N \log N )\) time.

Furthermore the inverse fourier transform can be obtained by recognizing that \(\frac{1}{\sqrt{N}} \mathbb{F}_N\) is a unitary and symmetric matrix, in other words

\[\mathbb{F}^{-1}_N = \frac{1}{N} \ \overline{\mathbb{F}_N}\]

where \(\overline{\mathbb{F}_N}\) is complex conjugate of the matrix \(\mathbb{F}_N\).

An Illustrative Example

Let us take consider the case when \(N = 4\) with \(X_4 = (x_0, x_1, x_2, x_3)\).

We shall calculate its discrete Fourier trasnform (and inverse discrete Fourier transform) by direct calculation as well as by using the FFT described above.

DFT Calculation

Fourier transformation \(Y_4 = \mathbb{F}_4 \ X_4 = (y_0, y_1, y_2, y_3)\) by definition is given by

\[\begin{align*} y_0 &= \sum_{n = 0}^3 x_n \ \exp\left( \frac{2 \pi i}{4} 0 \cdot n \right) = x_0 + x_1 + x_2 + x_3 \\ y_1 &= \sum_{n = 0}^3 x_n \ \exp\left( \frac{2 \pi i}{4} 1 \cdot n \right) = x_0 + i x_1 - x_2 - i x_3 \\ y_2 &= \sum_{n = 0}^3 x_n \ \exp\left( \frac{2 \pi i}{4} 2 \cdot n \right) = x_0 - x_1 + x_2 - x_3 \\ y_3 &= \sum_{n = 0}^3 x_n \ \exp\left( \frac{2 \pi i}{4} 3 \cdot n \right) = x_0 - i x_1 - x_2 + i x_3 \end{align*}\]

Moreover the inverse discrete Fourier transform is given as

\[\begin{align*} x_0 &= \frac{1}{4} \sum_{m = 0}^3 y_m \ \exp\left( - \frac{2 \pi i}{4} 0 \cdot m \right) = \frac{1}{4}\left( y_0 + y_1 + y_2 + y_3 \right) \\ x_1 &= \frac{1}{4} \sum_{m = 0}^3 y_m \ \exp\left( - \frac{2 \pi i}{4} 1 \cdot m \right) = \frac{1}{4} \left( y_0 - i y_1 - y_2 + i y_3 \right) \\ x_2 &= \frac{1}{4}\sum_{m = 0}^3 y_m \ \exp\left( - \frac{2 \pi i}{4} 2 \cdot m \right) = \frac{1}{4} \left( y_0 - y_1 + y_2 - y_3 \right) \\ x_3 &= \frac{1}{4} \sum_{m = 0}^3 y_m \ \exp\left( - \frac{2 \pi i}{4} 3 \cdot m \right) = \frac{1}{4} \left( y_0 + i y_1 - y_2 - i y_3 \right) \end{align*}\]

FFT Calculation

For the sake of clarity, let us first see what \(\mathbb{F}_1\) and \(\mathbb{F}_2\) matrices look like. For the boundary case of \(N = 1\), by definition one has \(\mathbb{F}_1 = 1\).

On the other hand for \(N = 2\) by definition we have

\[\mathbb{F}_2 = \begin{pmatrix}\mathbb{I}_{1} & \mathbb{D}_{1} \\ \mathbb{I}_{1} & - \mathbb{D}_{1} \end{pmatrix} \begin{pmatrix} \mathbb{F}_{1} & 0 \\ 0 & \mathbb{F}_{1} \end{pmatrix} \mathbb{S}_2 = \begin{pmatrix} 1 & 1 \\ 1 & - 1 \end{pmatrix} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} = \begin{pmatrix} 1 & 1 \\ 1 & - 1 \end{pmatrix}\]

Thus for the record, given an input \((a, b)\) the output would be \(\mathbb{F}_2 \begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} a + b \\ a - b \end{pmatrix}\)

Getting back to the problem, we have

\[\begin{align*} \begin{pmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \end{pmatrix} &= \mathbb{F}_4 \begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix}\mathbb{I}_{2} & \mathbb{D}_{2} \\ \mathbb{I}_{2} & - \mathbb{D}_{2} \end{pmatrix} \begin{pmatrix} \mathbb{F}_{2} & 0 \\ 0 & \mathbb{F}_{2} \end{pmatrix} \mathbb{S}_4 \begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{pmatrix} = \begin{pmatrix}\mathbb{I}_{2} & \mathbb{D}_{2} \\ \mathbb{I}_{2} & - \mathbb{D}_{2} \end{pmatrix} \begin{pmatrix} \mathbb{F}_{2} & 0 \\ 0 & \mathbb{F}_{2} \end{pmatrix} \begin{pmatrix} x_0 \\ x_2 \\ x_1 \\ x_3 \end{pmatrix} \\ &= \begin{pmatrix} \mathbb{I}_{2} & \mathbb{D}_{2} \\ \mathbb{I}_{2} & - \mathbb{D}_{2} \end{pmatrix} \begin{pmatrix} \mathbb{F}_2 \begin{pmatrix} x_0 \\ x_2 \end{pmatrix} \\ \mathbb{F}_2 \begin{pmatrix} x_1 \\ x_3 \end{pmatrix} \end{pmatrix} = \begin{pmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & i \\ 1 & 0 & -1 & 0 \\ 0 & 1 & 0 & -i \end{pmatrix} \begin{pmatrix} x_0 + x_2 \\ x_0 - x_2 \\ x_1 + x_3 \\ x_1 - x_3 \end{pmatrix} = \begin{pmatrix} x_0 + x_1 + x_2 + x_3 \\ x_0 + i x_1 - x_2 - i x_3 \\ x_0 - x_1 + x_2 - x_3 \\ x_0 - i x_1 - x_2 + i x_3 \end{pmatrix} \end{align*}\]

For the case of inverse transform similarly we have

\[\begin{pmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{pmatrix} = \mathbb{F}^{-1}_4 \begin{pmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \end{pmatrix} = \frac{1}{4} \, \overline{\mathbb{F}_4} \begin{pmatrix} y_0 \\ y_1 \\ y_2 \\ y_3 \end{pmatrix} = \frac{1}{4} \begin{pmatrix} 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & - i \\ 1 & 0 & -1 & 0 \\ 0 & 1 & 0 & i \end{pmatrix} \begin{pmatrix} y_0 + y_2 \\ y_0 - y_2 \\ y_1 + y_3 \\ y_1 - y_3 \end{pmatrix} = \frac{1}{4} \begin{pmatrix} y_0 + y_1 + y_2 + y_3 \\ y_0 - i y_1 - y_2 + i y_3 \\ y_0 - y_1 + y_2 - y_3 \\ y_0 + i y_1 - y_2 - i y_3 \end{pmatrix}\]

Implementation in C

First one notes that upto complex conjugation (flip of sign in imaginary part of primitive root) and scaling factor both FFT and inverse FFT routines are same. So one can build helper function (named fft_h in code) which is common to both of these functions.

/* size is assumed to be the power of 2 */
void fft(double* restrict output_re, double* restrict output_img,
	       const double* input_re,	const double* input_img, int size)
{
	fft_h(output_re, output_img, input_re, input_img, size, 1);
}

/* size is assumed to be the power of 2 */
void inv_fft(double* restrict output_re, double* restrict output_img,
	       const double* input_re,	const double* input_img, int size)
{
	fft_h(output_re, output_img, input_re, input_img, size, - 1);


	for(int idx = 0; idx < size; ++idx)
	{
		output_re[idx] /= size;
		output_img[idx] /= size;
	}
}

Next, for fft_h function, the boundary case for size = 1 is dealt with as follows

void fft_h(double* restrict output_re, double* restrict output_img,
	       const double* input_re,	const double* input_img, int size, int sign)
{
	/* Deal with the boundary case first */
	if(size == 1)
	{
		output_re[0] = input_re[0];
		output_img[0] = input_img[0];
		return;
	}

    /* ... Rest of the code */
}

For the sake of simplicity (and to assist in parallelization, which will be the topic of next post) we will make separate copies of inputs and output for the next recursive call. Doing so will only use double \(1 + 2 + 4 + \ldots + 2^{N - 1} = 2^N - 1\) the amount of memory required to store input variables.

int h_size = (size / 2);
	
double* buf_mem = (double*) malloc( 4 * size * sizeof(double) );
	
double* input_ev_re = buf_mem;
double* input_ev_img = buf_mem + h_size;

double* input_odd_re = buf_mem + 2 * h_size;
double* input_odd_img = buf_mem + 3 * h_size;

double* output_ev_re = buf_mem + 4 * h_size;
double* output_ev_img = buf_mem + 5 * h_size; 

double* output_odd_re = buf_mem + 6 * h_size;
double* output_odd_img = buf_mem + 7 * h_size;


/* Copy the original input to the inputs for the next recursive calls */

for(int idx = 0; idx < h_size; ++idx)
{
	int d_idx = (idx << 1);
		
	input_ev_re[idx] = input_re[d_idx];
	input_ev_img[idx] = input_img[d_idx];

	input_odd_re[idx] = input_re[d_idx + 1];
	input_odd_img[idx] = input_img[d_idx + 1];
}

/* ... Rest of the code ... */

free(buf_mem);

Next, goes the recursive calls (which can run independent of each other)

fft_h(output_ev_re, output_ev_img, input_ev_re, 
		input_ev_img, h_size, sign);

fft_h(output_odd_re, output_odd_img, input_odd_re,
		input_odd_img, h_size, sign);

Rest of the code assembles output from the outputs of recursive calls. It involves adding those up weighted by the power of primitive root. In general for FFT, one can choose any primitive root of unity one likes but for inverse FFT it should exactly be the complex conjugate of the previously chosen root. Keeping this in mind, below is the code to generate final output

double angle = (2 * PI) / size;
	
double prim_root_re = cos(angle);
double prim_root_img = sign * sin(angle); /* sign = 1: FFT and  -1: Inverse FFT */

/* factor variable carries the integer power of primitive root.  
	At iteration i of loop below factor = (primitive root) ^ i */

double factor_re = 1.0;
double factor_img = 0.0;

for(int idx = 0; idx < h_size; ++idx)
{
	double s_re = factor_re * output_odd_re[idx] - factor_img * output_odd_img[idx];
	double s_img = factor_re * output_odd_img[idx] + factor_img * output_odd_re[idx];
	
	output_re[idx] = output_ev_re[idx] + s_re; 
	output_img[idx] = output_ev_img[idx] + s_img;
		
	output_re[h_size + idx] = output_ev_re[idx] - s_re;
	output_img[h_size + idx] = output_ev_img[idx] - s_img;

		
	double t0 = factor_re * prim_root_re - factor_img * prim_root_img;
	double t1 = factor_img * prim_root_re + factor_re * prim_root_img;
		
	factor_re = t0;
	factor_img = t1;
}

Future Work

So far we have only dealt with the case of taking FFT where the input size N is a power of two. In the next post, we shall look at the more general case. Moreover we shall see how the code above can be parallelized (in either of OpenMP, MPI and CUDA frameworks). There is ample opportunity for “instruction level parallelization” as many of the instructions inside the loop can be run simultaneously. We shall ponder over all of these issues in the next post.