Introduction
The use of digital polyphase lter banks in modern signal processing systems has
increased dramatically over recent years. This increased use stems from a desire to
deliver more capable, innovative and accessible systems while sharing a common
physical infrastructure. This is especially true when strict regulations on Radio
Frequency (RF) spectrum use result in spectrum congestion. Systems engineers are
attempting to push beyond the boundaries of existing practical limits by leveraging
hardware platforms that deliver greater processing bandwidth to compensate for
dierent adverse eects that become evident at higher frequencies.
Today’s state-of-the-art systems use extremely high RF that operate using
millimeter length waves. This dramatic expansion enables many unoccupied
(aka.“white”) bands that system engineers can leverage to expand the operational
bandwidth of their systems without interference. The ability to expand the
operational bandwidth provides an opportunity to expand usage and features for
almost every RF application. This is especially true for communication systems that
use wideband to provide higher data rates over the channel as dened by Shannon
theorem.
Shannon formula: C=BWlog
2
(1+SNR)
Shannon channel capacity theorem dictates that if the system uses higher
bandwidth it is possible to operate in much lower signal-to-noise ratio (SNR)
specications that results in a lower transmit power density. For communications
systems, this translates to lower probability of intercept or jamming while also
allowing these systems to have better resilience to a multi-path eect by operating
with a shorter pulse duration for transmitted bits. Modern radar systems also
benet from the ability to operate at wider bandwidths by increasing resolution
and the ability to discriminate two adjacent range reections. Synthetic aperture
radar (SAR) requires extremely wide bandwidth of the waveform to acquire terrain
images with resolution of inches.
FPGAs provide an ideal processing platform for the implementation of polyphase
lter banks for wideband communications and radar systems. Currently the largest
available FPGA device includes over 10,000 high-performance mathematical
blocks, randomly accessible memory and programmable routing that enable the
use of multiple processing pipes that execute in parallel while maintaining overall
power consumption eciency. FPGAs are widely used to implement Digital Filter-
Bank applications.
This technical white paper dives deep into capabilities of polyphase fast Fourier
transforms (FFTs) for Intel® FPGAs.
Table of Contents
Introduction ....................1
Theory of Operations ............2
Polyphase Filter Bank ................... 2
Oversampling Channelizer ..............3
Filter Design ............................4
Inverse Channelizer. . . . . . . . . . . . . . . . . . . . . 4
Fixed-Point vs Floating-Point
Numeric Precision ...............4
Implementation .................5
Polyphase Filter Implementation in DSP .
Builder for Intel FPGAs ..................5
Channelizer Forward FFT ...............5
Polyphase FFT ..........................6
Real FFT ................................8
Oversampling Channelizer Structure ....8
Inverse Channelizer – Inverse FFT .......9
Conclusion .....................10
References .....................10
Intel FPGAs enable high-performance wideband digital polyphase lter banks
for modern communication and radar applications.
Authors
Dan Pritsker
Engineering Manager
Intel® Corporation
Hong Shan Neoh
Design Engineer
Intel Corporation
Colman Cheung
Design Engineer
Intel Corporation
Amulya Vishwanath
Product Marketing Manager
Intel Corporation
Tom Hill
Product Line Manager
Intel Corporation
Versatile Channelizer with
DSP Builder for Intel® FPGAs
Intel® FPGA
white paper
2
White Paper | Versatile Channelizer with DSP Builder for Intel FPGAs
Theory of Operations
A channelizer or an analysis filter bank takes a signal
with certain bandwidth and divides it into several narrow
band signals. Each output band or channel is a baseband
signal itself. A channelizer achieves its goal by modulating
(frequency shifting) the band of interest to baseband, filter
out the unwanted bands, then decimate it to a narrow band
signal, as shown in left hand side of Figure 1. Alternatively,
the band of interest can be band-passed first with a complex
filter, modulated to the baseband, then decimated to narrow
band signal, as shown in the right-hand side of Figure 1.
Polyphase Filter Bank
Instead of extracting each channel individually, all channels
can be extracted very efficiently with polyphase filter bank
(PFB) and FFT as shown in Figure 2.
The details of the mathematical derivation of the FFT-based
PFB channelizer can be referenced in [1], [2], and [3]. The
intuition is offered below. To help explanation, let K be the
total number of channels and k be the channel index range
from 0 to K-1. Let p be the phase or branch index of the PFB,
also ranging from 0 to K-1. Let T be the number of taps of
coefficients per phase, therefore, the overall filter has K * T
number of coefficients.
A polyphase filter (PF) is formed by breaking down a
prototype low pass filter into K sub-filters by interleaving
the coefficients of the prototype filters across K groups. For
example, a prototype filter with taps [h0, h1, , , hT*(K-1)], can
be arranged as K sub-filters as shown in Figure 3. These K
sub filters still bear the characters of the original prototype
filter, just K down-sampled. These sub filters are also 2π/K
separated phase-wise.
Figure 1. Channelizer Illustrations for Individual Channel
Figure 2. Block Diagram of PFB and FFT Channelizer
The selector on the left-hand side of Figure 2 also down
sample the input by K:1 into each sub filter. At the output
of the polyphase filter, there are K down sampled outputs
with different phase information. Notice that the amount
of information at the input and output of the PF has not
changed, but transformed and bandwidth limited per
prototype filter design.
If the outputs of the PF are just summed, only the channel
at the baseband (0
th
channel, k=0) are preserved as shown
in top-left of Figure 4. Information on all other channels
are destructively canceled out. If each phase, p, of the PFB
output are spun by W
K
-p
before being summed, then the next
(k=1) channel is extracted to the base band as shown in top-
right of Figure 4. If the output of the PFB phases are spun by
W
K
-kp
before being summed, then the k
th
channel is extracted
to the base band as shown in bottom gure of Figure 4.
The outputs X
0
(m), X
1
(m), and X
k
(m) are the 0
th
, 1
st
, and k
th
output of a regular DFT or FFT. The FFT eciently modulates
all the PFB phase outputs into individual base bands signals.
The amount of computation for modulation is reduced from
K
2
to K(log
2
K). It should be cleared that the FFT operation
here has nothing to do with time and frequency domain
conversion, just a computationally ecient way to implement
the frequency spinners W.
Figure 3. Formation of Polyphase Filter Coecients
Figure 4. PFB Channelizer Mechanism for Channel 0, 1 and k
Modulated to Baseband
Low-Pass Filtered
Downsampled
Modulated to Baseband
Band-Pass Filtered
Downsampled
H
0
(n)
X
0
(m)
H
1
(n)
X
1
(m)
H
K-2
(n)
X
K-2
(m)
H
K-1
(n)
X
K-1
(m)
FFT
X(n)
H
0
(n)
H
1
(n)
H
K-2
(n)
H
K-1
(n)
H
0
, H
K
, H
2K
, ... H
(T-1)*K
=
H
1
, H
K+1
, H
2K+1
, ... H
(T-1)*K+1
H
K-2
, H
2K-2
, H
3K-2
, ... H
T*K-2
H
K-1
, H
2K-1
, H
3K-1
, ... H
T*K-1
H
0
(n)
H
1
(n)
H
K-2
(n)
H
K-1
(n)
W
K
W
K
W
K
W
K
X
0
(n)
0
H
0
(n)
H
1
(n)
H
K-2
(n)
H
K-1
(n)
W
K
-1
W
K
W
K
-(K-2)
W
K
-(K-1)
X
1
(n)
1
H
0
(n)
H
1
(n)
H
K-2
(n)
H
K-1
(n)
W
K
-k
W
K
W
K
-k(K-2)
W
K
-k(K-2)
X
K
(n)
K
X
0
(m)
0
X
1
(m)
1
X
K
(m)
K
3
White Paper | Versatile Channelizer with DSP Builder for Intel FPGAs
Oversampling Channelizer
There are variations of channelizers with different
characteristics, the fore mentioned one (critically sampled
channelizer) is the basic one where the bandwidth of each
channel is exactly 1/K of the original signal spectrum.
Naturally, the output baseband is also decimated by K. It is
possible to design a channelizer so that the bandwidth of
each channel, 1/M, is larger than 1/K of the original spectrum.
In other words, the channel spectrum overlaps with its
neighbors. This is a very useful class of channelizer and is
called oversampling channelizer or overlapping channelizer,
or Weighted Overlap and Add (WOLA) structure in [3].
Let M be the channelizer decimation ratio (between input
sample rate and output sample rate of a channel). A critically
sampled channelizer has M equals to K. An oversampling
channelizer has M<K. Figure 5 shows the effects of
oversampling channelizer by overlapping the input samples.
The ratio K/M is commonly known as oversampling ratio – for
every M inputs, there are K outputs (one for each channel).
To provide more output samples, inputs to the polyphase
lter should be increased accordingly. This is done so with
the decimation rate M less than K. Thus, there will be an
K-M overlapping input samples to the PFB. Figure 6 shows
the arrangement of data feeding into corresponding sub-
lters for dot multiplications. Each column of orange boxes
represents a group of polyphase lter input data that
generate one column (or K) outputs. Notice that there are M
data samples apart between two successive groups of input
data, hence, decimation rate is M. For example, if M=K/2,
the successive PFB inputs are only K/2 samples apart. The
bottom half of the rst data group overlaps with the top half
of the second group. In such case, both PF and FFT should
work twice as hard to produce 2X output samples.
Figure 5. Input or Output Relationship of Oversampling
Channelizer
There is one additional computation step when M is not
equal to K. Referring to Figure 7 below summarizing the
relationship between M and K, the modulation term W
K
-
kmM
becomes 1 when M=K, but evaluates to dierent values based
on m and k. where m is the output index. This modulating
term is applied after FFT output in PFB implementation.
The top portion of Figure 8 shows a complete oversampling
channelizer structure.
One optimization is that modulation after FFT is the same
as time delay (shifting input) before FFT. Doing so helps
to reduce number of required multipliers, in the expense
of buering memory. Such buering may become free if
buering is required anyway. The bottom portion of Figure 8
shows such a structure.
Figure 6. Polyphase Filter Input Data Arrangement
Figure 7. Channelizer Characteristics per M and K
Relationship
Figure 8. Oversampling Channelizer Structures
Critically Sampled
Channelizer
Oversampling
Channelizer
0
FS
FS
50% Overlapping
2X
Oversampled
0
FFT Length, K
Decimation, M
0% Overlapping
FFT Length, K
Decimation, M
50% Overlapping
Input Samples Output Spectrum
Overlap, K-M
H
0
(n)
H
1
(n)
H
K-2
(n)
H
K-1
(n)
D
M
, D
M+K
, D
M+2K
, ...
D
M+1
, D
M+K+1
, D
M+2K+1
, ...
D
M+K-2
, D
M+2K-2
, D
M+3K-2
, ...
D
M+K-1
, D
M+2K-1
, D
M+3K-1
, ...
D
0
, D
K
, D
2K
, ...
D
1
, D
K+1
, D
2K+1
, ...
D
K-2
, D
2K-2
, D
3K-2
, ...
D
K-1
, D
2K-1
, D
3K-1
, ...
Second Group Data First Group Data
FS
FS
X
K
(m) = W
K
-kmM
Ʃ
n=-∞
[h(mM - n) W
K
k(mM-n)
]x(n)
Downconvert Band-Pass Filter
When M = K
• Decimation is same as channel count
• Bands are critically sampled
• W
K
k(mM-n)
= 1
• Nyquist zone aliasing
When M < K
• Decimation is less than channel count
• Bands are oversampled
• Requires additional shifting in frequency
domain (or sample delay in time domain)
• When M = K/2 (50% channel overlap)
W
K
-kmM
= -1 for odd k and odd m
• When M = 3K/4 (25% channel overlap)
W
K
-kmM
= quarter cycle spinning
Equation Parameters
• KNumber of channels
• MDecimation ratio
• kChannel index
• n, mInput/output index
• W
K
-kmM
= e
(-2п*j*k*m*(M/K))
H
0
(n)
X
0
(m)
H
1
(n)
X
1
(m)
H
K-2
(n)
X
K-2
H
K-1
(n)
X
K-1
FFT
1
W
K
-mM
W
K
-kmM
H
0
(n)
X
0
(m)
H
1
(n)
X
1
(m)
H
K-2
(n)
X
K-2
H
K-1
(n)
X
K-1
Circular
Shifting
Buffer
W
K
-(K-1)mM
FFT
4
White Paper | Versatile Channelizer with DSP Builder for Intel FPGAs
Filter Design
The prototype filter of a polyphase filter is basically a
low pass decimation filter with bandwidth of 1/M. In
oversampling channelizer, 1/M is larger than 1/K. A causal
design may set the stop band to 1/M as in the top of Figure 9,
so that the transition band is (1/M – F
pass
)/2. A more resource
efficient design would double the transition bandwidth as
shown in bottom of Figure 9, so that half of the transition
band is within 1/M and the other half is outside. In other
words, (F
stop
+ F
pass
)/2 equals to 1/M. The energy outside
1/M folds back to the transition band within 1/M, and the
passband is not affected. It may nevertheless introduce
slightly more noise from the larger transition band. Doubling
the transition bandwidth roughly reduces the number of
coefficients to half.
Inverse Channelizer
Inverse channelizer or synthesis filter bank is just the
reverse of the channelizer. Figure 10 shows the back-to-back
channelizer and inverse channelizer. For x(n) at the output to
be perfectly reconstructed from x(n) at the input, filter H(z)
and G(z) has to be inverse of each other. or, H(z)G(z) = cz
-d
I
K
where c and d are constants.
Fixed-Point vs Floating-Point Numeric
Precision
When implementing algorithms hardware developers will
need to tradeoff between floating-point and fixed-point
numeric precision. Algorithms are typically developed
using double-precision floating-point numbers. This
provides almost infinite dynamic range and infinitesimal
resolution and allows the algorithm developer to focus
on the mathematical part of the algorithm without being
concerned about the effects of a reduced precision datatype
on their algorithms. When implementing these algorithms
in hardware, however, double-precision floating-point
is expensive forcing the designer to evaluate reduced
precision data type formats that typically include single
precision floating-point or fixed-point. These data types
can reduce the cost and power of the final system but add to
the complexity of the design process by requiring analysis
of the reduced numeric precision effects on the algorithm
response.
A fixed-point number defines a dedicated number of bits
for the integer and fractional parts of the number. No matter
how large or small the number the same number of bits
are used for each portion. For example, a 16-bit fixed-
point number is typically defined using the nomenclature
“unsigned (16,13)”, which means this is an unsigned number
that uses a total of 16-bits to represent the number where
13 of the bits are used for the fractional (right of the decimal
point) and 3 bits are used for the integer portion of the
number (left of the decimal point). If more numeric range
is required than more bits can be allocated to the integer
portion or if finer resolution is required more bits can be
assigned to the fractional portion of the number.
In contrast, a floating-point number does not reserve a
specific number of bits for the integer part or the fractional
part. Instead it reserves a certain number of bits for the
number (called the mantissa or significand) and a certain
number of bits to indicate where within that number the
decimal place sits (called the exponent). So, a floating-point
number that took up 10 digits with 2 digits reserved for the
exponent might represent a largest value of 9.9999999e+50
and a smallest value of 0.0000001e-49. An IEEE 754 double-
precision floating-point number requires 64 bits and an IEEE
754 single-precision floating-point number requires 32 bits.
Converting an algorithm from floating to fixed point can be a
complex and tedious process that requires an analysis of the
effects of the reduced numeric precision on the algorithm
performance but can yield hardware cost and power savings.
Intel FPGAs support both floating and fixed-point numbers
but historically FPGA have supported “soft” implementations
of floating-point functionality that was implemented using
fixed-point digital signal processing (DSP) blocks and
other logic resources. Such implementation provided good
numerical performance advantages but suffered from FPGA
resource over-usage high-power consumption compared to
fixed-point implementations. Intel Stratix 10 and Intel Arria®
10 FPGAs have solved this problem by incorporating IEEE
754 single-precision floating-point functionality into each
DSP blocks. This makes the usage decision much easier as it
does not imply any resources penalty with negligible power
consumption difference versus fixed-point mode of DSP
block.
Figure 9. Low Pass Filter Design
Figure 10. Back-to-Back Channelizer and Inverse Channelizer
F
pass
1/M
F
pass
1/M
F
stop
Doubling
Transition
Bandwidth
1
1
H
0
(n)
X
0
(m)
H
1
(n)
X
1
(m)
H
K-2
(n)
X
K-2
(m)
H
K-1
(n)
X
K-1
(m)
IFFT
X(n)
FFT
G
0
(n)
G
1
(n)
G
K-2
(n)
G
K-1
(n)
X(n)
Analyzer Synthesizer
5
Implementation
Polyphase Filter Implementation in DSP Builder for Intel FPGAs
The way K logical phases of a polyphase filter is arranged when designing with FPGA depends on the sample rate and the
clock speed of the implementation. A comfortable clock rate for traditional FPGA signal processing is about 200 to 400 MHz.
The phases can be time division multiplexed (TDM) in one datapath, in multiple parallel datapaths, or a combination of both.
For example, if the input sample rate is 1024 Msps and there are 512 channels or phases, each channel only takes an output
sample rate of 2 Msps. The most efficient implementation would be using four datapaths with each datapath carrying 128
channels. These variations of parallel paths and TDM paths can be conveniently implemented in the DSP Builder for Intel
FPGA tool.
The DSP Builder for Intel FPGAs is a MathWorks* Simulink*-based tool with Intel FPGA specific components. The MathWorks
Simulink tool natively handles vector signals and vector processing engines. Filter can be designed with primitive components
or with built-in filter library model. For polyphase filter design, a custom library template was built using primitive
components with configurable number of parallel paths, number of TDM channels and the prototype coefficients.
When the Simulink or DSP Builder for Intel FPGAs Advanced Blockset (DSPBA) runs, the polyphase lter is built at compile
time. The number of parallel lter is built per the vector width. The corresponding coecients (embedded with channel
counts, taps-per-phase, and data type) are passed down to each parallel lter as shown in Figure 11, which also supports TDM
channels and vector processing (for taps-per-phase).
The primitive lter uses streaming interface at the input and output. The streaming interface consists of 3 signals – valid
(v), channel count (c), and data (d). The channel count indicates the TDM phase and selects the corresponding phase of 14
coecients in parallel. The 14 coecients are dot multiplied with a vector of 14 data and a scalar result is generated.
Channelizer Forward FFT
In a standard super heterodyne structure, the input signal is mixed with a frequency provided by a local oscillator to translate
to baseband. The complex multiplier in the super heterodyne structure is absorbed into an FFT in the polyphase channelizer
architecture. This provides an ecient implementation compared to direct multiplication. The FFT is acting as spinners
or phase rotators that eciently convert each channel to baseband. Like the beamforming operation, it is constructively
summing the selected aliased frequency components located in each path. Simultaneously, it is destructively canceling the
remaining aliased spectral components. While the polyphase lter is performing the dierential phase shifts across the
same channel bandwidths, the FFT performs the phase alignment of the band center for each of the aliased spectral band.
In the channelizer design, the FFT operation divides the input bandwidth into a number of evenly spaced frequency bands,
commonly referred to as "bins", in order to allow the frequency content of an input signal to be analyzed. Instead of applying
individual set of phase rotators to the lter stage outputs and summed to form each channel output, the discrete Fourier
transform (DFT) operation can be used to simultaneously apply the phase shifters for all the channels required to extract from
the aliased signal set. For computational eciency, the FFT algorithm is used to implement the DFT. In terms of computation
complexity and execution time, the polyphase FIR lter grows at the rate of O(N), while the FFT grows at the rate of O(N log N),
where N is the number of channels. As the number of channel increases, the FFT dominates the overall execution. Hence, it is
critical to have an ecient implementation of the FFT.
White Paper | Versatile Channelizer with DSP Builder for Intel FPGAs
Figure 11. Polyphase Filter Implementation with DSPBA
6
Direct implementation of the DFT requires on the order of N
2
operations to compute the channel values y
0
, …., y
N-1
for a single
block of N data points. This is called the discrete Fourier transform. When N = 2
i
where i is a positive integer, and taking
advantage of the intermediate values, the number of computations can be reduced to the order N log N operations. Hence,
the name fast Fourier transform.
Polyphase FFT
For polyphase FFT with multiple parallel phases, the FFT decomposition breaks the FFT into smaller building blocks based
on the number of phases, and FFT size. This is similar to the decomposition that is shown with the Cooley-Tukey algorithm as
shown in the following equation.
White Paper | Versatile Channelizer with DSP Builder for Intel FPGAs
X(k) =
Ʃ
N - 1
n=0
x(n)W
kn
X(N
K
k
1
+ k
2
) =
Ʃ
N₁ - 1
n₁=0
Ʃ
N₂ - 1
n₂=0
( )
W
N
kn
x(N
1
n
2
+ n
1
) W
N
kn
W
N
kn
Where N = N
1
+ N
2
The Radix-2
2
algorithm as referenced in [4] is a very efficient hardware architecture where pipelined feedforward structures
uses separate hardware for each stage. This allows streaming data to be processed continuously in contrast to block-based
FFT which processes the data in-place and requires separate buffers.
Radix-2
2
is based on Radix-2 and the ow graph of a Radix-2
2
DIF FFT can be obtained from the graph of a Radix-2 DIF
one. The main dierence is the twiddle factors. As proposed by Garrido, in the Radix-2
2
case, the twiddle factors have
been changed into a trivial rotation in odd stages, and a non-trivial rotation in the successive stage as shown in Figure 12.
This results in one half or less of the stages require complex multipliers for buttery rotations, reducing the computational
requirement.
The Radix-2
2
structure inherently supports parallelism. Figure 13. shows a 16-point 4-parallel radix- feedforward FFT
architecture. The architecture is made up of Radix-2 butteries (R2), non-trivial rotators, trivial rotators, which are diamond-
shaped, and shuing structures, which consist of buers and multiplexers. The lengths of the buers are indicated by a
number.
Radix
2² Stage
Trivial
Twiddle
Radix
2² Stage
Twiddle
Radix
2² Stage
Trivial
Twiddle
Radix
2² Stage
Stage 1
Stage 2 Stage 3 Stage 4
Twiddle
1, i, -1, -i
Twiddle
1, i, -1, -i
Multiplier +
Twiddle Mem
Z
-k
Z
-k
+
-
Multiplexer
Combine Butterfly
and Data Reordering
Figure 12. Radix-2
2
FFT Structure
7
However, it can be seen that as the parallelism of the parallel radix 2
2
topology continues to increase, the degree of coupling
between the FFT paths will grow increasingly complex, and the connections dicult to route and close timing. This has
resulted in hybrid type FFT architecture, leveraging the radix 2
2
concepts, but combining serial and parallel FFTs structures to
create a scalable, high performance and resource ecient implementation.
The serial FFTs are more memory ecient. The single-path delay feedback architecture stores the buttery output in the
feedback shift registers where the feedback delays essentially stores half the elements at each stage. This makes the memory
footprint very minimal. For fully parallel FFT using a split-radix architecture, it takes advantage of constant multipliers
optimization and uses fewer overall multiplications.
Serial FFTs can be combined with parallel FFTs to produce hybrid FFTs. In the case of handling input with multiple phases,
the serial FFT section consist of multiple single-wire serial FFTs in parallel. There is naturally a twiddle block between the
two sections between serial and parallel. The parallel section is more multiplier ecient but less memory ecient than the
serial section. To take advantage of both architectures, the serial FFT section represents the early stages in the FFT where the
shuing structures have longer feedback buers, and the latter stages are implemented using a parallel FFT section. This
provides a good trade-o between memory and multiplier resource usage. Figure. 14 shows a 16-path hybrid 512-K point FFT
where the rst six stages are implemented using a serial FFT section and the latter stages are implemented using a parallel
FFT section with the twiddle stages in-between.
The FFT intellectual property (IP) block within the model-based DSP Builder for Intel FPGAs tool ow, allows full
parameterization of the core from FFT size, number of parallel phases, number of serial/parallel stages, as well as twiddle
factor implementation options to trade-o resource between logic and dedicated DSP blocks. The core also supports
both xed-point and single precision oating-point data type. For the xed-point conguration, there are various pruning
strategies to allow SNR performance and resource usage trade-o.
White Paper | Versatile Channelizer with DSP Builder for Intel FPGAs
Figure 13. 4-Parallel Radix- Feedforward Architecture for the Computation of the 16-Point DIF FFT
Figure 14. Architecture of an 512-K Point Hybrid FFT with Serial and Parallel FFT Sections
3 2 1 0
11 10 9 8
7 6 5 4
15 14 13 12
bb
b
b
R2
R2
4 4 4 4
Stage 1
3 2 1 0
11 10 9 8
7 6 5 4
15 14 13 12
bb
b
b
R2
R2
0 3 6 9
Stage 2
2
2
2
9 8 1 0
11 10 3 2
13 12 5 4
15 14 7 6
bb
b
b
R2
R2
2
0 1 2 3
0 2 4 6
1
1
1
1
Stage 3 Stage 4
4 4 4 4
12 8 4 0
13 9 5 1
14 10 6 2
15 11 7 3
bb
R2
R2
b
b
64 pt FFT
8K pt
16 Path
Parallel
FFT
Twiddle
Twiddle
Twiddle
Twiddle
64 pt FFT
64 pt FFT
64 pt FFT
64 pt FFT
X(32K:0)
X(64K:32K)
X(96K:64K)
X(448K:480K)
X(486K:512K)
x(32), x(16), x(0)
x(1)
x(2)
x(14)
x(15)
8
White Paper | Versatile Channelizer with DSP Builder for Intel FPGAs
Real FFT
There are many cases where the inputs to the PFB is real instead of complex. For example, when the input follows directly
from an analog-to-digital converter (ADC) with real only output. FFT is natively complex, a simple way is to set the imaginary
part of the data to zeros and proceed with FFT. Half of the FFT outputs can be ignored as real input produces mirrored
spectrum.
It is possible to take advantage of the unused complex input and perform the transform with a half-length FFT and additional
processing. The steps are outlined below. Please see [2] for more details.
1. g is a real value sequence of 2N points and is mapped to an N point complex FFT, so that
x(n) = x
1
(n) + jx
2
(n) where x
1
(n)=g(2n) and x
2
(n) = g(2n+1)
2. Perform FFT Rearrange, with help of x*(n)
Transform x(n) to X(k),
with help of x*(n): x
1
(n) = ½ [x(n) + x*(n)], x
2
(n) = ½ [x(n) - x*(n)]
3. After FFT, nd X(k) and X(N-k)
X
1
(k) = ½ [X(k) + X*(N-k)] X
2
(k) = ½ [X(k) - X*(N-k)]
time reverse of X(k) is X(N-k-1), not X(N-k) for k=0,1,,,,N-1
When k=0, X(N-k) = X(N) = X(0). Thus X(N-k) is shifted time reverse.
4. Forming G(k) as a function of X
1
(k) & X
2
(k)
G(k) = X
1
(k) + W
2N
k
X
2
(k), and G(k+N) = X
1
(k) - W
2N
k
X
2
(k) k=0,1,,,,N-1
Only G(k) is needed, G(k+N) is the mirrored output at negative frequency
Figure 15 shows the computation chain implementing the real FFT. This process also works with parallel FFT, albeit is slightly
more complicated.
When FFT size is large, this approach may not be very ecient as the shifted time reverse operation needs additional double
buer memory. Latency is also increased by the FFT length. For large FFT or latency sensitive applications, it may be simpler
by converting the real input to complex with the same signal bandwidth rst, then proceed with a half-length FFT. This can be
achieved eciently by a quarter-band shift and a 2:1 decimation half band lter.
Oversampling Channelizer Structure
As mentioned in the theory of operations that oversampling channelizer is characterized by M, the decimation rate, being
less than K, the number of channels. Depends on how much M is smaller than K, we may have two dierent architectures.
First, if K is slightly more than M, say by 20%, then the polyphase lter and the FFT can just run 20% faster to provide the
computation bandwidth. Second, when K is about twice that of M, the amount of computations doubles relative to the case
M=K. Therefore, the processing pipe can be doubled while keeping the same operating clock.
It may not be trivial to design the data buer to provide the data sequences as shown in Figure 6. Figures 16(a) and 16(b)show
conceptual schematic to generate overlapping data for TDM equals to 1 and S. Control logics is not shown. T is the number
of taps per phase. New data is represented by blue wires while old data is represented by brown wires. When TDM=1, all K
channels are parallel paths. Input data at lower rate is buered with a FIFO. The M new data, together with M*(T-1) + (K-M)*T
old data provides the necessary data for ltering.
When there are multiple TDM slots, S, K/S phases or channels are handled every clock cycles. Some cycles may not use new
data at all. Consider oversampling of K/M, for every M inputs, there should be K outputs, therefore, there are (K-M)/S cycles
that only old data from memory is used.
Figure 15. FFT-Optimized for Real Input
Even Odd
Interleaving
g(n, 2n)
Half-Length
FFT
x(n)
Shifted Time
Reverse
X(k)
Transform
and Twiddle
x(X(N - k)) G(k)
G(k) =
Ʃ
N -1
k=0
g(2n)W
2N
2
nk
+
Ʃ
N -1
k=0
g(2n+1)W
2N
⁽2
nk
⁺¹⁾
k
=
Ʃ
N -1
k=0
x
1
(n)W
N
nk
+ W
2N
k
Ʃ
N -1
k=0
x
2
(n)W
N
nk
9
White Paper | Versatile Channelizer with DSP Builder for Intel FPGAs
Memory
K*T
M*(T-1)+(K-M)*T
FIFO
New Data
M
Merge
Memory
K/S*T
K/S*T
FIFO
New Data
K/S
Merge
Multiplexer
K/S*T
K/S*(T-1)
(a) Data Delay Memory Implementation with TDM = 1
(b) Data Delay Memory Implementation with TDM = S
Figure 16. Data Delay Memory Implementation with TDM = S
It may also not be obvious that in the case of dual processing pipes (K is about 2*M), one data delay Memory may serve both
polyphase lters as shown in Figure 17. Doing so saved quite a bit of memory.
Inverse Channelizer – Inverse FFT
The inverse channelizer, also referred to as the synthesizer, reconstructs the frequency domain component from the
channelizer back to the time domain sequence, to reconstruct a wideband signal. This is done using an inverse FFT. The
inverse FFT structure is similar to the forward FFT block used in the channelizer. A simple approach is to swap the real and
imaginary parts as shown in Figure 18.
K/S
Data Delay
RAM and
Control
Polyphase Filter Circular Buffer FFT
Polyphase Filter Circular Buffer FFT
K/S
K/S
K/S*T
Figure 17. Oversampling Channelizer with Dual Processing Pipes
Forward
FFT
÷N
÷N
Real
Imag
Real
Imag
X
real
(m) X
real
(n)
X
imag
(n)X
imag
(m)
N Point
F
-1
(x) = S(F(S(x))) where S(x) = im(x) + j.re(x)
Figure 18. Compute Inverse FFT Using Forward FFT Architecture by Swapping Real and Imaginary Parts
10
White Paper | Versatile Channelizer with DSP Builder for Intel FPGAs
Conclusion
Intel FPGAs provide a custom hardware processing platform that delivers the high performance and parallel processing
paths necessary to implement state of the art wideband digital polyphase lter banks for modern communications and radar
applications. The DSP Builder for Intel FPGAs provides a Simulink-based hardware design environment that includes a library
of Intel FPGA highly optimized DSP building blocks including FFT that allows the creation of the custom Radix-2 parallel and
serial FFTs required to achieve high performance with hardware eciency. The DSP Builder for Intel FPGAs also enables rapid
prototyping on supported hardware platforms.
References
1
F. J. Harris, Multirate signal processing for communication systems, Upper Saddle River, NJ; Prentice Hall, 2004.
2
John G. Proakis, Dimitris G. Manolakis, Digital signal processing: principles, algorithms, and application, Upper Saddle River,
NJ; Prentice Hall, 1996.
3
Ronald E. Crochiere, Lawrence R Rabiner, Multirate digital signal processing, Englewood Clis, NJ; Prentice Hall, 1983.
4
Garrido, J., Grajal, J., Sanchez, M. A. and Gustafsson, O. , Pipelined Radix-2k Feedforward FFT Architectures, IEEE Trans. VLSI
Syst., 15, 1, 2013.
5
Parker, M., Finn, S., Neoh, H.S., Multi-GSPS FFT Using FPGAs, IEEE NAECON and Ohio Innovation Summit, 2016.
Tests measure performance of components on a particular test, in specic systems. Dierences in hardware, software, or conguration will aect actual performance. Consult other sources of
information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.
© Intel Corporation. All rights reserved. Intel, the Intel logo, the Intel Inside mark and logo, the Intel. Experience Whats Inside mark and logo, Altera, Arria, Cyclone, Enpirion, Intel Atom, Intel
Core, Intel Xeon, MAX, Nios, Quartus and Stratix are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. Intel reserves the right to make changes to any products
and services at any time without notice. Intel assumes no responsibility or liability arising out of the application or use of any information, product, or service described herein except as expressly
agreed to in writing by Intel. Intel customers are advised to obtain the latest version of device specications before relying on any published information and before placing orders for products or
services. *Other marks and brands may be claimed as the property of others.
Please Recycle
WP-01275-1.0