Header image  
by Ralph Glasgal
line decor
Home Tutorials Tech
Kudos and
Demos Bio Free Ambio
Glossary The Home
Concert Hall
Rec Engineers
FAQ/Forum Links Contact us
line decor

Real-Time Partitioned Convolution for Ambiophonics Surround Sound

By Anders Torger, University of Parma Industrial Engineering Dept. V.Scienze 181/A, 43100 PARMA, ITALY
Email: torger@ludd.luth.se

and Angelo Farina University of Parma Industrial Engineering Dept. V.Scienze 181/A, 43100 PARMA, ITALY
Email: farina@pcfarina.eng.unipr.it

3. Computational performances

From a theoretical point of view, the process described in the previous sub-section is less efficient than the unpartitioned overlap-and-save algorithm. It requires a higher number of arithmetic operations, and more memory references. It does do some earnings on the FFT calculations, since one  point FFT is replaced withP L-point FFTs, but with the slow increase of computational load of the FFT algorithm as M grows larger (M⋅log(M)), the performance increase is (theoretically) small.

The cost added by partitioned convolution is the P multiply steps with P-1 sums needed for each L-point output block. With a typical value of P = 16, the number of memory references is increased 16 times, the multiply step is performed 16 times more often, and the sums are not even needed in unpartitioned convolution! However, the same memory is referenced several times, and the total amount of memory referenced is slightly reduced, since the input and output buffers are only L points each, instead of M points as for the unpartitioned case. Additionally, the multiplication of spectra and the subsequent sum takes only a fraction of operations compared to FFT, so the cost is not as high as it may seem at first glance.

From a practical point of view, there are some non-obvious advantages of partitioned convolution. First of all, as the size of the FFTs exceeds the cache size of the processor, the performance degradation is larger than theoretically expected. Furthermore, as the number of partitions go up, the major part of the computational load is moved from the FFT calculations, to the simple multiply and add step, which is very easily optimized. On modern processors this translates into a hand-coded SIMD assembler loop of a few lines of code. The SIMD instruction set allows for executing several arithmetic operations in a single instruction, making the resulting code very efficient.

To increase the performance further, this critical loop is extended with cache preload instructions, which improves memory access performance. The task of optimizing the FFT algorithm is much more complex, and therefore the available implementations seldom make the most out of the given hardware. Here, the well-known and highly efficient FFTW library [11] is employed for the FFT calculations.

We use the open-source BruteFIR convolver [8], which was originally designed with Ambiophonics in mind, but has become a general-purpose audio convolver. In this case, it is configured for 10 channel Ambiophonics, meaning 2 inputs and 10 outputs, with 2 IRs per output, a total of 20 independent ones having the lengths indicated. The sampling rate is 48 kHz and the internal resolution is 32 bit ¡ oating point. Table 1 and 2 show how large part of the processor time available is needed for achieving realtime operation. Thus, a lower value is better, and a value larger than 1.0 means that the machine cannot handle the processing in real-time.

The performance benchmarks clearly show that the partitioned convolution algorithm actually outperforms unpartitioned convolution in all measured cases, if a suitable number of partitions is chosen. Considering the added bene¿ts of reduced latency and more flexible IR lengths, partitioned convolution should always be preferred. The performance comparison between the two approaches, becomes largely a comparison of the speed of the FFT implementation and the small assembler loop performing the multiply/add step, as can be seen in table 3. FFTW works better on Intel processors than on AMD, thus the gain from partitioning is less evident on the Pentium test system. The higher memory bandwidth of the AMD test system improves the performance of the multiply/add step which is strictly limited by memory bandwidth, like most algorithms on today’s standard computers.

Since Ambiophonics uses many more filters than there are inputs and outputs, unpartitioned convolution has an advantage over partitioned. This is due to that partitioned convolution gains speed from moving processing time from FFT to the multiply/add step, and FFT is done per input and output (12 in total), while multiply/add is done per filter (20 in total). Thus, in cases where there is one ¿ lter per input and output even larger performance gains by using partitioned convolution should be expected (which actually can be seen in table 4). However, in the rare cases where an optimal FFT algorithm is available, unpartitioned overlap-and-save will out-perform partitioned convolution in terms of throughput, as observed from informal tests made with Lopez software [5]. Lopez uses a heavily optimized proprietary library from Intel, which unfortunately is both closed-source and limited to the MicrosoftWindows platform, and is therefore not employed in BruteFIR.

IR length unpartitioned 4 part. 8 part. 16 part.
32,768 0.59 0.53 0.59 0.78
65,536 0.68 0.61 0.64 0.80
131,072 0.74 0.81 0.73 0.86

Table 1: Ambiophonics performance on a Pentium III 550 MHz,
256 kB cache, 100 MHz RAM.

3.1. Comparison with non-uniform partitioning

As described by Gardner [8], partitioned convolution can be arithmetically optimized by using non-uniform partitions, starting with a small size to provide low I/O delay, and increasing the size further back in the impulse response for increased efficiency. This way a large part of the multiply/add step can be traded for FFTs of longer length. The processing time required grows much slower with increased IR lengths than with uniform-sized partitions, thus it is a more scalable algorithm. However, it is much more complex to implement since it requires scheduling and synchronisation of parallel convolution tasks.

To compare the two partition strategies we have created a dummy program which does not carry out real convolution but performs the FFTs, spectral products and sums required for non-uniform partitioned convolution, and the corresponding for the uniform and unpartitioned case. The program simulates convolution of a single ¿ lter with one input and one output. In table 4 the results are presented. The IR lengths have been chosen to be as near as possible to 131,072 while exactly ¿ tting into the non-uniform partitioning scheme with constant processor demand suggested by Gardner [8]. The execution times have been divided with the time for 131,072 taps unpartitioned convolution, thus a value higer than 1.0 means slower than that reference. The lowest latency of 1024 samples was chosen since at 48 kHz it is near the practical limit of a personal computer implementation (about 20 ms). The 1000 MHz Athlon computer from table 2 generated the results.

As seen, the non-uniform partitioning is never faster than the unpartitioned case, but can provide very low latency without a dramatic increase in execution time. However, for Ambiophonics, which is targeted at home use and reproduction of previously made recordings, high throughput is central and latency is of less importance, and therefore uniform partitions is the better choice.

IR length unpartitioned 4 part. 8 part. 16 part.
32,768 0.37 0.28 0.32 0.43
65,536 0.56 0.42 0.34 0.45
131,072 0.64 0.51 0.59 0.47

Table 2: Ambiophonics performance on an Athlon 1000 MHz, 256
kB cache, 266 MHz RAM.

  I/O FFT Mix/Scale Multiply/add
unpartitioned 6% 71% 16% 7%
partitioned 7% 15% 17% 61%

Table 3: CPU time distribution – N = 131,072.

IR length 114,688 122,880 126,976 130,048
latency 16384 8192 4096 1024
number of

Table 4: Uniform (top) vs. non-uniform (bottom) partitioned convolution.