The Science of Domestic Concert Hall Design  
by Ralph Glasgal 




RealTime Partitioned Convolution for Ambiophonics Surround Sound By Anders Torger, University of Parma Industrial Engineering Dept. V.Scienze 181/A, 43100 PARMA, ITALY 3. Computational performances From a theoretical point of view, the process described in the previous subsection is less efficient than the unpartitioned overlapandsave algorithm. It requires a higher number of arithmetic operations, and more memory references. It does do some earnings on the FFT calculations, since one point FFT is replaced withP Lpoint FFTs, but with the slow increase of computational load of the FFT algorithm as M grows larger (M⋅log(M)), the performance increase is (theoretically) small. The cost added by partitioned convolution is the P multiply steps with P1 sums needed for each Lpoint output block. With a typical value of P = 16, the number of memory references is increased 16 times, the multiply step is performed 16 times more often, and the sums are not even needed in unpartitioned convolution! However, the same memory is referenced several times, and the total amount of memory referenced is slightly reduced, since the input and output buffers are only L points each, instead of M points as for the unpartitioned case. Additionally, the multiplication of spectra and the subsequent sum takes only a fraction of operations compared to FFT, so the cost is not as high as it may seem at first glance. From a practical point of view, there are some nonobvious advantages of partitioned convolution. First of all, as the size of the FFTs exceeds the cache size of the processor, the performance degradation is larger than theoretically expected. Furthermore, as the number of partitions go up, the major part of the computational load is moved from the FFT calculations, to the simple multiply and add step, which is very easily optimized. On modern processors this translates into a handcoded SIMD assembler loop of a few lines of code. The SIMD instruction set allows for executing several arithmetic operations in a single instruction, making the resulting code very efficient. To increase the performance further, this critical loop is extended with cache preload instructions, which improves memory access performance. The task of optimizing the FFT algorithm is much more complex, and therefore the available implementations seldom make the most out of the given hardware. Here, the wellknown and highly efficient FFTW library [11] is employed for the FFT calculations. We use the opensource BruteFIR convolver [8], which was originally designed with Ambiophonics in mind, but has become a generalpurpose audio convolver. In this case, it is configured for 10 channel Ambiophonics, meaning 2 inputs and 10 outputs, with 2 IRs per output, a total of 20 independent ones having the lengths indicated. The sampling rate is 48 kHz and the internal resolution is 32 bit ¡ oating point. Table 1 and 2 show how large part of the processor time available is needed for achieving realtime operation. Thus, a lower value is better, and a value larger than 1.0 means that the machine cannot handle the processing in realtime. The performance benchmarks clearly show that the partitioned convolution algorithm actually outperforms unpartitioned convolution in all measured cases, if a suitable number of partitions is chosen. Considering the added bene¿ts of reduced latency and more flexible IR lengths, partitioned convolution should always be preferred. The performance comparison between the two approaches, becomes largely a comparison of the speed of the FFT implementation and the small assembler loop performing the multiply/add step, as can be seen in table 3. FFTW works better on Intel processors than on AMD, thus the gain from partitioning is less evident on the Pentium test system. The higher memory bandwidth of the AMD test system improves the performance of the multiply/add step which is strictly limited by memory bandwidth, like most algorithms on today’s standard computers. Since Ambiophonics uses many more filters than there are inputs and outputs, unpartitioned convolution has an advantage over partitioned. This is due to that partitioned convolution gains speed from moving processing time from FFT to the multiply/add step, and FFT is done per input and output (12 in total), while multiply/add is done per filter (20 in total). Thus, in cases where there is one ¿ lter per input and output even larger performance gains by using partitioned convolution should be expected (which actually can be seen in table 4). However, in the rare cases where an optimal FFT algorithm is available, unpartitioned overlapandsave will outperform partitioned convolution in terms of throughput, as observed from informal tests made with Lopez software [5]. Lopez uses a heavily optimized proprietary library from Intel, which unfortunately is both closedsource and limited to the MicrosoftWindows platform, and is therefore not employed in BruteFIR.
Table 1: Ambiophonics performance on a Pentium III 550 MHz, 3.1. Comparison with nonuniform partitioning As described by Gardner [8], partitioned convolution can be arithmetically optimized by using nonuniform partitions, starting with a small size to provide low I/O delay, and increasing the size further back in the impulse response for increased efficiency. This way a large part of the multiply/add step can be traded for FFTs of longer length. The processing time required grows much slower with increased IR lengths than with uniformsized partitions, thus it is a more scalable algorithm. However, it is much more complex to implement since it requires scheduling and synchronisation of parallel convolution tasks. To compare the two partition strategies we have created a dummy program which does not carry out real convolution but performs the FFTs, spectral products and sums required for nonuniform partitioned convolution, and the corresponding for the uniform and unpartitioned case. The program simulates convolution of a single ¿ lter with one input and one output. In table 4 the results are presented. The IR lengths have been chosen to be as near as possible to 131,072 while exactly ¿ tting into the nonuniform partitioning scheme with constant processor demand suggested by Gardner [8]. The execution times have been divided with the time for 131,072 taps unpartitioned convolution, thus a value higer than 1.0 means slower than that reference. The lowest latency of 1024 samples was chosen since at 48 kHz it is near the practical limit of a personal computer implementation (about 20 ms). The 1000 MHz Athlon computer from table 2 generated the results. As seen, the nonuniform partitioning is never faster than the unpartitioned case, but can provide very low latency without a dramatic increase in execution time. However, for Ambiophonics, which is targeted at home use and reproduction of previously made recordings, high throughput is central and latency is of less importance, and therefore uniform partitions is the better choice.
Table 2: Ambiophonics performance on an Athlon 1000 MHz, 256
Table 3: CPU time distribution – N = 131,072.
Table 4: Uniform (top) vs. nonuniform (bottom) partitioned convolution. 