|The Science of Domestic Concert Hall Design|
by Ralph Glasgal
Ambiophonics, 2nd Edition
Replacing Stereophonics to Achieve Concert-Hall Realism
By Ralph Glasgal, Founder Ambiophonics Institute, Rockleigh, New Jersey www.ambiophonics.org
The Psychoacoustic Flaws in Both Stereo and 5.1 Music Reproduction and Why Multi-Channel Recording Cannot Correct For Them
In the 21st century, it seems reasonable for videophiles and audiophiles to ask where the bridge from stereo reproduction to the next sonic century is leading or even if there is such a bridge. Stereophonic sound reproduction dates from 1931 and unfortunately as we shall see in this book has serious unredeemable flaws. But it only makes sense to replace it if there is something better that is reasonably practical and of true high-end quality. Fortunately, there is such a paradigm as described in the chapters that follow.
What Is Realism in Sound Reproduction
In this book, realism in staged music sound, game, or movie reproduction is understood to mean the generation of a sound field realistic enough to satisfy any normal ear-brain system that it is in the same space as the performers, that this is a space that could physically exist, and that the sound sources in this space are as full bodied and as easy to locate as at a live event. Realism does not necessarily equate to accuracy. For instance, a recording made in Symphony Hall but reproduced as if it were in Carnegie Hall is still realistic even if inaccurate. In a similar vein, realism achieved carelessly does not always mean perfection. If a full symphony orchestra is recorded in Carnegie Hall but played back as if it were crammed into Carnegie Recital Hall, one may have achieved realism but certainly not perfection. Likewise, as long as localization is as effortless as in real life, the reproduced locations of discrete sound sources might not have precisely the same sometimes exaggerated perspective as at the recording site to meet the standards of realism discussed here. An example of this occurs if a recording site has a stage width of 120 degrees but is played back on a stage that is only 90 degrees wide. What this really means in the context of realism is that the listener has moved back in the reproduced auditorium some twenty rows, from the first row but either stage perspective can be legitimately real. Finally, mere localization of a sound source does not guarantee that such a source will sound real. For example, a piano reproduced entirely via one loudspeaker, as in mono, or by two in stereo is easy to localize but almost never sounds real. The mantra goes, Mere Localization Is No Guarantor of Realism. Interestingly, one can have monophonic realism as when you hear a live orchestra from the last row of the balcony but can't tell (without looking) whether the horns are left, right, or center.
Since most of us are quite familiar with what live music in an auditorium sounds like, we soon realize that something is missing in our stereo systems. What is missing is soundfield completeness and psychoacoustic consistency. One can only achieve realism if all of the ear's hearing mechanisms are simultaneously satisfied without contradictions. If we assume that we know exactly how the ears work, then we could conceivably come up with a sound recording and reproduction system that would be quite realistic. But if we take the position that we don't know all the ear's characteristics or more significantly that we don't know how much they vary from one individual to another or that we don't know the relative importance of the hearing mechanisms we do know about, then the only thing we can do, until a greater understanding dawns, is what Manfred Schroeder suggested over a quarter of a century ago, and deliver to the remote ears an exact replica of what those same ears would have heard if present where and when the sound was originally generated. The old saw that, since we only have two ears, we only need two channels in reproduction has been justly disparaged. I would rephrase this hazy axiom to read, that since humans have only two ear canals, to achieve realism in reproduction, we need only provide the same sound pressure at the entrance to a particular listener's ear canal, even in the presence of head movement, that this same listener would have experienced at his ear canals had he himself been present at the recording session. Fortunately, it does turn out that only two recorded channels are in fact needed for realistic frontal music reproduction (more are actually detrimental) and it is the purpose of this book to show why this is so and how to do it. For music, movies, or games in the round only four recorded channels are needed. These principles also apply to electronically generated music or sound effects.
This axiom requires that all reproduced, md, and higher frequency direct or ambient sound come from as close to the correct direction as possible so as to reach the ear canal over a path that traverses the normal pinna structures and head parts. Thus home reproduced hall reverberation should reach the ears from many sideward and rearward locations and the early reflections from a variety of appropriate front, side and rear directions. This is why just the two rear surround speakers of 5.1 can never provide psychoacoustically satisfying hall ambience. Likewise central sound sources should come from straight ahead rather than from two speakers spanning 60 degrees. (A center speaker is no help in this regard as we will show below). Another precept that must be kept in mind is that your pinnae are unique like fingerprints. Using somebody else's pinna or pinna response, unless you get desperate, is not a good audiophile practice. A case in point is the use of dummy head microphones with pinnae. If the sound is reproduced by loudspeakers then all the sounds pass by two pinnae one of which is not even yours, and the result is strange and often in your head. If you listen, using normal pinna compressing earphonesor ear buds, then you are listening with someone else's pinnae and there is no proper directional component at higher frequencies. The usual result is that the sound seems to be inside your head. If the dummy head doesn't have molded pinnae, and you listen with earphones, there are no pinnae at all and the sound again seems to be inside your head or strange. You can't fool Mother Nature.
While there are some widely held hi-end beliefs that may have to give way to psychoacoustic reality, the basic audiophile ideal that two channel recordings can deliver concert-hall caliber musical realism is not that far off the mark. However, having only two recorded channels does not mean being limited to only two playback loudspeakers. I call the coming replacement for today's stereo 'ambio' optimized but uncompromised for the recording and reproduction of frontal acoustic performances such as concerts, operas, movies, and video. By definition, and as substantiated below, where audiophile purity is concerned, multi-channel recording, especially with a center front channel, not only is not needed but is actually psychoacoustically counter productive. The sonic 3D genie cannot be squeezed into the 5.1, 6.1, or 7.1 or 10.2 moving picture surround sound bottle.
There are two basic theoretical technologies that are prime candidates to replace stereo or 5.1 where mass marketing and complex technical concepts should not be (but of course are) major stumbling blocks. One is the wavefront reconstruction method often employing hundreds of microphones and speaker walls or Ambisonics. The Ambisonic wavefront reconstruction method generates the correct sound pressure and sound direction in a region that at least encompasses one listener's head. Both are binaural technology methods that directly duplicate the live experience at each ear. Both technologies aim to deliver to the entrance of your ear canal an accurate replica of the original sound field. The Ambisonic method does have the advantage that it can reproduce direct sound sources from any angle and so is quite well suited to non-concert events or movies. But since the Ambisonic wavefront reconstruction method requires a special microphone, a minimum of three (or better four recording) channels and a very complex decoder, is not as user-friendly as other binaural technologies, and does nothing for the existing library of LPs and CDs it will not be considered further here.
As we shall show, the advantages of a binaural technology method such as Ambiophonics is that only two recorded channels, two front loudspeakers, and a scaleable number of optional ambience speakers are necessary. Although using a single pinnaless dummy head microphone (Ambiophone) works best, this new playback technology does not obsolete the vast library of LPs and CDs; it enhances most of them almost beyond belief. Ambio is also room shape and decorator friendly in that the front speakers can be very close together and thus be placed almost anywhere in a room. Another difference between direct wavefront reconstruction such as Ambisonics and wavefield synthesis, and ambiophonic binaural field synthesis as in Ambiophonics is that in the latter case one can season the experience by moving one's virtual seat or changing a space, entirely, to suit the music or your taste. As explained in later chapters, this is not logical with 5.1 multi-channel recording systems since to make such changes you would be incurring the expense of a processor to undo the original expense of recording and storing the now superfluous center and rear surround tracks.
I Vant To Be Alone, Or, The Listening Mob Fallacy
The concept that dedicated video, game, or music listening in the concert hall, theater, jazz venue, or at home is a group activity is superficial. Yes, there may be 2500 people in the opera house, but while the curtain is up there is, ideally, no interaction between them. Each member of the audience might just as well be sitting alone unless you believe in ESP. Listeners in a concert hall are also restricted as to the size of their sweet spot. They can't slouch to the floor or stand up, their permitted side to side or back to front movement is not extensive and there are plenty of seats in most halls where the sound and the view are not quite so sweet.
At home, how often does the gang come over to sit with you for five hours of Die Götterdämmerung? Certainly, serious home listening to classical music and to a lesser extent longer popular genres such as Broadway shows, games, movies, jazz concerts, etc., is sad to say a solitary or at most a two couple pursuit. Of course we all want to demonstrate our great reproduction systems to friends and family, but since these sessions usually last just a few minutes, one can show off the system to one or two people at a time and after everyone has heard it, at its best, from the sweet spot, the party can go on.
The point here is that it is difficult enough to correct the inherent defects of stereo and create a concert-hall caliber soundfield at home without making compromises in the design in order to unduly enlarge the sweet spot. Note that stereo, Ambisonics, VMAx, 5.1, 7.1, etc. all have listening box limitations that one must live with. In the case of stereo, if one moves towards the loudspeaker one senses a hole in the middle. The stage is sensed as being half to the left and the other half to the right. As one moves back from the speakers the stage becomes narrower and eventually one seems to be listening to just one speaker. If one moves to the side then one soon localizes to the nearest speaker and can clearly hear just one channel. 5.1 has similar problems except that, since the dialog is already more in the center speaker, it remaims so even if one is offset to one side, closer, or further away.
By contrast, in Ambiophonics, if one moves closer to the speakers, one hears ordinary stereo. If one moves back, the stage remains wide and very little changes. If you move to the side, you still hear both channels, which is why a center speaker is never needed. This happens because each speaker is fed both its direct signal and a slightly delayed version of the other channel plus the center channel (if present as in 5.1). In Ambiophonics, one can stand, recline, nod, or rotate the head without affecting localization. This is in contrast to using earphones where, if you move your head, the apparent stage moves with you. Headtracking is never needed for loudspeaker binaural.
Why Stereo Can't Deliver Realism Without Some Fixing
By now, every one in the industry has recognized that when a two channel recording is played back through two loudspeakers that form an 60 or 90 degree angle from the listener, that each such speaker communicates with both ears, producing interaural crosstalk. The deleterious effects of this crosstalk at both low and especially the higher frequencies have been greatly underappreciated. For openers, crosstalk is what almost always prevents any sound source from appearing to come from beyond the angular position of the loudspeakers. This result is intuitively obvious, since if we postulate an extreme-right sound source, and can safely ignore the contribution from the left speaker, we can now hear the right speaker by itself, as usual with both ears, and no matter how we turn our heads the sound will always be localized to the right speaker as in any normal hearing situation. However, if we keep the right speaker sound from getting so easily to the left ear then the brain thinks that the sound must be at a larger angle to the right, well beyond the, say 30 degree position of the loudspeaker, since, as in the concert hall, the lesser sound reaching the left ear is now fully attenuated and delayed by the head and filtered by the left pinna. So, for starters, stereo, because of its crosstalk, inadvertently compresses the width of its own sound stage.
A second, perhaps more serious defect, is also caused by this same crosstalk. For centrally located (mono) sound sources, two almost equally loud acoustic signals reach each ear, (instead of one as in the concert hall) but one of these signals, in the normal stereo listening setup, travels about half a head's width or 300 usec., longer than the sound from the nearer speaker. This produces multiple peaks and nulls in the frequency response at each ear from 1500 Hz up known as comb filtering. Since the nulls are narrow, and are muddied by even later crosstalk coming around the back or over the top of the head, and since the other ear is also getting a similar but not precisely, identical set of peaks and nulls, the ear seldom perceives this comb filtering as a change in timbre; but it can and does perceive these gratuitous dips and peaks as a kind of second, but foreign, pinna function and this causes confusion in the brains mechanism for locating musical transients. Remember, in real halls the ear can hear a one degree shift in angular position, but not if strong comb-filter effects occur in the same 2-10 kHz region where the ear is most sensitive to its own intrapinna convolution effects and interpinna intensity differences. As long as this wrongful interaural crosstalk is allowed to persist, the sound stage will never be as natural or as tactile as it could be and for some people, such listening is fatiguing after awhile and all 60 (or LRC) degree stereo reproduction sounds canned to them.
Pinna-Sensitive Front Speaker Positioning
Just as there are optical illusions, so there are sonic illusions. One can create sonic illusions by using complex filters to create virtual sound sources that float in mid air or rise up in front of you. As with optical illusions some people detect them and some people don't. The most prominent audio illusion is in stereo where phantom images are created between two speakers. You may have observed that most optical illusions are two-dimensional drawings, that imply ephemeral three dimensions. Likewise there is something indistinct about the stereo phantom illusion. This is because the phantom image is largely based on lower frequency interaural cues and barely succeeds in the face of the higher frequency head and pinna localization contradictions. The fact that earphone systems such as Toltec based processors, a host of PC virtual reality systems, SRS, Lexicon, etc. can move images in circles just by manipulating high frequency head and pinna response curves, even if not of great high-end quality, does show that these hearing characteristics are of considerable importance. Thus the direction from which complex sounds with energy over 1500 Hz originate, particularly from the frontal stage, should be as close to correct as possible.
Most stage sounds, particularly soloists and small ensembles, originate in the center twenty degrees or so. Remember that we want to launch sounds as much as possible from the directions they originate. Thus it makes much more sense to move the front channel speakers to where the angle between each of them to the listening position is perhaps ten degrees. This eliminates the pinna processing error for the bulk of the stage. But, of course, if the speakers are so close together, what happens to the separation? The answer is that with the crosstalk eliminated, as is necessary anyway, separation, as in earphone binaural, is no longer dependent on angular speaker spacing.
Crosstalk elimination is not a concept new to just Ambiophonics; but most of the older electronic crosstalk elimination circuits such as those of Lexicon, Carver, Polk etc. assume the stereo triangle and have, therefore, had to make compromises to enlarge the sweet spot size over which they are effective. I would hesitate to class any of them as high-end components, especially as they still promote pinna position errors. Usually good crosstalk cancellers require complex compensation for the fact that the crosstalk signal being canceled has had to go around the head and over the pinna on its way to the remote ear. Since Carver, Lexicon, etc. don't know what your particular head and pinna are like, they assume an average response and thus can't do a very good job of cancellation at high frequencies. If they try, most listeners experience phasiness, a sort of unease or pressure particularly if they move about. But when the speakers are in front of you there is not much of the head to get in the way and so the head response functions are much simpler, less deleterious if ignored or averaged, and head motions make little difference. Ambiodipoles are just now appearing but you can easily achieve an inexpensive and truly high-end result using a simple three foot square six inch thick absorbent panel set on edge at the listening position. You get used to the panel rather quickly and it is a high-end tweak that needs no cables and produces no grunge. Either electronic processors or panels allow complete freedom of head motion without audible effect and afford more squirm room at the listening position than one has in a concert hall. Two people can be accommodated comfortably but usually one needs to be directly behind the other for optimum results, not unlike high-resolution stereo.
Earlier crosstalk cancellation systems were less than satisfactory because they were not recursive. That is, in the earlier systems the unwanted signal from the left speaker at the right ear was cancelled by an inverted signal from the right speaker and that was the end of it; but this right speaker cancellation also reaches the left ear ansd so one has a new form of crosstalk.
In Ambiophonics, this later crosstalk is cancelled over and over again until its level is inaudible. A comparison can be made with reverberation time in concert halls. Normally, the reverberation time is specified as the time it takes for a sound to decrease by 60 decibels. This implies that the human ear is sensitive to concert hall reverberation at this low a level. Likewise, crosstalk is still deleterious even if its level gets to be quite low after several cycles of successive cancellation. We call this process recursive Ambiophonic crosstalk elimination (RACE).
One can have video with crosstalk cancellation (XTC) but adding a picture can have its misleading side. One reason that so many listeners are impressed with the realism of movie surround sound systems is the presence of the visual image. While the research in this field is not definitive, it stands to reason that a brain preoccupied with processing a fast moving visual image is not going to have too much processing power left over to detect fine nuances of sound. Certainly, if you close your eyes while listening to any system, your sensitivity to the faults of the sound field is heightened. Thus when a seemingly great home theater system is used to play music only, without a picture, the experience is often less than thrilling. Adding a picture to Ambio seems to make fine adjustments to the ambient field much less audible, but one must observe that most people keep their eyes open at concerts and so perhaps an image is desirable to provide the ultimate home musical experience.
Nothing we have done to make the front stage image more realistic and psychoacoustically correct has required any extra recorded channels. I call all these changes to standard stereophonics, Ambiophonics or Ambio. Ambio, does not rely on the fluky phantom image mechanism. But there still remains one further difficulty with the stereo triangle and that is that we need a proper ambient field coming from more directions than just those of our now crosstalk-free, pinna-correct, front speakers.
The Case For Ambience By Reconstruction
Like a federal budget agreement, a method of achieving that air, space, and appropriate concert hall ambience at home, has technical devils in its details. The most obvious suggestion, based on movie and video surround-sound techniques, to just stick the ambient sound on additional DVD multi-channel tracks, on closer examination, just can't do it for hi-enders. The problem with using third, fourth or fifth microphones at or facing the rear of the hall and then recording these signals on a multi-channel DVD, is that these microphones inevitably pick up direct sound which, when played back from the rear or side speakers, causes crosstalk, pinna angle confusion, and comb filter notching. It is also pinnatically incorrect to have all rear hall ambience coming from just two point sources even if these surround speakers are THX dipoles. Remember, using rear dipoles implies a live listening room, which will thus also increase unwanted early reflections from the front speakers. Additionally, recording hall ambience directly is really not cost effective or necessary. Unlike movies, the acoustical signature of Carnegie Hall (despite its always ongoing renovations) does not change with every measure, so why waste bits recording its very static ambience over and over again? It is much more cost and acoustically effective to measure the hall response once from the best seat (or several) for say five, left, right, and center positions on the stage (If the hall is symmetric, the measurement process is simpler) and either include this data in a preamble on the DVD, store it in your playback system or provide it as part of a DVD-ROM library of the best ambient fields of the world.
The process of combining a frontal, two (I hope) channel recording with the hall impulse response is called convolution and convolution is the job of the ambience regenerator which may be a PC or a special purpose DSP computer or it may be a part of the DVD/CD DAC. The use of ambience reconstruction would obviate the need for DTS or Dolby Digital multi-channel recordings at least where classical music is concerned. Unlike frontal sound, ambience can and should come from as many speakers as one can afford or has room for. Crosstalk, and comb-filtering are not problems with ambient sound sources if these signals are uncorrelated (unrelated closely in time, amplitude, frequency response, duration, etc.) which is normally the case both with concert halls and good ambience convolvers.
An Uphill Political Struggle
The cause of concert-hall early reflection and reverberation tail synthesis by digital signal processors (DSP) in computers or audio products was set back by the late Michael Gerzon, the Oxford Ambisonics pioneer, who wrote in 1974 "Ideally, one would like a surround-sound system (yes, he did use this term in 1974) to recreate exactly, over a reasonable listening area, the original sound field of the concert hall.... Unfortunately, arguments from information theory can be used to show that to recreate a sound field over a two-meter diameter listening area for frequencies up to 20 kHz, one would need 400,000 channels and loudspeakers. These would occupy 8 gHz of bandwidth equivalent to the space used by 1000, 625-line television channels!"
Later, however, Gerzon did not let information theory prevent him from capturing a 98% complete concert-hall sound field using a single coincident array of four microphones. Indeed the complete impulse response of a hall can be measured and stored on one floppy disk by placing an orthogonal array of three microphone pairs at the best seat in the house and launching a test signal from the stage during the recording session or at any time.
Convolution to The Rescue
An audiophile-friendly approach to ambience reconstruction is to derive the surround speaker feeds by convolution of a two channel recording, preferably made using the microphone technique described below, that limits rear hall pickup. The questions to be asked are these:
There may never be a definitive answer to the first question. Just as there is no sure recipe for physical concert hall design, there is no best virtual concert hall specification. But, adjusting the number, placement, and shape of early reflections is easily more audible than changing amplifiers or cables and offers a tweaker delights that can last a lifetime. I can only say that in my own experience, just as there are thousands of real concert halls that differ in spite of being real, so there are thousands of ambience combinations that sound perfectly realistic even if not perfect. How do you get more real than real? Remember, absolute, particular hall parameter accuracy is not essential to achieve realism. By analogy, even if one sits on the side, in the last row of the balcony at Carnegie Hall where the ambience is lopsided, the sonic experience is still real. In my opinion the best software for this purpose is based on impulse response measurements made in actual concert halls as was done by JVC and Yamaha some 10 years ago for consumer products and is being done all the time by acoustical architects tuning auditoriums. Others, such as Dr. Dave Griesinger at Lexicon, create ambience signals using an imaginary model. I am not talking here about professional effects synthesizers that generate artifacts never heard by anybody in any physically existing space. Someday, I presume, we will have a DVD-ROM that contains the ambient parameters of Leo Beranek's 76 greatest concert houses of the world and a simple mouse click will yield a selection. With enough hall impulse responses stored, you could even select a seat and a stage width. (If it's a solo recital one wants only central derived early reflections, if a symphony orchestra, the works, etc.). There are already over 100 impulse responses of concert halls available on the Internet.
While I may not be the best one at executing my own theories, I have gotten startlingly good results using the new convolvers available. It is a rare AES convention that does not describe advances in the state of this art. Another important point is that ambience regeneration is scaleable. As computers get faster, and cheaper and as convolution software gets better, it is easy to upgrade or add more ambience speakers. The hall ambience storage method is also inherently tolerant of speaker type and the precise location or speaker response matter little and are akin to repainting the balcony or curving a wall in the concert hall.
The fact is that the brain is not all that sensitive to whether there are 30 early reflections from the right and only 25 from the left or whether they come from 50 degrees instead of the concert-hall ideal (according to Ando) of 55 degrees. If the reverberant field is not precisely diffuse or decays in 1.8 seconds instead of 2.0 seconds, that may only mean you are in Carnegie Hall instead of Symphony Hall. I make no claim to be an authority on setting ambience hall parameters, and I am sure many audiophiles could do better at this game. I now use 2 large area speakers at the sides and rear to provide a reverberation field as diffuse as possible.
Since central early proscenium reflections come from the recording via the main front speakers, these need not be regenerated and, of course, by definition they are natural and are coming from the proper directions. For side, overhead, and rear ambience using the left channel to recreate left leaning early reflections (some of which may end up coming from the right) and the right channel to produce a set of right reflections, the early reflection patterns for different instruments on the stage have enough diversity to exceed the threshold of the brain's reality barrier.
Whither Recording In An Ambiophonic Hi-End World
While audiophiles do not often concern themselves with recording techniques over which they have little control almost any LP or CD made with either coincident or spaced microphones is greatly enhanced by Ambio playback. But one can heighten the accuracy, if not gild the lily of realism, by taking advantage in the microphone arrangement, of the knowledge that, in playback, the rear half and side parts of the hall ambience will be synthesized, that there is no crosstalk, that the front loudspeakers are relatively close together, and that thus listening room reflections are minimized. To make a long story short, exceptionally realistic "You-are-there" recordings can be made by using a head shaped, pinnaless ball with holes at the ear canal positions to hold the microphones. The Schoeps KFM-6 is a good example of such a microphone even though it is a sphere and an oval would be slightly better. However, for best results, this microphone should be well baffled to prevent most rear hall ambience pickup. KFM-6 recordings are a feature of the PGM label, produced by the late Gabe Wiener who was a staunch advocate of this recording method, first expounded by Guenther Theile. As expected, these PGM recordings are exceptionally lifelike when played back Ambiophonically so as to be free of crosstalk or pinna distortion. The Ambiophone is a microphone array specifically designed to make recordings optimized for Ambio playback.
The reason such a microphone is optimum is that particularly for central sounds the sound rays reach the ears almost as they do in the concert hall. That is, one ray from a central instrument reaches the left ear of the microphone, goes to the left speaker where it is sent straight ahead to the left pinna and ear. The fact that the head response transfer function of the microphone is not the same as the listener's is not significant for central sound sources that don't cross either head. For side sources the microphone ball becomes a substitute for the listener's HRTF but at least there is still only one HRTF and one real pinna in the chain. Perhaps the hardest part of migrating to Ambio will be to convince recording engineers, who are usually rugged individualists, to use microphones and positions that are Ambio compatible.
Law of the First Impression
No matter how many great stereo systems I listen to, they still never have the impact that my first Emory Cook stereo disc had. Likewise, I still compare the multichannel systems I hear now to the mental image of air and presence I retain of the first RCA CD-4 true discrete quad LP of Mahler's 2nd I heard in the early 70's. The moral of this phenomena is that the first time anyone hears a major upgrade in reproduction, particularly when going beyond two speakers for the first time, they are always very favorably impressed. Dissatisfaction with systems like the Hafler arrangement, SQ, Dolby pro-logic etc only set in later. This is the scenario with the new discrete multi-channel format for music as well. At first 5.1 or even 7.1 sounds really exciting and a great contrast to stereo but in the end it fails as a realistic replica of the live music concert-hall experience.