Gamasutra: The Art & Business of Making Gamesspacer
View All     RSS
September 19, 2014
arrowPress Releases
September 19, 2014
PR Newswire
View All





If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 
Hail to the hall - Environmental Acoustics
by Dennis Gustafsson on 05/28/14 07:36:00 am   Featured Blogs

The following blog post, unless otherwise noted, was written by a member of Gamasutra’s community.
The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.

 

One of our early goals with Smash Hit was to combine audiovisual realism with highly abstract landscapes and environments. A lot of effort was put into making realistic shadows and visuals, and our sound designer spent long hours finding the perfect glass breaking sound. However, without proper acoustics to back up the different environments, the sense of presence simply would be there.

To achieve full control over the audio processing and add environmental effects I needed to do all the mixing myself. Platform dependent solutions like OpenAL and OpenSL cannot be trusted here, because support for environmental effects is device/firmware specific and missing in most mobile implementations. Even it was available it would be virtually impossible to reliably map parameters between OpenSL and OpenAL. As in most cases with multi-platform game development, DIY is the way to go.

Showcasing a few different acoustic environments

Software mixing

Writing a software mixer is quite rewarding – a small, well defined task with a handful operations performed on a large chunk of data, thus very suitable for SIMD optimization. A software mixer is also one of those few subsystems that has, or can have, a real work analogue - the physical audio mixer. I chose a conventional, physical abstraction, so my interface classes are named Mixer, Channel, Effect, etc, but there might better ways to structure it.

The biggest hurdle when writing a software mixer turned out to be the actual mixing. Two samples playing at the same time are added together, but what happens if they both play at maximum volume? The intuitive implementation, and what also happens in the real world is clipping. This is what most real audio programs do. Clipping is a form of distortion, where minimum and maximum audio levels are simply clamped above or below the physical threshold, effectively destroying or reshaping the waveform. In an audio software, you would typically adjust the levels manually in order to avoid clipping, but in games, where audio is interactive this can be really tricky. Say for instance you have a click sound for buttons. If there are no other sounds playing you want the click played back at maximum volume, but if there is music in the background the volume level needs to be lowered. If there is an explosion nearby it needs to be adjusted even further to avoid clipping.

One way to reduce clipping is to transform the output signal in a non-linear fashion, so that it never really reaches the maximum level. This still has the problem that it will affect the result when there is only one sample playing. Hence, when the click is played back in isolation it won't be at maximum volume.

Some people suggest that the output should be averaged in the case of multiple channels. So if there are three sounds, A B C playing at once. You mix them as (A+B+C)/3. This is not a good way to do it, because the formula doesn't know anything about the content of each channel (B and C can for instance be silent, still resulting in A played back at a third of the volume).

What we need is some form of audio compression - an algorithm that compress audio dynamically, based on the current levels. Real audio compressors are pretty advanced, with a sliding window to analyze the current audio content and adjust the levels accordingly. Fortunately there is a "magic formula" that sounds good enough in most cases. I found this solution by Viktor T. Toth: mix=A+B-A*B, but when adapting it to floating point math I realized that a slight modification into: mix=A+B-abs(A)*B is more suitable to better deal with negative numbers. Each channel is added to the mix separately, one at a time, using the following pseudo code:

mix = 0
for each channel C
  mix = mix+C - abs(mix)*C
next

This means that if there is only one channel playing, it will pass unmodified through the mixer. The same applies if there are two channels but one is completely silent. If both channels have the maximum value (1.0), the result will be 1.0, and anything in between will be compressed dynamically. It is definitely not the best or most accurate way to do it, but considering how cheap it is, it sounds amazingly good. I use this for all mixing in Smash Hit, and there are typically 10-20 channels playing simultaneously, so it does handle complex scenarios quite well.

I use three separate mixers in Smash Hit – the HUD mixer, which is used for all button clicks and menu sounds, the gameplay mixer, which represent all 3D sounds, and the music mixer which is used for streaming music. The gameplay mixer has a series of audio effects attached to it to emulate the acoustics of different room types.

Reverb

Given how useful a reverb effect is in game development, it's quite surprising to me how difficult it was to find any implementations or even an explanation online. At a first glance, the reverb effect seems much like a long series of small echoes, but when trying it, the result sounds exactly like that – a long series of small echoes, not the warm, rich acoustics of a big church. If one tries to make the echoes shorter, it turns more and more metallic, like being inside a sewage pipe.

There is a great series of blog posts about digital reverberation by Christian Floisand that contains a lot of the theory and also a practical implementation: Digital reverberation and Algorithmic Reverbs: The Moorer Design.

It uses a series of parallel comb filters passed through all-pass filters in series. Each comb filter is basically a short delay line with feedback, representing reflected sounds, while the all-pass filters are used to thicken and diffuse the reflected sounds by altering the phase. I don't know enough signal theory to fully understand the all-pass filter, but it works great and implementation is fairly easy.

In addition to the comb filters and all-pass filters I also added a couple of tap-delays (delay line without feedback), representing early reflections on hard surfaces, as well as low-pass filters in each comb filter allowing a great way to control the room characteristics. Christian's article suggest the use of six comb filters, but for performance reasons I cut it down to four. I'm using four tap-delays and two all-pass filter, plus a pre-delay on the entire late reflection network (the chain on the left).

All audio is processed in stereo in Smash Hit, so the reverb needs to be processed separately on the left and right channel. I slightly randomize the loop time in the comb filters differently for the left and right channel, which gives the final mix a very nice stereo spread and a much better sense of presence.

In addition to reverb I also implemented a regular echo as well as a low-pass filter. The parameters of these three filters are used to give each room its unique acoustics.

 


Related Jobs

Insomniac Games
Insomniac Games — Burbank , California, United States
[09.19.14]

Senior Engine Programmer
Nexon America, Inc.
Nexon America, Inc. — El Segundo , California, United States
[09.19.14]

Front-End Developer
Machine Zone
Machine Zone — Palo Alto, California, United States
[09.19.14]

Project Manager
Fun Bits Interactive
Fun Bits Interactive — SEATTLE, Washington, United States
[09.19.14]

Senior Engine Programmer






Comments


Jean-Philippe Lafortune
profile image
Nice acoustics Dennis!

I was wandering if you would be interested in a sound renderer source code. A few years ago we developed a versatile plugin for 3D max, including an off-line ray-tracer and realtime engine. It also comes with a bunch of "sonic" materials that affects sounds reflection, transmission and absorption.

Just let me know!

Roger Haagensen
profile image
A small advice, make sure all sounds, music, video audio, dialog in your games are mastered/normalized to EBU R128 standard, if that does not fit with your setup or you can't use that form of level measurements then simply do a normalization to ensure the loudness levels are at about -23 dBFS RMS (full signal z weighted sinewave) which is close to the EBU R128 standard, why z weighted (no loudness contour),? because it's super simple to do in code so slapping a improvised tool together is super simple if needed.
In the case of stereo one should take the average/mean RMS of both left and right and use that as the final RMS (for example stereo channel music that will be used).

Now, with game audio mixing of samples A+B+C and dividing by 3... if sample A and B is silent and thus C is only 1/3 the volume is actually correct, because why the heck would you play back a silent sample? And one surely do not have 3 channels mixing when only one is in use either.
So if this i an issue then the audio chain/audioengine really need to be redesigned, this is a waste of resources.

If you really wanted sample C to be 2/3rds louder then simply do not mix in the silent sample A and B. If you are not using/needing A and B then do not use them it's that simple.

Do not think in channels, that is the normally the final presentation layer (which may or may not be user defined).
Instead you place a sound in 3D space or a 2D plane, if you have two sounds at once you do (A+B)/2, if three then (A+B+C)/3 and so on.

The more sounds are layered the more they drown out frequencies an d parts of each other, just like in real life. At a certain point too many sounds will only muddle the audio and make it very unclear what is being heard, this is the same as being in a room full of people talking. If it is that crowded you can thanks to it being software simply fade out the oldest sounds.

"This is what most real audio programs do" this is not true, audio tracks are mixed down and during mixdown the common (channel+channel)/numberofchannels is always used.
during audio preview however some may allow clipping or give a clipping warning to make it easier to home in on problem areas.

Just as there is a CPU budget there is also a aural budget, there is a limit to how much sound you can layer at once. If certain sounds tend to trigger always at te same time then consider pre-mixing those two together. If possible I'd advise mixing the two sounds at load time of the game or level as this will give some flexibility and reuse possibilities as well as save storage space.

If you start using compression you stand the risk listening fatigue, the sound will be unnaturally flat or harsh at times. That math while preventing clipping actually ads a different kind of distortion by flattening/damaging the peaks. Simple sum/average only causes a volume halving, and no distortion or damage to the audio. There is a reason why volume knobs was invented, if it sounds too low, turn up the volume instead.

The only time you'd want to use compression is when creating the original sounds (to ensure the dynamics are consistent and a explosion won't be super quiet then blast at max thus appear quiet if RMS scanned but not when the tail part is heard obviously.
Likewise compression may be used at the final stage which could be the user/soundcard stereo mixdown or 7.1 upmix. I use Dolby Headphone mixing a lot which does its' own compression and adjustments of frequencies to suit headphones (using a 5.1 or 7.1 virtual mode as the source which a game/software presents the audio to).

Some users may have audio modes like nightmode or music mode or movie mode or game mode or may have special custom speaker placements.

The loudness war in the music industry is bad enough, games has been not too bad on this and I'd like to see that improve even further.

Also note that mix=A+B-A*B is actually: mix=SUM-SQUARED
I'm not sure but I suspect you could get aliasing issues due to this.
That math is written by a guy in 2000 (14 years ago) for 8bit audio.
Today with 16bit audio even on crappy devices, and 96dB dynamic range you got plenty dynamic range considering most hardware do not have noisefloors or dynamic rage better than this anyway.
In fact trying to listen to the full 96dB range would permanently damage your ears after a few minutes of listening anyway.

Also note that the mean of two sounds does not diminish the quality of either, if any have shared frequencies then those frequencies will be at 100% relative loudness, if one of the two sounds being mixed have a frequency but the other do not then that frequency will only be at 50% relative volume which is correct. That math you referenced would instead present it as 75% relative volume.
This makes it artificially louder than proper mixing. You can't mix 2 channels into 1 without at least halving the volume of each.

With 32bit floating point you have 1680dB (!) dynamic range and 24bit precision. Any details loss is a non-issue at this point. And if you do the calculations themselves as 64bit floats you have and insane dynamic range and 52bit precision, you could chain re-process the same audio hundreds of times with no perceived quality loss (something nobody would do). What this means is that with 32bit floast you could mix around 32 samples (smpl+smpl.....+smpl)/num samples without any loss, mix beyond that and the softest audio will begin to be obscured by the louder audio. What is the limit? No idea, I'm sure somebody out there as tested. Latest hardware has 128 channels (which uses mean average when mixing them to 2chn stereo for example), I seem to recall one sound card that could do 512 hardware/software channels.

And note that if you mix two sound (or channels) then doing mix=(A+B)*0.5
is better since you avoid a divide by 2.
Switching between integer and float is a performance penalty too so stay in 32bit float if possible.

And a final note on -23dBFS RMS or the EBU R128 standard is that this form of loudness normalization allows the user to crank up the volume (since it will all be relatively more quiet) allowing the amplifier to increase the current to the speaker elements, thus allowing more power to drive that bass.
Audio hardware has noise levels so good now that the noisefloor is nowhere near being audible.
If audio is compressed to "fit" and not allowing proper peaks and valleys it ends up flat, and if at the same time maxed out at full signal a user may not b e able to increase the volume in fact they may feel the need to decrease the volume, causing the bass and boom to vanish as the power to the speakers are instead reduced.

Instead of compromising sound quality when having too many sounds playing at once, try to prioritize instead. A UI bleep should have higher priority than a fizzle tall end of some background ambient.

And always, always make sure you have enough headroom (EBU R128 or -23dBFS RMS does provide a lot of headroom to work with).

PS! Do note that thanks to Creative dropping the ball on OpenAL, the way to go is OpenAL Soft which is a software implementation of OpenAL and I believe it exist for Android as well, and you should be able to do EAX 2 stuff with it.

Dennis Gustafsson
profile image
Wow, this is the first time I have got a comment that is longer than the original post! Averaging channels is not a good mixing algorithm. The reason is not precision, but the fact that in interactive audio you do not know in advance in what order samples will be played, and how many at once. Classic example: You have background music playing and an explosion goes off. Right when the explosion sample starts playing, you wouldn't hear that the music is taken to a lower volume, but at the end of the sample, when the explosion fades out to silence, you will go from two channels playing – music and explosion (which is now completely silent at the very end!) down to only playing the music channel, the volume will suddenly pop to double volume instantly, which will be VERY noticable.

Your other argument about giving samples enough headroom also doesn't work, because in interactive audio you simply don't know how many will be played at once. There might be a scenario where two explosions goes off at the same time, or three, or five, or ten... You WILL hit the roof at some point, regardless of headroom. Compression is not bad, it's just a smoother alternative to clipping.

Did you implement your ideas in an actual game? I'd be very interested to hear how it worked out. It sounds from the way you argue, that you come from a music production or other linear audio type background.

Roger Haagensen
profile image
Use a audio "budget", if you give yourself x number of sounds that can be played at once. Let's say 20 sounds at once is the budget.

Then if a sound number 21 is triggered then the oldest (possibly sound number 1) is killed.

"You have background music playing and an explosion goes off. Right when the explosion sample starts playing, you wouldn't hear that the music is taken to a lower volume, but at the end of the sample, when the explosion fades out to silence, you will go from two channels playing – music and explosion (which is now completely silent at the very end!) down to only playing the music channel, the volume will suddenly pop to double volume instantly, which will be VERY noticable."

Why would the music become louder? This would imply the music was reduced prior. (known as "ducking")
If you have a audio budget of two, then the music is already at 50% and the explosion is at 50%. Combined that is 100% (no need to use sum average).

Now you might say that there might be music (2 channels) + 20 sound effects (across the left and right).
A sound budget there would imply that the music would get only 10% of the volume.
Now if this is not desirable (usually it is not) then you need to reserve 50% of the volume for the music and the other 50% for the effects.
This means that each of those 20 sound effects can only use about 5% of the total volume.

If there are 20 bullets being fired at once then this is correct and should sound as expected.
But if there is 19 bullets and 1 explosion, then what? In that case you must duck the bullet sounds during most of the explosion.
This is why the RMS or loudness is so critical, you need to store that value, then your audio engine can take that into account.
If the RMS of 19 bullets is twice that of the explosion then the volume of the bullets need to only be halved and the explosion halved (total volume would thus be 100%).

Also if you suddenly get 10 explosions, in real life that would not be 100% x 10. The same sound twice will increases the SPL (Sound Pressure Level) by about 3dB. 10dB is perceived by humans as being twice as loud. Just for reference the smallest volume difference that is perceived by a human during normal circumstances is 0.1 dB. Movie Theater has the loudness set to 83 dB SPL. Listening to more than 8 hours at 83 db SPL can permanently damage your ears. A quick calculation shows that about 72dB is more or less safe for 24/7/365 exposure without causing hearing loss. Most humans prefer the loudest and softest sounds to be within 30 dB of each other. This means that if your RMS is at -23 dBFS then your audio peaks should be from -8 dBFS to -38 dBFS, giving about 7 dB headroom for extreme peaks (you should always leave the top 1dB below 0dB free as lossy encodings tend to cause rounding issues ad sometimes clip when decoded.

10 explosions at once if compressed to make the full 1.0 volume fit (10 x 1.0 = 10.0 floating point) this would mean severe compression and the explosions would sound as if the microphone was broken or as if the speakers are damaged due to the distortions caused by the compression. Squeezing the loudness down from +20 dBFS to 1.0 dBFS will do horrible things to quality.
There is a reason why some shooters use a high pitched peeping sound to indicate aural overload (and possible simulated temporary ear damage, good thing they aren't simulating tinnitus as well).

Now if you had a audio/sound budget then you would have room for those 10 explosions going off at once. But if the budget is 10 sounds at once then that means your game would sound ten times as low as a game crushing audio together so it fits. But once the player figure out that they have a volume knob they can change they will find that the game sounds ten times better (as no compression is used/needed).

If you need a marketable term to label it with you could call it HiFi Mode or Audiophile Mode or Full Dynamics mode with a note that the player will need to crank up the volume of their system.

Just try it out yourself.
Take 10 explosions let them overlap somewhat unevenly and use your calculation to mix them into a sample let's call it Loud.wav
Now do the same again, 10 explosions that overlap somewhat (identically to how you did it a moment ago if possible), do a traditional sum average and call the sample Dynamic.wav
Warning Loud.wav need to be reduced in volume before playback by about 20dB to match the level of Dynamic.wav. Crank your system volume to um, 4 times as loud as you would normally have it. Listen to the two samples. (use both headphones and speakers when comparing)
If one of the two sounds significantly louder than the other then I suggest you use a RMS normalization on both so you can compare quality to quality as louder sounds better than quieter (our brains are stupid like that).
From quick tests here I can easily hear how flat and distorted Loud.wav sounds and the bass is squished to nothing, while Dynamic.wav truly sounds dynamic and the bass is nice and rumbly.

I also noticed something else, 10 explosions at once sounds nowhere as cool as one really loud one, so with a sound budget of 10 you would maybe do 1 explosion using up 8 slots while two others get 1 each, this lets you give the nearest explosion dominance and it will sound awesome the other two ads color.

The more you mix sounds together the more you get a cacophony of sounds, each individual sound looses what makes it distinct, it's individuality. 10 explosions at once (using the same sound) sounds like a odd echo effect rather than 10 different explosions. Less is more at least in this case.

Also you would normally not set off 10 explosion sounds at max volume (the first bomb to hit the player would kill them no need to player the sound of the other 9 is there.
With distance added the other explosions are further away, after all if they are too close then the explosions will bleed into each other and sound like a single one so playing ten would be a waste.

Also if you are worried about say 20 guns being shot at the same time and there gun 21 appear you can stop gun 1 and allow gun 21 to be heard, or alternatively do not play gun 21 sound at all. Now if gun 21 is visually in view for the player then this could be noticeable so stopping gun 1 sound that is not visible would be smarter.

I do realize this demands a more advanced audio engine, you will need to do a RMS or EBU R128 loudness scan of all the sounds so the audio engine can use that information when mixing and balancing the audio loudness budget vs number of samples/channels at once limit.

Also if somebody is sitting on a bus playing a game quality nor the loudness will matter that much anyway; it is in more quiet environment that you want the quality to shine through.
There is also a game CPU budget that can affect this so a enable/disable "Dynamic Audio" in the audio options of the game might make sense, and maybe automatically enable it if the hardware is detected as good enough to handle it.

Today one simply can not do cold mixing (by cold mixing I refer to just mixing two sounds and sending them through some magic "box" to fix things), instead one need to design the soundscape, use/add audio cues or hints in the environment, use RMS/loudness info stored with the samples, use alternative sounds if certain combos are used, ducking, use a audio/sound slot budget.

And for the love of all that is holy, if there is dialog or narration then please make sure the dialog volume can be adjusted. A recent game, a huge budget one with heavy marketing and a triple A budget behind it, it did not allow adjusting the dialog volume, if it had not been for dialog subtitles I would have no idea what the characters where saying as the music and environment sounds drowned out the dialog.

As for me, I am coding routines for working on live streaming audio mixing and related mixing tools (that uses pre-calculated RMS and fade in and fade out duration cues and silence detection) on recorded music and mix that with live microphone/line in audio, and it has to sound balanced and sound evenly matched, in realtime without any loss in dynamics, this obviously means that the volume level will be quieter than the competition but in return the quality will be way better than the competition.
As for music production and linear audio editing, I am also working on a future music album(s), and due to the special way I'm making it I have to self impose a sound slot budget so I don't end up layering 100 sounds on each other which will just be a muddy mess if done wrongly.

You may not know what the player is doing at any given time, but neither is the script guys or graphics artists. Audio is as much a integral part of the game itself, it's not just something sitting on the side that audio is just flung into, audio must be part of the game environment itself. If a sound is on the other side of a wall it must be muffled or lower.
If a huge explosion is going off right next to you then there is no way the player should be able to hear the guard snoring in the next room so why mix in the snoring at all? Only exception was if there was a giant monster snoring as loud as a explosion.

As an audio artist you have to direct the audio of the game, as a audio programmer you have to direct or make it possible to direct the audio.

Once this is taken into account in your audio engine then all your future work will benefit from it.

Damn. I'm blabbing on again aren't I?...

Dennis Gustafsson
profile image
"But if there is 19 bullets and 1 explosion, then what? In that case you must duck the bullet sounds during most of the explosion." This is exactly what compression is, but instead of precomputing the RMS and adjusting manually it's all automatic based on the dynamic content of what is being played.

I like your idea of a "hifi mode", but even with that I think it would be hard to adjust everything manually and make it fit in a predefined budget. There are just too many sounds that you don't know when they will be played and how loud.

About prioritizing sounds – taking away the snoring guard sound while the explosion is playing IS a form of compression. If your snoring sound is less loud a compressor will automatically prioritize the explosion sound.

Roger Haagensen
profile image
"If your snoring sound is less loud a compressor will automatically prioritize the explosion sound."

But it is still playing it despite not being audible so it's eatingb up a "channel" or slot.
Also the snoring sound and the explosion may be equally loud (if stored as 16bit there is a digital max value per sample), in fact a explosion is momentary while snoring is more or less continuous.

"I think it would be hard to adjust everything manually and make it fit in a predefined budget"
Only if your entire game is procedural and the content/levels are usermade directly or indirectly.
Usually a game has a designed level, you know how many enemies there are or can be.
The sound slot budget can even be different per level or area as needed.

Also note that depending on the hardware/software/drivers there is a limit already (128 or 256 or 64) on the number of channels that are possible, they do not use compression when mixing those, if 64 of 128 channels is in use then the channels are reduced internally in relation to the number of channels, if 128 is used the volume is reduced even more.

BTW! Did you do the test I asked? You can hear the distortion that occurs with just 10 sounds at once when using compression vs just sum and average, now image what would happen with 128 sounds at once with compression, my ears hurt just thinking about it.

Compression is great when you want to reduce or flatten peaks or if you intentionally want distortion, but using it to reduce volume is just wrong (unless you want to and distortion), you are basically adding software clipping by doing that (you are clipping the tops off the waveforms).

Although I haven't done any tests on this, just doing sum
average with slot management seems to use less CPU than a compression routine would.

PS! None of the hardware nor any of the software manufactures (ranging from professional gear to home) use compression to reduce volume for mixing, they all use sum average and they all have a channel/sound slot budget and they usually have some mixing headroom too (ranging from 1dB to usually 3dB and in some cases 6dB or 11dB) to handle intersample peaks and thus avoid clipping/distortion.
FMOD, MILES, BASS, etc all use sum average as well. If some guy back in 2000 found a better way then why are nobody using it professionally?
The only ones I see use it are (no offense intended) amateurs, and on the net this method pops back up every few years and then somebody points out that you actually damage the audio quality doing that, then a few years pass and up it pops again.

I'm not trying to knock down you and you audio engine, I'm just saying that doing it the way you did makes your engine have worse sound quality than other audio engines. Which is not fair to you/the game/company/customers. Quality is a competitive edge today. The better quality, the less bugs, the better reviews ad word of mouth you will get.
"Compression" is usually an option the user can enable (great in cases where they are gaming or listening to music or watching movies on the bus for example).

Dennis Gustafsson
profile image
No offense taken.. You might not have realized, but this is not just vague ideas I have about audio, but the actual techniques used in Smash Hit, one of the most popular mobile games this year with over 50 million players and has received excellent reviews, especially for audio.

Let's just say you and I have different opinions on this one, have a nice day!

Sauli Lehtinen
profile image
You are both correct here. There are good practises and then there are things that just work. In case of Smash Hit sort of loudnesswarrey (what an antiword) approach is imo justified as an artistic decision.

Sauli Lehtinen
profile image
And as a sidenote I really dislike the idea that non-linear media should try to follow loudness level targets of non-linear media. No matter how hard you try to target something like EBU R128 loudness recommendations you will never succeed.

Sauli Lehtinen
profile image
I've had some pretty good results with sidechain and priority based mixing where you assign sounds to different buses by their priority and then let the higher priority buses turn down volume of low priority ones.

This isn't 100% foolproof and requires lots of tweaking to get desirable results, especially with long looping sounds, but I've found it kind of automatically solves many problems like having meaningful difference between one and five simultaneous explosions and not having to sacrifice detail of very silent sounds (like clothing foley and footsteps) in silent environment.

Of course if your game consists mainly of huge amount of evenly prioritized glass breaking sounds it doesn't solve anything :)


none
 
Comment: