Evaluating Perceptual Coders

This article presents a technique for evaluating perceptual coders for audio bit rate reduction.

Perceptual coders, audio bitrate reduction algorithms, are known by their more common names, MP3, AAC, WMA and so on. There are a few lossless ones, such as FLAC and WMA Lossless, which don't quite fall into the category of perceptual coders, but they do reduce the bitrate of digital audio files.

If the studio master tape, or the CD are the "original sound", then MP3, AAC, WMA are the LP, 8-Track, Compact Cassette, FM and AM radio of the early 21st Century. All of these technologies delivered sound to the end user with varying levels of fidelity, and were popular or unpopular with consumers for various technical reasons (availability, reliability, convenience), that may or may not include audio fidelity.

Perceptual coders use perceptual models (a set of rules that describe a statistical model of typical human hearing). In order to reduce the bandwidth of an audio file by any significant amount, data must be discarded. Missing data is an "error"; it adds noise to the signal that wasn't there in the original audio. The perceptual coders try to hide the noise behind the remaining audio in such a way that we humans don't perceive the loss -- or at least, we're not too annoyed by it.

The merits and drawbacks of the various coders are hotly debated on forums, resulting in quite a lot of superstition and religious wars. There are several reasons for this. First of all, the coders have specific objectives. A particular coder may not be optimized for what the user is asking it to do. Coders also use different perceptual models and algorithms to accomplish the bit rate reduction. It is a fact of life that full-bandwidth audio is not suitable for all applications. It must be bandwidth limited, and this involves trade-offs. The coders attempt to minimize the trade-offs by making the missing data as unobtrusive as possible. Competing algorithms do this using subjective rules, which will always be, well, subjective. Not everyone agrees on what sounds best.

Not everyone fits the accepted statistical models that the coders use to hide the error noise. Some people have types of hearing loss that "un mask" the distortion that was so carefully hidden by the coder. Some people are more sensitive to distortion than others. Equalization settings, speaker and headphone characteristics, as well as a well-trained ear can all make the perceptual coders less effective. By "less effective", I mean that for a given listener not to notice the loss in fidelity, a higher bit rate is needed. For some coders, a trained ear might be able to recognize a particular "signature" that is introduced by the algorithm no matter what bit rate is used.

The usual way to evaluate perceptual coders is to compare them in double-blind A/B listening tests. This usually requires a rather complex laboratory setup with specially calibrated sound systems and a wide array of sound clips, in which trained and un-trained listeners are asked to provide a "better or worse" judgement of sound A vs. sound B. Since the sound is constantly changing, it requires many repetitions and statistical analysis to determine which algorithm is "better" or "worse". Even then, it isn't always easy for the algorithm designers to understand why listeners consider something "better" or "worse", let alone what to do about it.

In the end, listening is the final test, because listening is what the coders are for. But I became curious about just what it is that separates the encoded audio from the original sound. I wanted to know in real time, what sound is being discarded. We know what the original sounds like, and we know what the encoded audio sounds like, because we hear it all the time. But what does the difference sound like? Knowing this would help me to better understand what the coder is doing at every instant, and it might also help train my ear to better hear the artifacts. It also might destroy my enjoyment of the format, but enhanced knowledge has its risks.

So, how do we hear what's missing? It's quite simple, really. All you need is a sound file editor, such as Audacity that allows you to import and export various encoding formats, normalize, invert the phase, time shift, and sum channels. These are all fairly standard operations.

In order to hear the difference between the original sound, we need only subtract the encoded sound from the original sound, sample by sample. You do this by inverting the phase (making each sample the additive inverse) of the original sound and summing that with the encoded version. Basic algebra tells us that adding negative numbers (summing) is the same as subtraction. You can listen to the resulting sound file to hear the audio that the encoder actually discarded in order to reduce the bandwidth.

There are a few caveats: In order for this to work, you have to encode and decode the original audio yourself. You probably cannot compare an encoded track from some third party source with a CD track that you happen to have. There are too many variables that would make it impossible to do a sample-by-sample comparison, which is what we are doing here. We want to know what the perceptual coder did to the audio, not whatever else might have happened to it on it's long and winding road to your computer. Also, I have found that the coders produce some time shifting in the resulting files, so some inspection is necessary to re-establish the correct sample alignment. If the samples are not in perfect alignment, the entire comparison is invalid. Especially when comparing very low bit rate encoded files, it can get very difficult to find the correct alignment, because the encoded waveforms are so different from the original.

I began by ripping track 05 Money, from a copy of Pink Floyd's Dark Side of the Moon (CDP 7 46001 2) that I purchased ca. 1984, shortly after CDs first came out. I chose this track because it is familiar to many people, it is a high quality recording, and it has many interesting percussive sound effects (coins jingling, cash register, etc.) as well as vocals, guitars, drums, synth and sax. I ripped the track to 16-bit WAV format, which is a bit-accurate uncompressed format. The bit stream is identical to whatever was on the CD. I imported this into Audacity, normalized it, and exported it in the following formats:
  • WAV (reference)
  • FLAC (lossless)
  • LAME MP3 Medium Preset (Variable Bit Rate 145~185 Kbps)
  • LAME MP3 Standard Preset (Variable Bit Rate 170~210 Kbps)
  • LAME MP3 Extreme Preset (Variable Bit Rate 220~260 Kbps)
  • LAME MP3 Insane Preset (320 Kbps)
  • LAME MP3 (Constant Bit Rate 32Kpbs) (minimum for 44.1KHz sample rate)
  • LAME MP3 (Constant Bit Rate 64Kpbs)
  • LAME MP3 (Constant Bit Rate 128Kpbs)
  • LAME MP3 (Constant Bit Rate 256Kpbs)
  • WMA Constant Bit Rate 32Kpbs (minimum for 44.1KHz sample rate)
  • WMA Constant Bit Rate 64Kpbs
  • WMA Constant Bit Rate 128Kpbs
  • WMA Constant Bit Rate 256Kpbs
  • WMA Constant Bit Rate 320Kpbs (max)
Then, I inverted the phase of the normalized WAV reference, and exported that as a 16-bit WAV file (inverted WAV reference). Finally, for each encoding listed above, I opened the inverted WAV reference file in Audacity, and imported the encoded version. I restored the correct sample time alignment and exported the sum as the following difference files:
To reduce download times, I converted these files to high quality M4A (AAC) files, fine for listening. The original exact WAV files are available on request. Note: these files are on an FTP server. Your browser might not play, or even download them automatically. If not, right click on the link, and select "Save target as..." (or words to that effect), and play the file from your download location.

Analysis, Mr. Spock
As would be expected, the first two files (comparison to WAV and FLAC) are completely silent, indicating no difference from the original, i.e., that they are a bit-accurate representation of the original sound. This serves as a control for the experiment, and validates the methodology. It also proves to any doubters (and I still see them on the forums regularly), that FLAC is indeed lossless, but smaller than WAV (44,448,836 bytes vs. 6,068,064,236 bytes). It isn't magic; it's just more efficient than WAV at storing the data.

The rest of the files have varying amounts of audible content, representing infidelity to the original sound. I have not normalized or scaled these files, so the residual levels are representative of the actual amount of noise that needs to be masked by your perception of sound -- as estimated by the perceptual coder. If you have a media player or audio editor with sound level or VU meters, you can see how far down in dB the noise level is (dBFS -- compared to the original sound, which I normalized to 0 dB or full scale).

As you might expect, the lower bit rate encodings are noisier than the higher bit rate ones. In addition, WMA seems to do a better job (to my ear) of making the errors less "jarring", by smoothing them out better than MP3 (although this might merely be a preference, and it doesn't mean that it sounds that way when listening to the normal encoding).

These files are interesting, but they represent the error encoding, completely unmasked. The whole idea of perceptual encoding is to mask the error behind the remaining audio in such a way that the listener doesn't hear the error, or doesn't notice the error, or doesn't find it too distracting, or can still stand to listen to it, or... But I do find it interesting to know what it is that the encoder is trying to hide.


  1. Thanks, I appreciate you doing this analysis and posting the findings on your blog. I have been using LAME (as an add-on app for iTunes) for my audio files to put on my portable players, either with the INSANE or EXTREME settings. It is interesting to see (in Audacity with the wave forms) and hear what is being lost when compressing files, in the song you tested. I may venture to use something closer to a lossless format.

    I wonder if it matters, though, if I'm using an ipod classic. Does an ipod truly have the capability through its headphone jack to produce accurate sound quality from a lossless file (if using quality headphones)?

    When I tried comparing apple lossless files and high quality (320 kbps LAME) mp3 files on my ipod in the past (not blind tested though). I could not tell if I could tell if I was hearing a difference, or imagining a difference, or if I would really miss any difference, real or imagined.

    I seem to remember hearing a difference in the sound when listening for the "echo" of the instruments. I felt like the lossless audio might have had more "space", that the mp3 lacked slightly. The last thing I compared was a Phil Collins track from "...But Seriously", and I listened for the drums, and it seemed the lossless file had more spaciousness around the drums and acoustics, than the mp3. But it FELT like a slight difference. I wasn't sure if I would really miss it, or notice it, since I was comparing on my grado sr80i headphones, which are not what I use all the time, and I really had to concentrate to hear a difference. If I was doing a blind test of the files I probably would not have chosen one as being better than the other, but I don't know.

    Anyway, downloading and listening to some of your files from the test you did seems to confirm that I may very well have been hearing a difference in the sound from my own listening test, especially in regards to the spaciousness and "echo".

    1. Black.Ink.On.Paper,

      Most people use lossy compression where storage space is at a premium, while keeping the bitrate as high as possible. This is a compromise, of course. I use lossless for archival storage, and convert to a lossy format when copying to a mobile device.

      My iPod shuffle exhibits extremely high quality performance with good quality headphones. AAC 256 VBR sounds quite good, although I haven't done any AB/X listening tests. Let's just say nothing jumps out to annoy me and reduce my enjoyment.

      When I get a chance, I might go back and repeat this test with AAC at various bitrates.

  2. Hi,
    I listen a lot DI Premium ( Digitally Imported ) ; from my iPad through my Apple Tv to my receiver.
    Which quality would be best ?
    The oprions are ;

    MP3: 320 kbits/sec
    AAC: 128 kbits/sec
    AAC-HE: 64 and 40 kbits/sec

    Thanks !

  3. I took a listen to the differences. I think you left one out difference that wasn't accounted for, volume. In realistic tests, volume matching should occur. In many of the non-silent recordings, the differences between the tracks above the noise floor is actually audible, it has a steady beat and even some vocals and be made out. It's actually pretty cool!

    1. The encoded and the original sound were precisely matched. The fact that the difference files have different loudness is part of the result set. The difference files directly represent the noise created by the encoding process, which would ideally be zero -- absolute silence.