Perceptual coders, audio bitrate reduction algorithms, are known by their more common names, MP3, AAC, WMA and so on. There are a few lossless ones, such as FLAC and WMA Lossless, which don't quite fall into the category of perceptual coders, but they do reduce the bitrate of digital audio files.
If the studio master tape, or the CD are the "original sound", then MP3, AAC, WMA are the LP, 8-Track, Compact Cassette, FM and AM radio of the early 21st Century. All of these technologies delivered sound to the end user with varying levels of fidelity, and were popular or unpopular with consumers for various technical reasons (availability, reliability, convenience), that may or may not include audio fidelity.
Perceptual coders use perceptual models (a set of rules that describe a statistical model of typical human hearing). In order to reduce the bandwidth of an audio file by any significant amount, data must be discarded. Missing data is an "error"; it adds noise to the signal that wasn't there in the original audio. The perceptual coders try to hide the noise behind the remaining audio in such a way that we humans don't perceive the loss -- or at least, we're not too annoyed by it.
The merits and drawbacks of the various coders are hotly debated on forums, resulting in quite a lot of superstition and religious wars. There are several reasons for this. First of all, the coders have specific objectives. A particular coder may not be optimized for what the user is asking it to do. Coders also use different perceptual models and algorithms to accomplish the bit rate reduction. It is a fact of life that full-bandwidth audio is not suitable for all applications. It must be bandwidth limited, and this involves trade-offs. The coders attempt to minimize the trade-offs by making the missing data as unobtrusive as possible. Competing algorithms do this using subjective rules, which will always be, well, subjective. Not everyone agrees on what sounds best.
Not everyone fits the accepted statistical models that the coders use to hide the error noise. Some people have types of hearing loss that "un mask" the distortion that was so carefully hidden by the coder. Some people are more sensitive to distortion than others. Equalization settings, speaker and headphone characteristics, as well as a well-trained ear can all make the perceptual coders less effective. By "less effective", I mean that for a given listener not to notice the loss in fidelity, a higher bit rate is needed. For some coders, a trained ear might be able to recognize a particular "signature" that is introduced by the algorithm no matter what bit rate is used.
The usual way to evaluate perceptual coders is to compare them in double-blind A/B listening tests. This usually requires a rather complex laboratory setup with specially calibrated sound systems and a wide array of sound clips, in which trained and un-trained listeners are asked to provide a "better or worse" judgement of sound A vs. sound B. Since the sound is constantly changing, it requires many repetitions and statistical analysis to determine which algorithm is "better" or "worse". Even then, it isn't always easy for the algorithm designers to understand why listeners consider something "better" or "worse", let alone what to do about it.
In the end, listening is the final test, because listening is what the coders are for. But I became curious about just what it is that separates the encoded audio from the original sound. I wanted to know in real time, what sound is being discarded. We know what the original sounds like, and we know what the encoded audio sounds like, because we hear it all the time. But what does the difference sound like? Knowing this would help me to better understand what the coder is doing at every instant, and it might also help train my ear to better hear the artifacts. It also might destroy my enjoyment of the format, but enhanced knowledge has its risks.
So, how do we hear what's missing? It's quite simple, really. All you need is a sound file editor, such as Audacity that allows you to import and export various encoding formats, normalize, invert the phase, time shift, and sum channels. These are all fairly standard operations.
In order to hear the difference between the original sound, we need only subtract the encoded sound from the original sound, sample by sample. You do this by inverting the phase (making each sample the additive inverse) of the original sound and summing that with the encoded version. Basic algebra tells us that adding negative numbers (summing) is the same as subtraction. You can listen to the resulting sound file to hear the audio that the encoder actually discarded in order to reduce the bandwidth.
There are a few caveats: In order for this to work, you have to encode and decode the original audio yourself. You probably cannot compare an encoded track from some third party source with a CD track that you happen to have. There are too many variables that would make it impossible to do a sample-by-sample comparison, which is what we are doing here. We want to know what the perceptual coder did to the audio, not whatever else might have happened to it on it's long and winding road to your computer. Also, I have found that the coders produce some time shifting in the resulting files, so some inspection is necessary to re-establish the correct sample alignment. If the samples are not in perfect alignment, the entire comparison is invalid. Especially when comparing very low bit rate encoded files, it can get very difficult to find the correct alignment, because the encoded waveforms are so different from the original.
I began by ripping track 05 Money, from a copy of Pink Floyd's Dark Side of the Moon (CDP 7 46001 2) that I purchased ca. 1984, shortly after CDs first came out. I chose this track because it is familiar to many people, it is a high quality recording, and it has many interesting percussive sound effects (coins jingling, cash register, etc.) as well as vocals, guitars, drums, synth and sax. I ripped the track to 16-bit WAV format, which is a bit-accurate uncompressed format. The bit stream is identical to whatever was on the CD. I imported this into Audacity, normalized it, and exported it in the following formats:
- WAV (reference)
- FLAC (lossless)
- LAME MP3 Medium Preset (Variable Bit Rate 145~185 Kbps)
- LAME MP3 Standard Preset (Variable Bit Rate 170~210 Kbps)
- LAME MP3 Extreme Preset (Variable Bit Rate 220~260 Kbps)
- LAME MP3 Insane Preset (320 Kbps)
- LAME MP3 (Constant Bit Rate 32Kpbs) (minimum for 44.1KHz sample rate)
- LAME MP3 (Constant Bit Rate 64Kpbs)
- LAME MP3 (Constant Bit Rate 128Kpbs)
- LAME MP3 (Constant Bit Rate 256Kpbs)
- WMA Constant Bit Rate 32Kpbs (minimum for 44.1KHz sample rate)
- WMA Constant Bit Rate 64Kpbs
- WMA Constant Bit Rate 128Kpbs
- WMA Constant Bit Rate 256Kpbs
- WMA Constant Bit Rate 320Kpbs (max)
- Original Sound Inverted + WAV
- Original Sound Inverted + FLAC
- Original Sound Inverted + LAME MP3 Medium Preset (Variable Bit Rate 145~185 Kbps)
- Original Sound Inverted + LAME MP3 Standard Preset (Variable Bit Rate 170~210 Kbps)
- Original Sound Inverted + LAME MP3 Extreme Preset (Variable Bit Rate 220~260 Kbps)
- Original Sound Inverted + LAME MP3 Insane Preset (320 Kbps)
- Original Sound Inverted + LAME MP3 (Constant Bit Rate 32Kpbs) (minimum for 44.1KHz sample rate)
- Original Sound Inverted + LAME MP3 (Constant Bit Rate 64Kpbs)
- Original Sound Inverted + LAME MP3 (Constant Bit Rate 128Kpbs)
- Original Sound Inverted + LAME MP3 (Constant Bit Rate 256Kpbs)
- Original Sound Inverted + WMA Constant Bit Rate 32Kpbs (minimum for 44.1KHz sample rate)
- Original Sound Inverted + WMA Constant Bit Rate 64Kpbs
- Original Sound Inverted + WMA Constant Bit Rate 128Kpbs
- Original Sound Inverted + WMA Constant Bit Rate 256Kpbs
- Original Sound Inverted + WMA Constant Bit Rate 320Kpbs (max)
Analysis, Mr. Spock
As would be expected, the first two files (comparison to WAV and FLAC) are completely silent, indicating no difference from the original, i.e., that they are a bit-accurate representation of the original sound. This serves as a control for the experiment, and validates the methodology. It also proves to any doubters (and I still see them on the forums regularly), that FLAC is indeed lossless, but smaller than WAV (44,448,836 bytes vs. 6,068,064,236 bytes). It isn't magic; it's just more efficient than WAV at storing the data.
The rest of the files have varying amounts of audible content, representing infidelity to the original sound. I have not normalized or scaled these files, so the residual levels are representative of the actual amount of noise that needs to be masked by your perception of sound -- as estimated by the perceptual coder. If you have a media player or audio editor with sound level or VU meters, you can see how far down in dB the noise level is (dBFS -- compared to the original sound, which I normalized to 0 dB or full scale).
As you might expect, the lower bit rate encodings are noisier than the higher bit rate ones. In addition, WMA seems to do a better job (to my ear) of making the errors less "jarring", by smoothing them out better than MP3 (although this might merely be a preference, and it doesn't mean that it sounds that way when listening to the normal encoding).
These files are interesting, but they represent the error encoding, completely unmasked. The whole idea of perceptual encoding is to mask the error behind the remaining audio in such a way that the listener doesn't hear the error, or doesn't notice the error, or doesn't find it too distracting, or can still stand to listen to it, or... But I do find it interesting to know what it is that the encoder is trying to hide.