Back in early 2000 I was approached by a publisher to write a book about about MP3 audio technology; I only wrote the first two chapters before my senior project duties eclipsed book-writing and I needed to shelve the project indefinitely. 20 years later, in 2020, I’ve resurrected what I had written and am re-publishing it.
The Guts of Music Technology
In this section and the ones following, things are going to get increasingly technical. I’m going to start off pretty simple and slowly ramp up to some considerably involved topics, so please feel free to skip the parts that you already know to get to the juicy stuff. It’s possible that you may find some parts overwhelming. Don’t worry yourself too much about it, just feel free to simply skim. To make this easy for you, I’ve bolded the key definitions throughout the text. And if you get bored? Just go to the next chapter. Nobody’s quizzing you on this!
Digital Audio Basics
Computers work by passing small charges through aluminum trenches etched in silicon and shoving these charges through various gates: If this charge is here and that one is too then the chip will create a charge in another place. The computer does all of its computations in ones and zeroes. Integers, like -4, 15, 0, or 3, can be represented with combinations of ones and zeroes in an arithmetic system called binary. Humans normally use a “decimal” system with ten symbols per space: we count 1, 2, 3,…8, 9, 10, 11. In the binary system there are only two symbols per space: one counts 1, 10, 11, 100, 101, 110, 111, 1000, etc.!
If the computer is to understand how to store music, music must be represented as a series of ones and zeroes. How can we do this? Well, one thing to keep in mind throughout all of this discussion is that we’re going to be focusing on making music for humans to hear. While that may sound trite, that will allow us to “cheat” and throw out the parts of the music the people can’t hear: a dog might not be able to appreciate Mozart as much after we’re done with things, but if it sounds just the same to an average Jane, then we’ve accomplished our true mission - to have realistic music come from a computer!
We first need to understand what sound is. When you hear a sound, like a train whistle or your favorite hip-hop artist, your eardrum is getting squished in and out by air. Speakers, whistles, voices, and anything else that makes sound repeatedly squishes air and then doesn’t. When the sound gets to your ear, it pushes your eardrum in and out. If the air gets squished in and out at a constant rate, like 440 times a second, you’ll hear a constant tone, like when someone whistles a single note. The faster the air gets squished in and out, the higher tone you hear; likewise, the low bass tones of a drum squish the air in and out very slowly, about 50 times a second. Engineers use the measurement Hertz, abbreviated Hz, to mean “number of times per second” and kilohertz, or kHz, to mean “thousands of times per second.” Some people with very good hearing can hear sounds as low as 20Hz and as high as 20kHz. Also, the more violently the air is compressed and decompressed, the louder the signal is.
Now we can understand what a microphone does. A microphone consists of a thin diaphragm that acts a lot like your eardrum: as music is being played, the diaphragm of the microphone gets pushed in and out. The more pushed in the diaphragm is, the more electrical charge the microphone sends back to the device into which you’ve plugged your mic. What if you plug the mic into your computer? The computer is good at dealing with discrete numbers, also known as digital information, but the amount that the microphone is being compressed is always changing; it is analog information. There is a small piece of hardware in a computer that allows it to record music from a microphone: it is a called a Analog to Digital Converter, or ADC for short. It is impossible for us to record a smooth signal as ones and zeroes and reproduce it perfectly on a computer. The ADC does not attempt to perfectly record the signal. Instead, several thousand times a second it takes a peek at how squished in the microphone is. The rate at which I check on the microphone is called the sampling rate. If the microphone is 100% squished in, we’ll give it the number 64,000. If the microphone is not squished in at all, we’ll give it a 0, and we’ll assign it a number correspondingly for in-between values: halfway squished in would merit a 32,000. We call these values samples.
The Nyquist Theorem says that as long as our sampling rate is twice the frequency of highest tone we want to record, we’ll be able to accurately reproduce the tone. Since humans can’t hear anything higher than 22kHz, if we take sample the microphone 44,000 times a second, we’ll be able to reproduce the highest tones that people can hear. In fact, CDs sample at 44.1kHz and, as suggested above, store the amount the microphone was squished as a number between 0 and 65,536, using 16 ones and zeros, or bits, for every sample. In this way, we’d say that CDs have a sample resolution of 16 bits.
All of this data ends up taking a great deal of space: if we sample a left and a right channel for stereo sound at 44.1kHz, using 16 bits for every sample, that’s 1.4 million bits for every second of music! On a 28.8 modem, it would take you over 50 seconds to transmit a single second of uncompressed music to a friend! We clearly need a way to use fewer bits to transmit the music.
Those of you comfortable with computers may suggest we use a compression program like WinZIP or StuffitDeluxe to reduce the size of these musicfiles. Unfortunately, this does not work very well. These compression programs were designed largely with text in mind. These programs were also designed to perfectly reproduce every bit: If you compress a document to put it on a floppy, it had better not be missing anything when you decompress it on a friend’s machine! Compression algorithms work best when they know what they are compressing. specialized algorithms can squish down video to an 100th of its original size, and people routinely use the JPEG (.JPG) compression format to reduce the size of pictures on the web. JPEG is lossy; that is to say, it destroys some data. If you scan in a beautifully detailed picture and squish it down to a small JPEG file, you will see that there are noticeable differences between the original and the compressed versions, but in general it is throwing away the information that is less important for your eye to see to understand what the picture is about. In the same way, we will get much better compression of sound if we use and algorithm that understands the way that people hear and destroys the parts of the sound that we cannot perceive. Already, we have done this in a small way by ignoring any sounds above 22kHz. We might have done things differently if we were making an audio system for a dog or a whale; we have already exploited some knowledge of the human ear to our advantage, now it comes time for us to further use this knowledge to compress the sound.
In order to compress the sound, we need to understand what parts are okay to throw away; that is to say, what the least important parts of the sound are. That way, we can keep the most important parts of the sound so we can stream them live through, say, a 28.8k modem.
Now as it turns out, Sound is very tonal. This means that sounds tend to maintain their pitch for periods of time: a trumpet will play a note for half-second, a piano will sound a chord, etc. If I were to whistle an ‘A’ for second, your eardrum may be wiggling in and out very quickly, but the tone stays constant. While recording the “wiggling” of the signal going in and out would take a great deal of numbers to describe, in this case it would be much simpler to simply record the tone and how long it went for, i.e., “440Hz (that’s A!) for 1.0 seconds.” In this way, I’ve replaced hundreds of thousands of numbers with two numbers.
While clearly most signals are not so compressible, the concept applies: sound pressure, or the amount that your eardrum is compressed, changes very rapidly (tens of thousands of times a second), while frequency information, or the tones that are present in a piece of music, tend not to change very frequently (32 notes per second is pretty fast for a pianist!). If we only had a way to look at sound in the frequency domain, we could probably get excellent compression.
Luckily for us, J. B. Joseph Fourier, a 19th century mathematician, came up with a nifty way for transforming a chunk of samples into their respective frequencies. While describing the method in detail has occupied many graduate-level electrical engineering books, the concept is straightforward: if I take a small chunk of audio samples from the microphone as you are whistling, I take the discrete numbers that describe the microphone’s state and run it through a Discrete Fourier Transform, also known as a DFT. What I get out is a set of numbers that describe what frequencies are present in the signal and how strong they are, i.e., “There is a very loud tone playing an A# and there is a quiet G flat, too.” I call the chunk of samples that I feed the DFT my input window.
There is an interesting tradeoff here: if I take a long input window, meaning I record a long chunk of audio from the microphone and run it all through the DFT at once, I’ll be able to pick out what tone a user was whistling with great precision. And, just like with people, if I only let the computer hear a sound for a short moment, it will have poor frequency resolution, i.e., it will be difficult for it to tell what tone was whistled. Likewise, if I’m trying to nail down exactly when a user begins to whistle into a microphone if I take short windows, I’ll be able to pick out close to the exact time when they started to whistle; but if I take very long windows, the Fourier transform won’t tell me when a tone began, only how loud it is. I’d have trouble nailing down when it began and could be said to have poor time resolution. Frequency resolution and time resolution work against each other: the more you need to know exactly when a sound happened, the less you know what tone it is; the more exactly you need to know what frequencies are present in a signal, the less precisely you know the time at which those frequencies started or stopped.
As a real world example of where this is applicable, Microsoft’s MS Audio 4 codec uses very long windows. As a result, music encoded in that format is bright and captures properly the tone of music, but quick, sharp sounds like hand claps, hihats, or cymbals sound mushy and drawn out. These kinds of quick bursts of sound are called transients in the audio compression world. Later on, we’ll learn how MP3 deals with this. (AAC and AC-3 use similar techniques to MP3.)
In 1965, two programmers, J. Tukey and J. Cooley invented a way to perform Fourier transforms a lot faster than had been done before. They decided to call this algorithm the Fast Fourier Transform, or FFT. You will likely hear this term used quite a bit in compression literature to refer to the Fourier transform (the process of looking at what tones are present in a sound).
The Biology of Hearing
Now that we understand how computers listen to sounds and how frequencies work, we can begin to understand how the human ear actually hears sound. So I’m going to take a bit of a “time out” from all of this talk about computer technology to explain some of the basics of ear biology.
As I mentioned before, when sound waves travel through the air, they cause the eardrum to vibrate, pushing in and out of the ear canal. The back of the eardrum is attached to an assembly of the three smallest bones in your body, known as the hammer, anvil, and stirrup. These three bones are pressed up against an oval section of a spiral fluid cavity in your inner ear shaped like a snail shell, known as the cochlea. (Cochlea is actually Latin for “snail shell”!) The vibrations from the bones pushing against the oval window of the cochlea cause hairs within the cochlea to vibrate.
Depending on the frequency of the vibrations, different sets of hairs in the cochlea vibrate: high tones excite the hairs near the base of the cochlea, while low tones excite the hairs at the center of the cochlea. When the hairs vibrate, they send electrical signals to the brain; the brain then perceives these signals as sound. The astute reader may notice that this means that the ear is itself performing a Fourier transform of sorts! The incoming signal (the vibrations of the air waves) is broken up into frequency components and transmitted to the brain. This means that thinking about sound in terms of frequency is not only useful because of the tonality of music, but also because it corresponds to how we actually perceive sound!
The sensitivity of the cochlear hairs is mind-boggling. The human ear can sense as little as a picowatt of energy per square foot of sound compression, but can take up to a full watt of energy before starting to feel pain. Visualize dropping a grain of sand on a huge sheet and being able to sense it. Now visualize dropping an entire beachful of sand (or, say, an anvil) onto the same sheet, without the sheet tearing and also being able to sense that. This absurdly large range of scales necessitated the creation of a new system of acoustic measurement, called the bel, named after the inventor of the telephone, Alexander Graham Bell. If one sound is a bel louder than another, it is ten times louder. If a sound is two bels louder than another, it is a hundred times louder than the first. If a sound is three bels louder than another, it is a thousand times louder. Get it? A bel corresponds roughly to however many digits there are after the first digit. A sounds 100,000 times louder than another would mean there was 5 bels of difference. This system lets us deal with manageably small numbers that can represent very large numbers. Mathematicians call these logarithmic numbering systems.
People traditionally have used “tenths of bels,” or decibels (dB) to describe relative sound strengths. In this system, one sound that was 20dB louder than another would be 2 bels louder, which means it is actually 100 times louder than the other. People are comfortable with sounds that are a trillion times louder than the quietest sounds they can hear! This corresponds to 12 bels, or 120dB of difference.
If a set of hairs are excited, it impairs the ability of nearby hairs to pickup detailed signals; we’ll cover this in the next section. It’s also worth noting that our brain groups these hairs into 25 frequency bands, called critical bands: this was discovered by acoustic researchers Zwicker, Flottorp, and Stevens in 1957. We’ll review critical bands a bit later on. Now, equipped with a basic knowledge of the functioning of the ear, we can tackle understanding the parts of a sound less important to the ear.
Your ear adapts to the sounds in the environment around you. If all is still and quiet, you can hear a twig snap hundreds of feet away. But when you’re at a concert with rock music blaring, it can be difficult to hear your friend, who is shouting right into your ear. This is called masking, because the louder sounds mask the quieter sounds. There are several different kinds of masking that occur in the human ear.
Your ear obviously has certain inherent thresholds: you can’t hear a mosquito buzzing 5 miles away even in complete silence, even though, theoretically it might be possible to do it with sufficiently sensitive instrumentation. The human ear is also more sensitive to some frequencies than to others: our best hearing is around 4000Hz, unsurprisingly not too far from the frequency range of most speech.
If you were to plot a curve graphing the quietest tone a person can hear versus frequency, as is done to the right, it would look like a “U,” with a little downwards notch around 4000Hz. Interestingly enough, people who have listened to too much loud music have a lump in this curve at 4000Hz, where they should have a notch. This is why it’s hard to hear people talk right after a loud concert. Continued exposure to loud music will actually permanently damage your cochlear hair cells, and unlike the hair on your head, cochlear hairs never grow back.
This curve, naturally, varies from person to person, and gets smaller the older the subject is, especially in the higher frequencies. Translation: old people usually have trouble hearing. Theoretically, this variance could be used to create custom compression for a given person’s hearing capability, but this would require a great deal of CPU horsepower for a server delivering 250 custom streams at once!
Pure tones, like a steady whistle, mask out nearby tones: if I were to whistle a C very loudly and you were to whistle a C# very softly, an onlooker (or “on-listener,” really) would not be able to hear the C#. If, however, you were to whistle an octave or two above me, I might have a better chance of noticing it. The farther apart the two tones are, the less they mask each other. The louder a tone is, the more surrounding frequencies it masks out.
Noise often encompasses a large number of frequencies. When you hear static on the radio, you’re hearing a whole slew of frequencies at once. Noise actually masks out sounds better than tones: It’s easier to whisper to someone at even a loud classical music concert than it is under a waterfall.
Critical Bands and Prioritization
As mentioned in our brief review of the biology of hearing, frequencies fall into one of 25 human psychoacoustic “critical bands.” This means that we can treat frequencies within a given band in a similar manner, allowing us to have a simpler mechanism for computing what parts of a sound are masked out.
So how do we use all of our newly-acquired knowledge about masking to compress data? Well, we first grab a window of sound, usually about 1/100th of a second-worth, and we take a look at the frequencies present. Based on how strong the frequency components are, we compute what frequencies will mask out what other frequencies.
We then assign a priority based on how much a given frequency pokes up above the masking threshold: a pure sine wave in quiet would receive nearly all of our attention, whereas with noise all of our attention would be spread around the entire signal. Giving more “attention” to a given frequency means allocating more bits to that frequency than others. In this way, I describe exactly how much energy is at that frequency with greater precision than for other frequencies.
How are the numbers encoded with different resolutions? That is to say, how can I use more bits to describe one number than another? The answer involves a touch of straightforward math. Do you remember scientific notation? It uses numbers kike 4.02 x 1032. The 4.02 is called the mantissa. The 32 is usually called the exponent, but we’re going to call it the scale factor. Since frequencies in the same critical band are treated similarly by our ear, we give them all the same scale factor and allocate a certain (fixed) number of bits to the mantissa of each. For example, let’s say I had the numbers 149.32, -13.29, and 0.12 - I’d set a scale factor of 4, since 104 = 100 and our largest number is 0.14932 x 103. In this way, I’m guaranteed that all of my mantissas will be between -1 and 1. Do you see why the exponent is called a scale factor now? I would encode the numbers above as 0.14932, -0.01329, and 0.00012 using a special algorithm known as fixed-point quantization.
Have you ever played the game where someone picks a number between 1 and 100 and you have to guess what it is, but are told if your guess is high or low? Everybody knows that the best way to play this game is to first guess 50, then 25 or 75 depending, etc., each time halving the possible numbers left. Fixed-point quantization works in a very similar fashion. The best way to describe it is to walk through the quantization of a number, like 0.65. Since we start off knowing the number is between -1 and 1, we should record a 0 if the number is greater than or equal to 0, and a 1 if it is less than 0. Our number is greater than zero, so we record 0: now we know the number is between 0 and 1, so we record a 0 if the number is greater than or equal to 0.5. Being greater, we record 0 again, narrowing the range to between 0.5 and 1. On the next step, we note that our number (0.742) is less than 0.75 and record a 1, bringing our total number to 001. You can here see how with each successive “less-than, greater-than” decision we record a one or a zero and come twice as close to the answer. The more decisions I am allowed, the more precisely I may know a number. We can use a lot of fixed-point quantization decisions on the frequencies that are most important to our ears and only a few on those that are less. In this way, we “spend” our bits wisely.
We can reconstruct a number by reversing the process: with 001, we first see that the number is between 0 and 1, then that it is between 0.5 and 1, and finally that it is between 0.5 and 0.75. Once we’re at the end, we’ll guess the number to be in the middle of the range of numbers we have left: 0.625 in this case. While we didn’t get it exactly right, our quantization error is only 0.025 - not bad for three ones and zeroes to match a number so closely! Naturally, the more ones and zeroes that are given, the smaller the quantization error.
The above technique roughly describes the MPEG Layer 2 codec (techie jargon for compression / decompression algorithm) and is the basis for more advanced codecs like Layer 3, AAC, and AC-3, all of which incorporate their own extra tricks, like predicting what the audio is going to do in the next second based on the past second. At this point you understand the basic foundations of modern audio compression and are getting comfortable with the language used; it is time to move to a comprehensive review of modern audio codecs.