 |
Chapter 2: The Guts of Music Technology
In this section and the ones following, things are going to get
increasingly technical. I'm going to start off pretty simple and
slowly ramp up to some considerably involved topics, so please feel
free to skip the parts that you already know to get to the juicy
stuff. It's possible that you may find some parts overwhelming. Don't
worry yourself too much about it, just feel free to simply skim. To
make this easy for you, I've bolded the key definitions throughout the
text. And if you get bored? Just go to the next chapter. Nobody's
quizzing you on this!
Digital Audio Basics
Computers work by passing small charges through aluminum trenches
etched in silicon and shoving these charges through various gates: If
this charge is here and that one is too then the chip will create a
charge in another place. The computer does all of its computations in
ones and zeroes. Integers, like -4, 15, 0, or 3, can be represented
with combinations of ones and zeroes in an arithmetic system called
binary. Humans normally use a "decimal" system with ten symbols per
space: we count 1, 2, 3,...8, 9, 10, 11. In the binary system there
are only two symbols per space: one counts 1, 10, 11, 100, 101, 110,
111, 1000, etc.!
If the computer is to understand how to store music, music must be
represented as a series of ones and zeroes. How can we do this? Well,
one thing to keep in mind throughout all of this discussion is that
we're going to be focusing on making music for humans to hear. While
that may sound trite, that will allow us to "cheat" and throw out the
parts of the music the people can't hear: a dog might not be able to
appreciate Mozart as much after we're done with things, but if it
sounds just the same to an average Jane, then we've accomplished our
true mission - to have realistic music come from a computer!
We first need to understand what sound is. When you hear a sound, like
a train whistle or your favorite hip-hop artist, your eardrum is
getting squished in and out by air. Speakers, whistles, voices, and
anything else that makes sound repeatedly squishes air and then
doesn't. When the sound gets to your ear, it pushes your eardrum in
and out. If the air gets squished in and out at a constant rate, like
440 times a second, you'll hear a constant tone, like when someone
whistles a single note. The faster the air gets squished in and out,
the higher tone you hear; likewise, the low bass tones of a drum
squish the air in and out very slowly, about 50 times a
second. Engineers use the measurement Hertz, abbreviated Hz, to mean
"number of times per second" and kilohertz, or kHz, to mean "thousands
of times per second." Some people with very good hearing can hear
sounds as low as 20Hz and as high as 20kHz. Also, the more violently
the air is compressed and decompressed, the louder the signal is.
Now we can understand what a microphone does. A microphone consists of
a thin diaphragm that acts a lot like your eardrum: as music is being
played, the diaphragm of the microphone gets pushed in and out. The
more pushed in the diaphragm is, the more electrical charge the
microphone sends back to the device into which you've plugged your
mic. What if you plug the mic into your computer? The computer is
good at dealing with discrete numbers, also known as digital
information, but the amount that the microphone is being compressed is
always changing; it is analog information. There is a small piece of
hardware in a computer that allows it to record music from a
microphone: it is a called a Analog to Digital Converter, or ADC for
short. It is impossible for us to record a smooth signal as ones and
zeroes and reproduce it perfectly on a computer. The ADC does not
attempt to perfectly record the signal. Instead, several thousand
times a second it takes a peek at how squished in the microphone
is. The rate at which I check on the microphone is called the sampling
rate. If the microphone is 100% squished in, we'll give it the number
64,000. If the microphone is not squished in at all, we'll give it a
0, and we'll assign it a number correspondingly for in-between values:
halfway squished in would merit a 32,000. We call these values
samples.
The Nyquist Theorem says that as long as
our sampling rate is twice the frequency of highest tone we want to
record, we'll be able to accurately reproduce the tone. Since humans
can't hear anything higher than 22kHz, if we take sample the
microphone 44,000 times a second, we'll be able to reproduce the
highest tones that people can hear. In fact, CDs sample at 44.1kHz
and, as suggested above, store the amount the microphone was squished
as a number between 0 and 65,536, using 16 ones and zeros, or bits,
for every sample. In this way, we'd say that CDs have a sample
resolution of 16 bits.
All of this data ends up taking a great deal
of space: if we sample a left and a right channel for stereo sound at
44.1kHz, using 16 bits for every sample, that's 1.4 million bits for
every second of music! On a 28.8 modem, it would take you over 50
seconds to transmit a single second of uncompressed music to a friend!
We clearly need a way to use fewer bits to transmit the music.
Those of you comfortable with computers may suggest we use a
compression program like WinZIP or StuffitDeluxe to reduce the size of
these musicfiles. Unfortunately, this does not work very well. These
compression programs were designed largely with text in mind. These
programs were also designed to perfectly reproduce every bit: If you
compress a document to put it on a floppy, it had better not be
missing anything when you decompress it on a friend's machine!
Compression algorithms work best when they know what they are
compressing. specialized algorithms can squish down video to an 100th
of its original size, and people routinely use the JPEG (.JPG)
compression format to reduce the size of pictures on the web. JPEG
is lossy; that is to say, it destroys some data. If you scan
in a beautifully detailed picture and squish it down to a small JPEG
file, you will see that there are noticeable differences between the
original and the compressed versions, but in general it is throwing
away the information that is less important for your eye to see to
understand what the picture is about. In the same way, we will get
much better compression of sound if we use and algorithm that
understands the way that people hear and destroys the parts of the
sound that we cannot perceive. Already, we have done this in a small
way by ignoring any sounds above 22kHz. We might have done things
differently if we were making an audio system for a dog or a whale; we
have already exploited some knowledge of the human ear to our
advantage, now it comes time for us to further use this knowledge to
compress the sound.
Understanding Fourier
In order to compress the sound, we need to understand what parts are
okay to throw away; that is to say, what the least important parts of
the sound are. That way, we can keep the most important parts of the
sound so we can stream them live through, say, a 28.8k modem.
Now as it turns out, Sound is very tonal. This means that sounds tend
to maintain their pitch for periods of time: a trumpet will play a
note for half-second, a piano will sound a chord, etc. If I were to
whistle an 'A' for second, your eardrum may be wiggling in and out
very quickly, but the tone stays constant. While recording the
"wiggling" of the signal going in and out would take a great deal of
numbers to describe, in this case it would be much simpler to simply
record the tone and how long it went for, i.e., "440Hz (that's A!)
for 1.0 seconds." In this way, I've replaced hundreds of thousands of
numbers with two numbers.
While clearly most signals are not so compressible, the concept
applies: sound pressure, or the amount that your eardrum is
compressed, changes very rapidly (tens of thousands of times a
second), while frequency information, or the tones that are present in
a piece of music, tend not to change very frequently (32 notes per
second is pretty fast for a pianist!). If we only had a way to look at
sound in the frequency domain, we could probably get excellent
compression.
Luckily for us, J. B. Joseph Fourier, a 19th century mathematician,
came up with a nifty way for transforming a chunk of samples into
their respective frequencies. While describing the method in detail
has occupied many graduate-level electrical engineering books, the
concept is straightforward: if I take a small chunk of audio samples
from the microphone as you are whistling, I take the discrete numbers
that describe the microphone's state and run it through a Discrete
Fourier Transform, also known as a DFT. What I get out is a set of
numbers that describe what frequencies are present in the signal and
how strong they are, i.e., "There is a very loud tone playing an A#
and there is a quiet G flat, too." I call the chunk of samples that I
feed the DFT my input window.
There is an interesting tradeoff here: if I take a long input window,
meaning I record a long chunk of audio from the microphone and run it
all through the DFT at once, I'll be able to pick out what tone a user
was whistling with great precision. And, just like with people, if I
only let the computer hear a sound for a short moment, it will have
poor frequency resolution, i.e., it will be difficult for it to tell
what tone was whistled. Likewise, if I'm trying to nail down exactly
when a user begins to whistle into a microphone if I take short
windows, I'll be able to pick out close to the exact time when they
started to whistle; but if I take very long windows, the Fourier
transform won't tell me when a tone began, only how loud it is. I'd
have trouble nailing down when it began and could be said to have poor
time resolution. Frequency resolution and time resolution work against
each other: the more you need to know exactly when a sound happened,
the less you know what tone it is; the more exactly you need to know
what frequencies are present in a signal, the less precisely you know
the time at which those frequencies started or stopped.
As a real world example of where this is applicable, Microsoft's MS
Audio 4 codec uses very long windows. As a result, music encoded in
that format is bright and captures properly the tone of music, but
quick, sharp sounds like hand claps, hihats, or cymbals sound mushy
and drawn out. These kinds of quick bursts of sound are called
transients in the audio compression world. Later on, we'll learn how
MP3 deals with this. (AAC and AC-3 use similar techniques to MP3.)
In 1965, two programmers, J. Tukey and J. Cooley invented a way to
perform Fourier transforms a lot faster than had been done
before. They decided to call this algorithm the Fast Fourier
Transform, or FFT. You will likely hear this term used quite a bit in
compression literature to refer to the Fourier transform (the process
of looking at what tones are present in a sound).
The Biology of Hearing
Now that we understand how computers listen to sounds and how
frequencies work, we can begin to understand how the human ear
actually hears sound. So I'm going to take a bit of a "time out" from
all of this talk about computer technology to explain some of the
basics of ear biology.
As I mentioned before, when sound waves travel through the air, they
cause the eardrum to vibrate, pushing in and out of the ear canal. The
back of the eardrum is attached to an assembly of the three smallest
bones in your body, known as the hammer, anvil, and stirrup. These
three bones are pressed up against an oval section of a spiral fluid
cavity in your inner ear shaped like a snail shell, known as the
cochlea. (Cochlea is actually Latin for "snail shell"!) The vibrations
from the bones pushing against the oval window of the cochlea cause
hairs within the cochlea to vibrate.
Depending on the frequency of the vibrations, different sets of hairs
in the cochlea vibrate: high tones excite the hairs near the base of
the cochlea, while low tones excite the hairs at the center of the
cochlea. When the hairs vibrate, they send electrical signals to the
brain; the brain then perceives these signals as sound. The astute
reader may notice that this means that the ear is itself performing a
Fourier transform of sorts! The incoming signal (the vibrations of the
air waves) is broken up into frequency components and transmitted to
the brain. This means that thinking about sound in terms of frequency
is not only useful because of the tonality of music, but also because
it corresponds to how we actually perceive sound!
The sensitivity of the cochlear hairs is mind-boggling. The human ear
can sense as little as a picowatt of energy per square foot of sound
compression, but can take up to a full watt of energy before starting
to feel pain. Visualize dropping a grain of sand on a huge sheet and
being able to sense it. Now visualize dropping an entire beachful of
sand (or, say, an anvil) onto the same sheet, without the sheet
tearing and also being able to sense that. This absurdly large range
of scales necessitated the creation of a new system of acoustic
measurement, called the bel, named after the inventor of the
telephone, Alexander Graham Bell. If one sound is a bel louder than
another, it is ten times louder. If a sound is two bels louder than
another, it is a hundred times louder than the first. If a sound is
three bels louder than another, it is a thousand times louder. Get it?
A bel corresponds roughly to however many digits there are after the
first digit. A sounds 100,000 times louder than another would mean
there was 5 bels of difference. This system lets us deal with
manageably small numbers that can represent very large
numbers. Mathematicians call these logarithmic numbering systems.
People traditionally have used "tenths of bels," or decibels (dB) to
describe relative sound strengths. In this system, one sound that was
20dB louder than another would be 2 bels louder, which means it is
actually 100 times louder than the other. People are comfortable with
sounds that are a trillion times louder than the quietest sounds they
can hear! This corresponds to 12 bels, or 120dB of difference.
If a set of hairs are excited, it impairs the ability of nearby hairs
to pickup detailed signals; we'll cover this in the next section. It's
also worth noting that our brain groups these hairs into 25 frequency
bands, called critical bands: this was discovered by acoustic
researchers Zwicker, Flottorp, and Stevens in 1957. We'll review
critical bands a bit later on. Now, equipped with a basic knowledge of
the functioning of the ear, we can tackle understanding the parts of a
sound less important to the ear.
Psychoacoustic Masking
Your ear adapts to the sounds in the environment around you. If all is
still and quiet, you can hear a twig snap hundreds of feet away. But
when you're at a concert with rock music blaring, it can be difficult
to hear your friend, who is shouting right into your ear. This is
called masking, because the louder sounds mask the quieter
sounds. There are several different kinds of masking that occur in the
human ear.
Normal Masking
Your ear obviously has certain inherent thresholds: you can't hear a
mosquito buzzing 5 miles away even in complete silence, even though,
theoretically it might be possible to do it with sufficiently
sensitive instrumentation. The human ear is also more sensitive to
some frequencies than to others: our best hearing is around 4000Hz,
unsurprisingly not too far from the frequency range of most speech.
If you were to plot a curve graphing the quietest tone a person can
hear versus frequency, as is done to the right, it would look like a
"U," with a little downwards notch around 4000Hz. Interestingly
enough, people who have listened to too much loud music have a lump in
this curve at 4000Hz, where they should have a notch. This is why it's
hard to hear people talk right after a loud concert. Continued
exposure to loud music will actually permanently damage your cochlear
hair cells, and unlike the hair on your head, cochlear hairs never
grow back.
This curve, naturally, varies from person to person, and gets smaller
the older the subject is, especially in the higher
frequencies. Translation: old people usually have trouble
hearing. Theoretically, this variance could be used to create custom
compression for a given person's hearing capability, but this would
require a great deal of CPU horsepower for a server delivering 250
custom streams at once!
Tone Masking
Pure tones, like a steady whistle, mask out nearby tones: if I were to
whistle a C very loudly and you were to whistle a C# very softly, an
onlooker (or "on-listener," really) would not be able to hear the
C#. If, however, you were to whistle an octave or two above me, I
might have a better chance of noticing it. The farther apart the two
tones are, the less they mask each other. The louder a tone is, the
more surrounding frequencies it masks out.
Noise Masking
Noise often encompasses a large number of frequencies. When you hear
static on the radio, you're hearing a whole slew of frequencies at
once. Noise actually masks out sounds better than tones: It's easier
to whisper to someone at even a loud classical music concert than it
is under a waterfall.
Critical Bands and Prioritization
As mentioned in our brief review of the biology of hearing,
frequencies fall into one of 25 human psychoacoustic "critical bands."
This means that we can treat frequencies within a given band in a
similar manner, allowing us to have a simpler mechanism for computing
what parts of a sound are masked out.
So how do we use all of our newly-acquired knowledge about masking to
compress data? Well, we first grab a window of sound, usually about
1/100th of a second-worth, and we take a look at the frequencies
present. Based on how strong the frequency components are, we compute
what frequencies will mask out what other frequencies.
We then assign a priority based on how much a given frequency pokes up
above the masking threshold: a pure sine wave in quiet would receive
nearly all of our attention, whereas with noise all of our attention
would be spread around the entire signal. Giving more "attention" to a
given frequency means allocating more bits to that frequency than
others. In this way, I describe exactly how much energy is at that
frequency with greater precision than for other frequencies.
Fixed-Point Quantization
How are the numbers encoded with different resolutions? That is to
say, how can I use more bits to describe one number than another? The
answer involves a touch of straightforward math. Do you remember
scientific notation? It uses numbers kike 4.02 x 1032.
The 4.02 is called the mantissa. The 32 is usually called the
exponent, but we're going to call it the scale factor.
Since frequencies in the same
critical band are treated similarly by our ear, we give them all the
same scale factor and allocate a certain (fixed) number of bits to the
mantissa of each. For example, let's say I had the numbers 149.32,
-13.29, and 0.12 - I'd set a scale factor of 4, since 104 = 100 and
our largest number is 0.14932 x 103. In this way, I'm guaranteed that
all of my mantissas will be between -1 and 1. Do you see why the
exponent is called a scale factor now? I would encode the
numbers above as 0.14932, -0.01329, and 0.00012 using a special
algorithm known as fixed-point quantization.
Have you ever played the game where someone picks a number between 1
and 100 and you have to guess what it is, but are told if your guess
is high or low? Everybody knows that the best way to play this game is
to first guess 50, then 25 or 75 depending, etc., each time halving
the possible numbers left. Fixed-point quantization works in a very
similar fashion. The best way to describe it is to walk through the
quantization of a number, like 0.65. Since we start off knowing the
number is between -1 and 1, we should record a 0 if the number is
greater than or equal to 0, and a 1 if it is less than 0. Our number
is greater than zero, so we record 0: now we know the number is
between 0 and 1, so we record a 0 if the number is greater than or
equal to 0.5. Being greater, we record 0 again, narrowing the range to
between 0.5 and 1. On the next step, we note that our number (0.742)
is less than 0.75 and record a 1, bringing our total number to
001. You can here see how with each successive "less-than,
greater-than" decision we record a one or a zero and come twice as
close to the answer. The more decisions I am allowed, the more
precisely I may know a number. We can use a lot of fixed-point
quantization decisions on the frequencies that are most important to
our ears and only a few on those that are less. In this way, we
"spend" our bits wisely.
We can reconstruct a number by reversing the process: with 001, we
first see that the number is between 0 and 1, then that it is between
0.5 and 1, and finally that it is between 0.5 and 0.75. Once we're at
the end, we'll guess the number to be in the middle of the range of
numbers we have left: 0.625 in this case. While we didn't get it
exactly right, our quantization error is only 0.025 - not bad for
three ones and zeroes to match a number so closely! Naturally, the
more ones and zeroes that are given, the smaller the quantization
error.
Conclusion
The above technique roughly describes the MPEG Layer 2 codec (techie
jargon for compression / decompression algorithm) and is the basis for
more advanced codecs like Layer 3, AAC, and AC-3, all of which
incorporate their own extra tricks, like predicting what the audio is
going to do in the next second based on the past second. At this point
you understand the basic foundations of modern audio compression and
are getting comfortable with the language used; it is time to move to
a comprehensive review of modern audio codecs.
|