Digital Audio Technology
A Primer: Part I
MI 313
Center for Audio Recording Arts (CARA)
by: Robert S. Thompson, Ph.D.
Historic Overview
From the outset audio recording had been a mechanical process, it was not until this century that electronics began to play a part. Background information every one should know.
1870's Thomas Edison, Emile Berliner
1898 Van Poulsen's Telegraphone magnetic wire recorder (100 years ago!)
The invention of the triode vacuum tube (1906) launched the era of electronics but it was not until 1924 that electronically produced recordings became practical.
1922 Optical sound recording on film demonstrated
1930's Sound recording on tape with powdered magnetic material
(Developed in Germany and spread to the rest of the world after WW-II.)
The old wire and steel band recorders, in use prior to the German Magnetophon, required soldering or welding to make a splice (no thanks!). The Magnetophons and their descendants were analog recorders.
analog --(sometimes analogue), refers to the fact that the waveform encoded on the tape is a close analogy to the original sound waveform picked up by a microphone.
Analog recording continues to be refined, but faces fundamental physical limits. These limits are most apparent when making copies (dubbing) from one analog medium to another - the addition of noise is inescapable.
Experimental Digital Recording
There are a number of core concepts in digital audio that one must master and the most important of these is sampling.
sampling - the conversion of continuous analog signals (such as those coming from a microphone) into discrete time-sampled signals.
The theoretical underpinning of the sampling is the sampling theorem which specifies the relation between the sampling rate and the audio bandwidth.
sampling rate - how often the continuous waveform is sampled.
bandwidth - the overall frequency range encompassed by the process of sampling.
The sampling theorem is also known as the Nyquist theorem after the work of Harold Nyquist at Bell Labs in 1928. It is interesting to note however that an earlier form of this theorem was first stated in 1841 by French mathematician Augustin Louis Cauchy.
After the development of the sampling theorem work progressed steadily on digital recording, storage and reproduction of sound.
1938 A. Reeves - patents the first PCM (pulse code modulation) for the transmission of messages in "amplitude-dichotomized, time-quantized" (digital) form.
Note: even today, digital recording is sometimes called "PCM recording"
1948 Information Theory developed, contributes to the understanding of digital audio transmission
1950's Max Matthews (the "father" of computer music) generated the first synthetic sounds using a digital computer. He played his samples back on a custom build 12-bit vacuum tube "digital to sound converter"
1960 Theory of digital error correction developed by Hamming and others
Note: later work on the error correction problem by a host of engineers resulted in the first usable and practical systems for digital audio recording.
1960's The first practical digital audio recorder - one channel - based on a videotape mechanism (helical scanning) was demonstrated by NHK, the Japanese Broadcasting Company.
1970's Denon refines the NHK prototype and the race begins to bring digital audio recorders to the market.
1977 Sony introduces the first commercial recording system, the PCM-F1 processor, designed to encode 13 bit digital audio signals on Sony Beta videocassette recorders.
1978 PCM-F1 displaced by newer 16 bit encoders such as the PCM-1600. At this point two lines of production of digital audio technology developed: professional and "consumer".
Note: although two product lines were developed it is clear that a mass market for this type of recording technology never really materialized. The DAT machine has been the most successful in the "semi-professional" arena.
1980's Sony PCM 1610 and PCM 1630 become the standards for compact disc mastering. The PCM-F1, also know as the EIAJ (Electronics Industry of Japan) became the de acto standard of low-cost digital audio recording on videocassette.
1982 Fist CD reaches the public due to collaboration between Sony and Phillips.
Note: since the introduction of the CD a variety of products have been derived from the technology notably CD-ROM (read only memory) and CD-I (Interactive) as well as other formats that mix audio data, text and images.
1985 Audio Engineering Society establishes two standard sampling frequencies
44.1 and 48 KHz. In 1992 they revised their specifications to include other rates such as 32 KHz. which is used in broadcast.
1986 Higher resolution encoders are developed - Mitsubishi X-86 reel to reel recorder encoded 20 bits at 96 KHz. The high resolution recorder is quickly becoming standardized and will enter the "semi-professional" market shortly.
1990's Manufacturers target the need for recordable digital media. Various stereo media appeared including Digital Audio Tape (DAT), Digital Compact Cassette (DCC), the Mini-disc (MD) and recordable CD's (CD-R). Brace yourselves for the innovation that will follow the introduction of the Digital Video Disc (DVD)!
Refinements for Audio Production
While the new CD players had inexpensive 16-bit Digital to Analog Converters (DACs), high quality converters (appropriate for audiophile recording) attached to computers were not common before 1988. Although MIDI was introduced in 1984, and allowed computers to interface with hardware synthesizers, the direct recording of sound, its manipulation and subsequent playback were not possible until the late 1980's.
After the development of lost-cost, high-quality Analog to Digital Converters (ADCs') and DACs for use with "workstation" computers the use of these systems for recording, editing, processing and sound synthesis became widespread. One might argue that this is the direction that music and sound production will take. Many workstation systems are able to operate as "stand-alone systems" and will "emulate" the technology of a mixing console and audio recorder making the hardware equivalents somewhat redundant. Clearly the DAW (Digital Audio Workstation) has become a permanent tool in the modern recording studio.
Digital Multi-track Recording
Multitrack recorders have several discrete channels or tracks that can be recorded on at different times - in contrast to a stereo recorders that record both the left and right channels at the same time. The multi-track recorder is the foundation of the professional audio recording field.
The digital revolution was fundamentally oriented to providing high-quality digital multi-track recorders for the professional audio community.
As early as 1976, the BBC had developed an experimental 10-channel digital multitrack recorder. Two years later, 3M in collaboration with the BBC developed the first 32-track recorder as well as a simple digital audio editor. The first computer based random-access editor for digital audio was introduced by Soundstream in 1983. This system allowed mixing of up to 8 tracks at one time. Both Soundstream and 3M have left the multi-track marketplace. The dominant forces now are Sony, Mitsubishi and Studer for high-end professional multi-tracks, all of which are extremely expensive and complex machines.
The trend of digital multitracks being the tools of only very well established studios was reversed in the early 1990s with the introduction of low-cost digital multitrack tape recorders by Alesis and Tascam. There are several types of tape-based recorders all using helical scanning video tape drives as the transport mechanism.
The digital audio workstation flourishes in the late 1990's and it is clearly the harbinger of what is to come. A cursory investigation of the SoundForge program or of a Pro-Tools or MicroSound system will reveal a very different way of conceiving multi-track audio production. That computers will replace dedicated stereo and multitrack recorders is inevitable. Memory becomes cheaper as computing power increases (Moore's Law). Clearly, for ease of use, cost and productivity factors alone, digital audio workstations out-perform the digital multi-track recorder and console paradigm.
Sound Signals: a (very) brief review
periodic waveform - a waveform which creates sound pressure variations according to a repeating pattern
noise - a waveform which exhibits no discernible repeating pattern
cycle - one repetition of a periodic waveform
fundamental - lowest partial that give "pitch"
frequency - the number of cycles per one second of time
wavelength (or period) - as wavelength increases the frequency in cycles per second decreases and vice versa.
Hz - Hertz post-operator, means "cycles per second" just like cps does.
Time-Domain Representation of Sound (TDR)
The simplest way to depict a sound waveform is to draw it in the form of a graph of sound pressure (in air) versus time. This is called a time-domain representation of sound. When the curved lined is near the bottom of the graph then air pressure is lower, and when the curve is near the top the air pressure has increased.
amplitude - is the amount of air-pressure change
Amplitude in the TDR (time-domain representation) is expressed as the vertical distance from the zero pressure point (also called a zero crossing) to the highest (or lowest) points of a given waveform segment.
TDRs are interesting, necessary and important to the study of sound and acoustics, yet there is an inherent limitation with TDRs - when the resolution of the TDR is high, such that we can study in detail the trending of the waveform (recall: all waveforms are subtly changing in time if they are not synthetic in origin), it becomes easy to lose perspective due to the size of the various wavelengths within the audio range (20-20KHz). A graph might only show a very short sound - often on the order of .01 sec.
Frequency-Domain Representation of Sound (FDR)
Besides the fundamental frequency, each complex sound will have a number of other frequencies present. A Frequency-Domain Representation or a spectrum shows the frequency content of a sound.
spectrum - a FDR which shows individual components in a composite sound
harmonics (or partials) - the individual components revealed by the FDR, harmonics are simple integer multiples of the fundamental, whereas partials do not require this specific relationship.
frequency and amplitude pairs - each component can be expressed as a frequency(n) at amplitude(n)
Displaying frequency content of a Waveform
Frequency content can be displayed in many different ways. Perhaps the most simple is a list of frequency and amplitude pairs:
440 1
880 .5
1760 .25
etc.
or,
112 .01
132.2 .12
1790.1 .7
4323.7 .23
etc.
A standard way to display frequency and amplitude content by plotting a single point along the x-axis. The plotted y-value indicates frequency by position along the x-axis and relative strength (amplitude) by position on the y-axis as the height of single line for the target frequency. For meaningful displays some averaging of values into a spectral graph of a desired frequency resolution will be required.
Phase
The starting point of a periodic waveform on the y or amplitude axis is its initial phase. For example a sine wave begins its waveform "trajectory" at zero on the y axis and completes one cycle at zero. If we shift the starting point by 2(PI) on the horizontal axis (or 90 degrees) then the sinusoidal wave begins at and ends at 1 on the y axis. This is known as a cosine wave by convention. A cosine is equivalent to a sine wave that is phase shifted by 90 degrees.
phase aligned - when two signals start at the same point they are in-phase or phase aligned
out of phase - when a signal is slightly delayed in respect to another signal
reversed polarity - when a signal is 180 degrees out of phase (exact opposite) in respect to another signal, also known as phase-inverted
It is sometimes said that the ear is not sensitive to phase, because two signals that are exactly the same except for their initial phase are difficult to distinguish. Current research indicates that 180-degree differences in absolute phase or polarity can be distinguished by some people under laboratory conditions.
Apart from this special and currently theoretical case, phase is an important concept for several reasons.
1. Every filter uses phase shifts to alter signals
2. A filter phase shifts a signal (by delaying its input for a short time) then combines the phase shifted version with the original signal to create frequency-dependent phase cancellation effects that alter the spectrum of the original. Frequency-dependent means that not all frequency components are affected equally.
3. When phase shifting is time-varying and continuous, the affected frequency bands also vary, creating the sweeping sound effect called phasing or flanging.
4. Phase information is crucial at the onset of sounds (transients) in that transients are very susceptible to phase problems.
5. In audio components phase regularity is crucial. Frequency-dependent phase shifts distort musical signals audibly and interfere with loudspeaker imaging (the ability of a set of loudspeakers to create a stable "audio picture" where each audio source is localized to a specific place within the picture.)
phase distortion - unwanted phase shifting between and among audio signals
Continuous Time Representations of Sound
Time-varying quantities such as voltage and amplitude are more or less analogous to each other. A graph of the air pressure variations picked up by a microphone looks very much like the variations in a graph of loudspeaker position when the sound is played back. The very term "analog" serves as a reminder of how these quantities are related.
Many representations of sound, especially in the analog domain, are CTRs (Continuous Time Representations). A phonograph groove holds within its walls a CTR of the sound stored in that record. As the needle glides through the groove, the needle moves back and forth in lateral motion. The lateral motion is changed into a fluxuating voltage which is eventually amplified and reaches the loudspeaker.
In a similar manner, the analog tape recorder creates a CTR of sound in magnetic fluxuations across the head-gap which are imprinted as magnetism on mylar.
When analog recordings are copied, the limitations of the system are revealed. Each copy, no matter what level of equipment was used, will have inherent signal degradation through the addition of unwanted noise.
first generation recording - this is the first copy from an original analog recording signal loss and noise is not likely to be objectionable
As copies are made from copies, by the fourth copy noise will begin to be a problem and as the process continues the recording will eventually be reduced to noise only.
Digital Representations of Sound
A digital signal is a discrete list of values typically binary numbers.
Analog to Digital Conversion (ADC)
Rather than CT signals typical of the analog domain, digital recording handles discrete-time (DT) signals. A DT signal is produced by an analog to digital converter (ADC) which converts voltages into a string of numbers at each period of the sample clock.
sample clock - the counting device which triggers the "sampling" of a continuous analog voltage - for a modern DAT recorder this clock ticks once every 1/44100 of a second on both of two channels
The binary numbers that result from AD conversion are stored in some kind of memory tape, disc or RAM for example.
Binary Numbers, Bits and Bytes
Base ten numbers are also known as decimal numbers and they use the digits from 0-9. Binary numbers are base two and use only the numbers 0 and 1.
bit - a binary digit, either 0 or 1
Both Real Decimal and floating-point numbers can be represented in binary numbers. How the binary number is physically encoded in the recording medium depends on the properties of that medium. On a digital audio tape recorder a 1 might be represented by a positive magnetic charge while a 0 is indicated by the absence of such a charge. This is quite different from an analog recording in which the signal is represented as a continuously varying charge.
Digital to Analog Conversion (DAC)
In the DAC, stored values in binary numbers are read from storage one at time and passed through the conversion process where the device, driven by the sample clock, changes the stream of numbers into a series of voltage levels. These voltages are smoothed by passage through a lowpass filter to produce a CT waveform
MIDI data is also a string of binary data but it is quite different in terms of both the amount and type of information it encodes. MIDI also has a clock which runs much slower than the clock in the ADC. MIDI also does not convert values, it simply records events according to a scheduler and then allows for their storage and subsequent playback. Since binary data is used transformations are facilitated. It was interesting to note the confusion among novices when MIDI was introduced!
Imagine a quarter note played four times at BPM of 60!
...for a MIDI data stream - 16 pieces of information: beg, end, pitch, velocity -
timing information is cared for by the UART in the MIDI interface which does time tagging
...for a digital tape recorder at 44.1KHz - 352,800 pieces of information
44.1KHz x 2 (stereo) x 4 seconds
Needless to say, the storage requirements for audio are large (though much smaller than for video). Using 16-bit samples, it takes over 700,000 bytes to store a 4-second sound.
A 48 track MIDI sequencer, running on a PC might handle 4000 bytes/second while a 48-track digital multitrack recorder might handle more than 4.6 Mbytes of information per second - over one thousand times the data rate of MIDI.
The advantage of digital recording is that it can capture any sound that can be recorded by a microphone - MIDI is restricted to recording control signals associated with performances on MIDI instruments.
Sampling
A digital signal is only defined a certain points in time - the signal is sampled at certain times - and one sample contains the information encoded at a certain point in time. Each sample represents a number smaller or larger depending upon the shape of the waveform sampled.
The number of bits used to represent each sample determine both the noise level and the amplitude range of that can be handled by the system. A CD player uses 16 bits per sample which is the current industry standard.
8 bits - telephone, personal computers, other tools
12 bits - early PCM, Emulator Emax, drum machines
16 bits - current consumer, semi-pro standard
20 - 24 bits - extended work lengths for hi-resolution audio
32 bits - floating-point numbers used for Digital Signal Processing
sampling frequency (rate) - the rate at which samples are taken is expressed in samples per second, expressed in Hz and often called the sampling rate
Sampling frequencies around 50KHz are common in digital audio systems, although both higher and lower rates can be found. At a rate of 50KHz one minute of stereo sound requires 6,000,000 numbers.
The duration of each sample is extremely narrow but it is conceivable that a waveform could change during the sampling period, this change may not be reflected in the list of samples. What makes digital sound recording possible is the fact that if the signal is bandlimited, the DAC and associated hardware can exactly reconstruct the original signal from these samples. This means that, given certain conditions, the missing part of the signal "between the samples" can be restored. This occurs when the data stream is passed through the DAC and smoothing filter. The smoothing filter "connects the dots" between the discrete samples and thereby re-creates the original signal.
Aliasing (Foldover)
Though the processes of sampling seems relative straight-forward there are some problems to overcome. Just as an audio amplifier or loudspeaker can introduce distortion, sampling can also play tricks with sound. The problems with sampling can be seen easily when we consider what happens when wavelength is changed without changing the sampling rate. Under certain circumstances the wavelength of the resynthesized (DAC) sampled sound will be different from the original. This would mean that the sound played from the DAC would sound at a different pitch than the original. This kind of distortion is called aliasing or foldover.
Fortunately for us, the frequencies at which foldover will occur can be predicted.
For example:
1000 Hz sampling rate
125 wave = 8 samples per cycle (1000/8=125)
500 wave = 2 samples per cycle (1000/2=125)
1100 wave = output occurs at a rate of 1000/10 = 100 Hz - this one is aliased
In the last case the frequency is folded over into the lower range. Note that the frequency (1100) is above the SRATE (1000). In this last case the sound has been changed by a "sample rate conversion" process.
The Sampling Theorem
For general purposes we can say that as long as there are at least two samples per period of the original waveform, we can assume that the resynthesized waveform will have the same frequency. But, when there are fewer than 2 samples per period, the frequency (and perhaps also the timbre) of the original is lost. In this case, the new frequency can be found by the following formula. If the original frequency is higher than half the sampling frequency, then:
foldover = new frequency = sampling frequency - original frequency
This formula is not mathematically complete here, but it is sufficient for this discussion. It means the following: suppose we have chosen a fixed sampling frequency. We start with a signal at a low frequency, sample it, and resynthesized the signal after sampling. As we raise the pitch of the input signal (keeping the sampling rate constant) the pitch of the resynthesized signal is the same as the pitch of the input signal until we reach a pitch that corresponds to one half of the sampling frequency. As we raise the pitch of the input signal beyond this point, the pitch of the output signal goes down to the lowest frequencies! When the input signal reaches the sampling frequency, the entire process repeats itself.
Another concrete example:
a signal at 26 KHz patched to ADC at SRATE of 50 KHz is shifted to 24 KHz by aliasing - since 50-26=24!
The Nyquist Rate
In his own words:
"For any given deformation of the receive signal, the transmitted frequency range must be increased in direct proportion to the signaling speed..., the conclusion is that the frequency band is directly proportional to the speed." (Harold Nyquist, 1928)
or, in other words...
"In order to be able to reconstruct a signal, the sampling frequency must be at least twice the frequency of the signal being sampled."
The highest frequency that can be produced by a digital audio system is called the Nyquist frequency and is half of the sampling rate. In most applications, using high-speed ADCs, the Nyquist frequency lies above the threshold of human hearing - above 20KHz. Consider the 44.1 KHz rate and the total frequency bandwidth.
DACs and associated hardware cannot reconstruct with reliability signals that lie close to the Nyquist frequency and are therefore set a bit higher than twice the frequency.
Ideal Sampling Frequency
The ideal sampling rate and for that matter sample size are points of debate among industry professionals. And, there are some inherent problems with digital audio theory:
1. mathematical theory and engineering practice often conflict
2. converter clocks are not stable
3. converter voltages are not linear
4. filters introduce phase distortion
Many people hear information in the 20 KHz range and beyond as "air" and as desirable. For some people, even up to age 41, the frequency range extends to over 23 KHz as it did for Rudolf Koenig. It seems strange that a new CD would have less frequency bandwidth than a phonograph record made in the 1960's or a new digital recorder have less frequency bandwidth than a 20 year-old analog recorder. Many analog systems produce frequencies beyond 25 KHz and the study of psychoacoustics has revealed that sound has effect above 22 KHz in both the physiological and subjective areas.
We are moving to a 96 KHz sampling rate which will greatly extend the realizable upper frequency limit. Even at 16 bit sample length this would be a significant achievement. With larger word lengths, digital audio will finally through off the limitations of technology that have characterized the first 20 years.
Anti-aliasing and Anti-imaging Filters
In order to make sure that a digital recording system works properly two important filters are needed in the hardware design. One filter is placed before the ADC to ensure that nothing (or as little as possible) in the input signal occurs at a frequency higher than the Nyquist rate. As long as this filter is working, aliasing should not occur in the recording process. This filter is called an anti-aliasing filter.
The other filter is placed after the DAC. Its main function is to change the samples stored digitally into a smooth, continuous representation of the signal. This low-pass filter is called an anti-imaging or smoothing filter.
End of Part I: rst