Sound as Signal - Cruxr.ai

What Is Sound?

Before a machine can understand speech, recognise a musical instrument, or generate a human voice, it needs to turn sound into numbers. But what is sound in the first place? It's a pressure wave — a disturbance that propagates through air (or any medium) by alternating regions of high and low pressure. When you speak, your vocal cords vibrate, pushing air molecules together and pulling them apart in rapid succession. Those pressure fluctuations travel outward until they reach a microphone, which converts them into a tiny electrical voltage that rises and falls in step with the pressure. An analog-to-digital converter (ADC) then measures that voltage at regular intervals, producing a sequence of numbers — each one an amplitude value representing the pressure at that instant. That sequence is digital audio.

The key parameter controlling this process is the sampling rate ($f_s$): the number of measurements (samples) taken per second. Common rates include 16 kHz (used by speech models like Whisper), 44.1 kHz (CD-quality audio), and 48 kHz (professional video and broadcast). At 16 kHz, one second of audio becomes 16,000 numbers. One minute becomes 960,000 numbers. Ten minutes of a podcast becomes 9.6 million numbers. That's an enormous amount of data for a model to process directly, and much of this track is about how we compress and transform that raw signal into something more manageable.

But why 16 kHz for speech and 44.1 kHz for music? The answer comes from the Nyquist-Shannon sampling theorem (Shannon, 1949) : to perfectly capture a frequency $f$ in a continuous signal, you must sample at a rate of at least $2f$. If you sample slower than that, high-frequency content gets "folded back" into lower frequencies — a phenomenon called aliasing — corrupting the signal in a way that cannot be undone. The highest frequency that a given sampling rate can faithfully represent is called the Nyquist frequency :

f_{\text{max}} = \frac{f_s}{2}

Here $f_s$ is the sampling rate and $f_{\text{max}}$ is the Nyquist frequency — the absolute ceiling on what we can represent. Any frequency content above $f_{\text{max}}$ will alias into lower frequencies and corrupt the signal. Below $f_{\text{max}}$, the original continuous signal can be perfectly reconstructed from the discrete samples (given enough bits per sample). At 16 kHz sampling, $f_{\text{max}} = 8{,}000$ Hz. Human speech fundamentals sit between roughly 85 Hz (deep male voice) and 300 Hz (child's voice), with consonant energy and sibilance reaching up to about 8 kHz, so 16 kHz captures speech well. Human hearing, however, extends to roughly 20 kHz, which is why CD audio uses 44.1 kHz ($f_{\text{max}} = 22{,}050$ Hz) — enough headroom to cover the full audible range.

💡 Why 44.1 kHz specifically and not, say, 40 kHz? The number comes from the early 1980s when Sony and Philips developed the CD standard. They needed a rate above 40 kHz (to cover 20 kHz hearing with margin for anti-aliasing filters), and 44,100 = 2\textsuperscript{2} \times 3\textsuperscript{2} \times 5\textsuperscript{2} \times 7\textsuperscript{2} was compatible with both PAL and NTSC video equipment used to store digital masters on videotape.

To make this concrete, the plot below shows a 440 Hz sine wave (the musical note A4, the standard tuning reference) as a continuous signal, alongside discrete samples taken at 16 kHz. Each dot is one number that the ADC produces — the full waveform between dots is lost, but thanks to Nyquist (since 440 Hz is far below the 8 kHz limit), we could reconstruct it perfectly.

import math, json, js

# Generate a 440 Hz sine wave (A4 note)
freq = 440  # Hz
duration = 0.005  # 5 ms — enough to show ~2 cycles
sample_rate = 16000  # 16 kHz

# "Continuous" signal: very dense sampling for smooth curve
n_continuous = 500
continuous_t = [i * duration / n_continuous for i in range(n_continuous)]
continuous_y = [math.sin(2 * math.pi * freq * t) for t in continuous_t]

# Discrete samples at 16 kHz
n_samples = int(sample_rate * duration)  # 80 samples in 5ms
sample_t = [i / sample_rate for i in range(n_samples)]
sample_y = [math.sin(2 * math.pi * freq * t) for t in sample_t]

# Convert time to milliseconds for readability
continuous_t_ms = [round(t * 1000, 4) for t in continuous_t]
sample_t_ms = [round(t * 1000, 4) for t in sample_t]

plot_data = [
    {
        "title": "440 Hz Sine Wave: Continuous vs Sampled at 16 kHz",
        "x_label": "Time (ms)",
        "y_label": "Amplitude",
        "x_data": continuous_t_ms,
        "lines": [
            {"label": "Continuous signal", "data": [round(y, 4) for y in continuous_y], "color": "#3b82f6"}
        ]
    },
    {
        "title": "Discrete Samples (16 kHz) — Each Dot Is One Number",
        "x_label": "Time (ms)",
        "y_label": "Amplitude",
        "x_data": sample_t_ms,
        "lines": [
            {"label": "Samples (16 kHz)", "data": [round(y, 4) for y in sample_y], "color": "#ef4444", "dotted": True}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

print(f"Frequency: {freq} Hz (A4 note)")
print(f"Sampling rate: {sample_rate} Hz")
print(f"Nyquist frequency: {sample_rate // 2} Hz")
print(f"Samples in 5ms: {n_samples}")
print(f"440 Hz is well below {sample_rate // 2} Hz => no aliasing")

From Waveforms to Frequency: The Fourier Transform

A waveform plot shows amplitude over time — it tells you when the signal is loud or quiet, but not which frequencies are present. Look at a waveform of someone saying "hello" and you'll see a wiggly line that gives almost no clue about the vowel formants, the consonant bursts, or the pitch of the speaker's voice. To extract that information, we need to decompose the signal into its constituent frequencies. That's what the Fourier Transform does.

The core intuition is surprisingly simple: any signal, no matter how complex, can be expressed as a sum of sine waves at different frequencies, amplitudes, and phases. A piano chord is a sum of the fundamental frequencies of each note plus their harmonics. A spoken vowel is a sum of the vocal cord's fundamental frequency plus resonant frequencies shaped by the throat and mouth. The Fourier Transform tells us exactly which sine waves to add together to reconstruct the original signal — it converts a time-domain representation (amplitude vs time) into a frequency-domain representation (amplitude vs frequency).

For discrete digital audio (a finite list of $N$ samples), we use the Discrete Fourier Transform (DFT) :

X[k] = \sum_{n=0}^{N-1} x[n] \cdot e^{-i \, 2\pi k n / N}

Let's unpack every symbol. $x[n]$ is the $n$-th sample of the signal — one of our amplitude values from the ADC. $X[k]$ is the $k$-th frequency bin — a complex number whose magnitude $|X[k]|$ tells us how strong frequency $k$ is in the signal, and whose phase $\angle X[k]$ tells us the timing offset of that frequency component. $N$ is the total number of samples in the analysis window. The term $e^{-i \, 2\pi k n / N}$ is a complex sinusoid (by Euler's formula, $e^{-i\theta} = \cos\theta - i\sin\theta$) at frequency $k$. The summation computes the dot product of the signal with this sinusoid — it measures how much the signal "correlates with" or "looks like" a sine wave at frequency $k$. If the signal contains a strong component at that frequency, the dot product is large; if not, the terms cancel out and the result is near zero.

The index $k$ ranges from 0 to $N - 1$, but for real-valued signals (which audio always is), the spectrum is symmetric: $X[k]$ and $X[N - k]$ are complex conjugates. So only bins $k = 0$ through $k = N/2$ carry unique information. At the boundaries: $k = 0$ gives the DC component — the average value of the signal (sum of all samples). $k = N/2$ corresponds to the Nyquist frequency — the highest frequency representable at this sampling rate. Between these extremes, each bin $k$ corresponds to a frequency of $k \cdot f_s / N$ Hz.

Computing the DFT naively requires $N$ multiplications for each of $N$ frequency bins, giving $O(N^2)$ complexity. The Fast Fourier Transform (FFT) (Cooley & Tukey, 1965) exploits symmetries in the complex exponentials to compute the same result in $O(N \log N)$. For a typical window of $N = 512$ samples, that's roughly 4,600 operations instead of 262,000 — a 57x speedup. The FFT is one of the most important algorithms in all of signal processing, and it's what makes real-time audio analysis practical.

The plot below demonstrates the Fourier Transform in action. We create a signal that's the sum of three sine waves (200 Hz, 500 Hz, and 1200 Hz) and then compute its magnitude spectrum. The three peaks in the frequency domain correspond exactly to the three frequencies we mixed together.

import math, json, js

# Build a signal from 3 sine waves: 200Hz, 500Hz, 1200Hz
sample_rate = 16000
duration = 0.025  # 25ms window (400 samples)
N = int(sample_rate * duration)  # 400

# Generate the composite signal
signal = []
for n in range(N):
    t = n / sample_rate
    val = (0.8 * math.sin(2 * math.pi * 200 * t)
         + 0.5 * math.sin(2 * math.pi * 500 * t)
         + 0.3 * math.sin(2 * math.pi * 1200 * t))
    signal.append(val)

# Compute DFT magnitude (only first N/2+1 bins — unique part)
half_N = N // 2 + 1
magnitudes = []
for k in range(half_N):
    re = 0.0
    im = 0.0
    for n in range(N):
        angle = 2 * math.pi * k * n / N
        re += signal[n] * math.cos(angle)
        im -= signal[n] * math.sin(angle)
    mag = math.sqrt(re * re + im * im) / N  # normalise
    magnitudes.append(round(mag, 4))

# Frequency axis: each bin k -> k * fs / N Hz
freqs = [round(k * sample_rate / N, 1) for k in range(half_N)]

# Time axis in ms for waveform
time_ms = [round(n / sample_rate * 1000, 3) for n in range(N)]

plot_data = [
    {
        "title": "Composite Signal: 200 Hz + 500 Hz + 1200 Hz",
        "x_label": "Time (ms)",
        "y_label": "Amplitude",
        "x_data": time_ms,
        "lines": [
            {"label": "Signal", "data": [round(s, 4) for s in signal], "color": "#3b82f6"}
        ]
    },
    {
        "title": "DFT Magnitude Spectrum — Peaks at 200, 500, 1200 Hz",
        "x_label": "Frequency (Hz)",
        "y_label": "Magnitude",
        "x_data": freqs,
        "lines": [
            {"label": "Magnitude", "data": magnitudes, "color": "#10b981"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

print(f"Window: {N} samples ({duration*1000:.0f} ms at {sample_rate} Hz)")
print(f"Frequency bins: {half_N} (0 to {sample_rate//2} Hz)")
print(f"Frequency resolution: {sample_rate/N} Hz per bin")
print(f"Peak bins near 200, 500, 1200 Hz visible in the spectrum")

💡 The frequency resolution of the DFT is $f_s / N$ Hz per bin. With $f_s = 16{,}000$ and $N = 400$ (a 25ms window), each bin spans 40 Hz. That means we can't distinguish two frequencies less than 40 Hz apart within a single window. Longer windows give finer frequency resolution but blur the time axis — a fundamental trade-off we'll return to when discussing spectrograms.

Spectrograms: Frequency Over Time

The DFT gives us the frequency content of a signal, but it analyses the entire signal at once. That's fine for a steady tone, but speech and music change rapidly — a single word might contain a voiced vowel, a fricative consonant, and a silence, each with completely different frequency profiles. If we run a single DFT over the whole word, those different sounds get averaged together and we lose the ability to see when each frequency was active. We need a way to see how the frequency content evolves over time .

The solution is the Short-Time Fourier Transform (STFT) : chop the signal into short, overlapping windows, and compute the DFT on each window independently. Each window is short enough that the signal is approximately stationary within it (the frequency content doesn't change much over 25 milliseconds), but long enough to give reasonable frequency resolution.

Three parameters control the STFT:

Window size (n_fft): the number of samples in each analysis frame. Typically 25 ms, which at 16 kHz is 400 samples. This determines frequency resolution: $f_s / \text{n\_fft} = 16{,}000 / 400 = 40$ Hz per bin. Larger windows give finer frequency resolution but blur the time axis.
Hop size (hop_length): how far we advance between consecutive windows. Typically 10 ms (160 samples at 16 kHz). A hop shorter than the window size means windows overlap, ensuring we don't miss transient events that fall between frames.
Window function: a taper applied to each frame before computing the DFT. The standard choice is a Hann window ($0.5 - 0.5 \cos(2\pi n / N)$), which smoothly fades the signal to zero at the frame edges. Without this, the abrupt truncation at frame boundaries creates artificial high-frequency artefacts called spectral leakage .

The result is a 2D matrix called a spectrogram . One axis is time (each column is one window), the other is frequency (each row is one frequency bin), and the values are magnitudes $|X[k]|$. With a 25 ms window and a 10 ms hop, one second of audio produces 100 time frames. Each frame has $\text{n\_fft}/2 + 1 = 201$ frequency bins. So one second of audio becomes a $201 \times 100$ matrix — already a massive compression from the original 16,000 raw samples, and one that organises the information in a far more useful way.

To illustrate, the code below generates a chirp signal — a sine wave whose frequency increases linearly from 200 Hz to 3000 Hz over half a second — and computes its STFT spectrogram. In the output, you can see how the peak frequency in each time frame shifts upward, exactly as we'd expect from a chirp. This is information the raw waveform hides but the spectrogram reveals immediately.

import math, json, js

# Generate a chirp: frequency sweeps from 200Hz to 3000Hz over 0.1s
sample_rate = 16000
duration = 0.1  # 100ms to keep computation small
N_total = int(sample_rate * duration)  # 1600 samples

f_start, f_end = 200, 3000
signal = []
for n in range(N_total):
    t = n / sample_rate
    # Instantaneous frequency increases linearly
    f_inst = f_start + (f_end - f_start) * t / duration
    phase = 2 * math.pi * (f_start * t + 0.5 * (f_end - f_start) * t * t / duration)
    signal.append(math.sin(phase))

# STFT parameters
n_fft = 256
hop_length = 128

# Hann window
hann = [0.5 - 0.5 * math.cos(2 * math.pi * n / n_fft) for n in range(n_fft)]

# Compute STFT
n_frames = (N_total - n_fft) // hop_length + 1
half_bins = n_fft // 2 + 1  # 129 frequency bins

# For the table, show peak frequency per frame
rows = []
for frame_idx in range(n_frames):
    start = frame_idx * hop_length
    # Apply Hann window
    windowed = [signal[start + n] * hann[n] for n in range(n_fft)]

    # DFT of windowed frame (only positive frequencies)
    best_k = 0
    best_mag = 0.0
    for k in range(half_bins):
        re = 0.0
        im = 0.0
        for n in range(n_fft):
            angle = 2 * math.pi * k * n / n_fft
            re += windowed[n] * math.cos(angle)
            im -= windowed[n] * math.sin(angle)
        mag = math.sqrt(re * re + im * im)
        if mag > best_mag:
            best_mag = mag
            best_k = k

    peak_freq = best_k * sample_rate / n_fft
    time_ms = round((start + n_fft / 2) / sample_rate * 1000, 1)
    rows.append([str(frame_idx), f"{time_ms}", f"{peak_freq:.0f}"])

js.window.py_table_data = json.dumps({
    "headers": ["Frame", "Centre Time (ms)", "Peak Frequency (Hz)"],
    "rows": rows
})

print(f"Chirp: {f_start} Hz -> {f_end} Hz over {duration*1000:.0f} ms")
print(f"Window: {n_fft} samples, Hop: {hop_length} samples")
print(f"Frames: {n_frames}, Freq bins: {half_bins}")
print(f"Peak frequency rises with each frame — the spectrogram reveals the chirp")

Notice how the peak frequency climbs steadily across frames, tracing the chirp's sweep from 200 Hz toward 3000 Hz. A raw waveform would just look like a wiggly line getting slightly faster — the spectrogram makes the frequency structure explicit. This is why spectrograms (and their mel-scaled variants, coming next) are the standard input representation for speech and audio models.

The Mel Scale: Hearing Like a Human

The linear spectrogram treats all frequencies equally: the gap between 100 Hz and 200 Hz gets the same number of bins as the gap between 7,900 Hz and 8,000 Hz. But human hearing doesn't work that way. The jump from 100 Hz to 200 Hz — one octave — sounds like a dramatic pitch change (think of the lowest note on a bass guitar versus one octave up). The jump from 5,000 Hz to 5,100 Hz is barely perceptible. Our ears have roughly logarithmic frequency resolution: we're very sensitive to differences at low frequencies and increasingly coarse at high frequencies.

The mel scale (Stevens, Stanley & Volkmann, 1937) formalises this perceptual warping. It maps linear frequency (in Hz) to a scale that better matches how humans perceive pitch:

m = 2595 \cdot \log_{10}\!\left(1 + \frac{f}{700}\right)

Here $f$ is frequency in Hz and $m$ is the corresponding mel value. Let's see what this formula actually does. The key is the argument to the logarithm: $1 + f/700$. When $f$ is small relative to 700 (say, $f = 100$ Hz), $f/700 \approx 0.14$, and $\log_{10}(1.14) \approx 0.057$, so $m \approx 2595 \times 0.057 \approx 148$. This is roughly proportional to $f$ — the mapping is nearly linear at low frequencies. But when $f$ is large (say, $f = 8{,}000$ Hz), $f/700 \approx 11.4$, and the $+1$ becomes negligible, so $\log_{10}(1 + f/700) \approx \log_{10}(f/700)$. Now the mapping is logarithmic — doubling $f$ adds a fixed constant in mel space. This is exactly the behaviour we want: fine-grained resolution where human hearing is sensitive (low frequencies) and coarser resolution where it's not (high frequencies).

At the boundaries: $f = 0$ gives $m = 2595 \cdot \log_{10}(1) = 0$ mel. $f = 700$ Hz gives $m = 2595 \cdot \log_{10}(2) \approx 781$ mel — this is roughly the transition point between the linear and logarithmic regimes. $f = 8{,}000$ Hz (the Nyquist limit at 16 kHz sampling) gives $m = 2595 \cdot \log_{10}(1 + 8000/700) \approx 2595 \cdot \log_{10}(12.43) \approx 2840$ mel.

The table below computes mel values for several important frequencies in speech and audio, illustrating the scale's compression at high frequencies.

import math, json, js

def hz_to_mel(f):
    return 2595 * math.log10(1 + f / 700)

freqs = [
    (0, "Silence / DC"),
    (85, "Low male voice fundamental"),
    (200, "Average male voice fundamental"),
    (300, "Average female voice fundamental"),
    (700, "Linear-to-log transition"),
    (1000, "Reference frequency (1 kHz)"),
    (2000, "Vowel second formant region"),
    (4000, "Consonant energy / sibilance"),
    (8000, "Nyquist limit at 16 kHz"),
    (16000, "Nyquist limit at 32 kHz"),
    (22050, "Nyquist limit at 44.1 kHz (CD)")
]

rows = []
for f, desc in freqs:
    m = hz_to_mel(f)
    rows.append([f"{f:,}", f"{m:.0f}", desc])

js.window.py_table_data = json.dumps({
    "headers": ["Frequency (Hz)", "Mel Value", "Description"],
    "rows": rows
})

print("Key insight: 0-1000 Hz spans ~1000 mel, but 1000-8000 Hz")
print("(7x the Hz range) spans only ~1840 mel.")
print("The mel scale compresses high frequencies aggressively.")

To build a mel spectrogram , we don't just convert the frequency axis — we apply a mel filterbank : a set of overlapping triangular bandpass filters whose centre frequencies are evenly spaced on the mel scale. Because mel values are compressed at high frequencies, this packs more filters into the low-frequency range (where human perception is fine-grained) and fewer into the high-frequency range. A typical filterbank for speech uses 80 mel channels (as in Whisper). Each filter sums up the energy from several adjacent frequency bins of the linear spectrogram, producing one number per filter per time frame.

The final step is to take the logarithm of each filterbank energy. This has two motivations: human perception of loudness is roughly logarithmic (a sound must double in power to seem noticeably louder), and log-compression reduces the dynamic range, making the values easier for neural networks to work with. The full pipeline is: waveform $\rightarrow$ STFT $\rightarrow$ magnitude spectrogram $\rightarrow$ mel filterbank $\rightarrow$ log $\rightarrow$ log-mel spectrogram . This is what Whisper, most speech recognition systems, and text-to-speech models like F5-TTS use as their input representation.

💡 Whisper's specific configuration: 80 mel channels, 25 ms window (400 samples at 16 kHz), 10 ms hop (160 samples), resulting in 100 frames per second. A 30-second audio clip becomes a $80 \times 3000$ matrix — 240,000 values, compared to 480,000 raw samples. That's a 2x compression, but more importantly it's a perceptually motivated compression that emphasises the frequencies that matter for understanding speech.

The plot below shows the mel scale curve, making the transition from near-linear (below ~700 Hz) to logarithmic (above ~700 Hz) clearly visible.

import math, json, js

def hz_to_mel(f):
    return 2595 * math.log10(1 + f / 700)

# Generate curve from 0 to 22050 Hz
freqs = [i * 50 for i in range(441)]  # 0 to 22000 Hz
mels = [round(hz_to_mel(f), 1) for f in freqs]

# Mark key points
annotations = {
    85: "Low voice",
    300: "Female voice",
    700: "Linear/log transition",
    4000: "Consonants",
    8000: "Nyquist (16 kHz)",
}

plot_data = [
    {
        "title": "The Mel Scale: Hz to Mel Mapping",
        "x_label": "Frequency (Hz)",
        "y_label": "Mel Value",
        "x_data": freqs,
        "lines": [
            {"label": "Mel scale", "data": mels, "color": "#8b5cf6"}
        ]
    }
]
js.window.py_plot_data = json.dumps(plot_data)

print("Below ~700 Hz: nearly linear (m roughly proportional to f)")
print("Above ~700 Hz: logarithmic (doubling f adds ~300 mel)")

table_rows = []
for f, label in sorted(annotations.items()):
    table_rows.append([f"{f:,} Hz", f"{hz_to_mel(f):.0f} mel", label])

js.window.py_table_data = json.dumps({
    "headers": ["Frequency", "Mel Value", "Description"],
    "rows": table_rows
})

MFCCs: The Pre-Deep-Learning Standard

Before deep learning took over, speech recognition systems needed a compact, fixed-size feature vector for each audio frame. The log-mel spectrogram was a good start — 80 numbers per frame, perceptually motivated — but its filterbank channels are correlated (adjacent mel filters overlap and capture similar energy), which caused problems for the statistical models of the era (particularly Gaussian Mixture Models, which assumed independent features). Mel-Frequency Cepstral Coefficients (MFCCs) solve this by applying one more transform to decorrelate the features.

The pipeline is: waveform $\rightarrow$ STFT $\rightarrow$ mel filterbank energies $\rightarrow$ log $\rightarrow$ Discrete Cosine Transform (DCT) $\rightarrow$ keep the first 12-13 coefficients. The DCT is similar in spirit to the DFT but operates on real-valued data and produces a set of cosine basis coefficients. The key property is that it packs most of the signal's energy into the first few coefficients. The low-order MFCCs capture the broad spectral shape (which vowel is being spoken, the overall timbre), while the high-order coefficients capture fine spectral detail that's usually noise for speech recognition purposes. By discarding everything above the 13th coefficient, we get a compact 13-dimensional vector per frame that captures the essential spectral envelope.

MFCCs dominated speech recognition for decades throughout the GMM-HMM era (Gaussian Mixture Model–Hidden Markov Model systems, roughly 1990–2012). Every major speech recogniser — from CMU Sphinx to Kaldi's early recipes — used MFCCs as the primary input feature. They're still relevant today: HuBERT (Hsu et al., 2021) uses MFCCs in its initial k-means clustering step to bootstrap pseudo-labels before the model has learned any representations of its own.

That said, modern deep learning systems have largely moved past MFCCs. The reason is straightforward: a neural network with enough capacity can learn better features from the data than any hand-crafted pipeline can produce. Log-mel spectrograms give the network a perceptually motivated starting point while preserving more information than MFCCs (80 channels vs 13 coefficients), and some architectures (wav2vec 2.0, HuBERT) skip the spectrogram entirely, learning directly from raw waveforms. The trend is clear: move the feature-extraction boundary deeper into the model and let gradient descent figure out the best representation.

💡 Understanding MFCCs is useful for historical context — you'll encounter them in papers, textbooks, and legacy systems everywhere. But for new work, you'll almost always start with log-mel spectrograms or raw waveforms and let the model handle the rest.

The Pipeline: From Air to Model Input

Let's put the entire chain together. When you speak into a microphone and a machine learning model processes your words, here's what happens at each stage:

Air pressure waves → a microphone converts pressure variations into an electrical voltage signal.
Analog-to-digital conversion → an ADC samples the voltage at $f_s$ times per second (e.g. 16,000), producing a sequence of amplitude values.
STFT → the sample sequence is chopped into overlapping 25 ms windows (with hop of 10 ms), each multiplied by a Hann window, and each transformed via FFT into a frequency spectrum.
Magnitude spectrogram → the complex FFT outputs are converted to magnitudes, yielding a 2D time-frequency matrix.
Mel filterbank → triangular filters spaced on the mel scale compress the frequency axis from $N/2 + 1$ linear bins to 80 (or 128) mel channels.
Logarithm → log-compression reduces dynamic range and aligns with human loudness perception, producing the final log-mel spectrogram .

This is the mel spectrogram path , and it's what most current production systems use. Whisper, F5-TTS, and many speech emotion recognition models all start here. The signal processing is explicit, well-understood, and decades-proven. But there's a second modern path:

The raw waveform path skips most of the signal processing and feeds the raw sample sequence directly into a learned encoder. Models like wav2vec 2.0 (Baevski et al., 2020) and HuBERT (Hsu et al., 2021) use a convolutional feature encoder that takes raw 16 kHz waveforms and learns to extract whatever features the downstream task requires. Neural audio codecs like EnCodec (D\'{e}fossez et al., 2022) similarly operate on raw waveforms, compressing them into discrete tokens. The advantage is that the model isn't constrained by the assumptions baked into mel filterbanks — it can discover features that humans wouldn't think to engineer.

Both paths ultimately serve the same goal: turn a high-dimensional, highly redundant raw signal into a compact representation that a transformer or other sequence model can process efficiently. The rest of this track covers what happens after that transformation: how models like Whisper encode speech for recognition (article 2), how self-supervised models learn audio representations without labels (article 3), how neural codecs discretise audio into tokens (article 4), how speech synthesis works (article 5), and how multimodal models combine audio with text and vision (articles 6–7).

Quiz

Test your understanding of audio signal processing fundamentals.

At a sampling rate of 16 kHz, what is the highest frequency that can be faithfully represented?

16,000 Hz

8,000 Hz

32,000 Hz

4,000 Hz

What does the magnitude $|X[k]|$ of a DFT bin represent?

The time at which frequency $k$ is loudest

The strength (amplitude) of the signal's component at frequency $k$

The phase offset of frequency $k$ relative to the start of the window

The number of samples that contain frequency $k$

Why does the mel scale use a logarithmic mapping above ~700 Hz?

Because human hearing is more sensitive to high frequencies

Because logarithms are faster to compute than linear operations

Because humans perceive pitch roughly logarithmically — equal ratios of frequency sound like equal differences in pitch

Because frequencies above 700 Hz don't appear in speech

Why have modern deep learning systems largely moved away from MFCCs in favour of log-mel spectrograms or raw waveforms?

MFCCs are too expensive to compute in real time

MFCCs discard information during the DCT truncation step; neural networks can learn better features by retaining more of the spectrogram

MFCCs only work for English-language speech

Log-mel spectrograms use fewer dimensions than MFCCs