Every Sound Is a Sum of Sine Waves

In 1807, Joseph Fourier made a claim so bold that the leading mathematicians of his time refused to believe it: any periodic signal can be decomposed into a sum of sine waves of different frequencies, amplitudes, and phases. A violin playing middle A isn't just a single vibration at 440 Hz — it's a fundamental tone at 440 Hz plus overtones at 880 Hz, 1320 Hz, and beyond, each with its own amplitude. Strip away the overtones and you get a sterile sine wave; add them back and the richness of the instrument returns. This insight is why Fourier analysis matters for audio: speech and music are complex sums of frequencies, and understanding those frequencies is the first step toward understanding the sound.

Let's make this concrete. We'll build a complex-looking waveform from just three sine waves. A sine wave is defined by three quantities: its frequency (how many cycles per second, in Hz), its amplitude (how tall the wave is), and its phase (where in its cycle it starts). The general formula is:

$$x(t) = A \sin(2\pi f t + \phi)$$

where $A$ is the amplitude, $f$ is the frequency in Hz, $t$ is time in seconds, and $\phi$ is the phase offset in radians. When $t = 0$, the wave starts at $A \sin(\phi)$. When $t = 1/f$, one complete cycle has elapsed and the argument has increased by $2\pi$. Doubling $f$ squeezes twice as many cycles into the same time interval.

The code below creates three individual sine waves — at 3 Hz, 7 Hz, and 13 Hz — each with a different amplitude. It then adds them together sample-by-sample to produce a composite signal. Watch how the composite wave looks nothing like a simple sine wave, yet it's built entirely from three predictable components.

import math, json, js

# Parameters for three sine waves
freqs  = [3, 7, 13]        # Hz
amps   = [1.0, 0.6, 0.3]   # amplitudes
phases = [0, 0, 0]          # no phase offset for clarity

# Sample 1 second at 200 samples/sec (enough to see all waves clearly)
sr = 200
N = sr  # 200 samples = 1 second
t_vals = [i / sr for i in range(N)]

# Generate each sine wave and the composite
waves = []
for f, a, p in zip(freqs, amps, phases):
    wave = [a * math.sin(2 * math.pi * f * t + p) for t in t_vals]
    waves.append(wave)

composite = [sum(w[i] for w in waves) for i in range(N)]

# Build plot: 4 lines (3 individual + composite)
colors = ["#94a3b8", "#94a3b8", "#94a3b8", "#3b82f6"]
labels = [f"{freqs[i]} Hz (A={amps[i]})" for i in range(3)] + ["Composite"]

lines = []
for idx, (wave, color, label) in enumerate(zip(waves + [composite], colors, labels)):
    lines.append({
        "label": label,
        "data": [round(v, 4) for v in wave],
        "color": color
    })

plot_data = [{
    "title": "Three Sine Waves and Their Sum",
    "x_label": "Sample index",
    "y_label": "Amplitude",
    "x_data": list(range(N)),
    "lines": lines
}]
js.window.py_plot_data = json.dumps(plot_data)

print("Three sine waves at 3 Hz, 7 Hz, and 13 Hz are added together.")
print("The blue composite wave looks complex, but it is EXACTLY the sum of the three grey waves.")
print("Fourier analysis works in reverse: given only the blue wave, recover the three components.")

The composite signal looks irregular and unpredictable, yet it contains no information beyond three frequencies, three amplitudes, and three phases. Fourier's insight is that this decomposition works in reverse: given only the composite signal, we can recover which frequencies are present and how strong each one is. The tool that does this is the Discrete Fourier Transform .

The Discrete Fourier Transform (DFT)

Given $N$ equally spaced samples of a signal, the Discrete Fourier Transform (DFT) finds how much of each frequency is present. It does this by correlating the signal with a set of reference sine and cosine waves — one for each frequency bin $k$. If the signal contains energy at frequency $k$, the correlation is strong; if not, the correlation is near zero.

The DFT formula is:

$$X[k] = \sum_{n=0}^{N-1} x[n] \cdot e^{-i \, 2\pi k n / N}, \quad k = 0, 1, \ldots, N-1$$

Let's unpack every symbol. $x[n]$ is the $n$-th sample of the input signal (a real number). $X[k]$ is the $k$-th output coefficient (a complex number). $N$ is the total number of samples. The exponential $e^{-i \, 2\pi k n / N}$ is Euler's formula in disguise — it equals $\cos(2\pi k n / N) - i \sin(2\pi k n / N)$, a rotating point on the unit circle in the complex plane. By multiplying each sample by this rotating reference and summing, we're computing the correlation between the input signal and a sinusoid at frequency $k$.

💡 The minus sign in the exponent ($e^{-i\ldots}$) is a convention. It means we're correlating with a clockwise-rotating complex exponential. The inverse DFT uses $e^{+i\ldots}$ and divides by $N$ to reconstruct the original signal. Some references swap the sign convention or place the $1/N$ factor differently — the math is equivalent as long as the forward and inverse transforms are consistent.

The output $X[k]$ is complex, and its two components tell us different things:

  • Magnitude $|X[k]|$: how much energy is present at frequency bin $k$. This is what we usually care about. A peak in the magnitude spectrum means the signal contains a strong component at that frequency.
  • Phase $\angle X[k]$: where in its cycle that frequency component starts. For most machine learning applications (speech recognition, music analysis), phase carries less perceptually important information, so models typically work with magnitudes only.

Let's check the boundary behaviour. When $k = 0$, the exponential becomes $e^0 = 1$ for all $n$, so $X[0] = \sum x[n]$ — the sum of all samples. This is the DC component (the average value of the signal, scaled by $N$). A signal centred around zero has $X[0] \approx 0$. When $k = N/2$ (if $N$ is even), the exponential alternates between $+1$ and $-1$, measuring the highest frequency representable in the signal — the Nyquist frequency . Frequencies above this cannot be distinguished from lower frequencies (a phenomenon called aliasing), which is why audio must be sampled at at least twice the highest frequency of interest.

Now let's implement the DFT from scratch using only Python's built-in cmath module for complex arithmetic. We'll apply it to our three-frequency composite signal and see if the DFT correctly identifies the three peaks.

import math, cmath, json, js

# Build the same composite signal: 3 Hz + 7 Hz + 13 Hz
freqs  = [3, 7, 13]
amps   = [1.0, 0.6, 0.3]
sr = 64          # samples per second
N  = sr          # 1 second of data = 64 samples
t_vals = [n / sr for n in range(N)]
signal = [0.0] * N
for f, a in zip(freqs, amps):
    for n in range(N):
        signal[n] += a * math.sin(2 * math.pi * f * t_vals[n])

# ---- DFT from scratch ----
def dft(x):
    N = len(x)
    X = []
    for k in range(N):
        s = 0 + 0j
        for n in range(N):
            angle = -2 * math.pi * k * n / N
            s += x[n] * cmath.exp(1j * angle)
        X.append(s)
    return X

X = dft(signal)

# Magnitude spectrum (only first N/2 bins are meaningful for real signals)
half = N // 2
magnitudes = [abs(X[k]) / (N / 2) for k in range(half)]
freq_bins  = [k * sr / N for k in range(half)]

# Plot the magnitude spectrum
lines = [{
    "label": "DFT Magnitude",
    "data": [round(m, 4) for m in magnitudes],
    "color": "#3b82f6"
}]

plot_data = [{
    "title": "DFT Magnitude Spectrum",
    "x_label": "Frequency (Hz)",
    "y_label": "Amplitude",
    "x_data": [round(f, 1) for f in freq_bins],
    "lines": lines,
    "y_min": 0
}]
js.window.py_plot_data = json.dumps(plot_data)

print("DFT of composite signal (3 Hz + 7 Hz + 13 Hz):")
for f, a in zip(freqs, amps):
    k = int(f * N / sr)
    measured = abs(X[k]) / (N / 2)
    print(f"  Bin k={k} => freq={f} Hz, expected amplitude={a}, measured={measured:.4f}")

The magnitude spectrum shows three sharp peaks at exactly 3 Hz, 7 Hz, and 13 Hz — the three frequencies we put in. The heights match the amplitudes we used (1.0, 0.6, 0.3). The DFT has perfectly decomposed our composite signal into its constituent sine waves, just as Fourier predicted. Every other frequency bin is essentially zero, confirming that no other frequencies are present.

The Fast Fourier Transform (FFT)

Our DFT implementation works, but look at its structure: for each of the $N$ output bins, we loop over all $N$ input samples. That's $N \times N = N^2$ complex multiplications. For a 1-second audio clip at 16 kHz ($N = 16{,}384$ samples), that's roughly 268 million operations — far too slow for real-time processing.

In 1965, James Cooley and John Tukey published an algorithm (Cooley & Tukey, 1965) that computes exactly the same result in $O(N \log N)$ time. The idea: split the $N$-point DFT into two $N/2$-point DFTs — one over the even-indexed samples, one over the odd-indexed samples — and combine the results. Each half can be split again, recursively, until we reach trivial 1-point DFTs. This is the Fast Fourier Transform (FFT) , and it's one of the most important algorithms in all of computing.

The key mathematical identity that makes this split possible is:

$$X[k] = E[k] + e^{-i \, 2\pi k / N} \cdot O[k]$$
$$X[k + N/2] = E[k] - e^{-i \, 2\pi k / N} \cdot O[k]$$

where $E[k]$ is the DFT of the even-indexed samples $x[0], x[2], x[4], \ldots$ and $O[k]$ is the DFT of the odd-indexed samples $x[1], x[3], x[5], \ldots$. The factor $e^{-i \, 2\pi k / N}$ is called a twiddle factor . Notice that both equations use the same $E[k]$ and $O[k]$ — we compute each half-size DFT once and reuse it twice (once with addition, once with subtraction). This reuse is what cuts the work in half at each level of recursion.

💡 The FFT requires $N$ to be a power of 2 for the simplest version (radix-2). In practice, audio frames are always chosen to be powers of 2 (256, 512, 1024, 2048) for this reason. Mixed-radix FFTs handle other sizes, but powers of 2 are fastest.

How much does this save? At each recursion level, we do $O(N)$ work (the combine step), and we recurse $\log_2 N$ levels deep. Total: $O(N \log N)$. The table below compares the operation counts at different signal lengths:

import math, json, js

rows = []
for exp in [6, 8, 10, 12, 14, 16]:
    N = 2 ** exp
    dft_ops = N * N
    fft_ops = N * exp  # N * log2(N)
    speedup = dft_ops / fft_ops
    rows.append([
        f"{N:,}",
        f"~{exp * 1000 if exp <= 10 else ''}" if N <= 1024 else "",
        f"{dft_ops:,}",
        f"{fft_ops:,}",
        f"{speedup:,.0f}x"
    ])

# Cleaner rows with context
rows = []
for exp in [6, 8, 10, 12, 14, 16]:
    N = 2 ** exp
    dft_ops = N * N
    fft_ops = N * exp
    speedup = dft_ops / fft_ops
    ctx = ""
    if exp == 10:
        ctx = "~64 ms at 16 kHz"
    elif exp == 14:
        ctx = "~1 sec at 16 kHz"
    elif exp == 16:
        ctx = "~4 sec at 16 kHz"
    rows.append([
        f"{N:,}",
        ctx,
        f"{dft_ops:,}",
        f"{fft_ops:,}",
        f"{speedup:,.0f}x"
    ])

js.window.py_table_data = json.dumps({
    "headers": ["N (samples)", "Duration", "DFT ops (N\u00b2)", "FFT ops (N log N)", "Speedup"],
    "rows": rows
})

print("DFT vs FFT operation counts")
print(f"At N = 16,384 (1 sec of 16 kHz audio):")
print(f"  DFT: {16384**2:,} operations")
print(f"  FFT: {16384 * 14:,} operations")
print(f"  Speedup: {16384 // 14:,}x")

At $N = 16{,}384$ (one second of 16 kHz audio), the FFT is over 1,000 times faster than the DFT. At $N = 65{,}536$ (four seconds), it's over 4,000 times faster. Without the FFT, real-time audio analysis would be computationally infeasible. With it, a modern CPU can process hundreds of FFT frames per second — fast enough for live speech recognition, music analysis, and audio effects.

The Short-Time Fourier Transform (STFT)

The FFT tells us which frequencies are present in a signal, but it gives us one answer for the entire signal. That's a problem for audio, because audio changes over time — a spoken sentence starts with one set of frequencies, transitions through others, and ends with different ones entirely. If we FFT the whole sentence at once, we get the average frequency content but lose all information about when each frequency was present.

The solution is simple: chop the signal into short, overlapping frames (typically 25 ms each, sliding forward by 10 ms at a time), and apply the FFT to each frame independently. This is the Short-Time Fourier Transform (STFT) . Each frame is short enough that the signal is approximately constant within it, so the FFT of that frame gives a meaningful snapshot of the frequency content at that moment in time.

But there's a fundamental tradeoff lurking here, often called the time-frequency uncertainty principle :

$$\Delta t \cdot \Delta f \geq \frac{1}{4\pi}$$

where $\Delta t$ is the time resolution (how precisely we can localise an event in time) and $\Delta f$ is the frequency resolution (how precisely we can distinguish two nearby frequencies). This inequality says we cannot have perfect resolution in both simultaneously. Let's check the boundaries:

  • Long window ($\Delta t$ large): more samples per FFT means finer frequency bins ($\Delta f$ small), so we can distinguish closely spaced frequencies. But we lose the ability to pinpoint when something happened — the event could be anywhere within the long window.
  • Short window ($\Delta t$ small): we know precisely when something happened, but fewer samples per FFT means coarser frequency bins ($\Delta f$ large), so nearby frequencies blur together.
  • Infinitely long window: $\Delta f \to 0$ (perfect frequency resolution), but $\Delta t \to \infty$ (no time information at all). This is just the regular FFT of the entire signal.
  • Infinitely short window (single sample): $\Delta t \to 0$ (perfect time resolution), but $\Delta f \to \infty$ (no frequency information). We just have the raw waveform.

The standard 25 ms window at 16 kHz gives $N = 400$ samples per frame, which means a frequency resolution of $16{,}000 / 400 = 40$ Hz. That's fine for speech (formant frequencies are hundreds of Hz apart) but too coarse for precise musical pitch detection (adjacent piano keys differ by only a few Hz in the bass range).

There's a second subtlety: we can't just chop the signal with a rectangle . When we extract a frame by multiplying the signal with a rectangular window (1 inside the frame, 0 outside), the sharp edges at the start and end of the frame act as artificial discontinuities. The FFT interprets these discontinuities as high-frequency content that isn't actually in the signal — a phenomenon called spectral leakage . Energy that should be concentrated in a single frequency bin "leaks" into neighbouring bins, smearing the spectrum.

The fix is to use a smooth window function that tapers gently to zero at the edges. The most common choice is the Hann window :

$$w[n] = 0.5 \left(1 - \cos\!\left(\frac{2\pi n}{N - 1}\right)\right), \quad n = 0, 1, \ldots, N-1$$

At the boundaries: when $n = 0$, $\cos(0) = 1$, so $w[0] = 0.5(1 - 1) = 0$. When $n = (N-1)/2$ (the midpoint), $\cos(\pi) = -1$, so $w = 0.5(1 - (-1)) = 1$. When $n = N-1$, $\cos(2\pi) = 1$, so $w[N-1] = 0$. The window smoothly rises from 0 to 1 and back to 0, resembling a bell curve. Multiplying the signal by this window before the FFT eliminates the sharp edges and dramatically reduces spectral leakage.

The Gaussian window has a similar purpose but follows a Gaussian (bell-curve) shape. For a deeper exploration of the Gaussian distribution and its properties, see the probability distributions article .

The plot below shows the Hann window shape, a raw signal frame, and the same frame after windowing. Notice how the windowed signal smoothly decays to zero at the edges, eliminating the sharp jumps that cause spectral leakage.

import math, json, js

N = 128  # samples in one frame

# Hann window
hann = [0.5 * (1 - math.cos(2 * math.pi * n / (N - 1))) for n in range(N)]

# Create a signal frame: 5 Hz + 12 Hz sine waves
sr = 128
signal = [0.8 * math.sin(2 * math.pi * 5 * n / sr) + 0.5 * math.sin(2 * math.pi * 12 * n / sr) for n in range(N)]

# Windowed signal
windowed = [signal[n] * hann[n] for n in range(N)]

x_data = list(range(N))

plot_data = [{
    "title": "Hann Window and Its Effect on a Signal Frame",
    "x_label": "Sample index",
    "y_label": "Amplitude",
    "x_data": x_data,
    "lines": [
        {"label": "Hann window", "data": [round(v, 4) for v in hann], "color": "#f59e0b"},
        {"label": "Raw signal", "data": [round(v, 4) for v in signal], "color": "#94a3b8"},
        {"label": "Windowed signal", "data": [round(v, 4) for v in windowed], "color": "#3b82f6"}
    ]
}]
js.window.py_plot_data = json.dumps(plot_data)

print("The yellow Hann window tapers smoothly from 0 to 1 and back to 0.")
print("The grey raw signal has abrupt edges at sample 0 and sample 127.")
print("The blue windowed signal (raw * Hann) decays to zero at both edges.")
print("This eliminates spectral leakage caused by the rectangular window's sharp cuts.")

Let's see the impact on the frequency spectrum directly. We'll take the same signal frame, compute the DFT with and without the Hann window, and compare the magnitude spectra. The unwindowed version shows energy leaking across many bins, while the windowed version concentrates the energy into sharp peaks.

import math, cmath, json, js

N = 128
sr = 128

# Signal: 5 Hz + 12 Hz
signal = [0.8 * math.sin(2 * math.pi * 5 * n / sr) + 0.5 * math.sin(2 * math.pi * 12 * n / sr) for n in range(N)]

# Hann window
hann = [0.5 * (1 - math.cos(2 * math.pi * n / (N - 1))) for n in range(N)]
windowed = [signal[n] * hann[n] for n in range(N)]

# DFT helper
def dft(x):
    M = len(x)
    result = []
    for k in range(M):
        s = 0 + 0j
        for n in range(M):
            s += x[n] * cmath.exp(-2j * math.pi * k * n / M)
        result.append(s)
    return result

X_raw = dft(signal)
X_win = dft(windowed)

half = N // 2
# Normalise magnitudes
mag_raw = [abs(X_raw[k]) / (N / 2) for k in range(half)]
mag_win = [abs(X_win[k]) / (N / 2) * 2 for k in range(half)]  # x2 to compensate Hann amplitude loss
freq_bins = [k * sr / N for k in range(half)]

plot_data = [{
    "title": "Spectral Leakage: Rectangle vs Hann Window",
    "x_label": "Frequency (Hz)",
    "y_label": "Magnitude",
    "x_data": [round(f, 1) for f in freq_bins],
    "y_min": 0,
    "lines": [
        {"label": "No window (rectangle)", "data": [round(m, 4) for m in mag_raw], "color": "#ef4444"},
        {"label": "Hann window", "data": [round(m, 4) for m in mag_win], "color": "#3b82f6"}
    ]
}]
js.window.py_plot_data = json.dumps(plot_data)

print("Red (no window): energy leaks from the 5 Hz and 12 Hz peaks into neighbouring bins.")
print("Blue (Hann window): peaks are sharper and the noise floor is much lower.")
print("The Hann window reduces spectral leakage at the cost of slightly wider main peaks.")
💡 Why do overlapping frames work? Adjacent frames share most of their samples (with 25 ms windows and 10 ms hop, neighbouring frames overlap by 60%). Because the Hann window attenuates the edges of each frame, the overlap ensures that every part of the signal is at full weight in at least one frame. This is called the Constant Overlap-Add (COLA) property, and it guarantees that the original signal can be perfectly reconstructed from the STFT if needed.

Reading a Spectrogram

When we stack the magnitude spectra of all STFT frames side by side, we get a 2D matrix called a spectrogram . The x-axis is time (one column per frame), the y-axis is frequency (one row per frequency bin), and the value at each cell is the magnitude (or its logarithm) of that frequency at that moment. Rendered as an image, brighter pixels mean more energy.

Here's how to read the three main visual patterns:

  • Horizontal bands: a sustained tone at a constant frequency. A violin holding a note produces a bright horizontal line at the fundamental frequency, with fainter lines above it at the harmonics.
  • Vertical stripes: a broadband transient — energy at many frequencies simultaneously but briefly. Consonants like "t" and "k" in speech, or a drum hit in music, appear as thin vertical columns.
  • Formants: in speech, the resonance frequencies of the vocal tract create broad horizontal bands called formants. The spacing and movement of formants is what distinguishes vowels — "ee" has a different formant pattern than "ah" — and tracking formants over time reveals the structure of spoken language.

For machine learning, the spectrogram is revolutionary because it turns a 1D audio signal into a 2D representation — essentially an image of sound. And we know how to process images with CNNs and transformers. This is why the mel spectrogram (a perceptually weighted version covered in article 1) became the standard input for speech models.

Let's build a spectrogram from scratch. We'll create a synthetic signal where frequencies change over time — a 5 Hz tone that's present in the first half, a 15 Hz tone in the second half, and a 10 Hz tone throughout. The spectrogram should clearly show these patterns.

import math, cmath, json, js

# Synthetic signal: 3 tones that turn on/off at different times
sr = 128      # samples per second
duration = 2  # seconds
N_total = sr * duration  # 256 samples

signal = [0.0] * N_total
for n in range(N_total):
    t = n / sr
    # 10 Hz present throughout
    signal[n] += 0.7 * math.sin(2 * math.pi * 10 * t)
    # 5 Hz present only in first half (0-1 sec)
    if t < 1.0:
        signal[n] += 1.0 * math.sin(2 * math.pi * 5 * t)
    # 15 Hz present only in second half (1-2 sec)
    if t >= 1.0:
        signal[n] += 0.8 * math.sin(2 * math.pi * 15 * t)

# STFT parameters
frame_len = 64
hop = 32  # 50% overlap
n_frames = (N_total - frame_len) // hop + 1

# Hann window
hann = [0.5 * (1 - math.cos(2 * math.pi * i / (frame_len - 1))) for i in range(frame_len)]

# Compute STFT
half = frame_len // 2
spectrogram = []
for f_idx in range(n_frames):
    start = f_idx * hop
    frame = [signal[start + i] * hann[i] for i in range(frame_len)]
    # DFT of frame
    mags = []
    for k in range(half):
        s = 0 + 0j
        for n in range(frame_len):
            s += frame[n] * cmath.exp(-2j * math.pi * k * n / frame_len)
        mags.append(abs(s) / (frame_len / 2) * 2)  # compensate Hann
    spectrogram.append(mags)

# Display as table: rows = selected frequency bins, cols = time frames
freq_bins = [k * sr / frame_len for k in range(half)]

# Select key frequencies to display
key_freqs = [0, 3, 5, 7, 10, 12, 15, 17, 20, 25, 30]
key_indices = []
for target in key_freqs:
    best_k = min(range(half), key=lambda k: abs(freq_bins[k] - target))
    if best_k not in [ki for ki, _ in key_indices]:
        key_indices.append((best_k, freq_bins[best_k]))

# Build table: rows are frequencies (high to low), columns are time frames
headers = ["Freq (Hz)"] + [f"t={f_idx * hop / sr:.2f}s" for f_idx in range(n_frames)]
rows = []
for k, freq in reversed(key_indices):
    row = [f"{freq:.0f} Hz"]
    for f_idx in range(n_frames):
        mag = spectrogram[f_idx][k]
        # Use symbols for magnitude
        if mag > 0.6:
            row.append(f"{mag:.2f} ***")
        elif mag > 0.2:
            row.append(f"{mag:.2f} *")
        else:
            row.append(f"{mag:.2f}")
    rows.append(row)

js.window.py_table_data = json.dumps({
    "headers": headers,
    "rows": rows
})

print("Spectrogram of a synthetic signal (magnitudes at each time-frequency cell)")
print()
print("*** = strong energy, * = moderate, blank = weak")
print()
print("Expected pattern:")
print("  5 Hz:  strong in first half (t < 1.0), gone in second half")
print("  10 Hz: strong throughout (present the entire 2 seconds)")
print("  15 Hz: gone in first half, strong in second half (t >= 1.0)")

Now let's visualise the same spectrogram as a plot. We'll display the magnitude over time for each of the three key frequency bins, so you can see the on/off patterns clearly.

import math, cmath, json, js

# Rebuild the same signal and STFT
sr = 128
duration = 2
N_total = sr * duration
signal = [0.0] * N_total
for n in range(N_total):
    t = n / sr
    signal[n] += 0.7 * math.sin(2 * math.pi * 10 * t)
    if t < 1.0:
        signal[n] += 1.0 * math.sin(2 * math.pi * 5 * t)
    if t >= 1.0:
        signal[n] += 0.8 * math.sin(2 * math.pi * 15 * t)

frame_len = 64
hop = 32
n_frames = (N_total - frame_len) // hop + 1
hann = [0.5 * (1 - math.cos(2 * math.pi * i / (frame_len - 1))) for i in range(frame_len)]
half = frame_len // 2

spectrogram = []
for f_idx in range(n_frames):
    start = f_idx * hop
    frame = [signal[start + i] * hann[i] for i in range(frame_len)]
    mags = []
    for k in range(half):
        s = 0 + 0j
        for n in range(frame_len):
            s += frame[n] * cmath.exp(-2j * math.pi * k * n / frame_len)
        mags.append(abs(s) / (frame_len / 2) * 2)
    spectrogram.append(mags)

freq_bins = [k * sr / frame_len for k in range(half)]
time_vals = [round(f_idx * hop / sr, 2) for f_idx in range(n_frames)]

# Find bins closest to 5, 10, 15 Hz
targets = [5, 10, 15]
colors = ["#f59e0b", "#3b82f6", "#10b981"]
lines = []
for target, color in zip(targets, colors):
    best_k = min(range(half), key=lambda k: abs(freq_bins[k] - target))
    mags_over_time = [round(spectrogram[f_idx][best_k], 3) for f_idx in range(n_frames)]
    lines.append({
        "label": f"{freq_bins[best_k]:.0f} Hz bin",
        "data": mags_over_time,
        "color": color
    })

plot_data = [{
    "title": "Spectrogram: Energy at 5, 10, and 15 Hz Over Time",
    "x_label": "Time (seconds)",
    "y_label": "Magnitude",
    "x_data": time_vals,
    "y_min": 0,
    "lines": lines
}]
js.window.py_plot_data = json.dumps(plot_data)

print("Yellow (5 Hz): strong in the first second, drops to zero in the second.")
print("Blue (10 Hz): steady throughout — this tone is always present.")
print("Green (15 Hz): zero in the first second, rises in the second.")
print()
print("This is exactly what a spectrogram encodes: which frequencies are active at each point in time.")

This is the power of the STFT spectrogram: it reveals the time-varying frequency structure of a signal. A single FFT would have shown peaks at 5, 10, and 15 Hz with no indication of when each was active. The spectrogram preserves both dimensions of information, and that's what makes it useful as input to audio models.

Why This Matters for Deep Learning

The Fourier transform gave us the first principled way to represent audio for machine learning. Instead of feeding raw waveforms (which are just a sequence of pressure measurements over time and carry no explicit frequency information), we can feed a spectrogram — a 2D time-frequency image — that makes the signal's structure legible to a model. Here's how the major audio architectures use this:

  • Whisper (Radford et al., 2022): converts raw audio to an 80-channel log-mel spectrogram using the STFT + mel filterbank, then processes it with a transformer encoder-decoder. The mel spectrogram is a fixed, non-learned preprocessing step — all the learning happens downstream.
  • wav2vec 2.0 (Baevski et al., 2020): takes the raw waveform as input, passes it through a CNN feature encoder, then a transformer. The CNN layers implicitly learn a representation similar to a spectrogram — the first convolutional layer often learns filters that resemble windowed sinusoids, effectively rediscovering the Fourier transform from data.
  • Neural codecs like EnCodec (D\'{e}fossez et al., 2022): encode raw waveforms into discrete tokens using a learned encoder-quantiser-decoder architecture. The encoder learns a compressed representation (not necessarily frequency-based) that captures the perceptually important structure of audio in far fewer bits than the raw signal.

The mel spectrogram remains the most common input format for audio models because it combines three desirable properties: it's grounded in well-understood signal processing (the Fourier transform), it's perceptually motivated (mel scaling matches human hearing), and it's compact (80 mel channels vs. hundreds of raw FFT bins). Models that start from raw waveforms must learn these properties from data, which requires more computation and more training examples.

To summarise the pipeline from sound to model input:

import json, js

rows = [
    ["1. Raw waveform", "1D array of pressure samples", "x[n], n = 0..N-1"],
    ["2. Framing", "Chop into overlapping 25 ms frames", "Frames of 400 samples (at 16 kHz)"],
    ["3. Windowing", "Multiply each frame by Hann window", "Smooth edges to zero"],
    ["4. FFT", "Frequency spectrum per frame", "Complex-valued X[k] per frame"],
    ["5. Magnitude", "|X[k]| gives energy at each freq", "Discard phase"],
    ["6. Mel filterbank", "Group FFT bins into mel-spaced bands", "80 mel channels (perceptual scale)"],
    ["7. Log", "log(mel energies)", "Compress dynamic range"],
    ["8. Model input", "2D matrix: time x mel channels", "Looks like an image of sound"],
]

js.window.py_table_data = json.dumps({
    "headers": ["Step", "What happens", "Result"],
    "rows": rows
})

print("Audio preprocessing pipeline: from raw waveform to mel spectrogram")
print("Each step builds on the Fourier transform concepts from this article.")

In the next article, we'll see how models like Whisper and wav2vec 2.0 learn audio representations from data — building on the spectral foundation we've laid here to achieve state-of-the-art speech recognition, translation, and beyond.

Quiz

Test your understanding of the Fourier transform and its role in audio processing.

What does the magnitude $|X[k]|$ of the DFT output at bin $k$ represent?

Why is the Hann window applied to each frame before computing the FFT?

A longer STFT window gives better frequency resolution but worse time resolution. What is this tradeoff called?

The FFT computes the same result as the DFT but in $O(N \log N)$ instead of $O(N^2)$. What is the core idea that makes this speedup possible?