Browser-based audio analysis tool for motion design sync workflows
Audio Cue Mapper is a browser-based tool that analyses an audio file and produces a structured set of visual cues for use in After Effects. Drop in any standard audio or video file. The tool performs all signal processing locally in the browser, renders an interactive waveform and tiered cue map, and exports an After Effects ExtendScript file that populates the composition timeline with automatically generated marker layers.
The tool was built to support The Dream Team — a pixel-art animation synced to a music track by Fievel Is Glauque — where existing beat-detection tools were unsuitable for the loose, live-recorded character of the source material. No data is uploaded anywhere. No server is involved at any stage.
The entire application is one HTML file. All analysis, rendering, and export logic runs client-side. There are no external dependencies, no build step, no npm, no framework, and no backend.
| Layer | Detail |
|---|---|
| Hosting | GitHub Pages — static file, zero infrastructure |
| Embedding | Cargo 3 iframe with postMessage resize bridge |
| Runtime | Vanilla JS, Web Audio API, Canvas 2D API |
| Fonts | IBM Plex Mono + DM Serif Display via Google Fonts |
| Dependencies | None |
Because Cargo 3 has no server-side scripting and limited <head> access, the tool is hosted externally and embedded. After analysis completes, the tool calls window.parent.postMessage() with the new document height. The parent page listens and resizes the iframe accordingly, preventing content clipping.
When a file is dropped or selected, the browser reads it as an ArrayBuffer and passes it through seven sequential stages. All computation runs on the main thread, broken into yielding chunks via setTimeout(r, 0) so the UI remains responsive during analysis.
The raw ArrayBuffer is decoded using the Web Audio API's decodeAudioData(), which handles MP3, WAV, FLAC, OGG, AAC, and M4A natively in the browser. The decoded AudioBuffer is immediately mixed to mono by averaging the left and right channel sample arrays. Mono mixing halves the processing load and is standard practice for onset detection, which does not benefit from stereo information.
The mono sample array (typically 4–8 million samples for a 90-second track at 44.1kHz) is downsampled to 1,200 min/max pairs for display. Each pair captures the peak positive and peak negative sample within a window, producing the filled waveform shape characteristic of audio editors. This representation is visual only and plays no part in the analysis.
Onset detection is the core of the analysis. The algorithm steps through the audio in overlapping frames:
| Parameter | Value |
|---|---|
| Frame size | 2048 samples |
| Hop size | 512 samples (~11.6ms at 44.1kHz) |
| Window function | Hanning — reduces spectral leakage at frame edges |
| Transform | Cooley–Tukey radix-2 FFT, implemented from scratch in JS |
For each frame the algorithm computes the positive difference between the current magnitude spectrum and the previous frame's spectrum — this is spectral flux. A spike in flux indicates new frequency energy entering the signal: a drum hit, a chord onset, a melodic phrase beginning. The flux values form a continuous time series across the track duration.
The flux array is lightly smoothed and local maxima above the 85th percentile are identified, with a minimum distance constraint between picks to prevent double-triggering. Peaks are tiered by strength relative to the global maximum:
| Tier | Threshold | Use |
|---|---|---|
| 1 — Major | ≥72% of max | Strongest hits. Key poses, cuts, major transitions. |
| 2 — Mid | 48–72% | Mid-level onsets. Secondary accents, phrase markers. |
| 3 — Minor | <48% | Subtle onsets. Texture, micro-timing, detail passes. |
Root mean square energy is computed per frame, giving a measure of perceived loudness over time. The RMS array is used for waveform background shading and for structural segmentation.
The RMS envelope is heavily smoothed over a two-second window and local minima are identified. These represent moments where the music breathes between phrases. The five deepest minima become section boundaries, dividing the track into up to six structural sections. This is a heuristic — it works well for music with clear phrase structure and less well for continuous or through-composed material.
Autocorrelation is performed on the spectral flux signal across a range of lags corresponding to 50–240 BPM. The lag with the highest correlation coefficient corresponds to the beat period, folded into a musically sensible range (60–200 BPM). This estimate drives the beat grid export and is approximate — for tracks with irregular pulse the grid should be treated as a starting reference.
Results are rendered to two HTML <canvas> elements using the Canvas 2D API. Both use device pixel ratio scaling to render crisply on high-DPI displays and resize responsively on window resize events.
The 1,200-point min/max array is drawn as a filled polygon — the top edge traces positive peaks, the bottom edge traces negative peaks in reverse, closed and filled with a gradient. Section boundaries overlay as dashed vertical lines.
Three tiers of peaks are drawn as vertical bar elements centred on the canvas midpoint, scaled by strength:
Section boundaries appear as dashed green vertical lines with section index labels. A 5-second grid provides temporal orientation. Hovering shows a tooltip with the nearest peak's timecode, tier, and strength percentage.
The primary deliverable is a .jsx ExtendScript file. Run in After Effects via File › Scripts › Run Script File, it creates five null layers in the active composition, each carrying a specific category of markers.
| Layer | AE label | Contents |
|---|---|---|
| ♪ Beat grid | Sea Foam | Every beat at estimated BPM. Downbeats labelled M1, M2… Off-beats as 1.2, 1.3, 1.4. |
| § Sections | Yellow | One marker per structural boundary with section index and timecode. |
| ▲ Major hits | Tangerine | Tier 1 peaks. Timecode and strength percentage. |
| ● Mid hits | Aqua | Tier 2 peaks. Timecode and strength percentage. |
| · Minor hits | Lavender | Tier 3 onsets. Timecode and strength percentage. |
Layers are added in reverse order so the beat grid sits at the top of the layer stack. All markers are placed using MarkerValue objects and setValueAtTime() — the standard ExtendScript API for composition markers. After Effects snaps keyframes and layer in/out points to composition markers when Shift is held while dragging, making the marker layers directly usable as a snap grid.
The tool shares a design language with the Pokédex viewer and Dream Team case study on the same portfolio site, creating coherence across all embedded tools.
| Token | Hex | Role |
|---|---|---|
--bg | #141414 | Page background — matches Pokédex viewer canvas |
--accent | #76c17d | Primary / Tier 1 — Pokédex green |
--accent2 | #b9b0ff | Tier 2 — periwinkle |
--accent3 | #e8b86d | Tier 3 / beat grid — warm amber |
--text | #e1e1e1 | Primary text — 12:1 contrast on --bg (WCAG AAA) |
--muted | #888888 | Secondary text — 4.6:1 contrast (WCAG AA) |
IBM Plex Mono throughout, matching the portfolio site's typographic system. DM Serif Display italic for the track title and upload prompt — a tonal counterpoint to the monospace grid.
During analysis a quarter note hops along a five-line staff via CSS keyframe animation. Ghost notes trail behind on the staff lines. The animation is entirely CSS — no JavaScript, no canvas — and uses the same CSS variable palette as the rest of the interface.
decodeAudioData() does not support all container/codec variants. Some MP4 files with unusual audio encoding may fail to decode. WAV and FLAC are the most reliable formats.