The case study
Most interfaces wait for a click.
A click is simple. A discrete event. Pressed or not pressed. On or off. The entire history of web interaction design is built on that binary.
This interface had to do something different. It had to listen to time.
Not just whether something was pressed — but for how long. And how long the silence lasted after. And the ratio between the two. And whether a second input arrived before the first had finished. And whether all of that happened within tolerances measured in milliseconds, on a device that was never designed to think about any of it.
The hardware.
The platform needed to support professional-grade physical input devices — specialized hardware that generates patterns based on how and how long it's pressed. Two device types with fundamentally different interaction signatures.
The simpler type: a single contact. Press and hold for a long element. Press and release for a short one. The operator's hand controls everything — duration, spacing, rhythm.
The more complex type: two contacts, generating automatic alternating sequences when both are held simultaneously. The operator controls the sequence; the hardware manages the precision.
Both connect to a computer via a standard audio cable. Neither was designed for a web browser. There is no dedicated Web API for this kind of input. No library. No established pattern. The browser hears an audio signal — and that's where the known terrain ends.
The problems that had to be solved.
Building this from first principles meant confronting problems that standard web development resources simply don't address.
Input latency. Any perceptible delay between physical press and audio response breaks the feedback loop that makes practice useful. The operator listens to what they're sending in real time — hand and ear in conversation. Introduce lag and that conversation becomes noise. The Web Audio API schedules sound ahead of time rather than reacting to input — working with that architecture, not against it, was the first engineering constraint.
Timing discrimination. The difference between a short and a long element is a ratio, not a fixed duration. At higher speeds, tolerances tighten considerably. The algorithm that reads the input stream has to calibrate to the operator's personal timing signature and update that calibration as the session progresses. A fixed threshold fails. An adaptive one doesn't.
Simultaneous inputs. The more complex device type creates input events that arrive faster than a naive decode cycle can process them. The solution required a dedicated state machine that could handle simultaneous inputs, queued outputs, and mid-sequence corrections without losing track of where it was.
iOS audio. Safari on iOS requires an explicit user gesture to initialize the audio context. A session that begins without that gesture produces silence — and silence, in a product where the sound is the entire point, is a critical failure. Every entry point into the practice environment had to be audited and wired to the audio initialization flow.
Designing the feedback layer.
Getting the input working was the engineering problem. Designing what the user experienced around it was the design problem — and the two were inseparable.
The decision to show timing feedback came from the transfer principle: a good score is worthless if the skill doesn't transfer. Accuracy tells you whether the right element was sent. Timing tells you whether the skill is being built correctly.
The feedback is visual and immediate — comparison bars showing sent timing against ideal timing for each element. Not graded. Not scored. Just visible. The goal is awareness, not judgment.
Earlier iterations used numerical feedback. Testing revealed this was too cognitively demanding during active practice — learners couldn't process numbers while also processing audio and managing physical input. The bar visualization maps naturally to duration without requiring conscious interpretation.
The toggle to hide timing feedback came from a different insight: beginners who see their timing before they've developed any timing instinct can become anxious about the data before they understand what it means. The toggle lets them choose when they're ready.
Feedback that feels like a coach watching quietly, not a test being scored.
The detection flow.
Before any of the practice interaction was designed, there was a prior problem: how does a user know if their hardware is connected correctly?
A device connected via audio cable produces no visual confirmation that anything is working. The browser has no dedicated API to confirm receipt. The user might have the wrong input selected, the wrong cable, the wrong jack. They might be connected correctly and not know it.
The solution was a guided detection wizard — a short flow that walks the user through connection, asks them to send a test input, confirms receipt, and advances when successful. Platform-specific guidance for iOS, which requires additional system settings to route audio correctly, is built into the flow as a conditional branch.
The wizard exists because the first experience with hardware input is also the most fragile. Getting a user through that moment successfully changes their relationship with the product. Leaving them alone with a silent interface does the opposite.
Thirty seconds of verification eliminates the most demoralizing failure mode entirely.
Input that knows what it's listening for.
The two device types have different interaction signatures and required different implementations — but they share a single interface contract with the rest of the product. Either type can be used in any practice mode. Switching between them doesn't change the experience, only the input path.
A code audit during a recent instrumentation pass found that one device type had been silently excluded from behavioral tracking since the feature was first wired — a single guard clause that checked device type before deciding whether to record the event. A one-line fix. The kind of gap that only surfaces when you look for it deliberately.
What the browser learned to hear.
There's a satisfaction in building something the platform wasn't designed to support — not by working around its constraints, but by understanding them deeply enough to work with them.
A user can pick up their hardware, connect it to their phone or laptop, open a browser tab, and practice at professional quality. The timing feedback is real. The audio is precise. The detection is reliable. The skill that builds here transfers to real-world application.
That happened because the design process started with the physics, not the components. With what the interaction actually required, not what the toolkit made easy.
When a tap becomes a signal — the browser learns to hear what it was never designed to listen for.