Asura's Harp: Direct Latent Control of Neural Sound
The goal of this project is to develop a technique to turn a corpus of sound into a neural instrument that can be played in real time without imposing any a priori structure on its latent space, exposing an interface like a grid of unlabeled knobs.
The project also has the goal of running on local devices while generating 44khz stereo output with relatively low control latency (~500 samples). The current version of the architecture, available at the Github link above, comes close to this goal but doesn't quite reach it: generation runs at 0.9x realtime on my M3 Macbook Air, or 0.25x realtime on a Jetson Orin Nano.
On this page you can hear some samples from this system trained on radio dance mixes. Candidly, while the model has picked up an interesting space of rhythmic variations, the frequency-space decoder seems to lack the ability to build stable harmonic structure, leading to low/no tonal content; when autoregressive renoising is turned lower to make carrying phase from one frame to the next easier, the signal tends to degenerate or overflow in power. TL;DR: There's lots of room for improvement!
In the background you can see a UMAP projection of this model's learned latent space.