SmartDJ: Declarative Audio Editing with Audio Language Model

University of Pennsylvania  

ICLR'26
NeurIPS'25 GenProCC Oral

TL;DR: We propose SmartDJ for intelligent audio editing. The system can take into user's declarative prompt and generate the corresponding step-by-step audio editing operations and complete the whole editing.

SmartDJ (Ours) • Audio Editing Examples

Overview

Audio editors today usually expect template-like commands (e.g., “add birds”, “remove rain”), and most operate only on mono audio. In practice, users want to provide a single declarative instruction (“make it sound like a quiet sunny forest”) and let the system figure out the steps — while preserving spatial cues in stereo audio.

SmartDJ = Planner + Editor

  • ALM Planner: perceives the original audio + interprets the goal → emits an edit recipe (atomic steps).
  • LDM Editor: executes each atomic step sequentially in stereo latent space.

The intermediate plan is natural language, so it's inspectable, editable, and enables human-in-the-loop workflows.

Pipeline snapshot

Method

From declarative goal → atomic edits → high-quality stereo diffusion editing

Problem: Declarative Audio Editing

Input: an original stereo waveform a0 and a high-level instruction P (e.g., “make it sound like a sunny forest”). The goal is to output an edited waveform an that achieves P while preserving all unedited content from the original audio.

SmartDJ decomposes P into a sequence of atomic steps S = {s1, …, sn}, then applies them sequentially:

S = ALM(a0, P) ai = LDM(ai-1, si)   for i = 1..n

Atomic Edit Operations

Each step targets a specific event/property:

Add Remove Extract Volume Change Direction Time Shift Reverb Timbre

Why atomic? It makes complex edits composable and keeps the plan interpretable.

Framework

ALM Planner

The Audio Language Model takes both audio and text and generates a step-by-step edit recipe.

  • Audio grounding: encode a0 using a pretrained audio encoder (e.g., CLAP).
  • Fusion: inject audio embedding into the LLM via adapter layers.
  • Efficient tuning: freeze the audio encoder; fine-tune adapters + LoRA on the LLM.

Stereo LDM Editor

The editor performs each step as conditional diffusion in a compressed latent space.

  • Stereo VAE: encode stereo audio into latent.
  • Edit conditioning: concatenate previous latent with a noised latent, condition on step text via cross-attention.
  • Inference: DDIM sampling + classifier-free guidance for strong prompt adherence.

Dataset

Designer–Composer Pipeline

To train SmartDJ, we synthesize editable stereo scenes by sampling labeled single-event clips, mixing them into a base scene, and then rendering stepwise edits from an LLM-generated recipe.

  • Sample K labeled events (e.g., “car engine”, “bell ring”)
  • LLM generates a declarative instruction + atomic steps
  • DSP composer applies each step to produce a1…an

Dataset illustration

Dataset pipeline figure

High‑level audio editing examples

High-level instruction: “Make this sound like a workshop by the dock”

ALM inferenced atomic editing steps:

  • Remove the sound of metal knock
  • Add the sound of seagulls squawking at left by 3dB
  • Turn down the sound of motorboat running by 2dB
  • Add the sound of waves lapping at right by 2dB

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like a protest in a city”

    ALM inferenced atomic editing steps:

  • Turn up the sound of emergency siren by 3dB
  • Remove the sound of man speech
  • Add the sound of crowd chanting at front by 3dB

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like a serene beach”

    ALM inferenced atomic editing steps:

  • Remove the sound of whistling
  • Turn up the sound of wave crash by 4dB
  • Add the sound of seagulls calling at front by 3dB

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like a busy city street”

    ALM inferenced atomic editing steps:

  • Add the sound of distant sirens at left by 3 dB
  • Add the sound of footsteps on pavement at right by 2 dB
  • Turn down the sound of engine rev by 2dB
  • Remove the sound of bell ring

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like a cozy living room”

    ALM inferenced atomic editing steps:

  • Add the sound of fireplace crackle at left by 3dB
  • Turn down the sound of woman speech by 2dB
  • Remove the sound of cat meow

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like in an outdoor concert”

    ALM inferenced atomic editing steps:

  • Remove the sound of whistle
  • Turn down the sound of woman speech by 2dB
  • Add the sound of guitar strumming at left by 2dB

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like a busy office”

    ALM inferenced atomic editing steps:

  • Remove the sound of drilling
  • Add the sound of phone ringing at right by 3dB
  • Turn up the sound of typewriter type by 2dB

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like a quiet workshop”

    ALM inferenced atomic editing steps:

  • Add the sound of soft hammering at right by 2dB
  • Remove the sound of tractor thud
  • Change the sound of object crumple to left

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like a city park”

    ALM inferenced atomic editing steps:

  • Add the sound of bicycle bells at left by 3dB
  • Add the sound of footsteps on gravel at right by 2dB
  • Remove the sound of engine rev
  • Turn up the sound of baby laugh by 2dB

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Craft this sound like a peaceful farm night”

    ALM inferenced atomic editing steps:

  • Add the sound of crickets chirping at front by 2dB
  • Turn up the sound of snore by 3dB
  • Turn down the sound of goat bleat by 2dB

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like a busy daycare center”

    ALM inferenced atomic editing steps:

  • Turn up the sound of child cry by 3dB
  • Remove the sound of car engine
  • Add the sound of toys clattering at left by 2dB

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    High-level instruction: “Make this sound like a military training ground”

    ALM inferenced atomic editing steps:

  • Add the sound of cannon fire at right by 4dB
  • Turn up the sound of clank by 2dB

  • Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    Atomic editing action: Add

    Edit instruction: “Add the sound of water falling at the front with -1 db”

    Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    Edit instruction: “Add the sound of engine revs at the left with 0 db”

    Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    Edit instruction: “Add the sound of music playing and people singning at the right with 0 db”

    Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    Atomic editing action: Remove

    Edit instruction: “Remove the sound of baby crying at the front”

    Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    Target (Ground truth)

    Edit instruction: “Remove the sound of man speaking at left”

    Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    Target (Ground truth)

    Atomic editing action: Extract

    Edit instruction: “extract the sound of water pours, horn honks, and man speaks at the front”

    Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    Target (Ground truth)

    Edit instruction: “extract the sound of whistles

    Original

    ZETA

    AudioEditor

    Audit

    SmartDJ (Ours)

    Target (Ground truth)

    Atomic editing action: Change sound direction

    Edit instruction: “Change the sound of woman speaking, food frying at the front to the right”

    Original

    Audit

    SmartDJ (Ours)

    Target (Ground truth)

    Edit instruction: “change the sound of whistling and male speech to the left

    Original

    Audit

    SmartDJ (Ours)

    Target (Ground truth)

    Atomic editing action: Turn up/down

    Edit instruction: “Turn up the sound of waves crashing, wind blows by 6 db”

    Original

    Audit

    SmartDJ (Ours)

    Target (Ground truth)

    Edit instruction: “Turn down the sound of typewriter by 6 db”

    Original

    Audit

    SmartDJ (Ours)

    Target (Ground truth)

    BibTeX

    @article{lan2025guiding,
      title={Guiding audio editing with audio language model},
      author={Lan, Zitong and Hao, Yiduo and Zhao, Mingmin},
      journal={arXiv preprint arXiv:2509.21625},
      year={2025}
    }