SmartDJ: Declarative Audio Editing with Audio Language Model

Zitong Lan, Yiduo Hao, Mingmin Zhao

University of Pennsylvania

ICLR'26
NeurIPS'25 GenProCC Oral

TL;DR: We propose SmartDJ for intelligent audio editing. The system can take into user's declarative prompt and generate the corresponding step-by-step audio editing operations and complete the whole editing.

SmartDJ (Ours) • Audio Editing Examples

Overview

Audio editors today usually expect template-like commands (e.g., “add birds”, “remove rain”), and most operate only on mono audio. In practice, users want to provide a single declarative instruction (“make it sound like a quiet sunny forest”) and let the system figure out the steps — while preserving spatial cues in stereo audio.

SmartDJ = Planner + Editor

ALM Planner: perceives the original audio + interprets the goal → emits an edit recipe (atomic steps).
LDM Editor: executes each atomic step sequentially in stereo latent space.

The intermediate plan is natural language, so it's inspectable, editable, and enables human-in-the-loop workflows.

Pipeline snapshot

Method

From declarative goal → atomic edits → high-quality stereo diffusion editing

Problem: Declarative Audio Editing

Input: an original stereo waveform a₀ and a high-level instruction P (e.g., “make it sound like a sunny forest”). The goal is to output an edited waveform a_n that achieves P while preserving all unedited content from the original audio.

SmartDJ decomposes P into a sequence of atomic steps S = {s₁, …, s_n}, then applies them sequentially:

S = ALM(a₀, P) a_i = LDM(a_i-1, s_i)   for i = 1..n

Atomic Edit Operations

Each step targets a specific event/property:

Add Remove Extract Volume Change Direction Time Shift Reverb Timbre

Why atomic? It makes complex edits composable and keeps the plan interpretable.

Framework

ALM Planner

The Audio Language Model takes both audio and text and generates a step-by-step edit recipe.

Audio grounding: encode a₀ using a pretrained audio encoder (e.g., CLAP).
Fusion: inject audio embedding into the LLM via adapter layers.
Efficient tuning: freeze the audio encoder; fine-tune adapters + LoRA on the LLM.

Stereo LDM Editor

The editor performs each step as conditional diffusion in a compressed latent space.

Stereo VAE: encode stereo audio into latent.
Edit conditioning: concatenate previous latent with a noised latent, condition on step text via cross-attention.
Inference: DDIM sampling + classifier-free guidance for strong prompt adherence.

Dataset

Designer–Composer Pipeline

To train SmartDJ, we synthesize editable stereo scenes by sampling labeled single-event clips, mixing them into a base scene, and then rendering stepwise edits from an LLM-generated recipe.

Sample K labeled events (e.g., “car engine”, “bell ring”)
LLM generates a declarative instruction + atomic steps
DSP composer applies each step to produce a1…an

Dataset illustration

High‑level audio editing examples

High-level instruction: “Make this sound like a workshop by the dock”

ALM inferenced atomic editing steps:

Remove the sound of metal knock

Add the sound of seagulls squawking at left by 3dB

Turn down the sound of motorboat running by 2dB

Add the sound of waves lapping at right by 2dB

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like a protest in a city”

ALM inferenced atomic editing steps:

Turn up the sound of emergency siren by 3dB

Remove the sound of man speech

Add the sound of crowd chanting at front by 3dB

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like a serene beach”

ALM inferenced atomic editing steps:

Remove the sound of whistling

Turn up the sound of wave crash by 4dB

Add the sound of seagulls calling at front by 3dB

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like a busy city street”

ALM inferenced atomic editing steps:

Add the sound of distant sirens at left by 3 dB

Add the sound of footsteps on pavement at right by 2 dB

Turn down the sound of engine rev by 2dB

Remove the sound of bell ring

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like a cozy living room”

ALM inferenced atomic editing steps:

Add the sound of fireplace crackle at left by 3dB

Turn down the sound of woman speech by 2dB

Remove the sound of cat meow

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like in an outdoor concert”

ALM inferenced atomic editing steps:

Remove the sound of whistle

Turn down the sound of woman speech by 2dB

Add the sound of guitar strumming at left by 2dB

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like a busy office”

ALM inferenced atomic editing steps:

Remove the sound of drilling

Add the sound of phone ringing at right by 3dB

Turn up the sound of typewriter type by 2dB

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like a quiet workshop”

ALM inferenced atomic editing steps:

Add the sound of soft hammering at right by 2dB

Remove the sound of tractor thud

Change the sound of object crumple to left

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like a city park”

ALM inferenced atomic editing steps:

Add the sound of bicycle bells at left by 3dB

Add the sound of footsteps on gravel at right by 2dB

Remove the sound of engine rev

Turn up the sound of baby laugh by 2dB

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Craft this sound like a peaceful farm night”

ALM inferenced atomic editing steps:

Add the sound of crickets chirping at front by 2dB

Turn up the sound of snore by 3dB

Turn down the sound of goat bleat by 2dB

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like a busy daycare center”

ALM inferenced atomic editing steps:

Turn up the sound of child cry by 3dB

Remove the sound of car engine

Add the sound of toys clattering at left by 2dB

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

High-level instruction: “Make this sound like a military training ground”

ALM inferenced atomic editing steps:

Add the sound of cannon fire at right by 4dB

Turn up the sound of clank by 2dB

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

Atomic editing action: Add

Edit instruction: “Add the sound of water falling at the front with -1 db”

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

Edit instruction: “Add the sound of engine revs at the left with 0 db”

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

Edit instruction: “Add the sound of music playing and people singning at the right with 0 db”

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

Atomic editing action: Remove

Edit instruction: “Remove the sound of baby crying at the front”

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

Target (Ground truth)

Edit instruction: “Remove the sound of man speaking at left”

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

Target (Ground truth)

Atomic editing action: Extract

Edit instruction: “extract the sound of water pours, horn honks, and man speaks at the front”

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

Target (Ground truth)

Edit instruction: “extract the sound of whistles

Original

ZETA

AudioEditor

Audit

SmartDJ (Ours)

Target (Ground truth)

Atomic editing action: Change sound direction

Edit instruction: “Change the sound of woman speaking, food frying at the front to the right”

Original

Audit

SmartDJ (Ours)

Target (Ground truth)

Edit instruction: “change the sound of whistling and male speech to the left

Original

Audit

SmartDJ (Ours)

Target (Ground truth)

Atomic editing action: Turn up/down

Edit instruction: “Turn up the sound of waves crashing, wind blows by 6 db”

Original

Audit

SmartDJ (Ours)

Target (Ground truth)

Edit instruction: “Turn down the sound of typewriter by 6 db”

Original

Audit

SmartDJ (Ours)

Target (Ground truth)

BibTeX

@article{lan2025guiding,
  title={Guiding audio editing with audio language model},
  author={Lan, Zitong and Hao, Yiduo and Zhao, Mingmin},
  journal={arXiv preprint arXiv:2509.21625},
  year={2025}
}

SmartDJ: Declarative Audio Editing with Audio Language Model

ICLR'26 NeurIPS'25 GenProCC Oral

Overview

Method

Problem: Declarative Audio Editing

Atomic Edit Operations

ALM Planner

Stereo LDM Editor

Dataset

Designer–Composer Pipeline

High‑level audio editing examples

Atomic editing action: Add

Atomic editing action: Remove

Atomic editing action: Extract

Atomic editing action: Change sound direction

Atomic editing action: Turn up/down

BibTeX

ICLR'26
NeurIPS'25 GenProCC Oral