Diffusers documentation
AceStepTransformer1DModel
AceStepTransformer1DModel
A 1D Diffusion Transformer for music generation from ACE-Step 1.5. The model operates on the 25 Hz stereo latents produced by AutoencoderOobleck using flow matching, and is trained with a Qwen3-derived backbone (grouped-query attention, rotary position embedding, RMSNorm, AdaLN-Zero timestep conditioning) plus cross-attention to the text / lyric / timbre conditions built by AceStepConditionEncoder.
AceStepTransformer1DModel
class diffusers.AceStepTransformer1DModel
< source >( hidden_size: int = 2048intermediate_size: int = 6144num_hidden_layers: int = 24num_attention_heads: int = 16num_key_value_heads: int = 8head_dim: int = 128in_channels: int = 192audio_acoustic_hidden_dim: int = 64patch_size: int = 2rope_theta: float = 1000000.0attention_bias: bool = Falseattention_dropout: float = 0.0rms_norm_eps: float = 1e-06sliding_window: int = 128layer_types: typing.Optional[typing.List[str]] = Noneencoder_hidden_size: typing.Optional[int] = Noneis_turbo: bool = Falsemodel_version: typing.Optional[str] = None )
Diffusion Transformer for ACE-Step 1.5 music generation.
Generates audio latents conditioned on text, lyrics, and timbre. Uses 1D patch embedding (Conv1d with stride patch_size) followed by a stack of AceStepTransformerBlocks with alternating sliding-window / full attention on
the self-attention branch. Cross-attention consumes the packed encoder_hidden_states produced by AceStepConditionEncoder.
forward
< source >( hidden_states: Tensortimestep: Tensortimestep_r: Tensorencoder_hidden_states: Tensorcontext_latents: Tensorreturn_dict: bool = True ) → Transformer2DModelOutput or tuple
Parameters
- hidden_states (
torch.Tensorof shape(batch_size, seq_len, channels)) — Noisy latent input for the diffusion process. - timestep (
torch.Tensorof shape(batch_size,)) — Current diffusion timestept. - timestep_r (
torch.Tensorof shape(batch_size,)) — Reference timestepr(set equal totfor standard inference). - encoder_hidden_states (
torch.Tensorof shape(batch_size, encoder_seq_len, hidden_size)) — Conditioning embeddings from the condition encoder (text + lyrics + timbre). - context_latents (
torch.Tensorof shape(batch_size, seq_len, context_dim)) — Context latents (source latents concatenated with chunk masks) — fed to the patchify conv alongsidehidden_states. - return_dict (
bool, defaults toTrue) — Whether to return aTransformer2DModelOutputor a plain tuple.
Returns
Transformer2DModelOutput or tuple
The predicted velocity field.
The AceStepTransformer1DModel forward method.