--- name: audiocraft-audio-generation description: PyTorch library for audio generation including text-to-music (MusicGen) and text-to-sound (AudioGen). Use when you need to generate music from text descriptions, create sound effects, or perform melody-conditioned music generation. version: 1.0.0 author: Orchestra Research license: MIT tags: [Multimodal, Audio Generation, Text-to-Music, Text-to-Audio, MusicGen] dependencies: [audiocraft, torch>=2.0.0, transformers>=4.30.0] --- # AudioCraft: Audio Generation Comprehensive guide to using Meta's AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec. ## When to use AudioCraft **Use AudioCraft when:** - Need to generate music from text descriptions - Creating sound effects and environmental audio - Building music generation applications - Need melody-conditioned music generation - Want stereo audio output - Require controllable music generation with style transfer **Key features:** - **MusicGen**: Text-to-music generation with melody conditioning - **AudioGen**: Text-to-sound effects generation - **EnCodec**: High-fidelity neural audio codec - **Multiple model sizes**: Small (300M) to Large (3.3B) - **Stereo support**: Full stereo audio generation - **Style conditioning**: MusicGen-Style for reference-based generation **Use alternatives instead:** - **Stable Audio**: For longer commercial music generation - **Bark**: For text-to-speech with music/sound effects - **Riffusion**: For spectogram-based music generation - **OpenAI Jukebox**: For raw audio generation with lyrics ## Quick start ### Installation ```bash # From PyPI pip install audiocraft # From GitHub (latest) pip install git+https://github.com/facebookresearch/audiocraft.git # Or use HuggingFace Transformers pip install transformers torch torchaudio ``` ### Basic text-to-music (AudioCraft) ```python import torchaudio from audiocraft.models import MusicGen # Load model model = MusicGen.get_pretrained('facebook/musicgen-small') # Set generation parameters model.set_generation_params( duration=8, # seconds top_k=250, temperature=1.0 ) # Generate from text descriptions = ["happy upbeat electronic dance music with synths"] wav = model.generate(descriptions) # Save audio torchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000) ``` ### Using HuggingFace Transformers ```python from transformers import AutoProcessor, MusicgenForConditionalGeneration import scipy # Load model and processor processor = AutoProcessor.from_pretrained("facebook/musicgen-small") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small") model.to("cuda") # Generate music inputs = processor( text=["80s pop track with bassy drums and synth"], padding=True, return_tensors="pt" ).to("cuda") audio_values = model.generate( **inputs, do_sample=True, guidance_scale=3, max_new_tokens=256 ) # Save sampling_rate = model.config.audio_encoder.sampling_rate scipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy()) ``` ### Text-to-sound with AudioGen ```python from audiocraft.models import AudioGen # Load AudioGen model = AudioGen.get_pretrained('facebook/audiogen-medium') model.set_generation_params(duration=5) # Generate sound effects descriptions = ["dog barking in a park with birds chirping"] wav = model.generate(descriptions) torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000) ``` ## Core concepts ### Architecture overview ``` AudioCraft Architecture: ┌──────────────────────────────────────────────────────────────┐ │ Text Encoder (T5) │ │ │ │ │ Text Embeddings │ └────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────▼─────────────────────────────────────┐ │ Transformer Decoder (LM) │ │ Auto-regressively generates audio tokens │ │ Using efficient token interleaving patterns │ └────────────────────────┬─────────────────────────────────────┘ │ ┌────────────────────────▼─────────────────────────────────────┐ │ EnCodec Audio Decoder │ │ Converts tokens back to audio waveform │ └──────────────────────────────────────────────────────────────┘ ``` ### Model variants | Model | Size | Description | Use Case | |-------|------|-------------|----------| | `musicgen-small` | 300M | Text-to-music | Quick generation | | `musicgen-medium` | 1.5B | Text-to-music | Balanced | | `musicgen-large` | 3.3B | Text-to-music | Best quality | | `musicgen-melody` | 1.5B | Text + melody | Melody conditioning | | `musicgen-melody-large` | 3.3B | Text + melody | Best melody | | `musicgen-stereo-*` | Varies | Stereo output | Stereo generation | | `musicgen-style` | 1.5B | Style transfer | Reference-based | | `audiogen-medium` | 1.5B | Text-to-sound | Sound effects | ### Generation parameters | Parameter | Default | Description | |-----------|---------|-------------| | `duration` | 8.0 | Length in seconds (1-120) | | `top_k` | 250 | Top-k sampling | | `top_p` | 0.0 | Nucleus sampling (0 = disabled) | | `temperature` | 1.0 | Sampling temperature | | `cfg_coef` | 3.0 | Classifier-free guidance | ## MusicGen usage ### Text-to-music generation ```python from audiocraft.models import MusicGen import torchaudio model = MusicGen.get_pretrained('facebook/musicgen-medium') # Configure generation model.set_generation_params( duration=30, # Up to 30 seconds top_k=250, # Sampling diversity top_p=0.0, # 0 = use top_k only temperature=1.0, # Creativity (higher = more varied) cfg_coef=3.0 # Text adherence (higher = stricter) ) # Generate multiple samples descriptions = [ "epic orchestral soundtrack with strings and brass", "chill lo-fi hip hop beat with jazzy piano", "energetic rock song with electric guitar" ] # Generate (returns [batch, channels, samples]) wav = model.generate(descriptions) # Save each for i, audio in enumerate(wav): torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000) ``` ### Melody-conditioned generation ```python from audiocraft.models import MusicGen import torchaudio # Load melody model model = MusicGen.get_pretrained('facebook/musicgen-melody') model.set_generation_params(duration=30) # Load melody audio melody, sr = torchaudio.load("melody.wav") # Generate with melody conditioning descriptions = ["acoustic guitar folk song"] wav = model.generate_with_chroma(descriptions, melody, sr) torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000) ``` ### Stereo generation ```python from audiocraft.models import MusicGen # Load stereo model model = MusicGen.get_pretrained('facebook/musicgen-stereo-medium') model.set_generation_params(duration=15) descriptions = ["ambient electronic music with wide stereo panning"] wav = model.generate(descriptions) # wav shape: [batch, 2, samples] for stereo print(f"Stereo shape: {wav.shape}") # [1, 2, 480000] torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000) ``` ### Audio continuation ```python from transformers import AutoProcessor, MusicgenForConditionalGeneration processor = AutoProcessor.from_pretrained("facebook/musicgen-medium") model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium") # Load audio to continue import torchaudio audio, sr = torchaudio.load("intro.wav") # Process with text and audio inputs = processor( audio=audio.squeeze().numpy(), sampling_rate=sr, text=["continue with a epic chorus"], padding=True, return_tensors="pt" ) # Generate continuation audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512) ``` ## MusicGen-Style usage ### Style-conditioned generation ```python from audiocraft.models import MusicGen # Load style model model = MusicGen.get_pretrained('facebook/musicgen-style') # Configure generation with style model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=5.0 # Style influence ) # Configure style conditioner model.set_style_conditioner_params( eval_q=3, # RVQ quantizers (1-6) excerpt_length=3.0 # Style excerpt length ) # Load style reference style_audio, sr = torchaudio.load("reference_style.wav") # Generate with text + style descriptions = ["upbeat dance track"] wav = model.generate_with_style(descriptions, style_audio, sr) ``` ### Style-only generation (no text) ```python # Generate matching style without text prompt model.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=None # Disable double CFG for style-only ) wav = model.generate_with_style([None], style_audio, sr) ``` ## AudioGen usage ### Sound effect generation ```python from audiocraft.models import AudioGen import torchaudio model = AudioGen.get_pretrained('facebook/audiogen-medium') model.set_generation_params(duration=10) # Generate various sounds descriptions = [ "thunderstorm with heavy rain and lightning", "busy city traffic with car horns", "ocean waves crashing on rocks", "crackling campfire in forest" ] wav = model.generate(descriptions) for i, audio in enumerate(wav): torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000) ``` ## EnCodec usage ### Audio compression ```python from audiocraft.models import CompressionModel import torch import torchaudio # Load EnCodec model = CompressionModel.get_pretrained('facebook/encodec_32khz') # Load audio wav, sr = torchaudio.load("audio.wav") # Ensure correct sample rate if sr != 32000: resampler = torchaudio.transforms.Resample(sr, 32000) wav = resampler(wav) # Encode to tokens with torch.no_grad(): encoded = model.encode(wav.unsqueeze(0)) codes = encoded[0] # Audio codes # Decode back to audio with torch.no_grad(): decoded = model.decode(codes) torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000) ``` ## Common workflows ### Workflow 1: Music generation pipeline ```python import torch import torchaudio from audiocraft.models import MusicGen class MusicGenerator: def __init__(self, model_name="facebook/musicgen-medium"): self.model = MusicGen.get_pretrained(model_name) self.sample_rate = 32000 def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0): self.model.set_generation_params( duration=duration, top_k=250, temperature=temperature, cfg_coef=cfg ) with torch.no_grad(): wav = self.model.generate([prompt]) return wav[0].cpu() def generate_batch(self, prompts, duration=30): self.model.set_generation_params(duration=duration) with torch.no_grad(): wav = self.model.generate(prompts) return wav.cpu() def save(self, audio, path): torchaudio.save(path, audio, sample_rate=self.sample_rate) # Usage generator = MusicGenerator() audio = generator.generate( "epic cinematic orchestral music", duration=30, temperature=1.0 ) generator.save(audio, "epic_music.wav") ``` ### Workflow 2: Sound design batch processing ```python import json from pathlib import Path from audiocraft.models import AudioGen import torchaudio def batch_generate_sounds(sound_specs, output_dir): """ Generate multiple sounds from specifications. Args: sound_specs: list of {"name": str, "description": str, "duration": float} output_dir: output directory path """ model = AudioGen.get_pretrained('facebook/audiogen-medium') output_dir = Path(output_dir) output_dir.mkdir(exist_ok=True) results = [] for spec in sound_specs: model.set_generation_params(duration=spec.get("duration", 5)) wav = model.generate([spec["description"]]) output_path = output_dir / f"{spec['name']}.wav" torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000) results.append({ "name": spec["name"], "path": str(output_path), "description": spec["description"] }) return results # Usage sounds = [ {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, {"name": "door", "description": "wooden door creaking and closing", "duration": 2} ] results = batch_generate_sounds(sounds, "sound_effects/") ``` ### Workflow 3: Gradio demo ```python import gradio as gr import torch import torchaudio from audiocraft.models import MusicGen model = MusicGen.get_pretrained('facebook/musicgen-small') def generate_music(prompt, duration, temperature, cfg_coef): model.set_generation_params( duration=duration, temperature=temperature, cfg_coef=cfg_coef ) with torch.no_grad(): wav = model.generate([prompt]) # Save to temp file path = "temp_output.wav" torchaudio.save(path, wav[0].cpu(), sample_rate=32000) return path demo = gr.Interface( fn=generate_music, inputs=[ gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"), gr.Slider(1, 30, value=8, label="Duration (seconds)"), gr.Slider(0.5, 2.0, value=1.0, label="Temperature"), gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient") ], outputs=gr.Audio(label="Generated Music"), title="MusicGen Demo" ) demo.launch() ``` ## Performance optimization ### Memory optimization ```python # Use smaller model model = MusicGen.get_pretrained('facebook/musicgen-small') # Clear cache between generations torch.cuda.empty_cache() # Generate shorter durations model.set_generation_params(duration=10) # Instead of 30 # Use half precision model = model.half() ``` ### Batch processing efficiency ```python # Process multiple prompts at once (more efficient) descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"] wav = model.generate(descriptions) # Single batch # Instead of for desc in descriptions: wav = model.generate([desc]) # Multiple batches (slower) ``` ### GPU memory requirements | Model | FP32 VRAM | FP16 VRAM | |-------|-----------|-----------| | musicgen-small | ~4GB | ~2GB | | musicgen-medium | ~8GB | ~4GB | | musicgen-large | ~16GB | ~8GB | ## Common issues | Issue | Solution | |-------|----------| | CUDA OOM | Use smaller model, reduce duration | | Poor quality | Increase cfg_coef, better prompts | | Generation too short | Check max duration setting | | Audio artifacts | Try different temperature | | Stereo not working | Use stereo model variant | ## References - **[Advanced Usage](references/advanced-usage.md)** - Training, fine-tuning, deployment - **[Troubleshooting](references/troubleshooting.md)** - Common issues and solutions ## Resources - **GitHub**: https://github.com/facebookresearch/audiocraft - **Paper (MusicGen)**: https://arxiv.org/abs/2306.05284 - **Paper (AudioGen)**: https://arxiv.org/abs/2209.15352 - **HuggingFace**: https://huggingface.co/facebook/musicgen-small - **Demo**: https://huggingface.co/spaces/facebook/MusicGen