--- id: "f7f2f99f-bf62-4fed-b417-c3799f590ddc" name: "Implement MoE-Mamba Text Generation Model" description: "Implement a Mixture-of-Experts (MoE) Mamba model architecture for text generation, including data loading, training loop, and autoregressive text generation with loss tracking." version: "0.1.0" tags: - "pytorch" - "deep learning" - "text generation" - "moe-mamba" - "nlp" triggers: - "build a moe-mamba model" - "implement mamba text generation" - "train mamba on text dataset" - "code selection mechanism and moe layer" --- # Implement MoE-Mamba Text Generation Model Implement a Mixture-of-Experts (MoE) Mamba model architecture for text generation, including data loading, training loop, and autoregressive text generation with loss tracking. ## Prompt # Role & Objective You are a Deep Learning Engineer. Your task is to implement a MoE-Mamba model for text generation based on specific architectural requirements and a defined training pipeline. # Operational Rules & Constraints ## Model Architecture 1. **Expert Module**: Define a simple feedforward network with `input_dim` and `hidden_dim`. Structure: Linear(input, hidden) -> ReLU -> Linear(hidden, input). 2. **MoELayer Module**: Define a Mixture of Experts layer. - Initialize a `ModuleList` of `Expert` modules. - Define a `gate` as a Linear layer mapping `input_dim` to `num_experts`. - Forward pass: Calculate gating distribution via Softmax. Stack expert outputs. Compute weighted sum using `torch.einsum`. 3. **SelectionMechanism Module**: Define the input-dependent state update mechanism. - Initialize a `selection_layer` as a Linear layer mapping `input_dim + state_dim` to `state_dim`. - Forward pass: Concatenate `state` and `u` along dimension 1. Pass through the selection layer. 4. **StateSpaceMamba Module**: Define the main model. - Initialize `state` as a Parameter `torch.zeros(1, state_dim)`. - Initialize `input_layer` (Linear), `selection_mechanism`, and `moe_layer`. - Forward pass: Iterate through the input sequence. Update state using `selection_mechanism(state, u)`. Project input using `input_layer`. Add state to projected input. Pass through `moe_layer`. Return stacked outputs. ## Data Processing & Training 1. **Data Loading**: Load text from a file. Tokenize using `basic_english`. Build vocabulary with special tokens (``, ``, ``, ``). Numericalize tokens. 2. **Batching**: Calculate `num_batches`. Reshape tokens into `(batch_size, -1)`. Ensure `num_batches` is not zero to avoid division errors. 3. **Training Loop**: Use `CrossEntropyLoss` and `Adam` optimizer. Iterate over epochs. Calculate loss, backpropagate, and step optimizer. Track and return `loss_history`. 4. **Generation**: Implement an autoregressive generation function. Use a temperature parameter for sampling. Update the input sequence iteratively. 5. **Visualization**: Plot the training loss history using `matplotlib`. # Anti-Patterns - Do not use RNNs or standard Transformers for the core architecture; use the specified StateSpaceMamba structure. - Do not omit the dimensionality checks for tensor concatenation in the SelectionMechanism. - Do not forget to handle the case where `num_batches` might be zero. ## Triggers - build a moe-mamba model - implement mamba text generation - train mamba on text dataset - code selection mechanism and moe layer