--- name: account-aware-training description: "Add account state (P&L, win rate, drawdown) to RL observations + drawdown penalty in rewards. Trigger when: (1) model needs account awareness, (2) training should penalize drawdowns, (3) upgrading obs_dim 5300→5600." author: Claude Code date: 2024-12-26 --- # Account-Aware RL Training (v2.4) ## Experiment Overview | Item | Details | |------|---------| | **Date** | 2024-12-26 | | **Goal** | Make RL model learn from account state (P&L, win rate, drawdown) | | **Environment** | vectorized_env.py, inference_obs_builder.py, training notebook | | **Status** | Success | ## Context Prior to v2.4, the RL model was "blind" to account performance. It received: - 53 features: price action, technicals, regime probabilities, calendar effects - No information about cumulative P&L, win rate, or drawdown **Problem**: The model could generate signals that were individually good but led to excessive drawdowns at the account level. It had no incentive to trade conservatively after losses. **Solution**: Add 3 account-level features + drawdown penalty in rewards. ## Verified Workflow ### 1. Config Parameters (GPUEnvConfig) ```python # In vectorized_env.py GPUEnvConfig dataclass (~line 405) # Account-aware training (v2.4) drawdown_penalty_threshold: float = 0.15 # Penalize when drawdown > 15% drawdown_penalty_weight: float = 0.10 # Weight in reward function ``` ### 2. Equity Tracking Tensors ```python # In _init_state_tensors() after line 712 # Account-level equity tracking (v2.4) self.initial_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device) self.peak_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device) self.current_equity = torch.ones(self.n_envs, dtype=self.dtype, device=self.device) ``` ### 3. Reset Equity Tensors ```python # In reset() after line 850 # Reset account-level equity tracking self.initial_equity[env_ids] = 1.0 self.peak_equity[env_ids] = 1.0 self.current_equity[env_ids] = 1.0 ``` ### 4. Update Equity in step() ```python # In step() after line 926 # Update account-level equity tracking (v2.4) self.current_equity = self.initial_equity + self.total_pnl / (current_prices + 1e-8) self.peak_equity = torch.maximum(self.peak_equity, self.current_equity) ``` ### 5. Feature Count Update ```python # In _calculate_obs_features() line 682 # Add account features account = 3 # total_pnl_pct, rolling_win_rate, current_drawdown_pct return base + technical + intraday + temporal + markov + extended + multi_window + account # Result: 53 + 3 = 56 features ``` ### 6. Account Features in Observations ```python # In _get_observations() after line 1258, before sanitization # === ACCOUNT-LEVEL FEATURES (3) - v2.4 === # Feature 1: Total P&L % (normalized to [-1, 1]) total_pnl_pct = self.total_pnl / (self.initial_equity + 1e-8) total_pnl_pct_norm = torch.tanh(total_pnl_pct * 10) obs[:, :, feat_idx] = total_pnl_pct_norm[env_ids].unsqueeze(1).expand(-1, self.config.window) feat_idx += 1 # Feature 2: Rolling win rate (0.5 if no trades) win_rate = torch.where( self.n_trades[env_ids] > 0, self.n_wins[env_ids].float() / self.n_trades[env_ids].float(), torch.full((n_envs,), 0.5, dtype=self.dtype, device=self.device) ) obs[:, :, feat_idx] = win_rate.unsqueeze(1).expand(-1, self.config.window) feat_idx += 1 # Feature 3: Current drawdown % [0, 1] drawdown = (self.peak_equity[env_ids] - self.current_equity[env_ids]) / (self.peak_equity[env_ids] + 1e-8) drawdown = torch.clamp(drawdown, 0.0, 1.0) obs[:, :, feat_idx] = drawdown.unsqueeze(1).expand(-1, self.config.window) feat_idx += 1 ``` ### 7. Drawdown Penalty in Rewards ```python # In _calculate_rewards() after line 1618 # COMPONENT 7: Drawdown penalty (v2.4) current_drawdown = (self.peak_equity - self.current_equity) / (self.peak_equity + 1e-8) current_drawdown = torch.clamp(current_drawdown, 0.0, 1.0) # Quadratic penalty when over threshold drawdown_over_threshold = torch.clamp(current_drawdown - self.config.drawdown_penalty_threshold, min=0.0) drawdown_penalty = -drawdown_over_threshold ** 2 * 10 # Add to reward combination: reward = ( self.config.direction_weight * direction_reward + self.config.magnitude_weight * magnitude_reward + self.config.pnl_weight * pnl_reward + self.config.stop_tp_weight * stop_tp_reward + self.config.exploration_weight * exploration_bonus + self.config.slippage_weight * slippage_penalty + self.config.drawdown_penalty_weight * drawdown_penalty # NEW ) * risk_adjustment ``` ### 8. Inference Observation Builder ```python # In inference_obs_builder.py get_target_features_from_obs_dim() if features == 56: return 56 # v2.4 with account awareness elif features == 53: return 53 # v2.3 # ... legacy support # In build_inference_observation() after line 624 # === ACCOUNT-LEVEL FEATURES (3) - v2.4 === # Use neutral defaults during inference if target_features >= 56: obs[:, feat_idx] = 0.0 # total_pnl_pct (no prior trades) feat_idx += 1 obs[:, feat_idx] = 0.5 # win_rate (neutral prior) feat_idx += 1 obs[:, feat_idx] = 0.0 # drawdown (no drawdown) feat_idx += 1 ``` ## Failed Attempts (Critical) | Attempt | Why it Failed | Lesson Learned | |---------|---------------|----------------| | Account features with raw P&L values | P&L scale varies by price level | Use P&L percentage normalized with tanh | | Win rate = 0 when no trades | Invalid input during initial episodes | Default to 0.5 (neutral prior) | | Peak equity never decreasing | Logical error in update | Use torch.maximum() to track high-water mark | | Drawdown penalty linear | Too harsh at moderate levels | Quadratic scaling is gentler below threshold | | Live inference with account state | Would need real account connection | Use neutral defaults (0, 0.5, 0) for inference | ## Final Parameters ```yaml # GPUEnvConfig (v2.4) n_features: 56 # Was 53 in v2.3 drawdown_penalty_threshold: 0.15 # 15% drawdown starts penalty drawdown_penalty_weight: 0.10 # Moderate weight in reward # Feature breakdown (56 total) base_features: 7 # price action basics technical_features: 4 # intraday technicals temporal_features: 7 # calendar features markov_features: 12 # 4-chain regime probabilities extended_features: 14 # extended technicals multi_window_features: 9 # 20/50/100 bar windows account_features: 3 # P&L %, win rate, drawdown % # obs_dim = n_features * window = 56 * 100 = 5600 ``` ## Key Insights - **Breaking Change**: obs_dim 5300 → 5600 means v2.3 models CANNOT be used with v2.4 environments - **Neutral Inference**: Live trading uses neutral defaults (0, 0.5, 0) since account state isn't tracked per-prediction - **Quadratic Penalty**: The `** 2` makes penalty gentle at 16% drawdown but harsh at 25%+ - **Normalized P&L**: `tanh(pnl * 10)` keeps values in [-1, 1] even for large P&L swings - **0.5 Win Rate Prior**: Prevents model confusion during initial trades with no history ## Model Behavior Expected With account awareness, the model should learn: 1. **Reduce position sizing after losses** (sees drawdown feature) 2. **Be more selective after poor win rate** (sees win rate feature) 3. **Avoid compounding losses** (drawdown penalty kicks in at 15%) 4. **Trade more aggressively when profitable** (sees positive P&L) ## References - `alpaca_trading/gpu/vectorized_env.py`: Lines 405 (config), 712 (tensors), 850 (reset), 926 (step), 1258 (obs) - `alpaca_trading/gpu/inference_obs_builder.py`: Lines 61-108 (feature detection), 624+ (account features) - `notebooks/VSCode_Colab_Training_NATIVE.ipynb`: Training notebook with v2.4 settings