Lesson Learned: AI Agent Adaptation Framework (Dec 11, 2025)

ID: ll_011 Date: December 11, 2025 Severity: INFORMATIONAL (Best Practice) Category: ML Pipeline, RL Feedback Loops, System Architecture Impact: Enhanced learning capability, continuous improvement

Executive Summary

Implemented AI agent adaptation framework based on Stanford/Princeton/Harvard taxonomy paper. The 4 adaptation modes (A1, A2, T1, T2) provide a systematic approach to continuous improvement.

The Framework

Taxonomy Overview

Mode What Adapts Feedback Source Our Implementation
A1 Agent Tool results DiscoRL online learning + RiskAdjustedReward
A2 Agent Output evaluations Prediction tracking for confidence calibration
T1 Tools (agent frozen) Separate training (Future: sentiment fine-tuning)
T2 Tools (agent fixed) Agent feedback (Future: adaptive position sizing)

What Was Implemented (PR #530)

1. A1 Mode: DiscoRL Online Learning

Files: src/orchestrator/main.py, src/risk/position_manager.py

# Entry features stored when position opens
self.position_manager.track_entry(
    symbol=ticker,
    entry_date=datetime.now(),
    entry_features=momentum_signal.indicators,
)

# record_trade_outcome() called when position closes
self.rl_filter.record_trade_outcome(
    entry_state=entry_features,
    action=1,  # long
    exit_state=exit_features,
    reward=position.unrealized_plpc,
    done=True,
)

Key Learning: Entry state must be captured at trade open and persisted across sessions.

2. A1 Enhancement: RiskAdjustedReward

File: src/agents/rl_weight_updater.py

# Old: Binary +1/-1 rewards
# New: Multi-dimensional reward signal
reward = RiskAdjustedReward(
    return_weight=0.35,      # Annualized return
    downside_weight=0.25,    # Downside risk penalty
    sharpe_weight=0.20,      # Sharpe ratio
    drawdown_weight=0.15,    # Max drawdown penalty
    transaction_weight=0.05  # Transaction cost
)

Key Learning: Rich reward signals enable better policy learning than simple binary feedback.

3. A2 Mode: Prediction Tracking

File: src/orchestrator/main.py

# At entry: Track what we predicted
telemetry.record(payload={
    "predicted_confidence": rl_result["confidence"],
    "predicted_action": "long",
    "prediction_timestamp": datetime.utcnow().isoformat()
})

# At exit: Track what actually happened
telemetry.record(payload={
    "actual_return": position.unrealized_plpc,
    "prediction_correct": actual_return > 0
})

Key Learning: Tracking predictions vs outcomes enables future confidence calibration.

Verification Checklist

Before merging any RL/ML changes:

[ ] 1. Syntax check: python3 -m py_compile <file>
[ ] 2. Import test: python3 -c "from src.agents.rl_agent import RLFilter"
[ ] 3. Reward range: Ensure rewards are normalized [-1, 1] or similar
[ ] 4. Feature persistence: Verify entry features survive session restarts
[ ] 5. Telemetry: Confirm new fields don't break existing analysis

Anti-Patterns to Avoid

1. Sparse Rewards

Bad: Only reward on trade close after days of holding Good: Shape intermediate rewards or use value function

2. Feature Leakage

Bad: Using future data in entry features Good: Strictly use data available at decision time

3. Reward Hacking

Bad: Agent learns to game reward signal without real improvement Good: Multi-objective rewards with diverse components

4. Catastrophic Forgetting

Bad: Full model replacement on new data Good: 70/30 weight blending (old/new) for stability

Future Enhancements

T1 Mode (Tools trained separately)

  • Train sentiment analyzer on trading-specific corpus
  • Learn indicator weights per market regime
  • Symbol-specific pattern recognition

T2 Mode (Tools tuned from feedback)

  • Adaptive position sizing based on confidence accuracy
  • Dynamic stop-loss thresholds from exit analysis
  • Entry timing optimization

Metrics to Track

Metric Baseline Target (Day 30) Target (Day 90)
Prediction Accuracy Not tracked Track baseline >60%
DiscoRL Training Steps 0 >100 >1000
Reward Signal Richness Binary Multi-dim Full 6-component
A1/A2 Coverage 25% 75% 100%

Integration Points

RAG Knowledge Base

  • This lesson stored in rag_knowledge/lessons_learned/
  • Queryable by RAGSafetyChecker before future ML changes
  • Pattern: “AI adaptation”, “RL feedback”, “reward function”

ML Pipeline

  • RLWeightUpdater reads audit trail daily
  • DiscoRL accumulates transitions online
  • Prediction tracking feeds future calibration

Telemetry

  • All adaptation events logged to data/audit_trail/hybrid_funnel_runs.jsonl
  • Event types: rl.online_learning, rl.prediction, position.exit
  • ll_009_ci_syntax_failure_dec11.md - Why verification gates matter
  • over_engineering_trading_system.md - Keep implementations focused

Tags

#ml #rl #adaptation #feedback-loops #disco-dqn #reward-function #prediction-tracking #a1-mode #a2-mode

Change Log

  • 2025-12-11: Initial implementation of A1 mode (DiscoRL + RiskAdjustedReward)
  • 2025-12-11: Initial implementation of A2 foundation (prediction tracking)
  • 2025-12-11: Created this lessons learned entry