Lesson Learned: AI Agent Adaptation Framework (Dec 11, 2025)
Lesson Learned: AI Agent Adaptation Framework (Dec 11, 2025)
ID: ll_011 Date: December 11, 2025 Severity: INFORMATIONAL (Best Practice) Category: ML Pipeline, RL Feedback Loops, System Architecture Impact: Enhanced learning capability, continuous improvement
Executive Summary
Implemented AI agent adaptation framework based on Stanford/Princeton/Harvard taxonomy paper. The 4 adaptation modes (A1, A2, T1, T2) provide a systematic approach to continuous improvement.
The Framework
Taxonomy Overview
| Mode | What Adapts | Feedback Source | Our Implementation |
|---|---|---|---|
| A1 | Agent | Tool results | DiscoRL online learning + RiskAdjustedReward |
| A2 | Agent | Output evaluations | Prediction tracking for confidence calibration |
| T1 | Tools (agent frozen) | Separate training | (Future: sentiment fine-tuning) |
| T2 | Tools (agent fixed) | Agent feedback | (Future: adaptive position sizing) |
What Was Implemented (PR #530)
1. A1 Mode: DiscoRL Online Learning
Files: src/orchestrator/main.py, src/risk/position_manager.py
# Entry features stored when position opens
self.position_manager.track_entry(
symbol=ticker,
entry_date=datetime.now(),
entry_features=momentum_signal.indicators,
)
# record_trade_outcome() called when position closes
self.rl_filter.record_trade_outcome(
entry_state=entry_features,
action=1, # long
exit_state=exit_features,
reward=position.unrealized_plpc,
done=True,
)
Key Learning: Entry state must be captured at trade open and persisted across sessions.
2. A1 Enhancement: RiskAdjustedReward
File: src/agents/rl_weight_updater.py
# Old: Binary +1/-1 rewards
# New: Multi-dimensional reward signal
reward = RiskAdjustedReward(
return_weight=0.35, # Annualized return
downside_weight=0.25, # Downside risk penalty
sharpe_weight=0.20, # Sharpe ratio
drawdown_weight=0.15, # Max drawdown penalty
transaction_weight=0.05 # Transaction cost
)
Key Learning: Rich reward signals enable better policy learning than simple binary feedback.
3. A2 Mode: Prediction Tracking
File: src/orchestrator/main.py
# At entry: Track what we predicted
telemetry.record(payload={
"predicted_confidence": rl_result["confidence"],
"predicted_action": "long",
"prediction_timestamp": datetime.utcnow().isoformat()
})
# At exit: Track what actually happened
telemetry.record(payload={
"actual_return": position.unrealized_plpc,
"prediction_correct": actual_return > 0
})
Key Learning: Tracking predictions vs outcomes enables future confidence calibration.
Verification Checklist
Before merging any RL/ML changes:
[ ] 1. Syntax check: python3 -m py_compile <file>
[ ] 2. Import test: python3 -c "from src.agents.rl_agent import RLFilter"
[ ] 3. Reward range: Ensure rewards are normalized [-1, 1] or similar
[ ] 4. Feature persistence: Verify entry features survive session restarts
[ ] 5. Telemetry: Confirm new fields don't break existing analysis
Anti-Patterns to Avoid
1. Sparse Rewards
Bad: Only reward on trade close after days of holding Good: Shape intermediate rewards or use value function
2. Feature Leakage
Bad: Using future data in entry features Good: Strictly use data available at decision time
3. Reward Hacking
Bad: Agent learns to game reward signal without real improvement Good: Multi-objective rewards with diverse components
4. Catastrophic Forgetting
Bad: Full model replacement on new data Good: 70/30 weight blending (old/new) for stability
Future Enhancements
T1 Mode (Tools trained separately)
- Train sentiment analyzer on trading-specific corpus
- Learn indicator weights per market regime
- Symbol-specific pattern recognition
T2 Mode (Tools tuned from feedback)
- Adaptive position sizing based on confidence accuracy
- Dynamic stop-loss thresholds from exit analysis
- Entry timing optimization
Metrics to Track
| Metric | Baseline | Target (Day 30) | Target (Day 90) |
|---|---|---|---|
| Prediction Accuracy | Not tracked | Track baseline | >60% |
| DiscoRL Training Steps | 0 | >100 | >1000 |
| Reward Signal Richness | Binary | Multi-dim | Full 6-component |
| A1/A2 Coverage | 25% | 75% | 100% |
Integration Points
RAG Knowledge Base
- This lesson stored in
rag_knowledge/lessons_learned/ - Queryable by RAGSafetyChecker before future ML changes
- Pattern: âAI adaptationâ, âRL feedbackâ, âreward functionâ
ML Pipeline
- RLWeightUpdater reads audit trail daily
- DiscoRL accumulates transitions online
- Prediction tracking feeds future calibration
Telemetry
- All adaptation events logged to
data/audit_trail/hybrid_funnel_runs.jsonl - Event types:
rl.online_learning,rl.prediction,position.exit
Related Lessons
ll_009_ci_syntax_failure_dec11.md- Why verification gates matterover_engineering_trading_system.md- Keep implementations focused
Tags
#ml #rl #adaptation #feedback-loops #disco-dqn #reward-function #prediction-tracking #a1-mode #a2-mode
Change Log
- 2025-12-11: Initial implementation of A1 mode (DiscoRL + RiskAdjustedReward)
- 2025-12-11: Initial implementation of A2 foundation (prediction tracking)
- 2025-12-11: Created this lessons learned entry