Lesson Learned: AI Agent Adaptation Framework (Dec 11, 2025)

ID: ll_011 Date: December 11, 2025 Severity: INFORMATIONAL (Best Practice) Category: ML Pipeline, RL Feedback Loops, System Architecture Impact: Enhanced learning capability, continuous improvement

Executive Summary

Implemented AI agent adaptation framework based on Stanford/Princeton/Harvard taxonomy paper. The 4 adaptation modes (A1, A2, T1, T2) provide a systematic approach to continuous improvement.

The Framework

Taxonomy Overview

Mode	What Adapts	Feedback Source	Our Implementation
A1	Agent	Tool results	DiscoRL online learning + RiskAdjustedReward
A2	Agent	Output evaluations	Prediction tracking for confidence calibration
T1	Tools (agent frozen)	Separate training	(Future: sentiment fine-tuning)
T2	Tools (agent fixed)	Agent feedback	(Future: adaptive position sizing)

What Was Implemented (PR #530)

1. A1 Mode: DiscoRL Online Learning

Files: src/orchestrator/main.py, src/risk/position_manager.py

# Entry features stored when position opens
self.position_manager.track_entry(
    symbol=ticker,
    entry_date=datetime.now(),
    entry_features=momentum_signal.indicators,
)

# record_trade_outcome() called when position closes
self.rl_filter.record_trade_outcome(
    entry_state=entry_features,
    action=1,  # long
    exit_state=exit_features,
    reward=position.unrealized_plpc,
    done=True,
)

Key Learning: Entry state must be captured at trade open and persisted across sessions.

2. A1 Enhancement: RiskAdjustedReward

File: src/agents/rl_weight_updater.py

# Old: Binary +1/-1 rewards
# New: Multi-dimensional reward signal
reward = RiskAdjustedReward(
    return_weight=0.35,      # Annualized return
    downside_weight=0.25,    # Downside risk penalty
    sharpe_weight=0.20,      # Sharpe ratio
    drawdown_weight=0.15,    # Max drawdown penalty
    transaction_weight=0.05  # Transaction cost
)

Key Learning: Rich reward signals enable better policy learning than simple binary feedback.

3. A2 Mode: Prediction Tracking

File: src/orchestrator/main.py

# At entry: Track what we predicted
telemetry.record(payload={
    "predicted_confidence": rl_result["confidence"],
    "predicted_action": "long",
    "prediction_timestamp": datetime.utcnow().isoformat()
})

# At exit: Track what actually happened
telemetry.record(payload={
    "actual_return": position.unrealized_plpc,
    "prediction_correct": actual_return > 0
})

Key Learning: Tracking predictions vs outcomes enables future confidence calibration.

Verification Checklist

Before merging any RL/ML changes:

[ ] 1. Syntax check: python3 -m py_compile <file>
[ ] 2. Import test: python3 -c "from src.agents.rl_agent import RLFilter"
[ ] 3. Reward range: Ensure rewards are normalized [-1, 1] or similar
[ ] 4. Feature persistence: Verify entry features survive session restarts
[ ] 5. Telemetry: Confirm new fields don't break existing analysis

Anti-Patterns to Avoid

1. Sparse Rewards

Bad: Only reward on trade close after days of holding Good: Shape intermediate rewards or use value function

2. Feature Leakage

Bad: Using future data in entry features Good: Strictly use data available at decision time

3. Reward Hacking

Bad: Agent learns to game reward signal without real improvement Good: Multi-objective rewards with diverse components

4. Catastrophic Forgetting

Bad: Full model replacement on new data Good: 70/30 weight blending (old/new) for stability

Future Enhancements

T1 Mode (Tools trained separately)

Train sentiment analyzer on trading-specific corpus
Learn indicator weights per market regime
Symbol-specific pattern recognition

T2 Mode (Tools tuned from feedback)

Adaptive position sizing based on confidence accuracy
Dynamic stop-loss thresholds from exit analysis
Entry timing optimization

Metrics to Track

Metric	Baseline	Target (Day 30)	Target (Day 90)
Prediction Accuracy	Not tracked	Track baseline	>60%
DiscoRL Training Steps	0	>100	>1000
Reward Signal Richness	Binary	Multi-dim	Full 6-component
A1/A2 Coverage	25%	75%	100%

Integration Points

RAG Knowledge Base

This lesson stored in rag_knowledge/lessons_learned/
Queryable by RAGSafetyChecker before future ML changes
Pattern: “AI adaptation”, “RL feedback”, “reward function”

ML Pipeline

RLWeightUpdater reads audit trail daily
DiscoRL accumulates transitions online
Prediction tracking feeds future calibration

Telemetry

All adaptation events logged to data/audit_trail/hybrid_funnel_runs.jsonl
Event types: rl.online_learning, rl.prediction, position.exit

ll_009_ci_syntax_failure_dec11.md - Why verification gates matter
over_engineering_trading_system.md - Keep implementations focused

Change Log

2025-12-11: Initial implementation of A1 mode (DiscoRL + RiskAdjustedReward)
2025-12-11: Initial implementation of A2 foundation (prediction tracking)
2025-12-11: Created this lessons learned entry