Lesson Learned #030: LangSmith Deep Agent Observability
Lesson Learned #030: LangSmith Deep Agent Observability
ID: LL-030 Impact: Identified through automated analysis
Date: December 14, 2025 Category: Infrastructure / Monitoring Severity: HIGH (critical for debugging and optimization) Status: IMPLEMENTED
Executive Summary
Implemented comprehensive observability for our Deep Agent trading system using LangSmith-compatible tracing. This enables real-time visibility into agent decisions, cost tracking per operation, and evaluation datasets for A/B testing strategies.
Problem
Our trading agent operates autonomously for extended periods, making complex decisions without visibility into:
- Why specific trade decisions were made
- Which prompts/models perform best
- Cost per decision (critical for $100/mo budget)
- Error rates and latency across components
- Calibration between confidence and actual win rate
Without observability, we were “flying blind” - only seeing outcomes, not the decision-making process.
Research: LangChain Deep Agents Webinar (Dec 2025)
Harrison Chase and Nick Huang from LangChain presented key insights:
-
Deep Agents are Different: Unlike simple chatbots, they run for extended periods, execute multiple sub-tasks, and make complex autonomous decisions
- Key Observability Requirements:
- Full trace of every LLM call
- Cost tracking per operation
- Latency monitoring for time-sensitive decisions
- Evaluation datasets for prompt optimization
- Error tracking with full context
- LangSmith Features:
- Automatic tracing of LangChain components
- Custom spans for non-LangChain code
- Evaluation datasets with metrics
- Cost dashboard
- A/B testing capabilities
Solution
Created comprehensive observability module at src/observability/:
1. LangSmith Tracer (langsmith_tracer.py)
from src.observability import traceable_decision, get_tracer
@traceable_decision(name="trade_signal")
async def generate_signal(symbol: str) -> Signal:
# All nested operations auto-traced
...
# Or use context manager
tracer = get_tracer()
with tracer.trace("market_analysis") as span:
span.add_metadata({"symbol": "BTCUSD"})
result = await analyze_market()
span.set_cost(input_tokens, output_tokens, model)
2. Trade Evaluator (trade_evaluator.py)
Records every decision with full context, links to outcomes:
from src.observability.trade_evaluator import TradeEvaluator
evaluator = TradeEvaluator()
# Record decision
record_id = evaluator.record_decision(
symbol="BTCUSD",
decision="BUY",
confidence=0.85,
reasoning="Strong momentum + positive sentiment",
price=50000.0,
)
# Later, record outcome
evaluator.record_outcome(record_id, exit_price=52500.0)
# Get metrics
metrics = evaluator.get_metrics(days=30)
print(f"Win rate: {metrics.win_rate:.1%}")
print(f"Calibration error: {metrics.calibration_error:.2f}")
3. Dashboard (dashboard.py)
Generates text reports and Prometheus metrics:
from src.observability.dashboard import ObservabilityDashboard
dashboard = ObservabilityDashboard()
report = dashboard.generate_report(days=7)
print(report)
# Export for Grafana
prometheus_metrics = dashboard.export_prometheus_metrics()
4. Orchestrator Hooks (orchestrator_hooks.py)
Non-invasive integration with existing orchestrator:
from src.observability.orchestrator_hooks import enable_observability
# Enable for all new orchestrators
enable_observability()
# Or for specific instance
orchestrator = TradingOrchestrator(tickers=["BTCUSD"])
enable_observability(orchestrator)
Key Features
| Feature | Benefit |
|---|---|
| Automatic tracing | See full decision chain |
| Cost tracking | Stay within $100/mo budget |
| Decision quality scoring | Excellent/Good/Lucky/Unlucky/Poor |
| Calibration metrics | Confidence vs actual accuracy |
| A/B testing | Compare strategies/models |
| Prometheus export | Grafana dashboards |
| Finetuning export | Export best decisions for training |
Cost Model
Tracks cost per 1K tokens for all major models:
- GPT-4o: $2.50 input / $10 output
- GPT-4o-mini: $0.15 input / $0.60 output
- Claude 3.5 Sonnet: $3 input / $15 output
- Claude 3 Haiku: $0.25 input / $1.25 output
- Gemini 2.0 Flash: $0.10 input / $0.40 output
- DeepSeek Chat: $0.14 input / $0.28 output
Decision Quality Classification
| Quality | Definition |
|---|---|
| Excellent | Right decision + high confidence + right reasoning |
| Good | Right decision + okay reasoning |
| Lucky | Right outcome but wrong reasoning |
| Unlucky | Wrong outcome but right reasoning |
| Poor | Wrong decision + wrong reasoning |
Integration Points
- TradingOrchestrator: Auto-traces run(), _process_ticker(), _execute_trade()
- Pre-Trade Verification: Traces verification decisions
- HICRA Credit Assignment: Traces RL reward shaping
- Market Scanner: Traces signal generation
Expected Improvements
- Debug time: -80% (full context for every decision)
- Cost optimization: Know exactly which operations cost most
- Strategy selection: A/B test with real metrics
- Calibration: Reduce overconfidence through feedback
Files Created
src/observability/__init__.pysrc/observability/langsmith_tracer.py(450 lines)src/observability/trade_evaluator.py(400 lines)src/observability/dashboard.py(300 lines)src/observability/orchestrator_hooks.py(200 lines)tests/test_observability.py(350 lines)
Environment Variables
# Enable LangSmith cloud (optional, works without)
LANGSMITH_API_KEY=your-api-key
# Daily budget limit (default: $3.33 = $100/30)
DAILY_LLM_BUDGET=3.33
Prevention Rules
- Apply lessons learned from this incident
- Add automated checks to prevent recurrence
- Update RAG knowledge base
Tags
#observability #langsmith #monitoring #tracing #evaluation #cost-tracking #deep-agents