Lesson Learned #058: 68% LangSmith Trace Error Rate - Silent Bugs Hiding in Observability Code
Lesson Learned #058: 68% LangSmith Trace Error Rate - Silent Bugs Hiding in Observability Code
The Failure
LangSmith dashboard showed 68% error rate across 304 runs - two-thirds of all trace attempts were silently failing. The observability system meant to help us debug was itself broken, and we had no idea.
Root Causes (7 Bugs Found)
Bug 1: Silent Exception Swallowing
File: src/orchestrator/gates.py:78
# BAD - All tracing errors completely hidden
except Exception as e:
logger.debug("Gate tracing failed: %s", e) # Nobody reads debug logs
Bug 2: Double Span Completion
Files: orchestrator_hooks.py:57 + langsmith_tracer.py:369
Both the wrapper function AND the context manager tried to complete the span on error, corrupting trace state.
Bug 3: AttributeError in Trace Inputs
File: gates.py:1052
# BAD - ctx.macd_signal doesn't exist!
{"gate": 1, "has_macd": ctx.macd_signal is not None}
# GOOD - Use actual attribute
{"gate": 1, "has_momentum": ctx.momentum_signal is not None}
Bug 4: Wrong Schema for Extra Field
File: langsmith_tracer.py:405
# BAD - Nested structure
extra={"metadata": span.metadata}
# GOOD - Flat structure
extra=span.metadata
Bug 5: Empty Name Validation
Spans could have empty names, causing LangSmith API validation failures.
Bug 6: Inconsistent Project Names
Three different defaults across files: âtrading-systemâ, âigor-trading-systemâ, âtrading-rl-trainingâ
Bug 7: Deprecated Datetime
Using datetime.utcnow() instead of timezone-aware datetime.now(timezone.utc)
Impact
- 68% of traces lost - Most debugging information never recorded
- Blind to failures - Couldnât diagnose trading issues
- False confidence - Thought observability was working
- Wasted LangSmith budget - Paying for traces that failed
The Fix
- Changed silent
logger.debugtologger.warningfor visibility - Removed manual
span.complete()- let context manager handle it - Fixed attribute name:
macd_signalâmomentum_signal - Flattened extra field schema
- Added fallback name for empty spans
- Unified project name to
igor-trading-system - Updated to timezone-aware datetime
- Added CI verification step before trading
Prevention Rules
- Test your observability - If you canât verify traces are working, assume theyâre not
- Never swallow exceptions silently - At minimum use
logger.warning - Add CI verification -
verify_langsmith.pynow runs before every trading session - Check error rates - 68% should have triggered an alert
- Validate against schema - Test trace format before production
Code Verification Pattern
# verify_langsmith.py - Run before trading
def verify_langsmith() -> bool:
client = Client()
now = datetime.now(timezone.utc)
client.create_run(
name="ci_verification_trace",
inputs={"test": "verification"},
outputs={"status": "ok"},
run_type="chain",
project_name="igor-trading-system",
start_time=now,
end_time=now,
tags=["verification", "ci"],
)
return True
Key Insight
âThe observability system is the last place you expect bugs, but itâs also the last place youâd notice them.â
If your traces arenât working, you wonât see the errors telling you your traces arenât working. This is a dangerous blind spot.
Tags
langsmith observability silent-failures debugging tracing ci-verification