Summary
LangSmith does not provide dedicated out-of-the-box drift detection or bias/fairness scoring systems with automatic baseline comparison. However, it offers flexible building blocks that can be combined to achieve similar observability outcomes for LLMs and AI agents.
Model Drift Detection
What LangSmith Offers
LangSmith provides several capabilities that can be combined to monitor model behavior changes over time:
Capability | Description | Use for Drift Detection |
Monitoring Dashboards | Prebuilt and custom charts tracking feedback scores, latency, cost, token usage, and error rates over time | Visually identify trends and performance degradation |
Online Evaluations | Evaluators that run automatically on production traffic in near real-time | Flag quality degradation and unusual patterns as they occur |
Experiment Comparison | Run the same dataset against different model versions, set a baseline, and compare results | See regressions (red) vs improvements (green) at a glance |
Automation Rules | Trigger actions based on filters and sampling on trace data | Automatically flag anomalies, add to datasets, or route for review |
Key Differences from Traditional ML Drift Detection
Traditional ML drift detection typically involves:
Statistical comparison of feature distributions (data drift)
Monitoring prediction probability distributions (concept drift)
Automatic baseline computation and threshold alerts
LangSmith's approach is tailored for LLM/agent observability:
Quality is assessed via evaluator scores rather than statistical distribution tests
Baselines are established through experiments rather than automatic reference windows
Drift is detected through trend analysis and evaluator score degradation rather than statistical drift tests
Bias and Fairness Detection
What LangSmith Offers
LangSmith does not include built-in bias or fairness metrics. However, custom bias checks can be implemented using:
Approach | Description | Best For |
Custom Code Evaluators | Write deterministic Python/TypeScript evaluation logic | Rule-based checks (e.g., demographic parity, keyword detection) |
LLM-as-Judge Evaluators | Use an LLM to assess outputs for bias/fairness at scale | Nuanced, context-aware bias assessment |
Online Code Evaluators | Run custom checks automatically on production traffic | Continuous monitoring in production |
Implementation Patterns
Custom Code Evaluator Example:
def bias_check_evaluator(run, example):
output = run.outputs.get("response", "")
# Custom logic to check for demographic parity,
# sensitive term usage, or other bias indicators
bias_score = custom_bias_detection_logic(output)
return {"key": "bias_score", "score": bias_score}LLM-as-Judge for Bias Detection:
Configure a prompt that instructs an LLM to evaluate outputs for bias
Can assess nuanced bias that rule-based systems might miss
Useful for detecting subtle tone, framing, or representation issues
Recommended Approach for AI Observability
For teams seeking comprehensive AI observability across traditional ML, LLMs, and AI agents:
1. Establish Baselines via Experiments
Create a reference dataset representing expected behavior
Run experiments to establish baseline scores
Use comparison views to track deviations in subsequent runs
2. Implement Online Evaluations
Deploy custom evaluators for quality, bias, and consistency checks
Configure evaluators to run on sampled production traffic
Set up automation rules to flag concerning patterns
3. Build Custom Dashboards
Track key metrics over time (latency, cost, error rates, custom scores)
Create views segmented by model version, prompt version, or time period
Use these to visually identify drift trends
4. Set Up Automation Rules
Configure alerts when scores fall below thresholds
Route flagged traces to review queues
Automatically add edge cases to datasets for retraining
What LangSmith Does NOT Provide
To set clear expectations:
Automatic statistical drift detection - No automatic computation of KL divergence, PSI, or other statistical drift metrics
Built-in bias/fairness metrics - No prebuilt demographic parity, equalized odds, or disparate impact calculations
Reference data upload with automatic comparison - Baselines are established through experiments rather than uploaded reference datasets
Traditional ML model support - LangSmith is designed for LLM/agent observability, not traditional ML models
Related Documentation
Summary Table
Capability | Out-of-the-Box | Custom Implementation |
Performance monitoring over time | Yes | - |
Experiment comparison with baselines | Yes | - |
Online evaluations on production traffic | Yes | - |
Automation rules and alerts | Yes | - |
Statistical drift detection | No | Possible via custom evaluators |
Bias/fairness metrics | No | Possible via custom/LLM-as-judge evaluators |
Traditional ML model support | No | Not designed for this use case |