Overview
This guide provides comprehensive best practices for scaling and productionizing your self-hosted LangSmith instance. Following these recommendations will help ensure your deployment can handle production workloads efficiently and reliably.
1. Scaling Recommendations
Pod Replicas and Database Sizing
Configure the number of pod replicas and database resources based on your expected read/write throughput.
Reference: LangSmith Self-Host Scaling Documentation
This documentation covers:
Recommended number of pod replicas for different throughput levels
Database sizing guidelines
Performance optimization strategies
2. Autoscaling Configuration
Set up autoscalers for LangSmith pods to automatically scale up or down based on request load. Use the scaling documentation above for replica recommendations based on your volume.
Key Components to Autoscale
LangSmith Backend
Configure autoscaling for the main LangSmith backend service.
Configuration: LangSmith Backend Values
# Example autoscaling configuration
langsmith:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70Platform Backend
Configure autoscaling for the platform backend service.
Configuration: Platform Backend Values
# Example autoscaling configuration
platformBackend:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 8
targetCPUUtilizationPercentage: 70Queue Service
Configure autoscaling for the queue service to handle background processing.
Configuration: Queue Values
# Example autoscaling configuration
queue:
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 6
targetCPUUtilizationPercentage: 703. External Datastores (Highly Recommended)
Instead of managing Kubernetes StatefulSets for your datastores, use managed external services for better reliability and easier maintenance.
External Redis
Use a managed Redis service like AWS ElastiCache with OSS Redis (not Redis Enterprise).
Benefits:
Better reliability and availability
Automated backups and failover
Reduced operational overhead
No Kubernetes StatefulSet management
Documentation: External Redis Setup
Recommended Providers:
AWS ElastiCache (OSS Redis)
Azure Cache for Redis
Google Cloud Memorystore
External PostgreSQL
Use a managed PostgreSQL service like AWS RDS for your primary database.
Benefits:
Automated backups and point-in-time recovery
High availability with multi-AZ deployments
Automated maintenance and patching
Better performance monitoring
Documentation: External Postgres Setup
Recommended Providers:
AWS RDS for PostgreSQL
Azure Database for PostgreSQL
Google Cloud SQL for PostgreSQL
4. ClickHouse Configuration
High Availability Replicated Deployment
Set up a highly available (HA) replicated ClickHouse deployment in your Kubernetes cluster using Zookeeper for coordination.
Benefits:
Data redundancy and fault tolerance
Improved query performance with distributed reads
Zero downtime during maintenance
Instructions: Replicated ClickHouse Setup
Enable Blob Storage
Enable blob storage to offload large trace data from ClickHouse, alleviating database pressure without sacrificing performance.
Benefits:
Reduced ClickHouse storage requirements
Lower database costs
Maintains query performance
Better separation of hot and cold data
Documentation: Blob Storage Setup
Recommended Providers:
AWS S3
Azure Blob Storage
Google Cloud Storage
MinIO (self-hosted)
5. Observability and Monitoring
Set up comprehensive observability on your LangSmith instance to gather logs, metrics, and traces for troubleshooting and alerting.
Documentation: Export Backend Observability
Key Observability Components
Logs
Application logs from all services
Error tracking and aggregation
Centralized log management
Metrics
Request rates and latencies
Resource utilization (CPU, memory, disk)
Queue depths and processing times
Database performance metrics
Traces
Distributed tracing across services
Request flow visualization
Performance bottleneck identification
Critical Alerts to Configure
High error rates
Service downtime
Database connection pool exhaustion
High queue depths
Storage capacity thresholds
Unusual latency spikes
Quick Reference Checklist
Configure pod replicas based on throughput requirements
Enable autoscaling for LangSmith Backend
Enable autoscaling for Platform Backend
Enable autoscaling for Queue service
Migrate to external managed Redis (e.g., AWS ElastiCache)
Migrate to external managed PostgreSQL (e.g., AWS RDS)
Set up HA replicated ClickHouse with Zookeeper
Enable blob storage for trace data
Configure log collection and aggregation
Set up metrics monitoring and dashboards
Enable distributed tracing
Configure alerting rules and notifications