Summary

This guide provides essential best practices and recommendations for scaling a self-hosted LangSmith instance to handle production-level workloads. It covers key areas including pod autoscaling, database management, high availability for data stores, and implementing observability to ensure a robust and reliable deployment.

Environment

* Product: LangSmith (Self-Hosted)

* Deployment: Kubernetes (via Helm)

* Databases: PostgreSQL, Redis, ClickHouse

* Cloud Services: Recommendations include managed services like AWS RDS/ElastiCache, but the principles apply to other cloud providers (e.g., GCP, Azure).

Scaling and productionizing recommendations

To ensure your self-hosted LangSmith instance is stable, scalable, and ready for high throughput, consider implementing the following best practices.

Pod Autoscaling

Configure a Horizontal Pod Autoscaler (HPA) for key LangSmith components to automatically scale the number of replicas up or down based on CPU/memory utilization or request load. This ensures performance during traffic spikes and cost efficiency during quiet periods.

The primary components to configure for autoscaling are:

LangSmith Backend: Manages core application logic. Configuration can be found in values.yaml.

Platform Backend: Handles user authentication, organization management, and other platform services. Configuration can be found in values.yaml.

Queue: Processes background tasks and data ingestion. Configuration can be found in values.yaml.

Refer to the official scaling documentation for recommendations on replica counts based on your expected trace volume.

Use External Managed Databases

For services like PostgreSQL(stateful) and Redis(stateless), it is highly recommended to use external, managed database services instead of running them as a stateful set within your Kubernetes cluster. Managed services offer better reliability, scalability, and simplified maintenance.

External PostgreSQL: Use a managed service like AWS RDS or an equivalent from your cloud provider. This provides automated backups, scaling, and high availability.

Setup Guide: Self-Hosting with an External PostgreSQL Instance.

External Redis: Use a managed service like AWS ElastiCache or an equivalent. This is critical for caching and queueing performance.

Setup Guide: Self-Hosting with an External Redis Instance.

High Availability for ClickHouse

ClickHouse stores trace and analytics data and is critical for the LangSmith UI's performance. To prevent data loss and downtime, set up a replicated, highly available (HA) ClickHouse deployment within your Kubernetes cluster. This configuration typically uses Zookeeper for coordination between replicas.

Implementation Guide: Replicated ClickHouse Example

Enable Blob Storage

To reduce the storage load and pressure on your ClickHouse database, configure LangSmith to use external blob storage (e.g., AWS S3, GCS, Azure Blob Storage). This offloads large payloads (like inputs and outputs) from the database to a more cost-effective and scalable storage layer without sacrificing performance.

Configuration Guide: Configuring Blob Storage for Self-Hosted LangSmith

Implement Observability

You cannot effectively manage what you cannot measure. Set up observability on your LangSmith instance to collect logs, metrics, and traces from the LangSmith components themselves. This is crucial for troubleshooting issues, monitoring performance, and configuring alerts for potential problems.

Observability Guide: Exporting Backend Traces, Metrics, and Logs

References

General Scaling: Scaling Self-Hosted LangSmith

Helm Chart Values: LangSmith Helm Chart values.yaml

External Postgres: Self-Hosting with an External PostgreSQL Instance

External Redis: Self-Hosting with an External Redis Instance.

Replicated ClickHouse: Replicated ClickHouse Example

Blob Storage: Configuring Blob Storage for Self-Hosted LangSmith

Observability Guide: Exporting Backend Traces, Metrics, and Logs