When upgrading self-hosted LangSmith via helm upgrade, you may experience extended unavailability due to:
Container image pulls during the rollout (especially with private registries or air-gapped environments)
All pods being terminated before new ones are ready
Database migrations running sequentially across large version jumps
This guide covers three strategies to minimize or eliminate downtime during upgrades.
Prerequisites
kubectlaccess to your clusterYour current
values.yamlThe target chart version (check the AppVersion with
helm show chart langchain/langsmith --version <version>)
1. Enable PodDisruptionBudgets
PodDisruptionBudgets (PDBs) prevent Kubernetes from terminating all pods of a service at once during a rollout. Add the following to your values.yaml:
backend:pdb:
enabled: true
minAvailable: 1frontend:pdb:
enabled: true
minAvailable: 1queue:pdb:
enabled: true
minAvailable: 1ingestQueue:pdb:
enabled: true
minAvailable: 1platformBackend:pdb:
enabled: true
minAvailable: 1aceBackend:pdb:
enabled: true
minAvailable: 1Note: PDBs require replicas >= 2 for each service. If a service has only 1 replica, the PDB will block the rollout since Kubernetes cannot maintain minAvailable: 1 while replacing the single pod.
2. Pre-Pull Images Before Upgrading
Image downloads can account for a significant portion of upgrade time. You can eliminate this by caching all new images on every node before running helm upgrade.
Image tags
Most LangSmith services share the same image tag (the chart's AppVersion). The exceptions are:
Service | Tag |
All LangSmith services ( | Chart |
| Independent tag (see |
| Pinned tags in |
DaemonSet template
Deploy a DaemonSet that runs an init container for each image. Adapt the registry and image paths to match your values.yaml:
# prepull-daemonset.yamlapiVersion: apps/v1kind: DaemonSetmetadata:name: langsmith-image-prepullnamespace: <namespace>spec:selector:
matchLabels:
app: langsmith-prepulltemplate:
metadata:
labels:
app: langsmith-prepull
spec:
imagePullSecrets:
- name: <pull-secret> # must match your imagePullSecrets
initContainers:
- name: pull-backend
image: <registry>/langchain/langsmith-backend:<NEW_VERSION>
command: ["true"]
- name: pull-frontend
image: <registry>/langchain/langsmith-frontend:<NEW_VERSION>
command: ["true"]
- name: pull-go-backend
image: <registry>/langchain/langsmith-go-backend:<NEW_VERSION>
command: ["true"]
- name: pull-ace-backend
image: <registry>/langchain/langsmith-ace-backend:<NEW_VERSION>
command: ["true"]
- name: pull-playground
image: <registry>/langchain/langsmith-playground:<NEW_VERSION>
command: ["true"]
- name: pull-clio
image: <registry>/langchain/langsmith-clio:<NEW_VERSION>
command: ["true"]
- name: pull-polly
image: <registry>/langchain/langsmith-polly:<NEW_VERSION>
command: ["true"]
- name: pull-langserve-backend
image: <registry>/langchain/hosted-langserve-backend:<NEW_VERSION>
command: ["true"]
- name: pull-tool-server
image: <registry>/langchain/agent-builder-tool-server:<NEW_VERSION>
command: ["true"]
- name: pull-trigger-server
image: <registry>/langchain/agent-builder-trigger-server:<NEW_VERSION>
command: ["true"]
- name: pull-deep-agent
image: <registry>/langchain/agent-builder-deep-agent:<NEW_VERSION>
command: ["true"]
- name: pull-operator
image: <registry>/langchain/langgraph-operator:<OPERATOR_VERSION>
command: ["/manager"]
containers:
- name: pause
image: <registry>/library/redis:7 # or any lightweight image already cached
command: ["sleep", "infinity"]
Note on the operator image: The langgraph-operator image is a minimal Go binary and does not include /bin/sh or true. Use command: ["/manager"] for this container. If any other image also fails with command: ["true"], try command: ["/bin/sh", "-c", "exit 0"] instead.
Running the pre-pull
# 1. Deploy the DaemonSet and wait for all images to be cached on every node
kubectl apply -f prepull-daemonset.yaml
kubectl rollout status daemonset/langsmith-image-prepull -n <namespace> --timeout=600s
# 2. Once all pods are Running, proceed with the upgrade
helm upgrade <release> langchain/langsmith --version <new-version> --values values.yaml
# 3. Clean up the pre-pull DaemonSet
kubectl delete daemonset langsmith-image-prepull -n <namespace>
You can reuse this DaemonSet for every upgrade by updating the image tags.
3. Monitor Migrations During Upgrade
Database migrations run as Kubernetes Jobs and are typically the main source of remaining downtime after image pulls are eliminated. Monitor their progress:
kubectl get jobs -n <namespace> -w
kubectl logs job/<release>-langsmith-backend-migrations -n <namespace> -f
kubectl logs job/<release>-langsmith-backend-ch-migrations -n <namespace> -f
Additional Recommendations
Upgrade in smaller increments: Jump 2-3 minor versions at a time rather than spanning many versions in a single upgrade. Each jump runs fewer migrations, reducing both the migration window and rollback risk.
Check your ClickHouse
pullPolicy: If set toAlways, the image will be re-downloaded even if it is already cached. Consider usingIfNotPresentwith a pinned version tag, or include the ClickHouse image in the pre-pull DaemonSet.Test in a staging environment first: If possible, run the upgrade on a non-production cluster to measure migration duration and catch issues before upgrading production.