Context
A LangSmith deployment may enter a "zombie state" where the /docs endpoint remains accessible and the service continues to accept user input, but it stops producing responses. When the service is restarted, all previously queued threads are processed, even though they correspond to past requests that are no longer relevant. This issue typically occurs due to infrastructure bottlenecks in self-hosted LangGraph environments.
Answer
This zombie state occurs due to architecture separation where the /docs endpoint is served by the API layer (FastAPI health check), while actual graph execution happens in background workers consuming from Redis queues. The workers can fail while the API layer remains healthy.
Root causes include:
Redis disk/memory exhaustion
PostgreSQL connection pool exhaustion
Worker crashes or hanging silently
Client disconnections without proper cancellation
Stale requests persist because both Redis (task queue) and PostgreSQL (checkpoints/threads) store data durably with exactly-once semantics, and threads have no expiration by default.
To resolve this issue:
Upgrade LangGraph to the latest version - This includes worker cancellation fixes and PostgreSQL connection pool improvements.
Reduce PostgreSQL connection pool size for Azure environments:
LANGGRAPH_POSTGRES_POOL_MAX_SIZE=50 # Default is 150 ASYNCPG_POOL_MIN_SIZE=1 # Reduce idle connectionsEnable automatic run cancellation on client disconnect:
async for chunk in client.runs.stream( thread_id=thread_id, assistant_id="agent", input={"messages": [{"role": "human", "content": "Hello"}]}, on_disconnect="cancel", # Enables cancellation on disconnect ):Isolate background job event loops:
BG_JOB_ISOLATED_LOOPS=true # Prevents sync operations from blocking workersEnable resumable streams for timeout handling:
RESUMABLE_STREAM_TTL_SECONDS=300 # 5 minutes (default is 120s)Handle stale runs with interrupt strategy:
run = await client.runs.create( thread_id, assistant_id, input={"messages": [...]}, multitask_strategy="interrupt", # Cancels any previous pending run )As a last resort, clear Redis data:
# Connect to your Redis instance and run: FLUSHDBThen restart the service after applying configuration changes.
Sources: