Context
When running LangGraph agents with long-running async operations, particularly in production environments like Kubernetes, users may encounter CancelledError exceptions. These errors commonly occur during extended processing periods or when deploying new revisions.
Answer
There are several configuration settings you can adjust to handle long-running operations and prevent CancelledError exceptions:
Set the background job timeout using the environment variable:
BG_JOB_TIMEOUT_SECS- Extends the default timeout period for long-running operationsConfigure isolated event loops:
BG_JOB_ISOLATED_LOOPS=true- Helps prevent event loop conflictsSet a grace period for job shutdown:
BG_JOB_SHUTDOWN_GRACE_PERIOD_SECS- Determines how long jobs have to complete when a new revision is deployed (recommended value between 240-1200 seconds depending on your typical run duration)
For deployments using the streaming API, additional configurations can help:
Enable resumable streaming with
streamResumable: trueUse reconnection with
reconnectOnMount: truein the useStream() hookConsider using durability modes (
durability: "async"or"sync") for better persistence
What If I need these runs to last more than an hour?
Once your run kicks off, you are constrained by
BG_JOB_TIMEOUT_SECS.With
streamResumable: true, as long as you rejoin the stream before the timeout,BG_JOB_TIMEOUT_SECSwill reset.
Note: If you see CancelledError during deployment of new revisions, this is expected behavior as existing runs are cancelled and restarted with the new revision.