Summary
This guide provides instructions on how to effectively use prompt caching with AWS Bedrock and the LangChain Python library, specifically with ChatBedrockConverse. It focuses on the common use case of caching a large, static system prompt for multiple independent API calls to reduce latency and cost. The article covers the correct syntax for enabling caching, clarifies the available cache types, and explains how to verify that caching is active.
Environment
- Products: LangChain, langchain-aws
- Language: Python
- Cloud Provider: AWS Bedrock
- Models: Anthropic Claude on Bedrock (or other models supporting prompt caching)
Caching System Prompts for Batch Invocations
When running multiple, independent invocations with the same large system prompt (e.g., analyzing a batch of documents), you can cache the system prompt to significantly reduce token usage and improve response times.
To enable caching, you must explicitly add a cachePoint object to your SystemMessage. The correct method is to structure the content of the SystemMessage as a list containing both the prompt text and the cachePoint dictionary.
Here is an example for caching a system prompt across multiple, non-conversational invoke calls:
from langchain_aws import ChatBedrockConverse
from langchain_core.messages import SystemMessage, HumanMessage
# Initialize the model
llm = ChatBedrockConverse(
model="anthropic.claude-3-5-sonnet",
region_name="us-east-1"
)
# Define the system prompt and the cache point
system_prompt_text = "You are an expert analyst. [Your very large system prompt here...]"
# Structure the SystemMessage content as a list to include the cachePoint
CACHED_SYSTEM_PROMPT = SystemMessage(
content=[
{
"type": "text",
"text": system_prompt_text,
},
{
"cachePoint": {
"type": "default",
},
},
]
)
# A list of independent queries to process
queries = ["Analyze the first document.", "Analyze the second document.", "Analyze the third document."]
# Run batch processing
# The first invoke will write to the cache, subsequent ones will read from it.
for query in queries:
response = llm.invoke([
CACHED_SYSTEM_PROMPT,
HumanMessage(content=query)
])
print(response.content)
# Check usage metadata to verify caching
print(response.usage_metadata)Understanding Cache Types in Bedrock
There can be confusion between the cache types offered by different model providers.
- "default": This is the correct cache type to use for AWS Bedrock's prompt caching feature. The cache is ephemeral by nature, typically lasting for a short duration (e.g., five minutes) and the Time-to-Live (TTL) is refreshed on each cache hit.
- "ephemeral": This cache type is associated with the direct Anthropic Claude API (cache_control parameter) and is not a valid type for the cachePoint object in AWS Bedrock. Using it will result in the cache point being ignored.
For all Bedrock caching use cases, specify {"type": "default"}.
Caching in Multi-Turn Conversations
Prompt caching can also be used within a single, ongoing conversation. In this scenario, you can set multiple cache "checkpoints" to avoid re-sending the entire conversation history.
To do this, place a cachePoint after a static portion of the conversation you wish to cache, such as after the first user message. This will cache the system prompt and the initial user turn.
from langchain_aws import ChatBedrockConverse
# Create a cache point using the helper method
cache_point = ChatBedrockConverse.create_cache_point(cache_type="default")
messages = [
("system", "Your large system prompt here..."),
("human", ["Hello, this is my first message.", cache_point]), # Cache point after the first turn
("ai", "Hello! How can I help you today?"),
("human", "This is my follow-up question."),
]
# The next invoke will use the cache up to the first human message
response = llm.invoke(messages)Note: Models have a limit on the number of checkpoints allowed per conversation (e.g., many Claude models allow up to 4).
Section - Verifying Cache Usage:
To confirm that prompt caching is working as expected, you can inspect the usage_metadata returned in the LangChain response object.
- Cache Write: The first time a prompt prefix is sent with a cachePoint, you will see a value for cache_write_input_tokens.
- Cache Read: On subsequent calls with the identical prefix, you will see a value for cache_read_input_tokens, and the prompt_token_count will be significantly lower.
Example usage_metadata on a cache hit:
{
"prompt_token_count": 15,
"completion_token_count": 50,
"total_token_count": 65,
"cache_read_input_tokens": 1024,
"cache_write_input_tokens": 0
}You can also monitor caching activity through AWS CloudWatch metrics for Amazon Bedrock.
References
- AWS Bedrock User Guide: Prompt Caching
- LangChain ChatBedrock integration
- AWS ML Blog: Effectively use prompt caching on Amazon Bedrock