Token consumption
Agentic RAG is a license and consumption-based service. This means that you pay for the computational resources you consume. The consumption is measured in Agentic RAG tokens. All public 3rd-party LLMs base their pricing on the number of tokens consumed. In the LLM world, a token is around 4-5 characters on average, which might fit an entire word or be split into parts. The number of tokens is proportional to the amount of text, measured in chunks of 4-5 characters. It closely relates to words but not entirely. The longer a sentence is, the more tokens it will consume to read or to generate it. Since all these 3rd-party LLMs have different pricing, Agentic RAG tokens serve to normalize the cost across all of them.
How Tokens Are Consumed in RAG
When a user asks a question to your Knowledge Box, the token consumption process follows these steps:
- Question Processing: The system finds the most relevant paragraphs to answer the question
- Context Assembly: These paragraphs are used as context when calling the LLM model
- Prompt Creation: Agentic RAG assembles the prompt, context, and question into a single string
- LLM Processing: This complete string is sent to the LLM, corresponding to a certain number of input tokens
- Answer Generation: The LLM generates the answer, which corresponds to a certain number of output tokens
Total consumption = Input tokens + Output tokens + Image tokens
Factors That Impact Token Consumption
Token consumption is directly affected by:
- Large context: Results from using RAG strategies like "Full resource" or "Neighbouring paragraphs", or from using the
extra_context
parameter - Long questions: More detailed or complex questions require more input tokens
- Long prompts: Extensive system prompts increase the input token count
- Detailed answers: Comprehensive responses require more output tokens
- Images in context: When using multimodal models, images included in the retrieved context significantly increase token consumption
How to Limit and Control Token Consumption
Strategy 1: Optimize Your Parameters
The first approach to reducing token consumption is to fine-tune your request parameters:
- Optimize prompts: Ensure your prompts are concise and focused, avoiding unnecessary verbosity
- Control resource scope: When using the "Full resource" strategy, use the
count
attribute to limit the number of resources returned - Tune neighboring context: For the "Neighbouring paragraphs" strategy, optimize the
before
andafter
attributes to balance context quality with token efficiency - Manage summary length: When using the "Hierarchical" strategy, ensure that resource summaries are appropriately sized
- Choose efficient models: Select LLMs that offer better token efficiency (typically, ChatGPT 4o-mini is more cost-effective than ChatGPT 4o)
Strategy 2: Set Hard Limits
You can implement safeguards against excessive token consumption:
Use the max_tokens
parameter on the /ask
endpoint to set hard limits on:
- Context size: Limits the amount of retrieved information sent to the LLM
- Answer length: Limits the length of the generated response
Important Considerations
Context Limitations:
- Restricting context size may result in less relevant answers since the LLM has less information to work with
- Balance between cost control and answer quality
Answer Length Limitations:
- The LLM might not complete its response if it hits the token limit, potentially cutting sentences mid-way
- Recommended approach: Include length requirements in your prompt (e.g., "Please answer in less than 200 words") rather than relying solely on hard limits
- This allows the LLM to naturally conclude its response within the desired length
How to Monitor Token Consumption
Understanding Token Consumption Data
You can receive detailed token consumption information from the following endpoints that utilize LLM models: ask
, chat
, remi
, query
, sentence
, summarize
, tokens
, and rerank
.
The rephrase
endpoint currently does not support token consumption monitoring.
To receive token consumption data, you must include the following header in your request:
X-SHOW-CONSUMPTION: true
The token consumption data is provided in different formats depending on the response type:
- Streaming responses (
application/x-ndjson
): Token consumption appears as a separate JSON chunk with type "consumption" - Standard responses (
application/json
): Token consumption is included in a "consumption" field within the main response
Token Consumption Response Format
- Streaming (x-ndjson)
- Standard (json)
{
"item": {
"type": "consumption",
"normalized_tokens": {
"input": 13,
"output": 34,
"image": 0.0
},
"customer_key_tokens": {
"input": 0.0,
"output": 0.0,
"image": 0.0
}
}
}
{
"consumption": {
"normalized_tokens": {
"input": 13,
"output": 34,
"image": 0.0
},
"customer_key_tokens": {
"input": 0.0,
"output": 0.0,
"image": 0.0
}
}
}
Understanding Token Types
Normalized Tokens (normalized_tokens
):
- These represent the number of Agentic RAG tokens consumed and that you will be billed for
- Values are normalized across different LLM providers for consistent billing
- Include separate counts for:
input
: Tokens used for the prompt, context, and questionoutput
: Tokens used for the generated responseimage
: Tokens used for image processing (when applicable)
Customer Key Tokens (customer_key_tokens
):
- These represent tokens consumed when using your own LLM API keys
- These tokens are not billed by Agentic RAG since you're using your own API keys
- Values are also normalized for comparison purposes across different providers