Token consumption

Agentic RAG is a license and consumption-based service. This means that you pay for the computational resources you consume. The consumption is measured in Agentic RAG tokens. All public 3rd-party LLMs base their pricing on the number of tokens consumed. In the LLM world, a token is around 4-5 characters on average, which might fit an entire word or be split into parts. The number of tokens is proportional to the amount of text, measured in chunks of 4-5 characters. It closely relates to words but not entirely. The longer a sentence is, the more tokens it will consume to read or to generate it. Since all these 3rd-party LLMs have different pricing, Agentic RAG tokens serve to normalize the cost across all of them.

How Tokens Are Consumed in RAG

When a user asks a question to your Knowledge Box, the token consumption process follows these steps:

Question Processing: The system finds the most relevant paragraphs to answer the question
Context Assembly: These paragraphs are used as context when calling the LLM model
Prompt Creation: Agentic RAG assembles the prompt, context, and question into a single string
LLM Processing: This complete string is sent to the LLM, corresponding to a certain number of input tokens
Answer Generation: The LLM generates the answer, which corresponds to a certain number of output tokens

Total consumption = Input tokens + Output tokens + Image tokens

Factors That Impact Token Consumption

Token consumption is directly affected by:

Large context: Results from using RAG strategies like "Full resource" or "Neighbouring paragraphs", or from using the extra_context parameter
Long questions: More detailed or complex questions require more input tokens
Long prompts: Extensive system prompts increase the input token count
Detailed answers: Comprehensive responses require more output tokens
Images in context: When using multimodal models, images included in the retrieved context significantly increase token consumption

How to Limit and Control Token Consumption

Strategy 1: Optimize Your Parameters

The first approach to reducing token consumption is to fine-tune your request parameters:

Optimize prompts: Ensure your prompts are concise and focused, avoiding unnecessary verbosity
Control resource scope: When using the "Full resource" strategy, use the count attribute to limit the number of resources returned
Tune neighboring context: For the "Neighbouring paragraphs" strategy, optimize the before and after attributes to balance context quality with token efficiency
Manage summary length: When using the "Hierarchical" strategy, ensure that resource summaries are appropriately sized
Choose efficient models: Select LLMs that offer better token efficiency (typically, ChatGPT 4o-mini is more cost-effective than ChatGPT 4o)

Strategy 2: Set Hard Limits

You can implement safeguards against excessive token consumption:

Use the max_tokens parameter on the /ask endpoint to set hard limits on:

Context size: Limits the amount of retrieved information sent to the LLM
Answer length: Limits the length of the generated response

Important Considerations

Context Limitations:

Restricting context size may result in less relevant answers since the LLM has less information to work with
Balance between cost control and answer quality

Answer Length Limitations:

The LLM might not complete its response if it hits the token limit, potentially cutting sentences mid-way
Recommended approach: Include length requirements in your prompt (e.g., "Please answer in less than 200 words") rather than relying solely on hard limits
This allows the LLM to naturally conclude its response within the desired length

How to Monitor Token Consumption

Understanding Token Consumption Data

You can receive detailed token consumption information from the following endpoints that utilize LLM models: ask, chat, remi, query, sentence, summarize, tokens, and rerank.

note

The rephrase endpoint currently does not support token consumption monitoring.

To receive token consumption data, you must include the following header in your request:

X-SHOW-CONSUMPTION: true

The token consumption data is provided in different formats depending on the response type:

Streaming responses (application/x-ndjson): Token consumption appears as a separate JSON chunk with type "consumption"
Standard responses (application/json): Token consumption is included in a "consumption" field within the main response