How to filter search results
All Agentic RAG search endpoints support filtering with the same API, using the filter_expression parameter. For POST endpoints, the expression is passed
as a JSON object, like all other parameters. For GET endpoints, it's passed as a JSON object serialized to string, with the appropriate URL-encoding for
special characters. For this reason, we recommend using the POST endpoints.
A filter expression is composed of the following parts:
{
"field": <expr>,
"paragraph": <expr>,
"operator": <and/or>
}
- An expression to filter resource fields. This is where most of the filtering takes place, e.g: Filtering by resource id, slug, field type, resource labels or language are all defined here.
- An expression to filter paragraphs. This applies filters to individual paragraphs based on paragraph labels or the kind of paragraph.
- If both expression are provided, how to combine them, either
andoror.
Examples:
Search in a specified resource
{
"field": { "prop": "resource", "slug": "my-cool-resource" }
}
Search for english texts, excluding OCR paragraphras
{
"field": { "prop": "language", "language": "en" },
"paragraph": { "not": { "prop": "kind", "kind": "OCR" } },
"operator": "and"
}
Filter expression
Each filter expression is a set of filters combined by operators (AND, OR, NOT). The allowed filters differ between field and paragraph expressions, but the operators are common.
Boolean operators
And
All filters must match for the expression to match
{
"field": {
"and": [<expr>, <expr>]
}
}
Or
At least one of the filters must match for the expression to match
{
"field": {
"or": [<expr>, <expr>]
}
}
Not
The filter must not match for the expression to match
{
"field": {
"not": <expr>
}
}
Nesting
Operators can be nested, producing complex expressions.
For example, to search for movies or books in english that don't mention Barcelona nor Paris, you could write:
{
"field": {
"and": [
{ "prop": "language", "language": "en" },
{
"or": [
{ "prop": "label", "labelset": "media_type", "label": "movies" },
{ "prop": "label", "labelset": "media_type", "label": "books" }
]
},
{
"not": {
"or": [
{ "prop": "entity", "subtype": "CITY", "value": "Barcelona" },
{ "prop": "entity", "subtype": "CITY", "value": "Paris" }
]
}
}
]
}
}
Resource filters
Resource ID or slug (resource)
Filters by a given resource id or slug (only one can be specified at a time).
{
"field": {
"prop": "resource",
"id": "2e601fd990790691813d1380c104ab98"
}
}
{
"field": {
"prop": "resource",
"slug": "my-slug"
}
}
Field type or specific field id (field)
Filters by a given field type or a specific field.
Type is one of text, file, link, conversation or generic.
{
"field": {
"prop": "field",
"type": "text"
}
}
{
"field": {
"prop": "field",
"type": "generic",
"name": "summary"
}
}
Documents containing a word (keyword)
Matches fields that contain a specific word.
{
"field": {
"prop": "keyword",
"word": "umbrella"
}
}
Creation date (created)
Matches documents created inside the date range.
{
"field": {
"prop": "created",
"since": "2021-03-05T02:00:00",
"until": "2021-05-15T02:00:00"
}
}
since or until can be left blank to search documents older than or newer than a single date.
{
"field": {
"prop": "created",
"since": "2021-03-05T02:00:00"
}
}
Modification date (modified)
Matches documents modified inside the date range.
{
"field": {
"prop": "modified",
"since": "2021-03-05T02:00:00",
"until": "2021-05-15T02:00:00"
}
}
since or until can be left blank to search documents older than or newer than a single date.
{
"field": {
"prop": "modified",
"since": "2021-03-05T02:00:00"
}
}
Origin tags (origin_tag)
Matches documents with a given origin tag (as specified at resource creation).
{
"field": {
"prop": "origin_tag",
"tag": "word"
}
}
Origin metadata (origin_metadata)
Matches documents with the given origin metadata (as specified at resource creation).
{
"field": {
"prop": "origin_metadata",
"field": "agent",
"value": "crawler"
}
}
Can also be used to match documents having the specified metadata field (withotu caring for its value):
{
"field": {
"prop": "origin_metadata",
"field": "agent"
}
}
Origin path (origin_path)
Matches path of the resource in the source system. It will match any path starting with the provided value.
Example: Users/JohnDoe/Documents will match files in the Documents folder of the JohnDoe user, but also the ones in Documents/Work or Documents/Personal, etc.
{
"field": {
"prop": "origin_path",
"prefix": "Users/JohnDoe/Documents"
}
}
Can also be used to match when any path is set by not specifying any prefix:
{
"field": {
"prop": "origin_path"
}
}
Origin source ID (origin_source)
Matches documents with a given origin source id (as specified at resource creation).
{
"field": {
"prop": "origin_source",
"id": "internet"
}
}
Can also be used to match when any source is set by not specifying any id:
{
"field": {
"prop": "origin_source"
}
}
Origin tags (origin_tag)
Matches documents with a given origin collaborator (as specified at resource creation).
{
"field": {
"prop": "origin_collaborator",
"collaborator": "someone"
}
}
Classification labels (label)
Matches documents with a given label.
{
"field": {
"prop": "label",
"labelset": "topic",
"label": "boats"
}
}
The label field can be skipped to match any resources with any label on that labelset.
{
"field": {
"prop": "label",
"labelset": "topic"
}
}
Icon / Resource mimetype (resource_mimetype)
Matches the mimetype of the resource (also known as icon). You can also consider by the specific field mimetype (see next filter).
{
"field": {
"prop": "resource_mimetype",
"type": "application",
"subtype": "pdf"
}
}
Can also filter by categories by not passing the subtype field.
{
"field": {
"prop": "resource_mimetype",
"type": "image"
}
}
Field mimetype (field_mimetype)
Matches the mimetype of the field. You can also consider by the mimetype of the resource/icon (see above).
{
"field": {
"prop": "field_mimetype",
"type": "application",
"subtype": "pdf"
}
}
Can also filter by categories by not passing the subtype field.
{
"field": {
"prop": "field_mimetype",
"type": "image"
}
}
Entities / NERs (entity)
Matches fields containing the specified NER entity.
{
"field": {
"prop": "entity",
"subtype": "CITY",
"value": "Paris"
}
}
Can also match any entity on a category:
{
"field": {
"prop": "entity",
"subtype": "CITY"
}
}
Text language (language)
Matches documents containing text in the given language (even if they have other languages):
{
"field": {
"prop": "language",
"language": "en"
}
}
Matches documents written primarily in the given language:
{
"field": {
"prop": "language",
"language": "en",
"only_primary": true
}
}
Field generated by (generated)
Matches if the field was generated by the given source. Currently can only be used files generated by Data Augmentation.
{
"field": {
"prop": "generated",
"by": "data-augmentation"
}
}
Can also be used to match fields generated by an specific DA task (given the field prefix).
{
"field": {
"prop": "generated",
"by": "data-augmentation",
"da_task": "summarizer"
}
}
Paragraph filters
Classification labels (label)
Matches paragraphs with a given label.
{
"paragraph": {
"prop": "label",
"labelset": "topic",
"label": "boats"
}
}
The label field can be skipped to match any paragraphs with any label on that labelset.
{
"paragraph": {
"prop": "label",
"labelset": "topic"
}
}
Paragraph kind (kind)
Matches paragraphs of that kind. Kind can be TEXT, OCR, INCEPTION, DESCRIPTION, TRANSCRIPT, TITLE or TABLE.
{
"paragraph": {
"prop": "kind",
"kind": "TEXT"
}
}
Filters on /catalog endpoint
The /catalog endpoint can use most of the field resources (except for field, field_mimetype, keyword and entity).
Additionally, it can also use the following filters:
Resource status (status)
Matches resources in a given processing status. Status can be PROCESSED, PENDING or ERROR.
{
"field": {
"prop": "status",
"status": "PROCESSED"
}
}
Reference documentation
The Agentic RAG API documentation is available here.
Legacy filter parameters
The parameters described below also apply filters and represent an older version of the API.
We recommend using filter_expression instead, but the documentation for the older parameters is still retained here.
Filters
The filters parameter allows you to filter the results depending on the value of different properties provided on the resource.
The following attributes are supported:
/origin.tags: tags defined in the resource'soriginproperty Example:/origin.tags/blue,/origin.tags/green/classification.labels: labels:/classification.labels/{labelset}/{label}Example:/classification.labels/movie-genre/science-fiction/icon: mime type of resource Example:/icon/application/pdfor/icon/movie/mp4/metadata.status: processing status Example:/metadata.status/PROCESSED,/metadata.status/PENDINGor/metadata.status/ERROR/entities: resource entities:/entities/{entity-type}/{entity-id}Example:/entities/CITY/Barcelona/metadata.language: primary language of the document Example:/metadata.language/cafor catalan language/metadata.languages: all other detected languages Example:/metadata.languages/trfor turkish language/origin.metadata: metadata provided by the user Example:/origin.metadata/fieldname/valueorigin.path: path of the resource in the source system. It will match any path starting with the provided value. Example:/origin.path/Users/JohnDoe/Documentswill match files in theDocumentsfolder of theJohnDoeuser, but also the ones inDocuments/WorkorDocuments/Personal, etc.
Examples:
-
To retrieve PNG images only, use:
filters=/icon/image/png -
To retrieve results in which the principal language is Italian, use:
filters=/metadata.language/it -
To retrieve results referring to the UNESCO organization, use:
filters=/entities/ORG/UNESCO
Filters can be combined by repeating the filters parameter. This example will retrieve results which are PDF and which are referring to the UNESCO organization:
filters=/icon/application/pdf&filters=/entities/ORG/UNESCO
Advanced filtering
As shown above, combining multiple filters will implicitly return the intersection (i.e: AND operator) between the specified filters.
If your use-case needs more complex filtering expressions, you can use the POST versions of the search endpoints to provide a filtering expression.
Filtering expressions accept the following keys: all, any, none and not_all. Here are some examples:
all
{
"filters": [
{"all": ["/icon/application/pdf", "/entities/ORG/UNESCO"]}
]
}
Which would be equivalent to the last example of the previous section: it will return resources that are PDF and have the UNESCO entity associated with them.
any
{
"filters": [
{"any": ["/icon/application/pdf", "/icon/movie/mp4"]}
]
}
Will return resources that are either PDF or mp4 videos. This is equivalent to the OR logical operation.
none
{
"filters": [
{"none": ["/icon/application/pdf", "/icon/movie/mp4"]}
]
}
Will return results from documents that are neither PDF nor mp4 videos. This is equivalent to the NOT(a OR b) logical expression.
not_all
{
"filters": [
{"not_all": ["/icon/application/pdf", "/entities/ORG/UNESCO"]}
]
}
Essentially, it will return the complementary set of results to the all example: all documents except those that are PDFs and also have UNESCO entity related to. This is equivalent to the NOT(a AND b) logical expression.
Combining
If you need even more complex filtering expressions, you can combine multiple expression terms as more elements of the filters list:
{
"filters": [
{"all": ["/icon/application/pdf"]},
{"any": ["/entities/ORG/UNESCO", "/entities/GPE/US"]},
]
}
And the returned result will be the implicit intersection (i.e: AND) of all expressions combined. In this example, it will return all documents that are PDF and that have either UNESCO or US as a related entity.
Date filtering
You can filter on the creation date using:
range_creation_startrange_creation_end
Examples:
-
To get all resources created between 2023-01-01 and 2023-12-31:
range_creation_start=2023-01-01T00:00:00.000Z&range_creation_end=2023-12-31T23:59:59.000Z -
To get all resources created after 2023-01-01:
range_creation_start=2023-01-01T00:00:00.000Z
Filtering will be based on the origin.created value if provided in the resource, otherwise it will default to the resource creation date (created).
Please note: all resources created before 2023-11-02 will have to be reprocessed for origin.created to be filterable.
Similarly, you can filter on the modification date using:
range_modification_startrange_modification_end
Search in a specific field
To restrict the search to a specific field you can use the field parameter. It supports different field types:
a: generic fields (= basic attributes, like title or summary)t: text fieldsf: file fieldsu: link fields
Example:
fields=a/title
To search in several fields, the parameter can be repeated:
fields=a/title&fields=a/summary
Regarding content fields, when used through the resource /search endpoint it allows you to restrict the search to one piece of content only, and when used through the main /search endpoint it allows you to restrict the search to all content having a given id in all resources.