Token split over camelCase and alphabet-numbers - Word delimiter kind of token filter. #97159

aananthagovindarajan · 2026-02-17T07:59:50Z

aananthagovindarajan
Feb 17, 2026

Heard the custom tokenization is in the TODO list

Till it available we are seeing an alternative option with an additional field for the tokens and create inverted index for the additional field as follows,

CREATE TABLE webapp_logs_final
(
event_time DateTime,
raw_message String,
search_tokens String MATERIALIZED lower(arrayStringConcat(extractAll(raw_message, '([A-Z][a-z]+|[A-Z]+|[a-z]+|[0-9]+)'), ' ')),
INDEX inv_msg_idx search_tokens TYPE text(tokenizer = 'splitByNonAlpha') GRANULARITY 1
)
ENGINE = MergeTree
ORDER BY event_time

The tokenizer supports the preprocessor section. Is it possible to support the " lower(arrayStringConcat(extractAll(raw_message, '([A-Z][a-z]+|[A-Z]+|[a-z]+|[0-9]+)'), ' ')) " kind of stuff in the preprocessor that helps avoid the additional field. The regex actually doing the split on the words when it encounter the alphabets-numbers and camelCase.

Example with the additional field.

Record 1 - rawmessage = "UserID123 Failed due to login problem"
Record 2 - rawmessage = "UserID123 LoginFailed"

SELECT * FROM webapp_logs_final WHERE hasToken(search_tokens, '123') AND hasToken(search_tokens, 'failed')

this query returns both the records.

Still the Phrase query support is missing. since the additional field have the array of tokens that can be used as follows to achieve the phrase query

SELECT * FROM webapp_logs_final WHERE (hasToken(search_tokens, '123') AND hasToken(search_tokens, 'failed')) and (search_tokens LIKE '%123 login%');

this query returns only the first record.

Two Problems were discussed.

Without additional field
Option for Phrase query

If the support for regex based tokenization via preprocessor is given without the additional field the phrase queries can be achieved by reading the respective record alone.

Does it make sense?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token split over camelCase and alphabet-numbers - Word delimiter kind of token filter. #97159

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Token split over camelCase and alphabet-numbers - Word delimiter kind of token filter. #97159

Uh oh!

Uh oh!

aananthagovindarajan Feb 17, 2026

Replies: 0 comments

aananthagovindarajan
Feb 17, 2026