Token split over camelCase and alphabet-numbers - Word delimiter kind of token filter. #97159
Unanswered
aananthagovindarajan
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Heard the custom tokenization is in the TODO list
Till it available we are seeing an alternative option with an additional field for the tokens and create inverted index for the additional field as follows,
The tokenizer supports the preprocessor section. Is it possible to support the " lower(arrayStringConcat(extractAll(raw_message, '([A-Z][a-z]+|[A-Z]+|[a-z]+|[0-9]+)'), ' ')) " kind of stuff in the preprocessor that helps avoid the additional field. The regex actually doing the split on the words when it encounter the alphabets-numbers and camelCase.
Example with the additional field.
Record 1 - rawmessage = "UserID123 Failed due to login problem"
Record 2 - rawmessage = "UserID123 LoginFailed"
this query returns both the records.
Still the Phrase query support is missing. since the additional field have the array of tokens that can be used as follows to achieve the phrase query
this query returns only the first record.
Two Problems were discussed.
If the support for regex based tokenization via preprocessor is given without the additional field the phrase queries can be achieved by reading the respective record alone.
Does it make sense?



Beta Was this translation helpful? Give feedback.
All reactions