-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Memoize file match patterns #1412
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memoize file match patterns #1412
Conversation
|
|
||
| private static final String PATTERN_CHARS_TO_ESCAPE = "\\.[]{}()*+-?^$|"; | ||
|
|
||
| private static final Map<String, Pattern> PATTERNS = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is static, I'd suggest using ConcurrentHashMap here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, I changed it to java.util.concurrent.ConcurrentHashMap
|
@swiedenfeld I was about to merge it but then realized... if the app is long running, such cache may eventually explode and cause an OOM. Since the class is in |
|
@bsideup Good point. I was unaware it could be running long as I focused on its use in build step. There, the number of ignore patterns must be bounded by the lines in For the TTL, do you suggest a reasonable time interval until cache entries expire? |
|
@swiedenfeld there are a few usages of docker-java that run for hours, days or even months :D Since it parses the user input, there is no guarantee that the set will be final, despite the relationship to "a few lines in For TTL, I'd suggest something like 1 hour. Alternatively, IIRC Guava can limit by the cache size, which may be another option (to limit it to 25Mb, or 10k patterns, or similar) |
|
@swiedenfeld ping :) |
|
pong :) |
|
@swiedenfeld FYI I just changed some defaults and will merge as soon as CI is green :) |
I ran a profiler to analyse performance of ScannedResult # addFilesInDirectory in the scenario described in #1409. All ignore strings must be matched on each of the +60k files, and a not neglible amount of time is spent compiling the Strings from ignores into patterns over and over again. It seems sensible to cache already compiled patterns into a map for quick resolution.
Unfortunately, it does not improve the situation in #1409 too much. The majority of processing time is spent walking the directory tree, which is possibly bound by I/O.