Skip to content

scripts: update log#2486

Closed
georglauterbach wants to merge 32 commits intomasterfrom
update/log
Closed

scripts: update log#2486
georglauterbach wants to merge 32 commits intomasterfrom
update/log

Conversation

@georglauterbach
Copy link
Copy Markdown
Member

Description

Cherry picked from commits in #2485, this PR holds all the log updates. The commit descriptions is somewhat detailed.

Type of change

  • Improvement (non-breaking change that does improve existing functionality)
  • This change requires a documentation update

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (README.md or the documentation under docs/)
  • If necessary I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

This is a bigger commit - it reworks the whole log functionality. The
old one was confusing and not adhering to common standards, i.e. not
following convention over configuration. With this commit, we use the
common standard log levels

4. TRACE
3. DEBUG
2. INFO
1. WARN
0. ERROR

to better provide the information to users and maintainers. With this
changes, `DMS_DEBUG` was removed and `LOG_LEVEL` introduced. `DMS_DEBUG`
is not usefule anymore (obviously), and replaced by the new `LOG_LEVEL`,
which defaults to `info`.

All scripts and tests were adjusted to use the new format. There is a
special log level (`always`) which is used only by `start-mailserver.sh`
to show the welcome messages independent of the log level.

I took some care when converting the "old" log levels. With this rework,
it should be easier to identify problems in logs and provide users with
a clear and straight forward log level variable which is easy to
understand.

This is not actually a breaking change, but I thought it is fitting for
the next major release `v11`.
@georglauterbach georglauterbach added priority/high priority/medium area/scripts kind/improvement Improve an existing feature, configuration file or the documentation kind/update Update an existing feature, configuration file or the documentation labels Mar 15, 2022
@georglauterbach georglauterbach added this to the v11.0.0 milestone Mar 15, 2022
@georglauterbach georglauterbach self-assigned this Mar 15, 2022
@georglauterbach georglauterbach mentioned this pull request Mar 15, 2022
7 tasks
1. `_errex` now logs correctly via `_notify`
2. `warn` and `error` now log to stderr
3. Some small log level adjustments
4. Corrected a bug in `setup-stack.sh` where the parent directory
   would not be present and the following `: >...` would not
   succeed
The env variable `POSTMASTER_ADDRESS` was set in a way that is not very
clear to the reader and indeed possibly wrong. The old way used the
default while the new way uses the correct variable from
`/root/.bashrc`.

Therefore, a new function `_get_dms_environment_variables` was
introduced which handles exporting vaiables from `/root/.bashrc`. This
could be used in the future as well.

I KNOW THIS IS NOT ACTUALLY RELATED TO THE LOG - PLEASE DON'T KILL ME :D
I just had to fix this... :) This is a small addition, I hope you don't
mind.
@georglauterbach
Copy link
Copy Markdown
Member Author

georglauterbach commented Mar 17, 2022

With cce33f8, I improved check-for-changes.sh a bit. I know this is not actually related to the log, but I figured the change was small enough (around 10 lines in check-for-changes.sh) to fit in here. I hope you don't mind it :) The commit as a whole does more than that (i.e. adjusting some log levels, formatting, etc.).

The `daemons-stack.sh` is now adjusted to the log level and will only
print information about the started daemon on `debug` or `trace` and
suppress the Supervisor log output (which is not needed and unwanted).
`check-for-changes.sh` should now wait more efficiently on startup.
Moreover, the `mail_ssl_letsencrypt.bats` test was adjusted to view more
log. This has become necessary as the log grew in absolute size with
this PR (the messages are longer, more bytes to log).

I CAN'T FOR THE LIFE OF ME find the reason the test in
`mail_ssl_letsencrypt.bats` in the function `_should_extract_on_changes`
which is called by `_acme_wildcard` in the test named
`ssl(letsencrypt): Traefik 'acme.json' (*.example.test)` in the file
`mail_ssl_letsencrypt.bats` fails. @polarthene I need your help on this:
the last `cp` of the wildcard ACME does not seem to cause the message
`'/etc/letsencrypt/acme.json' has changed - extracting certifcates now`
to be logged, though it should, right? Maybe this was overlooked in the
past (false positive by viewing to much of the previous log???). In all
of my tests, the wildcard ACME does not cause the log message about the
file to be logged again...
@georglauterbach
Copy link
Copy Markdown
Member Author

georglauterbach commented Mar 17, 2022

@polarathene I need your help:

The test mail_ssl_letsencrypt.bats does not work. The last call of _acme_wildcard in the test ssl(letsencrypt): Traefik 'acme.json' (*.example.test) does not succeed, because the message /etc/letsencrypt/acme.json' has changed - extracting certifcates now is not logged. I spent more than an hour trying to debug this. To me, it seems as if I found a false positive (in the past, we were viewing too much log ???).

This is the current log of _acme_wildcard:

[   INF   ]  (CHANGEDETECTOR) 2022-03-17 13:47:28 Change detected
[  TRACE  ]  Creating lock /tmp/docker-mailserver/check-for-changes.sh.lock
[  TRACE  ]  Checking file line endings
[  TRACE  ]  Regenerating postfix user list
[  DEBUG  ]  Creating user 'user1' for domain 'localhost.localdomain'
[  DEBUG  ]  Creating user 'user2' for domain 'otherdomain.tld'
[  DEBUG  ]  Creating user 'user3' for domain 'localhost.localdomain' with attributes 'userdb_mail=mbox:~/mail:INBOX=~/inbox'
[  DEBUG  ]  Adding alias '[email protected]' for user '[email protected]' to Dovecot's userdb
[  DEBUG  ]  Alias '[email protected]' is non-local (or mapped to a non-existing account) and will not be added to Dovecot's userdb
[  DEBUG  ]  Adding alias '@localdomain2.com' for user '[email protected]' to Dovecot's userdb
[  TRACE  ]  Adding regexp alias file postfix-regexp.cf
virtual_alias_maps = texthash:/etc/postfix/virtual pcre:/etc/postfix/regexp
[  TRACE  ]  Configuring root alias
[  DEBUG  ]  (CHANGEDETECTOR) 2022-03-17 13:47:28 Restarting services due to detected changes
postfix: stopped
postfix: started
dovecot: stopped
dovecot: started
[  TRACE  ]  Removed lock /tmp/docker-mailserver/check-for-changes.sh.lock
[  TRACE  ]  (CHANGEDETECTOR) 2022-03-17 13:47:31 Completed handling of detected change

This is the log of the call to _acme_rsa before:

[   INF   ]  (CHANGEDETECTOR) 2022-03-17 13:47:20 Change detected
[  TRACE  ]  Creating lock /tmp/docker-mailserver/check-for-changes.sh.lock
[  DEBUG  ]  (CHANGEDETECTOR) 2022-03-17 13:47:20 '/etc/letsencrypt/acme.json' has changed - extracting certifcates now
[  TRACE  ]  (CHANGEDETECTOR) 2022-03-17 13:47:20 Attempting to extract for '*.example.test'
[  TRACE  ]  _extract_certs_from_acme | Certificate successfully extracted for 'example.test'
[  TRACE  ]  Checking file line endings
[  TRACE  ]  Regenerating postfix user list
[  DEBUG  ]  Creating user 'user1' for domain 'localhost.localdomain'
[  DEBUG  ]  Creating user 'user2' for domain 'otherdomain.tld'
[  DEBUG  ]  Creating user 'user3' for domain 'localhost.localdomain' with attributes 'userdb_mail=mbox:~/mail:INBOX=~/inbox'
[  DEBUG  ]  Adding alias '[email protected]' for user '[email protected]' to Dovecot's userdb
[  DEBUG  ]  Alias '[email protected]' is non-local (or mapped to a non-existing account) and will not be added to Dovecot's userdb
[  DEBUG  ]  Adding alias '@localdomain2.com' for user '[email protected]' to Dovecot's userdb
[  TRACE  ]  Adding regexp alias file postfix-regexp.cf
virtual_alias_maps = texthash:/etc/postfix/virtual pcre:/etc/postfix/regexp
[  TRACE  ]  Configuring root alias
[  DEBUG  ]  (CHANGEDETECTOR) 2022-03-17 13:47:20 Restarting services due to detected changes
postfix: stopped
postfix: started
dovecot: stopped
dovecot: started
[  TRACE  ]  Removed lock /tmp/docker-mailserver/check-for-changes.sh.lock
[  TRACE  ]  (CHANGEDETECTOR) 2022-03-17 13:47:26 Completed handling of detected change

In the latter, the line '/etc/letsencrypt/acme.json' has changed - extracting certifcates now is properly shown. But I cannot for the life of me find out why this line is not shown in the former...

@georglauterbach
Copy link
Copy Markdown
Member Author

georglauterbach commented Mar 17, 2022

@polarathene indeed, the log of the changedetector on master for this test looks like this:

[[ TASKS ]]  2022-03-17 16:11:22 Start check-for-changes script.
[[  INF  ]]  2022-03-17 16:11:22 Using postmaster address [email protected]
[[  INF  ]]  2022-03-17 16:11:32 check-for-changes is ready
[[  INF  ]]  2022-03-17 16:11:36 Change detected
[[  INF  ]]  '/etc/letsencrypt/acme.json' has changed, extracting certs..
[[  INF  ]]  Attempting to extract for '*.example.test'
[ WARNING ]  _extract_certs_from_acme | Unable to find key and/or cert for '*.example.test' in '/etc/letsencrypt/acme.json'
[[  INF  ]]  Attempting to extract for 'mail.example.test'
[[  INF  ]]  _extract_certs_from_acme | Certificate successfully extracted for 'mail.example.test'
[[  INF  ]]  Checking file line endings
[[  INF  ]]  Regenerating postfix user list
[[  INF  ]]  Creating user 'user1' for domain 'localhost.localdomain'
[[  INF  ]]  Creating user 'user2' for domain 'otherdomain.tld'
[[  INF  ]]  Creating user 'user3' for domain 'localhost.localdomain' with attributes 'userdb_mail=mbox:~/mail:INBOX=~/inbox'
[[  INF  ]]  Adding regexp alias file postfix-regexp.cf
virtual_alias_maps = texthash:/etc/postfix/virtual pcre:/etc/postfix/regexp
[[  INF  ]]  Configuring root alias
[[  INF  ]]  Restarting services due to detected changes..
postfix: stopped
postfix: started
dovecot: stopped
dovecot: started
[[  INF  ]]  Removed lock /tmp/docker-mailserver/check-for-changes.sh.lock.
[[  INF  ]]  2022-03-17 16:11:44 Completed handling of detected change
[[  INF  ]]  2022-03-17 16:11:48 Change detected
[[  INF  ]]  '/etc/letsencrypt/acme.json' has changed, extracting certs..
[[  INF  ]]  Attempting to extract for '*.example.test'
[[  INF  ]]  _extract_certs_from_acme | Certificate successfully extracted for 'example.test'
[[  INF  ]]  Checking file line endings
[[  INF  ]]  Regenerating postfix user list
[[  INF  ]]  Creating user 'user1' for domain 'localhost.localdomain'
[[  INF  ]]  Creating user 'user2' for domain 'otherdomain.tld'
[[  INF  ]]  Creating user 'user3' for domain 'localhost.localdomain' with attributes 'userdb_mail=mbox:~/mail:INBOX=~/inbox'
[[  INF  ]]  Adding regexp alias file postfix-regexp.cf
virtual_alias_maps = texthash:/etc/postfix/virtual pcre:/etc/postfix/regexp
[[  INF  ]]  Configuring root alias
[[  INF  ]]  Restarting services due to detected changes..
postfix: stopped
postfix: started
dovecot: stopped
dovecot: started
[[  INF  ]]  Removed lock /tmp/docker-mailserver/check-for-changes.sh.lock.
[[  INF  ]]  2022-03-17 16:11:53 Completed handling of detected change

and there is a third '/etc/letsencrypt/acme.json' has changed, extracting certs.. somehow missing. It could be that there are only 2, which would indicate that actually all is well but we're not looking at enough log?? Are there overall only 2 changes with this test? Have I misunderstood it the whole time?

EDIT: Something odd is happening here. On master, only two changes are detected. On this branch (update/log), three changes are detected...

Copy link
Copy Markdown
Member

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I do appreciate nice improvements like the daemons-stack.sh addition. It like others probably should have been delayed as follow-up PRs.

Similar to how some of the broader sweeping changes of this PR could have been introduced in phases. You're not only adjusting log levels, but changing the existing ones, and making adjustments to content among other changes related to the logs.

It's easier for reviewers in this case to go into a PR knowing that you're only introducing a new and improved logger method, with existing log levels being remapped in the method, then a followup PR that just focuses on changes to drop that mapping, and adjust the log levels throughout the files, especially when you're adjusting based on context (for example when inf became info, debug, trace or just a plain echo).

There's more than just that going on with this PR though. I rather you don't attempt to split the PR again due to residue issues we had with the split to this PR.

I don't particularly want to wade through the diff of this size again to catch those hiccups. So unless another reviewer explicitly requests it to assist them reviewing, please don't take that initiative 😅


One of the main concerns with changes introduced since is your new utility for importing ENV, as the relevant review comments detail, this seems to be from a lack of awareness of /etc/dms-settings or I'm mistaken and there's good justification for it.

I'm also wary of the alternative approach for sourcing POSTMASTER_ADDRESS, mostly due to the dependency on sleep. Already the log output size has affected tests, and varying layers of sleep is another common issue we run into, I prefer to minimize the need for it where possible.

Comment on lines +19 to +25
function _changedetector_exit_error
{
_changedetector_notify "${@}"
_changedetector_notify 'Aborting'

exit 1
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These aren't calling a log level? I assume they should? Do you really want to be using ${@} here when ${1} to restrict to a single message arg should suffice?

I'd personally rather methods that don't document variable arg length are kept simple to grok. When the args are fixed, I know that anything calling this is predictable in usage, but in this case I am given the impression there is usage that warrants the flexibility when AFAIK there isn't?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there is the error log level missing. I will correct this and change "${@}" with "${1}".

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polarathene anything wrong with my reply? Is this not what you wanted to happen; did I misunderstand anything? (because of the 👎🏼)

@polarathene
Copy link
Copy Markdown
Member

polarathene commented Mar 18, 2022

I spent more than an hour trying to debug this. To me, it seems as if I found a false positive (in the past, we were viewing too much log ???).

If we were viewing too much log, then the test should still fail, it explicitly looks for the FQDN arg the function is called with:

assert_output --partial "_extract_certs_from_acme | Certificate successfully extracted for '${EXPECTED_DOMAIN}'"

And after that test, I also added another to count duplicate log output for expected number of restarts by this point. Perhaps not exactly foolproof but should be reasonably reliable:

# Ensure change detection is not mistakenly validating against previous change events:
function _should_have_service_restart_count() {
local NUM_RESTARTS=${1}
# Count how many times postfix was restarted by the `changedetector` service:
run docker exec "${TEST_NAME}" sh -c "supervisorctl tail changedetector | grep -c 'postfix: started'"
assert_output "${NUM_RESTARTS}"
}

Detailed breakdown if helpful

Full flow

and there is a third '/etc/letsencrypt/acme.json' has changed, extracting certs.. somehow missing. It could be that there are only 2, which would indicate that actually all is well but we're not looking at enough log?? Are there overall only 2 changes with this test? Have I misunderstood it the whole time?

  • _acme_ecdsa(): Only cares that certs were extracted at startup.
  • _acme_rsa(): Certs have changed, should be detected and extracted to mail.example.test cert dir.
  • _acme_wildcard(): Same as before, but this time the first check for wildcard is successful and has priority, this is not a *.example.test dir, but example.test dir (which may seem a bit odd but was to avoid *. as part of the directory name).

Priority is due to the SSL_DOMAIN=*.example.test ENV specified early in the test as additional container args/ENV beyond the defaults:

local TEST_DOCKER_ARGS=(
--volume "${TEST_TMP_CONFIG}/letsencrypt/acme.json:/etc/letsencrypt/acme.json:ro"
--env DMS_DEBUG=1
--env PERMIT_DOCKER='container'
--env SSL_DOMAIN='*.example.test'
--env SSL_TYPE='letsencrypt'
)
common_container_setup 'TEST_DOCKER_ARGS'

Priority logic in change detection for acme.json:

_notify 'inf' "'/etc/letsencrypt/acme.json' has changed, extracting certs.."
# This breaks early as we only need the first successful extraction.
# For more details see the `SSL_TYPE=letsencrypt` case handling in `setup-stack.sh`.
#
# NOTE: HOSTNAME is set via `helpers/dns.sh`, it is not the original system HOSTNAME ENV anymore.
# TODO: SSL_DOMAIN is Traefik specific, it no longer seems relevant and should be considered for removal.
FQDN_LIST=("${SSL_DOMAIN}" "${HOSTNAME}" "${DOMAINNAME}")
for CERT_DOMAIN in "${FQDN_LIST[@]}"
do
_notify 'inf' "Attempting to extract for '${CERT_DOMAIN}'"
if _extract_certs_from_acme "${CERT_DOMAIN}"
then
# Prevent an unnecessary change detection from the newly extracted cert files by updating their hashes in advance:
CERT_DOMAIN=$(_strip_wildcard_prefix "${CERT_DOMAIN}")
ACME_CERT_DIR="/etc/letsencrypt/live/${CERT_DOMAIN}"
sed -i "\|${ACME_CERT_DIR}|d" "${CHKSUM_FILE}.new"
sha512sum "${ACME_CERT_DIR}"/*.pem >> "${CHKSUM_FILE}.new"
break
fi
done
fi

That should be calling the extract certs helper:

function _extract_certs_from_acme
{
local CERT_DOMAIN=${1}
if [[ -z ${CERT_DOMAIN} ]]
then
_notify 'err' "_extract_certs_from_acme | CERT_DOMAIN is empty"
return 1
fi
local KEY CERT
KEY=$(acme_extract.py /etc/letsencrypt/acme.json "${CERT_DOMAIN}" --key)
CERT=$(acme_extract.py /etc/letsencrypt/acme.json "${CERT_DOMAIN}" --cert)
if [[ -z ${KEY} ]] || [[ -z ${CERT} ]]
then
_notify 'warn' "_extract_certs_from_acme | Unable to find key and/or cert for '${CERT_DOMAIN}' in '/etc/letsencrypt/acme.json'"
return 1
fi
# Currently we advise SSL_DOMAIN for wildcard support using a `*.example.com` value,
# The filepath however should be `example.com`, avoiding the wildcard part:
if [[ ${SSL_DOMAIN} == "${CERT_DOMAIN}" ]]
then
CERT_DOMAIN=$(_strip_wildcard_prefix "${SSL_DOMAIN}")
fi
mkdir -p "/etc/letsencrypt/live/${CERT_DOMAIN}/"
echo "${KEY}" | base64 -d > "/etc/letsencrypt/live/${CERT_DOMAIN}/key.pem" || exit 1
echo "${CERT}" | base64 -d > "/etc/letsencrypt/live/${CERT_DOMAIN}/fullchain.pem" || exit 1
_notify 'inf' "_extract_certs_from_acme | Certificate successfully extracted for '${CERT_DOMAIN}'"
}

which should log a success or failure.. but there is two || exit 1 lines at the end there which would not log and go undetected.. I did not introduce that logic for those two lines, it was pre-existing before me. I recall briefly discussing it with casper in a review that it probably should get addressed at some point I think..?

The directory is created in advance however, and we return early with warning logged if KEY or CERT is empty. I'm not sure why that would fail to trigger || exit?

I assume there is nothing wrong with calling the python script since that is working at an earlier stage in the tests. There was a recent change to use an explicit .py extension, otherwise acme_extract.py works like so:

  1. Given a filepath to acme.json (rsa example and ecdsa wildcard example) parse it and iterate through the top-level keys looking for the Certificates key:

    def retrieve_data():
    with open(args.filepath) as json_file:
    acme_data = json.load(json_file)
    for key, value in acme_data.items():
    try:
    certs = value['Certificates'] or []

  2. The Certificates field contains an array of objects, each a certificate. Check each one for a domain field, it has the subfields main and sans, these days sans should contain the FQDN, we check that it has the FQDN we requested via args:

    for cert in certs:
    if has_fqdn(cert['domain'], args.fqdn):
    return cert[args.requested]

    def has_fqdn(domains, fqdn):
    main = domains.get('main', '')
    sans = domains.get('sans', [])
    return main == fqdn or fqdn in sans

  3. In the event of a failure, we return None which will print as an empty string (which will be caught/logged by the ssl helper function _extract_certs_from_acme):

    # One of the expected keys is missing.. return an empty result
    # Certificates: [{domain: [main, sans], key, certificate}]
    except KeyError:
    return None
    # No match == 'None', we convert to empty string for
    # existing error handling:
    result = retrieve_data() or ''
    print(result)


Actual flow issue

So that whole flow seems sound AFAIK. The issue with _acme_wildcard() in your current PR error logs detects a change but doesn't log anything about attempting to extract or that acme.json changed.. Referencing check-for-changes.sh master:

_notify 'inf' "$(_log_date) Change detected"
_create_lock # Shared config safety lock
CHANGED=$(grep -Fxvf "${CHKSUM_FILE}" "${CHKSUM_FILE}.new" | sed 's/^[^ ]\+ //')

elif [[ ${CHANGED} =~ /etc/letsencrypt/acme.json ]]
then
_notify 'inf' "'/etc/letsencrypt/acme.json' has changed, extracting certs.."

As we don't reach that 2nd step, what may have occurred is in _acme_rsa() this logic ran which would be intentional, but somehow prevented change detection of acme.json on the following _acme_wildcard() test:

_notify 'inf' "Attempting to extract for '${CERT_DOMAIN}'"
if _extract_certs_from_acme "${CERT_DOMAIN}"
then
# Prevent an unnecessary change detection from the newly extracted cert files by updating their hashes in advance:
CERT_DOMAIN=$(_strip_wildcard_prefix "${CERT_DOMAIN}")
ACME_CERT_DIR="/etc/letsencrypt/live/${CERT_DOMAIN}"
sed -i "\|${ACME_CERT_DIR}|d" "${CHKSUM_FILE}.new"
sha512sum "${ACME_CERT_DIR}"/*.pem >> "${CHKSUM_FILE}.new"
break

Actually, it looks like the first | is being escaped here, but not the 2nd, I'm not sure if that's valid or not, I've probably mentioned it in the relevant PR or commit in that PR (maybe PR merge commit message):

sed -i "\|${ACME_CERT_DIR}|d" "${CHKSUM_FILE}.new"

That doesn't explain why acme.json was not part of the detected change however. Could be a timing issue with insufficient sleep? When are you retrieving the logs you shared? If that's not in sync with when the test attempts to match, you could be right that the PR tests are reading and matching old logs, change-detection for the acme.json file may occur after that.

cp "${ACME_JSON}" "${TEST_TMP_CONFIG}/letsencrypt/acme.json"
# Change detection takes a little over 5 seconds to complete (restart services)
sleep 10
# Expected log lines from the changedetector service:
run $(_get_service_logs 'changedetector')
assert_output --partial 'Change detected'
assert_output --partial "'/etc/letsencrypt/acme.json' has changed, extracting certs"
assert_output --partial "_extract_certs_from_acme | Certificate successfully extracted for '${EXPECTED_DOMAIN}'"
assert_output --partial 'Restarting services due to detected changes'


One possible factor contributing to the observed behaviour

If file locking logic has anything to do with it, I will reference these findings I made late last year. I'd like to highlight the collapsed details block there "Switch back to flock", which points out:

The intent of locking files in the first place - flock locked the configs being interacted with, while the replacement applied locks to individual scripts instead, thus locks only blocked a specific script from being concurrently called, but not concurrent access to the actual config files being modified which was the original intent with flock being introduced.

It's possible that is part of the problem, along with sleeps and the changes to log output size affecting timing? I would look into the checksum file changes, as that should reveal more about what change is detected for _acme_wildcard() if it's not the acme.json update..


If resolving proves difficult:

I would offer to investigate this further myself, but it would presently be inconvenient. I'd be happy to try allocate some time later this month for it if needed.

Feel free to workaround or disable the test to merge this PR, then open an issue about this problem and assign me to it.

`check-for-changes.sh` now uses `/etc/dms-settings` instead of my self-written, new (and unnecessary) one.
Moreover, some small reverts (doc in `error.sh`) and an unnecessary `\` in `mail_ssl_letsencrypt.bats`.
See #2486 (comment)
for an explanation. I also incorporated PR feedback.
@georglauterbach
Copy link
Copy Markdown
Member Author

Houston, we have a problem

The Problem

So, after long and painful hours of debugging, I found some problems that are currently present with check-for-changes.sh. After I was unable to figure out why my log PR #2486 fails, I had enough and added the following lines to the script to look at the changes myself:

_changedetector_notify 'warn' 'DEBUG START'
echo -e "\nchanged: ${CHANGED}\n"
echo -e "\nnew: $(<${CHKSUM_FILE}.new)\n"
echo -e "\nold: $(<${CHKSUM_FILE})\n"
_changedetector_notify 'warn' 'DEBUG END'

For context: the current logs only show two changes being being detected in the mail_ssl_letsencrypt.bats test, while with #2486, three changes are detected... To my surprise, the last change looked like this:

changed:


new: d6825c549a553df95e3830b6a3ec3f161f320b2a600b4f1fd63540920ad33fd6b587dbdf8359ccb35dd1a17c9fd14c17f113954f73da8f19bcf83f1649fd757a  /tmp/docker-mailserver/postfix-accounts.cf
37d1a2f5cb4f99c81463b05b366f27f3877d10e83685fec1ec1f38aabf60e2d28bc32a7d4b810e66739b37912f9dd9d1424145b93c7e5969120d345c05c8f74c  /tmp/docker-mailserver/postfix-virtual.cf
cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e  /tmp/docker-mailserver/postfix-aliases.cf
cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e  /tmp/docker-mailserver/dovecot-quotas.cf
a0eafaf897cb7c793244fe71de32120d7b22897ffe20e5f9f295251e992104c48148e3103187a61c7d92ad773d1e00889d4790038ca4326b414439dfbd90045e  /etc/letsencrypt/acme.json
5f79d61f94f365052e7c079d1ea5d3ac8810e6e5c028d8b22b12d80f3ceeffc0c8182a4bace90cea265d357fc73bb3518214cb54247a7945152d6dbd94d23a31  /etc/letsencrypt/live/mail.example.test/fullchain.pem
4fd75dc009e4499cd72e12ace6f348bfd904ac4e816aa2a13200ec2749a0e9b86f362407bc35c945f7cd667b3da5449de771ed96d0ae6f7cb24c8f82fcd3d3b6  /etc/letsencrypt/live/mail.example.test/key.pem


old: d6825c549a553df95e3830b6a3ec3f161f320b2a600b4f1fd63540920ad33fd6b587dbdf8359ccb35dd1a17c9fd14c17f113954f73da8f19bcf83f1649fd757a  /tmp/docker-mailserver/postfix-accounts.cf
37d1a2f5cb4f99c81463b05b366f27f3877d10e83685fec1ec1f38aabf60e2d28bc32a7d4b810e66739b37912f9dd9d1424145b93c7e5969120d345c05c8f74c  /tmp/docker-mailserver/postfix-virtual.cf
cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e  /tmp/docker-mailserver/postfix-aliases.cf
cf83e1357eefb8bdf1542850d66d8007d620e4050b5715dc83f4a921d36ce9ce47d0d13c5d85f2b0ff8318d2877eec2f63b931bd47417a81a538327af927da3e  /tmp/docker-mailserver/dovecot-quotas.cf
a0eafaf897cb7c793244fe71de32120d7b22897ffe20e5f9f295251e992104c48148e3103187a61c7d92ad773d1e00889d4790038ca4326b414439dfbd90045e  /etc/letsencrypt/acme.json
5f79d61f94f365052e7c079d1ea5d3ac8810e6e5c028d8b22b12d80f3ceeffc0c8182a4bace90cea265d357fc73bb3518214cb54247a7945152d6dbd94d23a31  /etc/letsencrypt/live/mail.example.test/fullchain.pem
4fd75dc009e4499cd72e12ace6f348bfd904ac4e816aa2a13200ec2749a0e9b86f362407bc35c945f7cd667b3da5449de771ed96d0ae6f7cb24c8f82fcd3d3b6  /etc/letsencrypt/live/mail.example.test/key.pem
f8911bc63794d532e2af7e24f71a92dea5fbef8a3f4da5249a13285fd715c958cdd265eea82721dbc5f9921613916d2b2cf21bf2012a1d6abae2a5a76c1e6faa  /etc/letsencrypt/live/example.test/fullchain.pem
760241712a2d96a6cbfe166c9fa063458c57ba869a8caf5bed1add6547f5445e1e890320f262ff04c60ecd757da15cbfc006547c5f0717af994a08edff20009b  /etc/letsencrypt/live/example.test/key.pem

After doing a manual cmp, the result was:

$ cmp old new
cmp: EOF on old after byte 1207, line 7
$ echo $?
1
$ diff old new
7a8,9
> f8911bc63794d532e2af7e24f71a92dea5fbef8a3f4da5249a13285fd715c958cdd265eea82721dbc5f9921613916d2b2cf21bf2012a1d6abae2a5a76c1e6faa  /etc/letsencrypt/live/example.test/fullchain.pem
> 760241712a2d96a6cbfe166c9fa063458c57ba869a8caf5bed1add6547f5445e1e890320f262ff04c60ecd757da15cbfc006547c5f0717af994a08edff20009b  /etc/letsencrypt/live/example.test/key.pem

So, there are two additional lines in the diff now.

f8911bc63794d532e2af7e24f71a92dea5fbef8a3f4da5249a13285fd715c958cdd265eea82721dbc5f9921613916d2b2cf21bf2012a1d6abae2a5a76c1e6faa  /etc/letsencrypt/live/example.test/fullchain.pem
760241712a2d96a6cbfe166c9fa063458c57ba869a8caf5bed1add6547f5445e1e890320f262ff04c60ecd757da15cbfc006547c5f0717af994a08edff20009b  /etc/letsencrypt/live/example.test/key.pem

Adding an ls -lhA /etc/letsencrypt/live/** to the script yields

/etc/letsencrypt/live/example.test:
total 8.0K
-rw-r--r-- 1 root root 1.2K Mar 18 09:28 fullchain.pem
-rw-r--r-- 1 root root 1.7K Mar 18 09:28 key.pem

/etc/letsencrypt/live/mail.example.test:
total 8.0K
-rw-r--r-- 1 root root 1.3K Mar 18 09:27 fullchain.pem
-rw-r--r-- 1 root root 1.7K Mar 18 09:27 key.pem

Summing up: the additional files are picked by the changedetector-run and cmp (which actually is correct behaviour, if you ask me). I don't know why this would not be the case on the master branch... The wildcard certificate does not use the mail. prefix for example.test, therefore the new files in example.test without the mail. prefix exist. The old files not cleaned up? This causes an additional run for the changedetector.

The Solution

I adjusted the tests to query an appropriate amount of log. This "solves" the issue, but in the future, a better solution could / should be found (the log should be queried in a more intelligent way). This is however not related to this PR.

@georglauterbach
Copy link
Copy Markdown
Member Author

georglauterbach commented Mar 18, 2022

@polarathene now I need your help again again (with a test failure that may actually be an expected and therefore correct failure):

function _acme_wildcard() {
_should_extract_on_changes 'example.test' "${LOCAL_BASE_PATH}/wildcard/rsa.acme.json"
_should_have_service_restart_count '2'
# note: https://github.com/docker-mailserver/docker-mailserver/pull/2404 solves this
# TODO: Make this pass.
# As the FQDN has changed since startup, the configs need to be updated accordingly.
# This requires the `changedetector` service event to invoke the same function for TLS configuration
# that is used during container startup to work correctly. A follow up PR will refactor `setup-stack.sh` for supporting this.
# _should_have_valid_config 'example.test' 'key.pem' 'fullchain.pem'
local WILDCARD_KEY_PATH="${LOCAL_BASE_PATH}/wildcard/key.rsa.pem"
local WILDCARD_CERT_PATH="${LOCAL_BASE_PATH}/wildcard/cert.rsa.pem"
_should_have_expected_files 'example.test' "${WILDCARD_KEY_PATH}" "${WILDCARD_CERT_PATH}"
# Verify this works for wildcard certs, it should use `*.example.test` for `mail.example.test` (NOT `example.test`):
_should_succesfully_negotiate_tls 'mail.example.test'
# WARNING: This should fail...but requires resolving the above TODO.
# _should_not_have_fqdn_in_cert 'example.test'
}

Line 171 says "WARNING: This should fail...but requires resolving the above TODO. ". Or is he warning for the line below?

@github-actions
Copy link
Copy Markdown
Contributor

Documentation preview for this PR is ready! 🎉

Built with commit: 28c4842

@polarathene
Copy link
Copy Markdown
Member

I don't know why this would not be the case on the master branch... The wildcard certificate does not use the mail. prefix for example.test, therefore the new files in example.test without the mail. prefix exist. The old files not cleaned up? This causes an additional run for the changedetector.

You don't think this is related to more earlier investigation notes above? This is in particular:

As we don't reach that 2nd step, what may have occurred is in _acme_rsa() this logic ran which would be intentional, but somehow prevented change detection of acme.json on the following _acme_wildcard() test:

_notify 'inf' "Attempting to extract for '${CERT_DOMAIN}'"
if _extract_certs_from_acme "${CERT_DOMAIN}"
then
# Prevent an unnecessary change detection from the newly extracted cert files by updating their hashes in advance:
CERT_DOMAIN=$(_strip_wildcard_prefix "${CERT_DOMAIN}")
ACME_CERT_DIR="/etc/letsencrypt/live/${CERT_DOMAIN}"
sed -i "\|${ACME_CERT_DIR}|d" "${CHKSUM_FILE}.new"
sha512sum "${ACME_CERT_DIR}"/*.pem >> "${CHKSUM_FILE}.new"
break

Actually, it looks like the first | is being escaped here, but not the 2nd, I'm not sure if that's valid or not, I've probably mentioned it in the relevant PR or commit in that PR (maybe PR merge commit message):

sed -i "\|${ACME_CERT_DIR}|d" "${CHKSUM_FILE}.new"

I have not verified the sed concern, but if it's not correctly removing the lines to prevent an unnecessary change detection, then that explains why you're getting a change detection event regarding new cert files, instead of for acme.json which should be the relevant change to react to in the test - not changes that happened as a result of the cert extraction in _acme_rsa() prior, or after example.test wildcard (if that change was detected and handled, but this logic resulted in an additional change after).

I'd also have to give it some thought if that logic is testing with *.example.test or example.test, it looks like CERT_DOMAIN is originally *.example.test and then if the strip method works correctly, it should be adjusted to example.test for the remaining logic.

You say master has 2 change detections logged, and then share the third additional change log that this PR caught due to log verbosity changes.. does this mean that the first two are the correct acme.json changes as expected? And the third is an accidental event with the example.test wildcard cert files triggering detection? The failure in the letsencrypt test is now present because that 2nd change event for acme.json success with the wildcard cert got the log bumped out of the retrieved lines, with only the third change event logs that shouldn't have occurred?

@polarathene
Copy link
Copy Markdown
Member

polarathene commented Mar 18, 2022

Line 171 says "WARNING: This should fail...but requires resolving the above TODO. ". Or is he warning for the line below?

This should be fine to enable now (but probably best deferred to a separate PR).

It's referencing line 163 which would also be enabled along with it, and the TODO comment above it. They needed the change detection event to run the ssl startup script again to update the filepaths to the correct letsencrypt directory (changed from mail.example.test to example.test).

It's an edge case, most users won't be changing to/from a wildcard cert on a running instance I imagine, this test will ensure we catch it though if someone refactors check-for-changes.sh thinking it wasn't necessary.


Test failure:

not ok 92 ssl(letsencrypt): Traefik 'acme.json' (*.example.test) in 41322ms
# (from function `assert_output' in file test/test_helper/bats-assert/src/assert.bash, line 231,
#  from function `_negotiate_tls' in file test/test_helper/tls.bash, line 53,
#  from function `_should_succesfully_negotiate_tls' in file test/test_helper/tls.bash, line 30,
#  from function `_acme_wildcard' in file test/mail_ssl_letsencrypt.bats, line 174,
#  in test file test/mail_ssl_letsencrypt.bats, line 193)
#   `_acme_wildcard' failed
# mail.example.test is up and running
# [  TRACE  ]  (CHANGEDETECTOR) 2022-03-18 19:30:13 Changedetector is ready
# 
# -- output does not contain substring --
# substring : Verification: OK
# output    : 
# --
# 

This is the last part of the test case for _acme_wildcard(), where it calls a TLS helper method:

# Verify this works for wildcard certs, it should use `*.example.test` for `mail.example.test` (NOT `example.test`):
_should_succesfully_negotiate_tls 'mail.example.test'
# WARNING: This should fail...but requires resolving the above TODO.
# _should_not_have_fqdn_in_cert 'example.test'

which is saying it should match the *.example.test registered in the cert:

"domain": {
"main": "Smallstep Leaf",
"sans": ["*.example.test"]
},

Not the mail.example.test cert used in _acme_rsa() previously, which is valid for both mail.example.test and example.test:

"domain": {
"main": "Smallstep Leaf",
"sans": ["example.test", "mail.example.test"]
},

These methods are only used in the wildcard, to ensure that mail.example.test resolves to the wildcard cert. The previous used cert covers both mail.example.test and example.test FQDN explicitly. AFAIK, *.example.test will be compatible with any subdomain used, but not example.test itself.

Here is the relevant part of the ssl helper function where failure is happening:

# Provide the Root CA cert for successful verification:
CMD_OPENSSL_VERIFY=$(_generate_openssl_cmd "${PORT}" "-CAfile ${CA_CERT}")
run docker exec "${CONTAINER_NAME}" sh -c "${CMD_OPENSSL_VERIFY}"
assert_output --partial 'Verification: OK'

All certs used in the acme.json test cases are using an RSA CA root cert:

@test "ssl(letsencrypt): Traefik 'acme.json' (*.example.test)" {
# This test group changes to certs signed with an RSA Root CA key,
# These certs all support both FQDNs: `mail.example.test` and `example.test`,
# Except for the wildcard cert `*.example.test`, which should not support `example.test`.
# We want to maintain the same FQDN (mail.example.test) between the _acme_ecdsa and _acme_rsa tests.
local LOCAL_BASE_PATH="${PWD}/test/test-files/ssl/example.test/with_ca/rsa"
# Change default Root CA cert used for verifying chain of trust with openssl:
# shellcheck disable=SC2034
local TEST_CA_CERT="${TEST_FILES_CONTAINER_PATH}/ssl/example.test/with_ca/rsa/ca-cert.rsa.pem"

Given the changes in this PR, chances are that undesired third change event is restarting Postfix while trying to make the connection perhaps..


Failure will likely be resolved by preventing the unexpected change event from happening. Related check-for-changes.sh logic needs to be fixed which should prevent this happening.

In the meantime, introducing a reasonable sleep as a workaround would probably work.

Copy link
Copy Markdown
Member

@polarathene polarathene left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stay focused, refrain from extras that can wait

I want to see this PR merged, please stop adding additions not related to the feedback. This is already deviating again from the core focus that resulted in it being extracted to a separate PR.

If you feel comments could be refactored/added, unless specifically about the new log functionality being introduced, hold off on that. Create a branch and pile them up there for you to follow up with.

This PR is a good improvement, but becoming a bit frustrating to review when changes that could be deferred to another PR are thrown in (and then reverted after review when I state not to unless requested by other reviewers, sometimes with residue errors).


Under no circumstances, take the initiative to clean up this PR with a force-pushed interactive-rebase (it's unlikely, but I want to communicate that clearly).

If you feel a need to do such, wait until ready to merge, so that all feedback and discussion is resolved first, then diffing the prior state to the reworked commits won't be as much effort.

Personally, since we squash commits on merge, and this PR history is already not pleasant to wade through for traceability to a future maintainer, I rather we not complicate it further and discourage altering PR history this way.

Comment on lines +187 to +192

# ATTENTION: due to the functions that run before (`_acme_ecdsa` and `_acme_rsa`), the directory
# `/etc/letsencrypt/live/` contains some "old" files that result in the changedetector
# running more than two times, so this test will need to query more log to find the
# correct log messages; this is a bit of a race condition, and could be done better.
# For reference, see https://github.com/docker-mailserver/docker-mailserver/pull/2486
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment needs to be added.

The older cert files are fine, the change detector itself should not be triggering events from them when they're extracted from acme.json change event. That's a bug with check-for-changes.sh not this test.

Any race condition AFAIK is due to changes since this test was written that affect the expected log lines retrieved to no longer be present, or sleep and other logic changes that affect the timing, since the test relies on sleep delay to allow time for a change detection event to process and output the expected logs.

This could be handled better absolutely, but I wouldn't attribute the warning specifically to _acme_wildcard, or even this test as such changes has impacted other tests too.


  • Drop these comments, they aren't correctly identifying the issue or context.

Comment on lines +157 to +160
# due to the third restart (note the comment atop the call of this function down below),
# the restart count may not be predicted accurately with the current method, which makes
# this test "flakey". This is therefore disabled until a better log query is found.
# _should_have_service_restart_count '2'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagree.

If the previous test method _should_extract_on_changes passes, this serves to ensure some log lines related to the restart came after the explicit FQDN log and weren't accidentally captured from earlier log history of previous change detected.

If anything, this one should be more reliable, and should be able to leverage the full log to work correctly. I'm not sure if we have a way to do that, but pulling in a large amount of bytes should be sufficient?


  • Re-enable this method and drop the comments you added.
  • If this method is failing, bump up the log content scanned.

Comment on lines 127 to +128
function _acme_ecdsa() {
_should_have_succeeded_at_extraction 'mail.example.test'
_should_have_succeeded_at_extraction 'mail.example.test' 'mailserver'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for the explicit service arg?

Comment on lines 154 to +155
function _acme_wildcard() {
_should_extract_on_changes 'example.test' "${LOCAL_BASE_PATH}/wildcard/rsa.acme.json"
_should_have_service_restart_count '2'
_should_extract_on_changes 'example.test' "${LOCAL_BASE_PATH}/wildcard/rsa.acme.json" '2800'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're adding log bytes to retrieve here as a workaround right?

Is there value drilling it down to this point?

  • Only _acme_rsa() and _acme_wildcard() call this method.
  • _acme_ecdsa() and _acme_rsa() use different methods to check for the same log line of success for mail.example.test, but _acme_rsa() relies on detecting a single restart event to differentiate in the event it accidentally matched earlier log for _acme_ecdsa().

With the restart count safeguard, we should be able to safely raise the retrieved log bytes in _should_extract_on_changes() itself, and not need to support a separate arg just for _acme_wildcard().

There is a concern with _acme_rsa() however.. If you have identified that change detection is occurring due to the cert files only, and not acme.json, then those two log line checks would be insufficient. _should_extract_on_changes() does additionally check for a acme.json change event line, which should be fine for _acme_rsa() but fail on _acme_wildcard() if it's accidentally matching previous output from _acme_rsa() for that line..

Adding a variant of _should_have_service_restart_count but for the acme.json change event log line might be useful to catch that discrepancy. However _should_extract_on_changes() should still fail to match the successful extraction for the wildcard EXPECTED_DOMAIN. Thus the failure would be the log lacking the expected change event (presumably in the past, as this workaround implies).


  • Revert supporting configurable log bytes for _shouold_extract_on_changes(), raise the bytes within the relevant call inside the function instead, applying to _acme_rsa() as well.
  • You can add a comment prior to the call that the explicit value is a workaround for _acme_wildcard() to pass due to a newer change event occurring that shouldn't be.

Comment on lines +228 to +229
local EXPECTED_DOMAIN=${1?}
local SERVICE=${2?}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot seem to find the right keywords to search to understand what this addition is for, and no mention in the commit message or comment it references for details.

Closest I found was ${1:?message} or $? which I understand but neither seem to match what you're doing here?


# Count how many times postfix was restarted by the `changedetector` service:
run docker exec "${TEST_NAME}" sh -c "supervisorctl tail changedetector | grep -c 'postfix: started'"
run docker exec "${TEST_NAME}" sh -c "supervisorctl tail -2000 changedetector | grep -c 'postfix: started'"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
run docker exec "${TEST_NAME}" sh -c "supervisorctl tail -2000 changedetector | grep -c 'postfix: started'"
# 99999 bytes intended as the "full" log. Scanning for matches since container start.
run docker exec "${TEST_NAME}" sh -c "supervisorctl tail -99999 changedetector | grep -c 'postfix: started'"

local SERVICE=${1:-'mailserver'}

local CMD_LOGS=(docker exec "${TEST_NAME}" "supervisorctl tail ${SERVICE}")
local SERVICE=${1?}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for dropping the implicit default here?

It's fine for promoting this method out of the test into a test helper for other tests to adopt, but probably not relevant to this PR scope.

Additionally, this method is called 4 times in the test, you only adjusted it for a single call. Please revert unless justified.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

${1?} is very similar to ${:?}, only that with the former, there is no nullity check. With the latter, when the variable is unset or null, Bash will stop and show an error.

Implicit default values can be nice, but I had a very hard time finding out that a function call to another function call that calls this function uses the implicit value - to the point where I thought I had found a bug. I'd refrain from such default values unless it's obvious.


local CMD_LOGS=(docker exec "${TEST_NAME}" "supervisorctl tail ${SERVICE}")
local SERVICE=${1?}
local LOG_TAIL_LENGTH=${2:-1800}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1500 appears to be the default last I noted:

# Note if tests fail asserting against `supervisorctl tail changedetector` output,
# use `supervisorctl tail -<num bytes> changedetector` instead to increase log output.
# Default `<num bytes>` appears to be around 1500.

Is 1800 required for this test due to changes from this PR? If not, better to try stay closer to the apparent default.

Comment on lines 72 to +80
if [[ ${?} -eq 1 ]]
then
_notify 'inf' "$(_log_date) Change detected"
_create_lock # Shared config safety lock
CHANGED=$(grep -Fxvf "${CHKSUM_FILE}" "${CHKSUM_FILE}.new" | sed 's/^[^ ]\+ //')

_changedetector_notify 'info' 'Change detected'
_create_lock # Shared config safety lock

[[ -z ${CHANGED} ]] && _changedetector_notify 'debug' 'Found new files'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was CHANGED re-located to the top? Is this just housekeeping? There is a lot of changes in this PR, I understand you want a tidy codebase, but please hold off from continuing to add these.

Make such changes and stash them, make note of them, or branch this PR and apply them there to rebase to master as a follow-up PR. I know that's annoying, but it's noise to the review, as well as the commit history. You still need a 2nd review approval for merge and at this rate it'll just end up being a quick glance and placing trust in me alone for being thorough so that it can be merged promptly to avoid causing conflicts for other PRs..


[[ -z ${CHANGED} ]] && _changedetector_notify 'debug' 'Found new files'

This is inaccurate? Not only new files, but modifications.

Isn't that the whole point of a change event being detected? CHANGED is just filtering out the hashes so we have a list of file names. Is this line adding anything of value?

Comment on lines 125 to 141

# regenerate postix aliases
# regenerate postix aliases; this needs `POSTMASTER_ADDRESS` to be set
# and therefore requires to call to `source /etc/dms-settings`
_create_aliases
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment belongs to the source /etc/dms-settings line, not here. A maintainer cares about why we're sourcing the settings file for ENV, the comment is only helpful here if refactoring this portion, which tests should raise failure about ideally.

Provide context to the source /etc/dms-settings line, that it is for POSTMASTER_ADDRESS which _create_aliases depends upon.

@georglauterbach
Copy link
Copy Markdown
Member Author

Stay focused, refrain from extras that can wait

I want to see this PR merged, please stop adding additions not related to the feedback. This is already deviating again from the core focus that resulted in it being extracted to a separate PR.

If you feel comments could be refactored/added, unless specifically about the new log functionality being introduced, hold off on that. Create a branch and pile them up there for you to follow up with.

This PR is a good improvement, but becoming a bit frustrating to review when changes that could be deferred to another PR are thrown in (and then reverted after review when I state not to unless requested by other reviewers, sometimes with residue errors).

Under no circumstances, take the initiative to clean up this PR with a force-pushed interactive-rebase (it's unlikely, but I want to communicate that clearly).

If you feel a need to do such, wait until ready to merge, so that all feedback and discussion is resolved first, then diffing the prior state to the reworked commits won't be as much effort.

Personally, since we squash commits on merge, and this PR history is already not pleasant to wade through for traceability to a future maintainer, I rather we not complicate it further and discourage altering PR history this way.

I'm sorry, I tried to follow your advice and only added what I deemed necessary for tests run. I don't know how to proceed with this PR without annoying anyone even further - I have therefore come to the conclusion that I will close this and provide smaller updates from time to time, which should be easier to review and easier to test.

I am very sorry to have wasted your time, but I think we nevertheless got some insights - and if it's only about ditching supervisor.

@polarathene
Copy link
Copy Markdown
Member

polarathene commented Mar 19, 2022

I don't know how to proceed with this PR without annoying anyone even further - I have therefore come to the conclusion that I will close this and provide smaller updates from time to time, which should be easier to review and easier to test.

It's only really been me providing reviews each morning I woke up. It was getting a bit overwhelming with all the changes going on, I was checking the individual commits for diffs, reading the commit messages and referencing the master version and full diff for certain files like check-for-changes.sh.

I could have communicated a bit better, or delayed review until I got a morning coffee 😅 I know sometimes my communication can come off a bit blunt and harsh, while I don't often comment on all the good changes the PR brought and my appreciation for how responsive you have been to my feedback, there was plenty of that unsaid :)

I am very sorry to have wasted your time, but I think we nevertheless got some insights - and if it's only about ditching supervisor.

I don't think you wasted my time. I chose to review regularly and invest the time that I did to provide as thorough review as I could because I value the changes you've put together here and I think you've invested a fair amount more of your time putting all this together.

I hope you do see it through, in whatever form that is, I'll continue to be happy to review it. Reviews work both ways, I can only be a better reviewer with your feedback when I overstep or get too pushy/fussy.

If you don't have the time to keep working on this, I can understand that too. I'm happy to piecemeal it through several PRs like I've done in the past with another maintainers large PR. It looks like a lot of thought went into it, it'd be a shame to lose it.

@casperklein
Copy link
Copy Markdown
Member

Just adding my 2 cents^^. PRs should be as simple as possible (just doing/introducing/fixing at best only one thing). In question, some should provide multiple PRs instead of one. This makes reviewing so much easier and faster.

There were so much discussion/changes/splitting in this PR, I was not able to keep track at some point. Then my "strategy" changed to wait until this has been calm down and then try to review it 😆

An approach for bigger PRs could also be, to open an issue and discuss the planed changes beforehand (if necessary/worth). I feel sometimes guilty, if someone took a lot of time to prepare something, for starting a general discussion about it. But just because it was a lot of work, doesn't automatically mean it's a good addition or that it can't be done better/simpler.

@georglauterbach georglauterbach mentioned this pull request Mar 19, 2022
8 tasks
@georglauterbach georglauterbach deleted the update/log branch March 19, 2022 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/scripts kind/improvement Improve an existing feature, configuration file or the documentation kind/update Update an existing feature, configuration file or the documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants