Background
LoggedUpdateMaintenance is a MediaWiki maintenance script that can be executed only once on each wiki. Generally, a logged update is used for data migrations and similar needs. For example, the Growth team is using it to move mentor away timestamps to Community Configuration (T347152). MediaWiki tracks whether a logged update executed in the updatelog table.
Each logged update tells MediaWiki whether it finished or not (using the return value of LoggedUpdateMaintenance::doDBUpdates(): bool). If a logged update returns true, MediaWiki records in updatelog table that it finished, and it will not run it again (unless the user forces it using the --force parameter). If false is returned, MediaWiki will rerun the update if executed again.
Some logged updates support dry-run. In that case, doDBUpdates() typically returns false, as we do want to be able to dry-run it multiple times. More importantly, we want to run the actual update even if a dry run was already executed.
LoggedUpdateMaintenance returns a non-zero exit code if doDBUpdates() returns false. This is because MediaWiki cannot differenciate between "the update decided not to run itself for now" (eg. user only wanted a dry run, which is an expected state) and "the update failed to execute" (an unexpected state). If the update deliberately did not want to execute itself, then that is probably a success (as it did what it wanted) from an exit code's perspective. If the update failed, then a non-zero exit code is definitely warranted.
mwscript-k8s can be used to execute MediaWiki maintenance script across all wikis in a dblist. For example, running mwscript-k8s --dblist="growthexperiments" -- Version.php would execute Version.php on all wikis in growthexperiments.dblist. Internally, the looping happens in a shell script that looks like this (simplified):
for wiki in ${DBLIST}; do echo "-----------------------------------------------------------------" echo "${wiki}" echo "-----------------------------------------------------------------" ${RUNNER} ${CMD} ${wiki} "${@}" done
Unless the FOREACHWIKI_IGNORE_ERRORS environment variable is defined, the shell script has set +e set, meaning it terminates whenever the maintenance script returns a non-zero exit code on any wiki (or whenever any of the other commands it executes fails, of course).
Problem
There are several problems coming out of this that I'll describe in separate subsections.
Being difficult to debug
Determining that the execution stopped executing because of a non-zero exit code is extremely challenging from a MediaWiki developer perspective. The output looked like this:
sgimeno@deploy2002:~$ mwscript-k8s -f --comment="T409170" --dblist="growthexperiments" -- GrowthExperiments:migrateMentorStatusAway --dry-run GrowthExperiments:migrateMentorStatusAway: Running on growthexperiments ----------------------------------------------------------------- abwiki ----------------------------------------------------------------- abwiki No mentors found in config, skipping migration. ----------------------------------------------------------------- acewiki ----------------------------------------------------------------- acewiki No mentors found in config, skipping migration. ----------------------------------------------------------------- adywiki ----------------------------------------------------------------- adywiki No mentors found in config, skipping migration. ----------------------------------------------------------------- afwiki ----------------------------------------------------------------- afwiki Nothing new to save, skipping sgimeno@deploy2002:~$
looking like the execution was interrupted midway through with no exception, no error message and no explanation. Initially, I was assuming this is because the k8s pod exhausted resources and was killed by k8s itself, or that the kernel of the underlying machine killed that process because it did something illegal. I ended up exec'ing into the container (when it was running) to access the foreachwikiindblist wrapper, and in there, it was quite straightforward to find the set +e that is causing this.
mwscript-k8s should make it more clear what is happening. Even saying "Skipping the rest of the wikis, because the script returned non-zero exit code for xxwiki" would be wonderful, and it would make the debugging much quicker (as I'd know what to look for inside MediaWiki).
Even if I do know that the non-zero exit code is the problem, it is impossible to know that setting the FOREACHWIKI_IGNORE_ERRORS environment variable would fix this. The documentation at https://wikitech.wikimedia.org/wiki/Maintenance_scripts does not mention this, and neither does any other Wikitech page.
LoggedUpdateMaintenance treats all non-executions as a failure
LoggedUpdateMaintenance::doDBUpdates() documentation says:
Return true to log the update as done or false (usually on failure).
I interpret this as "return true if you want the update to be logged, and return false otherwise". For the dry-run scenario, I do not want the update to be logged, so returning false seems to be following the documentation. I treat the "usually on failure" part as advisory (saying that I should probably return false on failures, but not saying that returning false outside of a failure is an issue).
LoggedUpdateMaintenance::execute() is then copying the return value of doDBUpdate(). However, the documentation for that function (coming from Maintenance::execute) is much stricter, as it says:
True for success, false for failure. Returning false for failure will cause doMaintenance.php to exit the process with a non-zero exit status.
In other words, even though doDBUpdate() might return false in a success scenario (such as "dry run finished"), LoggedUpdateMaintenance is considering that as a failure, even though the documented interface doesn't back that assumption.
Ideally, for expected non-executions, LoggedUpdateMaintenance::execute would return true, resulting in an exit code of zero, resulting in mwscript-k8s iterating through the rest of the wikis, as expected.
Lack of documentation
I'm not sure how stable FOREACHWIKI_IGNORE_ERRORS is, or whether it is something internal the infrastructure has in the background. Regardless of the answer, https://wikitech.wikimedia.org/wiki/Maintenance_scripts should make it clear that:
- mwscript-k8s with --dblist will stop iterating whenever an error is encouraged,
- that it is possible to ignore errors (if the executor wishes to do so); whether that happens by setting FOREACHWIKI_IGNORE_ERRORS or by a parameter to mwscript-k8s, that is up to further discussion