I was chatting with some people on -discovery about what various different servers do, and I realised that beta has no terbium equivalent... And nothing is running the maintenance scripts automatically. Is this something that should be being tested?
Description
Details
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T53494 Use Beta cluster as a true canary for code deployments (epic) | |||
| Stalled | None | T53497 Setup monitoring for Beta Cluster (tracking) | |||
| Invalid | None | T128357 Beta cluster job queue is unmonitored / potentially not running all jobs | |||
| Resolved | Krinkle | T125976 Run mediawiki::maintenance scripts in Beta Cluster |
Event Timeline
@Addshore added mediawiki::maintenance::wikidata to deployment-tin last night: https://tools.wmflabs.org/sal/log/AV3S-ZGGwg13V6285ZLD as a "do the minimum to fix the issue at hand" step.
I think we should still apply all of mediawiki::maitenance to deployment-tin (I guess we don't need a new worker vps?) and hope that any misconfigured (for Beta Cluster) scripts won't break too badly/negatively :)
I think the problem will just be that some beta-specific wikis won't be covered and you'll get a ton of error messages from wikis that don't exist in beta. It won't be perfect without changing the maintenance script, which you'd likely have a lot of problems trying to fix in puppet.git
Is this something that requires an ammount of non-trivial work? Otherwise we can list in a page somewhere in Wikitech which scripts should be regularly run and manually do so when needed. Note that CU and AF no longer stores data there so purge_(checkuser|abusefilter).pp can be left out of the list. Thanks.
I suggest to create a fresh instance (that is not named after a hostname in prod but has a generic name) and apply role(mediawiki_maintenance) to it. Then you will see which errors you actually get (or not). The ones that just work you can keep and the ones that are breaking you disable in Hiera (the puppet class makes this easy, already disables all the crons on the inactive maintenance server (currently codfw). I don't think that "keep a list of what should be run manually" is going to work that well.
@Dzahn Thanks for your explanation. I agree with the naming, etc. As for "see which erros you actually get" I'm afraid I'd not be able to do so, nor disable things in Hiera as I am not a project admin for the deployment-prep project. Should we involve Release-Engineering-Team here as the primary maintainers of the site? Thanks.
Also, what about deployment-maintenance with role::mediawiki_maintenance? (sorry if wrong role:: naming, puppet naming is still confusing to me)
Is anyone working on this? If not, I guess this should be expedited to enable us to test running the maintenance scripts on php 7 in production as well, as hhvm is dog slow at running cli scripts and I see this as a priority.
@Joe I don't think anyone is working on this atm. Anyone should feel free to take on this one.
It looks like the fix for running wikidata dispatching is on more since we have new deploy servers for beta and it would look like te wikidata maintenance role was not carried over to them?
Pinging @Krenair as he created the instances.
Either we should just go ahead and make a maint server for beta now, or lets add the wikidata maint role added in T125976#3520785 back to one of the servers.
Setting to high as we really want dispatching running on beta, as do the WMF media info team.
Change 462019 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: maintenance: skip mediawiki::state function
Change 462020 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: maintenance: no openldap management
I've got a stretch instance called deployment-mwmaint01 running in beta with role::mediawiki_maintenance. I made a couple patches to make this happen: one because we don't have conftool in beta and another because we don't have the ldap-admins group in beta (and openldap::maintenance isn't probably needed on this machine).
Note that not all scripts running on production can be run on beta (for example, the lack of CheckUser extension on Beta will make the script either to fail or running it will be pointless). I sugges we're allowed to choose which scripts to run, and under which parameters if that's not going to cause undue complications. That said, Puppet is a strange and very complicated land to understand to me so apologies if this is nonsensical. Regards.
Currently there is just one general switch to either enable all crons (scripts) or disable them all, and it's based on what is the active DC. It would be possible to have a separate switch for each script but that would be quite some overhead.
It seems like adding the missing extension on Beta would be the better solution.
It should probably be moved to a different machine in prod too but that's a matter for a different ticket I suppose.
I think it actually used to be there but @hashar got rid of it over 6 years ago for undefined reasons (https://gerrit.wikimedia.org/r/9796 - the exact code that does it got moved around later) - comment there refers to @Reedy? I wonder if we should put a commit up for review to re-enable it.
CheckUser stores PI in the form of (at least) IP addresses. And as basically anyone can get an account, anyone can look at the database and see the information.
As there's no way to neuter CheckUser to not store this, easiest answer was to just undeploy it
If someone wants to add some config to it so it doesn't always store that information.. Maybe we can redeploy it.. But feels kinda hacky
Sure, but it's more effort to do so. Plus then storing it somewhere, chances of it not being noticed by someone else is slim...
Maybe it's worth a discussion with legal about it, and see how they view it
It's a cost benefit analysis. Which is easier/quicker/whatever? Patching out the core functionality of the extension in PHP? Or patching puppet to put a config flat as to whether to enable the cronjob for checkuser...
That being said... We have a hook for the cu_changes table
Hooks::run( 'CheckUserInsertForRecentChange', [ $rc, &$rcRow ] );
Use that, override the sensitive columns to '' in CommonSettings-labs.php.. Seems more sensible than a config variable to make CheckUser stop doing what it's basically supposed to do which has limited usage elsewhere....
Not sure if we need to bother about cu_log
On the other hand, if purge_checkuser detects CheckUser is not installed it will just print that the CheckUser extension is not installed and will move along. It's just a bit of logspam instead of potential privacy issues.
Change 476980 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: add mwmaint01 to mediawiki-installation
Change 476980 merged by Dzahn:
[operations/puppet@production] Beta: add mwmaint01 to mediawiki-installation
@Dzahn With the patch merged above, I assume that we have now a deployment-mwmaint01 server where to run maintenance scripts. But I assume that maintenance scripts that run in production are not yet running on beta automatically, right?
@MarcoAurelio The patch means more specifically just that a host deployment-mwmaint01.deployment-prep.eqiad.wmflabs is receiving mediawiki deployments when/if scap is running in deployment-prep.
But yea, on https://tools.wmflabs.org/openstack-browser/server/deployment-mwmaint01.deployment-prep.eqiad.wmflabs we can see that host exists and is active.
And it also tells us under Puppet classes that it is using "role::mediawiki_maintenance" among other things. (btw, i want to rename that to mediawiki::maintenance to follow the other mediawiki:: structure -> https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/479131/).
The role includes `::profile::mediawiki::maintenance and inside that there is this code:
$ensure = mediawiki::state('primary_dc') ? {
$::site => 'present',
default => 'absent',
}
`This $ensure value is then used with all the cron jobs to decide if they should be running or not.
# Mediawiki maintenance scripts (cron jobs)
class { 'mediawiki::maintenance::pagetriage': ensure => $ensure }
class { 'mediawiki::maintenance::translationnotifications': ensure => $ensure }
class { 'mediawiki::maintenance::updatetranslationstats': ensure => $ensure }
...So for production it makes sense, since automatically the crons are either stopped or running based on what the current active_dc is.
So the question is really "what is mediawiki::state('primary_dc') in deployment-prep?".
A separate one is if each individual cron could also run in deployment-prep or not, and i don't know the answer. The way the code is written so far means that we can only have all or none running, so far.
Change 462019 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: maintenance: skip mediawiki::state function
Change 462020 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: maintenance: no openldap management
Change 462019 merged by Ladsgroup:
[operations/puppet@production] Beta: maintenance: skip mediawiki::state function
Change 462020 merged by Ladsgroup:
[operations/puppet@production] Beta: maintenance: no openldap management
Re-applied hack from T277206#7015609:
root@deployment-puppetserver-1:/srv/git/operations/puppet# git show HEAD
commit 2a8c216dee9eb0dbf48b71509e7018cb2b670458 (HEAD -> production)
Author: root <root@deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud>
Date: Mon Jul 29 22:17:56 2024 +0000
[LOCAL HACK] Hack mw-cli-wrapper to work without conftool
'I don't like this, but broken things aren't exactly fun either.' --taavi, April 2021
Bug: T370792
Bug: T125976
diff --git a/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.py b/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.py
index 98474c34361..28f7373af25 100755
--- a/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.py
+++ b/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.py
@@ -25,6 +25,7 @@ from shlex import quote
import yaml
CONFD_FILE = Path("/etc/conftool-state/mediawiki.yaml")
+"""
# First check if the confd file is stale or not. If it is, just exit
try:
subprocess.run(
@@ -35,6 +36,7 @@ try:
except subprocess.CalledProcessError:
print("Skipping execution: the mediawiki state file is stale.")
sys.exit(1)
+"""
state = yaml.safe_load(CONFD_FILE.read_text())
primary_dc = state["primary_dc"]
root@deployment-puppetserver-1:/srv/git/operations/puppet#in the interest of resolving T370792: refreshLinkRecommendation script fails in Beta cluster with FileNotFoundError.
@Urbanecm_WMF Do you mind uploading it to Gerrit under https://gerrit.wikimedia.org/r/q/hashtag:beta-cherry-picked+is:open ?
This appears to be working now, and seemingly has been for a while.
The deployment-mwmaint01 host was added back in 2018 to match production. As of 2025, this has been upgraded a few times and we now have deployment-mwmaint03.
The host in question runs PHP 8.1, receives latest MediaWiki code, and has the correct profile to run all production maintenance scripts as cron jobs (systemd timers).
During the kubernetes migration in production, the Puppet code was conditionalised such that setting kubernetes => true on profile::mediawiki::periodic_job (example) removes it from production (since mwmaint is being sunset there), but keeps it production (source).
Based on the wikitech:Maintenance server#Runbook I've confirmed that timers are active, and that the scripts generally succeed without issues (ref T289318).
krinkle@deployment-mwmaint03:~$ systemctl list-timers 'mediawiki*' Tue 2025-07-15 18:36:00 UTC 27s left Tue 2025-07-15 18:35:00 UTC 32s ago mediawiki_job_db_lag_stats_reporter.timer mediawiki_job_db_> Tue 2025-07-15 18:39:00 UTC 3min 27s left Tue 2025-07-15 17:39:00 UTC 56min ago mediawiki_job_wikidata_resubmit_changes_for_dispatch.timer mediawiki_job_wik> Tue 2025-07-15 18:55:00 UTC 19min left Tue 2025-07-15 17:55:00 UTC 40min ago mediawiki_job_centralauth-backfillLocalAccounts.php-metawiki.timer mediawiki_job_cen> …
krinkle@deployment-mwmaint03:~$ sudo journalctl -u mediawiki_job_startupregistrystats -n1000 -- Journal begins at Sat 2025-07-12 17:47:17 UTC, ends at Tue 2025-07-15 18:36:20 UTC. -- … Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki: | ext.visualEditor.editCheck.experimental | 3,076 B | 12,033 B Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki: | ext.guidedTour.tour.firsteditve | 1,317 B | 3,205 B … Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki: Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki: Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki: Sending stats... Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki: Done! Jul 15 18:00:48 deployment-mwmaint03 systemd[1]: mediawiki_job_startupregistrystats.service: Succeeded. Jul 15 18:00:48 deployment-mwmaint03 systemd[1]: Finished MediaWiki periodic job startupregistrystats. Jul 15 18:35:00 deployment-mwmaint03 systemd[1]: Starting MediaWiki periodic job startupregistrystats...
Change #941479 had a related patch set uploaded (by Krinkle; author: Krinkle):
[operations/puppet@production] scap: Limit foreachwikiindblist and expanddblist to beta wikis in beta
Change #1058199 had a related patch set uploaded (by Krinkle; author: Urbanecm):
[operations/puppet@production] [LOCAL HACK] Hack mw-cli-wrapper to work without conftool