Page MenuHomePhabricator

Run mediawiki::maintenance scripts in Beta Cluster
Closed, ResolvedPublic

Description

I was chatting with some people on -discovery about what various different servers do, and I realised that beta has no terbium equivalent... And nothing is running the maintenance scripts automatically. Is this something that should be being tested?

There are a variety of maintenance scripts in mediawiki::maintenance::* in operations/puppet that do not look to be running in the beta cluster. These are enabled in prod by applying the mediawiki::maintenance role to a specific host in manifests/site.pp. I'm not sure the right way to go about applying this. I could add an appropriate node clause to site.pp but we are not using that for any other deployment-prep machines.

This is needed because one of the new features in Discovery rebuilds the autocomplete indices from a cronjob and without it the indices will grow stale.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
greg triaged this task as Medium priority.Mar 17 2016, 1:43 PM

As result of not having this instance for deployment-prep wikidata dispatching doesn't actually run regularly on the beta cluster.
Problems with wikidata dispatching have recently blocked the train T171370 T172394

@Addshore added mediawiki::maintenance::wikidata to deployment-tin last night: https://tools.wmflabs.org/sal/log/AV3S-ZGGwg13V6285ZLD as a "do the minimum to fix the issue at hand" step.

I think we should still apply all of mediawiki::maitenance to deployment-tin (I guess we don't need a new worker vps?) and hope that any misconfigured (for Beta Cluster) scripts won't break too badly/negatively :)

I think we should still apply all of mediawiki::maitenance to deployment-tin (I guess we don't need a new worker vps?) and hope that any misconfigured (for Beta Cluster) scripts won't break too badly/negatively :)

I think the problem will just be that some beta-specific wikis won't be covered and you'll get a ton of error messages from wikis that don't exist in beta. It won't be perfect without changing the maintenance script, which you'd likely have a lot of problems trying to fix in puppet.git

Krinkle renamed this task from Run mediawiki::maintenance scripts? to Run mediawiki::maintenance scripts in Beta Cluster.Aug 18 2017, 6:52 PM

Is this something that requires an ammount of non-trivial work? Otherwise we can list in a page somewhere in Wikitech which scripts should be regularly run and manually do so when needed. Note that CU and AF no longer stores data there so purge_(checkuser|abusefilter).pp can be left out of the list. Thanks.

I suggest to create a fresh instance (that is not named after a hostname in prod but has a generic name) and apply role(mediawiki_maintenance) to it. Then you will see which errors you actually get (or not). The ones that just work you can keep and the ones that are breaking you disable in Hiera (the puppet class makes this easy, already disables all the crons on the inactive maintenance server (currently codfw). I don't think that "keep a list of what should be run manually" is going to work that well.

@Dzahn Thanks for your explanation. I agree with the naming, etc. As for "see which erros you actually get" I'm afraid I'd not be able to do so, nor disable things in Hiera as I am not a project admin for the deployment-prep project. Should we involve Release-Engineering-Team here as the primary maintainers of the site? Thanks.

Also, what about deployment-maintenance with role::mediawiki_maintenance? (sorry if wrong role:: naming, puppet naming is still confusing to me)

Is anyone working on this? If not, I guess this should be expedited to enable us to test running the maintenance scripts on php 7 in production as well, as hhvm is dog slow at running cli scripts and I see this as a priority.

@Joe I don't think anyone is working on this atm. Anyone should feel free to take on this one.

Addshore raised the priority of this task from Medium to High.Aug 30 2018, 8:12 AM

It looks like the fix for running wikidata dispatching is on more since we have new deploy servers for beta and it would look like te wikidata maintenance role was not carried over to them?

Pinging @Krenair as he created the instances.

Either we should just go ahead and make a maint server for beta now, or lets add the wikidata maint role added in T125976#3520785 back to one of the servers.

Setting to high as we really want dispatching running on beta, as do the WMF media info team.

Change 462019 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: maintenance: skip mediawiki::state function

https://gerrit.wikimedia.org/r/462019

Change 462020 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: maintenance: no openldap management

https://gerrit.wikimedia.org/r/462020

I've got a stretch instance called deployment-mwmaint01 running in beta with role::mediawiki_maintenance. I made a couple patches to make this happen: one because we don't have conftool in beta and another because we don't have the ldap-admins group in beta (and openldap::maintenance isn't probably needed on this machine).

Note that not all scripts running on production can be run on beta (for example, the lack of CheckUser extension on Beta will make the script either to fail or running it will be pointless). I sugges we're allowed to choose which scripts to run, and under which parameters if that's not going to cause undue complications. That said, Puppet is a strange and very complicated land to understand to me so apologies if this is nonsensical. Regards.

Currently there is just one general switch to either enable all crons (scripts) or disable them all, and it's based on what is the active DC. It would be possible to have a separate switch for each script but that would be quite some overhead.

It seems like adding the missing extension on Beta would be the better solution.

and openldap::maintenance isn't probably needed on this machine

It should probably be moved to a different machine in prod too but that's a matter for a different ticket I suppose.

It seems like adding the missing extension on Beta would be the better solution.

I think it actually used to be there but @hashar got rid of it over 6 years ago for undefined reasons (https://gerrit.wikimedia.org/r/9796 - the exact code that does it got moved around later) - comment there refers to @Reedy? I wonder if we should put a commit up for review to re-enable it.

It seems like adding the missing extension on Beta would be the better solution.

I think it actually used to be there but @hashar got rid of it over 6 years ago for undefined reasons (https://gerrit.wikimedia.org/r/9796 - the exact code that does it got moved around later) - comment there refers to @Reedy? I wonder if we should put a commit up for review to re-enable it.

CheckUser stores PI in the form of (at least) IP addresses. And as basically anyone can get an account, anyone can look at the database and see the information.

As there's no way to neuter CheckUser to not store this, easiest answer was to just undeploy it

If someone wants to add some config to it so it doesn't always store that information.. Maybe we can redeploy it.. But feels kinda hacky

It seems like adding the missing extension on Beta would be the better solution.

I think it actually used to be there but @hashar got rid of it over 6 years ago for undefined reasons (https://gerrit.wikimedia.org/r/9796 - the exact code that does it got moved around later) - comment there refers to @Reedy? I wonder if we should put a commit up for review to re-enable it.

CheckUser stores PI in the form of (at least) IP addresses. And as basically anyone can get an account, anyone can look at the database and see the information.

Anyone that can look at the DB (anyone) can deploy code that does the same thing.

Sure, but it's more effort to do so. Plus then storing it somewhere, chances of it not being noticed by someone else is slim...

Maybe it's worth a discussion with legal about it, and see how they view it

It doesn't particularly matter how much effort it takes, it is possible.

It doesn't particularly matter how much effort it takes, it is possible.

It's a cost benefit analysis. Which is easier/quicker/whatever? Patching out the core functionality of the extension in PHP? Or patching puppet to put a config flat as to whether to enable the cronjob for checkuser...

That being said... We have a hook for the cu_changes table

		Hooks::run( 'CheckUserInsertForRecentChange', [ $rc, &$rcRow ] );

Use that, override the sensitive columns to '' in CommonSettings-labs.php.. Seems more sensible than a config variable to make CheckUser stop doing what it's basically supposed to do which has limited usage elsewhere....

Not sure if we need to bother about cu_log

On the other hand, if purge_checkuser detects CheckUser is not installed it will just print that the CheckUser extension is not installed and will move along. It's just a bit of logspam instead of potential privacy issues.

Make it do a file existence && run script

We can try, but this is puppet.git, and we may just get a CR-2.

Change 476980 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: add mwmaint01 to mediawiki-installation

https://gerrit.wikimedia.org/r/476980

Change 476980 merged by Dzahn:
[operations/puppet@production] Beta: add mwmaint01 to mediawiki-installation

https://gerrit.wikimedia.org/r/476980

@Dzahn With the patch merged above, I assume that we have now a deployment-mwmaint01 server where to run maintenance scripts. But I assume that maintenance scripts that run in production are not yet running on beta automatically, right?

@MarcoAurelio The patch means more specifically just that a host deployment-mwmaint01.deployment-prep.eqiad.wmflabs is receiving mediawiki deployments when/if scap is running in deployment-prep.

But yea, on https://tools.wmflabs.org/openstack-browser/server/deployment-mwmaint01.deployment-prep.eqiad.wmflabs we can see that host exists and is active.

And it also tells us under Puppet classes that it is using "role::mediawiki_maintenance" among other things. (btw, i want to rename that to mediawiki::maintenance to follow the other mediawiki:: structure -> https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/479131/).

The role includes `::profile::mediawiki::maintenance and inside that there is this code:

    $ensure = mediawiki::state('primary_dc') ? {
        $::site => 'present',
        default => 'absent',
    }
`

This $ensure value is then used with all the cron jobs to decide if they should be running or not.

   # Mediawiki maintenance scripts (cron jobs)
    class { 'mediawiki::maintenance::pagetriage': ensure => $ensure }
    class { 'mediawiki::maintenance::translationnotifications': ensure => $ensure }
    class { 'mediawiki::maintenance::updatetranslationstats': ensure => $ensure }
...

So for production it makes sense, since automatically the crons are either stopped or running based on what the current active_dc is.

So the question is really "what is mediawiki::state('primary_dc') in deployment-prep?".

A separate one is if each individual cron could also run in deployment-prep or not, and i don't know the answer. The way the code is written so far means that we can only have all or none running, so far.

Change 462019 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: maintenance: skip mediawiki::state function

https://gerrit.wikimedia.org/r/462019

Change 462020 had a related patch set uploaded (by Thcipriani; owner: Thcipriani):
[operations/puppet@production] Beta: maintenance: no openldap management

https://gerrit.wikimedia.org/r/462020

Change 462019 merged by Ladsgroup:

[operations/puppet@production] Beta: maintenance: skip mediawiki::state function

https://gerrit.wikimedia.org/r/462019

Change 462020 merged by Ladsgroup:

[operations/puppet@production] Beta: maintenance: no openldap management

https://gerrit.wikimedia.org/r/462020

Re-applied hack from T277206#7015609:

root@deployment-puppetserver-1:/srv/git/operations/puppet# git show HEAD
commit 2a8c216dee9eb0dbf48b71509e7018cb2b670458 (HEAD -> production)
Author: root <root@deployment-puppetserver-1.deployment-prep.eqiad1.wikimedia.cloud>
Date:   Mon Jul 29 22:17:56 2024 +0000

    [LOCAL HACK] Hack mw-cli-wrapper to work without conftool

    'I don't like this, but broken things aren't exactly fun either.' --taavi, April 2021

    Bug: T370792
    Bug: T125976

diff --git a/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.py b/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.py
index 98474c34361..28f7373af25 100755
--- a/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.py
+++ b/modules/profile/files/mediawiki/maintenance/mw-cli-wrapper.py
@@ -25,6 +25,7 @@ from shlex import quote
 import yaml

 CONFD_FILE = Path("/etc/conftool-state/mediawiki.yaml")
+"""
 # First check if the confd file is stale or not. If it is, just exit
 try:
     subprocess.run(
@@ -35,6 +36,7 @@ try:
 except subprocess.CalledProcessError:
     print("Skipping execution: the mediawiki state file is stale.")
     sys.exit(1)
+"""

 state = yaml.safe_load(CONFD_FILE.read_text())
 primary_dc = state["primary_dc"]
root@deployment-puppetserver-1:/srv/git/operations/puppet#

in the interest of resolving T370792: refreshLinkRecommendation script fails in Beta cluster with FileNotFoundError.

Krinkle closed this task as Resolved.EditedJul 15 2025, 6:37 PM
Krinkle claimed this task.

This appears to be working now, and seemingly has been for a while.

The deployment-mwmaint01 host was added back in 2018 to match production. As of 2025, this has been upgraded a few times and we now have deployment-mwmaint03.

The host in question runs PHP 8.1, receives latest MediaWiki code, and has the correct profile to run all production maintenance scripts as cron jobs (systemd timers).

During the kubernetes migration in production, the Puppet code was conditionalised such that setting kubernetes => true on profile::mediawiki::periodic_job (example) removes it from production (since mwmaint is being sunset there), but keeps it production (source).

Based on the wikitech:Maintenance server#Runbook I've confirmed that timers are active, and that the scripts generally succeed without issues (ref T289318).

krinkle@deployment-mwmaint03:~$ systemctl list-timers 'mediawiki*'
Tue 2025-07-15 18:36:00 UTC 27s left              Tue 2025-07-15 18:35:00 UTC 32s ago            mediawiki_job_db_lag_stats_reporter.timer                                              mediawiki_job_db_>
Tue 2025-07-15 18:39:00 UTC 3min 27s left         Tue 2025-07-15 17:39:00 UTC 56min ago          mediawiki_job_wikidata_resubmit_changes_for_dispatch.timer                             mediawiki_job_wik>
Tue 2025-07-15 18:55:00 UTC 19min left            Tue 2025-07-15 17:55:00 UTC 40min ago          mediawiki_job_centralauth-backfillLocalAccounts.php-metawiki.timer                     mediawiki_job_cen>
…
krinkle@deployment-mwmaint03:~$ sudo journalctl -u mediawiki_job_startupregistrystats -n1000
-- Journal begins at Sat 2025-07-12 17:47:17 UTC, ends at Tue 2025-07-15 18:36:20 UTC. --
…
Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki:  | ext.visualEditor.editCheck.experimental            |    3,076 B |   12,033 B
Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki:  | ext.guidedTour.tour.firsteditve                    |    1,317 B |    3,205 B
…
Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki:
Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki:
Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki:  Sending stats...
Jul 15 18:00:48 deployment-mwmaint03 mediawiki_job_startupregistrystats[3935841]: zhwiki:  Done!
Jul 15 18:00:48 deployment-mwmaint03 systemd[1]: mediawiki_job_startupregistrystats.service: Succeeded.
Jul 15 18:00:48 deployment-mwmaint03 systemd[1]: Finished MediaWiki periodic job startupregistrystats.
Jul 15 18:35:00 deployment-mwmaint03 systemd[1]: Starting MediaWiki periodic job startupregistrystats...

Change #941479 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/puppet@production] scap: Limit foreachwikiindblist and expanddblist to beta wikis in beta

https://gerrit.wikimedia.org/r/941479

Change #1058199 had a related patch set uploaded (by Krinkle; author: Urbanecm):

[operations/puppet@production] [LOCAL HACK] Hack mw-cli-wrapper to work without conftool

https://gerrit.wikimedia.org/r/1058199