Page MenuHomePhabricator

deployment-prep hourly databases update times out
Closed, ResolvedPublic

Description

The beta-update-databases-eqiad Jenkins job last ran on March 18th 22:20 UTC.

The next build at https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/65915/consoleFull timed out, the end of the build with UTC times (note the script output is buffered and only shown once it has completed).

23:34:42 wikifunctionswiki
23:34:42 #!/usr/bin/env php
23:34:42 Warning: session_name(): Cannot change session name when headers already sent in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 645
23:34:42 MediaWiki 1.41.0-alpha Updater
23:34:42 
23:34:42 Your composer.lock file is up to date with current dependencies!
23:34:42 Going to run database updates for wikifunctionswiki
23:34:42 Depending on the size of your database this may take a while!
23:34:42 ...collations up-to-date.
23:34:42 ...have rev_actor field in revision table.
23:34:42 ...watchlist_expiry table already exists.
23:34:42 ...page_restrictions field does not exist in page table, skipping modify field patch.
...
23:34:42 ...Update 'FixOldLogEntries' already logged as completed. Use --force to run it again.
23:34:42 ...Update 'UpdateVarDumps' already logged as completed. Use --force to run it again.
23:34:42 Purging caches...done.
23:34:42 
00:05:00 Done in 12 min 52 s.Build timed out (after 45 minutes). Marking the build as aborted.

The most recent build:

13:20:00 [beta-update-databases-eqiad] $ /bin/bash -xe /tmp/jenkins11819082651828314578.sh
13:20:00 + export TMPDIR_FS=/srv/home/jenkins-deploy/tmpfs/jenkins-3
13:20:00 + TMPDIR_FS=/srv/home/jenkins-deploy/tmpfs/jenkins-3
13:20:00 + '[' -d /srv/home/jenkins-deploy/tmpfs ']'
13:20:00 + trap global_teardown EXIT
13:20:00 + /usr/local/bin/wmf-beta-update-databases.py
14:05:00 Build timed out (after 45 minutes). Marking the build as aborted.
14:05:00 Build was aborted
14:05:00 IRC notifier plugin: Sending notification to: #wikimedia-releng
14:05:01 Finished: ABORTED

If I try to run it on the deployment machine:

hashar@deployment-deploy03:~$ /bin/bash /usr/local/bin/mwscript update.php --wiki=aawiki --quick --skip-config-validation
#!/usr/bin/env php
Warning: session_name(): Cannot change session name when headers already sent in /srv/mediawiki-staging/wmf-config/CommonSettings.php on line 645

*******************************************************************************
NOTE: Do not run maintenance scripts directly, use maintenance/run.php instead!
      Running scripts directly has been deprecated in MediaWiki 1.40.
      It may not work for some (or any) scripts in the future.
*******************************************************************************

MediaWiki 1.41.0-alpha Updater

Your composer.lock file is up to date with current dependencies!
Going to run database updates for aawiki
Depending on the size of your database this may take a while!
...collations up-to-date.
...have rev_actor field in revision table.
^C

Event Timeline

Mentioned in SAL (#wikimedia-releng) [2023-03-19T14:43:17Z] <hashar> deployment-prep: databases update hourly job times out, filed T332507

Nothing really suspicious in mediawiki/core. But for extensions PoolCounter has been removed by 52fa8a4ccc712566842aafb4275acd0a78d1d05e for T324631

commit 52fa8a4ccc712566842aafb4275acd0a78d1d05e
Author:     Umherirrender <umherirrender_de.wp@web.de>
AuthorDate: Sun Mar 19 00:08:42 2023 +0100
Commit:     Umherirrender <umherirrender_de.wp@web.de>
CommitDate: Sun Mar 19 00:08:42 2023 +0100

    Remove PoolCounter
    
    Bug: T324631
    Change-Id: I4d9b5b5ff2fd17e75628e9d86ca78e75830232a5

Notes (review):
    Verified+2: Umherirrender <umherirrender_de.wp@web.de>
    Code-Review+2: Umherirrender <umherirrender_de.wp@web.de>
    Submitted-by: Umherirrender <umherirrender_de.wp@web.de>
    Submitted-at: Sat, 18 Mar 2023 23:09:45 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/+/900799
    Project: mediawiki/extensions
    Branch: refs/heads/master

Then that is now in mediawiki/core and should have been removed from mediawiki-config a while ago.

deployment-db11

Mar 19 20:10:30 deployment-db11 mysqld[14965]: 2023-03-19 20:10:30 5858284 [Warning] Access denied for user 'wikiadmin'@'172.16.%' to database 'mainstash'
Mar 19 20:10:30 deployment-db11 mysqld[14965]: 2023-03-19 20:10:30 5858285 [Warning] Access denied for user 'wikiadmin'@'172.16.%' to database 'mainstash'
Mar 19 20:20:52 deployment-db11 mysqld[14965]: 2023-03-19 20:20:52 5858688 [Warning] Aborted connection 5858688 to db: 'apiportalwiki' user: 'wikiadmin' host: '172.16.4.233' (Got an error reading communication packets)
Mar 19 20:20:52 deployment-db11 mysqld[14965]: 2023-03-19 20:20:52 5858689 [Warning] Aborted connection 5858689 to db: 'aawiki' user: 'wikiadmin' host: '172.16.4.233' (Got an error reading communication packets)
Mar 19 20:27:52 deployment-db11 mysqld[14965]: 2023-03-19 20:27:52 5858877 [Warning] Aborted connection 5858877 to db: 'apiportalwiki' user: 'wikiadmin' host: '172.16.4.233' (Got an error reading communication packets)
Mar 19 20:27:52 deployment-db11 mysqld[14965]: 2023-03-19 20:27:52 5858878 [Warning] Aborted connection 5858878 to db: 'aawiki' user: 'wikiadmin' host: '172.16.4.233' (Got an error reading communication packets)
Mar 19 20:30:34 deployment-db11 mysqld[14965]: 2023-03-19 20:30:34 5858951 [Warning] Aborted connection 5858951 to db: 'aawiki' user: 'wikiadmin' host: '172.16.4.233' (Got an error reading communication packets)

deployment-db13

● mariadb.service - mariadb database server
     Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Sun 2023-03-19 20:33:55 UTC; 3min 41s ago
    Process: 972707 ExecStart=/opt/wmf-mariadb106/bin/mysqld $MYSQLD_OPTS (code=exited, status=1/FAILURE)
   Main PID: 972707 (code=exited, status=1/FAILURE)
     Status: "MariaDB server is down"
      Tasks: 50 (limit: 19184)
     Memory: 13.8G
        CPU: 757ms
     CGroup: /system.slice/mariadb.service
             └─16119 /opt/wmf-mariadb106/bin/mysqld

Mar 19 20:33:55 deployment-db13 mysqld[972707]: 2023-03-19 20:33:55 0 [ERROR] Plugin 'InnoDB' init function returned error.
Mar 19 20:33:55 deployment-db13 mysqld[972707]: 2023-03-19 20:33:55 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
Mar 19 20:33:55 deployment-db13 mysqld[972707]: 2023-03-19 20:33:55 0 [Note] Plugin 'FEEDBACK' is disabled.
Mar 19 20:33:55 deployment-db13 mysqld[972707]: 2023-03-19 20:33:55 0 [ERROR] Could not open mysql.plugin table: "Unknown storage engine 'Aria'". Some plugins may be not loaded
Mar 19 20:33:55 deployment-db13 mysqld[972707]: 2023-03-19 20:33:55 0 [ERROR] Failed to initialize plugins.
Mar 19 20:33:55 deployment-db13 mysqld[972707]: 2023-03-19 20:33:55 0 [ERROR] Aborting
Mar 19 20:33:55 deployment-db13 systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Mar 19 20:33:55 deployment-db13 systemd[1]: mariadb.service: Failed with result 'exit-code'.
Mar 19 20:33:55 deployment-db13 systemd[1]: mariadb.service: Unit process 16119 (mysqld) remains running after unit stopped.
Mar 19 20:33:55 deployment-db13 systemd[1]: Failed to start mariadb database server.

and a load of Wikimedia\Rdbms\LoadMonitor::computeServerStates: host deployment-db13 is not replicating? in beta logstash (which oddly has zero logs recorded since Mar 19, 2023 @ 14:29:40)

killing the stuck mysqld process on deployment-db13 and restarting the service has brought mysql back up on db13.. it's now replicating (it's very behind but catching up..)

Mentioned in SAL (#wikimedia-releng) [2023-03-19T22:34:15Z] <TheresNoTime> (beta) samtar@deployment-kafka-logging01:~$ sudo systemctl restart kafka.service, noticed beta-logs wasn't showing logs from ~14:00UTC in T332507

Awesome @TheresNoTime . Does it mean that loosing the replica caused Wikimedia\Rdbms\LoadMonitor::computeServerStates to stale and thus prevented update.php from running? Could it be there is an infinite loop of doom somewhere in the code? ;)

Anyway looks fixed now, I am inclined to mark it resolved ;)

Does it mean that loosing the replica caused Wikimedia\Rdbms\LoadMonitor::computeServerStates to stale and thus prevented update.php from running?

I honestly have no idea — resolving this was a mixture of luck, guessing and turning it off and on again...

I honestly have no idea — resolving this was a mixture of luck, guessing and turning it off and on again...

I fully agree! Thank you for the investigation and the quick fix!