Page MenuHomePhabricator

scap no longer restarts php-fpm on canary servers
Closed, ResolvedPublic2 Estimated Story PointsBUG REPORT

Description

Noted by @Krinkle just now in #wikimedia-releng:

14:56 <Krinkle> brennen: how sursprising is it to see wmf.19 code running today?
14:57 <brennen> er...
14:57 <Krinkle> apparently 5% of today's traffic is runnig wmf.19 according to https://performance.wikimedia.org/arclamp/svgs/daily/2022-07-25.excimer-wall.index.svgz?x=10.0&y=1429                                        
14:57 <Krinkle> (the rest wmf.21)
14:57 <brennen> pretty surprising: https://versions.toolforge.org/
14:57 <Krinkle> SAL says we switched group2 on 21st
14:58 <brennen> indeed i see some errors for .19   
14:58 <brennen> i wonder if... something was depooled and didn't get synced on thursday?
14:58 <Krinkle> ok, please file task and escalate as you see fit. I'm OOO. Yeah, maybe ask SRE as well.                                                                                                                     
14:58 <brennen> ack, will do

Again, this should not be the case, since 1.39.0-wmf.21 (T308074) was rolled to all wikis on 2022-07-21.

Just for good measure:

21:03:16 brennen@deploy1002 /srv/mediawiki-staging (master u=) $ grep -c '[.]19' ./wikiversions.json 
0

cc: @jeena

Event Timeline

brennen set the point value for this task to 2.
brennen added a project: User-brennen.
brennen updated the task description. (Show Details)

Mentioned in SAL (#wikimedia-operations) [2022-07-25T21:20:50Z] <brennen> running a no-op sync-world for T313770 to hopefully get 1.39.0-wmf.21 (T308074) to all servers.

Mentioned in SAL (#wikimedia-operations) [2022-07-25T21:24:21Z] <brennen@deploy1002> Started scap: no-op deploy to get wmf.21 on all boxen (T313770)

Mentioned in SAL (#wikimedia-operations) [2022-07-25T21:27:55Z] <brennen@deploy1002> Finished scap: no-op deploy to get wmf.21 on all boxen (T313770) (duration: 03m 33s)

brennen triaged this task as Unbreak Now! priority.Jul 25 2022, 9:55 PM

Well, that didn't seem to do it. Still seeing errors for wmf.19. I'm going to mark this as a blocker for 1.39.0-wmf.22 (T308075).

confirmed /srv/mediawiki/php-1.39.0-wmf.21 exists on every single appserver

it has been pointed out on IRC this affects only the canary servers

canary servers have been used for testing a new PHP version

Noting also that wikiversions.json is up to date on the canaries.

Further from notes from IRC (thanks Zabe and Dzahn for investigation):

  • Errors are all PHP 7.2
  • Canaries just not getting restarted?
    • Something in a recent Scap change?

Re: scap changes, maybe something related to 78cc0f85? @jnuche - any thoughts there?

Mentioned in SAL (#wikimedia-operations) [2022-07-26T00:11:06Z] <TimStarling> restarted php7.2-fpm on the 9 canary hosts in eqiad T313770

I confirmed that mw1450 was affected and was reporting wmf.19 in Special:Version.

I confirmed that restart-php-fpm-all is still functional by running it on mw1450.

Zabe suggests https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/811373 may be the problem, which seems likely.

I restarted the canaries with cumin. I concatenated /etc/dsh/group/{appserver,api_appserver}, removed comments, and ran

for h in $(<~/all_appservers); do 
    echo -n "$h "
    curl -sx $h:80 -H'X-Forwarded-Proto: https' 'http://en.wikipedia.org/w/api.php?action=query&format=json&meta=siteinfo' | jq .query.general.generator
done

This confirmed that all servers are now running wmf.21.

From logstash, looks like last canary restart was on the 20th.

rMSCAb616a7ffd1b7: Clean up php fpm restart touches canary restarts directly. Maybe that combined with the later rMSCA78cc0f8553dd: _restart_php: Exclude empty host lists causes unexpected behavior, since canary restarts were still happening after the first patch was deployed - assuming scap versions correspond to scap deploys.

I think arguably this should block backport windows as well as train deploys. At a minimum, deployers should exercise greater caution. I'll send a note to deployers for the next couple of windows until we sort out a fix. (I'm guessing a fix is relatively straightforward.)

brennen renamed this task from Some traffic seems to be reaching 1.39.0-wmf.19 code to scap no longer restarts php-fpm on canary servers.Jul 26 2022, 12:32 AM
brennen changed the task status from Open to In Progress.
brennen added a project: Scap.
brennen moved this task from Backlog to Doing on the User-brennen board.

Is this related to the issues we were seeing in backport windows where sometimes the patches would not take effect on some servers? (most recently that I know of on July 20, see #wikimedia-operations logs starting from 20:11:45).

Un-cookie-licking.

Change 817200 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/tools/scap@master] Restart the canaries and testservers as well

https://gerrit.wikimedia.org/r/817200

Change 817206 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[operations/puppet@production] scap: restart php on 'scap pull'

https://gerrit.wikimedia.org/r/817206

Change 817206 merged by Giuseppe Lavagetto:

[operations/puppet@production] scap: restart php on 'scap pull'

https://gerrit.wikimedia.org/r/817206

Mentioned in SAL (#wikimedia-operations) [2022-07-26T09:31:33Z] <_joe_> running puppet on the mw-canary hosts T313770

Change 817212 had a related patch set uploaded (by Giuseppe Lavagetto; author: Giuseppe Lavagetto):

[mediawiki/tools/scap@master] Do not set the PHP variable, now unused

https://gerrit.wikimedia.org/r/817212

Change 817212 merged by jenkins-bot:

[mediawiki/tools/scap@master] Stop using utils.sudo_check_call in php restarts

https://gerrit.wikimedia.org/r/817212

Mentioned in SAL (#wikimedia-operations) [2022-07-26T12:02:27Z] <oblivian@deploy1002> Synchronized README: testing fix for php restarts T313770 (duration: 03m 15s)

Mentioned in SAL (#wikimedia-operations) [2022-07-26T12:32:21Z] <jnuche@deploy1002> Synchronized README: Verifying fix for T313770 (duration: 03m 14s)

Joe claimed this task.

What happended is a combination of factors:

  • because of T224857#5467370, scap pull was not configured to perform a php restart
  • the change in https://gerrit.wikimedia.org/r/c/mediawiki/tools/scap/+/811373 started to restart canaries as part of performing scap pull there, instead of doing that as a centralized process like we do for the rest of the servers
  • This of course failed silently because scap was configured not to perform php restarts in that case

@jnuche and I fixed both the configuration and the code so that the cause of T224857#5467370 is solved.

I still think we should rather favour a centralized approach as there isn't any visual feedback as of now, but this is enough to consider this issue resolved.

Change 817200 abandoned by Giuseppe Lavagetto:

[mediawiki/tools/scap@master] Restart the canaries and testservers as well

Reason:

not needed.

https://gerrit.wikimedia.org/r/817200