Page MenuHomePhabricator

MW script "eval.php" failing during scap operations
Open, Needs TriagePublic

Description

Last night the train presync failed with:

04:04:18 Finished l10n-update (duration: 01m 15s)
04:04:18 Checking for new runtime errors locally
'mwscript eval.php --wiki testwiki' generated unexpected output: Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234
Warning: socket_sendto(): unable to write to socket [101]: Network is unreachable in /srv/mediawiki-staging/php-1.44.0-wmf.2/includes/debug/logger/monolog/LegacyHandler.php on line 234

04:04:34 deploy-promote failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', 'testwikis to 1.44.0-wmf.2  refs T375661', '-D', 'full_image_build:True']' returned non-zero exit status 1. (scap version: 4.1

this morning a backport for https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1087195 failed with the same error.

The problem happens when Scap tries to run the MW "eval.php" script inside of a Docker container. The new dockerized version was introduced recently but it had already been deployed to the deployment server and had been running successfully at least for the past week, the first error happening last night. The script runs with with --network none, so my best guess at the moment is the script's behavior has changed and now needs some external network access that it didn't need before.

Event Timeline

kostajh triaged this task as Unbreak Now! priority.Nov 5 2024, 9:04 AM

Marking as UBN, as this is blocking train and backport deployments.

Investigating right now. It would help to know on what server this happened. I assume the active deployment server?

I assume the active deployment server?

Yeah, on deploy2002

Mentioned in SAL (#wikimedia-operations) [2024-11-05T09:21:49Z] <_joe_> restarted rsyslog on deploy2002 T379044

Rsyslog had some tls errors so i restarted it, but i doubted it could be the real culprit, given it is reached via udp. But the problem seems to be that we're calling docker run with --network=none:

https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/main.py#L391 -> https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/mwscript.py?ref_type=heads#L279

and something in eval.php tries to log the call. I guess it's something that changed in mwscript that makes a log call and then socket_sendto fails ofc.

I would suggest, on the short term, to just run docker with --network=host and then check what log calls are being made.

The best thing would probably be to have a full reproduction case and follow the calls from there.

Things I was able to exclude up to now:

  • network/docker/system issues
  • Recent changes to mwscript or scap (all changes were done before the issue started and the changed code was used to complete successful deployments)

More details, successful backports were still happening yesterday, e.g.: https://sal.toolforge.org/log/6iIf-ZIBFFSCpsJz1WyO

Pinging @Umherirrender, @Dreamy_Jazz, @Jdforrester-WMF to try to get some feedback from someone more familiar with the MW scripts.

jnuche opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/543

AbstractSync: run 'mwscript eval.php --wiki testwiki' with network access

jnuche merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/543

AbstractSync: run 'mwscript eval.php --wiki testwiki' with network access

jnuche lowered the priority of this task from Unbreak Now! to Needs Triage.Nov 5 2024, 11:31 AM

Train now got past the failing stage. Removing this as a blocker

and something in eval.php tries to log the call. I guess it's something that changed in mwscript that makes a log call and then socket_sendto fails ofc.

Yeah, I do wonder if this task's issue is obscuring some other error. That is, something that would normally log to udp + stderr is getting hung up on not sending to udp and obscuring real stuff in stderr. Is that possible?

I suppose doing a call to eval.php outside the container and seeing if there's output would be the easiest way to confirm that theory.

The last change to eval.php directly was be39a1833251ba15a4447676e5994105e0807259, it is a comment only change and from end of august. I would not expect any failure from that.
I cannot help on log issues here