Page MenuHomePhabricator

"JobExecutor not loaded" error for BounceHandlerJob on wikitech.wikimedia.org
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request ID: XPfF4QpAMD4AADNSoHcAAAAX

message
Exception from line 68 of /srv/mediawiki/rpc/RunSingleJob.php:

JobExecutor not loaded for job:
{ "database":"labswiki",
  "mediawiki_signature":,
  "meta":{"domain":"wikitech.wikimedia.org","dt":"2019-06-05T13:38:41+00:00","id":"request_id":"XPfF4QpAMD4AADNSoHcAAAAX","schema_uri":"mediawiki\/job\/3","topic":"mediawiki.job.BounceHandlerJob","uri":"https:\/\/placeholder.invalid\/wiki\/Special:Badtitle"},
  "page_namespace":-1,"page_title":":","params":{"email":},
  "type":"BounceHandlerJob"
}

Impact

Some kinds of user actions on Wikitech wiki are presumably not working as intended.

Notes

Breakdown of last 30 days:

Screenshot 2019-06-05 at 15.03.55.png (2,110×870 px, 129 KB)

Details

Event Timeline

Pchelolo subscribed.

Wikitech should not be using kafka job queue at all per T192361#4139799

In mediawiki-config we set wmgUseClusterJobqueue to false for wikitech, thus it should be using the JobQueueDB. Plus we set the wikitech group to send events of TYPE_NONE - which means no events should be produced, so no jobs should get into kafka even if the queue is incorrect.

Something is clearly not working as expected, will investigate.

I think the root cause is the same as T208922: PHP Fatal Error: Class undefined: JobExecutor (jobrunners try to run labswiki jobs), namely T208922#4766050 (global config problems when spawning inter-wiki jobs). It's just that now we detect the case when JobExecutor is not loaded instead of bluntly crashing.

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:06 PM

Ok, it has happened again. The problem is essentially that varying the $wgJobTypeConf by wiki is not supported, so cross-wiki job posts for wiki tech don't work.

We should reevaluate why after all we can't go to kafka jobqueue on wiki tech. When we first deployed it, there were some reasons, but I don't remember them. I'll follow up and discuss with SRE.

Ok, it has happened again. The problem is essentially that varying the $wgJobTypeConf by wiki is not supported, so cross-wiki job posts for wiki tech don't work.

We should reevaluate why after all we can't go to kafka jobqueue on wiki tech. When we first deployed it, there were some reasons, but I don't remember them. I'll follow up and discuss with SRE.

T237773: Move Wikitech onto the production MW cluster and related T167973: Move database for wikitech (labswiki) to a main cluster section are what is really needed. Wikitech is a snowflake deployment of MediaWiki. Today it runs on isolated servers with an isolated db. The core cluster job runners can't deal with that.

I traced this issue to this configuration in Puppet (Codesearch):

common/mail/mx.yaml
profile::mail::mx::verp_post_connect_server: meta.wikimedia.org
profile::mail::mx::verp_bounce_post_url: api-rw.discovery.wmnet/w/api.php
…
/usr/bin/curl -H 'Host: <%= @verp_post_connect_server %>' <%= @verp_bounce_post_url %> -d "action=bouncehandler" --data-urlencode "email@-" -o /dev/null

This is currently leading to the ChangeProp jobrunner making requests to the mw* machines for jobrunners with a wikitech.wikimedia.org header and initialising part of MediaWiki until it finds that JobExecutor is not installed there.

Given this isn't working right now, shall we disable the BounceHandler extension on Wikitech for now, citing that it is only supported in combination with EventBus/Kafka/JobQueue?

Change 818286 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Disable BounceHandler on Wikitech

https://gerrit.wikimedia.org/r/818286

Change 818286 merged by jenkins-bot:

[operations/mediawiki-config@master] Disable BounceHandler on Wikitech

https://gerrit.wikimedia.org/r/818286

Krinkle claimed this task.
Krinkle edited projects, added: Performance-Team; removed: Platform Engineering (Icebox).