Page MenuHomePhabricator

cdnPurge and other jobs fail completely to execute
Closed, ResolvedPublicPRODUCTION ERROR

Description

Error

Request URL: /rpc/runSingleJob.php
Request ID: not applicable

From the logs:

2019-04-18 10:58:41 [799c42dbfaeda1474c477c48] mw1306 metawiki 1.34.0-wmf.1 JobExecutor ERROR: Failed creating job from description {"job_type":"MassMessageJob","message":"Title Special: is invalid"}

Impact

A lot of jobs (most notably, cdnPurge and MassMessage that I found) are failing to execute. See for example T221365 as impact

Notes

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Joe triaged this task as Unbreak Now! priority.Apr 18 2019, 12:57 PM

@Reedy graciously reverted group 1 for me, as this was the cause for a UBN! ticket.

We decided to revert given the spike in errors we got started yesterday at 19:15 UTC, and it corresponds to the SAL entry for moving group 1 to wmf.1.

We're still not sure if the problem is specific to that version though.

This is caused by https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/500171/ and specifically https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/500171/14/includes/jobqueue/Job.php where it sets the default title to Title::makeTitle( NS_SPECIAL, '').

In Kafka queue upon deserialization we try to create a Title from that using newFromPrefixedDBKey and it obviously fails.
First, the core probably shouldn't assign invalid title as a default, and secondly, kafka job queue should be prepared to deal with new title-less jobs.

Change 504884 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Revert "jobqueue: add GenericParameterJob and RunnableJob interface"

https://gerrit.wikimedia.org/r/504884

Change 504885 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@wmf/1.34.0-wmf.1] Revert "jobqueue: add GenericParameterJob and RunnableJob interface"

https://gerrit.wikimedia.org/r/504885

Change 504888 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@master] Don't use Special:'', use Special:Blankpage again

https://gerrit.wikimedia.org/r/504888

Change 504889 had a related patch set uploaded (by Reedy; owner: Reedy):
[mediawiki/core@wmf/1.34.0-wmf.1] Don't use Special:'', use Special:Blankpage again

https://gerrit.wikimedia.org/r/504889

In Kafka queue upon deserialization we try to create a Title from that using newFromPrefixedDBKey and it obviously fails.
First, the core probably shouldn't assign invalid title as a default, and secondly, kafka job queue should be prepared to deal with new title-less jobs.

The use of an invalid title here was intentional, although that was with the idea that it would never make it into the queue. Recent work has removed the mandatory existence of a title parameter. Any jobs passing them through the main signature as before are normalised to set title as a regular params key.

We probably need to update the EventBus handler to re-use core' handling and/or synchronise it accordingly.

Krinkle added a project: Performance-Team.
Krinkle added a subscriber: aaron.

Assigning to self to start investigating. May need to transfer to Aaron once his day starts :)

The use of an invalid title here was intentional, although that was with the idea that it would never make it into the queue. Recent work has removed the mandatory existence of a title parameter. Any jobs passing them through the main signature as before are normalized to set the title as a regular params key.

Seems like it didn't entirely work. For example for cdnPurge there's no title in the parameters, so we falled back to Special: invalid title. Is the idea to eventually remove Job::title field completely?

We probably need to update the EventBus handler to re-use core' handling and/or synchronize it accordingly.

Yup. Here's where it's parsed https://github.com/wikimedia/mediawiki-extensions-EventBus/blob/master/includes/JobExecutor.php#L178

We also should make page_title in the job event schema not required.

@Pchelolo Yeah, I also see a few other pre-existing problems here that we're lucky haven't caused problems before.

EventBus is currently doing something similar to core, but with very different (incompatible) semantics. This is actually the first time I'm seeing Title::newFromDBkey used in production code. I didn't know that existed.

The stored parameters in EventBus kafka look incorrect. The meaning of page_namespace and page_title are meant as a pair. For example the display title User:Example is stored as (2, 'Example'). They way they are read out, is with Title::makeTitle. Some examples from core:

mediawiki-core/JobQueueDB.php
			'job_namespace' => $job->getTitle()->getNamespace(),
			'job_title' => $job->getTitle()->getDBkey(),
// …
	Title::makeTitle( $row->job_namespace, $row->job_title ),

and

mediawiki-core/IJobSpecification.php
			'title'  => [
				'ns'  => $this->title->getNamespace(),
				'key' => $this->title->getDBkey()
			]
// …
		$title = Title::makeTitle( $map['title']['ns'], $map['title']['key'] );

By comparison:

EventBus/EventFactory.php
		$attrs = [
			'database' => $wiki ?: $wgDBname,
			'type' => $job->getType(),
			'page_namespace' => $job->getTitle()->getNamespace(),
			'page_title' => self::getTitleFormatter()->getPrefixedDBkey( $job->getTitle() )
		];
// …

		$title = Title::newFromDBkey( $jobEvent['page_title'] );

This looks like a mistake to me. This is applying expensive formatting for human display, and storing it in a persistent medium (Kafka), which we never do. And it isn't using the page_namespace parameter, it seems?

Anyway, patch incoming soon. No schema changes are required for this. We can make it work.

Change 504916 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/core@master] jobqueue: Follow-up for fc5d51f12936ed (added GenericParameterJob)

https://gerrit.wikimedia.org/r/504916

Change 504920 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/EventBus@master] Normalise invalid titles to a dummy Title object

https://gerrit.wikimedia.org/r/504920

Change 504920 merged by jenkins-bot:
[mediawiki/extensions/EventBus@master] Normalise invalid titles to a dummy Title object

https://gerrit.wikimedia.org/r/504920

Change 504929 had a related patch set uploaded (by Jforrester; owner: Krinkle):
[mediawiki/extensions/EventBus@wmf/1.34.0-wmf.1] Normalise invalid titles to a dummy Title object

https://gerrit.wikimedia.org/r/504929

Change 504929 merged by Mobrovac:
[mediawiki/extensions/EventBus@wmf/1.34.0-wmf.1] Normalise invalid titles to a dummy Title object

https://gerrit.wikimedia.org/r/504929

Change 504930 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/event-schemas@master] Make page_title and page_namespace parameters not required.

https://gerrit.wikimedia.org/r/504930

Mentioned in SAL (#wikimedia-operations) [2019-04-18T17:46:31Z] <mobrovac@deploy1001> Synchronized php-1.34.0-wmf.1/extensions/EventBus/includes/JobExecutor.php: Default to a dummy title for invalid titles - T221368 (duration: 01m 01s)

Change 504933 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/EventBus@master] JobExecutor: Remove use of page_title valid for jobs

https://gerrit.wikimedia.org/r/504933

mobrovac lowered the priority of this task from Unbreak Now! to High.Apr 18 2019, 6:13 PM
mobrovac subscribed.

The immediate problem with jobs failing because of invalid titles has been addressed, so lowering the priority. Next step (which have to wait for the train to fully complete) include removing the title and namespace parameters from the job schema and not using them when enqueuing/dequeuing the jobs.

Change 504930 merged by Mobrovac:
[mediawiki/event-schemas@master] Make page_title and page_namespace parameters not required.

https://gerrit.wikimedia.org/r/504930

mobrovac raised the priority of this task from High to Unbreak Now!.Apr 18 2019, 6:22 PM

Still UBN, waiting on the mw-core side for this.

@mobrovac Okay, I've looked it over a few more times and am now confident that with this second EventBus patch, we can proceed without the core patch.

Change 504933 merged by Mobrovac:
[mediawiki/extensions/EventBus@master] JobExecutor: Remove use of page_title valid for jobs

https://gerrit.wikimedia.org/r/504933

Change 504942 had a related patch set uploaded (by Mobrovac; owner: Krinkle):
[mediawiki/extensions/EventBus@wmf/1.34.0-wmf.1] JobExecutor: Remove use of page_title valid for jobs

https://gerrit.wikimedia.org/r/504942

Change 504942 merged by Mobrovac:
[mediawiki/extensions/EventBus@wmf/1.34.0-wmf.1] JobExecutor: Remove use of page_title valid for jobs

https://gerrit.wikimedia.org/r/504942

Mentioned in SAL (#wikimedia-operations) [2019-04-18T19:10:04Z] <mobrovac@deploy1001> Synchronized php-1.34.0-wmf.1/extensions/EventBus/includes/JobExecutor.php: Remove the use of page titles in JobExecutor, file 1/2 - T221368 (duration: 01m 01s)

Mentioned in SAL (#wikimedia-operations) [2019-04-18T19:11:46Z] <mobrovac@deploy1001> Synchronized php-1.34.0-wmf.1/extensions/EventBus/includes/EventFactory.php: Remove the use of page titles in JobExecutor, file 2/2 - T221368 (duration: 00m 59s)

Mentioned in SAL (#wikimedia-operations) [2019-04-18T19:17:49Z] <mobrovac@deploy1001> Started restart [cpjobqueue/deploy@922cbc0]: Bounce CP4JQ, lots of transport broken failures - T221368

Mentioned in SAL (#wikimedia-operations) [2019-04-18T19:29:31Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'scb*' 'disable-puppet "mobrovac: temp stop JQ for T221368" && systemctl stop cpjobqueue'

Change 504958 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/Translate@master] Remove problematic Job::$params assignments

https://gerrit.wikimedia.org/r/504958

Mentioned in SAL (#wikimedia-operations) [2019-04-18T19:36:25Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cookbook sre.hosts.downtime -r "mobrovac: temp stop JQ for T221368" 'scb*'

Change 504961 had a related patch set uploaded (by Krinkle; owner: Krinkle):
[mediawiki/extensions/Translate@wmf/1.34.0-wmf.1] Remove problematic Job::$params assignments

https://gerrit.wikimedia.org/r/504961

Change 504958 merged by jenkins-bot:
[mediawiki/extensions/Translate@master] Remove problematic Job::$params assignments

https://gerrit.wikimedia.org/r/504958

Change 504961 merged by jenkins-bot:
[mediawiki/extensions/Translate@wmf/1.34.0-wmf.1] Remove problematic Job::$params assignments

https://gerrit.wikimedia.org/r/504961

Mentioned in SAL (#wikimedia-operations) [2019-04-18T20:32:22Z] <cdanis> cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'scb*' 'enable-puppet "mobrovac: temp stop JQ for T221368"'

Change 504885 abandoned by Jforrester:
Revert "jobqueue: add GenericParameterJob and RunnableJob interface"

Reason:
Fingers crossed, we don't need this.

https://gerrit.wikimedia.org/r/504885

Change 504884 abandoned by Jforrester:
Revert "jobqueue: add GenericParameterJob and RunnableJob interface"

Reason:
Fingers crossed, we don't need this.

https://gerrit.wikimedia.org/r/504884

Change 504889 abandoned by Jforrester:
Don't use Special:'', use Special:Blankpage again

Reason:
Fingers crossed, we don't need this.

https://gerrit.wikimedia.org/r/504889

Change 504888 abandoned by Jforrester:
Don't use Special:'', use Special:Blankpage again

Reason:
Fingers crossed, we don't need this.

https://gerrit.wikimedia.org/r/504888

Mentioned in SAL (#wikimedia-operations) [2019-04-18T20:38:47Z] <mobrovac@deploy1001> Synchronized php-1.34.0-wmf.1/extensions/Translate/tag: Translate jobs: Remove problematic Job::$params assignments, dir 1/2 - T221368 (duration: 01m 01s)

Mentioned in SAL (#wikimedia-operations) [2019-04-18T20:40:31Z] <mobrovac@deploy1001> Synchronized php-1.34.0-wmf.1/extensions/Translate/utils/MessageUpdateJob.php: Translate jobs: Remove problematic Job::$params assignments, dir 2/2 - T221368 (duration: 01m 00s)

Confirmed to work again on mediawiki.org by making an edit on a template, and confirming from Incognito (cached enabled) that pages using it got the update propagated. Also confirmed via logged-in, and via Translate extension.

Mentioned in SAL (#wikimedia-operations) [2019-04-18T20:52:25Z] <cdanis> root@icinga1001.wikimedia.org /var/lib/icinga # for DOWNTIME in $(fgrep -B12 'comment=mobrovac: temp stop JQ for T221368 - cdanis@cumin1001' retention.dat | grep -A13 servicedowntime | grep downtime_id | cut -d= -f2); do printf "[%lu] DEL_SVC_DOWNTIME;%u\n" $(date +%s) $DOWNTIME ; done > rw/icinga.cmd

Change 504916 merged by jenkins-bot:
[mediawiki/core@master] jobqueue: Follow-up for fc5d51f12936ed (added GenericParameterJob)

https://gerrit.wikimedia.org/r/504916

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:07 PM