Page MenuHomePhabricator

Fatal error during RecentChange::notifyEdit (deferred update) from ORES/RecentChangeSaveHookHandler
Closed, ResolvedPublic

Description

Error

Request ID: XO3NKQpAAEgAABpfTjMAAABG (1.34.0-wmf.6)

Request URL: any context where edits or other contributions are saved (e.g. index.php saving edits, api.php, etc.)
Request ID: XN9fRgpAAEQAAFE8V0IAAABV (1.34.0-wmf.5)

message
JobQueueError:
Could not enqueue jobs: Unable to serialize events
trace
#0 /srv/mediawiki/php-1.34.0-wmf.5/includes/jobqueue/JobQueue.php(342): JobQueueEventBus->doBatchPush(array, integer)
#1 /srv/mediawiki/php-1.34.0-wmf.5/includes/jobqueue/JobQueue.php(312): JobQueue->batchPush(array, integer)
#2 /srv/mediawiki/php-1.34.0-wmf.5/includes/jobqueue/JobQueueGroup.php(162): JobQueue->push(array)
#3 /srv/mediawiki/php-1.34.0-wmf.5/extensions/ORES/includes/Hooks/RecentChangeSaveHookHandler.php(155): JobQueueGroup->push(array)
#4 /srv/mediawiki/php-1.34.0-wmf.5/extensions/ORES/includes/Hooks/RecentChangeSaveHookHandler.php(106): ORES\Hooks\RecentChangeSaveHookHandler->triggerJob(RecentChange, array)
#5 /srv/mediawiki/php-1.34.0-wmf.5/extensions/ORES/includes/Hooks/RecentChangeSaveHookHandler.php(69): ORES\Hooks\RecentChangeSaveHookHandler->handle(RecentChange, array, boolean, array)
#6 /srv/mediawiki/php-1.34.0-wmf.5/includes/Hooks.php(174): ORES\Hooks\RecentChangeSaveHookHandler::onRecentChange_save(RecentChange)
#7 /srv/mediawiki/php-1.34.0-wmf.5/includes/Hooks.php(202): Hooks::callHook(string, array, array, NULL)
#8 /srv/mediawiki/php-1.34.0-wmf.5/includes/changes/RecentChange.php(445): Hooks::run(string, array)
#9 /srv/mediawiki/php-1.34.0-wmf.5/includes/changes/RecentChange.php(733): RecentChange->save()
#10 /srv/mediawiki/php-1.34.0-wmf.5/includes/deferred/MWCallableUpdate.php(34): Closure$RecentChange::notifyEdit()
#11 /srv/mediawiki/php-1.34.0-wmf.5/includes/deferred/DeferredUpdates.php(274): MWCallableUpdate->doUpdate()
#12 /srv/mediawiki/php-1.34.0-wmf.5/includes/deferred/DeferredUpdates.php(219): DeferredUpdates::runUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
#13 /srv/mediawiki/php-1.34.0-wmf.5/includes/deferred/DeferredUpdates.php(143): DeferredUpdates::execute(array, string, integer)
#14 /srv/mediawiki/php-1.34.0-wmf.5/includes/MediaWiki.php(907): DeferredUpdates::doUpdates(string)
#15 /srv/mediawiki/php-1.34.0-wmf.5/includes/MediaWiki.php(731): MediaWiki->restInPeace(string, boolean)
#16 [internal function]: Closure$MediaWiki::doPostOutputShutdown()

Impact

The fatal exception means that other systems listening for recent changes in Core and in other extensions may be broken as well, due to the exception being unhandled.

Notes

This JobQueueError exception message, with RecentChangeSaveHookHandler in the trace, was first observed in Logstash during 1.34.0-wmf.4.

Logstash query: +message:"Unable to serialize events" +exception.trace:"RecentChangeSaveHookHandler"

Event Timeline

Krinkle created this task.Jun 6 2019, 11:45 AM
Restricted Application added a project: Scoring-platform-team. · View Herald TranscriptJun 6 2019, 11:45 AM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
JTannerWMF added subscribers: Catrope, JTannerWMF.

@Catrope will investigate

Harej triaged this task as High priority.Jun 11 2019, 9:01 PM
Harej moved this task from Untriaged to Maintenance/cleanup on the Scoring-platform-team board.

There are 1043 Logstash events in the last 30 days that match +exception.trace:"RecentChangeSaveHookHandler" and only 9 of them match +exception.trace:"RecentChangeSaveHookHandler". However, the other 1034 all have the same stack trace, which is the one from T225200: Fatal error during CirrusSearch-LinksUpdate job (CirrusTitleJob) from JobQueueGroup->push. (I checked that all of the non-ORES events match +exception.trace:"requeueError".) So they probably have the same root cause (or similar root causes).

Halfak added a subscriber: Halfak.Jun 18 2019, 9:22 PM

@Krinkle, is this still happening? Is it causing a problem with something else? I'm trying to determine if this is a problem to be solved in the ORES extension or if it just the result of some network blip (e.g. it seems many of these are related to CirrusSearch and not ORES). Any insights you have would be helpful.

The immediate impact is that a Job that the ORES extension tries to schedule, is not scheduled and as such will never execute. What that means exactly for ORES users, depends on what that Job is meant to do.

The fall out is that, due to it failing with an unhandled exception, I believe it therefore also ends up short-circuiting all other hook handlers from other/unrelated MediaWiki extensions for the onRecentChange_save event. Of which the impact is much wider and harder to measure.

The frequency is low, though, but until confirmed otherwise, I think any such loss of events has high priority to mitigate. Possibly with a stop-gap solution that prevents the fallout at least.

Per @Catrope last comment, it is indeed possible that both the CirrusSearch job and the ORES job are unable to queue their jobs for the same underlying reason. For example, it might be that they are both transparently passing non-serialisable data through to a Job constructor from the same source. Further investigation would be needed to confirm that, see also T225200#5244919.

Thank you for the additional information, @Krinkle! That's very helpful.

Urbanecm added a subscriber: Urbanecm.EditedJul 18 2019, 9:36 PM

I believe this might caused T228462. See my comment in linked task for details, not posting here due to Security.

Urbanecm raised the priority of this task from High to Unbreak Now!.Jul 18 2019, 9:36 PM

Bumping to UBN, since if I'm right this is cause of T228462, that's worth UBN I think.

Restricted Application added a subscriber: Liuxinyu970226. · View Herald TranscriptJul 18 2019, 9:36 PM

The fall out is that, due to it failing with an unhandled exception, I believe it therefore also ends up short-circuiting all other hook handlers from other/unrelated MediaWiki extensions for the onRecentChange_save event. Of which the impact is much wider and harder to measure.
The frequency is low, though, but until confirmed otherwise, I think any such loss of events has high priority to mitigate. Possibly with a stop-gap solution that prevents the fallout at least.

This seems like it would cause https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Phantom_edit:_notifications_and_watchlist_not_updating

Given that this has been going on for a month with no fix, I'd propose just disabling the ORES extension until this data loss problem can be fixed.

Request ID: XN9fRgpAAEQAAFE8V0IAAABV (1.34.0-wmf.5)

So here's the more useful log:

EventBus.log-20190518.gz:2019-05-18 01:26:31 [XN9fRgpAAEQAAFE8V0IAAABV] mw1273 eswiki 1.34.0-wmf.5 EventBus ERROR: FormatJson::encode($events) failed: Malformed UTF-8 characters, possibly incorrectly encoded. Aborting send. {"json_last_error":"Malformed UTF-8 characters, possibly incorrectly encoded"} 
EventBus.log-20190518.gz:2019-05-18 01:26:31 [XN9fRgpAAEQAAFE8V0IAAABV] mw1273 eswiki 1.34.0-wmf.5 EventBus ERROR: FormatJson::encode($events) failed: Malformed UTF-8 characters, possibly incorrectly encoded. Aborting send. {"json_last_error":"Malformed UTF-8 characters, possibly incorrectly encoded"}

It looks like it could be any job at all that fails serialization is killing all of them?

Change 524403 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/extensions/EventBus@master] Add more debugging to figure out which events are invalid

https://gerrit.wikimedia.org/r/524403

Change 524406 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/extensions/EventBus@wmf/1.34.0-wmf.14] Add more debugging to figure out which events are invalid

https://gerrit.wikimedia.org/r/524406

Change 524407 had a related patch set uploaded (by Legoktm; owner: Legoktm):
[mediawiki/extensions/EventBus@wmf/1.34.0-wmf.13] Add more debugging to figure out which events are invalid

https://gerrit.wikimedia.org/r/524407

Change 524403 merged by Ppchelko:
[mediawiki/extensions/EventBus@master] Add more debugging to figure out which events are invalid

https://gerrit.wikimedia.org/r/524403

Change 524406 merged by jenkins-bot:
[mediawiki/extensions/EventBus@wmf/1.34.0-wmf.14] Add more debugging to figure out which events are invalid

https://gerrit.wikimedia.org/r/524406

Change 524407 merged by Legoktm:
[mediawiki/extensions/EventBus@wmf/1.34.0-wmf.13] Add more debugging to figure out which events are invalid

https://gerrit.wikimedia.org/r/524407

Mentioned in SAL (#wikimedia-operations) [2019-07-19T06:30:26Z] <legoktm@deploy1001> Synchronized php-1.34.0-wmf.14/extensions/EventBus/includes/EventBus.php: Add more debugging to figure out which events are invalid: T225199 (duration: 00m 55s)

Mentioned in SAL (#wikimedia-operations) [2019-07-19T06:32:14Z] <legoktm@deploy1001> Synchronized php-1.34.0-wmf.13/extensions/EventBus/includes/EventBus.php: Add more debugging to figure out which events are invalid: T225199 (duration: 00m 55s)

OK, theoretically we'll be able to better pinpoint exactly which jobs are sending malformed utf-8. I'm going to tail the log for a little while longer before going to sleep. It's happened 9 times so far today, so it shouldn't be too long...

ArielGlenn added a subscriber: ArielGlenn.EditedJul 19 2019, 10:15 AM
2019-07-19 09:32:42 [XTGOOQpAIDEAAKzYQLoAAABP] mw1328 dewiki 1.34.0-wmf.13 EventBus ERROR: FormatJson::encode($events) failed: Malformed UTF-8 characters, possibly incorrectly encoded. Aborting send.  
[Exception Exception] (/srv/mediawiki/php-1.34.0-wmf.13/extensions/EventBus/includes/EventBus.php:198) 
  #0 /srv/mediawiki/php-1.34.0-wmf.13/extensions/EventBus/includes/EventBus.php(133): EventBus::serializeEvents(array)
  #1 /srv/mediawiki/php-1.34.0-wmf.13/extensions/EventBus/includes/adapters/monolog/EventBusMonologHandler.php(61): EventBus->send(array)
  #2 /srv/mediawiki/php-1.34.0-wmf.13/includes/deferred/MWCallableUpdate.php(38): Closure$EventBusMonologHandler::write()
  #3 /srv/mediawiki/php-1.34.0-wmf.13/includes/deferred/DeferredUpdates.php(309): MWCallableUpdate->doUpdate()
  #4 /srv/mediawiki/php-1.34.0-wmf.13/includes/deferred/DeferredUpdates.php(265): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
  #5 /srv/mediawiki/php-1.34.0-wmf.13/includes/deferred/DeferredUpdates.php(225): DeferredUpdates::handleUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, string, integer)
  #6 /srv/mediawiki/php-1.34.0-wmf.13/includes/deferred/DeferredUpdates.php(144): DeferredUpdates::handleUpdateQueue(array, string, integer)
  #7 /srv/mediawiki/php-1.34.0-wmf.13/includes/MediaWiki.php(926): DeferredUpdates::doUpdates(string)
  #8 /srv/mediawiki/php-1.34.0-wmf.13/includes/MediaWiki.php(750): MediaWiki->restInPeace(string, boolean)
  #9 [internal function]: Closure$MediaWiki::doPostOutputShutdown()
  #10 {main}

https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.07.19/mediawiki?id=AWwJk5bl-pVO1Pagj6wJ&_g=h@66534ad

Hoping this is helpful, I looked at a couple of the few weird page titles in the events field from the logstash entry.

select * from page where page_id = 700613;
...
| Toys_“R”_Us     |

But the string in the events field has Toys âRâ Us (for the serialized string at any rate).
Similarly, we have for dewikinews:

select page_title from page where page_id = 83199;
...
| Direktor_der_US-Zoll-_und_Grenzsschutzbehörde_Sanders_tritt_zurück   |

where the events field has Direktor der US-Zoll- und Grenzsschutzbehörde Sanders tritt zurÌck (again for the serialized version of the text).
Maybe these are worth checking out.

Beware that a lot of failures reported as FormatJson::encode($events) failed: Malformed UTF-8 characters, possibly incorrectly encoded. Aborting send. are due to T228496 and relate to sending log events through kafja (these are not MW jobs). These are all search related api requests.
Unless I'm searching incorrectly I can only find 3 instances of this particular failure in the last 7days (ORES RecentChangeSaveHookHandler failing to push due to invalid UTF-8 sequences). Search string I'm using: EventBus AND Malformed NOT "EventBusMonologHandler"

Change 524534 had a related patch set uploaded (by DCausse; owner: DCausse):
[mediawiki/extensions/ORES@master] Normalize the user agent before sending FetchScoreJob

https://gerrit.wikimedia.org/r/524534

Beware that a lot of failures reported as FormatJson::encode($events) failed: Malformed UTF-8 characters, possibly incorrectly encoded. Aborting send. are due to T228496 and relate to sending log events through kafja (these are not MW jobs). These are all search related api requests.
Unless I'm searching incorrectly I can only find 3 instances of this particular failure in the last 7days (ORES RecentChangeSaveHookHandler failing to push due to invalid UTF-8 sequences). Search string I'm using: EventBus AND Malformed NOT "EventBusMonologHandler"

Thanks, I definitely assumed both were the same. So if there are only 3 instances of this failure, we're still looking for another cause of RC insertion failure, since T228462 has 4 reported events missing plus 1 on VPT.

All entries in the EventBus log for today, up until now, are cirrussearch-related. Still waiting for something else to show up.

Change 524534 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Normalize the user agent before sending FetchScoreJob

https://gerrit.wikimedia.org/r/524534

Change 524534 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Normalize the user agent before sending FetchScoreJob
https://gerrit.wikimedia.org/r/524534

Should this be backported, or its waiting on some analytics of the logs?

Change 524604 had a related patch set uploaded (by Accraze; owner: Accraze):
[mediawiki/extensions/ORES@master] Handle JobQueueError exception

https://gerrit.wikimedia.org/r/524604

Change 524604 merged by jenkins-bot:
[mediawiki/extensions/ORES@master] Handle JobQueueError exception

https://gerrit.wikimedia.org/r/524604

akosiaris lowered the priority of this task from Unbreak Now! to High.Tue, Jul 23, 12:41 PM
akosiaris added a subscriber: akosiaris.

Lowering priority from UBN, since that was on Jun 19th and it's been 4 days already.

JTannerWMF added a subscriber: Etonkovidova.

Hey @Etonkovidova can you check to see if this is still happening

We should check if this still happens in production. The latest patch would have changed how the error is reported in Logstash, we should now search for "Job push failed".

Etonkovidova closed this task as Resolved.Wed, Aug 7, 12:29 AM
Etonkovidova claimed this task.

Since the patch was deployed to wmf.14, wmf.15, wmf.16 and no errors were reported, and, also, logstash for testwiki (wmf.17) doesn't display such errors - I'm closing the task as Resolved.