Page MenuHomePhabricator

Exception executing job: EchoNotificationJob "No EchoEvent found with ID ..."
Closed, ResolvedPublic

Description

Error

Request ID: W6K7VQrAEHAAAKn0538AAADW

message
Exception executing job: EchoNotificationJob ##### eventId=809751 requestId=W6K7VQrAEHAAAKn0538AAADW : MWException: No EchoEvent found with ID: 809751
stacktrace
0 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Echo/includes/model/Event.php(339): EchoEventMapper->fetchById(integer, boolean)
#1 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Echo/includes/model/Event.php(382): EchoEvent->loadFromID(integer)
#2 /srv/mediawiki/php-1.32.0-wmf.22/extensions/Echo/includes/jobs/NotificationJob.php(13): EchoEvent::newFromID(integer)
#3 /srv/mediawiki/php-1.32.0-wmf.22/extensions/EventBus/includes/JobExecutor.php(64): EchoNotificationJob->run()
#4 /srv/mediawiki/rpc/RunSingleJob.php(67): JobExecutor->execute(array)
exception
[{exception_id}] {exception_url}   MWException from line 55 of /srv/mediawiki/php-1.32.0-wmf.22/extensions/Echo/includes/mapper/EventMapper.php: No EchoEvent found with ID: 809751

Notes

Seems to only affect www.mediawiki.org. Filing retroactively given it has recorded from previous branches as well, going back to 1.32.0-wmf.18.

Details

Related Gerrit Patches:

Related Objects

Event Timeline

Krinkle created this task.Sep 19 2018, 10:16 PM
Restricted Application added a project: Growth-Team. · View Herald TranscriptSep 19 2018, 10:16 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

(Still seen on 1.32.0-wmf.24)

For reference, the code in the job goes like this:

MWEchoDbFactory::newFromDefault()->waitForSlaves();
$event = EchoEvent::newFromID( $this->eventId );
EchoNotificationController::notify( $event, false );

with the exception being thrown by the second line.

I have verified a handful of event_id from occurrences of this exception in logstash and they all exist in the database. Also, many of them have associated echo_notification rows so it look like the job eventually works but it spams the logs.

waitForSlaves() in the first line is clearly not working. I think LogBalancer->getMasterPos() is returning false since no connection is open at this point of the job execution.

EchoEventMapper->fetchById() has a "auto-retry to master" feature that doesn't seem to be kicking in (we would see re-entrance to the same function in the stack trace). It is controlled by EchoDbFactory->canRetryMaster(), which relies on LoadBalancer->hasOrMadeRecentMasterChanges().

My guess is that since a new LoadBalancer is instantiated in the job, it doesn't know about the recent insertion of the EchoEvent. While those operations are part of the same conceptual action, they are technically disconnected.

I think the way to fix this is to change the implementation of canRetryMaster() to be true when running from a job (it was previously changed to ensure it never kicks in from a GET request). I don't know if there's something like wfIsJob() or if wfIsCLI() would work?

We could also waitForSlaves in a deferred update before enqueuing the job. I don't know if there are undesirable consequences but it might work with minimal efforts.

Change 503404 had a related patch set uploaded (by Sbisson; owner: Sbisson):
[mediawiki/extensions/Echo@master] NotificationJob: make sure we retry to load the event from master

https://gerrit.wikimedia.org/r/503404

Change 503404 merged by jenkins-bot:
[mediawiki/extensions/Echo@master] NotificationJob: make sure we retry to load the event from master

https://gerrit.wikimedia.org/r/503404

Etonkovidova closed this task as Resolved.Apr 25 2019, 12:12 AM

Change 530701 had a related patch set uploaded (by Paladox; owner: Sbisson):
[mediawiki/extensions/Echo@REL1_33] NotificationJob: make sure we retry to load the event from master

https://gerrit.wikimedia.org/r/530701

Change 530701 merged by jenkins-bot:
[mediawiki/extensions/Echo@REL1_33] NotificationJob: make sure we retry to load the event from master

https://gerrit.wikimedia.org/r/530701

mmodell changed the subtype of this task from "Task" to "Production Error".Aug 28 2019, 11:08 PM