Page MenuHomePhabricator

MassMessage problems - multiple deliveries and missing deliveries
Open, LowPublic

Description

In this week's delivery of Tech News, there were 2 problems.

The full target list was meant to be https://meta.wikimedia.org/wiki/Global_message_delivery/Targets/Tech_ambassadors

I definitely only clicked "Submit" once, so it's not an impatient user-error!

I'll tag this with WMF-JobQueue, which I suspect is related, in case that helps.

See also this possibly related (recent) task? T232362: Massmessages not going through, log looks fine

Event Timeline

Quiddity created this task.Mon, Sep 9, 6:57 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMon, Sep 9, 6:57 PM
Tgr added a subscriber: Tgr.Mon, Sep 9, 7:11 PM

Common reason for recipients not getting anything is T139380: MassMessage failed delivery claiming "readonly" although the page is not protected. It also has a (not user-friendly) script for identifying such issues.

Double-sending is maybe due to a job returning failed status (which causes it to be rescheduled) but successfully saving anyway. In theory all changes are rolled back on an error and this shouldn't happen, but maybe that logic is circumvented somehow.
If that's the case, you should be able to find the job errors in logstash.

Re: non-delivery - AFAICT this problem is not the "readonly" bug, because the local logs don't contain any of those errors, e.g. https://id.wikipedia.org/wiki/Istimewa:Catatan/massmessage

Re: double-sending - I'm not very familiar with logstash (but I'll assume this suggestion was directed at the other devs who will investigate further :)

Tgr added a comment.Mon, Sep 9, 7:42 PM

It's easy to learn! Just go to https://logstash.wikimedia.org , click on the jobqueue link in the navigation section, set the time filter to 24 hours with the button on the top right, type massmessage in the search box and you'll see a flurry of errors like

Exception executing job: MassMessageSubmitJob Global_message_delivery/Targets/Tech_ambassadors data={"spamlist":"Global message delivery/Targets/Tech ambassadors","subject":"[[m:Special:MyLanguage/Tech/News/2019/37|Tech News: 2019-37]]","message":"string(...)","comment":["Quiddity (WMF)","metawiki","https://meta.wikimedia.org/w/index.php?title=Global_message_delivery/Targets/Tech_ambassadors\u0026oldid=19361962"],"userId":29132638,"originWiki":"metawiki"} pages=array(839) class=MediaWiki\MassMessage\MassMessageJob namespace=0 title=Global_message_delivery/Targets/Tech_ambassadors requestId=XXaPFApAMFUAABuiLDcAAAAB : JobQueueError: Could not enqueue jobs: Unable to deliver all events: 500: Internal Server Error

Now you can go back to the main logstash page (which shows logs of all types), set the time range again, paste the request ID from the previous message in the search box, and see what errors happened in that request. Now, because this is a job queue execution request, there will be a bunch of different actions bundled into it, so a bunch of unrelated errors/warnings as well, but this:

Deferred update MWCallableUpdate_MediaWiki\MassMessage\MassMessageJob->editPage failed: At least one of: RCID, revision ID, and log ID MUST be specified when adding or removing a tag from a change!

#0 /srv/mediawiki/php-1.34.0-wmf.21/includes/changetags/ChangeTags.php(254): ChangeTags::updateTags(Array, Array, NULL, NULL, NULL, NULL, NULL)
#1 /srv/mediawiki/php-1.34.0-wmf.21/extensions/MassMessage/includes/job/MassMessageJob.php(242): ChangeTags::addTags('massmessage-del...', NULL, NULL, NULL)
#2 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/MWCallableUpdate.php(38): MediaWiki\MassMessage\MassMessageJob->MediaWiki\MassMessage\{closure}()
#3 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(382): MWCallableUpdate->doUpdate()
#4 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(280): DeferredUpdates::attemptUpdate(Object(MWCallableUpdate), Object(Wikimedia\Rdbms\LBFactoryMulti))
#5 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(225): DeferredUpdates::run(Object(MWCallableUpdate), Object(Wikimedia\Rdbms\LBFactoryMulti), Object(Monolog\Logger), Object(BufferingStatsdDataFactory), 'post')
#6 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(149): DeferredUpdates::handleUpdateQueue(Array, 'run', 2)
#7 /srv/mediawiki/php-1.34.0-wmf.21/extensions/EventBus/includes/JobExecutor.php(96): DeferredUpdates::doUpdates()
#8 /srv/mediawiki/rpc/RunSingleJob.php(76): JobExecutor->execute(Array)
#9 {main}

seems pretty relevant.

Tgr reopened this task as Open.Mon, Sep 9, 7:44 PM

Given the duplicate deliveries this is probably not (or only partially) a duplicate.
Also in the other task there was no confirmation what it is not a duplicate of T139380, while this one apparently isn't.

In general it's hard to tell what caused a massmessage (or any kind of job queue) error without log digging, so they are not good merge targets.

Mbch331 added a subscriber: Mbch331.Mon, Sep 9, 8:25 PM

Adding @Pchelolo and @Ottomata per T232362 -- But there's the quirk in this task, of MassMessage sending 3 copies of the message in some instances, hence it might need separate investigation/solution? Pinging whilst the related aspects are still fresh in your heads. Thanks! :)

Hm. This would make sense

Adding @Pchelolo and @Ottomata per T232362 -- But there's the quirk in this task, of MassMessage sending 3 copies of the message in some instances, hence it might need separate investigation/solution? Pinging whilst the related aspects are still fresh in your heads. Thanks! :)

Hm.. The MassMessageJob doesn't set removeDuplicates to true, thus messages are not deduplicated. Since the jobs are posted via MassMessageSubmit job, and it was failing and it was retrying, duplicate MassMessageJobs were submitted before it hit the error of too large bulk. We need to deduplicate MassMessage jobs.

However, simply using removeDuplicates will not actually work in this case, since that will kick in deduplication by sha1, which only works if the job with the same sha1 was executed after the second job was posted (we need to be able to re-execute the jobs with the same parameters), which in this case might not be the case.

Seems like we need a new deduplication flag, like deduplicate_unique that would make it ignore exactly the same job regardless of the timing. This, however, will make it impossible to send exactly the same message to the user.

[...]
Seems like we need a new deduplication flag, like deduplicate_unique that would make it ignore exactly the same job regardless of the timing. This, however, will make it impossible to send exactly the same message to the user.

I think that would be ok?
AFAIK MassMessage isn't used for sending generic messages (such as user-warning templates) that someone might legitimately receive multiple times. I believe it is purely used for delivering unique newsletter-edition type content.
But I'll CC some colleagues (@Johan @Trizek-WMF ) in case they know otherwise; and if they're unsure then we can ask the wikitech-ambassadors and/or on the MassMessage talkpage.

Johan added a comment.Tue, Sep 10, 9:24 PM

I'd ask around, to be on the safe side.

eprodromou triaged this task as Low priority.Wed, Sep 11, 5:57 PM

What about, in the hidden comment added by MassMessageJob::makeText, adding the log id of the mass message submission? Each log entry there should be associated with 1 edit per page, and that would allow identical content to be resent later without causing deduplication issues, since a new message would have a new log entry. Just an idea, from the description on m:Talk:MessMessage

Happened again this week :)

JAnD added a subscriber: JAnD.Mon, Sep 16, 6:14 PM
Johan added a comment.Mon, Sep 16, 9:43 PM

Have we seen any arguments against a new deduplication flag so far?

Johan added a comment.Mon, Sep 16, 9:50 PM

Also, Ukrainian Wikipedia hasn't got Tech News since September 4. This is limiting our ability to communicate any technical changes to editors: what's happening, what do they need to prepare for, what can they give feedback on – leading to tools having bigger potential to remain unused, preparations not being taken, and risk of conflict when they were not invited to discuss things that might change their workflows. Is that likely to be caused by T232392?