Page MenuHomePhabricator

MassMessage problems - multiple deliveries and missing deliveries
Closed, ResolvedPublic

Description

In this week's delivery of Tech News, there were 2 problems.

The full target list was meant to be https://meta.wikimedia.org/wiki/Global_message_delivery/Targets/Tech_ambassadors

I definitely only clicked "Submit" once, so it's not an impatient user-error!

I'll tag this with WMF-JobQueue, which I suspect is related, in case that helps.

See also this possibly related (recent) task? T232362: Massmessages not going through, log looks fine

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Common reason for recipients not getting anything is T139380: MassMessage failed delivery claiming "readonly" although the page is not protected. It also has a (not user-friendly) script for identifying such issues.

Double-sending is maybe due to a job returning failed status (which causes it to be rescheduled) but successfully saving anyway. In theory all changes are rolled back on an error and this shouldn't happen, but maybe that logic is circumvented somehow.
If that's the case, you should be able to find the job errors in logstash.

Re: non-delivery - AFAICT this problem is not the "readonly" bug, because the local logs don't contain any of those errors, e.g. https://id.wikipedia.org/wiki/Istimewa:Catatan/massmessage

Re: double-sending - I'm not very familiar with logstash (but I'll assume this suggestion was directed at the other devs who will investigate further :)

It's easy to learn! Just go to https://logstash.wikimedia.org , click on the jobqueue link in the navigation section, set the time filter to 24 hours with the button on the top right, type massmessage in the search box and you'll see a flurry of errors like

Exception executing job: MassMessageSubmitJob Global_message_delivery/Targets/Tech_ambassadors data={"spamlist":"Global message delivery/Targets/Tech ambassadors","subject":"[[m:Special:MyLanguage/Tech/News/2019/37|Tech News: 2019-37]]","message":"string(...)","comment":["Quiddity (WMF)","metawiki","https://meta.wikimedia.org/w/index.php?title=Global_message_delivery/Targets/Tech_ambassadors\u0026oldid=19361962"],"userId":29132638,"originWiki":"metawiki"} pages=array(839) class=MediaWiki\MassMessage\MassMessageJob namespace=0 title=Global_message_delivery/Targets/Tech_ambassadors requestId=XXaPFApAMFUAABuiLDcAAAAB : JobQueueError: Could not enqueue jobs: Unable to deliver all events: 500: Internal Server Error

Now you can go back to the main logstash page (which shows logs of all types), set the time range again, paste the request ID from the previous message in the search box, and see what errors happened in that request. Now, because this is a job queue execution request, there will be a bunch of different actions bundled into it, so a bunch of unrelated errors/warnings as well, but this:

Deferred update MWCallableUpdate_MediaWiki\MassMessage\MassMessageJob->editPage failed: At least one of: RCID, revision ID, and log ID MUST be specified when adding or removing a tag from a change!

#0 /srv/mediawiki/php-1.34.0-wmf.21/includes/changetags/ChangeTags.php(254): ChangeTags::updateTags(Array, Array, NULL, NULL, NULL, NULL, NULL)
#1 /srv/mediawiki/php-1.34.0-wmf.21/extensions/MassMessage/includes/job/MassMessageJob.php(242): ChangeTags::addTags('massmessage-del...', NULL, NULL, NULL)
#2 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/MWCallableUpdate.php(38): MediaWiki\MassMessage\MassMessageJob->MediaWiki\MassMessage\{closure}()
#3 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(382): MWCallableUpdate->doUpdate()
#4 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(280): DeferredUpdates::attemptUpdate(Object(MWCallableUpdate), Object(Wikimedia\Rdbms\LBFactoryMulti))
#5 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(225): DeferredUpdates::run(Object(MWCallableUpdate), Object(Wikimedia\Rdbms\LBFactoryMulti), Object(Monolog\Logger), Object(BufferingStatsdDataFactory), 'post')
#6 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(149): DeferredUpdates::handleUpdateQueue(Array, 'run', 2)
#7 /srv/mediawiki/php-1.34.0-wmf.21/extensions/EventBus/includes/JobExecutor.php(96): DeferredUpdates::doUpdates()
#8 /srv/mediawiki/rpc/RunSingleJob.php(76): JobExecutor->execute(Array)
#9 {main}

seems pretty relevant.

Given the duplicate deliveries this is probably not (or only partially) a duplicate.
Also in the other task there was no confirmation what it is not a duplicate of T139380, while this one apparently isn't.

In general it's hard to tell what caused a massmessage (or any kind of job queue) error without log digging, so they are not good merge targets.

Adding @Pchelolo and @Ottomata per T232362 -- But there's the quirk in this task, of MassMessage sending 3 copies of the message in some instances, hence it might need separate investigation/solution? Pinging whilst the related aspects are still fresh in your heads. Thanks! :)

Hm. This would make sense

Adding @Pchelolo and @Ottomata per T232362 -- But there's the quirk in this task, of MassMessage sending 3 copies of the message in some instances, hence it might need separate investigation/solution? Pinging whilst the related aspects are still fresh in your heads. Thanks! :)

Hm.. The MassMessageJob doesn't set removeDuplicates to true, thus messages are not deduplicated. Since the jobs are posted via MassMessageSubmit job, and it was failing and it was retrying, duplicate MassMessageJobs were submitted before it hit the error of too large bulk. We need to deduplicate MassMessage jobs.

However, simply using removeDuplicates will not actually work in this case, since that will kick in deduplication by sha1, which only works if the job with the same sha1 was executed after the second job was posted (we need to be able to re-execute the jobs with the same parameters), which in this case might not be the case.

Seems like we need a new deduplication flag, like deduplicate_unique that would make it ignore exactly the same job regardless of the timing. This, however, will make it impossible to send exactly the same message to the user.

[...]
Seems like we need a new deduplication flag, like deduplicate_unique that would make it ignore exactly the same job regardless of the timing. This, however, will make it impossible to send exactly the same message to the user.

I think that would be ok?
AFAIK MassMessage isn't used for sending generic messages (such as user-warning templates) that someone might legitimately receive multiple times. I believe it is purely used for delivering unique newsletter-edition type content.
But I'll CC some colleagues (@Johan @Trizek-WMF ) in case they know otherwise; and if they're unsure then we can ask the wikitech-ambassadors and/or on the MassMessage talkpage.

I'd ask around, to be on the safe side.

What about, in the hidden comment added by MassMessageJob::makeText, adding the log id of the mass message submission? Each log entry there should be associated with 1 edit per page, and that would allow identical content to be resent later without causing deduplication issues, since a new message would have a new log entry. Just an idea, from the description on m:Talk:MessMessage

Have we seen any arguments against a new deduplication flag so far?

Also, Ukrainian Wikipedia hasn't got Tech News since September 4. This is limiting our ability to communicate any technical changes to editors: what's happening, what do they need to prepare for, what can they give feedback on – leading to tools having bigger potential to remain unused, preparations not being taken, and risk of conflict when they were not invited to discuss things that might change their workflows. Is that likely to be caused by T232392?

Have we seen any arguments against a new deduplication flag so far?

Nope. (There was only one reply to my outreach, a technical clarification from Tgr.)

In that case, I suggest getting said deduplication flag. It's bad that we're spamming people.

@Pchelolo @EvanProdromou How complicated would it be to implement the aforementioned (T232379#5480718) deduplication feature, and is that within your team's area of responsibility?
These bugs are making the MassMessage system as a whole completely unreliable, and it's one of the main systems for cross-wiki communication for a multitude of groups.
Thanks!

Change 537712 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/MassMessage@master] Provide deduplication info to MassMessageJob

https://gerrit.wikimedia.org/r/537712

Hey @Pchelolo, do we know whether the patch might be merged soon? Thanks for the work there.

Hey @Pchelolo, do we know wether the patch might be merged soon? Thanks for the work there.

Depends on the reviews :)

Change 537712 merged by jenkins-bot:
[mediawiki/extensions/MassMessage@master] Provide deduplication info to MassMessageJob

https://gerrit.wikimedia.org/r/537712

Hey @Pchelolo, do we know wether the patch might be merged soon? Thanks for the work there.

Depends on the reviews :)

If so, I'll give you guys 5 stars?

mobrovac subscribed.

The patch has been merged, so you can expect the changes to be visible next week with the regular train.

@Johan I did not see User-notice, aren't you announcing this on Tech News? People are still complaining of receiving triple copies this week, and there must be many others who just delete them.

Change 538742 had a related patch set uploaded (by Urbanecm; owner: Ppchelko):
[mediawiki/extensions/MassMessage@wmf/1.34.0-wmf.23] Provide deduplication info to MassMessageJob

https://gerrit.wikimedia.org/r/538742

This monday the Tech News has been spammed three times again. If it is running by a cron job, please check the crontab. Maybe the crontab has been started three times, too. I figured out the rhythm of the spam job: 0-1-6 minutes: The second Tech News followed within one minute, the third Tech News followed within 6 minutes. It has happend the last three weeks identically.

Exactly what is the procedure of the delivering of the Tech News by MediaWiki message delivery? Who or what script is sending it out?

Regards,
Doc Taxon ...

And why do you use the "MediaWiki message delivery" sending The Tech News out?

And we don't need a providing of deduplication but a prevention of duplication. That has to be the first step.

And why do you use the "MediaWiki message delivery" sending The Tech News out?

@doctaxon: Please ask general Tech News questions on https://meta.wikimedia.org/wiki/Talk:Tech/News - thanks. This task is about mass message.

@Johan I did not see User-notice, aren't you announcing this on Tech News? People are still complaining of receiving triple copies this week, and there must be many others who just delete them.

Not sure what you are referring to?
The last two issues both mentioned this bug already (example).

And why do you use the "MediaWiki message delivery" sending The Tech News out?

@doctaxon: Please ask general Tech News questions on https://meta.wikimedia.org/wiki/Talk:Tech/News - thanks. This task is about mass message.

On Talk:Tech/News nobody seems to be responsible about it. If there is nobody maintaining the "publishing" of Tech News, the service has to be stopped.

@doctaxon I'm not sure exactly what it is you're asking? The problems with multiple deliveries stem from the problem discussed in this task, for which a fix is being merged now. If you've got general questions about Tech News and how it's being delivered, I suggest you ask them at m:Talk:Tech/News and I'll be happy to answer.

Change 538742 merged by jenkins-bot:
[mediawiki/extensions/MassMessage@wmf/1.34.0-wmf.23] Provide deduplication info to MassMessageJob

https://gerrit.wikimedia.org/r/538742

Mentioned in SAL (#wikimedia-operations) [2019-09-24T11:27:58Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.23/extensions/MassMessage/: SWAT: rEMAMba9b209ef0b0: Provide deduplication info to MassMessageJob (T232379) (duration: 00m 57s)

@doctaxon I'm not sure exactly what it is you're asking? The problems with multiple deliveries stem from the problem discussed in this task, for which a fix is being merged now. If you've got general questions about Tech News and how it's being delivered, I suggest you ask them at m:Talk:Tech/News and I'll be happy to answer.

done, waiting on your reply

mobrovac assigned this task to Pchelolo.

The code fix has been applied in production, please reopen if you experience the same issue again.

Since the jobs are posted via MassMessageSubmit job, and it was failing and it was retrying, duplicate MassMessageJobs were submitted before it hit the error of too large bulk.

So it sounds like the root cause was MassMessageSubmit succeeding but reporting failure (and thus being retried), and that still hasn't been explained or fixed, even though we now have a hack in place to catch those erroneous duplicates, right?

Just to be sure: it is not guaranteed at the moment that MassMessage will deliver the messages properly, right?

@mobrovac @Pchelolo So I'm not sure I understand whether the issue is resolved or not. Tgr seems to disagree with you? :)

So it sounds like the root cause was MassMessageSubmit succeeding but reporting failure (and thus being retried), and that still hasn't been explained or fixed, even though we now have a hack in place to catch those erroneous duplicates, right?

MassMessageSubmit is not transactional - it posts batches of MassMessage jobs separately, so the first batch can succeed, the next batch fail and it will cause the whole MassMessageSubmit to be retried, re-posting the first, previously successful batch.

We can address it by implementing manual retries for the MassMessageSubmit, that would only retry failed batches, but now that MassMessage jobs are deduplicated in a standard way, the resubmitted MassMessage jobs will not be executed anyway. So I would say it's resolved.

If we get more instances of duplicated message, that means deduplication approach doesn't work well enough, and a more complicated approach (partial retries explained above) should be implemented, but we can address that in a separate ticket if needed.

Makes sense, thanks for the explanation.

Great, TA!
Also noting for posterity that re-occurrences of duplication should be brought up immediately, then.

I don't think this is a bug, luckily. The person simply got added to two different target lists, hence the redundancy. @Johan will establish on which one they should be kept!

(There's another problem I had to work around, and accidentally included some who had already got the newsletter when I sent it out the second time. Mea culpa.)

I mentioned the issue I just had with a double delivery at T93049. Paging
@Pchelolo who said "If we get more instances of duplicated message, that means deduplication approach doesn't work well enough, and a more complicated approach (partial retries explained above) should be implemented, but we can address that in a separate ticket if needed." - LMK how I can help with the latter?