MassMessage problems - multiple deliveries and missing deliveries
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Quiddity
	Sep 9 2019, 6:57 PM

Description

In this week's delivery of Tech News, there were 2 problems.

It sent 3 copies of the entire message to some pages, in multiple diffs across different minutes. E.g. https://cs.wikipedia.org/w/index.php?title=Diskuse_s_wikipedistou:Janbery&action=history and https://de.wikipedia.org/w/index.php?title=Wikipedia:Projektneuheiten/Tech_News&action=history
But it didn't send at all to some other recipients (unless it's still going?). E.g. not sent to https://pt.wikipedia.org/w/index.php?title=Usu%C3%A1rio_Discuss%C3%A3o:Jaideraf&action=history and https://id.wikipedia.org/w/index.php?title=Wikipedia:Warung_Kopi_(Teknis)&action=history

The full target list was meant to be https://meta.wikimedia.org/wiki/Global_message_delivery/Targets/Tech_ambassadors

I definitely only clicked "Submit" once, so it's not an impatient user-error!

I'll tag this with WMF-JobQueue, which I suspect is related, in case that helps.

See also this possibly related (recent) task? T232362: Massmessages not going through, log looks fine

Details

	Subject	Repo	Branch	Lines +/-
	Provide deduplication info to MassMessageJob	mediawiki/extensions/MassMessage	wmf/1.34.0-wmf.23	+10 -0
	Provide deduplication info to MassMessageJob	mediawiki/extensions/MassMessage	master	+10 -0

Customize query in gerrit

Related Objects

Mentioned In: T245473: MassMessage tool fails with big list of recipients, and does not issue warnings
Mentioned Here: T93049: Same MassMessage is being sent more than once
T234226: MassMessage not delivering messages
rEMAMba9b209ef0b0: Provide deduplication info to MassMessageJob
T232392: EventBus extension must not send batches that are too large
T139380: MassMessage failed delivery claiming "readonly" although the page is not protected
T232362: Massmessages not going through, log looks fine

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptSep 9 2019, 6:57 PM

Common reason for recipients not getting anything is T139380: MassMessage failed delivery claiming "readonly" although the page is not protected. It also has a (not user-friendly) script for identifying such issues.

Double-sending is maybe due to a job returning failed status (which causes it to be rescheduled) but successfully saving anyway. In theory all changes are rolled back on an error and this shouldn't happen, but maybe that logic is circumvented somehow.
If that's the case, you should be able to find the job errors in logstash.

Re: non-delivery - AFAICT this problem is not the "readonly" bug, because the local logs don't contain any of those errors, e.g. https://id.wikipedia.org/wiki/Istimewa:Catatan/massmessage

Re: double-sending - I'm not very familiar with logstash (but I'll assume this suggestion was directed at the other devs who will investigate further :)

Aklapper closed this task as a duplicate of T232362: Massmessages not going through, log looks fine.Sep 9 2019, 7:30 PM

It's easy to learn! Just go to https://logstash.wikimedia.org , click on the jobqueue link in the navigation section, set the time filter to 24 hours with the button on the top right, type massmessage in the search box and you'll see a flurry of errors like

Exception executing job: MassMessageSubmitJob Global_message_delivery/Targets/Tech_ambassadors data={"spamlist":"Global message delivery/Targets/Tech ambassadors","subject":"[[m:Special:MyLanguage/Tech/News/2019/37|Tech News: 2019-37]]","message":"string(...)","comment":["Quiddity (WMF)","metawiki","https://meta.wikimedia.org/w/index.php?title=Global_message_delivery/Targets/Tech_ambassadors\u0026oldid=19361962"],"userId":29132638,"originWiki":"metawiki"} pages=array(839) class=MediaWiki\MassMessage\MassMessageJob namespace=0 title=Global_message_delivery/Targets/Tech_ambassadors requestId=XXaPFApAMFUAABuiLDcAAAAB : JobQueueError: Could not enqueue jobs: Unable to deliver all events: 500: Internal Server Error

Now you can go back to the main logstash page (which shows logs of all types), set the time range again, paste the request ID from the previous message in the search box, and see what errors happened in that request. Now, because this is a job queue execution request, there will be a bunch of different actions bundled into it, so a bunch of unrelated errors/warnings as well, but this:

Deferred update MWCallableUpdate_MediaWiki\MassMessage\MassMessageJob->editPage failed: At least one of: RCID, revision ID, and log ID MUST be specified when adding or removing a tag from a change!

#0 /srv/mediawiki/php-1.34.0-wmf.21/includes/changetags/ChangeTags.php(254): ChangeTags::updateTags(Array, Array, NULL, NULL, NULL, NULL, NULL)
#1 /srv/mediawiki/php-1.34.0-wmf.21/extensions/MassMessage/includes/job/MassMessageJob.php(242): ChangeTags::addTags('massmessage-del...', NULL, NULL, NULL)
#2 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/MWCallableUpdate.php(38): MediaWiki\MassMessage\MassMessageJob->MediaWiki\MassMessage\{closure}()
#3 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(382): MWCallableUpdate->doUpdate()
#4 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(280): DeferredUpdates::attemptUpdate(Object(MWCallableUpdate), Object(Wikimedia\Rdbms\LBFactoryMulti))
#5 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(225): DeferredUpdates::run(Object(MWCallableUpdate), Object(Wikimedia\Rdbms\LBFactoryMulti), Object(Monolog\Logger), Object(BufferingStatsdDataFactory), 'post')
#6 /srv/mediawiki/php-1.34.0-wmf.21/includes/deferred/DeferredUpdates.php(149): DeferredUpdates::handleUpdateQueue(Array, 'run', 2)
#7 /srv/mediawiki/php-1.34.0-wmf.21/extensions/EventBus/includes/JobExecutor.php(96): DeferredUpdates::doUpdates()
#8 /srv/mediawiki/rpc/RunSingleJob.php(76): JobExecutor->execute(Array)
#9 {main}

seems pretty relevant.

Given the duplicate deliveries this is probably not (or only partially) a duplicate.
Also in the other task there was no confirmation what it is not a duplicate of T139380, while this one apparently isn't.

In general it's hard to tell what caused a massmessage (or any kind of job queue) error without log digging, so they are not good merge targets.

Mbch331 subscribed.Sep 9 2019, 8:25 PM

Adding @Pchelolo and @Ottomata per T232362 -- But there's the quirk in this task, of MassMessage sending 3 copies of the message in some instances, hence it might need separate investigation/solution? Pinging whilst the related aspects are still fresh in your heads. Thanks! :)

Hm. This would make sense

In T232379#5480635, @Quiddity wrote:

Adding @Pchelolo and @Ottomata per T232362 -- But there's the quirk in this task, of MassMessage sending 3 copies of the message in some instances, hence it might need separate investigation/solution? Pinging whilst the related aspects are still fresh in your heads. Thanks! :)

Hm.. The MassMessageJob doesn't set removeDuplicates to true, thus messages are not deduplicated. Since the jobs are posted via MassMessageSubmit job, and it was failing and it was retrying, duplicate MassMessageJobs were submitted before it hit the error of too large bulk. We need to deduplicate MassMessage jobs.

However, simply using removeDuplicates will not actually work in this case, since that will kick in deduplication by sha1, which only works if the job with the same sha1 was executed after the second job was posted (we need to be able to re-execute the jobs with the same parameters), which in this case might not be the case.

Seems like we need a new deduplication flag, like deduplicate_unique that would make it ignore exactly the same job regardless of the timing. This, however, will make it impossible to send exactly the same message to the user.

In T232379#5480718, @Pchelolo wrote:

[...]
Seems like we need a new deduplication flag, like deduplicate_unique that would make it ignore exactly the same job regardless of the timing. This, however, will make it impossible to send exactly the same message to the user.

I think that would be ok?
AFAIK MassMessage isn't used for sending generic messages (such as user-warning templates) that someone might legitimately receive multiple times. I believe it is purely used for delivering unique newsletter-edition type content.
But I'll CC some colleagues (@Johan @Trizek-WMF ) in case they know otherwise; and if they're unsure then we can ask the wikitech-ambassadors and/or on the MassMessage talkpage.

I'd ask around, to be on the safe side.

• eprodromou triaged this task as Low priority.Sep 11 2019, 5:57 PM

I've asked on m:Talk:MassMessage and wikitech-ambassadors@.

What about, in the hidden comment added by MassMessageJob::makeText, adding the log id of the mass message submission? Each log entry there should be associated with 1 edit per page, and that would allow identical content to be resent later without causing deduplication issues, since a new message would have a new log entry. Just an idea, from the description on m:Talk:MessMessage

Janbery subscribed.Sep 16 2019, 3:14 PM

MrJaroslavik subscribed.Sep 16 2019, 3:17 PM

Today on cswiki three times -> https://cs.wikipedia.org/w/index.php?title=Wikipedie:Pod_l%C3%ADpou_(technika)&action=history

Happened again this week :)

Urbanecm subscribed.Sep 16 2019, 3:23 PM

Jeroen_N subscribed.Sep 16 2019, 3:41 PM

alaa subscribed.Sep 16 2019, 4:03 PM

JAnD subscribed.Sep 16 2019, 6:14 PM

Have we seen any arguments against a new deduplication flag so far?

Also, Ukrainian Wikipedia hasn't got Tech News since September 4. This is limiting our ability to communicate any technical changes to editors: what's happening, what do they need to prepare for, what can they give feedback on – leading to tools having bigger potential to remain unused, preparations not being taken, and risk of conflict when they were not invited to discuss things that might change their workflows. Is that likely to be caused by T232392?

Xiplus subscribed.Sep 17 2019, 2:03 AM

• Elitre subscribed.Sep 17 2019, 10:17 AM

doctaxon subscribed.Sep 17 2019, 11:48 AM

In T232379#5497392, @Johan wrote:

Have we seen any arguments against a new deduplication flag so far?

Nope. (There was only one reply to my outreach, a technical clarification from Tgr.)

In that case, I suggest getting said deduplication flag. It's bad that we're spamming people.

@Pchelolo @EvanProdromou How complicated would it be to implement the aforementioned (T232379#5480718) deduplication feature, and is that within your team's area of responsibility?
These bugs are making the MassMessage system as a whole completely unreliable, and it's one of the main systems for cross-wiki communication for a multitude of groups.
Thanks!

Change 537712 had a related patch set uploaded (by Ppchelko; owner: Ppchelko):
[mediawiki/extensions/MassMessage@master] Provide deduplication info to MassMessageJob

https://gerrit.wikimedia.org/r/537712

gerritbot added a project: Patch-For-Review.Sep 18 2019, 6:02 PM

sguebo_WMF subscribed.Sep 19 2019, 2:19 PM

Hey @Pchelolo, do we know whether the patch might be merged soon? Thanks for the work there.

In T232379#5506991, @sguebo_WMF wrote:

Hey @Pchelolo, do we know wether the patch might be merged soon? Thanks for the work there.

Depends on the reviews :)

Change 537712 merged by jenkins-bot:
[mediawiki/extensions/MassMessage@master] Provide deduplication info to MassMessageJob

https://gerrit.wikimedia.org/r/537712

Maintenance_bot removed a project: Patch-For-Review.Sep 19 2019, 7:11 PM

ReleaseTaggerBot added a project: MW-1.34-notes (1.34.0-wmf.24; 2019-09-24).Sep 19 2019, 8:00 PM

In T232379#5507001, @Pchelolo wrote:

In T232379#5506991, @sguebo_WMF wrote:

Hey @Pchelolo, do we know wether the patch might be merged soon? Thanks for the work there.

Depends on the reviews :)

If so, I'll give you guys 5 stars?

The patch has been merged, so you can expect the changes to be visible next week with the regular train.

sguebo_WMF awarded a token.Sep 20 2019, 1:27 PM

• Elitre awarded a token.Sep 20 2019, 1:58 PM

Johan awarded a token.Sep 23 2019, 9:42 PM

@Johan I did not see User-notice, aren't you announcing this on Tech News? People are still complaining of receiving triple copies this week, and there must be many others who just delete them.

Change 538742 had a related patch set uploaded (by Urbanecm; owner: Ppchelko):
[mediawiki/extensions/MassMessage@wmf/1.34.0-wmf.23] Provide deduplication info to MassMessageJob

https://gerrit.wikimedia.org/r/538742

gerritbot added a project: Patch-For-Review.Sep 24 2019, 6:48 AM

This monday the Tech News has been spammed three times again. If it is running by a cron job, please check the crontab. Maybe the crontab has been started three times, too. I figured out the rhythm of the spam job: 0-1-6 minutes: The second Tech News followed within one minute, the third Tech News followed within 6 minutes. It has happend the last three weeks identically.

Exactly what is the procedure of the delivering of the Tech News by MediaWiki message delivery? Who or what script is sending it out?

Regards,
Doc Taxon ...

And why do you use the "MediaWiki message delivery" sending The Tech News out?

And we don't need a providing of deduplication but a prevention of duplication. That has to be the first step.

In T232379#5518455, @doctaxon wrote:

And why do you use the "MediaWiki message delivery" sending The Tech News out?

@doctaxon: Please ask general Tech News questions on https://meta.wikimedia.org/wiki/Talk:Tech/News - thanks. This task is about mass message.

In T232379#5518427, @Ammarpad wrote:

@Johan I did not see User-notice, aren't you announcing this on Tech News? People are still complaining of receiving triple copies this week, and there must be many others who just delete them.

Not sure what you are referring to?
The last two issues both mentioned this bug already (example).

In T232379#5518629, @Aklapper wrote:

In T232379#5518455, @doctaxon wrote:

And why do you use the "MediaWiki message delivery" sending The Tech News out?

@doctaxon: Please ask general Tech News questions on https://meta.wikimedia.org/wiki/Talk:Tech/News - thanks. This task is about mass message.

On Talk:Tech/News nobody seems to be responsible about it. If there is nobody maintaining the "publishing" of Tech News, the service has to be stopped.

@doctaxon I'm not sure exactly what it is you're asking? The problems with multiple deliveries stem from the problem discussed in this task, for which a fix is being merged now. If you've got general questions about Tech News and how it's being delivered, I suggest you ask them at m:Talk:Tech/News and I'll be happy to answer.

Change 538742 merged by jenkins-bot:
[mediawiki/extensions/MassMessage@wmf/1.34.0-wmf.23] Provide deduplication info to MassMessageJob

https://gerrit.wikimedia.org/r/538742

Mentioned in SAL (#wikimedia-operations) [2019-09-24T11:27:58Z] <urbanecm@deploy1001> Synchronized php-1.34.0-wmf.23/extensions/MassMessage/: SWAT: rEMAMba9b209ef0b0: Provide deduplication info to MassMessageJob (T232379) (duration: 00m 57s)

In T232379#5519095, @Johan wrote:

@doctaxon I'm not sure exactly what it is you're asking? The problems with multiple deliveries stem from the problem discussed in this task, for which a fix is being merged now. If you've got general questions about Tech News and how it's being delivered, I suggest you ask them at m:Talk:Tech/News and I'll be happy to answer.

done, waiting on your reply

The code fix has been applied in production, please reopen if you experience the same issue again.

ReleaseTaggerBot edited projects, added MW-1.34-notes (1.34.0-wmf.23; 2019-09-17); removed MW-1.34-notes (1.34.0-wmf.24; 2019-09-24).Sep 24 2019, 12:00 PM

Maintenance_bot removed a project: Patch-For-Review.Sep 24 2019, 12:12 PM

Since the jobs are posted via MassMessageSubmit job, and it was failing and it was retrying, duplicate MassMessageJobs were submitted before it hit the error of too large bulk.

So it sounds like the root cause was MassMessageSubmit succeeding but reporting failure (and thus being retried), and that still hasn't been explained or fixed, even though we now have a hack in place to catch those erroneous duplicates, right?

Just to be sure: it is not guaranteed at the moment that MassMessage will deliver the messages properly, right?

MrJaroslavik unsubscribed.Sep 24 2019, 3:39 PM

jijiki subscribed.Sep 30 2019, 3:20 PM

T234226: MassMessage not delivering messages covers new issues happening today.

@mobrovac @Pchelolo So I'm not sure I understand whether the issue is resolved or not. Tgr seems to disagree with you? :)

So it sounds like the root cause was MassMessageSubmit succeeding but reporting failure (and thus being retried), and that still hasn't been explained or fixed, even though we now have a hack in place to catch those erroneous duplicates, right?

MassMessageSubmit is not transactional - it posts batches of MassMessage jobs separately, so the first batch can succeed, the next batch fail and it will cause the whole MassMessageSubmit to be retried, re-posting the first, previously successful batch.

We can address it by implementing manual retries for the MassMessageSubmit, that would only retry failed batches, but now that MassMessage jobs are deduplicated in a standard way, the resubmitted MassMessage jobs will not be executed anyway. So I would say it's resolved.

If we get more instances of duplicated message, that means deduplication approach doesn't work well enough, and a more complicated approach (partial retries explained above) should be implemented, but we can address that in a separate ticket if needed.

Makes sense, thanks for the explanation.

Great, TA!
Also noting for posterity that re-occurrences of duplication should be brought up immediately, then.

Janbery got two tech news issues, see https://cs.wikipedia.org/w/index.php?title=Diskuse_s_wikipedistou:Janbery&diff=17731128&oldid=17731041 and https://cs.wikipedia.org/w/index.php?title=Diskuse_s_wikipedistou:Janbery&diff=17731166&oldid=17731128. Reopening, doesn't seem to work "well enough" :-).

I don't think this is a bug, luckily. The person simply got added to two different target lists, hence the redundancy. @Johan will establish on which one they should be kept!

Yes, this is human error. See explanation here:
https://meta.wikimedia.org/wiki/Talk:Tech/News#Week_42

(There's another problem I had to work around, and accidentally included some who had already got the newsletter when I sent it out the second time. Mea culpa.)

• Quiddity mentioned this in T245473: MassMessage tool fails with big list of recipients, and does not issue warnings.Apr 8 2020, 7:20 PM

I mentioned the issue I just had with a double delivery at T93049. Paging
@Pchelolo who said "If we get more instances of duplicated message, that means deduplication approach doesn't work well enough, and a more complicated approach (partial retries explained above) should be implemented, but we can address that in a separate ticket if needed." - LMK how I can help with the latter?

MassMessage problems - multiple deliveries and missing deliveriesClosed, ResolvedPublicActions

Description

Details

Related Objects

Event Timeline

MassMessage problems - multiple deliveries and missing deliveries
Closed, ResolvedPublic
Actions