Page MenuHomePhabricator

Same MassMessage is being sent more than once
Closed, ResolvedPublic

Description

https://it.wikipedia.org/w/index.php?title=Discussioni_utente:Evil89&diff=71467952&oldid=20003624

It's the same message, I believe: there are no incoming redirects and both copies say "Evil89". JobQueue issues again?

PLEASE NOTE: this bug hits randomly, and it's unassigned.
Consider appending a preemptive apology to your MassMessage if you really need to use the tool. Having short target lists may also help (if anything, removing any duplicates later will be easier/faster as there should be fewer of them). You can use GUC to figure out if the bug has affected your latest delivery.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

For more quirkiness, I had recently brought up that in a mass message delivery, the GUC log would only feature a few entries even though the message was correctly delivered to a way longer list (and it wasn't a lag issue, this was observed several hours after sending). Please LMK if this is not helpful here - I can report separately. TYVM.

@Joe I don't have full context of how MassMessageJob and JobQueue work here, but at the very least it seems we are able to save the MassMessageJob events to Hive in the event.mediawiki_job_massmessagejob table. Unfortunetly we do run the events in these tables through a deduplicate function, which deduplicates them based on their meta.id field. The raw JSON event data from Kafka is saved in HDFS for 90 days in /wmf/data/raw/event/{eqiad,codfw}_mediawiki_job_MassMessageJob, so we could look there. We also have raw data for /wmf/data/raw/event/{eqiad,codfw}_mediawiki_job_MassMessageSubmitJob (we don't refine MassMessageSubmitJob into Hive because its format has not been consistent enough to do so, so it has been manually excluded from refinement).

Just noting for the record, I had similar problems on Monday, whilst delivering TechNews.
It delivered duplicates to 6 Wiktionary pages, at "ja, la, nl, ml, simple, vi", from this list:
https://meta.wikimedia.org/wiki/User:Johan_(WMF)/Tech_News_target_list_1
The duplicates were delivered ~53 minutes after the first delivery, e.g. at :
https://la.wiktionary.org/w/index.php?title=Victionarium:Nuntii_technici&offset=&limit=20&action=history

AMooney subscribed.

If Platform team is needed, retag us.

For more quirkiness, I had recently brought up that in a mass message delivery, the GUC log would only feature a few entries even though the message was correctly delivered to a way longer list (and it wasn't a lag issue, this was observed several hours after sending). Please LMK if this is not helpful here - I can report separately. TYVM.

This is now T259454.

I hope this doesn't happen to my delivery as well :(((
I had to contact 588 wikis earlier today. Please cross fingers, toes, whatever you have for me.

[Jul 23, 2020] In T93049#6328594, @Joe added projects: Platform Engineering:

@Pchelolo @Ottomata do we have any way to verify how this happened from eventgate and changeprop logs?

[Jul 28 2020] In T93049#6342727, @AMooney removed a project: Platform Engineering.

If Platform team is needed, retag us.

Yes, per Joe's comment. From what I can see, it was not addressed or answer yet.

At least one message (of 111) I sent out yesterday duplicated, on the Italian Wikipedia (https://it.wikipedia.org/wiki/Discussioni_utente:Keegan_(WMF)#Twice).

Edit: there were a few more duplicates left in the run, the most egregious was three copies being left at https://www.wikidata.org/wiki/User_talk:Pigsonthewing (https://www.wikidata.org/w/index.php?title=User_talk%3APigsonthewing&type=revision&diff=1281386944&oldid=1281276097)

Unfortunately the editing team does not have any expertise is this area.

@Esanders: Is the Editing-team still going to investigate this problem, despite lack of expertise?
If not, then a https://www.mediawiki.org/wiki/Code_stewardship_reviews should be initiated to define code stewards for MassMessage.

Per the page mentioned above, pinging @Legoktm , @wctaiwan and @DAlangi_WMF in case they have anything to contribute here.

While MassMessage is how users see the problem (e.g. no one really notices a page's cache being invalidated twice, unnecessarily), the real issue here is that the job queue is executing a job multiple times, and it really shouldn't be. And unfortunately I no longer really understand the internals of how the job queue works in the new kafka system to be able to debug what's going wrong here.

Ways forward: 1) someone looks into why the job queue is running jobs twice, and/or 2) MassMessage implements double-posting protection.

I think #1 needs to be looked at regardless, because running jobs twice seems problematic even besides MassMessage.

For #2, we could implement something similar to EdwardsBot's counter system. At job submission, we generate a unique Message-ID, like an email would have, include it in the wikitext of the message as a HTML comment, <!-- Message-ID: {...} --> (maybe in the same HTML comment we already add) and then before adding a new section, check for that string. The primary flaw in something like this is race conditions, so the best and probably only complete way to fix this (short of dropping the job queue entirely) is #1.

In T93049#6328594, @Joe wrote:

@Pchelolo @Ottomata do we have any way to verify how this happened from eventgate and changeprop logs?

Ping again?

Are the jobs failing near the end and getting retried? Also, job B can still be enqueued if a duplicate job A is marked as running. This was long-since the semantic logic (and the kafka queue does this too last I checked). If job A fails, it seems like the redis job queue (https://github.com/wikimedia/mediawiki-services-jobrunner/blob/master/redisJobChronService) would then allow duplicate job C, though JobQueueDB would not. Not sure about the Kafka queue in that situation. The behavior there should be standardized in any case.

That said, job-deduplication is just for performance, not for semantics, so the Job::run() code itself should also have a robust check to catch any duplicates (or be otherwise idempotent).

@Pchelolo and I looked at the logs and investigated the most recent incident reported by @Samwalton9 and in that case, we noticed that MassMessage is treating the message as two separate jobs because of the text. Below you can see the difference in the two messages. One appears to be sent by a script from User:Evad37/SPS while the other one is not.

The small change in the text created duplicates for MassMessageSubmitJob and every subscriber of the mailing list received two messages.

Message 1:

<div lang=\"en\" dir=\"ltr\" class=\"mw-content-ltr\"><div style=\"-moz-column-count:2; -webkit-column-count:2; column-count:2;\"> {{Wikipedia:Wikipedia Signpost/2020-09-27}} </div><!--Volume 16, Issue 9--> <div class=\"hlist\" style=\"margin-top:10px; font-size:90%; padding-left:5px; font-family:Georgia, Palatino, Palatino Linotype, Times, Times New Roman, serif;\"> * '''[[Wikipedia:Wikipedia Signpost|Read this Signpost in full]]''' * [[Wikipedia:Wikipedia Signpost/Single/2020-09-27|Single-page]] * [[Wikipedia:Wikipedia Signpost/Subscribe|Unsubscribe]] * [[User:MediaWiki message delivery|MediaWiki message delivery]] ([[User talk:MediaWiki message delivery|talk]]) ~~~~~ <!-- Sent via script ([[User:Evad37/SPS]]) --></div></div>

Message 2:

div lang=\"en\" dir=\"ltr\" class=\"mw-content-ltr\"><div style=\"-moz-column-count:2; -webkit-column-count:2; column-count:2;\"> {{WP:Wikipedia Signpost/2020-09-27}} </div><!--Volume 16, Issue 09--> <div class=\"hlist\" style=\"margin-top:10px; font-size:90%; padding-left:5px; font-family:Georgia, Palatino, Palatino Linotype, Times, Times New Roman, serif;\"> * '''[[Wikipedia:Wikipedia Signpost|Read this Signpost in full]]''' * [[Wikipedia:Wikipedia Signpost/Single/2020-09-27|Single-page]] * [[Wikipedia:Wikipedia Signpost/Subscribe|Unsubscribe]] * [[User:MediaWiki message delivery|MediaWiki message delivery]] ([[User talk:MediaWiki message delivery|talk]]) ~~~~~ </div></div>

While MassMessage is how users see the problem (e.g. no one really notices a page's cache being invalidated twice, unnecessarily), the real issue here is that the job queue is executing a job multiple times, and it really shouldn't be.

{{cn}}

I see no conclusive proof that's what's happening.

In this case the queue is executing the job multiple times because these are different jobs. The posted text is different. The second one contains a comment which the first one does not. We have deduplication on the jobqueue side, but it just uses whatever Job::getDeduplicationInfo gives it. Message text is a part of the job deduplication hash, so JobQueue operates correctly here.

I think the User:Evad37/SPS bot is a problem here for posting the message twice. I do not know enough about how that bot works, what does it do, or how these messages get into the system, so I can't judge if it's possible to add some better deduplication info there. If it is, we can just override getDeduplicationInfo for this job and dedupe it.

It seems like the Signpost article duplication was a user error, per the discussion at https://en.wikipedia.org/wiki/Wikipedia_talk:Wikipedia_Signpost#Duplicate_delivery%3F. The Books & Bytes example I mentioned further up, however, was definitely only sent once.

The Books & Bytes example I mentioned further up, however, was definitely only sent once.

Unfortunately the detailed logs and jobs for that one has expired by now. In the current system we have only 1 week when all the info is preserved. As soon as this happens again, please add an example here and we will investigate again.

Thanks everyone for your help here.

Ok, it did execute the job twice:

Once on 27th:

2020-10-27 19:52:28 [34499b04-8b9a-4cd1-95dd-9229906705c7] mw1303 huwiki 1.36.0-wmf.14 JobExecutor DEBUG: Beginning job execution {"job":"MassMessageJob Szerkesztővita:Tgr spamlist=Szerkesztő:Samat/MassMessages subject=Októberi online wikitalálkozó page-message= message=Szia!
[[User:Samat|Samat]] <sup>[[User vita:Samat|üzenetrögzítő]]</sup> ~~~~~ • ([[Wikipédia:Wikitalálkozók/Érdeklődők|feliratkozás és leiratkozás]]) comment=[\"Samat\",\"huwiki\",\"https://hu.wikipedia.org/w/index.php?title=Szerkeszt%C5%91:Samat/MassMessages\\u0026oldid=23070985\"] userId=949 originWiki=huwiki isSourceTranslationPage= translationPageSourceLanguage= pageMessageTitle= title=Szerkesztővita:Tgr rootJobSignature=8d2e0b34ef7fe5249c9182c2efb0063261ae50bd rootJobTimestamp=20201027195225 requestId=34499b04-8b9a-4cd1-95dd-9229906705c7","job_type":"MassMessageJob"} 
2020-10-27 19:52:35 [34499b04-8b9a-4cd1-95dd-9229906705c7] mw1303 huwiki 1.36.0-wmf.14 JobExecutor INFO: Finished job execution {"job":"MassMessageJob Szerkesztővita:Tgr spamlist=Szerkesztő:Samat/MassMessages subject=Októberi online wikitalálkozó page-message= message=Szia!
[[User:Samat|Samat]] <sup>[[User vita:Samat|üzenetrögzítő]]</sup> ~~~~~ • ([[Wikipédia:Wikitalálkozók/Érdeklődők|feliratkozás és leiratkozás]]) comment=[\"Samat\",\"huwiki\",\"https://hu.wikipedia.org/w/index.php?title=Szerkeszt%C5%91:Samat/MassMessages\\u0026oldid=23070985\"] userId=949 originWiki=huwiki isSourceTranslationPage= translationPageSourceLanguage= pageMessageTitle= title=Szerkesztővita:Tgr rootJobSignature=8d2e0b34ef7fe5249c9182c2efb0063261ae50bd rootJobTimestamp=20201027195225 requestId=34499b04-8b9a-4cd1-95dd-9229906705c7","job_type":"MassMessageJob","job_status":true,"job_duration":7.977766990661621}

Once on 28th:

2020-10-28 16:06:39 [34499b04-8b9a-4cd1-95dd-9229906705c7] mw1294 huwiki 1.36.0-wmf.14 JobExecutor DEBUG: Beginning job execution {"job":"MassMessageJob Szerkesztővita:Tgr spamlist=Szerkesztő:Samat/MassMessages subject=Októberi online wikitalálkozó page-message= message=Szia!
[[User:Samat|Samat]] <sup>[[User vita:Samat|üzenetrögzítő]]</sup> ~~~~~ • ([[Wikipédia:Wikitalálkozók/Érdeklődők|feliratkozás és leiratkozás]]) comment=[\"Samat\",\"huwiki\",\"https://hu.wikipedia.org/w/index.php?title=Szerkeszt%C5%91:Samat/MassMessages\\u0026oldid=23070985\"] userId=949 originWiki=huwiki isSourceTranslationPage= translationPageSourceLanguage= pageMessageTitle= title=Szerkesztővita:Tgr rootJobSignature=8d2e0b34ef7fe5249c9182c2efb0063261ae50bd rootJobTimestamp=20201027195225 requestId=34499b04-8b9a-4cd1-95dd-9229906705c7","job_type":"MassMessageJob"} 
2020-10-28 16:06:43 [34499b04-8b9a-4cd1-95dd-9229906705c7] mw1294 huwiki 1.36.0-wmf.14 JobExecutor INFO: Finished job execution {"job":"MassMessageJob Szerkesztővita:Tgr spamlist=Szerkesztő:Samat/MassMessages subject=Októberi online wikitalálkozó page-message= message=Szia!
[[User:Samat|Samat]] <sup>[[User vita:Samat|üzenetrögzítő]]</sup> ~~~~~ • ([[Wikipédia:Wikitalálkozók/Érdeklődők|feliratkozás és leiratkozás]]) comment=[\"Samat\",\"huwiki\",\"https://hu.wikipedia.org/w/index.php?title=Szerkeszt%C5%91:Samat/MassMessages\\u0026oldid=23070985\"] userId=949 originWiki=huwiki isSourceTranslationPage= translationPageSourceLanguage= pageMessageTitle= title=Szerkesztővita:Tgr rootJobSignature=8d2e0b34ef7fe5249c9182c2efb0063261ae50bd rootJobTimestamp=20201027195225 requestId=34499b04-8b9a-4cd1-95dd-9229906705c7","job_type":"MassMessageJob","job_status":true,"job_duration":4.2106781005859375}

looking further

Ok, a bit more:

16:06 ppchelko@deploy1001: helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .

which corresponds with the time of the second execution.

So, we have found it: the same exact job has been executed twice. I have deployed change-prop for jobqueue right before the second execution. The deduplicator didn't work out in this case since we expire the keys in Redis after 1 hour for by-id and by-sha1 deduplicators.

This suggests a bug in commit logic. Investigation to be continued.

So, we have found it: the same exact job has been executed twice. I have deployed change-prop for jobqueue right before the second execution. The deduplicator didn't work out in this case since we expire the keys in Redis after 1 hour for by-id and by-sha1 deduplicators.

This suggests a bug in commit logic. Investigation to be continued.

Thanks for your work, please let us know how we can assist.

One more new example of a duplication.
first send (22:10) and second send (22:11).
I've checked ~50 other random entries from the distribution list I used, and that was the only duplication I could see.

One more new example of a duplication.
first send (22:10) and second send (22:11).
I've checked ~50 other random entries from the distribution list I used, and that was the only duplication I could see.

@Pchelolo ping in case this new instance has elements that may be helpful to you

@Pchelolo new instance today, sadly. GUC tells me that around 8 messages were delivered again 1h after the first delivery.

@Snaevar Thanks for the report. Unfortunately, (per the comment above in T93049#6512485) the detailed-version of the server-side logs for this system are only retained for a week, so the devs cannot investigate this example.
Note: The logs say this was only sent once (00:22 UTC), so this instance was not user-error.

@Snaevar Thanks for the report. Unfortunately, (per the comment above in T93049#6512485) the detailed-version of the server-side logs for this system are only retained for a week, so the devs cannot investigate this example.

Sure, just ping one or more of the users that have given examples on this behaviour once someone starts working on it.

Hello, I believe I have encountered this bug here: https://meta.wikimedia.org/w/index.php?title=Special:Log&logid=47116463

The message was sent again some 6 hours later: https://lij.wikipedia.org/wiki/Special:Diff/228050

Pages contacted: https://meta.wikimedia.org/w/index.php?title=Movement_Strategy_and_Governance/Delivery/it&oldid=22502575

Page sent as message: https://meta.wikimedia.org/wiki/Leadership_Development_Working_Group/Participate/Announcement

Method: Allowed MM to auto-select Italian by sending the page directly using both the new subject and the content section drop-downs

@Ottomata Ping in case this fresh example helps.
It's unclear from the last engineer comment above (Petr's at T93049#6597488) whether additional user-examples are needed for the investigation.
It's also unclear who might be able to help as Petr and Clara have left WMF since their last comments.

This happened with the Signpost today (details here). It seems that a good number of people were given double issues from the global MassMessage list on Meta. This is not user error, as the message was only requested once. For example, two successive edits to the same user talk page both cite the exact same revid (https://meta.wikimedia.org/w/index.php?title=Global_message_delivery/Targets/Signpost&oldid=24726786). I have repaired the duplicates via manual reverts, of which there were approximately 250. Most of these were from the local distribution list, but about a dozen were from the global distribution list; curiously, the doubled messages seem to be a couple hours apart.

It is rather concerning that this bug seems to have been known (and not addressed) since 2015.

jpxg raised the priority of this task from High to Needs Triage.Mar 20 2023, 7:58 PM

Ideally this would be fixed in the job runner, but given that's not happening and this extension isn't really maintained, a relatively easy to implement workarounds would be to generate a stable ID for the message (this is some kind of job double-execution trouble so it could just be a random number passed to the job), associate with MassMessage edits, and use that to detect and abort double-posts. We could use comment_data, log_params (currently the local MassMessage log only records failures, so it would become much more verbose - not sure if that's a problem), ct_params. None of those are searchable, but we'd only need to check the last N edits to the same page with a MassMessage change tag and look up the data for those. ct_params seems like the cleanest of those.

Yeah, I think #2 from @Legoktm's earlier comment is probably the easiest short term fix in MM. The long term solution is probably T290054 but I don't have bandwidth to work on it in the near future, and it doesn't look like anyone else is actively working on it.

I can try to take a look at the HTML comment idea if no one else gets to it first; alternatively if someone wants to take a crack at some version of this, I'm happy to help with code review.

I took a look at this and it's a bit more involved than I was hoping:

  1. For the HTML comment idea, it's not as simple as a string match since the comment should presumably be translated (as massmessage-hidden-comment is)
  2. For adding some kind of stable ID (I was just trying md5( $subject . $message )) to change tags and matching against it, this would probably work for the local wiki, but I don't think we can get this information from the revisions API when cross-posting to a different wiki? Looks like tags just returns each tag as a string. (Please let me know if this is actually incorrect.)

Edit: @Legoktm pointed out for the second solution we may be able to use database queries for remote wikis as well, so that might work.

Change 914347 had a related patch set uploaded (by Wctaiwan; author: Wctaiwan):

[mediawiki/extensions/MassMessage@master] Prevent duplicate deliveries

https://gerrit.wikimedia.org/r/914347

I hadn't noticed but this happened to me as well on April 19. I'm about to target 100+ wikis again, crossing fingers :(

Change 914347 merged by jenkins-bot:

[mediawiki/extensions/MassMessage@master] Prevent duplicate deliveries

https://gerrit.wikimedia.org/r/914347