Page MenuHomePhabricator

Massmessages not going through, log looks fine
Closed, ResolvedPublic

Description

I have been sending out massmessages to lists of users since September 6 (this is the first wave of the core distribution of the annual Community Insights survey), and while the massmessage log indicates that they have been successfully sent, there are a few messages that are not appearing at all on the targeted pages. There seems to be no pattern to which messages work and which don't, and no difference in the logs.

I'm wondering (1) should I try again, or do I risk the same message ultimately ending up on talk pages twice? and (2) what is going wrong, and is there anything I can do to fix it?

Here is what I see in the logs for the messages that are not showing up:
15:04, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(commons,act5) (Community Insights Survey)
15:01, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(commons,act4) (Community Insights Survey)
14:53, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(commons,act3) (Community Insights Survey)
14:51, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(commons,act2) (Community Insights Survey)
15:56, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(enwiki,act3) (Community Insights Survey)
15:37, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(enwiki,act2) (Community Insights Survey)
16:02, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(enwiki,act5) (Community Insights Survey)
16:09, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(eswiki,act3) (Community Insights Survey)
16:06, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(eswiki,act2) (Community Insights Survey)
16:20, 6 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(eswiki,act5) (Community Insights Survey)
15:37, 9 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(ptwiki,act3) (Community Insights Survey) Tag: PHP7
14:36, 9 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(ptwiki,act2) (Community Insights Survey) Tag: PHP7
15:39, 9 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(ptwiki,act5) (Community Insights Survey) Tag: PHP7
16:27, 9 September 2019 RMaung (WMF) talk contribs sent a message to CI2019List(wikidata,act5) (Community Insights Survey) Tag: PHP7

Event Timeline

Note that massmessage errors are logged on the target wiki, not the central log. See T139380: MassMessage failed delivery claiming "readonly" although the page is not protected which was the main reason for this kind of error in the past.

I believe it's a new issue related to a switch from EventBus service to eventgate. See https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.09.09/mediawiki?id=AW0W2jMnaTyU0YVHzPfd&_g=h@1251ff0 in particular - it boils down to PayloadTooLarge error from the service.

cc @Ottomata

Is there a way to know whether those messages will ever show up, or should I attempt to send again?

I don't think they will show up neither will resending help at this point, we need to fix the underlying issue first.

Change 535286 had a related patch set uploaded (by Ottomata; owner: Ottomata):
[operations/deployment-charts@master] Increase default EventGate max_body_size to 10mb

https://gerrit.wikimedia.org/r/535286

Ok, we've dug out the root cause of this. In the job queue system the maximum size of the serialized job is 4 mb, so the maximum body size of a request MW makes to post the job was set to roughly 4mb. However, MassMessage batches jobs to a single wiki with no limit, and in this particular case, the batch has ~240 individual messages for Wikidata.

So, a quick fix for submitting this particular mass message would be to split it in 2 batches if possible.
Next, we probably wanna increase the size of the maximum possible request, but this is under discussion.
However, the proper solution would be to chop up batches that are too huge back into several pieces. But that would need quite a bit more time to get implemented.

It looks like this would have been a problem before the migration to EventGate too.

Our message.max.bytes is 4mb, but we allow POSTing of batches of event messages. We had the POST body size also restricted to 4mb, but this restriction should really apply to individual events, not the total batch size in the POST body.

We'll increase the max POST body size to 10mb tomorrow, but we should likely have a way to automatically split up batches that are larger than our max POST size.

jbond triaged this task as Medium priority.Sep 10 2019, 10:09 AM

Change 535286 merged by Ottomata:
[operations/deployment-charts@master] Increase default EventGate max_body_size to 10mb

https://gerrit.wikimedia.org/r/535286

Mentioned in SAL (#wikimedia-operations) [2019-09-10T14:18:15Z] <ottomata> increasing max_body_size to 10mb for all eventgate services - T232362

Ok, max_body_size increased to 10mb.

Awesome-- thank you all! I'm resending the failed messages and haven't had an issue yet today.

Pchelolo claimed this task.

I'm resolving this ticket. Filed T232392 for a followup.