Page MenuHomePhabricator

MassMessage failed delivery claiming "readonly" although the page is not protected
Closed, ResolvedPublicBUG REPORT

Description

MassMessage regularly fails with a readonly error code on some of the target wikis, for no obvious reason. For large target sets there can be hundreds of failures.

The failure is only logged on the local wiki and not exposed to the sender, making these errors hard to realize and cumbersome to identify (T139380#4460313 has a workaround for the latter).


(Merged duplicate description:)
The last few days, we've had reports of MassMessage not delivering messages to everyone on the target list.

For T213864, two hours after a message was sent out (14:47 UTC, 16 January 2019) to https://meta.wikimedia.org/w/index.php?oldid=18788945 a manual check shows that some of the targets have received the message, but multiple wikis have not. The queue is said to be empty.

At https://meta.wikimedia.org/wiki/Talk:Tech/News/2019/03 @IKhitron has reported messages not being delivered (seen on he.wp, where 1 user out of 11 received the issue).


See comments below for many examples and logs

Event Timeline

readonly means the database is locked, not page protection (something like "protectedpage" would be the error). MassMessage should probably backoff and then retry like we do for edit conflicts.

Made extra annoying by the inability of seeing whether message delivery was successful (short of looking through massmessage logs on hundreds of wikis).

Here's a script to at least query which wikis had errors, given the UTC date and edit summary of the delivery (needs jq and GNU Parallel to be installed):

export LOGS_DAY='2018-07-12'; export LOGS_SUBJECT="Consultation on the creation of a separate user group for editing sitewide CSS/JS"; curl -s 'https://meta.wikimedia.org/w/api.php?action=sitematrix&format=json&smtype=special%7Clanguage&smstate=all&smlangprop=site&smsiteprop=url&smlimit=max' | jq --raw-output '(.sitematrix[]["site"]?[] | select(has("closed") or has ("private") or has("fishbowl") or has("nonglobal") | not).url), (.sitematrix.specials[].url)' | parallel --no-notice -P5 -I @ "curl -s '@/w/api.php?action=query&format=json&list=logevents&leprop=ids%7Ctitle%7Cdetails%7Ctimestamp|type&letype=massmessage&lestart=${LOGS_DAY}T00%3A00%3A00.000Z&leend=${LOGS_DAY}T23%3A59%3A59.000Z&ledir=newer' | jq --raw-output '.query.logevents[]? | select(.params.subject == \"${LOGS_SUBJECT}\") | \"@/w/index.php?title=Special:Log&type=massmessage&offset=\(.timestamp | fromdateiso8601 | . - 1 | strftime(\"%Y%m%d%H%M%S\") )&dir=prev&limit=1 \(.logid) \(.params.reason // .action // \"\")\"'"

I got a bunch of readonly errors (~200, apparently) around 2018-07-12 08:45, and there is nothing relevant in the SAL, so I'm not convinced these errors are not anomalous.

Nearly 300 errors this time, again nothing in SAL. These were my only two attempts to send mass messages, so I'm pretty sure something is broken there.

Crossposting a theory here:

I wonder if there's something in the code that CommRel people are using?

The only other message that failed there was again from the team,
12:03, 4 October 2018 Delivery of "Reminder: No editing for up to an hour on 10 October" to Wikibooks:Reading room/General failed with an error code of readonly .

The page is getting other MMs: https://en.wikibooks.org/w/index.php?title=Wikibooks:Reading_room/General&action=history .

(This is also true for https://en.wikiversity.org/w/index.php?title=Special:Log&page=Wikiversity%3AColloquium , FWIW.)

When this bug occurs it is severely impacting the functionality of this tool and causing a lot of extra work for MassMessage senders. Triaging to 'high'.

Nikerabbit changed the subtype of this task from "Task" to "Bug Report".
Nikerabbit claimed this task.
Nikerabbit added a subscriber: Nikerabbit.

2019-06-12T12:50:38 is the last time I see readonly error in enwiki.