Page MenuHomePhabricator

After operation aborted, on-wiki log is missing, but it was reported to irc.wikimedia
Closed, ResolvedPublic

Description

[15:56]	rc-pmtpa	[#meta.wikimedia] [[Special:Log/notifytranslators]]  sent  * MarcoAurelio * 
MarcoAurelio sent a notification about translating page [[Right to vanish]]; languages: ar, ca, de, en-gb, es, fr, ja, pt, ru;
deadline: none; priority: (unset); sent to 987 recipients, failed for 0 recipients, skipped for 0 recipients

No such on-wiki log: https://meta.wikimedia.org/w/index.php?title=Special%3ALog&type=notifytranslators&user=MarcoAurelio

Error message received:

Database error
From Meta, a Wikimedia project coordination wiki
To avoid creating high replication lag, this transaction was aborted because the write duration (10.131317615509) exceeded the 5 second limit.
If you are changing many items at once, try doing multiple smaller operations instead.

Event Timeline

This looks like a race condition on the rcstream- if an operation fails, it should not be logged and everything should be reverted. The error is that it should not send anything to the stream until it is confirmed on the database, right?

So 2 issues here:

  • Why the transaction fails (why is a transaction open for so long, can that be avoided?)
  • Why it logs even if there is an error (can it wait to confirm commit successful by the db?)

The first could be infrastructure or code, the second is probably code logic

Sorry for the tag-spam, fell free to revert those. In particular some, such as Performance-Team are only very indirectly related but maybe interesting for them (db transaction limit).

This happened when tried to send a notification to translators via https://meta.wikimedia.org/wiki/Special:NotifyTranslators - I think that maybe the system tries to message all subscribed translators at once, instead of doing so in batches? I really don't know. However I guess that sending 900+ talk page messages (on meta and elsewhere) and emails (because the extension also emails those who wish to be updated by email too), causes some strain to DB, right?

I also see some sendEmail / TranslationNotificationJob jobs failing during that time. No clear error message because message_normalized is truncated (could this be fixed please?).

The backtrace for this event is not very useful: https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2016.08.03/mediawiki?id=AVZQsUDP5t-gr1WUc7EK&_g=(refreshInterval:(display:Off,pause:!f,value:0),time:(from:'2016-08-03T13:52:41.821Z',mode:absolute,to:'2016-08-03T14:01:08.097Z'))

Only says it happens in doPreOutputCommit which I guess meaning that it is executing some deferred tasks? Perhaps actually executing the preference changes:

				$user->setOption( $timestampOptionName, $currentDBTime );
				$user->saveSettings();

I don't know the log entry does not get inserted though.

The actual delivery is postponed to job queue, so I find it unlikely to be the issue here.

The actual delivery is postponed to job queue, so I find it unlikely to be the issue here.

Okay, thank you. Then I have no idea. All I can say is that when I hit the notify translators button, the page loaded over and over again until the system gave me the error message I posted above. Shortly after, I started seeing TranslationNotificationBot doing the Meta-Wiki notifications and the log message sent to the IRC, however when I went chencking the on-wiki log, no log was there since June. Thanks.

The actual delivery is postponed to job queue, so I find it unlikely to be the issue here.

Even if the actual delivery is postponed to the job queue, the preferences are updated separately for each user so the transaction takes very long and transactions that take a long time are automatically aborted by MediaWiki nowadays. But that doesn't explain why the jobs got run and the IRC feed entry.. shouldn't they also have errored out?

Note: The actual issue for the long duration of the submission request is tracked at T63122.

OK, after looking at the code, I think this is what happened. MediaWiki starts the transaction with the first query. After that, the queries from User::saveSettings are issued but as it is not committed at this point, it doesn't block further code from being executed. Then the job queue insert code is executed where the jobs are sent to the Redis queue. After this, the log entry insert query is also issue and it is also sent to the IRC feed. But these queries are actually committed during preOutputCommit() where MediaWiki detects that the transaction is taking longer than the set time and so throws an exception aborting the DB transaction. However, since the jobs has already been sent to Redis and the RC entry sent to the IRC feed, they work okay as if the DB transaction wasn't aborted. I don't know for sure whether this is what actually happened since I am not very familiar with what happens during the lifetime of an MW request and also have made some assumptions about how they get executed but this seems likely.
So for atomicity, the expected behavior would be: 1) jobs should be removed from the queue if an exception was thrown after they have been sent or not sent at all if we know that something went wrong 2) the RC entry shouldn't also be sent to the IRC feed before everything is completed fine. And fix T63122: Timeout when sending translation notification (yet again) of course.

Change 304627 had a related patch set uploaded (by Aaron Schulz):
Make various MWCallableUpdate usages rollback-safe

https://gerrit.wikimedia.org/r/304627

Change 304627 merged by jenkins-bot:
Make various MWCallableUpdate usages rollback-safe

https://gerrit.wikimedia.org/r/304627

aaron claimed this task.