Page MenuHomePhabricator

Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" trying to undelete a file
Closed, DeclinedPublic

Description

While trying to undelete a file on Commons (File:Soldier Sikh LI INF of the Indian Army practices aim during MALABAR 2006.jpg), I got this error:

To avoid creating high replication lag, this transaction was aborted because the write duration (3.8845181560516) exceeded the 3 second limit. If you are changing many items at once, try doing multiple smaller operations instead.

[WrIzDQpAICoAABGltIQAAACW] 2018-03-21 10:25:27: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError"

A similar message is mentioned on T63122, and T150702 seems to be related.

Event Timeline

jcrespo subscribed.

Adding performance for triaging purposes. It could be later sent to multimedia or someone else more familiar with that code.

Aklapper renamed this task from Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" to Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" trying to delete a file.Mar 21 2018, 12:18 PM
Aklapper edited projects, added SRE-swift-storage, WMF-General-or-Unknown; removed Commons.
Yann renamed this task from Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" trying to delete a file to Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" trying to undelete a file.Mar 21 2018, 12:39 PM

Definitely not a storage problem (swift), if you want to guess the triage and not wait for performance to do it, this is a multimedia or editing team issue (probably undeleting needs batching or smaller transactions).

Krinkle triaged this task as Unbreak Now! priority.EditedMar 21 2018, 9:58 PM
Krinkle subscribed.

Details from Logstash (produced 3 entries in mediawiki-errors for the same web request)

Logstash
timestamp: 2018-03-21T10:25:27
method: POST
type:mediawiki channel:exception
url: /w/index.php?title=Special:Undelete&action=submit

Wikimedia\Rdbms\DBTransactionSizeError
Transaction spent 3.8845181560516 second(s) in writes, exceeding the limit of 3.
[...]
#9 /srv/mediawiki/php-1.31.0-wmf.25/includes/MediaWiki.php(602): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string, array)
#10 /srv/mediawiki/php-1.31.0-wmf.25/includes/MediaWiki.php(575): MediaWiki::preOutputCommit(RequestContext, Closure$MediaWiki::main;328)
#11 /srv/mediawiki/php-1.31.0-wmf.25/includes/MediaWiki.php(877): MediaWiki->doPreOutputCommit(Closure$MediaWiki::main;328)
#12 /srv/mediawiki/php-1.31.0-wmf.25/includes/MediaWiki.php(524): MediaWiki->main()
#13 /srv/mediawiki/php-1.31.0-wmf.25/index.php(42): MediaWiki->run()
#14 /srv/mediawiki/w/index.php(3): include(string)

It seems user Yann on Commons was ultimately able to succeed the action when trying again, 9 minutes later (per Specal:Log).

More from Logstash:

  • before March 20, DBTransactionSizeError exceptions happened 5-15x per 3 hours.
  • a spike started around 20 March 13:00, peaked around 18:00 at 500x per 3 hours.
  • came back down, stabalised at 20-30x per 3 hours since 21 March 03:00 (UTC).

Screen Shot 2018-03-21 at 14.57.46.png (478×1 px, 91 KB)

Screen Shot 2018-03-21 at 15.10.27.png (522×2 px, 112 KB)

So apart from the spike, it seems consistently 2-3x more common now than 2 days ago.

Tagging as UBN regression against wmf.26 until we know more.

Krinkle lowered the priority of this task from Unbreak Now! to High.Mar 26 2018, 8:23 PM
Imarlier subscribed.

@Krinkle Check if this is still happening, followup if so.

Still seeing ~ 200 hits per day in Logstash for (DBTransactionSizeError) just within wiki:commonswiki AND method:POST.

Still seeing ~ 200 hits per day in Logstash for (DBTransactionSizeError) just within wiki:commonswiki AND method:POST.

I see about 3/hour or 24/day via << +wiki:commonswiki +http_method:POST +"DBTransactionSizeError" >>.

Change 443452 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] rdbms: give TransactionProfiler write time and row count for ROLLBACK

https://gerrit.wikimedia.org/r/443452

Change 443472 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[mediawiki/core@master] Change LOCK IN SHARE MODE to FOR UPDATE for local file restoration

https://gerrit.wikimedia.org/r/443472

aaron removed aaron as the assignee of this task.Jul 16 2018, 8:37 PM
aaron subscribed.

Change 443472 abandoned by Aaron Schulz:
Change LOCK IN SHARE MODE to FOR UPDATE for local file restoration

https://gerrit.wikimedia.org/r/443472

@tstarling I tagged CPT as it seems Aaron's patch wasn't right according to @Anomie. Could use a fresh perspective on this.

On second thought, while I see plenty of DBTransactionSizeError AND wiki:commonswiki in Logstash, they're with urls /w/api.php, Special:Wachlist or Special:MovePage. I don't see any from Special:Undelete in the last 30 days, which is what this task was about. I'm closing this specific task as such.

  • Separate task for watchlist issue: T171898.
  • Separate task for general undelete transaction size problem: T198176.

Change 443452 merged by jenkins-bot:
[mediawiki/core@master] rdbms: give TransactionProfiler write time and row count for ROLLBACK

https://gerrit.wikimedia.org/r/443452