Page MenuHomePhabricator

Database timeout error + significant lag when modifying a Partial Block with 10 items
Closed, ResolvedPublic

Description

This occured on Test Wikipedia — https://test.wikipedia.org/wiki/Special:Block/Nov-1-2018-test — on November 1, 2018. I didn't test for blocks of 9, 8, 7, or 6 pages but I imagine they are also time consuming.

This video shows 4 work flows: https://drive.google.com/open?id=1Msu6Hvgc3X7yEg-DkuXMItIlGbBnwGry


Use case 1: Setting a large partial block
0:00 – Start setting partial block with 10 pages for user with no current block
0:18 — Click “block this user”
0:19 — “Block succeeded” displays
OUTCOME: ✅ Success, acceptable user experience.
PROCESSING TIME: 1 second


Use case 2: Modifying a large partial block, adding and removing
00:26 — begin modifying block to remove one page from 10 and add an alternate
00:35 — click “re-block the user with these settings”
1:29 — “Database error” page displays
OUTCOME: ❌ Failure, unacceptable user experience
PROCESSING TIME: 54 seconds

[W9s60gpAIC8AABuS6FQAAAAM] /wiki/Special:Block/Nov-1-2018-test Wikimedia\Rdbms\DBTransactionSizeError from line 1354 of /srv/mediawiki/php-1.33.0-wmf.2/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Transaction spent 8.6079864501953 second(s) in writes, exceeding the limit of 3.

Backtrace:

#0 /srv/mediawiki/php-1.33.0-wmf.2/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1776): Closure$Wikimedia\Rdbms\LoadBalancer::approveMasterChanges(Wikimedia\Rdbms\DatabaseMysqli)
#1 /srv/mediawiki/php-1.33.0-wmf.2/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1364): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection(Closure$Wikimedia\Rdbms\LoadBalancer::approveMasterChanges;1858)
#2 /srv/mediawiki/php-1.33.0-wmf.2/includes/libs/rdbms/lbfactory/LBFactory.php(219): Wikimedia\Rdbms\LoadBalancer->approveMasterChanges(array)
#3 /srv/mediawiki/php-1.33.0-wmf.2/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(413): Closure$Wikimedia\Rdbms\LBFactory::forEachLBCallMethod(Wikimedia\Rdbms\LoadBalancer, string, array)
#4 /srv/mediawiki/php-1.33.0-wmf.2/includes/libs/rdbms/lbfactory/LBFactory.php(222): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure$Wikimedia\Rdbms\LBFactory::forEachLBCallMethod;1660, array)
#5 /srv/mediawiki/php-1.33.0-wmf.2/includes/libs/rdbms/lbfactory/LBFactory.php(270): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod(string, array)
#6 /srv/mediawiki/php-1.33.0-wmf.2/includes/MediaWiki.php(595): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string, array)
#7 /srv/mediawiki/php-1.33.0-wmf.2/includes/MediaWiki.php(568): MediaWiki::preOutputCommit(RequestContext, Closure$MediaWiki::main;1825)
#8 /srv/mediawiki/php-1.33.0-wmf.2/includes/MediaWiki.php(878): MediaWiki->doPreOutputCommit(Closure$MediaWiki::main;1825)
#9 /srv/mediawiki/php-1.33.0-wmf.2/includes/MediaWiki.php(517): MediaWiki->main()
#10 /srv/mediawiki/php-1.33.0-wmf.2/index.php(42): MediaWiki->run()
#11 /srv/mediawiki/w/index.php(3): include(string)
#12 {main}

Use case 3: Modifying a large partial block, removing
1:40 — begin modifying block to remove one page from 10
1:45 — click “re-block the user with these settings”
2:38 — “block succeeded” displays
OUTCOME: ❌ Success, but unacceptable user experience
PROCESSING TIME: 53 seconds


Use case 3: Modifying a large partial block, adding
2:48 — begin modifying block to add one page to 9
2:54 — click “re-block the user with these settings”
2:59 — “block succeeded” displays
OUTCOME: ✅ Success, acceptable user experience.
PROCESSING TIME: 5 seconds

Event Timeline

TBolliger created this task.Nov 1 2018, 5:38 PM
Restricted Application added subscribers: MGChecker, Aklapper. · View Herald TranscriptNov 1 2018, 5:38 PM
TBolliger updated the task description. (Show Details)Nov 1 2018, 5:40 PM
TBolliger updated the task description. (Show Details)
TBolliger updated the task description. (Show Details)Nov 1 2018, 5:43 PM
dbarratt triaged this task as High priority.Nov 1 2018, 5:45 PM
dbarratt added a subscriber: dbarratt.

Setting priority for visibility.

Since the common failure is removing it seems that this has something to do with a DELETE query.

The problem is, is that we are missing an index on ipblocks_restrictions.ir_ipb_id I'm not sure how to resolve this in production.

Or at least that's what we think the issue is.

We're using ir_ipb_id for row locking (SELECT ... FOR UPDATE) and there isn't an index on that column, also we are using that column for several other operations (i.e. get restrictions by block id).

Question for DBA: Does this make sense? How could we confirm that this is the problem? If it is the problem, how should we fix this in production? Currently the table is empty on all wikis except for testwiki (and it is fine to truncate).

Catrope added a subscriber: Catrope.Nov 1 2018, 7:12 PM

also we are using that column for several other operations (i.e. get restrictions by block id).

That in itself is reason enough to have an index on ir_ipb_id. Sorry for missing this during code review.

Wait, no, you DO have an index on ir_ipb_id: it's the first field in the primary key.

MaxSem added a subscriber: MaxSem.Nov 1 2018, 7:13 PM
MariaDB [enwiki]> show create table ipblocks_restrictions\G
*************************** 1. row ***************************
       Table: ipblocks_restrictions
Create Table: CREATE TABLE `ipblocks_restrictions` (
  `ir_ipb_id` int(11) NOT NULL,
  `ir_type` tinyint(4) NOT NULL,
  `ir_value` int(11) NOT NULL,
  PRIMARY KEY (`ir_ipb_id`,`ir_type`,`ir_value`),
  KEY `ir_type_value` (`ir_type`,`ir_value`)
) ENGINE=InnoDB DEFAULT CHARSET=binary
1 row in set (0.00 sec)

Wait, no, you DO have an index on ir_ipb_id: it's the first field in the primary key.

I believe since it's a compound key, that it is also a compound index.

Wait, no, you DO have an index on ir_ipb_id: it's the first field in the primary key.

I believe since it's a compound key, that it is also a compound index.

Yes it is. That still works for getting rows by ir_ipb_id because it's the first field. I don't know how compound indexes interact with row locking though.

Change 471082 had a related patch set uploaded (by Dbarratt; owner: Dbarratt):
[mediawiki/core@master] BlockRestriction::update() unnecessarily does a SELECT on the page table.

https://gerrit.wikimedia.org/r/471082

Yes it is. That still works for getting rows by ir_ipb_id because it's the first field. I don't know how compound indexes interact with row locking though.

Ah! I see. Another possibility (that @MaxSem found) is that it is because of some copy-pasta of having the page in the SELECT
https://gerrit.wikimedia.org/r/c/mediawiki/core/+/471082

Change 471082 merged by jenkins-bot:
[mediawiki/core@master] BlockRestriction::update() unnecessarily does a SELECT on the page table.

https://gerrit.wikimedia.org/r/471082

Change 471094 had a related patch set uploaded (by MaxSem; owner: Dbarratt):
[mediawiki/core@wmf/1.33.0-wmf.2] BlockRestriction::update() unnecessarily does a SELECT on the page table.

https://gerrit.wikimedia.org/r/471094

Change 471094 merged by jenkins-bot:
[mediawiki/core@wmf/1.33.0-wmf.2] BlockRestriction::update() unnecessarily does a SELECT on the page table.

https://gerrit.wikimedia.org/r/471094

dbarratt closed this task as Resolved.Nov 5 2018, 8:01 PM
dbarratt claimed this task.
dbarratt moved this task from Ready to Done on the Anti-Harassment (AHT Sprint 32) board.
Restricted Application added a project: User-Ryasmeen. · View Herald TranscriptNov 5 2018, 8:01 PM