Page MenuHomePhabricator

Creating an election with "all wikis" can give a DBTransactionSizeError
Closed, ResolvedPublic

Description

What's the problem?

SecurePoll is running into a "DBTransactionSizeError" when trying to create an election.

@Reedy @Zabe @Majavah Trying to create an election with "All wiki" selected on votewiki and testwiki, get some sort of timeout:

[19e3148a-a7d9-4c40-9aa2-bf5e49d774e1] /wiki/Special:SecurePoll/create Wikimedia\Rdbms\DBTransactionSizeError: Transaction spent 4.2640490531921 second(s) in writes, exceeding the limit of 3

Backtrace:

from /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1748)
#0 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2265): Wikimedia\Rdbms\LoadBalancer::Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\DatabaseMysqli)
#1 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1762): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection(Closure)
#2 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/lbfactory/LBFactory.php(249): Wikimedia\Rdbms\LoadBalancer->approveMasterChanges(array, string, integer)
#3 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(236): Wikimedia\Rdbms\LBFactory::Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\LoadBalancer, string, array)
#4 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/lbfactory/LBFactory.php(251): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure, array)
#5 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/lbfactory/LBFactory.php(310): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod(string, array)
#6 /srv/mediawiki/php-1.37.0-wmf.16/includes/MediaWiki.php(684): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string, array)
#7 /srv/mediawiki/php-1.37.0-wmf.16/includes/MediaWiki.php(654): MediaWiki::preOutputCommit(RequestContext, Closure)
#8 /srv/mediawiki/php-1.37.0-wmf.16/includes/MediaWiki.php(948): MediaWiki->doPreOutputCommit(Closure)
#9 /srv/mediawiki/php-1.37.0-wmf.16/includes/MediaWiki.php(559): MediaWiki->main()
#10 /srv/mediawiki/php-1.37.0-wmf.16/index.php(53): MediaWiki->run()
#11 /srv/mediawiki/php-1.37.0-wmf.16/index.php(46): wfIndexMain()
#12 /srv/mediawiki/w/index.php(3): require(string)
#13 {main}

I had the same error, but on a second try I was able to create https://test.wikipedia.org/wiki/SecurePoll:828.

Oh. I tried 4 times in a row (on votewiki) but was not able to :(

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Reedy renamed this task from Creating an election with "all wikis" gives a DBTransactionSizeError to Creating an election with "all wikis" can give a DBTransactionSizeError.Aug 2 2021, 12:37 PM
Reedy triaged this task as High priority.
Reedy updated the task description. (Show Details)
Niharika raised the priority of this task from High to Unbreak Now!.Aug 2 2021, 4:49 PM
Niharika subscribed.

We need to fix this ASAP because the Board election starts in a day.

A quick fix would be to bump $wgMaxUserDBWriteDuration for votewiki. I don't see any examples of this being done elsewhere though (and likely for good reason).

There's not much we can do here as DBA unfortunately as this is more MW - not sure if that specific transaction can be optimized (maybe split it?) to make sure it takes less than 3 seconds (which is, in my opinion, a very generous limit for a database write transaction).

@aaron: @Reedy mentioned that you might be able to help out with this.

(reposting my IRC comment so it doesn't get lost)

There is the immediate problem of the current use case and the long term problem of reasonable (and standardized) configuration. If we choose to solve the former with a workaround then let's plan for rolling that back afterwards and looking for the proper solution in parallel.

phuedx lowered the priority of this task from Unbreak Now! to High.Aug 2 2021, 6:14 PM

From #wikimedia-operations in IRC:

19:03:08 <tltaylor> hi folks. per Toby the board elections are being delayed 2 weeks. so hopefully that creates more room for a smoother solution
Krinkle subscribed.

(Untagging library as this isn't a bug with the rdbms library itself. It is reporting a problem in SecurePoll).

This error usually means there is a slow query, in this case likely from the SecurePoll special page. There should be additional entires in Logstash under the same reqId, reporting the query itself and the stack trace from which it originated.

Looking at the most recent error that I can find, there seem to be a handful of "Detected no transaction for …" warnings preceding the error: https://logstash.wikimedia.org/goto/a353836b8c6abe4563cb2298b5bc225e

Hi everybody -- I wanted to clarify this comment:

From #wikimedia-operations in IRC:

19:03:08 <tltaylor> hi folks. per Toby the board elections are being delayed 2 weeks. so hopefully that creates more room for a smoother solution

Since it's critically important for the voting software to be reliable, we have decided that to ensure time for adequate troubleshooting and testing, we need an additional two weeks. This information is being sent to the election committee and the board for consideration on the timing of the election. They are aligned on the need for reliable voting software and there are additional factors they will consider for when the elections will occur.

We will communicate the results of their decision when it is made.

Hi everybody -- I wanted to clarify this comment:

From #wikimedia-operations in IRC:

19:03:08 <tltaylor> hi folks. per Toby the board elections are being delayed 2 weeks. so hopefully that creates more room for a smoother solution

Since it's critically important for the voting software to be reliable, we have decided that to ensure time for adequate troubleshooting and testing, we need an additional two weeks. This information is being sent to the election committee and the board for consideration on the timing of the election. They are aligned on the need for reliable voting software and there are additional factors they will consider for when the elections will occur.

We will communicate the results of their decision when it is made.

I know I sound like a stuck wheel here, but can we please commit to actually funding, maintaining and making sure so called "critical software" (I'm paraphrasing slightly) actually works when we need it to?

I know we struggle with institutional memory around here sometimes...

Toby, I'm pretty sure this is not the first time you've seen SecurePoll broken just before we need it? I can think of numerous occasions I've helped unbreak it when people found it was broken before, during or at the end of an election. And yes, I've also been the cause of at least one of those breakages. Tim being another person often on hook for SecurePoll related UBN.

We shouldn't be finding out basically 2 working days before we need something "critically important" that it's broken in numerous ways (from relatively recent changes too). And similarly, the day before it's needed, we shouldn't be having to delay elections. It looks unprofessional.

We've heard it before that we're going to deal with tech debt, we're going to find maintainers for stuff, but we repeatedly just don't. The situation rarely improves. And maybe when it does, teams eventually get reassigned and don't have time to deal with stuff they created in the past, and worse still, no one else really takes over the maintainership.

SecurePoll may have had some people/a team working on it recently, and that's good they're implementing features and trying to improve things, but it is critically under tested (unit tests etc). Test coverage is so good that we don't even have it displayed (I will go and fix that) - https://doc.wikimedia.org/cover-extensions/SecurePoll/

Sure, this specific issue (where it's exacerbated by many queries to many different databases on many different servers) is a bit harder to write tests for; it wouldn't have failed on beta... And as per Zabe, it doesn't always fail in real world usage either. But other two bugs? Well, they were seemingly not tested or noticed before CR (where both could have in theory have been caught), post merge or deployment. They certainly could've been caught further out when they went live.

And of course, this is not just SecurePoll. We've numerous fairly important extensions with little to no actual active or proactive maintenance.

https://doc.wikimedia.org/cover-extensions/SecurePoll/ is now live

Screenshot 2021-08-02 at 23.02.49.png (220×2 px, 51 KB)

And while SecurePoll's test coverage is not the worst (that we track!), it's far from anywhere that can be considered "good" - https://doc.wikimedia.org/cover-extensions/?sort=cov

$wgMaxUserDBWriteDuration only affects web requests IIRC, a maintenance script or just typing in the right PHP code in mwscript shell.php should bypass it. And +1 to everything Reedy said.

The code is already making a reasonable attempt at splitting up the transaction, by wrapping the write operations on each wiki with startAtomic/endAtomic. It's just that those functions are basically no-ops -- they don't actually provide separate transactions. After endAtomic() on the local DB, the write transaction stays open while all the other DBs are written to. I'm looking at doing the write operations in a separate connection with outer transaction scope, so that begin() and commit() can be used.

Change 709589 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/SecurePoll@master] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709589

With the patch above, the transactions are short, and the write queries are not even counted towards the limit. I tested it locally with my little wiki farm (1 local wiki, 2 foreign). Please merge, deploy and test.

Change 709592 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] rdbms: Log the transaction write query duration

https://gerrit.wikimedia.org/r/709592

Change 709592 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Log the transaction write query duration

https://gerrit.wikimedia.org/r/709592

Change 709589 merged by jenkins-bot:

[mediawiki/extensions/SecurePoll@master] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709589

Change 709973 had a related patch set uploaded (by Phuedx; author: Tim Starling):

[mediawiki/extensions/SecurePoll@wmf/1.37.0-wmf.16] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709973

Change 709974 had a related patch set uploaded (by Phuedx; author: Tim Starling):

[mediawiki/extensions/SecurePoll@wmf/1.37.0-wmf.17] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709974

Change 709973 merged by jenkins-bot:

[mediawiki/extensions/SecurePoll@wmf/1.37.0-wmf.16] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709973

Change 709974 merged by jenkins-bot:

[mediawiki/extensions/SecurePoll@wmf/1.37.0-wmf.17] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709974

@tstarling This issue is now resolved. I was able to create an election poll for All wikis and This wikis.
Version tested:SecurePoll 3.0.0 (f7361c0) 06:22, 5 August 2021 GPL-2.0-or-later

A BIG Thank you to all that worked to resolved this issue promptly! Go Team!

Hi everybody -- I wanted to clarify this comment:

From #wikimedia-operations in IRC:

19:03:08 <tltaylor> hi folks. per Toby the board elections are being delayed 2 weeks. so hopefully that creates more room for a smoother solution

Since it's critically important for the voting software to be reliable, we have decided that to ensure time for adequate troubleshooting and testing, we need an additional two weeks. This information is being sent to the election committee and the board for consideration on the timing of the election. They are aligned on the need for reliable voting software and there are additional factors they will consider for when the elections will occur.

We will communicate the results of their decision when it is made.

I know I sound like a stuck wheel here, but can we please commit to actually funding, maintaining and making sure so called "critical software" (I'm paraphrasing slightly) actually works when we need it to?

I know we struggle with institutional memory around here sometimes...

Toby, I'm pretty sure this is not the first time you've seen SecurePoll broken just before we need it? I can think of numerous occasions I've helped unbreak it when people found it was broken before, during or at the end of an election. And yes, I've also been the cause of at least one of those breakages. Tim being another person often on hook for SecurePoll related UBN.

We shouldn't be finding out basically 2 working days before we need something "critically important" that it's broken in numerous ways (from relatively recent changes too). And similarly, the day before it's needed, we shouldn't be having to delay elections. It looks unprofessional.

We've heard it before that we're going to deal with tech debt, we're going to find maintainers for stuff, but we repeatedly just don't. The situation rarely improves. And maybe when it does, teams eventually get reassigned and don't have time to deal with stuff they created in the past, and worse still, no one else really takes over the maintainership.

SecurePoll may have had some people/a team working on it recently, and that's good they're implementing features and trying to improve things, but it is critically under tested (unit tests etc). Test coverage is so good that we don't even have it displayed (I will go and fix that) - https://doc.wikimedia.org/cover-extensions/SecurePoll/

Sure, this specific issue (where it's exacerbated by many queries to many different databases on many different servers) is a bit harder to write tests for; it wouldn't have failed on beta... And as per Zabe, it doesn't always fail in real world usage either. But other two bugs? Well, they were seemingly not tested or noticed before CR (where both could have in theory have been caught), post merge or deployment. They certainly could've been caught further out when they went live.

And of course, this is not just SecurePoll. We've numerous fairly important extensions with little to no actual active or proactive maintenance.

Thank you, @Reedy, for pointing out this often overlooked, yet incredibly important issue. I think it is very easy for us all to forget the importance of having a more robust maintenance setup in place. This is a very good example of tech debt which caused tangible problems across the organization and movement. From my perspective, I think it would be incredibly helpful for all departments within the organization for this issue to receive more attention as we build out our capacity more. It feels like a very appropriate part of building both a thriving movement and thriving foundation.

I'm distanced from the details here, and would prefer to keep it that way, but I need to chime in about cross-database transactions: there is no such thing on our infrastructure. This can only be achieved through advanced database extensions supporting two-phase commit, which we have never integrated with for MariaDB as far as I'm aware. Without native 2PC support, you'll never be able to guarantee that you have either committed all changes, or rolled them all back. For example, if 2 out of 3 databases are changed but the 3rd fails, then you need to roll back the 2nd and 1st changes. If the rollback fails, then you're left with an inconsistent state.

The only thing we can do is to log thoroughly, and set off a loud alert if the pseudo-multi-db transaction is broken, then recover manually. I've seen this fail on WMF's CiviCRM donation handling platform, it's not just an academic question.

tstarling claimed this task.

I'm distanced from the details here, and would prefer to keep it that way, but I need to chime in about cross-database transactions: there is no such thing on our infrastructure.

Thanks for the interesting information, but that is not relevant to this task.

I'm distanced from the details here, and would prefer to keep it that way, but I need to chime in about cross-database transactions: there is no such thing on our infrastructure.

Thanks for the interesting information, but that is not relevant to this task.

Good, thanks for the clarification. Reading the patch, I see that we're actually doing the opposite of what I imagined: the change was to set autocommit mode so each statement is committed immediately rather than in a transaction. It seems that you can still get into the inconsistent state I was describing, if any of the remote wiki statements fail then we will be left with entries in some of the remote wikis, but recordElectionToNamespace will never be called. I believe the operator will see an exception in this case, which makes it possible to do manual cleanup if they are feeling diligent. Also, this only happens during the creation of a new election so not quite a user-facing issue. Is this potentially inconsistent state going to break anything?

Reading the patch, I see that we're actually doing the opposite of what I imagined: the change was to set autocommit mode so each statement is committed immediately rather than in a transaction.

It gets a transaction in autocommit mode but then does an explicit BEGIN/COMMIT around write queries relating to each wiki DB being acted on. There are about 11 queries in the local transaction and 6 queries per transaction in the subsequent foreign wiki transactions.

The bulk replacement of begin()/commit() with startAtomic()/endAtomic() a few years ago caused all transactions to be left open until request shutdown. My change restored the performance/consistency tradeoff chosen by the original author, i.e. the wiki being the atomic unit.

Is this potentially inconsistent state going to break anything?

Maybe, but if it does, it's a different bug.

[…] I need to chime in about cross-database transactions: there is no such thing on our infrastructure. […]

MediaWiki provides cross-database transactions, and has doing this by default for many years. You are technically right though from the MySQL perspective, each transaction is indeed bound to only one database.

For 99.9% of cases, MediaWiki povides shared pseudo-atomics for cross-database transactions and these work. The way it works is that we transactions are required to stay open for the duration of the request. A database error in any one query or transaction, or a PHP exception of any kind, or a segfault, would effectively rollback all DBs, and thus emulate a cross-database transaction.

It is only during three lines of code at the very end of the request, where we a tight loop to finally commit each per-db transaction, where the techninically separate transactions are exposed.

Having said that, for SecurePoll this behaviour was not intended or neccecary. And in cases like that it makes sense to locally opt-out. (And depending on how rare this use case is, and much it costs/benefits, we can revisit the default behaviour.)

jrbs subscribed.

I think I just ran into this again. Not sure what changed. This was creating an election with GPG encryption, STV with droop quota, all wikis, 14 days. May be the all wikis problem again?

I thought maybe it was some issue with the number of candidates, though I ran into the same thing with only 1.

Screenshot 2021-08-11 at 7.33.04 PM.png (341×882 px, 43 KB)