Creating an election with "all wikis" can give a DBTransactionSizeError
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Reedy
	Aug 2 2021, 12:36 PM

Description

What's the problem?

SecurePoll is running into a "DBTransactionSizeError" when trying to create an election.

In T287780#7251685, @dom_walden wrote:

@Reedy @Zabe @Majavah Trying to create an election with "All wiki" selected on votewiki and testwiki, get some sort of timeout:

[19e3148a-a7d9-4c40-9aa2-bf5e49d774e1] /wiki/Special:SecurePoll/create Wikimedia\Rdbms\DBTransactionSizeError: Transaction spent 4.2640490531921 second(s) in writes, exceeding the limit of 3

Backtrace:

from /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1748)
#0 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(2265): Wikimedia\Rdbms\LoadBalancer::Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\DatabaseMysqli)
#1 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/loadbalancer/LoadBalancer.php(1762): Wikimedia\Rdbms\LoadBalancer->forEachOpenMasterConnection(Closure)
#2 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/lbfactory/LBFactory.php(249): Wikimedia\Rdbms\LoadBalancer->approveMasterChanges(array, string, integer)
#3 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/lbfactory/LBFactoryMulti.php(236): Wikimedia\Rdbms\LBFactory::Wikimedia\Rdbms\{closure}(Wikimedia\Rdbms\LoadBalancer, string, array)
#4 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/lbfactory/LBFactory.php(251): Wikimedia\Rdbms\LBFactoryMulti->forEachLB(Closure, array)
#5 /srv/mediawiki/php-1.37.0-wmf.16/includes/libs/rdbms/lbfactory/LBFactory.php(310): Wikimedia\Rdbms\LBFactory->forEachLBCallMethod(string, array)
#6 /srv/mediawiki/php-1.37.0-wmf.16/includes/MediaWiki.php(684): Wikimedia\Rdbms\LBFactory->commitMasterChanges(string, array)
#7 /srv/mediawiki/php-1.37.0-wmf.16/includes/MediaWiki.php(654): MediaWiki::preOutputCommit(RequestContext, Closure)
#8 /srv/mediawiki/php-1.37.0-wmf.16/includes/MediaWiki.php(948): MediaWiki->doPreOutputCommit(Closure)
#9 /srv/mediawiki/php-1.37.0-wmf.16/includes/MediaWiki.php(559): MediaWiki->main()
#10 /srv/mediawiki/php-1.37.0-wmf.16/index.php(53): MediaWiki->run()
#11 /srv/mediawiki/php-1.37.0-wmf.16/index.php(46): wfIndexMain()
#12 /srv/mediawiki/w/index.php(3): require(string)
#13 {main}

In T287780#7251693, @Zabe wrote:

I had the same error, but on a second try I was able to create https://test.wikipedia.org/wiki/SecurePoll:828.

In T287780#7251727, @dom_walden wrote:

Oh. I tried 4 times in a row (on votewiki) but was not able to :(

Details

Subject	Repo	Branch	Lines +/-
Use real transactions when creating an election	mediawiki/extensions/SecurePoll	wmf/1.37.0-wmf.17	+13 -6
Use real transactions when creating an election	mediawiki/extensions/SecurePoll	wmf/1.37.0-wmf.16	+13 -6
Use real transactions when creating an election	mediawiki/extensions/SecurePoll	master	+13 -6
rdbms: Log the transaction write query duration	mediawiki/core	master	+13 -7

Customize query in gerrit

Related Objects

Mentioned In: T288831: SecurePoll: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" when submitting a translation
T287914: Create a maintenance script to create polls
Mentioned Here: T288711: /wiki/Special:SecurePoll/create Error: Call to a member function getId() on null
rESPOf7361c0972f1: Localisation updates from https://translatewiki.net.
T287780: SecurePoll CreatePage can no longer correctly select "remote" wiki databases that aren't in the same cluster

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 2 2021, 12:37 PM

Reedy renamed this task from Creating an election with "all wikis" gives a DBTransactionSizeError to Creating an election with "all wikis" can give a DBTransactionSizeError.Aug 2 2021, 12:37 PM

Reedy triaged this task as High priority.

Reedy updated the task description. (Show Details)

Reedy added projects: MediaWiki-libs-Rdbms, Performance-Team, Platform Engineering.

We need to fix this ASAP because the Board election starts in a day.

A quick fix would be to bump $wgMaxUserDBWriteDuration for votewiki. I don't see any examples of this being done elsewhere though (and likely for good reason).

phuedx added a project: DBA.Aug 2 2021, 5:09 PM

There's not much we can do here as DBA unfortunately as this is more MW - not sure if that specific transaction can be optimized (maybe split it?) to make sure it takes less than 3 seconds (which is, in my opinion, a very generous limit for a database write transaction).

@aaron: @Reedy mentioned that you might be able to help out with this.

LSobanski subscribed.Aug 2 2021, 5:36 PM

(reposting my IRC comment so it doesn't get lost)

There is the immediate problem of the current use case and the long term problem of reasonable (and standardized) configuration. If we choose to solve the former with a workaround then let's plan for rolling that back afterwards and looking for the proper solution in parallel.

phuedx added a project: Anti-Harassment (The Letter Song).Aug 2 2021, 6:07 PM

From #wikimedia-operations in IRC:

19:03:08 <tltaylor> hi folks. per Toby the board elections are being delayed 2 weeks. so hopefully that creates more room for a smoother solution

(Untagging library as this isn't a bug with the rdbms library itself. It is reporting a problem in SecurePoll).

This error usually means there is a slow query, in this case likely from the SecurePoll special page. There should be additional entires in Logstash under the same reqId, reporting the query itself and the stack trace from which it originated.

Looking at the most recent error that I can find, there seem to be a handful of "Detected no transaction for …" warnings preceding the error: https://logstash.wikimedia.org/goto/a353836b8c6abe4563cb2298b5bc225e

Hi everybody -- I wanted to clarify this comment:

In T287859#7252963, @phuedx wrote:
From #wikimedia-operations in IRC:
19:03:08 <tltaylor> hi folks. per Toby the board elections are being delayed 2 weeks. so hopefully that creates more room for a smoother solution

Since it's critically important for the voting software to be reliable, we have decided that to ensure time for adequate troubleshooting and testing, we need an additional two weeks. This information is being sent to the election committee and the board for consideration on the timing of the election. They are aligned on the need for reliable voting software and there are additional factors they will consider for when the elections will occur.

We will communicate the results of their decision when it is made.

HakanIST subscribed.Aug 2 2021, 8:31 PM

Niharika updated the task description. (Show Details)Aug 2 2021, 8:31 PM

Zabe moved this task from Backlog to Immediate WMF elections needs on the MediaWiki-extensions-SecurePoll board.Aug 2 2021, 8:35 PM

Mh-3110 subscribed.Aug 2 2021, 8:36 PM

• dpifke subscribed.Aug 2 2021, 8:48 PM

KTC subscribed.Aug 2 2021, 8:58 PM

In T287859#7253352, @Tnegrin wrote:
Hi everybody -- I wanted to clarify this comment:
In T287859#7252963, @phuedx wrote:
From #wikimedia-operations in IRC:
19:03:08 <tltaylor> hi folks. per Toby the board elections are being delayed 2 weeks. so hopefully that creates more room for a smoother solution
Since it's critically important for the voting software to be reliable, we have decided that to ensure time for adequate troubleshooting and testing, we need an additional two weeks. This information is being sent to the election committee and the board for consideration on the timing of the election. They are aligned on the need for reliable voting software and there are additional factors they will consider for when the elections will occur.

We will communicate the results of their decision when it is made.

I know I sound like a stuck wheel here, but can we please commit to actually funding, maintaining and making sure so called "critical software" (I'm paraphrasing slightly) actually works when we need it to?

I know we struggle with institutional memory around here sometimes...

Toby, I'm pretty sure this is not the first time you've seen SecurePoll broken just before we need it? I can think of numerous occasions I've helped unbreak it when people found it was broken before, during or at the end of an election. And yes, I've also been the cause of at least one of those breakages. Tim being another person often on hook for SecurePoll related UBN.

We shouldn't be finding out basically 2 working days before we need something "critically important" that it's broken in numerous ways (from relatively recent changes too). And similarly, the day before it's needed, we shouldn't be having to delay elections. It looks unprofessional.

We've heard it before that we're going to deal with tech debt, we're going to find maintainers for stuff, but we repeatedly just don't. The situation rarely improves. And maybe when it does, teams eventually get reassigned and don't have time to deal with stuff they created in the past, and worse still, no one else really takes over the maintainership.

SecurePoll may have had some people/a team working on it recently, and that's good they're implementing features and trying to improve things, but it is critically under tested (unit tests etc). Test coverage is so good that we don't even have it displayed (I will go and fix that) - https://doc.wikimedia.org/cover-extensions/SecurePoll/

Sure, this specific issue (where it's exacerbated by many queries to many different databases on many different servers) is a bit harder to write tests for; it wouldn't have failed on beta... And as per Zabe, it doesn't always fail in real world usage either. But other two bugs? Well, they were seemingly not tested or noticed before CR (where both could have in theory have been caught), post merge or deployment. They certainly could've been caught further out when they went live.

And of course, this is not just SecurePoll. We've numerous fairly important extensions with little to no actual active or proactive maintenance.

Frostly subscribed.Aug 2 2021, 9:48 PM

ttaylor subscribed.Aug 2 2021, 9:58 PM

https://doc.wikimedia.org/cover-extensions/SecurePoll/ is now live

Screenshot 2021-08-02 at 23.02.49.png (220×2 px, 51 KB)

And while SecurePoll's test coverage is not the worst (that we track!), it's far from anywhere that can be considered "good" - https://doc.wikimedia.org/cover-extensions/?sort=cov

$wgMaxUserDBWriteDuration only affects web requests IIRC, a maintenance script or just typing in the right PHP code in mwscript shell.php should bypass it. And +1 to everything Reedy said.

Reedy mentioned this in T287914: Create a maintenance script to create polls.Aug 2 2021, 10:30 PM

• 4nn1l2 subscribed.Aug 3 2021, 1:25 AM

Risker subscribed.Aug 3 2021, 3:10 AM

The code is already making a reasonable attempt at splitting up the transaction, by wrapping the write operations on each wiki with startAtomic/endAtomic. It's just that those functions are basically no-ops -- they don't actually provide separate transactions. After endAtomic() on the local DB, the write transaction stays open while all the other DBs are written to. I'm looking at doing the write operations in a separate connection with outer transaction scope, so that begin() and commit() can be used.

Change 709589 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/extensions/SecurePoll@master] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709589

gerritbot added a project: Patch-For-Review.Aug 3 2021, 4:50 AM

With the patch above, the transactions are short, and the write queries are not even counted towards the limit. I tested it locally with my little wiki farm (1 local wiki, 2 foreign). Please merge, deploy and test.

Change 709592 had a related patch set uploaded (by Tim Starling; author: Tim Starling):

[mediawiki/core@master] rdbms: Log the transaction write query duration

https://gerrit.wikimedia.org/r/709592

Lydia_Pintscher subscribed.Aug 3 2021, 7:31 AM

Peachey88 subscribed.Aug 3 2021, 7:45 AM

Neozoon subscribed.Aug 3 2021, 10:24 AM

• CKibelka_WMF subscribed.Aug 3 2021, 1:29 PM

• marcella subscribed.Aug 3 2021, 2:50 PM

Change 709592 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Log the transaction write query duration

https://gerrit.wikimedia.org/r/709592

ReleaseTaggerBot added a project: MW-1.37-notes (1.37.0-wmf.18; 2021-08-09).Aug 3 2021, 7:00 PM

Vermont subscribed.Aug 3 2021, 10:56 PM

Change 709589 merged by jenkins-bot:

[mediawiki/extensions/SecurePoll@master] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709589

Maintenance_bot removed a project: Patch-For-Review.Aug 4 2021, 6:10 AM

Change 709973 had a related patch set uploaded (by Phuedx; author: Tim Starling):

[mediawiki/extensions/SecurePoll@wmf/1.37.0-wmf.16] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709973

Change 709974 had a related patch set uploaded (by Phuedx; author: Tim Starling):

[mediawiki/extensions/SecurePoll@wmf/1.37.0-wmf.17] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709974

Marostegui moved this task from Triage to Done on the DBA board.Aug 4 2021, 10:33 AM

Change 709973 merged by jenkins-bot:

[mediawiki/extensions/SecurePoll@wmf/1.37.0-wmf.16] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709973

Change 709974 merged by jenkins-bot:

[mediawiki/extensions/SecurePoll@wmf/1.37.0-wmf.17] Use real transactions when creating an election

https://gerrit.wikimedia.org/r/709974

Maintenance_bot removed a project: Patch-For-Review.Aug 4 2021, 11:10 AM

ReleaseTaggerBot edited projects, added MW-1.37-notes (1.37.0-wmf.17; 2021-08-02); removed MW-1.37-notes (1.37.0-wmf.18; 2021-08-09).Aug 4 2021, 12:00 PM

phuedx moved this task from Ready 🎬 (ONLY IF YOU HAVE NO MORE CODE TO REVIEW) to QA/Testing 🐞 on the Anti-Harassment (The Letter Song) board.Aug 4 2021, 12:03 PM

So9q subscribed.Aug 4 2021, 12:55 PM

Ratekreel subscribed.Aug 5 2021, 7:19 AM

• brennen subscribed.Aug 5 2021, 6:12 PM

Bluerasberry subscribed.Aug 5 2021, 7:20 PM

Is this resolved?

@tstarling This issue is now resolved. I was able to create an election poll for All wikis and This wikis.
Version tested:SecurePoll 3.0.0 (f7361c0) 06:22, 5 August 2021 GPL-2.0-or-later

A BIG Thank you to all that worked to resolved this issue promptly! Go Team!

• imaigwilo moved this task from QA/Testing 🐞 to Done: Q1 (2021-22) on the Anti-Harassment (The Letter Song) board.Aug 5 2021, 9:26 PM

In T287859#7253820, @Reedy wrote:
In T287859#7253352, @Tnegrin wrote:
Hi everybody -- I wanted to clarify this comment:
In T287859#7252963, @phuedx wrote:
From #wikimedia-operations in IRC:
19:03:08 <tltaylor> hi folks. per Toby the board elections are being delayed 2 weeks. so hopefully that creates more room for a smoother solution
Since it's critically important for the voting software to be reliable, we have decided that to ensure time for adequate troubleshooting and testing, we need an additional two weeks. This information is being sent to the election committee and the board for consideration on the timing of the election. They are aligned on the need for reliable voting software and there are additional factors they will consider for when the elections will occur.

We will communicate the results of their decision when it is made.
I know I sound like a stuck wheel here, but can we please commit to actually funding, maintaining and making sure so called "critical software" (I'm paraphrasing slightly) actually works when we need it to?

I know we struggle with institutional memory around here sometimes...

Toby, I'm pretty sure this is not the first time you've seen SecurePoll broken just before we need it? I can think of numerous occasions I've helped unbreak it when people found it was broken before, during or at the end of an election. And yes, I've also been the cause of at least one of those breakages. Tim being another person often on hook for SecurePoll related UBN.

We shouldn't be finding out basically 2 working days before we need something "critically important" that it's broken in numerous ways (from relatively recent changes too). And similarly, the day before it's needed, we shouldn't be having to delay elections. It looks unprofessional.

We've heard it before that we're going to deal with tech debt, we're going to find maintainers for stuff, but we repeatedly just don't. The situation rarely improves. And maybe when it does, teams eventually get reassigned and don't have time to deal with stuff they created in the past, and worse still, no one else really takes over the maintainership.

SecurePoll may have had some people/a team working on it recently, and that's good they're implementing features and trying to improve things, but it is critically under tested (unit tests etc). Test coverage is so good that we don't even have it displayed (I will go and fix that) - https://doc.wikimedia.org/cover-extensions/SecurePoll/

Sure, this specific issue (where it's exacerbated by many queries to many different databases on many different servers) is a bit harder to write tests for; it wouldn't have failed on beta... And as per Zabe, it doesn't always fail in real world usage either. But other two bugs? Well, they were seemingly not tested or noticed before CR (where both could have in theory have been caught), post merge or deployment. They certainly could've been caught further out when they went live.

And of course, this is not just SecurePoll. We've numerous fairly important extensions with little to no actual active or proactive maintenance.

Thank you, @Reedy, for pointing out this often overlooked, yet incredibly important issue. I think it is very easy for us all to forget the importance of having a more robust maintenance setup in place. This is a very good example of tech debt which caused tangible problems across the organization and movement. From my perspective, I think it would be incredibly helpful for all departments within the organization for this issue to receive more attention as we build out our capacity more. It feels like a very appropriate part of building both a thriving movement and thriving foundation.

I'm distanced from the details here, and would prefer to keep it that way, but I need to chime in about cross-database transactions: there is no such thing on our infrastructure. This can only be achieved through advanced database extensions supporting two-phase commit, which we have never integrated with for MariaDB as far as I'm aware. Without native 2PC support, you'll never be able to guarantee that you have either committed all changes, or rolled them all back. For example, if 2 out of 3 databases are changed but the 3rd fails, then you need to roll back the 2nd and 1st changes. If the rollback fails, then you're left with an inconsistent state.

The only thing we can do is to log thoroughly, and set off a loud alert if the pseudo-multi-db transaction is broken, then recover manually. I've seen this fail on WMF's CiviCRM donation handling platform, it's not just an academic question.

In T287859#7268116, @awight wrote:

I'm distanced from the details here, and would prefer to keep it that way, but I need to chime in about cross-database transactions: there is no such thing on our infrastructure.

Thanks for the interesting information, but that is not relevant to this task.

In T287859#7269211, @tstarling wrote:

In T287859#7268116, @awight wrote:

I'm distanced from the details here, and would prefer to keep it that way, but I need to chime in about cross-database transactions: there is no such thing on our infrastructure.

Thanks for the interesting information, but that is not relevant to this task.

Good, thanks for the clarification. Reading the patch, I see that we're actually doing the opposite of what I imagined: the change was to set autocommit mode so each statement is committed immediately rather than in a transaction. It seems that you can still get into the inconsistent state I was describing, if any of the remote wiki statements fail then we will be left with entries in some of the remote wikis, but recordElectionToNamespace will never be called. I believe the operator will see an exception in this case, which makes it possible to do manual cleanup if they are feeling diligent. Also, this only happens during the creation of a new election so not quite a user-facing issue. Is this potentially inconsistent state going to break anything?

In T287859#7269282, @awight wrote:

Reading the patch, I see that we're actually doing the opposite of what I imagined: the change was to set autocommit mode so each statement is committed immediately rather than in a transaction.

It gets a transaction in autocommit mode but then does an explicit BEGIN/COMMIT around write queries relating to each wiki DB being acted on. There are about 11 queries in the local transaction and 6 queries per transaction in the subsequent foreign wiki transactions.

The bulk replacement of begin()/commit() with startAtomic()/endAtomic() a few years ago caused all transactions to be left open until request shutdown. My change restored the performance/consistency tradeoff chosen by the original author, i.e. the wiki being the atomic unit.

Is this potentially inconsistent state going to break anything?

Maybe, but if it does, it's a different bug.

RhinosF1 subscribed.Aug 9 2021, 1:21 PM

In T287859#7268116, @awight wrote:

[…] I need to chime in about cross-database transactions: there is no such thing on our infrastructure. […]

MediaWiki provides cross-database transactions, and has doing this by default for many years. You are technically right though from the MySQL perspective, each transaction is indeed bound to only one database.

For 99.9% of cases, MediaWiki povides shared pseudo-atomics for cross-database transactions and these work. The way it works is that we transactions are required to stay open for the duration of the request. A database error in any one query or transaction, or a PHP exception of any kind, or a segfault, would effectively rollback all DBs, and thus emulate a cross-database transaction.

It is only during three lines of code at the very end of the request, where we a tight loop to finally commit each per-db transaction, where the techninically separate transactions are exposed.

Having said that, for SecurePoll this behaviour was not intended or neccecary. And in cases like that it makes sense to locally opt-out. (And depending on how rare this use case is, and much it costs/benefits, we can revisit the default behaviour.)

I think I just ran into this again. Not sure what changed. This was creating an election with GPG encryption, STV with droop quota, all wikis, 14 days. May be the all wikis problem again?

I thought maybe it was some issue with the number of candidates, though I ran into the same thing with only 1.

Screenshot 2021-08-11 at 7.33.04 PM.png (341×882 px, 43 KB)

It's a different error, related to GPG. Forked to T288711: /wiki/Special:SecurePoll/create Error: Call to a member function getId() on null

dom_walden mentioned this in T288831: SecurePoll: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" when submitting a translation.Aug 13 2021, 2:58 PM

• Kormat subscribed.Aug 17 2021, 1:45 PM

	F34591178: Screenshot 2021-08-11 at 7.33.04 PM.png
	Aug 12 2021, 2:43 AM

	F34574120: Screenshot 2021-08-02 at 23.02.49.png
	Aug 2 2021, 10:04 PM

Creating an election with "all wikis" can give a DBTransactionSizeErrorClosed, ResolvedPublicActions