Page MenuHomePhabricator

Investigate issues with Tally function in SecurePoll [8Hr]
Closed, ResolvedPublicDec 15 2020

Description

Motivation

In out meeting with @jrbs, we saw significant issues with the "Tally" function on the main page randomly not working. This task is to understand the problems with that and solutions we could pursue.

Details

Due Date
Dec 15 2020, 5:00 AM

Event Timeline

Niharika triaged this task as Medium priority.Nov 30 2020, 8:02 PM
ARamirez_WMF renamed this task from Investigate issues with Tally function in SecurePoll to Investigate issues with Tally function in SecurePoll [8Hr].Dec 2 2020, 6:05 PM
ARamirez_WMF set Due Date to Dec 15 2020, 5:00 AM.
ARamirez_WMF changed the subtype of this task from "Task" to "Deadline".

@jrbs What error message do you see when tallying fails due to there being too many votes?

@jrbs What error message do you see when tallying fails due to there being too many votes?

Request from <my IP> via cp4031 cp4031, Varnish XID 100949954
Error: 503, Backend fetch failed at Fri, 11 Dec 2020 18:52:20 GMT

Also takes me to a generic error screen:

Screenshot 2020-12-11 at 11.00.54 AM.png (643ร—723 px, 46 KB)

Thanks @jrbs. Sorry I have a couple of follow-up questions:

  • Did you say it only happens for elections with large numbers of votes?
  • If so, do you know roughly how large?
  • How long do you wait roughly before you see the error?
  • Do you know if it happens only for a particular type of tallying?

Clicking on the tally button leads to calling DBStore::callbackValidVotes, which requests all the votes for a given election, then calls a callback on each one:

public function callbackValidVotes( $electionId, $callback, $voterId = null ) {
	$dbr = $this->getDB();
	$where = [
		'vote_election' => $electionId,
		'vote_current' => 1,
		'vote_struck' => 0
	];
	if ( $voterId !== null ) {
		$where['vote_voter'] = $voterId;
	}
	$res = $dbr->select(
		'securepoll_votes',
		'*',
		$where,
		__METHOD__
	);

	foreach ( $res as $row ) {
		$status = call_user_func( $callback, $this, $row->vote_record );
		if ( $status instanceof Status && !$status->isOK() ) {
			return $status;
		}
	}

	return Status::newGood();
}

Called from the tally page via:

#0 MediaWiki\Extensions\SecurePoll\DBStore->callbackValidVotes() called at [/var/www/html/mediawiki/extensions/SecurePoll/includes/Talliers/ElectionTallier.php:57]
#1 MediaWiki\Extensions\SecurePoll\Talliers\ElectionTallier->execute() called at [/var/www/html/mediawiki/extensions/SecurePoll/includes/Entities/Election.php:510]
#2 MediaWiki\Extensions\SecurePoll\Entities\Election->tally() called at [/var/www/html/mediawiki/extensions/SecurePoll/includes/Pages/TallyPage.php:136]
#3 MediaWiki\Extensions\SecurePoll\Pages\TallyPage->submitLocal() called at [/var/www/html/mediawiki/extensions/SecurePoll/includes/Pages/TallyPage.php:73]
#4 MediaWiki\Extensions\SecurePoll\Pages\TallyPage->execute() called at [/var/www/html/mediawiki/extensions/SecurePoll/includes/SpecialSecurePoll.php:87]
#5 MediaWiki\Extensions\SecurePoll\SpecialSecurePoll->execute() called at [/var/www/html/mediawiki/core/includes/specialpage/SpecialPage.php:645]
#6 SpecialPage->run() called at [/var/www/html/mediawiki/core/includes/specialpage/SpecialPageFactory.php:1403]
#7 MediaWiki\SpecialPage\SpecialPageFactory->executePath() called at [/var/www/html/mediawiki/core/includes/MediaWiki.php:310]
#8 MediaWiki->performRequest() called at [/var/www/html/mediawiki/core/includes/MediaWiki.php:945]
#9 MediaWiki->main() called at [/var/www/html/mediawiki/core/includes/MediaWiki.php:548]
#10 MediaWiki->run() called at [/var/www/html/mediawiki/core/index.php:53]
#11 wfIndexMain() called at [/var/www/html/mediawiki/core/index.php:46]

Given that there's no limit on the DB query, perhaps fetching or processing the data could be timing out.

Thanks @jrbs. Sorry I have a couple of follow-up questions:

  • Did you say it only happens for elections with large numbers of votes?
  • If so, do you know roughly how large?
  • How long do you wait roughly before you see the error?

I believe it's big elections, yeah. For example, the 2020 Farsi ArbCom election, which had 143 valid votes, tallies instantly, but the 2014 English ArbCom election, which had 594 votes, times out after just over two minutes.

  • Do you know if it happens only for a particular type of tallying?

This I don't know. It's possible that this is a factor but I'm not sure. Given the elections that are timing out are Range voting (histogram range) while the Farsi one is Schulze vote (in theory a much more complex style to tally) I imagine the issue is with volume, but I'd be guessing.

Thanks @jrbs, this is really helpful.

No worries! I just tried to tally the 2018 enwiki ArbCom election, which had 2,118 votes (so much more than 2014!) and it also timed out after two minutes, so it's at least consistent.

If I look at the timeout exception backtraces in the logs, it's mostly GpgCrypt. That's expected -- encrypted elections should be tallied offline. There's no security benefit to encryption if the tally key is stored on the server.

If I look at the timeout exception backtraces in the logs, it's mostly GpgCrypt. That's expected -- encrypted elections should be tallied offline. There's no security benefit to encryption if the tally key is stored on the server.

If we want to improve the user experience by displaying the tally for encrypted elections via the securepoll interface, what would be the best way to do that?

Problem

The page times out when attempting to tally large elections. When this happens, the admin must run the maintenance script instead.

However, admins would like to do be able to tally via the web interface.

To elaborate why the timeout is happening, the logs confirm that is happening during the loop in T269029#6689725. In this case, the callback leads to writing a file and executing a shell command to GpgCrypt for each vote cast.

The error message the execution time limit of 200 seconds was exceeded explains why @jrbs found that the error appears in just over a couple of minutes. Smaller elections appear to be able to cope with this time limit.

Solutions

Not throwing errors
We shouldn't really be letting timeout errors happen like this. Short term, we could throw an exception if the election has more than a certain number of votes instead of attempting the tally and triggering the timeout. There's no perfect threshold since speed of executions will vary, so we'd have to use judgement. If we set a high threshold, we'd reduce the number of errors while still allowing tallies via the web interface wherever possible.

Improving efficiency
We could look into whether we could speed the process up. E.g. could we decrypt all the votes at once, since the key seems to be per-election rather than per-vote?

We should also consider efficiency of other code-paths from this loop, not involving decryption.

Running a job
If efficiency can't be improved, we could instead run a job that is started via the web inteface, tallies the election, then alerts the admin when it is done. This won't time out and will allow the admin only to interact via the web interface.

Saving the results
@jrbs I'm not very familiar with the tallying workflow, but is one of the problems that elections need to be repeatedly tallied? Is there a need for us to store the result in a way that's accessible via the web interface once an election has been tallied?

If I look at the timeout exception backtraces in the logs, it's mostly GpgCrypt. That's expected -- encrypted elections should be tallied offline. There's no security benefit to encryption if the tally key is stored on the server.

Thanks @tstarling. Does calling GpgCrypt not mean that the key is stored on the server? Do we need to make any changes here?

4 out of 5 exceptions that I looked at were in setupHomeAndKeys(). So it could be improved by setting up the temp directory once at the start of tallying and reusing that directory, instead of tearing it down and recreating it for every vote. As @Tchanders says, a batch decrypt function could be implemented efficiently. That will speed up tallying whether it is done online or offline.

It looks like online tallying is already discouraged by the create interface. I'm not sure how people are getting the tallying key into the server. If people really need online tallying then the tally page should have a textarea which allows you to paste in the tallying key. Then it would be posted to the server and used in that request, but not stored.

You understand, the point of using asymmetric encryption in SecurePoll is to prevent deanonymization of votes by a party with access to the server. If that is not a security goal for an election, encryption can be disabled. Then tallying will be fast and simple. Storing the private key permanently on the server defeats the purpose of encryption.

Saving the results
@jrbs I'm not very familiar with the tallying workflow, but is one of the problems that elections need to be repeatedly tallied? Is there a need for us to store the result in a way that's accessible via the web interface once an election has been tallied?

I would have to check with the Stewards who have done the scrutineering before, since I know a lot of them like to tally the results themselves before they sign off (which I assume they do offline). Paging @Urbanecm and @revi for that as two I can think of who know this system.

To answer the other question, it might be nice for others to be able to tally the election through the interface if they want to, but I don't think it's necessary since the results are posted publicly anyway, on Meta-Wiki.

I can't really speak to encryption too much since it's technically rather beyond me. :)

I'll explain this a bit more. The create page allows you to set two encryption-related properties: gpg-encrypt-key and gpg-sign-key. With only these two properties defined, GpgCrypt::canDecrypt() returns false and so tallying can't be done. If you additionally set gpg-decrypt-key (which I'm calling the tallying key), then canDecrypt() returns true and tallying can be done.

The original idea was that the election configuration would be imported into an offline instance of SecurePoll, with gpg-decrypt-key added by editing the XML prior to import, and then tallying would be done via the command line with cli/tally.php. You can see from the backtraces that this procedure is not being followed. Instead gpg-decrypt-key is being inserted into votewiki and tallying is being done via the web interface. As I say, this is the worst of all worlds, since it's slow and receives no security benefit from encryption.

Perhaps we should re-evaluate SecurePoll's security and usability requirements to find some new tradeoff which works for users.

Ah, I see. That's helpful context, thank you!

The original idea was that the election configuration would be imported into an offline instance of SecurePoll

I think this is where things have somewhat fallen apart, given it's T&S who generally support elections through votewiki and we are not really a technical team. I personally wouldn't know the first thing about setting something like this up. I'd be happy to learn and document the process, though, assuming there are no "non-technical" docs out there already for this.

The original idea was that the election configuration would be imported into an offline instance of SecurePoll

I think this is where things have somewhat fallen apart, given it's T&S who generally support elections through votewiki and we are not really a technical team. I personally wouldn't know the first thing about setting something like this up. I'd be happy to learn and document the process, though, assuming there are no "non-technical" docs out there already for this.

The fact that this process has gone without being understood or used for so long tells me that there might not be a need for the extra security measures that SecurePoll was designed to support. We should decide whether it is worth keeping the process that Tim just described. If we deem it not necessary, we can remove it and simplify the process. We can also look into alternate ways of enhancing security if the current process is too complicated and time-consuming.

My threat model was Board elections, specifically the risk of capture of the whole organisation, or more likely, unfounded allegations of the same. I wanted to have a process whereby Foundation officers with ultimate legal authority could be elected, and election results could be verified even in the event of allegations that the people with control of the servers wanted a particular outcome. At the time, I was arguing for majority community control of the Board, not the current system of a minority of elected members.

For ArbCom elections it is complete overkill. Nobody wants to be on the ArbCom so much that they would take over the servers to rig the election.

I'm getting a case of the nostalgias now. SecurePoll's whole security model was already present in the original version of BoardVote, which I wrote as a volunteer in May 2004. It was only the third extension to be created, after wikihiero and timeline. I advocated for voting and democratic processes in the fledgling Foundation and I was keen to do whatever technical work was required to support that.

SecurePoll was an abstraction of BoardVote, adding a concept of multiple elections, with configuration of elections in the database. I called it "SecurePoll" in reference to the encryption feature inherited from BoardVote.

So it would be sad to just delete that feature. But it doesn't work as just a box to tick on the create page. The election admins and ideally some of the voters need to understand and participate in the security model. We did have that participation up until 2011. The 2011 Board election was the last one to be hosted on servers run by an independent organisation (SPI). In 2013, vote.wikimedia.org was created, and I can see that all the Board elections on that wiki (2013, 2015 and 2017) have decryption keys stored on the server. So this misunderstanding of SecurePoll's security model started in 2013.

The fact that this process has gone without being understood or used for so long tells me that there might not be a need for the extra security measures that SecurePoll was designed to support. We should decide whether it is worth keeping the process that Tim just described. If we deem it not necessary, we can remove it and simplify the process. We can also look into alternate ways of enhancing security if the current process is too complicated and time-consuming.

It seems to me like this is an instance of the eternal dilemma with data privacy: keeping data truly private is not just technically difficult, it's inconvenient for users. Privacy restrictions are always annoying and pointless, until they aren't.

In the end, the question is who we want to trust with the information of who voted for what: do we trust a select handful of individuals, chosen for each election, or do we trust anyone who has access to production databases or their backups, now and at any time in the future?

Whatever choice we make, it should be made very clear to the person casting the vote who can see how they voted.

I'm getting a case of the nostalgias now. SecurePoll's whole security model was already present in the original version of BoardVote, which I wrote as a volunteer in May 2004. It was only the third extension to be created, after wikihiero and timeline. I advocated for voting and democratic processes in the fledgling Foundation and I was keen to do whatever technical work was required to support that.

SecurePoll was an abstraction of BoardVote, adding a concept of multiple elections, with configuration of elections in the database. I called it "SecurePoll" in reference to the encryption feature inherited from BoardVote.

Thanks for the history, that's very interesting! Wasn't aware of the origins of this tool :)

So it would be sad to just delete that feature. But it doesn't work as just a box to tick on the create page. The election admins and ideally some of the voters need to understand and participate in the security model.

I'm sort of in two minds about this. Obviously, we have a tremendous volunteer developer community who do understand the intricacies of these systems and who are willing to help with things like this. But IMO, the Venn diagram of people who 1) know the Wikimedia system, 2) know how to handle SQL, GPG and other tools you need for this, and 3) have the personal time and bandwidth to execute, is getting smaller not larger. I am of course very keen for secure, "untamperable" elections, but I do think that we are increasingly looking for unicorns or community-veterans to run these sorts of processes.

In 2013, vote.wikimedia.org was created, and I can see that all the Board elections on that wiki (2013, 2015 and 2017) have decryption keys stored on the server. So this misunderstanding of SecurePoll's security model started in 2013.

To clarify how I (and @Jalexander before me) handles this - we add the decryption key only *after* the vote has concluded and the independent Stewards assigned to the election have scrutinised the votes. This means that, from the moment voting opens to the moment someone with the decryption key inserts that key, the vote is encrypted. If I understand things correctly, even decrypted it is extremely difficult to ascertain whom an individual voted for (which would be my primary concern), but it is very easy to tally and see the results.

Personally (assuming I am correct) this is actually kind of the ideal outcome; having votewiki retain the results in this way would allow for an historical record to be retained there, too. We already have that for the smaller unencrypted elections. But if there's an obvious security aspect to inserting the decryption key after the vote has closed and the votes are "locked", then I'll happily stand corrected!

(Sidenote: It is surprisingly difficult to spell the word "encrypted".)

The fact that this process has gone without being understood or used for so long tells me that there might not be a need for the extra security measures that SecurePoll was designed to support.

Uh? People complained about the election not being run by a neutral third-party which could guarantee safe handling of the keys. The fact that concerns were steamrolled is not evidence people "did not understand" or "did not care".
https://meta.wikimedia.org/wiki/Talk:Wikimedia_Foundation_elections_2013#Software_in_the_Public_Interest?
https://lists.wikimedia.org/pipermail/wikimedia-l/2014-October/074880.html

The fact that this process has gone without being understood or used for so long tells me that there might not be a need for the extra security measures that SecurePoll was designed to support.

Uh? People complained about the election not being run by a neutral third-party which could guarantee safe handling of the keys. The fact that concerns were steamrolled is not evidence people "did not understand" or "did not care".
https://meta.wikimedia.org/wiki/Talk:Wikimedia_Foundation_elections_2013#Software_in_the_Public_Interest?
https://lists.wikimedia.org/pipermail/wikimedia-l/2014-October/074880.html

I'm sorry, but I do not see concerns being "steamrolled" in either of those links. I don't even think these posts constitute complaints. James explained that we had stopped working with SPI due to a difficult working relationship with them, though I of course don't have context for the exact reasons behind this decision since it was made almost eight years ago.

What WMF called "logistical difficulties", community members called "expected security": that WMF not have someone on its payroll who could be ordered to provide the decryption keys. Elections from 2013 onwards never claimed to be secure; they didn't even claim to be consistent with the laws of mathematics, for that matter. Nothing new.

Elections from 2013 onwards never claimed to be secure; they didn't even claim to be consistent with the laws of mathematics, for that matter. Nothing new.

I honestly don't know what you mean here but I don't think it is relevant to this task.