Page MenuHomePhabricator

Investigate why tallying a large, encrypted election did not complete on votewiki when kicked off via the tally page [8h]
Closed, ResolvedPublic

Description

From @jrbs comments on Slack:

I started it tallying the ArbCom elections (encrypted, ~1,700 votes) yesterday, and it seems to still be tallying.
I don’t remember exactly when yesterday, but it’s at least 15 or 16 hours ago)
It definitely does not [usually take this long]. I don’t remember exactly how long it took but it’s less than an hour.

We should investigate:

  • What happens if Joe kicks this off again? (Was the failure just a one-off)
  • Is there anything in logstash for SecurePoll/votewiki at around 2021-04-14 to 2021-04-16? (Did it error out, rather than running slower?)
  • Looking at the code, are we catching all the possible errors and saving them to the database?

Finally, if there were no errors and it just looks like the tally was taking longer to run when kicked off via the web page, compared to when it was kicked off via the maintenance script, we should have a look into why this might be.

Event Timeline

Niharika triaged this task as Medium priority.Apr 15 2021, 5:16 PM
Niharika created this task.
ARamirez_WMF renamed this task from Investigate issues with Tallying to Investigate issues with Tallying [8h].Apr 21 2021, 4:12 PM
Tchanders renamed this task from Investigate issues with Tallying [8h] to Investigate why tallying a large, encrypted election did not complete on votewiki when kicked off via the tally page [8h].Apr 21 2021, 5:06 PM
Tchanders updated the task description. (Show Details)

The votewiki securepoll_properties table for this election has the following row:

 +-----------+---------------+----------+
| pr_entity | pr_key        | pr_value |
+-----------+---------------+----------+
|       808 | tally-ongoing | 1        |
+-----------+---------------+----------+

but no rows where pr_key is 'tally-result' or 'tally-error', which suggests that the tally started but it did not complete and it did not log an error (though an error may still have happened that we didn't log to the table).

See TallyPage::submitLocal for how status is logged to the securepoll_properties table.

Investigation

What happens if Joe kicks this off again?

We don't need to do this, as I think it's answered below - there would be another failure.

Is there anything in logstash for SecurePoll/votewiki at around 2021-04-14 to 2021-04-16?

Yes.

TL;DR: The request timed out, even with the deferred update. We didn't see this problem locally because it runs fast enough not to time out locally. I think we don't test such large elections on beta; or maybe beta is also quick enough not to time out.

Details: A RequestTimeoutException was thrown, with message The maximum execution time of 200 seconds was exceeded.

  • reqId 64f0bdc6-d27b-4192-af34-7399158998f0
  • message: Deferred update 'MWCallableUpdate_MediaWiki\Extensions\SecurePoll\Pages\TallyPage->submitLocal' failed to run.
  • stack trace:
from /srv/mediawiki/php-1.37.0-wmf.1/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(97)
#0 /srv/mediawiki/php-1.37.0-wmf.1/vendor/wikimedia/request-timeout/src/Detail/ExcimerTimerWrapper.php(72): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->onTimeout(integer)
#1 /srv/mediawiki/php-1.37.0-wmf.1/vendor/wikimedia/shellbox/src/Command/UnboxedExecutor.php(291): Wikimedia\RequestTimeout\Detail\ExcimerTimerWrapper->Wikimedia\RequestTimeout\Detail\{closure}(integer)
#2 /srv/mediawiki/php-1.37.0-wmf.1/vendor/wikimedia/shellbox/src/Command/UnboxedCommand.php(29): Shellbox\Command\UnboxedExecutor->execute(MediaWiki\Shell\Command)
#3 /srv/mediawiki/php-1.37.0-wmf.1/includes/shell/Command.php(227): Shellbox\Command\UnboxedCommand->execute()
#4 /srv/mediawiki/php-1.37.0-wmf.1/extensions/SecurePoll/includes/Crypt/GpgCrypt.php(274): MediaWiki\Shell\Command->execute()
#5 /srv/mediawiki/php-1.37.0-wmf.1/extensions/SecurePoll/includes/Crypt/GpgCrypt.php(378): MediaWiki\Extensions\SecurePoll\Crypt\GpgCrypt->runGpg(string, string, string, string)
#6 /srv/mediawiki/php-1.37.0-wmf.1/extensions/SecurePoll/includes/Talliers/ElectionTallier.php(111): MediaWiki\Extensions\SecurePoll\Crypt\GpgCrypt->decrypt(string)
#7 [internal function]: MediaWiki\Extensions\SecurePoll\Talliers\ElectionTallier->addRecord(MediaWiki\Extensions\SecurePoll\DBStore, string)
#8 /srv/mediawiki/php-1.37.0-wmf.1/extensions/SecurePoll/includes/DBStore.php(204): call_user_func(array, MediaWiki\Extensions\SecurePoll\DBStore, string)
#9 /srv/mediawiki/php-1.37.0-wmf.1/extensions/SecurePoll/includes/Talliers/ElectionTallier.php(79): MediaWiki\Extensions\SecurePoll\DBStore->callbackValidVotes(string, array)
#10 /srv/mediawiki/php-1.37.0-wmf.1/extensions/SecurePoll/includes/Entities/Election.php(566): MediaWiki\Extensions\SecurePoll\Talliers\ElectionTallier->execute()
#11 /srv/mediawiki/php-1.37.0-wmf.1/extensions/SecurePoll/includes/Pages/TallyPage.php(325): MediaWiki\Extensions\SecurePoll\Entities\Election->tally()
#12 /srv/mediawiki/php-1.37.0-wmf.1/includes/deferred/MWCallableUpdate.php(38): MediaWiki\Extensions\SecurePoll\Pages\TallyPage->MediaWiki\Extensions\SecurePoll\Pages\{closure}()
#13 /srv/mediawiki/php-1.37.0-wmf.1/includes/deferred/DeferredUpdates.php(513): MWCallableUpdate->doUpdate()
#14 /srv/mediawiki/php-1.37.0-wmf.1/includes/deferred/DeferredUpdates.php(390): DeferredUpdates::attemptUpdate(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti)
#15 /srv/mediawiki/php-1.37.0-wmf.1/includes/deferred/DeferredUpdates.php(221): DeferredUpdates::run(MWCallableUpdate, Wikimedia\Rdbms\LBFactoryMulti, Monolog\Logger, BufferingStatsdDataFactory, string)
#16 /srv/mediawiki/php-1.37.0-wmf.1/includes/deferred/DeferredUpdatesScope.php(267): DeferredUpdates::{closure}(MWCallableUpdate, integer)
#17 /srv/mediawiki/php-1.37.0-wmf.1/includes/deferred/DeferredUpdatesScope.php(196): DeferredUpdatesScope->processStageQueue(integer, integer, Closure)
#18 /srv/mediawiki/php-1.37.0-wmf.1/includes/deferred/DeferredUpdates.php(242): DeferredUpdatesScope->processUpdates(integer, Closure)
#19 /srv/mediawiki/php-1.37.0-wmf.1/includes/MediaWiki.php(1132): DeferredUpdates::doUpdates(string)
#20 /srv/mediawiki/php-1.37.0-wmf.1/includes/MediaWiki.php(837): MediaWiki->restInPeace()
#21 /srv/mediawiki/php-1.37.0-wmf.1/includes/MediaWiki.php(575): MediaWiki->doPostOutputShutdown()
#22 /srv/mediawiki/php-1.37.0-wmf.1/index.php(53): MediaWiki->run()
#23 /srv/mediawiki/php-1.37.0-wmf.1/index.php(46): wfIndexMain()
#24 /srv/mediawiki/w/index.php(3): require(string)
#25 {main}
Looking at the code, are we catching all the possible errors and saving them to the database?

I didn't look into this, since I think we will need to abandon the deferred update idea anyway, and remove the code introduced in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/SecurePoll/+/673601.

Next steps

I think we have 2 options:

  1. Revisit the idea of running this as a job. We have some WIP code that we could pick up again.
  2. Return to running this as a maintenance script.

Here are the pros and cons:

SolutionProsCons
JobElection admins can tally via the web interface onlyMay have to wait hours for the election result: https://www.mediawiki.org/wiki/Manual:Job_queue
Maintenance scriptCan kick off tallying instantlySomeone with server access must run the tally script

@jrbs @drochford @Niharika It would be great to hear your thoughts on which of these solutions we should go with.

@Tchanders when run as a job, is there a way to prioritize this job so it does not take hours? This is a critical feature and if the tallying takes hours, that is a no-go.

@Niharika Not to my knowledge, no - the job queue is not intended for tasks that need to be completed with urgency: https://www.mediawiki.org/wiki/Manual:Job_queue/For_developers

@Tchanders Despite the tallying taking a considerable time vi the Job method, I would prefer this way so as to remove the need for somebody with server access running the script.

@Tchanders Despite the tallying taking a considerable time vi the Job method, I would prefer this way so as to remove the need for somebody with server access running the script.

@drochford In my judgement, it is more essential to have a way to provide a quick tally. One additional point not mentioned above is that we now have a way to store results - so the script would ideally only need to be kicked off once and results will be accessible on the interface in the future.

@jrbs also looking for your input.

@Niharika that would still leave us dependent upon somebody at the Foundation i.e. Joe or Nahid manually accessing the database no?

@Tchanders David and I are talking about this. Is it possible to communicate the time remaining for the job to complete in the UI?

I'm sorry, I didn't see that I was tagged in here. Phabricator is hard :)

Next steps

I think we have 2 options:

  1. Revisit the idea of running this as a job. We have some WIP code that we could pick up again.
  2. Return to running this as a maintenance script.

Here are the pros and cons:

SolutionProsCons
JobElection admins can tally via the web interface onlyMay have to wait hours for the election result: https://www.mediawiki.org/wiki/Manual:Job_queue
Maintenance scriptCan kick off tallying instantlySomeone with server access must run the tally script

@jrbs @drochford @Niharika It would be great to hear your thoughts on which of these solutions we should go with.

Personally speaking, I think there are more pros to doing this as a job. The use of the web interface really cuts out a lot of uncertainty from the process particularly if someone non-technical ends up running the tallying from the election. I'm among the more technical T&S Specialists (no, really!) and even I'm put off by having to do this through the CLI. It's messy, it's confusing, and it requires knowledge that many people don't have (i.e. commands on the terminal).

I think if we can do this through the interface, even if it takes "hours", it is still preferable to using the command line in my personal opinion. It might be more of a problem if "hours" actually works out to be "days", especially if we're on a deadline for things like Board Elections, though there's already the scrutineering step which has an ambiguous length so I don't think it's actually that much of an issue.

Thanks @Niharika @drochford @jrbs.

Having discussed this with the engineers, it sounds like we might need to:

I'd hope that we can do this work during our next phase of SecurePoll work, but I'd also strongly suggest that we deprioritise it until after we have more progress on STV, since that needs to be done in time for the board election.

We could do some immediate follow-up work to re-instate the "use the maintenance script" message for large elections, to avoid the tally page hanging in the confusing "in progress" state which caused this investigation to be needed. (This work is smaller and I suspect more urgent.)

If it helps, I have been able to reproduce this on beta with 1000 votes: https://vote.wikimedia.beta.wmflabs.org/wiki/Special:SecurePoll/tally/632

I couldn't do this before because, at the time of testing this feature, beta was having reliability problems :(.

If it helps, I have been able to reproduce this on beta with 1000 votes: https://vote.wikimedia.beta.wmflabs.org/wiki/Special:SecurePoll/tally/632

I couldn't do this before because, at the time of testing this feature, beta was having reliability problems :(.

Thanks for looking into that @dom_walden