Page MenuHomePhabricator

[M] Add UI fallbacks for failed tally jobs
Open, HighPublic

Assigned To
None
Authored By
STran
Aug 3 2021, 5:56 PM
Referenced Files
F34598067: image.png
Aug 17 2021, 6:40 AM
F34598030: image.png
Aug 17 2021, 5:17 AM
F34598026: image.png
Aug 17 2021, 5:17 AM
F34598028: image.png
Aug 17 2021, 5:17 AM
F34586747: image.png
Aug 7 2021, 10:54 AM
F34586749: image.png
Aug 7 2021, 10:54 AM
F34586753: image.png
Aug 7 2021, 10:54 AM
F34585486: Screen Shot 2021-08-05 at 11.02.37 AM.png
Aug 5 2021, 6:34 PM

Description

If a tally job fails silently, there's no way for the user to restart the job or cancel it from the UI. The election is untallyable until someone deletes tally-ongoing from the database. We should:

  1. Provide a way to "clear" a queued job. Once a job has been queued, afaik, there's no way to un-queue it but if it ran and failed silently we can safely remove the flag. Because it would be less than ideal if someone re-tallied while a job actually was running, we should make it clear that this is a bailout option and not a button to press if things are possibly slow.
  2. We have the capacity to distinguish between a "job is queued" state and a "job is running" state. While this might not be useful as-is, it might be useful messaging to help the user gauge whether or not they need to re-queue the job.

Event Timeline

Niharika added subscribers: Prtksxna, Niharika.

@Prtksxna Want to bring this to your attention.

Niharika renamed this task from Add UI fallbacks for failed tally jobs to [M] Add UI fallbacks for failed tally jobs .Aug 4 2021, 4:53 PM
Niharika triaged this task as Medium priority.

For context, here's what the screen will look like when you're tallying and what it will be stuck on if the job never returns:

image.png (1×1 px, 223 KB)

A few options we discussed in standup:

  • Remove the disabled button state. Users will be able to tally again whenever. Add a warning.
  • Implement a "time tally started" data point so users (and us) can know when a job was queued to help whoever is tallying make a more informed decision about whether or not to requeue.
  • Log when a job is queued/running so that we have insight into the job state (so we don't have to distinguish in the UI between queued/running, information that is more relevant to us as developers than possibly election admins)

Change 710323 had a related patch set uploaded (by AGueyte; author: AGueyte):

[mediawiki/extensions/SecurePoll@master] Add UI fallbacks for failed tally jobs

https://gerrit.wikimedia.org/r/710323

Hi all, I've created an option to see whether the job is queueing or running.

@Prtksxna can you confirm the copy for the warnings
"securepoll-tally-job-enqueued": "This election is queuing to be (re-)tallied. Please check back later for the results.",
"securepoll-tally-job-running" : "This election is now being (re-)tallied. Please check back later for the results.",

When submitting a tally, the tally is first tagged as queuing. When finally running, the tag "enqueued" is being deleted and a "running" tag is added into the DB.
The "running" tag is deleted later on the successful completion of the tally.

Screen Shot 2021-08-05 at 11.32.19 AM.png (438×1 px, 42 KB)

Screen Shot 2021-08-05 at 11.32.07 AM.png (342×1 px, 36 KB)

While the job is being queued or running, we can now see a "Force Re-tally" submit button.
@Prtksxna let me know how we can improve this display

Screen Shot 2021-08-05 at 11.02.37 AM.png (360×686 px, 40 KB)

@STran.
To force re-tally, on TallyPage::submitJob(), we check if the job is queuing/running. If so, we delete any tag related to the job and let the submitJob() function proceed.
I am not entirely sure how to stop/delete a job. Can I have your input on what I've done? Thanks!
After forcing a re-tally, we don't have the cleanest display of warning messages. Definitely need technical and design feedback here! cc @Prtksxna

"securepoll-tally-force-retally-confirm": "The ongoing tally has now been stopped",

Screen Shot 2021-08-05 at 11.18.03 AM.png (467×722 px, 41 KB)

I think this is mostly a Prateek question but I can weigh in on some technical things. You're correct in that we rely on checking the tally-ongoing flag for most of the stuff around "don't restart another job." Since we're circumventing that, if you remove the tally-onoing flag as part of your re-tally operation, you should resolve most of these problems you're seeing.

afaik* there's no way to stop/delete a job which is why there's a bit of uncertainty around this situation. We're not so much cancelling a job so much as assuming it ran and failed silently and since the cleanup step would never be reached if that were the case, we're doing some manual cleanup. Given that we have little insight into this (right now), that's my best guess. Additionally, I'm not sure if a re-tally option is the best option given that if a job already failed, would re-running it do anything? And in that case, do we even need to disable the submit button and add another one? We could add a (very strong) warning and re-use that button which already has all the cleanup built into it. I'm deferring to Prateek on the user experience for this.

This is very much my personal opinion, but I think the option should be to "cancel" the job only (so only removing the tally-ongoing flag and then ideally they could raise it to someone who might be able to look more into it (or re-run it themselves, if that's what they want).

@phuedx raised the possibility of adding more log lines and I strongly support that idea (maybe not as part of this ticket if it's unwieldy) because that would give us more insight into fixing this issue. I don't know why Dom's election failed to finish tallying and I don't know if restarting the job would fix that issue given the information we have access to right now.

*very disclaimer. I've read through the job queue docs and casually poked around the files

Thanks for the details and screenshots @AGueyte.

Here is what I am thinking for the queuing and tallying states:

QueuingTallying
image.png (396×1 px, 29 KB)
image.png (396×1 px, 28 KB)

Changes I've made:

  • Made the destructive button normal instead of primary
  • Both buttons on the same line
  • The disabled button shows the state of the tally

When someone clicks on Force Re-tally we show them a confirm dialog (OOUI or otherwise) that gives them some more details about their action:

image.png (600×796 px, 42 KB)


I had a couple of questions about the steps in the process:

  1. Is the job "queuing" or does it instantly get "queued"? Should this be reflected in the language we use?
  2. What does force re-tally do when the job is queuing? Move it to the back of the queue? When is this helpful?
  3. Does the page automatically reload or give some indication when the tallying is done?

@phuedx raised the possibility of adding more log lines and I strongly support that idea (maybe not as part of this ticket if it's unwieldy) because that would give us more insight into fixing this issue. I don't know why Dom's election failed to finish tallying and I don't know if restarting the job would fix that issue given the information we have access to right now.

I created a skinny task: T288366: Add more logging to determine what happens to jobs in the wild - feel free to add more detail

Thanks, @STran for this.
I might have misunderstood the ticket and our conversations.
Then instead of Re-tally, the button should say "cancel tallying"?

Thank you @Prtksxna for the screenshots.
And to answer your questions:

  1. It instantly gets queued when submitting a tally. Then the job is queuing.
  2. Force Re-tally was to "cancel" an ongoing job. If a job fails silently, it doesn't clean the status flag and therefore appears as still running. Force Retally was a way to clean the status and relaunch a job.
  3. The page doesn't reload. The results appear only when the job has been successfully running hence the message "check back later for results".

Thank you!

@Prtksxna cc @STran
Can we update the ticket with what has been decided to go for?
Especially regarding the Force-Tally / Cancel / "Provide a way to "clear" a queued job" part.

Thank you!

Adding screenshots for the cancel button + workflows.

No cancel button while queueing

image.png (396×1 px, 26 KB)

Show a cancel button during tallying

image.png (396×1 px, 27 KB)

image.png (600×796 px, 30 KB)
(Its okay if this is a native confirm and not OOUI)

In the past, AHT stepped in to support Trust & Safety with the time-sensitive matter of the Board Elections by providing help with SecurePoll. Unfortunately, at this time we can no longer support nor maintain SecurePoll. Per the Foundation leadership’s instructions, AHT is dedicating all of our time and energy to other critical efforts.

Change 710323 abandoned by AGueyte:

[mediawiki/extensions/SecurePoll@master] Add UI fallbacks for failed tally jobs

Reason:

No longer working on secure poll

https://gerrit.wikimedia.org/r/710323

jrbs raised the priority of this task from Medium to High.Sep 23 2022, 10:32 PM
jrbs moved this task from Backlog to Evaluated on the MediaWiki-extensions-SecurePoll board.

Evaluation:

There is quite some discussion within the task. Also there is an abandoned change. Maybe it could easily be revived

Aklapper added a subscriber: AGueyte.

Removing inactive task assignee (please do so as part of offboarding steps).