Page MenuHomePhabricator

Pressing the Stop button in Quarry results in a 500 error
Open, HighPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Submit a query on Quarry, especially one that takes a while to complete (example).
  • Try to stop the execution of the query by clicking on the Stop button.

What happens?:
A dialogue is shown in which a 500 message returned by the server is displayed. See screenshot:

image.png (376×730 px, 58 KB)

What should have happened instead?:

  • The query should stop, or a more graceful error message should be shown.
  • After some time, the query's max execution time will be over. At this point, Quarry's UI should be able to determine that the query has been stopped, and Stop button should switch back to Submit Query button.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:
Unsure; Quarry's UI does not show which version is running. But most likely 98898f0613c962303c08ae07c1a39414d5cce4a3 because (a) it is the HEAD of the master branch right now, and (b) the feature it added is enabled in production.

Event Timeline

Huji added a subscriber: mdipietro.

This may be related to fawiki_p, seems to leave jobs queue or running even when they are short. If the job is queued the stop function will likely fail, as it won't find a job to stop. We could introduce logic to have it try revoking the celery worker, a potential race condition could probably be avoided by having the stop always try to revoke the session as well as stopping the job if there is one. Only one should happen, though forceful stops of celery leave the query in an ugly state in the db.

I did not understand about half of what you said! You are clearly the expert, so I defer to you on how to handle this.

I'm experiencing the same issue with enwiki_p. I have one job stuck in "running", another stuck in "queued", and stop button gives error 500 for both of them.

I suspect this is a combination of a new problem and an old problem. The new problem is that the stop function doesn't consider a job in the "queued" status, it needs different logic than a running job, but doesn't have it. The old problem is that previously if you ran a job, and selected submit on the same job page before the initial job finished, the old job would be orphaned (but still running in the db) and the new job would get its run_id. Now you can't do that, but it is more visible, as the stop button stays "stop" until the old job is complete. Some failure states don't come back and update the db away from running. Thus the old problem is still there, but more obvious when a job gets into a weird state.

I have the same issue also on some of my queries connected with itwiki_p: I've launched them 4 days ago, but they are still stucked on "queued" state and I can't stop them.

@Mess Workaround for stuck queries:

  • Click Stop
  • The button turns into Start query button for a fraction of second, before the alert error is shown, just click it before the error shows up.

Or more simple: just click the Stop button repeatedly, very fast.

@Mess Workaround for stuck queries:

  • Click Stop
  • The button turns into Start query button for a fraction of second, before the alert error is shown, just click it before the error shows up.

Or more simple: just click the Stop button repeatedly, very fast.

Thanks a lot! I did it reducing the zoom on my web browser on PC down to 30% (because if not the webpage scrolls itself fast to the top, thus preventing me to click quickly on that button).

Perhaps, this should be triaged as "High" or even "Unbreak Now!" priority. For now, I am going to set this as "High" priority, but if anyone thinks that this should be UBN, then they may change the priority to UBN.

We could pull the stop function. Though that would still orphan jobs stuck running, they will not be killed until something like OOM killer comes and gets them, where there was previously a job killing function that would stop them after 1/2 hour, that no longer exists. As such pressing submit on a page with an already running job, leaves it running and starts a new one. My current guess is these are the jobs that are noticed. We could also seek out T289349, which could fix it. Jobs that are stuck, would still be stuck, but jobs that can be stopped would be stopped. So that would leave us in a better place. Or we could identify why some jobs fail to run, or celery loses track of them. I'm not sure what knowledge there is on that.

In the meantime, is using the fork button on a stuck job an unreasonable workaround?

I believe this is fixed from T289349, please reopen if this is not the case.

@MarioGom - Clicking the Stop button twice quickly did the trick - thanks!