Page MenuHomePhabricator

Pressing the Stop button in Quarry results in a 500 error
Closed, ResolvedPublicBUG REPORT

Description

List of steps to reproduce (step by step, including full links if applicable):

  • Submit a query on Quarry, especially one that takes a while to complete (example).
  • Try to stop the execution of the query by clicking on the Stop button.

What happens?:
A dialogue is shown in which a 500 message returned by the server is displayed. See screenshot:

image.png (376×730 px, 58 KB)

What should have happened instead?:

  • The query should stop, or a more graceful error message should be shown.
  • After some time, the query's max execution time will be over. At this point, Quarry's UI should be able to determine that the query has been stopped, and Stop button should switch back to Submit Query button.

Software version (if not a Wikimedia wiki), browser information, screenshots, other information, etc:
Unsure; Quarry's UI does not show which version is running. But most likely 98898f0613c962303c08ae07c1a39414d5cce4a3 because (a) it is the HEAD of the master branch right now, and (b) the feature it added is enabled in production.

Event Timeline

Huji added a subscriber: rook.

This may be related to fawiki_p, seems to leave jobs queue or running even when they are short. If the job is queued the stop function will likely fail, as it won't find a job to stop. We could introduce logic to have it try revoking the celery worker, a potential race condition could probably be avoided by having the stop always try to revoke the session as well as stopping the job if there is one. Only one should happen, though forceful stops of celery leave the query in an ugly state in the db.

I did not understand about half of what you said! You are clearly the expert, so I defer to you on how to handle this.

I'm experiencing the same issue with enwiki_p. I have one job stuck in "running", another stuck in "queued", and stop button gives error 500 for both of them.

I suspect this is a combination of a new problem and an old problem. The new problem is that the stop function doesn't consider a job in the "queued" status, it needs different logic than a running job, but doesn't have it. The old problem is that previously if you ran a job, and selected submit on the same job page before the initial job finished, the old job would be orphaned (but still running in the db) and the new job would get its run_id. Now you can't do that, but it is more visible, as the stop button stays "stop" until the old job is complete. Some failure states don't come back and update the db away from running. Thus the old problem is still there, but more obvious when a job gets into a weird state.

I have the same issue also on some of my queries connected with itwiki_p: I've launched them 4 days ago, but they are still stucked on "queued" state and I can't stop them.

@Mess Workaround for stuck queries:

  • Click Stop
  • The button turns into Start query button for a fraction of second, before the alert error is shown, just click it before the error shows up.

Or more simple: just click the Stop button repeatedly, very fast.

@Mess Workaround for stuck queries:

  • Click Stop
  • The button turns into Start query button for a fraction of second, before the alert error is shown, just click it before the error shows up.

Or more simple: just click the Stop button repeatedly, very fast.

Thanks a lot! I did it reducing the zoom on my web browser on PC down to 30% (because if not the webpage scrolls itself fast to the top, thus preventing me to click quickly on that button).

Perhaps, this should be triaged as "High" or even "Unbreak Now!" priority. For now, I am going to set this as "High" priority, but if anyone thinks that this should be UBN, then they may change the priority to UBN.

We could pull the stop function. Though that would still orphan jobs stuck running, they will not be killed until something like OOM killer comes and gets them, where there was previously a job killing function that would stop them after 1/2 hour, that no longer exists. As such pressing submit on a page with an already running job, leaves it running and starts a new one. My current guess is these are the jobs that are noticed. We could also seek out T289349, which could fix it. Jobs that are stuck, would still be stuck, but jobs that can be stopped would be stopped. So that would leave us in a better place. Or we could identify why some jobs fail to run, or celery loses track of them. I'm not sure what knowledge there is on that.

In the meantime, is using the fork button on a stuck job an unreasonable workaround?

I believe this is fixed from T289349, please reopen if this is not the case.

@MarioGom - Clicking the Stop button twice quickly did the trick - thanks!

I reproduced the problem on enwiki_p, including https://quarry.wmcloud.org/query/61115 which should complete in milliseconds but claims to have run for more than a day.

GeoffreyT2000 lowered the priority of this task from High to Medium.Apr 30 2022, 4:40 AM

Still not fixed after several months.

Yes, my trivial query 61115 claims to have be running for nearly four months now, and the Stop button gives error 500. Can we at least do a one-off task to stop all queries which have been in running or queued state for more than, say, a week?

https://quarry.wmcloud.org/query/63057 has been running for four days now. I've run this many, many times and normally it runs in a matter of just seconds or a few minutes at most.

just here to echo continued persistence of this problem, and advocating for continued pressure to address this!

I may have a workaround:

  • tick the "Don't allow site to prompt you again" box on the 500 box (text varies by browser and language), or ad-block that box and its modal background
  • replace the query by a trivial one such as SELECT 123 /* keep old query here for reference */
  • click Stop and quickly click Submit Query

The query will complete, show the answer 123 and display the "Submit Query" button ready for re-use.
That's not ideal or intuitive; a proper fix would be much better!

Change 788719 had a related patch set uploaded (by Vivian Rook; author: Vivian Rook):

[analytics/quarry/web@master] Remove stop query function

https://gerrit.wikimedia.org/r/788719

I propose we remove the feature. Would anyone here care to do a code review on https://gerrit.wikimedia.org/r/788719 seems to run, some brief tinkering in the dev env doesn't show anything error shaped.

Please don't remove the Stop feature. It sometimes works, and is better than nothing, both for users who (sometimes) get their query unlocked for fixing and for the servers which can cease wasting resources on pointless tasks. The feature really needs mending rather than removing.

hmm...maybe making it a separate button that is just always there alongside submit then...

Queries sometimes get stuck (T307263), and the Stop button (with the double click trick) seems to be the only workaround. So removing the Stop button altogether does not seem to be the best move?

@MarioGom if the stop function works sometimes, then would it be better to have it appear along side the submit function, rather than replace the submit function all together? This is a workaround rather than a fix, the stop button would, for now, still only work sometimes, but the submit button would be available all the time.

Change 789196 had a related patch set uploaded (by Vivian Rook; author: Vivian Rook):

[analytics/quarry/web@master] separate stop and submit buttons

https://gerrit.wikimedia.org/r/789196

I believe https://gerrit.wikimedia.org/r/789196 should allow for the double click workaround to be codified. The submit button is always present, and when a query is running the stop option should appear. The stop button can pressed, if it works that should be apparent in about 5 seconds, if not the submit button can be pressed to launch a new query. This ticket would remain open to address stop feature failures. Opinions?

Here's the traceback of a 500 error I got:

[2022-05-04 17:25:52,664] ERROR in app: Exception on /api/query/stop [POST]
Traceback (most recent call last):
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/srv/quarry/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "./quarry/web/api.py", line 152, in api_stop_query
    cur.execute("KILL %s;", (result_dictionary["connection_id"]))
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/cursors.py", line 148, in execute
    result = self._query(query)
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/cursors.py", line 310, in _query
    conn.query(q)
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 548, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 775, in _read_query_result
    result.read()
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 1156, in read
    first_packet = self.connection._read_packet()
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/connections.py", line 725, in _read_packet
    packet.raise_for_error()
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/protocol.py", line 221, in raise_for_error
    err.raise_mysql_exception(self._data)
  File "/srv/quarry/venv/lib/python3.7/site-packages/pymysql/err.py", line 143, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.OperationalError: (1094, 'Unknown thread id: 8295086')

It looks like this means the query has already terminated, as can be seen in this local mysql session:

mysql> select sleep(100);
(in another client)
mysql> kill 21;
Query OK, 0 rows affected (0.00 sec)

mysql> kill 21;
ERROR 1094 (HY000): Unknown thread id: 21

so it appears the backend fix would be to catch this error and tell the client the query has already stopped.

That probably means celery isn't reporting on all the ways a query has terminated, or is somehow missing some. Rather than catch the error when the stop button is pressed, celery should have already reported the query has terminated. Which is to say, this needs fixed, but not at the stop button level.

I suspect (most of?) the underlying problem of when the stop button fails is due to the celery worker having already died, and thus the stop command has nothing to tell to stop. This is reported in T278583, while this does not mean that the job is actually not running in the DB somewhere, just without the celery process running, a wiki result, isn't going to update the quarry DB with anything, so the job is, at the least, lost.

I think a reasonable path forward is to merge:
https://gerrit.wikimedia.org/r/c/analytics/quarry/web/+/789196/

Thus returning the submit button to always being available. As such if the stop button works, great, if it fails, the submit button is still there to resubmit the job and abandon the old (still, maybe, running somewhere) job, as things were before the stop button was introduced.

At the same time the stop button could be updated such that it runs an update to quarry.query_run and sets the status to stopped. This would update the interface to stopped. Though has the potential of an oddity: if the celery job didn't actually stop, and completes it will update the job with the output of the query and change the status to completed. This doesn't seem like a terrible thing to me. And should the, celery, job have indeed failed, this would track that the job was manually stopped.

With the above two details I believe we will have solved the most cumbersome elements of the stop button. At which point work should focus on resolving T278583 which should resolve most, if we're lucky all, of the running forever and failure to stop, commands.

Opinions?

Change 789196 merged by jenkins-bot:

[analytics/quarry/web@master] separate stop and submit buttons

https://gerrit.wikimedia.org/r/789196

Change 788719 abandoned by Vivian Rook:

[analytics/quarry/web@master] Remove stop query function

Reason:

https://gerrit.wikimedia.org/r/788719

Change 791669 had a related patch set uploaded (by Vivian Rook; author: Vivian Rook):

[analytics/quarry/web@master] Update stop status directly and catch error

https://gerrit.wikimedia.org/r/791669

Change 791669 merged by jenkins-bot:

[analytics/quarry/web@master] Update stop status directly and catch error

https://gerrit.wikimedia.org/r/791669

I believe with the last merge we will have cleared up the 500 errors, so the user experience should be good in this front now. Background issues mentioned in this ticket are covered in https://phabricator.wikimedia.org/T278583
I'm going to close this ticket. Though please re-open if anything comes up.

rook claimed this task.