Page MenuHomePhabricator

[bug] Quarry queries not completing
Open, In Progress, HighPublicBUG REPORT

Description

What happened?

I'm trying to run queries and about 1/2 are successful and the rest just run endlessly without finishing They can run for hours without completing or producing an error message.

If you look at Current query list you'll see all of the incomplete query runs.

What should have happened?

The queries should have completed their runs, usually in a minute or two, producing a page or results.

Event Timeline

One such case: https://quarry.wmcloud.org/query/83084

The query actually completed on the db in 140 seconds but there was a Trove connection reset while updating the status to "success".

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (ConnectionResetError(104, 'Connection reset by peer'))")
(Background on this error at: https://sqlalche.me/e/14/e3q8)
[2024-06-13 20:21:18,854: ERROR/ForkPoolWorker-17] Task worker.run_query[53f203b6-0cdd-4343-90a3-d1381a7f3d78] raised unexpected: PendingRollbackError('This Session\'s transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (ConnectionResetError(104, \'Connection reset by peer\'))")\n[SQL: UPDATE query_run SET `status`=%(status)s, extra_info=%(extra_info)s WHERE query_run.id = %(query_run_id)s]\n[parameters: {\'status\': 4, \'extra_info\': \'{"resultsets": [{"headers": ["URL"], "rowcount": 2}], "runningtime": "140.66"}\', \'query_run_id\': 874393}]\n(Background on this error at: https://sqlalche.me/e/14/e3q8)')

The failure doesn't reflect on the UI due to the same reason as T365136, which has been fixed but not deployed yet.

As for the cause of actual failure, I suspect the reason is improper connection pooling causing reuse of closed connections. I don't see us setting parameters like min pool size, max pool size and pool idle timeout, which may need reconfiguring.

SD0001 triaged this task as High priority.Fri, Jun 14, 7:14 AM
SD0001 claimed this task.

The fix is live. Please reopen if the problem arises again.

I don't see that there has been any improvement. My queries are still not finishing their run.

See Current queries list

Edit: Quarry has become effectively useless at this point.

I tried https://quarry.wmcloud.org/query/83617 multiple times since T366909 was closed, and it never finished. What can a simple user do to prevent running into this problem? Starting again and again does not work.

Now, there is a little grey box saying "Explain" at the bottom of the page under Query Status (see this Quarry query ) that just produces a table of code I don't understand and isn't helpful at all, at least to me.

There are problems now with bot reports,not being generated on schedule, too, so I think this might be a database problem as well.

Now, there is a little grey box saying "Explain" at the bottom of the page under Query Status (see this Quarry query ) that just produces a table of code I don't understand and isn't helpful at all, at least to me.

That's probably unrelated to this ticket. Is probably a recently added feature.

There are problems now with bot reports,not being generated on schedule, too, so I think this might be a database problem as well.

That's probably unrelated to this ticket. enwiki replag is at 5 hours at the moment (Quarry is 5 hours behind the main SQL database). https://replag.toolforge.org/. That is outside of Quarry's control.

I don't see that there has been any improvement. My queries are still not finishing their run.

With that said, sounds like this bug is not fixed, so re-opening the ticket was a good course of action. SD0001 will see this and hopefully think of something else to try :)

Now, there is a little grey box saying "Explain" at the bottom of the page under Query Status (see this Quarry query ) that just produces a table of code I don't understand and isn't helpful at all, at least to me.

That's probably unrelated to this ticket. Is probably a recently added feature.

It's the EXPLAIN feature that broke in 2018 (T205214) that has now been revived.

For the purpose of this ticket, if clicking Explain gives you a table it means the query is actually running. If not it's stuck and will never complete.

It's surprising that this issue still occurs with pre-ping enabled – Trove connections from the pool are now validated to ensure liveness before being used. Perhaps I don't understand SQLAlchemy well enough... Also, even with Trove errors occurring, the fix from T365136 should have caused them to show up as a failure in the UI. Both fixes not working makes me wonder if there's a problem with the deployment. The updated files are visible in the pod, though.

cc @fnegri @rook - in case you have any ideas.

For my query: https://quarry.wmcloud.org/query/83617

When I wait for about 20 Minutes and click on "Explain": Nothing happens
When I start the query an click on "Explain" immediate after starting, I see some output.

In both cases it seems that the query does never finish.

When I reload the tab, the explain data is gone. And when I click on "Explain" again, no data is shown. So it seems that the query dies somewhen during execution.

But hitting "Explain" fills up the part of the page with a table where the query results should be plus it is in code that only a developer or coder would understand, not us mere editors.

So, it's not helping at all with this problem of endlessly running queries and by filling up the part of the page where the query results should be, I think it is actually impeding a solution here. Where are the results supposed to go, under or over the table or does the table disappear when the query finishes running (if it ever does)?

And looking at the Current Queries list, it looks like some queries are completing but others, like mine, are not, so is it only queries to some databases and not others?

Liz: Explain tells you in which order the query is processed and which index (if there is some) is used. This is an output of the database itself and you may find some (more or less) helpful information on the Website of the mariadb database (or sometimes on mysql pages).

To my query: It did rum successful on barwiki_p (bawarian dialect wiki). So I conclude: The query is working fine in the quarry framework.

Results will go under the explain.

I forget if the explain disappears.

The explain is useful to the sql query writer, because you can use it to see why your query is slow. For example you can see if your query is properly using indexes or not.

Well, maybe it depends on the database, queries aren't running on enwiki_p. But I've noticed that editors have just given up on running Quarry queries since about 2 days ago when it stopped working

If you look through Execution time column on Recent queries list, it actually seems like that results of virtually any query with execution time longer than ~120s will never make it back, currently (stays in running state until manually stopped). Some kind of systematic conn. timeout on BE side? Extremely annoying.

edit: for some queries, manual paging, i.e. submitting smaller batches with explicit offset/limit one by one may help (keeping exec. time of each run within this current de facto limit), e.g.: https://quarry.wmcloud.org/query/68296

The status says this case is open and is a high priority but it's not getting any attention from those who might be in a position to resolve this problem.

I've been told that "complaining" doesn't help but I'm not sure how else to draw attention to this tool malfunction that many of us use on a daily basis and sometimes throughout the day.

@Liz it is getting attention by multiple people, but it's not clear what the problem is. :)

It might also be related to some performance issues that started last Friday on one database (T367778). This is just a hypothesis though, since you opened the task on Thursday. When did you first detect this issue?

fnegri changed the task status from Open to In Progress.Tue, Jun 18, 8:33 AM