Page MenuHomePhabricator

[bug] Quarry queries not completing
Closed, ResolvedPublicBUG REPORT

Description

What happened?

I'm trying to run queries and about 1/2 are successful and the rest just run endlessly without finishing They can run for hours without completing or producing an error message.

If you look at Current query list you'll see all of the incomplete query runs.

What should have happened?

The queries should have completed their runs, usually in a minute or two, producing a page or results.

Event Timeline

One such case: https://quarry.wmcloud.org/query/83084

The query actually completed on the db in 140 seconds but there was a Trove connection reset while updating the status to "success".

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (ConnectionResetError(104, 'Connection reset by peer'))")
(Background on this error at: https://sqlalche.me/e/14/e3q8)
[2024-06-13 20:21:18,854: ERROR/ForkPoolWorker-17] Task worker.run_query[53f203b6-0cdd-4343-90a3-d1381a7f3d78] raised unexpected: PendingRollbackError('This Session\'s transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (ConnectionResetError(104, \'Connection reset by peer\'))")\n[SQL: UPDATE query_run SET `status`=%(status)s, extra_info=%(extra_info)s WHERE query_run.id = %(query_run_id)s]\n[parameters: {\'status\': 4, \'extra_info\': \'{"resultsets": [{"headers": ["URL"], "rowcount": 2}], "runningtime": "140.66"}\', \'query_run_id\': 874393}]\n(Background on this error at: https://sqlalche.me/e/14/e3q8)')

The failure doesn't reflect on the UI due to the same reason as T365136, which has been fixed but not deployed yet.

As for the cause of actual failure, I suspect the reason is improper connection pooling causing reuse of closed connections. I don't see us setting parameters like min pool size, max pool size and pool idle timeout, which may need reconfiguring.

SD0001 triaged this task as High priority.Jun 14 2024, 7:14 AM
SD0001 claimed this task.

The fix is live. Please reopen if the problem arises again.

I don't see that there has been any improvement. My queries are still not finishing their run.

See Current queries list

Edit: Quarry has become effectively useless at this point.

I tried https://quarry.wmcloud.org/query/83617 multiple times since T366909 was closed, and it never finished. What can a simple user do to prevent running into this problem? Starting again and again does not work.

Now, there is a little grey box saying "Explain" at the bottom of the page under Query Status (see this Quarry query ) that just produces a table of code I don't understand and isn't helpful at all, at least to me.

There are problems now with bot reports,not being generated on schedule, too, so I think this might be a database problem as well.

Now, there is a little grey box saying "Explain" at the bottom of the page under Query Status (see this Quarry query ) that just produces a table of code I don't understand and isn't helpful at all, at least to me.

That's probably unrelated to this ticket. Is probably a recently added feature.

There are problems now with bot reports,not being generated on schedule, too, so I think this might be a database problem as well.

That's probably unrelated to this ticket. enwiki replag is at 5 hours at the moment (Quarry is 5 hours behind the main SQL database). https://replag.toolforge.org/. That is outside of Quarry's control.

I don't see that there has been any improvement. My queries are still not finishing their run.

With that said, sounds like this bug is not fixed, so re-opening the ticket was a good course of action. SD0001 will see this and hopefully think of something else to try :)

Now, there is a little grey box saying "Explain" at the bottom of the page under Query Status (see this Quarry query ) that just produces a table of code I don't understand and isn't helpful at all, at least to me.

That's probably unrelated to this ticket. Is probably a recently added feature.

It's the EXPLAIN feature that broke in 2018 (T205214) that has now been revived.

For the purpose of this ticket, if clicking Explain gives you a table it means the query is actually running. If not it's stuck and will never complete.

It's surprising that this issue still occurs with pre-ping enabled – Trove connections from the pool are now validated to ensure liveness before being used. Perhaps I don't understand SQLAlchemy well enough... Also, even with Trove errors occurring, the fix from T365136 should have caused them to show up as a failure in the UI. Both fixes not working makes me wonder if there's a problem with the deployment. The updated files are visible in the pod, though.

cc @fnegri @rook - in case you have any ideas.

For my query: https://quarry.wmcloud.org/query/83617

When I wait for about 20 Minutes and click on "Explain": Nothing happens
When I start the query an click on "Explain" immediate after starting, I see some output.

In both cases it seems that the query does never finish.

When I reload the tab, the explain data is gone. And when I click on "Explain" again, no data is shown. So it seems that the query dies somewhen during execution.

But hitting "Explain" fills up the part of the page with a table where the query results should be plus it is in code that only a developer or coder would understand, not us mere editors.

So, it's not helping at all with this problem of endlessly running queries and by filling up the part of the page where the query results should be, I think it is actually impeding a solution here. Where are the results supposed to go, under or over the table or does the table disappear when the query finishes running (if it ever does)?

And looking at the Current Queries list, it looks like some queries are completing but others, like mine, are not, so is it only queries to some databases and not others?

Liz: Explain tells you in which order the query is processed and which index (if there is some) is used. This is an output of the database itself and you may find some (more or less) helpful information on the Website of the mariadb database (or sometimes on mysql pages).

To my query: It did rum successful on barwiki_p (bawarian dialect wiki). So I conclude: The query is working fine in the quarry framework.

Results will go under the explain.

I forget if the explain disappears.

The explain is useful to the sql query writer, because you can use it to see why your query is slow. For example you can see if your query is properly using indexes or not.

Well, maybe it depends on the database, queries aren't running on enwiki_p. But I've noticed that editors have just given up on running Quarry queries since about 2 days ago when it stopped working

If you look through Execution time column on Recent queries list, it actually seems like that results of virtually any query with execution time longer than ~120s will never make it back, currently (stays in running state until manually stopped). Some kind of systematic conn. timeout on BE side? Extremely annoying.

edit: for some queries, manual paging, i.e. submitting smaller batches with explicit offset/limit one by one may help (keeping exec. time of each run within this current de facto limit), e.g.: https://quarry.wmcloud.org/query/68296

The status says this case is open and is a high priority but it's not getting any attention from those who might be in a position to resolve this problem.

I've been told that "complaining" doesn't help but I'm not sure how else to draw attention to this tool malfunction that many of us use on a daily basis and sometimes throughout the day.

@Liz it is getting attention by multiple people, but it's not clear what the problem is. :)

It might also be related to some performance issues that started last Friday on one database (T367778). This is just a hypothesis though, since you opened the task on Thursday. When did you first detect this issue?

fnegri changed the task status from Open to In Progress.Jun 18 2024, 8:33 AM

@fnegri: My query had this problem since June 6th.

If you look through Execution time column on Recent queries list, it actually seems like that results of virtually any query with execution time longer than ~120s will never make it back

I had to scroll back to 2 months ago to find a query with an execution time longer than 120s (this query) so it looks like something changed around 2 months ago.

I detected the problem whenever I opened this bug report. I remember filing a different one for Quarry about two weeks ago but this kind of incident, with queries running endlessly and never finishing, only happened starting last week at least for me. You'll notice in my original bug report, I say that about 1/2 of my queries were completing but that soon turned into ZERO queries finishing.

But I appreciate hearing that someone is working on this problem. Without communication, those of us who rely on these tools are kind of left in the dark, not sure if anyone has even seen these messages. Even a message saying "We don't know what the problem is" is preferable to silence. I rhink I also mentioned that many of the bots on en-wiki are also having reporting problems (reports delayed by 6-9 hours) so it might be a problem collecting information from one or more of the databases. Just a guess.

But hey, a few of my queries just issued reports! I don't know what happened since I posted this message but something has changed for the better. Surprised me.

But hey, a few of my queries just issued reports! I don't know what happened since I posted this message but something has changed for the better.

This is likely related to the issues described in T367778, that have improved today (though I'm not sure exactly why :D)

I can still see some queries that remain stuck in "status: Running" forever. It seems the reason is the one described by @SD0001 in a previous comment:

The query actually completed on the db in 140 seconds but there was a Trove connection reset while updating the status to "success".

I was able to reproduce it with a simple SELECT SLEEP(150) in Quarry. Exactly after 150 seconds, Quarry logged an exception while trying to update the status of the query in the Trove database:

[2024-06-19 15:17:50,059: ERROR/ForkPoolWorker-1] Task worker.run_query[33068350-96c8-475e-ae1e-bba25080a10f] raised unexpected: PendingRollbackError('This Session\'s transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (ConnectionResetError(104, \'Connection reset by peer\'))")\n[SQL: UPDATE query_run SET `status`=%(status)s, extra_info=%(extra_info)s WHERE query_run.id = %(query_run_id)s]\n[parameters: {\'status\': 4, \'extra_info\': \'{"resultsets": [{"headers": ["SLEEP(150)"], "rowcount": 1}], "runningtime": "150.06"}\', \'query_run_id\': 876493}]\n(Background on this error at: https://sqlalche.me/e/14/e3q8)')

Well, the queries were running and finishing up for a few hours last night (at least it was night where I was at) but now they are back to not completing at all. But for a few hours, everything was in sync.

@Liz I'm sorry that you're still having issues, I suspect that sometimes your queries take a bit longer to complete, and when that happens you run into the ConnectionResetError described above.

The time a query takes to complete can depend on many factors (e.g. how many other queries are running on the server), but eventually the results should be displayed, whereas right now it seems that the results for slow queries are just getting lost.

I will keep on investigating on what could be causing the issue but it could take some time. @SD0001 if you have other ideas on where the problem could be, let me know.

I think I found the issue: the Trove database used by Quarry had wait_timeout set to 120 seconds, which meant that all idle connections were terminated after 2 minutes.

In Trove, there was a configuration group quarry-db which included a higher value (wait_timeout=28800) but the config group was not attached to the current db instance. I attached it and now I can run queries like SELECT SLEEP(150).

Worth noting that Quarry should ideally not keep the connection to its own database open while a query is running, and it should instead create a new connection (or fetch one from the connection pool) when the query completes and it needs to write the result to its own database. See for example the discussion at https://github.com/sqlalchemy/sqlalchemy/discussions/10639.

I think raising wait_timeout should be enough to fix the issues that have been reported so far, so I will resolve this bug, but please reopen it if you have new examples of queries not completing.