Page MenuHomePhabricator

Quarry: Lost connection to MySQL server during query
Closed, ResolvedPublic

Description

For the last two days I get the error message quoted above when I try to run https://quarry.wmflabs.org/query/17928. I have been running it on a daily basis for over two years, it takes about 6–11 minutes to execute. My other query, which normally executes in about a second, runs as expected.

Event Timeline

bd808 renamed this task from Lost connection to MySQL server during query to Quarry: Lost connection to MySQL server during query.Mar 5 2020, 11:31 PM
bd808 added a project: Quarry.
bd808 updated the task description. (Show Details)
bd808 moved this task from Backlog to Wiki replicas on the Data-Services board.

@zhuyifei1999 I see that we are wheel warring on the tags here. The datasource may be Wiki Replicas, but the runtime is Quarry. The error is either related to the query killer on the the Wiki Replicas or just instability on the Wiki Replica servers for sure. But that being said, handling server disconnects is first a client problem and Quarry is the client in this case.

This isn't just a quarry issue: (logs)

2020-03-05 17:56:32 	<AntiComposite> 	I keep getting "Lost connection to MySQL server during query" trying to execute this query: https://quarry.wmflabs.org/query/42724 both through Quarry and through pymysql on k8s

There hasn't been a single query kill from quarry's own killer for about the last two days:

zhuyifei1999@quarry-web-01:~$ tail -n 10000 /var/log/quarry/killer.log | head
2020-03-04 06:03:01,498 pid:7947 Found 1 queries running
2020-03-04 06:03:01,498 pid:7947 Found 0 queries to kill
2020-03-04 06:03:01,498 pid:7947 Finished killer process
2020-03-04 06:04:01,802 pid:8365 Started killer process, with limit 1800
2020-03-04 06:04:01,816 pid:8365 Found 1 queries running
2020-03-04 06:04:01,816 pid:8365 Found 0 queries to kill
2020-03-04 06:04:01,816 pid:8365 Finished killer process
2020-03-04 06:05:02,620 pid:8866 Started killer process, with limit 1800
2020-03-04 06:05:02,637 pid:8866 Found 1 queries running
2020-03-04 06:05:02,637 pid:8866 Found 0 queries to kill
zhuyifei1999@quarry-web-01:~$ tail -n 10000 /var/log/quarry/killer.log | grep 'to kill' | head
2020-03-04 06:03:01,498 pid:7947 Found 0 queries to kill
2020-03-04 06:04:01,816 pid:8365 Found 0 queries to kill
2020-03-04 06:05:02,637 pid:8866 Found 0 queries to kill
2020-03-04 06:06:01,959 pid:8873 Found 0 queries to kill
2020-03-04 06:07:01,319 pid:8880 Found 0 queries to kill
2020-03-04 06:08:01,624 pid:8885 Found 0 queries to kill
2020-03-04 06:09:01,928 pid:8894 Found 0 queries to kill
2020-03-04 06:10:02,248 pid:8900 Found 0 queries to kill
2020-03-04 06:11:01,572 pid:8915 Found 0 queries to kill
2020-03-04 06:12:01,873 pid:8921 Found 0 queries to kill
zhuyifei1999@quarry-web-01:~$ tail -n 10000 /var/log/quarry/killer.log | grep 'to kill' | grep -v 'Found 0 queries' | head
zhuyifei1999@quarry-web-01:~$

Lost connection during query has been on and off since forever and has usually been instability; however, quarry is not the only affected client, bots as well. Yes, quarry can be made to retry upon a lost connection, but I don't think that's the best way to go forward.

bd808 added a subscriber: Marostegui.

I know that the backend query killer (pt-kill) was set to be more aggressive while the cluster was operating at a lower capacity due to an instance with corrupt data (https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/573855/). That was reverted in https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/576190/ and things should be back to normal now.

I actually don't know where to look on the Wiki Replica setup for query killer logs, but hopefully @Marostegui can give us some clues about where to look next.

Query killer is indeed back to its normal running times.
300 seconds for web service (labsdb1009 and labsdb1010) and 14400 seconds for analytics (labsdb1011):

4 hosts will be targeted:
labsdb[1009-1012].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) labsdb1011.eqiad.wmnet
----- OUTPUT of 'ps aux | grep wm...l | grep -v grep' -----
wmf-pt-+ 26794  0.0  0.0 119064 39036 ?        Ss   Mar03   1:40 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 14400 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /run/mysqld/mysqld.sock F=/dev/null
===== NODE GROUP =====
(1) labsdb1010.eqiad.wmnet
----- OUTPUT of 'ps aux | grep wm...l | grep -v grep' -----
wmf-pt-+  2578  0.0  0.0 124932 46320 ?        Ss   Mar03   0:33 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /run/mysqld/mysqld.sock F=/dev/null
===== NODE GROUP =====
(1) labsdb1012.eqiad.wmnet
----- OUTPUT of 'ps aux | grep wm...l | grep -v grep' -----
wmf-pt-+ 11528  0.0  0.0  87636 18100 ?        Ss   Mar02   0:33 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 14400 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /run/mysqld/mysqld.sock F=/dev/null
===== NODE GROUP =====
(1) labsdb1009.eqiad.wmnet
----- OUTPUT of 'ps aux | grep wm...l | grep -v grep' -----
wmf-pt-+  6731  0.0  0.0 117600 39040 ?        Ss   Mar03   0:33 perl /usr/bin/wmf-pt-kill --daemon --print --kill --victims all --interval 10 --busy-time 300 --match-command Query|Execute --match-user ^[spu][0-9] --log /var/log/wmf-pt-kill/wmf-pt-kill.log -S /run/mysqld/mysqld.sock F=/dev/null

The query killer logs live at /var/log/wmf-pt-kill on each replica.

I have run the query at https://quarry.wmflabs.org/query/8938 and I had to kill it after 1200 seconds.
Looking at the logs, I can see the query being too because of going over 300 seconds.
This is on labsdb1009:

# 2020-03-04T14:47:19 KILL 26700276 (Query 301 sec) SELECT
  page_title,
  page_len,
  cat_pages,
  rev_timestamp
  ,rev_actor
  /* ,rev_user_text */
FROM revision
JOIN
(SELECT
   page_id,
   page_title,
   page_len,
   cat_pages
 FROM category
 RIGHT JOIN page
 ON cat_title = page_title
 LEFT JOIN categorylinks
 ON page_id = cl_from
 WHERE cl_from IS NULL
 AND page_namespace = 14
 AND page_is_redirect = 0) AS pagetmp
ON rev_page = pagetmp.page_id
AND rev_timestamp = (SELECT
                       MAX(rev_timestamp)
                     FROM revision AS last
                     WHERE last.rev_page = pagetmp.page_id)

Even if I run the query on a non loaded host (labsdb1012) I had to kill it after 1000 seconds.

So if the query is taking this long, it better run on the labsdb-analytics replicas (labsdb1011) whose query killer limit is 14400

In regards to https://quarry.wmflabs.org/query/17928, that query also takes more than the limit, 300 seconds the web service has and gets killed:

# 2020-03-06T03:48:15 KILL 27243067 (Query 306 sec) SELECT      CONCAT('File:', img_name) AS 'File name', DATE_FORMAT(img_timestamp, '%Y-%m-%d %T') AS 'Date', img_size AS 'Size',
                CONCAT('User:', actor_name) AS 'Uploader\'s name' FROM image
        INNER JOIN actor ON img_actor = actor_id
        WHERE img_timestamp>=TIMESTAMP('2020-03-02 00:00:00') AND (img_minor_mime='png' OR img_minor_mime='svg+xml')
                AND img_name REGEXP '([FfVv]lag|[Bb]andei?ra|[Dd]rapeau|[Vv]lajka|[Ee]nsign|[Zz]astava|[??]???)'
        ORDER BY img_minor_mime DESC, img_timestamp DESC

Hmm I can see that quarry s indeed running on web rather than analytics.

@Bstorm I see in the SAL on 2019-06-28:

14:37 bstorm_: changed to web replica for database queries and restarted celery workers

Do you remember what that was? Can I revert to analytics?

Can I revert to analytics?

Yes. No matter why it was switched 8 months ago (?) it should be running against the analytics replicas today.

Mentioned in SAL (#wikimedia-cloud) [2020-03-06T19:32:03Z] <zhuyifei1999_> changed to analytics replica for database queries and restarted celery workers T246970

Mentioned in SAL (#wikimedia-cloud) [2020-03-06T19:32:03Z] <zhuyifei1999_> changed to analytics replica for database queries and restarted celery workers T246970

https://quarry.wmflabs.org/query/40539 now seems to run for longer, but still ends with "Lost connection to MySQL server during query".

Mentioned in SAL (#wikimedia-cloud) [2020-03-06T19:32:03Z] <zhuyifei1999_> changed to analytics replica for database queries and restarted celery workers T246970

https://quarry.wmflabs.org/query/40539 now seems to run for longer, but still ends with "Lost connection to MySQL server during query".

It is likely that the query is running for more than 4h then. I am executing it on an idle labs replica under root user, to see how long it takes to actually finish on a non loaded host.

Mentioned in SAL (#wikimedia-cloud) [2020-03-06T19:32:03Z] <zhuyifei1999_> changed to analytics replica for database queries and restarted celery workers T246970

https://quarry.wmflabs.org/query/40539 now seems to run for longer, but still ends with "Lost connection to MySQL server during query".

For what is worth, I don't see that query being killed on the query killer logs.
I see that query running now on labsdb1011 (analytics) and it has been running for around 5 minutes as of now.

It is likely that the query is running for more than 4h then. I am executing it on an idle labs replica under root user, to see how long it takes to actually finish on a non loaded host.

It's a query I run regularly, it normally takes between 15 and 30 minutes depending on the server load - and sometimes longer if the servers are busy, but then it normally just gets killed by quarry's 30 minute limit.

It is likely that the query is running for more than 4h then. I am executing it on an idle labs replica under root user, to see how long it takes to actually finish on a non loaded host.

It's a query I run regularly, it normally takes between 15 and 30 minutes depending on the server load - and sometimes longer if the servers are busy, but then it normally just gets killed by quarry's 30 minute limit.

Ah, I didn't know there was a limit on quarry itself. Then you've got your answer :-)
The query times will probably depend on how loaded the host is, so it can be different each time.
I am running it on a non loaded host, so we can see how long it takes there for real.

It is likely that the query is running for more than 4h then. I am executing it on an idle labs replica under root user, to see how long it takes to actually finish on a non loaded host.

It's a query I run regularly, it normally takes between 15 and 30 minutes depending on the server load - and sometimes longer if the servers are busy, but then it normally just gets killed by quarry's 30 minute limit.

Ah, I didn't know there was a limit on quarry itself. Then you've got your answer :-)
The query times will probably depend on how loaded the host is, so it can be different each time.
I am running it on a non loaded host, so we can see how long it takes there for real.

If quarry kills it, then the page reports 'killed'. It's still reporting 'Lost connection to MySQL server during query' instead, though, so the issue remains. How did the test run go?

It took 22 minutes, so it is entirely possible than on a normal host it might take more than 30 minutes.

As I said, I haven't found that query being killed on the analytics host.

I don't see quarry's killer doing anything. The last command at T246970#5946798 still yields nothing

I don't see quarry's killer doing anything. The last command at T246970#5946798 still yields nothing

I found that the killer's host, the quarry-web-01 was connected to the web replica instead of analytics, changed that, and:

zhuyifei1999@quarry-web-01:~$ tail -n 10000 /var/log/quarry/killer.log | grep 'to kill' | grep -v 'Found 0 queries'
2020-03-07 20:19:01,720 pid:18336 Found 4 queries to kill
2020-03-07 20:22:01,636 pid:18366 Found 2 queries to kill
2020-03-07 20:24:01,280 pid:18381 Found 1 queries to kill
2020-03-07 20:29:02,002 pid:18429 Found 1 queries to kill
2020-03-07 20:36:01,571 pid:19399 Found 1 queries to kill
2020-03-07 20:37:01,913 pid:19407 Found 1 queries to kill
2020-03-07 20:44:02,173 pid:19464 Found 1 queries to kill
2020-03-07 20:52:02,061 pid:19544 Found 1 queries to kill
2020-03-07 21:14:01,775 pid:20635 Found 2 queries to kill
zhuyifei1999@quarry-web-01:~$ tail -n 10000 /var/log/quarry/killer.log | grep -P 'Found [1-9] queries to kill' -B 1 -A 1
2020-03-07 20:19:01,720 pid:18336 Found 15 queries running
2020-03-07 20:19:01,720 pid:18336 Found 4 queries to kill
2020-03-07 20:19:01,723 pid:18336 Killed query with thread_id:75789902
--
2020-03-07 20:22:01,635 pid:18366 Found 15 queries running
2020-03-07 20:22:01,636 pid:18366 Found 2 queries to kill
2020-03-07 20:22:01,636 pid:18366 Killed query with thread_id:76654995
--
2020-03-07 20:24:01,279 pid:18381 Found 15 queries running
2020-03-07 20:24:01,280 pid:18381 Found 1 queries to kill
2020-03-07 20:24:01,280 pid:18381 Killed query with thread_id:76664589
--
2020-03-07 20:29:02,002 pid:18429 Found 9 queries running
2020-03-07 20:29:02,002 pid:18429 Found 1 queries to kill
2020-03-07 20:29:02,003 pid:18429 Killed query with thread_id:76691548
--
2020-03-07 20:36:01,571 pid:19399 Found 8 queries running
2020-03-07 20:36:01,571 pid:19399 Found 1 queries to kill
2020-03-07 20:36:01,572 pid:19399 Killed query with thread_id:76744265
--
2020-03-07 20:37:01,912 pid:19407 Found 8 queries running
2020-03-07 20:37:01,913 pid:19407 Found 1 queries to kill
2020-03-07 20:37:01,913 pid:19407 Killed query with thread_id:76745402
--
2020-03-07 20:44:02,173 pid:19464 Found 13 queries running
2020-03-07 20:44:02,173 pid:19464 Found 1 queries to kill
2020-03-07 20:44:02,174 pid:19464 Killed query with thread_id:76794670
--
2020-03-07 20:52:02,061 pid:19544 Found 6 queries running
2020-03-07 20:52:02,061 pid:19544 Found 1 queries to kill
2020-03-07 20:52:02,063 pid:19544 Killed query with thread_id:76760646
--
2020-03-07 21:14:01,775 pid:20635 Found 7 queries running
2020-03-07 21:14:01,775 pid:20635 Found 2 queries to kill
2020-03-07 21:14:01,775 pid:20635 Killed query with thread_id:76980145

But the issue still holds: quarry's own killer did nothing yesterday, and the connection was lost.

I'm now getting the normal 'killed' message for going over 30 minutes, rather than the MySQL error. So perhaps things are now back to normal?

I'm now getting the normal 'killed' message for going over 30 minutes, rather than the MySQL error. So perhaps things are now back to normal?

Probably the only pending thing left was:

I found that the killer's host, the quarry-web-01 was connected to the web replica instead of analytics, changed that, and:

@zhuyifei1999: The issue does not seem to be solved. For the last four days I have been continuously getting

Query status: failed
Error

Sort aborted: Query execution was interrupted

or, less often,

Query status: killed
This query took longer than 30 minutes to execute and was killed.

The query was executing for too long then.

I'm now getting the normal 'killed' message for going over 30 minutes, rather than the MySQL error. So perhaps things are now back to normal?

I'm still getting the 'killed' message, I guess the server loads are still high? Daft question: if the server limit is 240 minutes, why is the Quarry cut-off 30 minutes? Could that be adjusted to, say, 60 minutes, which would give bigger queries a better chance of completing? (Which would save server time as they wouldn't be restarted so often...)

@zhuyifei1999: But why? The query used to execute in 11 minutes max. Is it a congestion issue, as Mike Peel suspects?

@zhuyifei1999: But why? The query used to execute in 11 minutes max. Is it a congestion issue, as Mike Peel suspects?

I don't know. I don't know what the replica servers do exactly to execute the query. All quarry does is send the query to the replica servers, wait for results, and display the results.

I can increase the timeout, but I don't want to do it lightly. It does not scale. If next year there's a ticket saying 'oh I have query that used to run for an hour, now it times out' do I increase to two hours? What if later we have 'oh I have query that used to run for two hours, now it times out'?

Or, I can completely get rid of quarry's lower limits and just use replicas' time limit. This has a problem: there's a higher barrier of entry for the traditional methods of connection to replica servers than using quarry. Quarry also archive results indefinitely. Meaning, there are often people who does: SELECT * FROM page;. This crashes quarry T172086. A query time limit and an OOM limit is some sort of a guard against these. Without a lower limit, I'm not sure if I'm doing anti-abuse correctly.

Replicas are of Data-Services so for why exactly queries are running slow (as opposed to why is quarry broken), they are better directed to the DBAs.

Why are you using that query?

Isn't it a convoluted way of doing:

SELECT
   page_title,
   page_len,
   cat_pages,
   rev_timestamp,
   rev_actor
   /* ,rev_user_text */
   FROM category RIGHT JOIN page ON (cat_title = page_title) LEFT JOIN categorylinks
   ON (page_id = cl_from) JOIN revision ON (rev_page = page_id) WHERE cl_from IS NULL
   AND page_namespace = 14
   AND page_is_redirect = 0;

?

which seems to simply fetch a list of uncategorized category pages.

That still takes ages, though.

@zhuyifei1999: But why? The query used to execute in 11 minutes max. Is it a congestion issue, as Mike Peel suspects?

It could definitely be a congestion issue. The analytics wiki replica handles very long running queries, which are by nature, very complex. So it is possible that your query now has to share resources with more tools/queries, and hence there could be some degradation in your query's runtime.
https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-24h&to=now&var-server=labsdb1011&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&fullscreen&panelId=3

My advise, if doable, would be to split the query into smaller but faster ones (again, not sure if that's doable from your side)

Looking at CPU usage at https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&from=now-30d&to=now I can't see anything obvious that would explain things running longer. An exception might be an-launcher1001, which shows repeated CPU usage spikes since about the same start time.

I'm not sure what the best way forward here is now. Options might be:

  1. File a different bug about analytics cluster lag?
  2. @zhuyifei1999 Perhaps there could be some sort of a trusted user set on quarry that can run things for longer?
  3. Is there a way to request more direct access to the replicas, ideally with an example of how to run a MySQL query and output a CSV of the results, so that for these queries we could do that instead of using quarry?

(The query I run is what enables the continued deployment of {{Wikidata Infobox}} on Commons - I haven't been able to do that for 2 weeks now and would quite like to resume it!)

@zhuyifei1999 Perhaps there could be some sort of a trusted user set on quarry that can run things for longer?

How do you want such a list to be made? I obviously can't made it like RFA sort of thing, since quarry has no admin web interface.

@zhuyifei1999 Perhaps there could be some sort of a trusted user set on quarry that can run things for longer?

How do you want such a list to be made? I obviously can't made it like RFA sort of thing, since quarry has no admin web interface.

If you'd be willing to do this, I'd suggest a simple request system at https://www.mediawiki.org/wiki/Talk:Quarry that you'd then decide yes/no on based on past queries/on-wiki activity.

a simple request system at https://www.mediawiki.org/wiki/Talk:Quarry

I don't like the idea of flooding a help page with access requests (or perhaps there will be few?). Or thinking about stuffs like beta cluster, many of the access requests are done through phab; perhaps we could just use phab?

However, I don't want to be doing the reviews if there are more than a few. Being too busy too frequently.

I expect that there would be few requests. Phab would also work.

  1. Is there a way to request more direct access to the replicas, ideally with an example of how to run a MySQL query and output a CSV of the results, so that for these queries we could do that instead of using quarry?

(The query I run is what enables the continued deployment of {{Wikidata Infobox}} on Commons - I haven't been able to do that for 2 weeks now and would quite like to resume it!)

I would be happy to help make some how to/tutorial documentation on using either platform to run a query and get results in some exportable format. Working with someone who has questions themselves would help me write a better tutorial by helping me understand the "beginners mind" questions that folks are likely to have.

Not sure if you were looking at the right side, the host involved here is labsdb1011, which has a high CPU usage

@Marostegui I saw that, but it seems to have had high CPU usage for a long time, I couldn't see an obvious change in behaviour earlier this month (my query at least ran fine on the 28th February). Plus, that's the metrics for only one host, I thought it was a cluster?

@bd808 Thanks, both of those options actually look promising, in particular PAWS seems to have a useful example at https://paws-public.wmflabs.org/paws-public/User:YuviPanda/examples/revisions-sql.ipynb for how to do it (and presumably similar code can be run on toolforge). I'll try to give that a go soon. If you do write some documentation on this, I'd be happy to give feedback on it - I know python, so I'm not completely a beginner, but I don't know the specifics of the Wikimedia environment.

Not sure if you were looking at the right side, the host involved here is labsdb1011, which has a high CPU usage

@Marostegui I saw that, but it seems to have had high CPU usage for a long time, I couldn't see an obvious change in behaviour earlier this month (my query at least ran fine on the 28th February). Plus, that's the metrics for only one host, I thought it was a cluster?

We haven't changed anything on the replicas themselves. It could be just more activity on host, heavier queries etc... :-(
As far as I know quarry only points to one host.

I sort of have my query running on paws now (at https://paws.wmflabs.org/paws/user/Mike_Peel/notebooks/Query%20for%20Wikidata%20Infobox.ipynb if you can access that) - it fetches 100 results at a time, with an offset for the next query. I'm running through those results to add the infobox on commons, but it's a painful process (and it's probably costing more server time than usual). There must be a better way of doing this.

Hmm, Quarry is still not completing the query, and now paws has stopped working. :-(

labsdb1011 is suffering badly lately, it seems. I see some rough replag and the connection errors spike periodically on that host https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1011&var-port=9104

That's the one that has crashed repeatedly as well (and is the one that is queried by quarry at this time IIRC).

I can also point out that I just tried to query a table with in wikidatawiki with a simple select * from <blah> limit 1 and it just hung. It was a full view, not a quirky joined view. Is s4 wikidata?

I just found that I got the same result when I did it against the underlying table locally. @Marostegui I think there is a problem with the wikidatawiki database on the analytics replica (labsdb1011). The same query works fine on the other replicas.

labsdb1011 has been running a big alter table in the last few hours, so that's why it has been lagging s4 a bit - s4 is commons but it of course affects the whole server performance.
Anyways, labsdb1011 is super loaded all the time anyways: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=labsdb1011&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&from=now-24h&to=now&fullscreen&panelId=3

The idea is to upgrade to 10.4 and Buster (T247978#5986536), I want to start with that next week.

Keep in mind that labsdb1011 was "recently" fully recloned from labsdb1012, so I don't think there are any other issues there apart from the big load this server suffers from.

Once I have depooled it, I can also try to run a disks benchmark, to see if they are performing at the same speed.

So this query takes around 25 minutes to execute on an idle host, so on a normal loaded hosts it is perfectly possible that it would take more, so the reason it is being killed is because of the 30 minutes Quarry has as a limit.

I'm now running my query (https://quarry.wmflabs.org/query/40539) directly on toolforge via the grid engine, and I'm still getting "pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')" (after running for ~2 hours). Should I open a separate ticket for that? Something doesn't seem right here still, though.

The query killer is set to 2 hours instead of 4 as we are still troubleshooting the affected server

OK, I think I'm just wasting CPU time by trying to run this query at the moment. I'll pause {{Wikidata Infobox}} deployment on Commons until things are running better. Best of luck with the debugging, thank you for your work, and please comment on this ticket when the situation is better.

Thank you for your understanding

Marostegui claimed this task.

We have split the analytics role between two hosts, which looks like it is helping with the load and having the replication lag, as well as the InnoDB Purge lag under control.
I have executed the original reported query (https://quarry.wmflabs.org/query/17928) at https://quarry.wmflabs.org/query/45764 which completed in 750 seconds.

Going to consider this fixed. Thanks a lot everyone for the understanding and patience while we deal with this service degradation.

I am still getting the same error at https://quarry.wmflabs.org/query/40539 . Trying on toolforge now...

I am still getting the same error at https://quarry.wmflabs.org/query/40539 . Trying on toolforge now...

I ran that query on an idle host and it took 25 minutes, which is right on the edge of the 30 minutes killer Quarry has, so given that those hosts are shared, it is entirely possible that it goes over 30 minutes and gets killed by Quarry.

I ran that query on an idle host and it took 25 minutes, which is right on the edge of the 30 minutes killer Quarry has, so given that those hosts are shared, it is entirely possible that it goes over 30 minutes and gets killed by Quarry.

OK, it now completes on toolforge. I'll keep using that from now on, rather than quarry. I've just set pi bot running through the ~50k category backlog to add infoboxes! Thanks for adding the new host.