Page MenuHomePhabricator

Quarry query became work much slower
Closed, ResolvedPublic

Description

Query https://quarry.wmflabs.org/query/12570 generates rating of ruwiki bots by the number of edits in main namespace in current year. In last few years it always was completed in a few minutes (after creating 'actor' table too), but from yesterday, it runs 30 min and gets killed.

Event Timeline

Keep in mind that you are using a shared resource, which is currently being used a lot: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-12h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1011&var-port=9104&fullscreen&panelId=16

https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=labsdb1011&var-datasource=eqiad%20prometheus%2Fops&var-cluster=mysql&fullscreen&panelId=3

So it is entirely possible that your query will hit metadatalocking, deadlocks and so forth. The query killer is set to 4h on the host itself, but quarry kills queries after 30 minutes.
I see your query now running there for around 10 minutes

May it be the same as T246970?

Yes, most likely. This might be useful for your situation too: T246970#5977729

I have tried your query on the other hosts and it takes around 2-3 minutes to complete. But it is also true that they are way less loaded.
Just a quick command:

[15:11:29] marostegui@cumin1001:~$ sudo cumin 'labsdb*'.eqiad.wmnet 'w'
4 hosts will be targeted:
labsdb[1009-1012].eqiad.wmnet
Confirm to continue [y/n]? y
===== NODE GROUP =====
(1) labsdb1011.eqiad.wmnet
----- OUTPUT of 'w' -----
 15:11:39 up 25 days, 19:59,  1 user,  load average: 24.52, 28.98, 29.47
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
marosteg pts/0    208.80.154.86    15:07    2:45   0.07s  0.01s sshd: marostegui [priv]
===== NODE GROUP =====
(1) labsdb1009.eqiad.wmnet
----- OUTPUT of 'w' -----
 15:11:39 up 118 days,  9:00,  0 users,  load average: 15.05, 11.90, 9.47
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
===== NODE GROUP =====
(1) labsdb1012.eqiad.wmnet
----- OUTPUT of 'w' -----
 15:11:39 up 121 days,  4:46,  1 user,  load average: 1.29, 1.28, 1.25
USER     TTY      FROM             LOGIN@   IDLE   JCPU   PCPU WHAT
marosteg pts/0    208.80.154.86    14:36   34:43   0.04s  0.01s sshd: marostegui [priv]
===== NODE GROUP =====
(1) labsdb1010.eqiad.wmnet
----- OUTPUT of 'w' -----
 15:11:39 up 120 days,  8:42,  0 users,  load average: 14.62, 13.13, 9.35

The only thing I can think of to discard labsdb1011's specific host issues, would be to point quarry to another host for a few days (we'd need to change the query killer and the LB weights and all that) and see what happens.
@bd808 @Bstorm thoughts?

The only thing I can think of to discard labsdb1011's specific host issues, would be to point quarry to another host for a few days (we'd need to change the query killer and the LB weights and all that) and see what happens.
@bd808 @Bstorm thoughts?

If signs are pointing to labsdb1011 specifically being slower than other wiki replica hosts then it seems worth trying shifting load to another host in the pool. Is labsdb1011 currently the only host taking the *.analytics.db.svc.eqiad.wmflabs load?

The only thing I can think of to discard labsdb1011's specific host issues, would be to point quarry to another host for a few days (we'd need to change the query killer and the LB weights and all that) and see what happens.
@bd808 @Bstorm thoughts?

If signs are pointing to labsdb1011 specifically being slower than other wiki replica hosts then it seems worth trying shifting load to another host in the pool. Is labsdb1011 currently the only host taking the *.analytics.db.svc.eqiad.wmflabs load?

We can try that, just to discard specific host issues. labsdb1011 is the only one yeah.

labsdb-analytics    5M  IN CNAME    dbproxy1010.eqiad.wmnet.
labsdb-web          5M  IN CNAME    dbproxy1011.eqiad.wmnet.
wikireplica-analytics    5M  IN CNAME    dbproxy1010.eqiad.wmnet.
wikireplica-web          5M  IN CNAME    dbproxy1011.eqiad.wmnet.

And dbproxy has labsdb1011 as primary and 1010 as secondary. I can try to change that, promote 1010 to be the analytics host, and move labsdb1011 to serve web replacing 1010.
I would need to change query killers too, as 1010 will be set to 14400 and 1011 to 300.
Is there something to do from Quarry side?

Another thing we can also try is upgrading labsdb1011 to Buster and MariaDB 10.4
We recently had some CPU issues with wikidatawiki (after they all got InnoDB compressed - like our wikireplicas) and looked like that 10.4 performed better CPU wise. Again, not 100% sure, but what we saw on the graphs is that the 10.4 host, having the same hardware and the same load balancer weight, used less CPU.

MariaDB 10.1 goes end of life in October, so we have started to test 10.4 with production hosts (T246604) and so far we've not noticed any major regression or issues.
For next Q I would like to keep migrating some roles (misc, parsercache, x1, external storage...) to Buster and 10.4, and wikireplicas are one of them, as they are multi-source. So I would like to include them and we could start with this host and check if it makes some difference as well.
Also, it includes some new cool things we can use to detect slow queries, like the optimizer trace.

Change 591320 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikireplicas: Set innodb_purge_threads to 10

https://gerrit.wikimedia.org/r/591320

Change 591320 merged by Marostegui:
[operations/puppet@production] wikireplicas: Set innodb_purge_threads to 10

https://gerrit.wikimedia.org/r/591320

The problem persists. The same query https://quarry.wmflabs.org/query/12570 was killed after 30 min again.

The problem persists. The same query https://quarry.wmflabs.org/query/12570 was killed after 30 min again.

Problem doesn't come from Quarry. Quoting @bd808:

we are down to 2 wiki replica servers right now because of ongoing hardware/software issues with labsdb1011, so I expect everything is slower for everyone. folks are working on labsdb1011 and trying to get it back into rotation

@Framawiki @bd808 Source/context for the issue, please? If there are only 3 servers in total here, are there plans to add more?

@Framawiki @bd808 Source/context for the issue, please?

The current work on labsdb1011 is T249188: Reimage labsdb1011 to Buster and MariaDB 10.4. This is an attempt to see if newer operating system and mariadb versions help out with the ongoing performance issues that we have seen on that particular physical host.

If there are only 3 servers in total here, are there plans to add more?

All 3 physical servers in the user facing wiki replicas cluster are due for hardware replacement in the Wikimedia Foundation's fiscal year 2020/2021 (July 2020-June 2021). There are internal discussions about various ways that we might proceed with this replacement, but no concrete plans to share yet. The current cluster design was documented in a blog post when these servers were initially put into public use. Initially running on 2 servers was an incredible improvement in query speed and system stability over the prior generation of hardware and its architecture. In the ~3.5 years that have passed since, query volume (more folks running more queries), data sizes (number of rows stored per wikis, number of wikis), and query complexity have all grown. Data sizes are reaching levels where we are forced to contemplate options beyond simple horizontal (more servers) or vertical ("bigger" servers) scaling of the current design.

Why you can't just move Quarry to another host?

Unfortunately, the servers that we use for Quarry and for the all wikireplicas in general is very specific (and very costly) so we do not have hot spares ready to take over any moment.

Marostegui claimed this task.

Analytics role is now split between two hosts: replication lag and InnoDB purge lag are under control now. This should have improved Quarry's response time as the hosts are splitting the load.
I have executed the query that was being reported originally, which was being killed after 30 minutes (https://quarry.wmflabs.org/query/45763) and it now executes under a minute.

Going to consider this fixed for now. Hopefully we can keep with this performance.
Thank you all for your patience

@Marostegui it isn't the original query. Recently I optimized it by replacing views "revision" => "revision_userindex" and "actor" => "actor_revision". Before that it was killed after 30 minutes. To test the original query you can roll back these replacements.

@Marostegui it isn't the original query. Recently I optimized it by replacing views "revision" => "revision_userindex" and "actor" => "actor_revision". Before that it was killed after 30 minutes. To test the original query you can roll back these replacements.

I tested the query reported originally on this task: https://quarry.wmflabs.org/query/12570 which went from 500 seconds to 42

@Marostegui But query https://quarry.wmflabs.org/query/12570 now contains another text, that it was contain on March 18, date of creation of this ticket. How you know what text was contain this query on March 18? Quarry doesn't store history of text changes in one query, as far as I know.

@Marostegui But query https://quarry.wmflabs.org/query/12570 now contains another text, that it was contain on March 18, date of creation of this ticket. How you know what text was contain this query on March 18? Quarry doesn't store history of text changes in one query, as far as I know.

Right I see! Yeah, then, yeah, there is no way to retrieve that original one.

I re-runned this query in its old form. Successfully executed in 278 sec.