Query https://quarry.wmflabs.org/query/12570 generates rating of ruwiki bots by the number of edits in main namespace in current year. In last few years it always was completed in a few minutes (after creating 'actor' table too), but from yesterday, it runs 30 min and gets killed.
Description
Details
Subject | Repo | Branch | Lines +/- | |
---|---|---|---|---|
wikireplicas: Set innodb_purge_threads to 10 | operations/puppet | production | +1 -0 |
Related Objects
- Mentioned In
- T262757: SQL requests to DB replicas became work much slower, both from Quarry and from process on Toolforge
T251719: Quarry or the Analytics wikireplicas role creates lots of InnoDB Purge Lag
T249188: Reimage labsdb1011 to Buster and MariaDB 10.4
T246970: Quarry: Lost connection to MySQL server during query - Mentioned Here
- T249188: Reimage labsdb1011 to Buster and MariaDB 10.4
T246604: Install 1 buster+10.4 host per section
T246970: Quarry: Lost connection to MySQL server during query
Event Timeline
Keep in mind that you are using a shared resource, which is currently being used a lot: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-12h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1011&var-port=9104&fullscreen&panelId=16
So it is entirely possible that your query will hit metadatalocking, deadlocks and so forth. The query killer is set to 4h on the host itself, but quarry kills queries after 30 minutes.
I see your query now running there for around 10 minutes
I have tried your query on the other hosts and it takes around 2-3 minutes to complete. But it is also true that they are way less loaded.
Just a quick command:
[15:11:29] marostegui@cumin1001:~$ sudo cumin 'labsdb*'.eqiad.wmnet 'w' 4 hosts will be targeted: labsdb[1009-1012].eqiad.wmnet Confirm to continue [y/n]? y ===== NODE GROUP ===== (1) labsdb1011.eqiad.wmnet ----- OUTPUT of 'w' ----- 15:11:39 up 25 days, 19:59, 1 user, load average: 24.52, 28.98, 29.47 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT marosteg pts/0 208.80.154.86 15:07 2:45 0.07s 0.01s sshd: marostegui [priv] ===== NODE GROUP ===== (1) labsdb1009.eqiad.wmnet ----- OUTPUT of 'w' ----- 15:11:39 up 118 days, 9:00, 0 users, load average: 15.05, 11.90, 9.47 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT ===== NODE GROUP ===== (1) labsdb1012.eqiad.wmnet ----- OUTPUT of 'w' ----- 15:11:39 up 121 days, 4:46, 1 user, load average: 1.29, 1.28, 1.25 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT marosteg pts/0 208.80.154.86 14:36 34:43 0.04s 0.01s sshd: marostegui [priv] ===== NODE GROUP ===== (1) labsdb1010.eqiad.wmnet ----- OUTPUT of 'w' ----- 15:11:39 up 120 days, 8:42, 0 users, load average: 14.62, 13.13, 9.35
The only thing I can think of to discard labsdb1011's specific host issues, would be to point quarry to another host for a few days (we'd need to change the query killer and the LB weights and all that) and see what happens.
@bd808 @Bstorm thoughts?
If signs are pointing to labsdb1011 specifically being slower than other wiki replica hosts then it seems worth trying shifting load to another host in the pool. Is labsdb1011 currently the only host taking the *.analytics.db.svc.eqiad.wmflabs load?
We can try that, just to discard specific host issues. labsdb1011 is the only one yeah.
labsdb-analytics 5M IN CNAME dbproxy1010.eqiad.wmnet. labsdb-web 5M IN CNAME dbproxy1011.eqiad.wmnet. wikireplica-analytics 5M IN CNAME dbproxy1010.eqiad.wmnet. wikireplica-web 5M IN CNAME dbproxy1011.eqiad.wmnet.
And dbproxy has labsdb1011 as primary and 1010 as secondary. I can try to change that, promote 1010 to be the analytics host, and move labsdb1011 to serve web replacing 1010.
I would need to change query killers too, as 1010 will be set to 14400 and 1011 to 300.
Is there something to do from Quarry side?
Another thing we can also try is upgrading labsdb1011 to Buster and MariaDB 10.4
We recently had some CPU issues with wikidatawiki (after they all got InnoDB compressed - like our wikireplicas) and looked like that 10.4 performed better CPU wise. Again, not 100% sure, but what we saw on the graphs is that the 10.4 host, having the same hardware and the same load balancer weight, used less CPU.
MariaDB 10.1 goes end of life in October, so we have started to test 10.4 with production hosts (T246604) and so far we've not noticed any major regression or issues.
For next Q I would like to keep migrating some roles (misc, parsercache, x1, external storage...) to Buster and 10.4, and wikireplicas are one of them, as they are multi-source. So I would like to include them and we could start with this host and check if it makes some difference as well.
Also, it includes some new cool things we can use to detect slow queries, like the optimizer trace.
Change 591320 had a related patch set uploaded (by Jcrespo; owner: Jcrespo):
[operations/puppet@production] wikireplicas: Set innodb_purge_threads to 10
Change 591320 merged by Marostegui:
[operations/puppet@production] wikireplicas: Set innodb_purge_threads to 10
The problem persists. The same query https://quarry.wmflabs.org/query/12570 was killed after 30 min again.
Problem doesn't come from Quarry. Quoting @bd808:
we are down to 2 wiki replica servers right now because of ongoing hardware/software issues with labsdb1011, so I expect everything is slower for everyone. folks are working on labsdb1011 and trying to get it back into rotation
@Framawiki @bd808 Source/context for the issue, please? If there are only 3 servers in total here, are there plans to add more?
The current work on labsdb1011 is T249188: Reimage labsdb1011 to Buster and MariaDB 10.4. This is an attempt to see if newer operating system and mariadb versions help out with the ongoing performance issues that we have seen on that particular physical host.
If there are only 3 servers in total here, are there plans to add more?
All 3 physical servers in the user facing wiki replicas cluster are due for hardware replacement in the Wikimedia Foundation's fiscal year 2020/2021 (July 2020-June 2021). There are internal discussions about various ways that we might proceed with this replacement, but no concrete plans to share yet. The current cluster design was documented in a blog post when these servers were initially put into public use. Initially running on 2 servers was an incredible improvement in query speed and system stability over the prior generation of hardware and its architecture. In the ~3.5 years that have passed since, query volume (more folks running more queries), data sizes (number of rows stored per wikis, number of wikis), and query complexity have all grown. Data sizes are reaching levels where we are forced to contemplate options beyond simple horizontal (more servers) or vertical ("bigger" servers) scaling of the current design.
Unfortunately, the servers that we use for Quarry and for the all wikireplicas in general is very specific (and very costly) so we do not have hot spares ready to take over any moment.
Analytics role is now split between two hosts: replication lag and InnoDB purge lag are under control now. This should have improved Quarry's response time as the hosts are splitting the load.
I have executed the query that was being reported originally, which was being killed after 30 minutes (https://quarry.wmflabs.org/query/45763) and it now executes under a minute.
Going to consider this fixed for now. Hopefully we can keep with this performance.
Thank you all for your patience
@Marostegui it isn't the original query. Recently I optimized it by replacing views "revision" => "revision_userindex" and "actor" => "actor_revision". Before that it was killed after 30 minutes. To test the original query you can roll back these replacements.
I tested the query reported originally on this task: https://quarry.wmflabs.org/query/12570 which went from 500 seconds to 42
@Marostegui But query https://quarry.wmflabs.org/query/12570 now contains another text, that it was contain on March 18, date of creation of this ticket. How you know what text was contain this query on March 18? Quarry doesn't store history of text changes in one query, as far as I know.