Query https://quarry.wmflabs.org/query/12570 generates rating of ruwiki bots by the number of edits in main namespace in current year. In last few years it always was completed in a few minutes (after creating 'actor' table too), but from yesterday, it runs 30 min and gets killed.
- Mentioned In
- T262757: SQL requests to DB replicas became work much slower, both from Quarry and from process on Toolforge
T251719: Quarry or the Analytics wikireplicas role creates lots of InnoDB Purge Lag
T249188: Reimage labsdb1011 to Buster and MariaDB 10.4
T246970: Quarry: Lost connection to MySQL server during query
- Mentioned Here
- T249188: Reimage labsdb1011 to Buster and MariaDB 10.4
T246604: Install 1 buster+10.4 host per section
T246970: Quarry: Lost connection to MySQL server during query
Keep in mind that you are using a shared resource, which is currently being used a lot: https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&from=now-12h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=labsdb1011&var-port=9104&fullscreen&panelId=16
So it is entirely possible that your query will hit metadatalocking, deadlocks and so forth. The query killer is set to 4h on the host itself, but quarry kills queries after 30 minutes.
I see your query now running there for around 10 minutes
I have tried your query on the other hosts and it takes around 2-3 minutes to complete. But it is also true that they are way less loaded.
Just a quick command:
[15:11:29] marostegui@cumin1001:~$ sudo cumin 'labsdb*'.eqiad.wmnet 'w' 4 hosts will be targeted: labsdb[1009-1012].eqiad.wmnet Confirm to continue [y/n]? y ===== NODE GROUP ===== (1) labsdb1011.eqiad.wmnet ----- OUTPUT of 'w' ----- 15:11:39 up 25 days, 19:59, 1 user, load average: 24.52, 28.98, 29.47 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT marosteg pts/0 126.96.36.199 15:07 2:45 0.07s 0.01s sshd: marostegui [priv] ===== NODE GROUP ===== (1) labsdb1009.eqiad.wmnet ----- OUTPUT of 'w' ----- 15:11:39 up 118 days, 9:00, 0 users, load average: 15.05, 11.90, 9.47 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT ===== NODE GROUP ===== (1) labsdb1012.eqiad.wmnet ----- OUTPUT of 'w' ----- 15:11:39 up 121 days, 4:46, 1 user, load average: 1.29, 1.28, 1.25 USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT marosteg pts/0 188.8.131.52 14:36 34:43 0.04s 0.01s sshd: marostegui [priv] ===== NODE GROUP ===== (1) labsdb1010.eqiad.wmnet ----- OUTPUT of 'w' ----- 15:11:39 up 120 days, 8:42, 0 users, load average: 14.62, 13.13, 9.35
The only thing I can think of to discard labsdb1011's specific host issues, would be to point quarry to another host for a few days (we'd need to change the query killer and the LB weights and all that) and see what happens.
@bd808 @Bstorm thoughts?
If signs are pointing to labsdb1011 specifically being slower than other wiki replica hosts then it seems worth trying shifting load to another host in the pool. Is labsdb1011 currently the only host taking the *.analytics.db.svc.eqiad.wmflabs load?
We can try that, just to discard specific host issues. labsdb1011 is the only one yeah.
labsdb-analytics 5M IN CNAME dbproxy1010.eqiad.wmnet. labsdb-web 5M IN CNAME dbproxy1011.eqiad.wmnet. wikireplica-analytics 5M IN CNAME dbproxy1010.eqiad.wmnet. wikireplica-web 5M IN CNAME dbproxy1011.eqiad.wmnet.
And dbproxy has labsdb1011 as primary and 1010 as secondary. I can try to change that, promote 1010 to be the analytics host, and move labsdb1011 to serve web replacing 1010.
I would need to change query killers too, as 1010 will be set to 14400 and 1011 to 300.
Is there something to do from Quarry side?
Another thing we can also try is upgrading labsdb1011 to Buster and MariaDB 10.4
We recently had some CPU issues with wikidatawiki (after they all got InnoDB compressed - like our wikireplicas) and looked like that 10.4 performed better CPU wise. Again, not 100% sure, but what we saw on the graphs is that the 10.4 host, having the same hardware and the same load balancer weight, used less CPU.
MariaDB 10.1 goes end of life in October, so we have started to test 10.4 with production hosts (T246604) and so far we've not noticed any major regression or issues.
For next Q I would like to keep migrating some roles (misc, parsercache, x1, external storage...) to Buster and 10.4, and wikireplicas are one of them, as they are multi-source. So I would like to include them and we could start with this host and check if it makes some difference as well.
Also, it includes some new cool things we can use to detect slow queries, like the optimizer trace.
The current work on labsdb1011 is T249188: Reimage labsdb1011 to Buster and MariaDB 10.4. This is an attempt to see if newer operating system and mariadb versions help out with the ongoing performance issues that we have seen on that particular physical host.
If there are only 3 servers in total here, are there plans to add more?
All 3 physical servers in the user facing wiki replicas cluster are due for hardware replacement in the Wikimedia Foundation's fiscal year 2020/2021 (July 2020-June 2021). There are internal discussions about various ways that we might proceed with this replacement, but no concrete plans to share yet. The current cluster design was documented in a blog post when these servers were initially put into public use. Initially running on 2 servers was an incredible improvement in query speed and system stability over the prior generation of hardware and its architecture. In the ~3.5 years that have passed since, query volume (more folks running more queries), data sizes (number of rows stored per wikis, number of wikis), and query complexity have all grown. Data sizes are reaching levels where we are forced to contemplate options beyond simple horizontal (more servers) or vertical ("bigger" servers) scaling of the current design.
Analytics role is now split between two hosts: replication lag and InnoDB purge lag are under control now. This should have improved Quarry's response time as the hosts are splitting the load.
I have executed the query that was being reported originally, which was being killed after 30 minutes (https://quarry.wmflabs.org/query/45763) and it now executes under a minute.
Going to consider this fixed for now. Hopefully we can keep with this performance.
Thank you all for your patience