Tool Name: mix-n-match
Quota increase requested: +2 CPU
Reason: After the kubernetes move severely limited tool resources, I have moved (almost) all PHP services to Rust. Now all (usual) tool background tasks run from a single Rust job, which uses async/await and therefore deals well with the slow I/O (DB/http GET) involved. However, I find that it does not run at full power due to CPU limitations. So I would like to increase the CPUs to 3 for this single pod. I don't know if that requires changing per-container limits as well as per tool; please do all required.
Description
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | • Bstorm | T216208 ToolsDB overload and cleanup | |||
Resolved | • Bstorm | T216441 Evaluate transferring the non-replicated tables to the new toolsdb server | |||
Resolved | fnegri | T236101 Find a way to remove non-replicated tables from ToolsDB | |||
Resolved | dcaro | T301951 toolsdb: full disk on clouddb1001 broke clouddb1002 (secondary) replication | |||
Open | None | T301967 toolsdb: evaluate storage usage by some tools | |||
Open | fnegri | T291782 Migrate largest ToolsDB users to Trove | |||
Resolved | Andrew | T323502 Move some of magnus's tools to Trove databases (was: Request increased quota for mix-n-match Toolforge tool) | |||
Resolved | Andrew | T324984 Trove volume size limit of 31Gb |
Event Timeline
And while I'm at it, I would like to request that the maximum number of database connections be increased to 20. This is mostly for the tool user database, not so much for the DB replicas, in case that makes a difference.
Mentioned in SAL (#wikimedia-cloud) [2022-11-24T12:34:40Z] <arturo> bump CPU quota from 2 to 3 (T323502)
There are several things going on in this tool:
- web interface queries API which queries database
- Rust runs background jobs based on rules and times. There are currently ~20K jobs in the list, though most of them have run and won't run again. But there can be 20-30 of them running at any given time. All of them need DB connections at some point, and the job system does as well.
- Various older PHP-based jobs that run daily/weekly/monthly all require database connections
In order to have the background jobs run smoothly while not choking the API of connections, I'd like to have more connections available for peak times.
Also, the per-container limit still seems to be 1 CPU?
toolforge-jobs run --image tf-php74 --cpu 1 --mem 1500Mi --cpu 2 --continuous --command '/data/project/mix-n-match/mixnmatch_rs/run.sh' rustbot ERROR: unable to create job: "ERROR: Requested CPU 2 is over maximum allowed per container (1)"
I see the following abbreviated limits:
status: hard: limits.cpu: "3" limits.memory: 8Gi pods: "10" requests.cpu: "3" requests.memory: 6Gi used: limits.cpu: "2" limits.memory: 2524Mi pods: "3" requests.cpu: 800m requests.memory: "1323302912"
EDIT: Added request limits
This seems to be the issue:
kubectl describe limits -n tool-mix-n-match Name: tool-mix-n-match Namespace: tool-mix-n-match Type Resource Min Max Default Request Default Limit Max Limit/Request Ratio ---- -------- --- --- --------------- ------------- ----------------------- Container memory 100Mi 4Gi 256Mi 512Mi - Container cpu 50m 1 150m 500m -
Mentioned in SAL (#wikimedia-cloud) [2022-12-05T22:17:42Z] <balloons> Update cpu rangelimit to 3 T323502
Per T323502#8416721 looks like this is for the user database and not wiki replicas. We do not handle that one. If this is needed for the wikireplicas, let me know!
@Marostegui I'm sorry! I read T323502#8416721 and re-read and somehow still missed what was being asked.
@Magnus, for toolsdb, there's scaling and performance issues that currently impact the system (see T216170: toolsdb - Per-user connection limits ). I don't think we can support more connections in it's current state. Can you live without the increased connections? I'm also wondering what other alternatives might exist. For instance, on the Cloud VPS side, database as a service is possible. See https://wikitech.wikimedia.org/wiki/Help:Adding_a_Database_to_a_Cloud_VPS_Project#Trove:_Database_as_a_Service_for_Cloud_VPS. Would this meet your needs? It's technically possible to connect to a cloud vps datastore within toolforge. We could discuss what would work well for you if this sounds like an option.
@nskaggs Thanks, the Cloud VPS DB option looks very interesting, but I think it would be overkill to move the 120GB DB over. I'll stick with the 10 connections for now, unless you recommend that this is hosted more efficiently (for both you and me) on Cloud VPS.
Maybe this is a more general issue as well. I checked and it looks like I "own" 4 of the 10 largest tool databases. Is there a case for these large ones to move to their own instance, to take pressure of the toolforge DB system?
The answer to that question is YES! If you're willing to pick a database (not necessarily mix-n-match) and be a test subject I'd love to work on you with that.
Sure! Best test candidate might might be s51203__baglama2_p with ~40GB.
Some things I would like:
- some guarantee that the new DB will have the same safety as the tooldb one; with VMs I'm always suspicious that storage might be deleted or somesuch, are there regular backups/replicas?
- "one-click" migration script, or you do it for me. This is more for guaranteed 1:1 copy than convenience
Let me know when you are ready to do this, so I can stop processes writing to the DB, unless you have some way to do this without interruption?
@fnegri: After a rabbit trail of trove issues, I've created the baglama2 project with trove db 'baglama2'. Want to experiment with migration there?
@Magnus: Safety levels should be similar; the VM image itself is backed up but recovery from breakage would be painful. Ultimately we plan to support replication/snapshotting of trove databases but there are a few other projects that need to be finished up before we can offer that. We won't have one-click migration but Francesco is interested in developing a process so I'm leaving this in his hands.
@Andrew I have created a new baglama2 DB there, and am currently importing the tooldb database. For that, I made a new replica file (~/replica.trove.my.cnf), and run in a screen:
mysqldump --defaults-file=~/replica.my.cnf --host=tools-db s51203__baglama2_p | mysql --defaults-file=~/replica.trove.my.cnf -h pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud baglama2
This seems to be mostly done (after ~14h). I will then point everything to the trove DB. I might also dump the tooldb, and set up a regular trove dump. What's the best place to store compressed dumps, just the tool directory?
I'll let you know when the tooldb can be nuked.
Everything worked fine but (after a few days) I now can't connect to the instance any more:
Used command: /usr/bin/ssh -v -N -S none -o ControlMaster=no -o ExitOnForwardFailure=yes -o ConnectTimeout=10 -o NumberOfPasswordPrompts=3 -i /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa -o TCPKeepAlive=no -o ServerAliveInterval=60 -o ServerAliveCountMax=1 magnus@tools-login.wmflabs.org -L 61284:pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud:3306 OpenSSH_8.6p1, LibreSSL 3.3.6 debug1: Reading configuration data /Users/mm6/.ssh/config debug1: /Users/mm6/.ssh/config line 5: Applying options for tools-login.wmflabs.org debug1: /Users/mm6/.ssh/config line 14: Applying options for * debug1: /Users/mm6/.ssh/config line 15: Deprecated option "useroaming" debug1: /Users/mm6/.ssh/config line 23: Applying options for * debug1: /Users/mm6/.ssh/config line 33: Applying options for * debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 21: include /etc/ssh/ssh_config.d/* matched no files debug1: /etc/ssh/ssh_config line 54: Applying options for * debug1: Authenticator provider $SSH_SK_PROVIDER did not resolve; disabling debug1: Control socket " none" does not exist debug1: Connecting to tools-login.wmflabs.org [185.15.56.66] port 22. debug1: fd 5 clearing O_NONBLOCK debug1: Connection established. debug1: identity file /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa type 0 debug1: identity file /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa-cert type -1 debug1: identity file /Users/mm6/.ssh/id_rsa type 0 debug1: identity file /Users/mm6/.ssh/id_rsa-cert type -1 debug1: Local version string SSH-2.0-OpenSSH_8.6 debug1: Remote protocol version 2.0, remote software version OpenSSH_7.9p1 Debian-10+deb10u2 debug1: compat_banner: match: OpenSSH_7.9p1 Debian-10+deb10u2 pat OpenSSH* compat 0x04000000 debug1: Authenticating to tools-login.wmflabs.org:22 as 'magnus' debug1: load_hostkeys: fopen /Users/mm6/.ssh/known_hosts2: No such file or directory debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: algorithm: curve25519-sha256 debug1: kex: host key algorithm: ssh-ed25519 debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none debug1: expecting SSH2_MSG_KEX_ECDH_REPLY debug1: SSH2_MSG_KEX_ECDH_REPLY received debug1: Server host key: ssh-ed25519 SHA256:xxW0+dRvWgCzYOq7uBKXXo7Xze0FVezt0QikIkpeMKI debug1: load_hostkeys: fopen /Users/mm6/.ssh/known_hosts2: No such file or directory debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory debug1: Host 'tools-login.wmflabs.org' is known and matches the ED25519 host key. debug1: Found key in /Users/mm6/.ssh/known_hosts:174 debug1: rekey out after 134217728 blocks debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: SSH2_MSG_NEWKEYS received debug1: rekey in after 134217728 blocks debug1: Will attempt key: /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa RSA SHA256:lRpWeuGfunWq42w+ScX2CmVuVy+9ZgV5GuuTGozl680 explicit agent debug1: Will attempt key: mm6@sanger.ac.uk RSA SHA256:2BLwrTnTiH3K5enChXlTeLlVzLLVTFbamBC+ulpUaG4 agent debug1: Will attempt key: /Users/mm6/.ssh/id_rsa RSA SHA256:lRpWeuGfunWq42w+ScX2CmVuVy+9ZgV5GuuTGozl680 explicit debug1: SSH2_MSG_EXT_INFO received debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521> debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,hostbased debug1: Next authentication method: publickey debug1: Offering public key: /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa RSA SHA256:lRpWeuGfunWq42w+ScX2CmVuVy+9ZgV5GuuTGozl680 explicit agent debug1: Server accepts key: /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa RSA SHA256:lRpWeuGfunWq42w+ScX2CmVuVy+9ZgV5GuuTGozl680 explicit agent debug1: Authentication succeeded (publickey). Authenticated to tools-login.wmflabs.org ([185.15.56.66]:22). debug1: Local connections to LOCALHOST:61284 forwarded to remote address pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud:3306 debug1: Local forwarding listening on ::1 port 61284. debug1: channel 0: new [port listener] debug1: Local forwarding listening on 127.0.0.1 port 61284. debug1: channel 1: new [port listener] debug1: Requesting no-more-sessions@openssh.com debug1: Entering interactive session. debug1: pledge: filesystem full debug1: Connection to port 61284 forwarding to pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud port 3306 requested. debug1: channel 2: new [direct-tcpip] debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0 debug1: client_input_hostkeys: searching /Users/mm6/.ssh/known_hosts for tools-login.wmflabs.org / (none) debug1: client_input_hostkeys: searching /Users/mm6/.ssh/known_hosts2 for tools-login.wmflabs.org / (none) debug1: client_input_hostkeys: hostkeys file /Users/mm6/.ssh/known_hosts2 does not exist debug1: client_input_hostkeys: no new or deprecated keys from server debug1: Remote: /usr/sbin/ssh-key-ldap-lookup:2: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding debug1: Remote: /usr/sbin/ssh-key-ldap-lookup:2: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding channel 2: open failed: connect failed: Connection refused debug1: channel 2: free: direct-tcpip: listening port 61284 for pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud port 3306, connect from 127.0.0.1 port 61286 to 127.0.0.1 port 61284, nchannels 3
More concise, from toolforge:
tools.glamtools@tools-sgebastion-10:~$ mysql --defaults-file=~/replica.trove.my.cnf -h pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud baglama2 ERROR 2002 (HY000): Can't connect to MySQL server on 'pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud' (115)
OK I restarted the DB instance via horizon, when I saw that even horizon couldn't connect to it any more.
That seems to have resolved the problem for now, but the fact that the entire instance just quietly crashed should be investigated IMHO
In other words, I would be hesitant to switch to a system where I have to manually restart the MySQL server every other week. I don't have time to work on all my tools as I would like, I can't run around kicking infrastructure as well.
Yep, I share your concern! It looks like the docker container containing the mysql server crashed; I'm investigating why that happened and why Trove didn't restart it automatically :/
Seems to have happened again just now. I was importing a rather large table (views), that has been running for hours(?). Not sure if that's related.
The host has been OOM'ing. I'm in the process of trying to resize it, if that goes poorly we may want to start over with a bigger host.
bah, the resizing is going poorly. I can try heroic measures to rescue the data or just restart from scratch -- are you ok with the latter?
Sorry that this is going poorly! It's likely that with enough RAM we'll get this stable and I want to get you settled before I go down the side-track of figuring out why resizing failed.
I've rebuilt the database instance and you should be able to start syncing again. I'm not thrilled with how Trove has been acting today but I suspect with 4GB of ram this one will stay up for a while.
It took a few days but the database has been successfully copied over to trove. I am taking a final mysqldump from toolsdb now, then s51203__baglama2_p can be removed. I will post here when it's done.
Everything still going well with this? I'm curious how the move to trove is working out. Hope all is well!
I have successfully moved all data over to trove, and took a snapshot of the toolsdb version.
The web interface and the background tools have been switched over to the trove version and are reading/writing successfully.
As far as I am concerned, the toolsdb s51203__baglama2_p can be deleted.
Should I do that, or do you want to do the honors?
After T329853 can we please add something so databases don't stay vanished until I complain at a "a proper support venue" (which seems to be only IRC, with mattermost link broken)?
As far as I am concerned, the toolsdb s51203__baglama2_p can be deleted.
Should I do that, or do you want to do the honors?
Sorry for not getting back sooner about this. I have just deleted the old database from ToolsDB:
MariaDB [(none)]> DROP DATABASE s51203__baglama2_p; Query OK, 7 rows affected (2.19 sec)
Mentioned in SAL (#wikimedia-cloud) [2023-02-17T16:13:31Z] <dhinus> drop unused database s51203__baglama2_p T323502