Page MenuHomePhabricator

Move some of magnus's tools to Trove databases (was: Request increased quota for mix-n-match Toolforge tool)
Closed, ResolvedPublic

Description

Tool Name: mix-n-match
Quota increase requested: +2 CPU
Reason: After the kubernetes move severely limited tool resources, I have moved (almost) all PHP services to Rust. Now all (usual) tool background tasks run from a single Rust job, which uses async/await and therefore deals well with the slow I/O (DB/http GET) involved. However, I find that it does not run at full power due to CPU limitations. So I would like to increase the CPUs to 3 for this single pod. I don't know if that requires changing per-container limits as well as per tool; please do all required.

Event Timeline

Magnus renamed this task from Request increased quota for <Replace Me> Toolforge tool to Request increased quota for mix-n-match Toolforge tool.Nov 21 2022, 2:41 PM

And while I'm at it, I would like to request that the maximum number of database connections be increased to 20. This is mostly for the tool user database, not so much for the DB replicas, in case that makes a difference.

And while I'm at it, I would like to request that the maximum number of database connections be increased to 20. This is mostly for the tool user database, not so much for the DB replicas, in case that makes a difference.

Could you explain a bit why this is needed?

Mentioned in SAL (#wikimedia-cloud) [2022-11-24T12:34:40Z] <arturo> bump CPU quota from 2 to 3 (T323502)

aborrero moved this task from Inbox to Discussion needed on the Toolforge (Quota-requests) board.
aborrero subscribed.

Accepted the CPU quota change. The DB request is pending discussion.

And while I'm at it, I would like to request that the maximum number of database connections be increased to 20. This is mostly for the tool user database, not so much for the DB replicas, in case that makes a difference.

Could you explain a bit why this is needed?

There are several things going on in this tool:

  • web interface queries API which queries database
  • Rust runs background jobs based on rules and times. There are currently ~20K jobs in the list, though most of them have run and won't run again. But there can be 20-30 of them running at any given time. All of them need DB connections at some point, and the job system does as well.
  • Various older PHP-based jobs that run daily/weekly/monthly all require database connections

In order to have the background jobs run smoothly while not choking the API of connections, I'd like to have more connections available for peak times.

Accepted the CPU quota change. The DB request is pending discussion.

Actually the CPU request was for +2 CPUs but your earlier comment says +1?

Also, the per-container limit still seems to be 1 CPU?

toolforge-jobs run --image tf-php74 --cpu 1 --mem 1500Mi --cpu 2 --continuous --command '/data/project/mix-n-match/mixnmatch_rs/run.sh' rustbot
ERROR: unable to create job: "ERROR: Requested CPU 2 is over maximum allowed per container (1)"

I see the following abbreviated limits:

status:
  hard:
    limits.cpu: "3"
    limits.memory: 8Gi
    pods: "10"
    requests.cpu: "3"
    requests.memory: 6Gi
  used:
    limits.cpu: "2"
    limits.memory: 2524Mi
    pods: "3"
    requests.cpu: 800m
    requests.memory: "1323302912"

EDIT: Added request limits

For the DB connections request, adding @Marostegui for review.

This seems to be the issue:

kubectl describe limits -n tool-mix-n-match
Name:       tool-mix-n-match
Namespace:  tool-mix-n-match
Type        Resource  Min    Max  Default Request  Default Limit  Max Limit/Request Ratio
----        --------  ---    ---  ---------------  -------------  -----------------------
Container   memory    100Mi  4Gi  256Mi            512Mi          -
Container   cpu       50m    1    150m             500m           -

Mentioned in SAL (#wikimedia-cloud) [2022-12-05T22:17:42Z] <balloons> Update cpu rangelimit to 3 T323502

For the DB connections request, adding @Marostegui for review.

Per T323502#8416721 looks like this is for the user database and not wiki replicas. We do not handle that one. If this is needed for the wikireplicas, let me know!

@Marostegui I'm sorry! I read T323502#8416721 and re-read and somehow still missed what was being asked.

@Magnus, for toolsdb, there's scaling and performance issues that currently impact the system (see T216170: toolsdb - Per-user connection limits ). I don't think we can support more connections in it's current state. Can you live without the increased connections? I'm also wondering what other alternatives might exist. For instance, on the Cloud VPS side, database as a service is possible. See https://wikitech.wikimedia.org/wiki/Help:Adding_a_Database_to_a_Cloud_VPS_Project#Trove:_Database_as_a_Service_for_Cloud_VPS. Would this meet your needs? It's technically possible to connect to a cloud vps datastore within toolforge. We could discuss what would work well for you if this sounds like an option.

@nskaggs Thanks, the Cloud VPS DB option looks very interesting, but I think it would be overkill to move the 120GB DB over. I'll stick with the 10 connections for now, unless you recommend that this is hosted more efficiently (for both you and me) on Cloud VPS.

Maybe this is a more general issue as well. I checked and it looks like I "own" 4 of the 10 largest tool databases. Is there a case for these large ones to move to their own instance, to take pressure of the toolforge DB system?

Is there a case for these large ones to move to their own instance, to take pressure of the toolforge DB system?

The answer to that question is YES! If you're willing to pick a database (not necessarily mix-n-match) and be a test subject I'd love to work on you with that.

Is there a case for these large ones to move to their own instance, to take pressure of the toolforge DB system?

The answer to that question is YES! If you're willing to pick a database (not necessarily mix-n-match) and be a test subject I'd love to work on you with that.

Sure! Best test candidate might might be s51203__baglama2_p with ~40GB.

Some things I would like:

  • some guarantee that the new DB will have the same safety as the tooldb one; with VMs I'm always suspicious that storage might be deleted or somesuch, are there regular backups/replicas?
  • "one-click" migration script, or you do it for me. This is more for guaranteed 1:1 copy than convenience

Let me know when you are ready to do this, so I can stop processes writing to the DB, unless you have some way to do this without interruption?

@fnegri: After a rabbit trail of trove issues, I've created the baglama2 project with trove db 'baglama2'. Want to experiment with migration there?

@Magnus: Safety levels should be similar; the VM image itself is backed up but recovery from breakage would be painful. Ultimately we plan to support replication/snapshotting of trove databases but there are a few other projects that need to be finished up before we can offer that. We won't have one-click migration but Francesco is interested in developing a process so I'm leaving this in his hands.

@Andrew I have created a new baglama2 DB there, and am currently importing the tooldb database. For that, I made a new replica file (~/replica.trove.my.cnf), and run in a screen:

mysqldump --defaults-file=~/replica.my.cnf --host=tools-db s51203__baglama2_p | mysql --defaults-file=~/replica.trove.my.cnf -h pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud baglama2

This seems to be mostly done (after ~14h). I will then point everything to the trove DB. I might also dump the tooldb, and set up a regular trove dump. What's the best place to store compressed dumps, just the tool directory?

I'll let you know when the tooldb can be nuked.

Fantastic! Thank you for diving in, Magnus -- let me know how things go!

Andrew renamed this task from Request increased quota for mix-n-match Toolforge tool to Move some of magnus's tools to Trove databases (was: Request increased quota for mix-n-match Toolforge tool).Dec 14 2022, 3:24 PM
Andrew claimed this task.

Everything worked fine but (after a few days) I now can't connect to the instance any more:

Used command:  /usr/bin/ssh -v -N -S none -o ControlMaster=no -o ExitOnForwardFailure=yes -o ConnectTimeout=10 -o NumberOfPasswordPrompts=3 -i /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa -o TCPKeepAlive=no -o ServerAliveInterval=60 -o ServerAliveCountMax=1 magnus@tools-login.wmflabs.org -L 61284:pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud:3306

OpenSSH_8.6p1, LibreSSL 3.3.6
debug1: Reading configuration data /Users/mm6/.ssh/config
debug1: /Users/mm6/.ssh/config line 5: Applying options for tools-login.wmflabs.org
debug1: /Users/mm6/.ssh/config line 14: Applying options for *
debug1: /Users/mm6/.ssh/config line 15: Deprecated option "useroaming"
debug1: /Users/mm6/.ssh/config line 23: Applying options for *
debug1: /Users/mm6/.ssh/config line 33: Applying options for *
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 21: include /etc/ssh/ssh_config.d/* matched no files
debug1: /etc/ssh/ssh_config line 54: Applying options for *
debug1: Authenticator provider $SSH_SK_PROVIDER did not resolve; disabling
debug1: Control socket " none" does not exist
debug1: Connecting to tools-login.wmflabs.org [185.15.56.66] port 22.
debug1: fd 5 clearing O_NONBLOCK
debug1: Connection established.
debug1: identity file /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa type 0
debug1: identity file /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa-cert type -1
debug1: identity file /Users/mm6/.ssh/id_rsa type 0
debug1: identity file /Users/mm6/.ssh/id_rsa-cert type -1
debug1: Local version string SSH-2.0-OpenSSH_8.6
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.9p1 Debian-10+deb10u2
debug1: compat_banner: match: OpenSSH_7.9p1 Debian-10+deb10u2 pat OpenSSH* compat 0x04000000
debug1: Authenticating to tools-login.wmflabs.org:22 as 'magnus'
debug1: load_hostkeys: fopen /Users/mm6/.ssh/known_hosts2: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: ssh-ed25519
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: SSH2_MSG_KEX_ECDH_REPLY received
debug1: Server host key: ssh-ed25519 SHA256:xxW0+dRvWgCzYOq7uBKXXo7Xze0FVezt0QikIkpeMKI
debug1: load_hostkeys: fopen /Users/mm6/.ssh/known_hosts2: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts: No such file or directory
debug1: load_hostkeys: fopen /etc/ssh/ssh_known_hosts2: No such file or directory
debug1: Host 'tools-login.wmflabs.org' is known and matches the ED25519 host key.
debug1: Found key in /Users/mm6/.ssh/known_hosts:174
debug1: rekey out after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey in after 134217728 blocks
debug1: Will attempt key: /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa RSA SHA256:lRpWeuGfunWq42w+ScX2CmVuVy+9ZgV5GuuTGozl680 explicit agent
debug1: Will attempt key: mm6@sanger.ac.uk RSA SHA256:2BLwrTnTiH3K5enChXlTeLlVzLLVTFbamBC+ulpUaG4 agent
debug1: Will attempt key: /Users/mm6/.ssh/id_rsa RSA SHA256:lRpWeuGfunWq42w+ScX2CmVuVy+9ZgV5GuuTGozl680 explicit
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,rsa-sha2-256,rsa-sha2-512,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey,hostbased
debug1: Next authentication method: publickey
debug1: Offering public key: /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa RSA SHA256:lRpWeuGfunWq42w+ScX2CmVuVy+9ZgV5GuuTGozl680 explicit agent
debug1: Server accepts key: /Users/mm6/SpiderOak Hive/Configurations/ssh/id_rsa RSA SHA256:lRpWeuGfunWq42w+ScX2CmVuVy+9ZgV5GuuTGozl680 explicit agent
debug1: Authentication succeeded (publickey).
Authenticated to tools-login.wmflabs.org ([185.15.56.66]:22).
debug1: Local connections to LOCALHOST:61284 forwarded to remote address pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud:3306
debug1: Local forwarding listening on ::1 port 61284.
debug1: channel 0: new [port listener]
debug1: Local forwarding listening on 127.0.0.1 port 61284.
debug1: channel 1: new [port listener]
debug1: Requesting no-more-sessions@openssh.com
debug1: Entering interactive session.
debug1: pledge: filesystem full
debug1: Connection to port 61284 forwarding to pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud port 3306 requested.
debug1: channel 2: new [direct-tcpip]
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: client_input_hostkeys: searching /Users/mm6/.ssh/known_hosts for tools-login.wmflabs.org / (none)
debug1: client_input_hostkeys: searching /Users/mm6/.ssh/known_hosts2 for tools-login.wmflabs.org / (none)
debug1: client_input_hostkeys: hostkeys file /Users/mm6/.ssh/known_hosts2 does not exist
debug1: client_input_hostkeys: no new or deprecated keys from server
debug1: Remote: /usr/sbin/ssh-key-ldap-lookup:2: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
debug1: Remote: /usr/sbin/ssh-key-ldap-lookup:2: key options: agent-forwarding port-forwarding pty user-rc x11-forwarding
channel 2: open failed: connect failed: Connection refused
debug1: channel 2: free: direct-tcpip: listening port 61284 for pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud port 3306, connect from 127.0.0.1 port 61286 to 127.0.0.1 port 61284, nchannels 3

More concise, from toolforge:

tools.glamtools@tools-sgebastion-10:~$ mysql --defaults-file=~/replica.trove.my.cnf -h pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud baglama2
ERROR 2002 (HY000): Can't connect to MySQL server on 'pwupvyu6i6k.svc.trove.eqiad1.wikimedia.cloud' (115)

OK I restarted the DB instance via horizon, when I saw that even horizon couldn't connect to it any more.

That seems to have resolved the problem for now, but the fact that the entire instance just quietly crashed should be investigated IMHO

In other words, I would be hesitant to switch to a system where I have to manually restart the MySQL server every other week. I don't have time to work on all my tools as I would like, I can't run around kicking infrastructure as well.

Yep, I share your concern! It looks like the docker container containing the mysql server crashed; I'm investigating why that happened and why Trove didn't restart it automatically :/

Seems to have happened again just now. I was importing a rather large table (views), that has been running for hours(?). Not sure if that's related.

The host has been OOM'ing. I'm in the process of trying to resize it, if that goes poorly we may want to start over with a bigger host.

bah, the resizing is going poorly. I can try heroic measures to rescue the data or just restart from scratch -- are you ok with the latter?

Sorry that this is going poorly! It's likely that with enough RAM we'll get this stable and I want to get you settled before I go down the side-track of figuring out why resizing failed.

Yes that's fine. Let me know when I can re-import it

I've rebuilt the database instance and you should be able to start syncing again. I'm not thrilled with how Trove has been acting today but I suspect with 4GB of ram this one will stay up for a while.

It took a few days but the database has been successfully copied over to trove. I am taking a final mysqldump from toolsdb now, then s51203__baglama2_p can be removed. I will post here when it's done.

Everything still going well with this? I'm curious how the move to trove is working out. Hope all is well!

I have successfully moved all data over to trove, and took a snapshot of the toolsdb version.
The web interface and the background tools have been switched over to the trove version and are reading/writing successfully.
As far as I am concerned, the toolsdb s51203__baglama2_p can be deleted.
Should I do that, or do you want to do the honors?

After T329853 can we please add something so databases don't stay vanished until I complain at a "a proper support venue" (which seems to be only IRC, with mattermost link broken)?

@Magnus sorry about that, and thanks @taavi for restarting the instance. I created T329949 to investigate if we can avoid this happening in the future.

As far as I am concerned, the toolsdb s51203__baglama2_p can be deleted.
Should I do that, or do you want to do the honors?

Sorry for not getting back sooner about this. I have just deleted the old database from ToolsDB:

MariaDB [(none)]> DROP DATABASE s51203__baglama2_p;
Query OK, 7 rows affected (2.19 sec)

Mentioned in SAL (#wikimedia-cloud) [2023-02-17T16:13:31Z] <dhinus> drop unused database s51203__baglama2_p T323502