User Details
- User Since
- Jul 18 2022, 2:39 PM (105 w, 4 d)
- Availability
- Available
- IRC Nick
- dhinus
- LDAP User
- FNegri
- MediaWiki User
- FNegri-WMF [ Global Accounts ]
Yesterday
Following the instructions on the wiki page, and using the setup for the catalyst project as a reference, I first added the required Hiera key to the project-proxy-acme-chief instance-puppet prefix, and ran Puppet on project-proxy-acme-chief-02.project-proxy.eqiad1.wikimedia.cloud.
Ok, after thinking about it a little bit longer I think I understand it better: the subdomain can be created, but the steps listed at https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Web_proxy#Enable_per-project_subdomain_delegation are required if you want the Web Proxy to generate valid SSL certificates for that subdomain.
I did read the wiki page, but I'm not sure how this is supposed to work. I can see there's already a DNS zone duct.wmcloud.org in the duct project. I expect it would be possible for admins of the duct project to create a wildcard subdomain in that zone from Horizon.
@Don-vip I removed the 2 extra records (the reverse DNS records), and manually updated the A records because they were still pointing to the old IPs. Now you should be able to ssh to both instances.
I have deleted instances encoding04 and encoding05 and recreated them with the same name. These ones I cannot ssh to.
Thu, Jul 25
The code should also just read/parse from replica.my.cnf instead of copying them.
The credentials have been rotated.
I don't think these credentials have ever been rotated, replica.my.cnf has a date in 2014 :)
Connecting to ToolsDB with the password stored in /data/project/lexica-tool/replica.my.cnf failed with:
ERROR 1045 (28000): Access denied for user 's56035'@'localhost' (using password: YES)
I can see the user and grants in ToolsDB:
Unfortunately there was one wrong assumption in my calculations above: the number of row lock(s) reported by SHOW ENGINE INNODB STATUS does not match the actual number of rows to be deleted. The number of row locks keeps on increasing, maybe because rows are locked incrementally. I assume the second number (undo log entries) does show the correct number of processed rows, and that has now increased from 1299525 to 1500150. The total number it has to reach is not 1315111 as I incorrectly assumed yesterday, but I believe is the total number of rows in the table:
Wed, Jul 24
This is an Epic that will require multiple quarters. I'm removing the milestone cloud-services-team (FY2024/2025-Q1-Q2) from this task, and adding that milestone to two subtasks of this one that I can realistically start in this quarter:
Update: I think I found a way to estimate how long the transaction will take, thanks to this brilliant StackOverflow post.
Replication is still stuck processing the same DELETE transaction. From past experience, it usually takes a few days to complete the transaction and catch up. I haven't yet found a way to estimate the time it's gonna take, as that can vary depending on the size of the transaction.
Wiki Replicas docs improvements are tracked in this newer task: T365717: [wikireplicas] Update Admin docs
This is not happening anymore.
Upgrade to Ceph v16 is tracked in T306820: [ceph] Upgrade to v16
Tue, Jul 23
Replication lag on clouddb1019 (s4) remained at 0 until 11:25 UTC today, then it started increasing again.
The alert went back to green 1 minute after posting the comment above :)
The same error happened last month (T368211), and was fixed by @Jhancock.wm by reseating the cable.
I acked the alert for 24 hours, hopefully it will catch up by then.
Mon, Jul 22
As an additional test, I killed all the remaining long queries for user s52168 (tools.kmlexport), and 2 long queries for user u1115 (@dschwen):
I spent some more time analyzing long queries running on clouddb1019. I found a few long queries for user s52168 (tools.kmlexport), and killing the 3 oldest ones did indeed cause the replication lag to stop growing and to start going down:
Fri, Jul 19
It seems tofu itself is protected against this. We may not need any lock in the cookbooks after all.
Seems to be working now:
Do we make use of the mentioned etcd backend from cloudcumin hosts?
Thu, Jul 18
Back from my holidays, here's a glance of what happened since my last comment:
I've added T360488: Missing Perl packages on dev.toolforge.org for anomiebot workflows as a subtask to remember that anomiebot is currently relying on the login-buster bastion, and if we remove it that tool is likely to break.
We should also make use of Spicerack's locking functionality to prevent two people from running apply at the same time:
https://doc.wikimedia.org/spicerack/master/introduction.html#distributed-locking
I would maybe force --branch main if you select --apply (or at least require --force to apply a different branch)
Wed, Jul 17
to allow to query hosts also by their Puppet classes.
I just discovered that we do have dedicated Cumin instances for tools and toolsbeta, where /etc/cumin/config.yaml points to the respective puppetdb instead of the production one:
Tue, Jul 16
@ABran-WMF, @fnegri - I believe that we are ready to run the sre.wikireplicas.add-wiki cookbook for this wiki, which should make it available on both clouddb* and an-redacteddb1001 hosts.
+1
I think it makes sense to do your test first. I can change back the role before the reimage.
@BTullis I think you can proceed with your test and turn off all sections for a week. When that is done and you are confident nothing goes wrong as a result, I will proceed with the reimage. After the reimage is done, we can decommission it.
Adding a note to mention that currently Icinga alerts related to clouddb* hosts are getting tagged with team=wmcs when they are forwarded from icinga to alertmanager. I tried to figure out where that tagging happens but I haven't found it. We should aim to maintain that tagging when the alerts are migrated to alertmanager.
Mon, Jul 15
Marking this as Resolved as all the main subtasks have been completed. There are 4 subtasks left that are follow-ups to this work.
I'm removing the Data-Services tag as this is not a problem with wikireplicas.
I am fine with having this ticket stalled until IP masking (T283177) is effective,
Thu, Jul 4
@KCVelaga_WMF thanks for testing! Yes it is kind of expected that all dbs are exposed, as in the SQL permissions are for all dbs ending in _p. So maybe to solve the problem of manually listing the dbs, we could just call the db "ToolsDB" in Superset, and ask users to start all their queries with use some_db_p;. What do you think?
@Andrew @rook can you please review the PR at https://github.com/toolforge/superset-deploy/pull/26 and test if it works?
Setting this to "Stalled" as I have too many things on my plate and this is not super urgent. I'd like to get back to it at some point but feel free to claim it if you are interested in this.
Wed, Jul 3
Additional clean-up: I removed the grant for heartbeat_p as that is already implied in the grant for %\_p.
I created the user and grants in ToolsDB, similar to the Quarry ones I created in T348407.
Two days later, it's still not looking great:
DB storage is tracked in the subtask T291782: Migrate largest ToolsDB users to Trove