User Details
- User Since
- Nov 2 2014, 11:35 PM (489 w, 1 d)
- Availability
- Available
- IRC Nick
- andrewbogott
- LDAP User
- Unknown
- MediaWiki User
- Andrewbogott [ Global Accounts ]
Fri, Mar 15
I have built a very large server (cloudinfra-cloudvps-puppetserver-1.cloudinfra.eqiad1.wikimedia.cloud) which I hope will be able to handle all of cloud-vps on its own. I'll change the DNS entry on Monday.
Thu, Mar 14
+1 sounds good
Wed, Mar 13
Tue, Mar 12
Mon, Mar 11
The technical bits are mostly tracked here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Trove#guest_containers. I wrote those docs assuming someone would be reading the upstream trove docs at the same time, though.
Pinging @dr0ptp4kt as I heard a rumor that he might know something.
Fri, Mar 8
Thu, Mar 7
@taavi, project-proxy seems to have been partially migrated, with some hosts using project-proxy-puppetmaster-01.project-proxy.eqiad.wmflabs and some using project-proxy-puppetserver-1.project-proxy.eqiad1.wikimedia.cloud. Can you advise about next steps? Are the clients using puppetmaster-01 slated for deletion?
If you aren't interested in diving into the trove code then yeah, zero-ing out the db values is what I'd do. It turns out that maintaining quota consistency is hard.
Wed, Mar 6
Tue, Mar 5
Notes from today's (unproductive) meeting:
puppet7 servers need > 1 Gb of RAM or they swap
Mon, Mar 4
I think the right thing here is to update these to replicate the behavior of cloudbackup200[12]. I'll have a look at that.
Thu, Feb 29
Wed, Feb 28
The fix for this is to first switch the VM to puppet7 via hiera (profile::puppet::agent::force_puppet7: true), get a clean puppet run, and then apply the role::puppetserver::cloud_vps_project role.
Tue, Feb 27
I've moved Designate to cloudcontrol nodes, with pdns services still running on cloudservices nodes. This means we now have consistent addressing everywhere.
Mon, Feb 26
Please remove that bit of config.
Thu, Feb 22
I'm perfectly happy with *.internal.toolforge.org or *.infra.toolforge.org, which seems to be what Taavi prefers as well :)
Tue, Feb 20
Backy2 always persists one snapshot for each volume in order to do incremental backups. So as far as I can tell we will have this issue (or a similar one) with every volume that is ever backed up.
Sun, Feb 18
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!
Feb 15 2024
OK, in summary, I think taavi fixed it. What I would do next time is
Typically on a reimage we don't need to remove or rediscover hosts; the pool is based on hostname so the reimaged hosts should rejoin without any issues.
Feb 14 2024
It's back in service but only as of today.
Feb 13 2024
I noticed this morning that this broke new VMs based on images built before the new resolver IP was added. To fix, I rebuilt and installed a new Bullseye base image, and built a new Buster base image in 'testlabs' but disabled the existing Buster image in toolforge because I'm hopeful that we won't be building any new Buster VMs.
We discussed this at length during our toolforge council meeting. We considered two options, neither of which is very popular.
I think this is a past bug, rather than a present bug. Many older designate records don't have an associated managed_resource_id which is what designate-sink uses to perform the cleanup:
Feb 11 2024
This looks to be full again.
Feb 9 2024
According to cumin:
They were likely broken enough for cumin to not reach them. I'll nonetheless work on that list a bit.
Feb 8 2024
This is pretty clearly a bug with cinder or rbd, either with how things get moved to the trash, or with how the trash itself is managed.
# rbd children -a --pool eqiad1-cinder volume-e25dae8a-803a-4b62-aa0c-bdf6ff481869@snapshot-f46d30ca-e655-4892-91da-473ccb60bfd4 eqiad1-cinder/volume-80faf64d-c623-4965-af7c-9ea96103f39f (trash 8626bb8cb96871)
I suspect that the current issue is that the volume that was created and deleted is still in the trash. I cannot empty the trash, though:
From logstash
Feb 7 2024
The topic of this doesn't quite fit with the initial description. I /think/ that T170355 is the same ask (and it's done, and somewhat documented) but I'm confused by the 'for a user' in the task title.
There's no need to coordinate with us for cloudbackup2001, it might cause us to get a transient alert but that service isn't the most stable anyway :)
Feb 3 2024
A lot of what you're seeing is because of having the admin flag, I think. When I log in as 'Andrew bogott mortal' and look at a project I'm a member of, this is what the creation dialog offers:
Feb 2 2024
This is silly, but I think the solution to this is moving Designate services onto cloudcontrol nodes. If we keep pdns on cloudservices nodes then all the traffic governed by clouds.yaml will be between hosts rather than local to one host, and we can use the private 172.x addresses everywhere.
All existing guest VMs need a config upgrade to find the right container images
Jan 31 2024
Jan 29 2024
Jan 26 2024
This may have been fixed by someone else since I looked at it last night. Right now I'm waiting for the task to re-run on clouddb1015 after doing 'systemctl reset-failed'
Jan 25 2024
Adapting that sysbee doc to our platform, I got down to this:
Some doc links (that I haven't finished reading):
thanks! Let's let this sit w/out workload for a week or so and see if stays up, then we can try giving it some work to do.
Jan 24 2024
Jan 23 2024
This was a side-effect of galera work, cleared itself after a bit.
Jan 22 2024
Whatever this was is now resolved.
Jan 19 2024
Galera on codfw1dev is now using private addresses. Let's see if it stays happy over the weekend.
+1 sounds good but will probably not implement until Monday
Jan 18 2024
this was a conflict between galera config; fixed by adding --port=3306 to the command line.