User Details
- User Since
- Nov 2 2014, 11:35 PM (580 w, 10 h)
- Availability
- Available
- IRC Nick
- andrewbogott
- LDAP User
- Unknown
- MediaWiki User
- Andrewbogott [ Global Accounts ]
Today
I got a round of these pages this evening. The first page was at 00:10UTC.
Fri, Dec 12
A little more info here: https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS_instances
You should be able to log in with your developer account credentials on https://horizon.wikimedia.org/project/ -- that is the web interface for managing things in your project.
@NathanGavenski I'm on the run today but I glanced at the 'prove' project on Horizon and I don't see any VMs there, so it doesn't look like there's anywhere to ssh to. The domain you're seeing on openstack-browser is meant to be a container for future services (e.g. myinternalservice.svc.prove.eqiad1.wikimedia.cloud) ; it doesn't itself refer to any actual host or destination.
Thu, Dec 11
I'm hoping someone will link me to an existing dash on grafana.wmcloud.org and then we'll be done :)
Sun, Dec 7
Done. Please re-open and follow up on this task when you finish the migration so we can revert the quota change.
Fri, Dec 5
nice work komla!
There are 1297 users in eqiad1 with the 'member' role. It's easy for me (or you) to dump the list of userids ("openstack role assignment list --role member") but correlating the usernames back to their email addresses will take a few lines of code.
Thu, Dec 4
I'm partway into this process but everyone is about to travel so I'm rolling things back to Bullseye everywhere.
Tue, Dec 2
Mon, Dec 1
The remaining reboots are blocked by VMs that can't be drained. I hope to have that resolved tomorrow when mass reboots are scheduled.
Just now I ran into this error during reimage:
Sun, Nov 30
Wed, Nov 26
This is probably unrelated, but it /is/ a concern with Ceph and trixie (right now the ceph hosts themselves are running bookworm but the radosgw is on Trixie.)
Tests suggest that the 'NoneType' error happens 100% of the time when cloudcontrol2005-dev is the radosgw backend, and 0% of the time with either other cloudcontrol as the backend.
Tue, Nov 25
I think this is ready for dcops now but please lmk what I forgot!
Mon, Nov 24
This is now taken care of by the project deletion cookbook.
Assigning to myself pending a decision about hostnames
this is the price I pay for being an early adopter
Sun, Nov 23
Fri, Nov 21
(I'm moving this to a private address; lots of cookbook things to come)
Thu, Nov 20
Tue, Nov 18
Mon, Nov 17
I'm leaning towards moving this service to a separate host. Ganeti request is T410294
ec318e06-1ddc-4856-8e37-17a2a5aeb0b3 | tcp-proxy-test on cloudvirt1044 is showing the migration issue.
Sun, Nov 16
sudo cumin --backend openstack "*" 'ip addr | grep "mtu 1450"'
Nov 15 2025
As @taavi predicted, a reboot --hard of that server reset the MTU and allowed it to migrate. So that's good, and suggests that maybe we only need to reboot a select subset of VMs to get everyone on the same page mtu-wise.
Nov 14 2025
I think I care about deprecation warnings when they apply to our custom policies, but don't care when keystone is issuing warnings about policies that shipped directly from keystone upstream. I'm happy assuming they're approximately a 'note to self' from the keystone team and ignoring them unless you think I'm missing something.
Today I'm draining a cloudvirt and I see this error in the logs (along with a failed migration):
Keystone logs are still fairly full of warnings like
Refactoring Neutron is scary, and splitting out a new user for Neutron won't really enhance security so I'm declaring this to be done enough.
Seems like consensus around option 1 -- let's close this next week if no one objects.
Nov 13 2025
@taavi this is one of the codfw1dev issues that has me blocked. I've spent a while messing with the envoy config but at this point I'm not even sure how this is meant to work.
andrewbogott> Andrew Bogott moritzm: do you still aspire to look at https://phabricator.wikimedia.org/T409328 or should I take another stab?
4:14 PM
<moritzm> Moritz Mühlenhoff andrewbogott: I had a quick look yesterday and the CAS part looks all fine in the logs
4:14 PM I miss some context what's actually the finer details of the indended setup and I currently have some more pressing things to look at
4:15 PM so please take another stab, otherwise I'll try to make some time for it next week
Unless I'm missing something, this feature is now available on Horizon via 'Get Cluster Config'. There's also an API for this.
I've just switched the mod_evasive settings to be more aggressive than the defaults:
Nov 12 2025
I just ran a couple more tests:
OSD nodes up through 1034 are scheduled for decom in 2026. Unless there's an urgent port shortage, we should only retcon 1035 and above to avoid sending DC ops on multiple visits to the older hosts.
Nov 10 2025
On @fgiunchedi's request I tried dd'ing every drive on a server before reimaging but grub still exhibits the issue.
This wasn't applied yet on codfw1dev but now it is.
Nov 7 2025
You're right!
An example of an offending line is
I've just noticed that there are quite a few 2-drive r450s that reimaged without trouble, for example cloudrabbit200[123]-dev.
For my reference, the following will be the redundant pairs according to T401295
Nov 6 2025
For future debug research: We can prevent the final reboot after a reimage like this:
@MoritzMuehlenhoff has offered to take a look at this.
Nov 5 2025
Nov 3 2025
Hello again!
Oct 30 2025
we're told Gadgets shouldn't call external services
I think that admin-defined just means 'things in cloud-vps managed and supported by staff rather than by random users'.
Oct 29 2025
Oct 25 2025
Confirmed, when I rolled cloudcontrol1008-dev back to raid10 grub failed again.
Seems like grub works properly without sw raid. cloudcontrol1008-dev with flat.cfg:
