User Details
- User Since
- Nov 2 2014, 11:35 PM (597 w, 6 h)
- Availability
- Available
- IRC Nick
- andrewbogott
- LDAP User
- Unknown
- MediaWiki User
- Andrewbogott [ Global Accounts ]
Fri, Apr 10
The root of this seems to be that keystone has stopped logging with the name 'keystone,' instead using the name '<frozen importlib._bootstrap>'
We can likely have cloudinit add a public cumin key to all hosts on creation, but I have a few thoughts:
Upstream has acknowledged this and think it's fixed.
Thu, Apr 9
We are attempting to only get the puppet package from the wikimedia repo (this is set by cloud-init at creation time)
The base image is based on a trixie VM with our puppet classes already applied (that happens at build time). So shouldn't /that/ have already downgraded puppet in the base image?
Wed, Apr 8
For starters we should probably look for places that take a --project arg and convert them to either --project-id or --project-name depending on what the code does.
@Nokib_Sarkar have you seen this happen on multiple occasions, or just several times on the 7th specifically? (I want to make sure it's not a side-effect of maintenance activity.)
Do you have any theory (you being @elukey and @fgiunchedi) about why that happened on this exact instance? I just checked and we have around 100 running Trixie VMs so presumably cloud-init works properly most of the time.
This is done, and your creds should be in your envvars as TOOL_ELASTICSEARCH_USER and TOOL_ELASTICSEARCH_PASSWORD
How much disk is there to play with?
~400T of unreplicated space, split in 3 nodes
The only thing left to do here (that I know if) is relative links being messed up in the initial wikitech-static landing page. Search works, and once you navigate to a valid page the links work.
This is fine but I have to figure out how to do it! Docs seem to be at https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin#Granting_a_tool_write_access_to_Elasticsearch
cloudcontrol nodes not in C8 (i.e. 1006/1007) though didn't seem to give up trying to connect to rabbitmq01.eqiad1.wikimediacloud.org:5671 whereas cloudcontrol1011 stopped trying to talk to rabbitmq01 as expected.
Tue, Apr 7
decom script is failing:
note to self: the old flavor (g4.cores8.ram24.disk20.ephemeral90.4xiops) is available in other projects (according to tofu-infra: ["integration", "search", "gitlab-runners", "wikiapiary", "zuul3"]) so I'm leaving it as is.
Mon, Apr 6
I think I've fixed both (!) things that were wiping out your ptr record. Please re-open if it vanishes again!
Thu, Apr 2
I'm pretty sure the issue was that tofu was removing the by-hand record and then the ip-updater adding the instance- record. I've added this to tofu, let's see if it persists now.
Tue, Mar 31
I have not used it before, but 'designate-manage service clean' seems to be the tool needed. Now I see:
Mon, Mar 30
Item #2 is already handled by the code. I don't know how/why my last attempt was clobbered; trying again.
For #2, taavi points out that there is a boilerplate description for auto-created records
Fri, Mar 27
Hm, we have a bot that maintains those instance- addresses, it must've clobbered the one I made by hand. I will need to think about this a bit.
Current status:
Thank you @Don-vip! For better or worse there turns out to be no automated upgrade path; users will basically have to do a dump and import into a new engine. That said, if you want to try that (and, better yet, document it) I will ping you when the new engine is available.
Thu, Mar 26
Hi folks! Any idea when this is likely to happen? I will need to coordinate for openstack nodes which use a bespoke openstack repo hosted on the mirror.
+1 this seems fine; would you like us to also remove the older 8-core flavor or do you expect to use both in the future?
Wed, Mar 25
+1 approved
Tue, Mar 24
I've deleted all the things I can find in our rackspace account -- it should be empty or effectively empty. Over to you, Rob!
because they are declared classic and not quorum by default.
andrew@bookworm:~/tofu-infra/resources/eqiad1-r/cloudinfra$ dig -x 185.15.56.85
users had not actually allocated a floating IP, but I have now done so. It is: 185.15.56.85
Mon, Mar 23
Fri, Mar 20
eqiad folks: these hosts are untested hardware with a novel drive configuration. I do not expect partman to work on the first go!
eqiad folks: these hosts are untested hardware with a novel drive configuration. I do not expect partman to work on the first go!
eqiad folks: these hosts are untested hardware with a novel drive configuration. I do not expect partman to work on the first go!
Thu, Mar 19
+1, seems good
Wed, Mar 18
Tue, Mar 17
This is all done except for the cloudvirtlocal hosts.
Mon, Mar 16
Mar 12 2026
This VM is now shut off. They will still bill us for the space, but let's give this a week before we delete things and close out the account.
Just as soon as I can get pwstore working I will shut down this host and wait for screams.
This was discussed and approved during today's weekly meeting.
It's been a long time since we discussed this and no one is working on it. The privacy implications are a bit messy so I'm just closing.
Mar 11 2026
decom script says:
2004-dev is up and working now, thanks to @taavi and a reboot.
Oh, to check the maintenance state of a host you want to look at the host aggregates. Docs for that here: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Host_aggregates
Mar 10 2026
It seems that the network services on 1006 were manually (or via cookbook) set to down. That would certainly explain the failover.
Mar 9 2026
+1
This is resolved in codfw1dev. Task is still open because I'm trying to track (and fix?) the issue upstream.
Mar 6 2026
So -- I reiterate that I don't think a cloud-vps project is the right way to handle things like this. If I were solving the issues you're solving, I would definitely look for a hosted service rather than building my own infra from scratch, and I would start /that/ process by talking to other affiliates and asking how they are solving the problem.
