Thu, May 16
I can confirm that this is present on our current Horizon. It seems like something that can be worked around in the meantime, so I'm going to mark it as low priority and hope that future upgrades address it.
This seems to be fine now. I double-checked the state of apt and dpkg and although there are a few things stuck from race conditions there's nothing comprehensive or serious going on.
I still sort of want this but I'm clearly not really working on it.
This particular thing shouldn't happen any more.
I've created this project and added @Samwalton9 as project admin. They can add additional users as needed.
I doubled the RAM and core quota for the project -- let me know if that doesn't get you where you need to go :)
I'm still puzzled about this. A second look suggests that the only way the firstboot script would fail is if it's unable to resolve the name of the host. But if /that/ happens then I can't imagine things ever working.
If only WMCS staff (or people explicitly opting in for pages to those systems) got paged then this is probably correct behavior. @ArielGlenn says they did not get pages. @fgiunchedi /did/ get paged though. So... ???
Wed, May 15
This is done now, published at
*bump* Chris, do you have any thoughts about what we should do next here?
So it looks like any failure at all in the firstboot script means that firstboot will be run again on subsequent boots. I'm pretty sure that this is wrong and we should just only ever run it once even if it throws an error but I'm going to look at more failure cases first.
Ah so I see. There are some VMs which were built from a bad base image and for some reason re-run the firstboot script on every boot. It should be fixed now and I thought I'd caught them all... I'll do a bit more research.
...and it looks like puppet caught up?
This is likely a result of puppet going out of sync. The old recursors were 126.96.36.199 and 188.8.131.52, replaced with 184.108.40.206 and 220.127.116.11 as per T221183
Tue, May 14
drive-by-comment: I've also been disappointed at transfer speeds when migrating to/from 10G systems but never followed up to figure out what the bottleneck was. Worth investigating I think.
Approved -- we'll create this in the next few days.
Approved -- we'll try to get this created in the next few days.
Mon, May 13
I think we should risk the slight chance of a multi-hour outage. Three days isn't enough time to give proper notice of an evacuation, and if things go well the work will have been in vain anyway. So, I propose:
I've asked for clarification about what kind of power outage is feared here. Since emptying 1028 will cause downtime anyway I want to know if the expected downtime from the PDU move is more or less than the downtime associated with evacuation.
Just to clarify -- best case (normal) scenario is no interruption? And worst case is... brief power interruption? Or no power for hours?
Thanks arturo! I worked on this a bit last week but didn't make a whole lot of progress.
Fri, May 10
Thu, May 9
To be clear: The issue is the leaking files. It doesn't matter where the files wind up, they'll cause problems either way.
Wed, May 8
Ran sudo cumin --force --timeout 500 -o json "A:all" "sed -i 's%/r/p/%/r/%' /srv/composer/.git/config" for the composer urls
After merging the above patch I corrected the urls in /var/lib/git/operations/puppet/.git/config and /var/lib/git/operations/software/.git/config on all prod and cloud hosts.
Tue, May 7
Mon, May 6
Thank you for working on all these, @Cmjohnson !
Sun, May 5
There are a fair number of timeout messages when pdns and mdns try to sync via axfr. I suspect that that's related to the problem but so far that's all I've found.
Frustratingly there are no log messages from pdns-recursor about these failures.
Fri, May 3
Hm, I seem not to have gotten notified when this was last updated -- sorry! The good news is that self-serve account creation is now back online, so Tupino should be able to just create a new account on wikitech.
Thu, May 2
Tue, Apr 30
I still need something like https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474272/ in order to get cloudvirt1024 online (and to pave the way towards upgrading similar hardware to Stretch). Are there competing solutions at this point, or should I just be bold and merge 474272?
Sat, Apr 27
Fri, Apr 26
pdns is now running and happily talking to the db. I added some docs to the puppet class about what I did.
I think this is moot since we're shutting down these systems. T221857
Thu, Apr 25
@RobH, I'm supposed to assign decom hosts to you at this point, right?
To do this I need:
I hand-edited resolv.conf on these hosts so that they will survive the upcoming nameserver change.