helium/potassium are still on precise, should be upgraded to trusty (as the pool counters in codfw) or reinstalled with jessie. Needs to be done carefully, as the non-availability of pool counters causes problems.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Dzahn | T123525 reduce amount of remaining Ubuntu 12.04 (precise) systems in production | |||
Resolved | fgiunchedi | T123734 Migrate pool counters to trusty/jessie | |||
Resolved | fgiunchedi | T146277 Build poolcounter for jessie | |||
Resolved | • Cmjohnson | T149106 Rename potassium / WMF3287 as poolcounter1002 |
Event Timeline
for one of the pool counters, it probably makes sense to move it to a ganeti VM. requirements are minimal for a pool counter anyway, it's a perfect candidate for virtualization. I am saying for one only because we want the 2 of them to be in different rows and our ganeti cluster still is one row only.
@akosiaris I think that's a bad idea - a single poolcounter server dying still causes unavailability (notwithstanding the mitigations we tried to create with T105378. I'd say until T105378 is resolved somehow it's better to stick to physical hardware that has been more stable in general.
Change 313564 had a related patch set uploaded (by Filippo Giunchedi):
poolcounter: move to modules/role
Note that in helium case it is also the bacula director/storage. I propose we start with moving a poolcounter to a ganeti VM and move off helium.
I tried provisioning deployment-poolcounter03 with jessie to migrate beta too but ATM the instance is not accessible via ssh and console shows
Debian GNU/Linux 8 deployment-poolcounter03 ttyS0 deployment-poolcounter03 login: 2016-10-03T09:52:06.250841+00:00 deployment-poolcounter03 nslcd[1241]: [334873] <passwd="filippo"> (re)loading /etc/nsswitch.conf 2016-10-03T09:53:22.827801+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known 2016-10-03T09:55:22.839435+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known 2016-10-03T09:57:22.846457+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known
Change 313789 had a related patch set uploaded (by Filippo Giunchedi):
deployment-prep: Move poolcounter to deployment-poolcounter04
FWIW Yuvi helped figuring it out, it happens when an instance's name was used previously, and indeed provisioning deployment-poolcounter04 worked just fine so I've destroyed 03
Change 313789 merged by jenkins-bot:
deployment-prep: Move poolcounter to deployment-poolcounter04
Mentioned in SAL (#wikimedia-releng) [2016-10-04T15:01:14Z] <godog> shutdown deployment-poolcounter02, replaced by deployment-poolcounter04 - T123734
@Dzahn I see technetium was provisioned in T118763 and then destroyed, is it going to be used again? If not we could just reuse the name (still in DNS) to create the VM for poolcounter
We talked about this briefly and agreed that "germanium" is free and can be used to avoid reusing a name. The VMs for PCI scanning _might_ be re-created one day and reusing names caused unexpected issues and confusion before.
As these are not not one-off servers, we should rather use the opportunity by starting with poolcounter1001.eqiad,wmnet and adapting the other servers as they get reimaged.
@MoritzMuehlenhoff poolcounter1001 would work for me, though usually PC lives on a shared baremetal machine since its requirements are very small. Given how critical the service is to MW though we could just go for poolcounter* while we're at it.
Change 316307 had a related patch set uploaded (by Filippo Giunchedi):
eqiad: add poolcounter1001
Change 316343 had a related patch set uploaded (by Filippo Giunchedi):
Provision poolcounter1001
Change 316356 had a related patch set uploaded (by Filippo Giunchedi):
Replace helium with poolcounter1001
Change 317853 had a related patch set uploaded (by Filippo Giunchedi):
Put helium back in service during potassium reimage
Change 317853 merged by Filippo Giunchedi:
Put helium back in service during potassium reimage
Change 317854 had a related patch set uploaded (by Filippo Giunchedi):
Rename potassium as poolcounter1002
Change 317855 had a related patch set uploaded (by Filippo Giunchedi):
Rename potassium as poolcounter1002
Change 317873 had a related patch set uploaded (by Filippo Giunchedi):
Put back potassium as poolcounter1002
Potassium reimaged as poolcounter1002, resolving. Followup for wrap up the rename is at T149106