Page MenuHomePhabricator

Migrate pool counters to trusty/jessie
Closed, ResolvedPublic


helium/potassium are still on precise, should be upgraded to trusty (as the pool counters in codfw) or reinstalled with jessie. Needs to be done carefully, as the non-availability of pool counters causes problems.

Event Timeline

MoritzMuehlenhoff raised the priority of this task from to Needs Triage.
MoritzMuehlenhoff updated the task description. (Show Details)
MoritzMuehlenhoff added a project: SRE.
MoritzMuehlenhoff subscribed.

for one of the pool counters, it probably makes sense to move it to a ganeti VM. requirements are minimal for a pool counter anyway, it's a perfect candidate for virtualization. I am saying for one only because we want the 2 of them to be in different rows and our ganeti cluster still is one row only.

@akosiaris I think that's a bad idea - a single poolcounter server dying still causes unavailability (notwithstanding the mitigations we tried to create with T105378. I'd say until T105378 is resolved somehow it's better to stick to physical hardware that has been more stable in general.

I am not fond of coupling the two things (T105378 and migration to whatever). Of course T105378 should be fixed but ganeti VMs have not exactly been unstable. Baring one obscure bug we might have a fix for already, I remember no other occurrences of ganeti VMs being less stable than hardware

For the record, a poolcounter server failing should not cause downtime anymore.

fgiunchedi triaged this task as Medium priority.Apr 27 2016, 1:48 PM

Change 313564 had a related patch set uploaded (by Filippo Giunchedi):
poolcounter: move to modules/role

Note that in helium case it is also the bacula director/storage. I propose we start with moving a poolcounter to a ganeti VM and move off helium.

Note that in helium case it is also the bacula director/storage. I propose we start with moving a poolcounter to a ganeti VM and move off helium.

Absolutely true. And I second that plan. I 'll create a task+VM for that part

I tried provisioning deployment-poolcounter03 with jessie to migrate beta too but ATM the instance is not accessible via ssh and console shows

Debian GNU/Linux 8 deployment-poolcounter03 ttyS0

deployment-poolcounter03 login: 2016-10-03T09:52:06.250841+00:00 deployment-poolcounter03 nslcd[1241]: [334873] <passwd="filippo"> (re)loading /etc/nsswitch.conf
2016-10-03T09:53:22.827801+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known
2016-10-03T09:55:22.839435+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known
2016-10-03T09:57:22.846457+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known

Change 313789 had a related patch set uploaded (by Filippo Giunchedi):
deployment-prep: Move poolcounter to deployment-poolcounter04

Change 313564 merged by Filippo Giunchedi:
poolcounter: move to modules/role

I tried provisioning deployment-poolcounter03 with jessie to migrate beta too but ATM the instance is not accessible via ssh and console shows

Debian GNU/Linux 8 deployment-poolcounter03 ttyS0

deployment-poolcounter03 login: 2016-10-03T09:52:06.250841+00:00 deployment-poolcounter03 nslcd[1241]: [334873] <passwd="filippo"> (re)loading /etc/nsswitch.conf
2016-10-03T09:53:22.827801+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known
2016-10-03T09:55:22.839435+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known
2016-10-03T09:57:22.846457+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known

@Andrew, any idea what's going on there?

I tried provisioning deployment-poolcounter03 with jessie to migrate beta too but ATM the instance is not accessible via ssh and console shows

Debian GNU/Linux 8 deployment-poolcounter03 ttyS0

deployment-poolcounter03 login: 2016-10-03T09:52:06.250841+00:00 deployment-poolcounter03 nslcd[1241]: [334873] <passwd="filippo"> (re)loading /etc/nsswitch.conf
2016-10-03T09:53:22.827801+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known
2016-10-03T09:55:22.839435+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known
2016-10-03T09:57:22.846457+00:00 deployment-poolcounter03 puppet-agent[523]: Could not request certificate: getaddrinfo: Name or service not known

@Andrew, any idea what's going on there?

FWIW Yuvi helped figuring it out, it happens when an instance's name was used previously, and indeed provisioning deployment-poolcounter04 worked just fine so I've destroyed 03

Change 313789 merged by jenkins-bot:
deployment-prep: Move poolcounter to deployment-poolcounter04

Mentioned in SAL (#wikimedia-releng) [2016-10-04T15:01:14Z] <godog> shutdown deployment-poolcounter02, replaced by deployment-poolcounter04 - T123734

@Dzahn I see technetium was provisioned in T118763 and then destroyed, is it going to be used again? If not we could just reuse the name (still in DNS) to create the VM for poolcounter

We talked about this briefly and agreed that "germanium" is free and can be used to avoid reusing a name. The VMs for PCI scanning _might_ be re-created one day and reusing names caused unexpected issues and confusion before.

As these are not not one-off servers, we should rather use the opportunity by starting with poolcounter1001.eqiad,wmnet and adapting the other servers as they get reimaged.

@MoritzMuehlenhoff poolcounter1001 would work for me, though usually PC lives on a shared baremetal machine since its requirements are very small. Given how critical the service is to MW though we could just go for poolcounter* while we're at it.

+1 , poolcounter1001 sounds good

Change 316307 had a related patch set uploaded (by Filippo Giunchedi):
eqiad: add poolcounter1001

Change 316307 merged by Filippo Giunchedi:
eqiad: add poolcounter1001

Change 316343 had a related patch set uploaded (by Filippo Giunchedi):
Provision poolcounter1001

Change 316343 merged by Filippo Giunchedi:
Provision poolcounter1001

Change 316356 had a related patch set uploaded (by Filippo Giunchedi):
Replace helium with poolcounter1001

Change 316356 merged by Alexandros Kosiaris:
Replace helium with poolcounter1001

helium replaced with poolcounter1001

Change 317853 had a related patch set uploaded (by Filippo Giunchedi):
Put helium back in service during potassium reimage

Change 317853 merged by Filippo Giunchedi:
Put helium back in service during potassium reimage

Change 317854 had a related patch set uploaded (by Filippo Giunchedi):
Rename potassium as poolcounter1002

Change 317855 had a related patch set uploaded (by Filippo Giunchedi):
Rename potassium as poolcounter1002

Change 317854 merged by Filippo Giunchedi:
Rename potassium as poolcounter1002

Change 317855 merged by Filippo Giunchedi:
Rename potassium as poolcounter1002

Change 317873 had a related patch set uploaded (by Filippo Giunchedi):
Put back potassium as poolcounter1002

Change 317873 merged by Filippo Giunchedi:
Put back potassium as poolcounter1002

fgiunchedi claimed this task.

Potassium reimaged as poolcounter1002, resolving. Followup for wrap up the rename is at T149106