Page MenuHomePhabricator

Eliminate SPOFs in the existing eqiad kubernetes infrastructure
Closed, ResolvedPublic

Description

The current infrastructure in EQIAD has a couple of identified Single Points of Failure that should be addressed. Those are

  • Etcd/master cluster powering kubernetes is concentrated on the same Rack Row
  • BGP configuration for pod IP space is done only on cr1-eqiad.
  • Single master SPOF

Event Timeline

akosiaris added a subtask: Unknown Object (Task).Apr 3 2017, 1:56 PM

Task T161702 for the purchase of ganeti nodes in eqiad is being processed by me in the procurement space & is projected to result in the ordering of 4 new ganeti hosts for eqiad. (Just commenting here since that blocking sub-task isn't public any longer due to pricing being added.)

akosiaris renamed this task from Eliminate SPOFs in the existing eqiad infrastructure to Eliminate SPOFs in the existing eqiad kubernetes infrastructure.May 4 2017, 1:23 PM

Change 351836 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Create kubemaster.svc.$site.wmnet

https://gerrit.wikimedia.org/r/351836

Change 352580 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/puppet@production] lvs: Add the kubernetes master service/cluster

https://gerrit.wikimedia.org/r/352580

faidon closed subtask Unknown Object (Task) as Resolved.May 9 2017, 3:55 PM

Change 351836 merged by Alexandros Kosiaris:
[operations/dns@master] Create kubemaster.svc.$site.wmnet

https://gerrit.wikimedia.org/r/351836

Change 352580 merged by Alexandros Kosiaris:
[operations/puppet@production] lvs: Add the kubernetes master service/cluster

https://gerrit.wikimedia.org/r/352580

The single master (with a manual override) SPOF has been addressed. We now have 2 masters in eqiad behind an LVS service with a proper puppet certificate.

Change 359908 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Renumber neon.eqiad.wmnet

https://gerrit.wikimedia.org/r/359908

Change 359909 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Renumber etcd100{2,3,4,5,6}.eqiad.wmnet

https://gerrit.wikimedia.org/r/359909

Change 359910 had a related patch set uploaded (by Alexandros Kosiaris; owner: Alexandros Kosiaris):
[operations/dns@master] Renumber chlorine.eqiad.wmnet

https://gerrit.wikimedia.org/r/359910

Change 359908 merged by Alexandros Kosiaris:
[operations/dns@master] Renumber neon.eqiad.wmnet

https://gerrit.wikimedia.org/r/359908

Change 359909 merged by Alexandros Kosiaris:
[operations/dns@master] Renumber etcd100{2,3,4,5}.eqiad.wmnet

https://gerrit.wikimedia.org/r/359909

Change 359910 merged by Alexandros Kosiaris:
[operations/dns@master] Renumber chlorine.eqiad.wmnet

https://gerrit.wikimedia.org/r/359910

akosiaris updated the task description. (Show Details)

2/3 etcd hosts are now in a different row giving up read functionality in the worst case scenario, that is failure of row A and read/write in case of failure of row C. chlorine has been moved to row A as well. This is now successfully resolved.