Page MenuHomePhabricator

inconsistencies between pybal configuration and IPVS status
Closed, ResolvedPublic

Description

After @fselles depooled kubernetes1001.eqiad.wmnet (T213859) the following alert appeared on lvs1006 and lvs1016:

CRITICAL: Hosts in IPVS but unknown to PyBal: set(['kubernetes1001.eqiad.wmnet'])

Using ipvsadm we can confirm that kubernetes1001 is still in ipvs in both LVS servers:

vgutierrez@lvs1016:/var/log$ host kubernetes1001.eqiad.wmnet
kubernetes1001.eqiad.wmnet has address 10.64.0.121`
vgutierrez@cumin1001:~$ sudo cumin lvs1006.wikimedia.org,lvs1016.eqiad.wmnet "ipvsadm -Ln | fgrep 10.64.0.121"
2 hosts will be targeted:
lvs1016.eqiad.wmnet,lvs1006.wikimedia.org
Confirm to continue [y/n]? y
===== NODE GROUP =====
(2) lvs1016.eqiad.wmnet,lvs1006.wikimedia.org
----- OUTPUT of 'ipvsadm -Ln | fgrep 10.64.0.121' -----
  -> 10.64.0.121:1968             Route   10     0          0

digging a little bit further we discover that ipvsadm still lists kubernetes1001 for the following service:

TCP  10.2.2.29:1968 wrr
  -> 10.64.0.121:1968             Route   10     0          0
  -> 10.64.16.75:1968             Route   10     0          0
  -> 10.64.32.23:1968             Route   10     0          0
  -> 10.64.48.52:1968             Route   10     0          0

Currently there is no service configured in pybal for 10.2.2.29:1968:

vgutierrez@cumin1001:~$ sudo cumin -x lvs1006.wikimedia.org,lvs1016.eqiad.wmnet "fgrep 10.2.2.29 /etc/pybal/pybal.conf"
IGNORE EXIT CODES mode enabled, all commands executed will be considered successful
2 hosts will be targeted:
lvs1016.eqiad.wmnet,lvs1006.wikimedia.org
Confirm to continue [y/n]? y
===== NO OUTPUT =====
PASS:
vgutierrez@cumin1001:~$ sudo cumin -x lvs1006.wikimedia.org,lvs1016.eqiad.wmnet "fgrep 1968 /etc/pybal/pybal.conf"
IGNORE EXIT CODES mode enabled, all commands executed will be considered successful
2 hosts will be targeted:
lvs1016.eqiad.wmnet,lvs1006.wikimedia.org
Confirm to continue [y/n]? y
===== NO OUTPUT =====
PASS:

Digging a little bit on our puppet repo, we find that 10.2.2.29 was the service IP used for zoterov2: https://github.com/wikimedia/puppet/commit/d33c334e112d58a55c8564828606efc0db56f6f4

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Mentioned in SAL (#wikimedia-operations) [2019-01-17T14:03:34Z] <vgutierrez> running ipvsadm -D -t 10.2.2.29:1968 in lvs1006 - T214041

Mentioned in SAL (#wikimedia-operations) [2019-01-17T14:04:55Z] <vgutierrez> running ipvsadm -D -t 10.2.2.29:1968 in lvs1016 - T214041

After removing a service in pybal, a restart is not enough to get rid of the service at IPVS level, it should be removed manually with ipvsadm -D -t ip:port(or -u if it's UDP instead of TCP)

Vgutierrez triaged this task as Medium priority.Jan 17 2019, 2:30 PM

Change 485044 had a related patch set uploaded (by Vgutierrez; owner: Vgutierrez):
[operations/puppet@production] pybal: check for discrepancies in the configured services

https://gerrit.wikimedia.org/r/485044

Change 485044 merged by Vgutierrez:
[operations/puppet@production] pybal: check for discrepancies in the configured services

https://gerrit.wikimedia.org/r/485044

Mentioned in SAL (#wikimedia-operations) [2019-01-17T18:19:52Z] <vgutierrez> running ipvsadm -D -t 10.2.1.29:1968 in lvs2006 - T214041

Mentioned in SAL (#wikimedia-operations) [2019-01-17T18:22:51Z] <vgutierrez> running ipvsadm -D -t 10.2.1.29:1968 in lvs2003 - T214041

Vgutierrez claimed this task.
Vgutierrez removed a project: Patch-For-Review.