Page MenuHomePhabricator

pybal doesn't fully manage LVS table leaving stale services (on IP change)
Open, MediumPublic

Description

For instance this service IP was changed for the VIP:

rush@lvs1006:~# sudo ipvsadm -L | grep TCP
TCP  208.80.154.82:ssh wrr

to (where both coexited post change)

rush@sudo lvs1006:~# ipvsadm -L | grep TCP
TCP  208.80.154.82:ssh wrr
TCP  git-ssh.eqiad.wikimedia.org: wrr

I restarted a few times and made sure this was persistent behavior.

I cleaned up the old VIP manually with:

ipvsadm -D -t 208.80.154.82:ssh

This may be too much of an edge case to worry about seriously.

Event Timeline

chasemp raised the priority of this task from to Medium.
chasemp updated the task description. (Show Details)
chasemp added projects: Pybal, acl*sre-team.
chasemp added a subscriber: chasemp.

This may be too much of an edge case to worry about seriously.

It has happened before and has actually caused site issues in the past, so definitely an issue we should solve at some point.

It's trivial to have Pybal clear the ipvsadm table on startup of course, but I deemed that undesirable behaviour back in 2006 when I wrote it - when we still had to revert to quick manual intervention/hacks often enough. :)

It's trivial to have Pybal clear the ipvsadm table on startup of course, but I deemed that undesirable behaviour back in 2006 when I wrote it - when we still had to revert to quick manual intervention/hacks often enough. :)

Should we clear the table upon pybal startup now that manual hacks are not that frequent? :)

I think wiping the whole table, even at startup, is probably not ideal (but certainly better than wiping it on shutdown!)., What we should really be aiming for is just better state-sync. Pybal should delete unconfigured services on startup, but it shouldn't delete and then recreate ones that remained stable. So basically it needs to read the current state and model from that what the minimal actions are to bring it into alignment with configuration. What it seems to do now is more blind/idempotent than that, but it leads to these kinds of issues.

Bonus points down the road of course, if we can reconfigure services without restarting pybal at all (but then we'll probably still want fairly smooth restarts for when we're doing code upgrades regardless).

The real solution for this is to dedicate real developer time to pybal to move it to use a FSM and a netlink-based python ipvs client.

All of the ground work is already done in a few patches of mine:

https://gerrit.wikimedia.org/r/#/c/302434/
https://gerrit.wikimedia.org/r/#/c/313556
https://gerrit.wikimedia.org/r/#/c/302435/ [introduces the FSM]
https://gerrit.wikimedia.org/r/#/c/302882 [uses netlink to manage ipvsadm]

but all of those need more tests/finishing/code review.

If you don't feel like it's worth the effort, you could at least salvage what's in that patch to migrate to a native netlink implementation instead of shelling out to ipvsadm.

That will make things decidedly easier to implement and control.