Page MenuHomePhabricator

pybal doesn't fully manage LVS table leaving stale services (on IP change)
Closed, DeclinedPublic

Description

For instance this service IP was changed for the VIP:

rush@lvs1006:~# sudo ipvsadm -L | grep TCP
TCP  208.80.154.82:ssh wrr

to (where both coexited post change)

rush@sudo lvs1006:~# ipvsadm -L | grep TCP
TCP  208.80.154.82:ssh wrr
TCP  git-ssh.eqiad.wikimedia.org: wrr

I restarted a few times and made sure this was persistent behavior.

I cleaned up the old VIP manually with:

ipvsadm -D -t 208.80.154.82:ssh

This may be too much of an edge case to worry about seriously.

Event Timeline

chasemp raised the priority of this task from to Medium.
chasemp updated the task description. (Show Details)
chasemp added projects: PyBal, acl*sre-team.
chasemp subscribed.

This may be too much of an edge case to worry about seriously.

It has happened before and has actually caused site issues in the past, so definitely an issue we should solve at some point.

It's trivial to have Pybal clear the ipvsadm table on startup of course, but I deemed that undesirable behaviour back in 2006 when I wrote it - when we still had to revert to quick manual intervention/hacks often enough. :)

It's trivial to have Pybal clear the ipvsadm table on startup of course, but I deemed that undesirable behaviour back in 2006 when I wrote it - when we still had to revert to quick manual intervention/hacks often enough. :)

Should we clear the table upon pybal startup now that manual hacks are not that frequent? :)

I think wiping the whole table, even at startup, is probably not ideal (but certainly better than wiping it on shutdown!)., What we should really be aiming for is just better state-sync. Pybal should delete unconfigured services on startup, but it shouldn't delete and then recreate ones that remained stable. So basically it needs to read the current state and model from that what the minimal actions are to bring it into alignment with configuration. What it seems to do now is more blind/idempotent than that, but it leads to these kinds of issues.

Bonus points down the road of course, if we can reconfigure services without restarting pybal at all (but then we'll probably still want fairly smooth restarts for when we're doing code upgrades regardless).

The real solution for this is to dedicate real developer time to pybal to move it to use a FSM and a netlink-based python ipvs client.

All of the ground work is already done in a few patches of mine:

https://gerrit.wikimedia.org/r/#/c/302434/
https://gerrit.wikimedia.org/r/#/c/313556
https://gerrit.wikimedia.org/r/#/c/302435/ [introduces the FSM]
https://gerrit.wikimedia.org/r/#/c/302882 [uses netlink to manage ipvsadm]

but all of those need more tests/finishing/code review.

If you don't feel like it's worth the effort, you could at least salvage what's in that patch to migrate to a native netlink implementation instead of shelling out to ipvsadm.

That will make things decidedly easier to implement and control.

The swap of Traffic for Traffic-Icebox in this ticket's set of tags was based on a bulk action for all such tickets that haven't been updated in 6 months or more. This does not imply any human judgement about the validity or importance of the task, and is simply the first step in a larger task cleanup effort. Further manual triage and/or requests for updates will happen this month for all such tickets. For more detail, have a look at the extended explanation on the main page of Traffic-Icebox . Thank you!

BCornwall subscribed.

Hello!

PyBal's role at WMF will be replaced with an upcoming project. As such, the traffic team has decided to freeze development of non-trivial changes for stability/predictability. While we would love to service this ticket we do not have the resources to devote to PyBal any more. To keep our work queue organized and prioritized, I will decline this ticket.

(For the unlikely case of PyBal's continued usage here's a common search string to search for re-opening these declined tickets: aiZ6ohm6)