Page MenuHomePhabricator

pybal-related issue on host start can break service IPs...
Open, MediumPublic

Description

This still needs further investigation, but the scenario observed was as follows:

  1. lvs3001 was fully puppetized (new jessie install), and was up and running fully successfully and routing traffic via pybal + ipvsadm.
  2. stopped pybal service to fail over to the backup LVS (3003).
  3. rebooted lvs3001 (for ethernet driver params to take effect).
  4. on machine boot (systemd starting all services), pybal came up and began talking BGP, causing traffic to flip back to it, however, some service IPs were broken and had no backend destinations, and traffic dropped off
  5. "service pybal restart" fixed the issue

During the bad period when pybal was first running from boot, we had messsages like these spamming in dmesg/syslog/console:

[  178.195442] IPVS: wrr: TCP 91.198.174.192:80 - no destination available
[  178.195445] IPVS: sh: TCP 91.198.174.192:443 - no destination available
[  178.195473] IPVS: sh: TCP 91.198.174.204:443 - no destination available

and ipvsadm state looked like this (note lack of backends for services in log messages above):

# ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=1048576)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  91.198.174.192:80 wrr
TCP  91.198.174.192:443 sh
TCP  91.198.174.204:80 wrr
  -> 10.20.0.115:80               Route   1      95         3889
  -> 10.20.0.116:80               Route   1      77         3920
  -> 10.20.0.117:80               Route   1      92         3883
  -> 10.20.0.118:80               Route   1      94         3891
TCP  91.198.174.204:443 sh
TCP  [2620:0:862:ed1a::1]:80 wrr
  -> [2620:0:862:102:10:20:0:103]:80 Route   3      0          479
  -> [2620:0:862:102:10:20:0:104]:80 Route   3      2          472
  -> [2620:0:862:102:10:20:0:105]:80 Route   3      1          474
  -> [2620:0:862:102:10:20:0:106]:80 Route   3      1          475
  -> [2620:0:862:102:10:20:0:107]:80 Route   3      1          483
  -> [2620:0:862:102:10:20:0:108]:80 Route   3      0          482
  -> [2620:0:862:102:10:20:0:109]:80 Route   3      3          474
  -> [2620:0:862:102:10:20:0:110]:80 Route   3      1          481
  -> [2620:0:862:102:10:20:0:112]:80 Route   3      2          474
  -> [2620:0:862:102:10:20:0:113]:80 Route   3      1          476
  -> [2620:0:862:102:10:20:0:114]:80 Route   3      4          474
  -> [2620:0:862:102:10:20:0:165]:80 Route   10     10         1584
  -> [2620:0:862:102:10:20:0:166]:80 Route   10     5          1584
  -> [2620:0:862:102:10:20:0:175]:80 Route   10     7          1583
  -> [2620:0:862:102:10:20:0:176]:80 Route   10     4          1575
TCP  [2620:0:862:ed1a::1]:443 sh
  -> [2620:0:862:102:10:20:0:103]:443 Route   3      954        1065
  -> [2620:0:862:102:10:20:0:104]:443 Route   3      1041       2471
  -> [2620:0:862:102:10:20:0:105]:443 Route   3      681        568
  -> [2620:0:862:102:10:20:0:106]:443 Route   3      920        1527
  -> [2620:0:862:102:10:20:0:107]:443 Route   3      697        735
  -> [2620:0:862:102:10:20:0:108]:443 Route   3      720        824
  -> [2620:0:862:102:10:20:0:109]:443 Route   3      739        1445
  -> [2620:0:862:102:10:20:0:110]:443 Route   3      926        987
  -> [2620:0:862:102:10:20:0:112]:443 Route   3      689        2105
  -> [2620:0:862:102:10:20:0:113]:443 Route   3      944        709
  -> [2620:0:862:102:10:20:0:114]:443 Route   3      1093       1330
  -> [2620:0:862:102:10:20:0:165]:443 Route   10     3210       3502
  -> [2620:0:862:102:10:20:0:166]:443 Route   10     1609       6315
  -> [2620:0:862:102:10:20:0:175]:443 Route   10     2380       2659
  -> [2620:0:862:102:10:20:0:176]:443 Route   10     3175       3259
TCP  [2620:0:862:ed1a::1:c]:80 wrr
TCP  [2620:0:862:ed1a::1:c]:443 sh
  -> [2620:0:862:102:10:20:0:115]:443 Route   1      975        1448
  -> [2620:0:862:102:10:20:0:116]:443 Route   1      1003       1420
  -> [2620:0:862:102:10:20:0:117]:443 Route   1      983        1366
  -> [2620:0:862:102:10:20:0:118]:443 Route   1      979        1404

pybal.log indicates it was ipvsadm that failed for some reason:

2015-09-24 12:11:47.008468 ipvsadm exited with status 255 when executing cmdlist ['-a -t [2620:0:862:ed1a::1:c]:80 -r 2620:0:862:102:10:20:0:116 -w 1\n', '-a -t [2620:0:862:ed1a::1:c]:80 -r 2
620:0:862:102:10:20:0:115 -w 1\n', '-a -t [2620:0:862:ed1a::1:c]:80 -r 2620:0:862:102:10:20:0:118 -w 1\n', '-a -t [2620:0:862:ed1a::1:c]:80 -r 2620:0:862:102:10:20:0:117 -w 1\n']
2015-09-24 12:11:47.008490 ipvsadm stderr output:
2015-09-24 12:11:47.008505 Memory allocation problem
2015-09-24 12:11:47.008510 Memory allocation problem
2015-09-24 12:11:47.008514 Memory allocation problem
2015-09-24 12:11:47.008518 Memory allocation problem
2015-09-24 12:11:47.008553
2015-09-24 12:11:47.008863 ipvsadm exited with status 255 when executing cmdlist ['-a -t 91.198.174.192:80 -r 10.20.0.108 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.105 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.112 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.107 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.110 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.175 -w 10\n', '-a -t 91.198.174.192:80 -r 10.20.0.109 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.166 -w 10\n', '-a -t 91.198.174.192:80 -r 10.20.0.103 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.104 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.114 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.113 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.176 -w 10\n', '-a -t 91.198.174.192:80 -r 10.20.0.106 -w 3\n', '-a -t 91.198.174.192:80 -r 10.20.0.165 -w 10\n']
2015-09-24 12:11:47.008885 ipvsadm stderr output:
2015-09-24 12:11:47.008902 Memory allocation problem
2015-09-24 12:11:47.008907 Memory allocation problem
2015-09-24 12:11:47.008912 Memory allocation problem
2015-09-24 12:11:47.008916 Memory allocation problem
2015-09-24 12:11:47.008920 Memory allocation problem
2015-09-24 12:11:47.008924 Memory allocation problem
2015-09-24 12:11:47.008928 Memory allocation problem
2015-09-24 12:11:47.008932 Memory allocation problem
2015-09-24 12:11:47.008936 Memory allocation problem
2015-09-24 12:11:47.008940 Memory allocation problem
2015-09-24 12:11:47.008944 Memory allocation problem
2015-09-24 12:11:47.008948 Memory allocation problem
2015-09-24 12:11:47.008952 Memory allocation problem
2015-09-24 12:11:47.008956 Memory allocation problem
2015-09-24 12:11:47.008960 Memory allocation problem
2015-09-24 12:11:47.008969
2015-09-24 12:11:47.009133 ipvsadm exited with status 255 when executing cmdlist ['-a -t 91.198.174.192:443 -r 10.20.0.108 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.105 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.112 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.107 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.110 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.175 -w 10\n', '-a -t 91.198.174.192:443 -r 10.20.0.109 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.166 -w 10\n', '-a -t 91.198.174.192:443 -r 10.20.0.103 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.104 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.114 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.113 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.176 -w 10\n', '-a -t 91.198.174.192:443 -r 10.20.0.106 -w 3\n', '-a -t 91.198.174.192:443 -r 10.20.0.165 -w 10\n']
2015-09-24 12:11:47.009154 ipvsadm stderr output:
2015-09-24 12:11:47.009170 Memory allocation problem
2015-09-24 12:11:47.009175 Memory allocation problem
2015-09-24 12:11:47.009179 Memory allocation problem
2015-09-24 12:11:47.009183 Memory allocation problem
2015-09-24 12:11:47.009187 Memory allocation problem
2015-09-24 12:11:47.009191 Memory allocation problem
2015-09-24 12:11:47.009195 Memory allocation problem
2015-09-24 12:11:47.009199 Memory allocation problem
2015-09-24 12:11:47.009203 Memory allocation problem
2015-09-24 12:11:47.009206 Memory allocation problem
2015-09-24 12:11:47.009210 Memory allocation problem
2015-09-24 12:11:47.009214 Memory allocation problem
2015-09-24 12:11:47.009218 Memory allocation problem
2015-09-24 12:11:47.009222 Memory allocation problem
2015-09-24 12:11:47.009226 Memory allocation problem
2015-09-24 12:11:47.009235

Event Timeline

BBlack raised the priority of this task from to Needs Triage.
BBlack updated the task description. (Show Details)
BBlack added projects: Traffic, Pybal.
BBlack added a subscriber: BBlack.
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
BBlack set Security to None.

I have seen Memory allocation problemwhen referencing a pool that doesn't exist like on lvs1003 this works ipvsadm -Ln -t 10.2.2.30:9200 but:

(bad port and as such nonexistent pool)

root@lvs1003:~# ipvsadm -Ln -t 10.2.2.30:920
Memory allocation problem

Just a note that this definitely happens when referencing a virtual service that isn't yet setup.

Dzahn triaged this task as High priority.Oct 19 2015, 11:37 PM
Dzahn added a subscriber: Dzahn.
fgiunchedi lowered the priority of this task from High to Medium.Apr 14 2020, 3:40 PM
fgiunchedi added a subscriber: fgiunchedi.

We've been routinely reboot lvs hosts multiple times and IIRC this issue hasn't come up again (?) Lowering priority