Page MenuHomePhabricator

Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping
Closed, ResolvedPublic

Description

3 pages had happened in the last 2 days (alerts text: Socket timeout after 10 seconds):

  • 2020-07-21 15:04:09 UTC 2020 (aprox)
  • [2020-07-22 10:39:55] SERVICE ALERT: api.svc.codfw.wmnet;LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds
  • [2020-07-22 16:28:38] SERVICE ALERT: api.svc.codfw.wmnet;LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds

The times are approximate (when alerts trigger) the queries fail at least twice before paging, and multiple times in SOFT state (once or twice) over the last 2 days.

There is not a strightforward reason why this is happening.
Interestingly, they seem to fail for icinga1001 and icinga2001 at different times (but are detected from both hosts).

Event Timeline

jcrespo created this task.Jul 22 2020, 4:37 PM
Restricted Application added a project: Operations. · View Herald TranscriptJul 22 2020, 4:37 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript
jcrespo updated the task description. (Show Details)Jul 22 2020, 4:46 PM

First occurrence was June 17th, 15:10 UTC:

Jun 17 15:10:38 icinga1001 icinga: SERVICE ALERT: api.svc.codfw.wmnet;LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds

see also

1✔️ root@centrallog1001.eqiad.wmnet /srv/syslog/icinga1001 🕐⁉️ for F in * ; do echo -ne "$F\t" ; zfgrep 'SERVICE ALERT: api.svc.codfw.wmnet' $F | wc -l ; done
2syslog.log 37
3syslog.log-20200605.gz 0
4syslog.log-20200606.gz 0
5syslog.log-20200607.gz 0
6syslog.log-20200608.gz 0
7syslog.log-20200609.gz 0
8syslog.log-20200610.gz 0
9syslog.log-20200611.gz 0
10syslog.log-20200612.gz 0
11syslog.log-20200613.gz 0
12syslog.log-20200614.gz 0
13syslog.log-20200615.gz 0
14syslog.log-20200616.gz 0
15syslog.log-20200617.gz 0
16syslog.log-20200618.gz 19
17syslog.log-20200619.gz 39
18syslog.log-20200620.gz 44
19syslog.log-20200621.gz 40
20syslog.log-20200622.gz 26
21syslog.log-20200623.gz 30
22syslog.log-20200624.gz 41
23syslog.log-20200625.gz 50
24syslog.log-20200626.gz 49
25syslog.log-20200627.gz 57
26syslog.log-20200628.gz 59

Dzahn claimed this task.Jul 22 2020, 5:29 PM

mw2335 - mw2339 are configured as API appservers in confctl but they are regular appservers in site.pp

This means they are getting the appservers.svc.codfw.wmnet LVS IP on lo but they should have the api.svc.codfw.wmnet LVS IP.

That was my mistake when getting them into production apparently. Will fix it.

Change 615537 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] conftool-data: move mw2335-mw2339 to regular appservers

https://gerrit.wikimedia.org/r/615537

Change 615537 merged by Dzahn:
[operations/puppet@production] conftool-data: move mw2335-mw2339 to regular appservers

https://gerrit.wikimedia.org/r/615537

Dzahn added a comment.Jul 22 2020, 7:51 PM

So this happened whenever the check ended up talking to one of the servers in that 2335 - 2339 range.

It stopped happening so far since the fix of moving them to the correct section in conftool.

https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=api.svc.codfw.wmnet&service=LVS+api+codfw+port+80%2Ftcp+-+MediaWiki+API+cluster-+api.svc.eqiad.wmnet+IPv4+%23page

I checked other newish codfw appservers but they were ok.

Calling it tentatively resolved.

Mentioned in SAL (#wikimedia-operations) [2020-07-22T22:07:23Z] <cdanis> remove downtime on api.svc.codfw.wmnet T258614

Thanks for your work on this. One clarification, for those of us that are not that familiar with LVS/appservers. Based on T258614#6327121 I understand that the issue was that the load balancing was not well configured for some codfw servers. But API appservers and user traffic servers are equivalent in terms of the traffic they can theoretically return, right (in theory app servers can return api traffic and viceversa)?

Thanks for your work on this. One clarification, for those of us that are not that familiar with LVS/appservers. Based on T258614#6327121 I understand that the issue was that the load balancing was not well configured for some codfw servers. But API appservers and user traffic servers are equivalent in terms of the traffic they can theoretically return, right (in theory app servers can return api traffic and viceversa)?

No, cause they are configured to only have the LVS IP address corresponding to the type of service they are destined for.

e.g. for an appserver

akosiaris@mw2235:~$ sudo ip -c addr ls |grep 10.2
    inet 10.2.1.1/32 scope global lo:LVS

vs an API server.

akosiaris@mw2299:~$ sudo ip -c addr ls |grep 10.2
    inet 10.2.1.22/32 scope global lo:LVS

But I think you have made a point with that question and that's that there is the impression those hosts groups of hosts are functionally equivalent, while they currently aren't.

I see, thanks.