Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• jcrespo
	Jul 22 2020, 4:37 PM

Description

3 pages had happened in the last 2 days (alerts text: Socket timeout after 10 seconds):

2020-07-21 15:04:09 UTC 2020 (aprox)
[2020-07-22 10:39:55] SERVICE ALERT: api.svc.codfw.wmnet;LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds
[2020-07-22 16:28:38] SERVICE ALERT: api.svc.codfw.wmnet;LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page;CRITICAL;HARD;3;CRITICAL - Socket timeout after 10 seconds

The times are approximate (when alerts trigger) the queries fail at least twice before paging, and multiple times in SOFT state (once or twice) over the last 2 days.

There is not a strightforward reason why this is happening.
Interestingly, they seem to fail for icinga1001 and icinga2001 at different times (but are detected from both hosts).

Details

	Subject	Repo	Branch	Lines +/-
	conftool-data: move mw2335-mw2339 to regular appservers	operations/puppet	production	+5 -5

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
		Unknown Object (Task)
Resolved	Papaul	T241852 (Need by: TBD) rack/setup/install 86 new codfw mw systems
Resolved	Dzahn	T247021 move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet)
Resolved	Dzahn	T258614 Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping
Declined	CDanis	T258648 monitoring for mismatched LVS realserver addresses/configurations

Event Timeline

• jcrespo created this task.Jul 22 2020, 4:37 PM

Restricted Application added a project: SRE. · View Herald TranscriptJul 22 2020, 4:37 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=api.svc.codfw.wmnet&service=LVS+api+codfw+port+80%2Ftcp+-+MediaWiki+API+cluster-+api.svc.eqiad.wmnet+IPv4+%23page

• jcrespo updated the task description. (Show Details)Jul 22 2020, 4:46 PM

First occurrence was June 17th, 15:10 UTC:

Jun 17 15:10:38 icinga1001 icinga: SERVICE ALERT: api.svc.codfw.wmnet;LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page;CRITICAL;SOFT;1;CRITICAL - Socket timeout after 10 seconds

see also

P12020 Masterwork From Distant Lands

1	✔️ root@centrallog1001.eqiad.wmnet /srv/syslog/icinga1001 🕐⁉️ for F in * ; do echo -ne "$F\t" ; zfgrep 'SERVICE ALERT: api.svc.codfw.wmnet' $F \| wc -l ; done
2	syslog.log 37
3	syslog.log-20200605.gz 0
4	syslog.log-20200606.gz 0
5	syslog.log-20200607.gz 0
6	syslog.log-20200608.gz 0
7	syslog.log-20200609.gz 0
8	syslog.log-20200610.gz 0
9	syslog.log-20200611.gz 0
10	syslog.log-20200612.gz 0
11	syslog.log-20200613.gz 0
12	syslog.log-20200614.gz 0
13	syslog.log-20200615.gz 0
14	syslog.log-20200616.gz 0
15	syslog.log-20200617.gz 0
16	syslog.log-20200618.gz 19
17	syslog.log-20200619.gz 39
18	syslog.log-20200620.gz 44
19	syslog.log-20200621.gz 40
20	syslog.log-20200622.gz 26
21	syslog.log-20200623.gz 30
22	syslog.log-20200624.gz 41
23	syslog.log-20200625.gz 50
24	syslog.log-20200626.gz 49
25	syslog.log-20200627.gz 57
26	syslog.log-20200628.gz 59

mw2335 - mw2339 are configured as API appservers in confctl but they are regular appservers in site.pp

This means they are getting the appservers.svc.codfw.wmnet LVS IP on lo but they should have the api.svc.codfw.wmnet LVS IP.

That was my mistake when getting them into production apparently. Will fix it.

Change 615537 had a related patch set uploaded (by Dzahn; owner: Dzahn):
[operations/puppet@production] conftool-data: move mw2335-mw2339 to regular appservers

https://gerrit.wikimedia.org/r/615537

gerritbot added a project: Patch-For-Review.Jul 22 2020, 5:54 PM

Change 615537 merged by Dzahn:
[operations/puppet@production] conftool-data: move mw2335-mw2339 to regular appservers

https://gerrit.wikimedia.org/r/615537

Maintenance_bot removed a project: Patch-For-Review.Jul 22 2020, 7:10 PM

So this happened whenever the check ended up talking to one of the servers in that 2335 - 2339 range.

It stopped happening so far since the fix of moving them to the correct section in conftool.

https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=api.svc.codfw.wmnet&service=LVS+api+codfw+port+80%2Ftcp+-+MediaWiki+API+cluster-+api.svc.eqiad.wmnet+IPv4+%23page

I checked other newish codfw appservers but they were ok.

Calling it tentatively resolved.

Dzahn closed this task as Resolved.Jul 22 2020, 7:52 PM

Dzahn added a parent task: T247021: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet).

Mentioned in SAL (#wikimedia-operations) [2020-07-22T22:07:23Z] <cdanis> remove downtime on api.svc.codfw.wmnet T258614

Thanks for your work on this. One clarification, for those of us that are not that familiar with LVS/appservers. Based on T258614#6327121 I understand that the issue was that the load balancing was not well configured for some codfw servers. But API appservers and user traffic servers are equivalent in terms of the traffic they can theoretically return, right (in theory app servers can return api traffic and viceversa)?

In T258614#6328624, @jcrespo wrote:

Thanks for your work on this. One clarification, for those of us that are not that familiar with LVS/appservers. Based on T258614#6327121 I understand that the issue was that the load balancing was not well configured for some codfw servers. But API appservers and user traffic servers are equivalent in terms of the traffic they can theoretically return, right (in theory app servers can return api traffic and viceversa)?

No, cause they are configured to only have the LVS IP address corresponding to the type of service they are destined for.

e.g. for an appserver

akosiaris@mw2235:~$ sudo ip -c addr ls |grep 10.2
    inet 10.2.1.1/32 scope global lo:LVS

vs an API server.

akosiaris@mw2299:~$ sudo ip -c addr ls |grep 10.2
    inet 10.2.1.22/32 scope global lo:LVS

But I think you have made a point with that question and that's that there is the impression those hosts groups of hosts are functionally equivalent, while they currently aren't.

I see, thanks.

CDanis closed subtask T258648: monitoring for mismatched LVS realserver addresses/configurations as Declined.Jul 29 2020, 3:56 PM

Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flappingClosed, ResolvedPublicActions

Description

Details

Related ObjectsSearch...

Event Timeline

Alert "LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4" is flapping
Closed, ResolvedPublic
Actions

Related Objects
Search...