Maniphest T200563

wdq1003 is anomalous
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Smalyshev
	Jul 27 2018, 8:45 PM

Description

If we look at the graph of the high lag for last 30 days:

https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&panelId=8&fullscreen&orgId=1&from=now-30d&to=now

it is clear that wdqs1003 has highest instance of lag problems. This is weird, since three servers 1003, 1004 and 1005 should be sharing load & updates equally, and thus if there is a problem, either with query load or update load, it should be evenly distributed.

However, it is clear that 1003 has problems more often, and they are usually more severe (higher lag, longer duration). If could be:

1003 has some hardware/software issue that makes it slower
our load balancing is not balanced between hosts
some other reason?
just a coincidence?

I think we need to investigate this and see if there's some issue we could fix.

Details

	Subject	Repo	Branch	Lines +/-
	wdqs: don't send nginx logs to logstash	operations/puppet	production	+0 -19

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Smalyshev	T200563 wdq1003 is anomalous
		Resolved		Smalyshev	T207947 Switch wdqs1003 with one of the internal wdqs cluster

Event Timeline

Smalyshev created this task.Jul 27 2018, 8:45 PM

Restricted Application added a project: Wikidata. · View Herald TranscriptJul 27 2018, 8:45 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

Smalyshev moved this task from Incoming to Operations/SRE on the Wikidata-Query-Service board.Jul 27 2018, 8:46 PM

wdqs1003 is a bit older, purchased on 2016-12-02 vs 2017-06-29 for wdqs100[45].

Looking at CPU infos, it seems that wdqs1003 does have the same CPU model (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz) as wdqs100[45], but CPU max MHz: 2100.0000, vs CPU max MHz: 3000.0000 for wdqs100[45].

Let's try to send a bit less traffic to wdqs1003.

Mentioned in SAL (#wikimedia-operations) [2018-07-31T12:29:06Z] <gehel> rebalance LVS weights to send less traffic to wdqs1003 - T200563

Another anomaly on 1003 - it gets the bulk of timeout errors from wikidata (T202764): https://logstash.wikimedia.org/goto/b9eada384c246bf953a3abf8fc362f90

So, weird thing: now that we switched data centers, wdqs2003 is showing the same anomaly. Could it be that our load balancing is not balancing the load evenly for these hosts?

All credit for the findings below goes to @Volans:

we have some dropped packets on the NICs, both on wdqs[12]003 and other servers, but higher on wdqs[12]003.
NIC interrupts are processed only by CPU0 (see /proc/interrupts), we could spread those and maybe increase throughput
we see some messages in dmesg: NMI handler (ghes_notify_nmi) took too long to run: 1.259 msecs, probably related to the contention on processing NIC queues

The above would explain the correlation between high load and the error rate of T202764. Since wdqs[12]003 have more slower cores, and the interrupts managed by a single core, it would explain the lower throughput on some servers.

One hard problem to solve is going to be the CPU contention between blazegraph and other processes. At some point, if blazegraph generates too much CPU load, whatever the tuning, we are going to see contention in other areas. We might be able to limit the number of parallel requests.

Potential actions:

reduce load on the networking stack:
- stop logging nginx to logstash
- tune logging for wdqs-blazegraph and wdqs-updater
tune networking stack to spread interrupts across all CPUs (having a look at the config of LVS servers might help)

Change 463248 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: don't send nginx logs to logstash

https://gerrit.wikimedia.org/r/463248

gerritbot added a project: Patch-For-Review.Sep 27 2018, 12:31 PM

Change 463254 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: cleanup logback configuration

https://gerrit.wikimedia.org/r/463254

Great work!
I am not sure though why logging would be that much of an issue, shouldn't the log code take care of batching it, etc.? As for not logging nginx - do we have these logs somewhere else? If yes, then I guess we can stop that. We could probably tune Kafka logging too - right now it's kinda verbose and we most likely don't want that in the logstash.

In T200563#4623531, @Smalyshev wrote:

Great work!

Thanks (I'll forward to @Volans)

I am not sure though why logging would be that much of an issue, shouldn't the log code take care of batching it, etc.? As for not logging nginx - do we have these logs somewhere else? If yes, then I guess we can stop that. We could probably tune Kafka logging too - right now it's kinda verbose and we most likely don't want that in the logstash.

nginx sends logs over syslog, so one UDP packet per log message. Honestly, I'm not sure it will make much difference. The dropped packets are on RX, and nginx/syslog should be only TX. They still compete for the same CPU, so that could help.

In the end, the problem is CPU contention, so anything that can reduce it is good to take. Nginx logs is the low hanging fruit. Spreading NIC related workload to more CPU should be doable. And making sure that Blazegraph CPU consumption is bounded is what we should really address (but probably quite a bit more complex).

As far as I understand, access logs should be available via varnish kafka, in the analytics cluster / hive.

Gehel added a project: Discovery-Wikidata-Query-Service-Sprint.Oct 2 2018, 5:41 PM

Gehel moved this task from Backlog to In progress on the Discovery-Wikidata-Query-Service-Sprint board.

Change 463248 merged by Gehel:
[operations/puppet@production] wdqs: don't send nginx logs to logstash

https://gerrit.wikimedia.org/r/463248

Gehel mentioned this in T206105: Optimize networking configuration for WDQS.Oct 3 2018, 9:05 AM

Gehel mentioned this in T206108: Limit CPU consumption of blazegraph.Oct 3 2018, 9:32 AM

Gehel mentioned this in T206114: Create an Icinga check to alert on packet dropped.Oct 3 2018, 9:45 AM

Actionable tasks have been created, the investigation itself is done.

Getting also this now:

Oct 23 00:41:49 wdqs1003 wdqs-updater[10071]: 00:41:49.901 [main] ERROR o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-1, groupId=wdqs1003] Offset commit failed on partition eqiad.mediawiki.page-undelete-0 at offset 136517: The request timed out.
Oct 23 00:41:55 wdqs1003 wdqs-updater[10071]: 00:41:55.009 [main] ERROR o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-1, groupId=wdqs1003] Offset commit failed on partition eqiad.mediawiki.page-undelete-0 at offset 136517: The request timed out.

Looks like we have some serious network problems or something else is seriously messed up there, since Kafka is now failing too.

Smalyshev triaged this task as High priority.Oct 23 2018, 12:45 AM

Smalyshev moved this task from Done to Backlog on the Discovery-Wikidata-Query-Service-Sprint board.

Also I notice the timestamp on 1003 is not advancing:

Oct 23 00:54:40 wdqs1003 wdqs-updater[10071]: 00:54:40.692 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-10-23T00:43:08Z at (4.9, 8.0, 5.1) updates per second and (1.8, 2256.6, 3258.3) milliseconds per second
Oct 23 00:55:45 wdqs1003 wdqs-updater[10071]: 00:55:45.315 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-10-23T00:43:08Z at (6.0, 8.4, 5.5) updates per second and (0.6, 1817.0, 3031.3) milliseconds per second
Oct 23 00:56:21 wdqs1003 wdqs-updater[10071]: 00:56:21.063 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-10-23T00:43:08Z at (9.5, 9.4, 6.0) updates per second and (0.3, 1616.9, 2915.6) milliseconds per second
Oct 23 00:57:02 wdqs1003 wdqs-updater[10071]: 00:57:02.508 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-10-23T00:43:08Z at (8.1, 9.3, 6.1) updates per second and (0.2, 1415.1, 2788.9) milliseconds per second

This does not happen on other hosts. It looks like maybe something wrong with communication with kafka?

Gehel mentioned this in T207947: Switch wdqs1003 with one of the internal wdqs cluster.Oct 25 2018, 2:15 PM

Smalyshev closed subtask T207947: Switch wdqs1003 with one of the internal wdqs cluster as Resolved.Nov 6 2018, 6:26 PM

Smalyshev moved this task from Backlog to Done on the Discovery-Wikidata-Query-Service-Sprint board.Nov 7 2018, 8:24 AM