Page MenuHomePhabricator

wdq1003 is anomalous
Closed, ResolvedPublic

Description

If we look at the graph of the high lag for last 30 days:

https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&panelId=8&fullscreen&orgId=1&from=now-30d&to=now

it is clear that wdqs1003 has highest instance of lag problems. This is weird, since three servers 1003, 1004 and 1005 should be sharing load & updates equally, and thus if there is a problem, either with query load or update load, it should be evenly distributed.

However, it is clear that 1003 has problems more often, and they are usually more severe (higher lag, longer duration). If could be:

  • 1003 has some hardware/software issue that makes it slower
  • our load balancing is not balanced between hosts
  • some other reason?
  • just a coincidence?

I think we need to investigate this and see if there's some issue we could fix.

Event Timeline

Restricted Application added a project: Wikidata. · View Herald TranscriptJul 27 2018, 8:45 PM
Restricted Application added a subscriber: Aklapper. · View Herald Transcript

wdqs1003 is a bit older, purchased on 2016-12-02 vs 2017-06-29 for wdqs100[45].

Looking at CPU infos, it seems that wdqs1003 does have the same CPU model (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz) as wdqs100[45], but CPU max MHz: 2100.0000, vs CPU max MHz: 3000.0000 for wdqs100[45].

Let's try to send a bit less traffic to wdqs1003.

Mentioned in SAL (#wikimedia-operations) [2018-07-31T12:29:06Z] <gehel> rebalance LVS weights to send less traffic to wdqs1003 - T200563

Another anomaly on 1003 - it gets the bulk of timeout errors from wikidata (T202764): https://logstash.wikimedia.org/goto/b9eada384c246bf953a3abf8fc362f90

So, weird thing: now that we switched data centers, wdqs2003 is showing the same anomaly. Could it be that our load balancing is not balancing the load evenly for these hosts?

Gehel added a subscriber: Volans.Sep 27 2018, 10:03 AM

All credit for the findings below goes to @Volans:

  • we have some dropped packets on the NICs, both on wdqs[12]003 and other servers, but higher on wdqs[12]003.
  • NIC interrupts are processed only by CPU0 (see /proc/interrupts), we could spread those and maybe increase throughput
  • we see some messages in dmesg: NMI handler (ghes_notify_nmi) took too long to run: 1.259 msecs, probably related to the contention on processing NIC queues

The above would explain the correlation between high load and the error rate of T202764. Since wdqs[12]003 have more slower cores, and the interrupts managed by a single core, it would explain the lower throughput on some servers.

One hard problem to solve is going to be the CPU contention between blazegraph and other processes. At some point, if blazegraph generates too much CPU load, whatever the tuning, we are going to see contention in other areas. We might be able to limit the number of parallel requests.

Potential actions:

  • reduce load on the networking stack:
    • stop logging nginx to logstash
    • tune logging for wdqs-blazegraph and wdqs-updater
  • tune networking stack to spread interrupts across all CPUs (having a look at the config of LVS servers might help)

Change 463248 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: don't send nginx logs to logstash

https://gerrit.wikimedia.org/r/463248

Change 463254 had a related patch set uploaded (by Gehel; owner: Gehel):
[operations/puppet@production] wdqs: cleanup logback configuration

https://gerrit.wikimedia.org/r/463254

Great work!
I am not sure though why logging would be that much of an issue, shouldn't the log code take care of batching it, etc.? As for not logging nginx - do we have these logs somewhere else? If yes, then I guess we can stop that. We could probably tune Kafka logging too - right now it's kinda verbose and we most likely don't want that in the logstash.

Great work!

Thanks (I'll forward to @Volans)

I am not sure though why logging would be that much of an issue, shouldn't the log code take care of batching it, etc.? As for not logging nginx - do we have these logs somewhere else? If yes, then I guess we can stop that. We could probably tune Kafka logging too - right now it's kinda verbose and we most likely don't want that in the logstash.

nginx sends logs over syslog, so one UDP packet per log message. Honestly, I'm not sure it will make much difference. The dropped packets are on RX, and nginx/syslog should be only TX. They still compete for the same CPU, so that could help.

In the end, the problem is CPU contention, so anything that can reduce it is good to take. Nginx logs is the low hanging fruit. Spreading NIC related workload to more CPU should be doable. And making sure that Blazegraph CPU consumption is bounded is what we should really address (but probably quite a bit more complex).

As far as I understand, access logs should be available via varnish kafka, in the analytics cluster / hive.

Change 463248 merged by Gehel:
[operations/puppet@production] wdqs: don't send nginx logs to logstash

https://gerrit.wikimedia.org/r/463248

Actionable tasks have been created, the investigation itself is done.

Getting also this now:

Oct 23 00:41:49 wdqs1003 wdqs-updater[10071]: 00:41:49.901 [main] ERROR o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-1, groupId=wdqs1003] Offset commit failed on partition eqiad.mediawiki.page-undelete-0 at offset 136517: The request timed out.
Oct 23 00:41:55 wdqs1003 wdqs-updater[10071]: 00:41:55.009 [main] ERROR o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-1, groupId=wdqs1003] Offset commit failed on partition eqiad.mediawiki.page-undelete-0 at offset 136517: The request timed out.

Looks like we have some serious network problems or something else is seriously messed up there, since Kafka is now failing too.

Smalyshev triaged this task as High priority.Oct 23 2018, 12:45 AM
Smalyshev moved this task from Done to Backlog on the Discovery-Wikidata-Query-Service-Sprint board.
Smalyshev added a comment.EditedOct 23 2018, 12:57 AM

Also I notice the timestamp on 1003 is not advancing:

Oct 23 00:54:40 wdqs1003 wdqs-updater[10071]: 00:54:40.692 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-10-23T00:43:08Z at (4.9, 8.0, 5.1) updates per second and (1.8, 2256.6, 3258.3) milliseconds per second
Oct 23 00:55:45 wdqs1003 wdqs-updater[10071]: 00:55:45.315 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-10-23T00:43:08Z at (6.0, 8.4, 5.5) updates per second and (0.6, 1817.0, 3031.3) milliseconds per second
Oct 23 00:56:21 wdqs1003 wdqs-updater[10071]: 00:56:21.063 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-10-23T00:43:08Z at (9.5, 9.4, 6.0) updates per second and (0.3, 1616.9, 2915.6) milliseconds per second
Oct 23 00:57:02 wdqs1003 wdqs-updater[10071]: 00:57:02.508 [main] INFO  org.wikidata.query.rdf.tool.Updater - Polled up to 2018-10-23T00:43:08Z at (8.1, 9.3, 6.1) updates per second and (0.2, 1415.1, 2788.9) milliseconds per second

This does not happen on other hosts. It looks like maybe something wrong with communication with kafka?

Smalyshev closed this task as Resolved.Nov 13 2018, 6:23 PM
Smalyshev claimed this task.

I think swapping helped, so closing for now.