Page MenuHomePhabricator

Some error spike from db2055
Closed, DeclinedPublic

Description

I've seen some spikes in the DB errors on Kibana all related to db2055:
https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError
See attached screenshot:

Screen Shot 2016-04-21 at 10.16.30.png (746×2 px, 133 KB)

I've found quite an amount of dropped packets on the network interface. For reference similar servers have ~10-100 dropped packets.

eth0      Link encap:Ethernet  HWaddr 40:a8:f0:35:4d:1c
          inet addr:10.192.48.7  Bcast:10.192.51.255  Mask:255.255.252.0
          inet6 addr: 2620:0:860:104:42a8:f0ff:fe35:4d1c/64 Scope:Global
          inet6 addr: fe80::42a8:f0ff:fe35:4d1c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:15868076040 errors:0 dropped:175349 overruns:0 frame:0
          TX packets:2688172933 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:3124888269011 (2.8 TiB)  TX bytes:1069230584257 (995.7 GiB)
          Interrupt:26

From dmesg:

[Mon Apr 11 08:01:49 2016] Process accounting resumed
[Tue Apr 12 08:03:17 2016] Process accounting resumed
[Wed Apr 13 08:04:46 2016] Process accounting resumed
[Thu Apr 14 08:06:14 2016] Process accounting resumed
[Fri Apr 15 08:07:43 2016] Process accounting resumed
[Sat Apr 16 08:09:11 2016] Process accounting resumed
[Sun Apr 17 08:10:39 2016] Process accounting resumed
[Mon Apr 18 08:12:08 2016] Process accounting resumed
[Mon Apr 18 13:08:00 2016] systemd-sysv-generator[58201]: Overwriting existing symlink /run/systemd/generator.late/mysql.service with real service
[Mon Apr 18 13:08:01 2016] systemd-sysv-generator[58769]: Overwriting existing symlink /run/systemd/generator.late/mysql.service with real service
[Mon Apr 18 13:08:02 2016] systemd-sysv-generator[58853]: Overwriting existing symlink /run/systemd/generator.late/mysql.service with real service
[Mon Apr 18 13:08:02 2016] systemd-sysv-generator[58883]: Overwriting existing symlink /run/systemd/generator.late/mysql.service with real service
[Tue Apr 19 08:13:36 2016] Process accounting resumed
[Tue Apr 19 18:26:13 2016] perf interrupt took too long (76972 > 71428), lowering kernel.perf_event_max_sample_rate to 1750
[Wed Apr 20 08:15:05 2016] Process accounting resumed
[Wed Apr 20 10:32:56 2016] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.344 msecs
[Wed Apr 20 19:43:35 2016] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 1.352 msecs
[Thu Apr 21 08:16:33 2016] Process accounting resumed

I didn't see the dropped packets increase while checking the server, but we can check after the next peak (if any) to see if they increased.
Nothing on mysql error log and load of the server from grafana, although it's role is vslow hence has some long queries running from time to time.

Event Timeline

The errors should be easily explainable by cold buffers due to initial usage + vslow role. Do not get me wrong, I think it was important to report it, but unless it has a good actionable (other than warn about this happening), I would close this unless it happens again, or you specifically want to so something about it- we have other worse, more repeatable issues no one is working on. (Otherwise this will become another orphan ticket).

Reopen if it happens again/we start a more thorough investigation.