Fri, Jun 15
List of metrics at https://phabricator.wikimedia.org/P7262, I'll remove those if the list looks good.
Thu, Jun 14
Wed, Jun 13
I'm resolving this task since we're alerting on uncorrectable memory errors found by EDAC now. Uncorrectable errors get either a kernel panic or SIGBUS to the process. See T197084: Report problems found in server's IPMI SEL and more importantly T197086: Report problems found by mcelog for followups.
I researched the "panic on uncorrectable errors" a bit and turns out not edac but the machine check framework already takes care of panicking (or SIGBUS'ing the process) in case uncorrectable errors are reported.
There have been edac correctable memory errors reported for this host, raising priority to high since the cpu temp alerts also persist
Tue, Jun 12
A bigger nail in the coffin for GET requests is also going to be enabling caching by apache, at least for listinfo the information doesn't change frequently and we can safely cache for 30min or so.
@Cmjohnson ok! thanks, I'll being removing the machine from swift tomorrow
Looks like high load is back with a whole lot of listinfo requests
Latest rsyslog release containing the fix is already packaged in Debian unstable, it'd be easier to backport that to stretch instead of jessie. Once we have a replacement for lithium in place (T195416) and running stretch I'll test the backport there.
Thanks @Cmjohnson ! Please treat this with urgency, do you know if there's an ETA? If more than a couple of days I'll remove the machine from swift.
Yeah I think it might have been the controller barfing and the disk is actually ok. I couldn't find related logs on lithium tho so hard to know for sure. The disk can be sent back, we'll order it back if need be.
Mon, Jun 11
Sun, Jun 10
May 16 2018
WRT ms-fe servers (1008 and 1007), please move to asw2 and reallocate to be in two different physical racks.
Ditto for some Thumbor headers:
May 15 2018
May 14 2018
Resolving, swift (sys)log has been fixed a while ago but this task never resolved.
May 11 2018
For sure! It means the drive(s) are not healthy according to smartmontools. I'll add some details to https://wikitech.wikimedia.org/wiki/SMART about this but tl;dr smartctl --health /dev/bus/0 -d <DEVICE> will show why.
May 10 2018
Nice work! Looking forward to see this working in beta.
For swift / ms servers the requirements are as follows:
- ms-fe* to be depooled and moved one at a time.
- ms-be* to be moved one at a time, just a clean poweroff is enough, no depooling needed.
Agreed with @ayounsi please spread said servers across racks in row C as much as possible. I'll be on vacation starting Thurs 17th, I can assist with the move before that though.
May 9 2018
4 | May-06-2018 | 04:46:06 | Mem ECC Warning | Memory | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 40h
wtp2013:~$ sudo ipmi-sel ID | Date | Time | Name | Type | Event 1 | Jan-15-2015 | 23:04:45 | SEL | Event Logging Disabled | Log Area Reset/Cleared 2 | Dec-21-2016 | 01:41:38 | Mem ECC Warning | Memory | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h 3 | Dec-21-2016 | 01:41:39 | Mem ECC Warning | Memory | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h 4 | Dec-14-2017 | 07:21:25 | Mem ECC Warning | Memory | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h 5 | Dec-14-2017 | 07:21:25 | Mem ECC Warning | Memory | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h 6 | Feb-22-2018 | 01:27:39 | Mem ECC Warning | Memory | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h 7 | Feb-22-2018 | 03:08:43 | Mem ECC Warning | Memory | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h 8 | Apr-24-2018 | 17:56:42 | Mem ECC Warning | Memory | transition to Non-Critical from OK ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h 9 | Apr-24-2018 | 20:07:26 | Mem ECC Warning | Memory | transition to Critical from less severe ; OEM Event Data2 code = 90h ; OEM Event Data3 code = 80h
May 8 2018
The correctable errors check has been deployed and it is yielding some results already. Myself and @herron took at the list of hosts and there seem to be a few different "classes" or "states":
- high count of CEs and recent kernel messages
- low count of CEs and no recent kernel messages
Rebalance has completed, resolving
See preliminary comments inline, something else to keep in mind wrt big files: swift is limited by default to 4-5GB files as a single object. Going over that means either using SLOs or DLOs: https://docs.openstack.org/swift/latest/overview_large_objects.html
@thcipriani for sure! package is built, LMK when available and we'll deploy it
May 7 2018
Upstream has fixed the issue, should be included in the next rsyslog release. When that happens we'll try it out on the central syslog servers.
May 4 2018
In a Prometheus world the cpu utilization is calculated from the number of seconds each cpu has spent in each mode, from the numbers in /proc/stat. e.g. https://grafana.wikimedia.org/dashboard/db/host-overview uses that in the cpu utilization, divided by the number of cores to normalize the graph at 100%. There's also more information on https://www.robustperception.io/understanding-machine-cpu-usage/. AFAICS the graphs in labs-capacity-planning are using graphite/diamond as their source, were you looking to port the dashboard to Prometheus instead?
Thanks for kickstarting this! +1, having syslogs in ELK would be very useful indeed. Some partial answers to the things to figure out:
That's the current behavior of the check, i.e. when things are ok exit 0 and no output. We can change it to print "OK" or sth similar, and the values/thresholds perhaps
May 3 2018
Thanks for the feedback!
May 2 2018
@Gilles done! Should be good to go
Done! For reference the commands I used (note this package has -2 as its Debian revision, thus the upstream source is already uploaded, we are changing only the built packages.
@Gilles sounds good, can you send a gerrit review against operations/debs/python-logstash instead since I've imported the package there after the first upload?
There has been a spike of 500s yesterday in codfw, looks like from search.wikimedia.org (tracked at T193600)
Apr 30 2018
I've put together a sample dashboard to play around with some concepts/ideas emerged in this task at https://grafana.wikimedia.org/dashboard/db/dashboard-redesign-proposal . Notably missing is the navigation story among different dashboards, but tl;dr it would be based on dashboards tags to create dropdowns. Which grouping/dropdown menus make sense is still TBD.
Upstream has merged the changes I submitted, the Debian package has been uploaded to stretch-wikimedia and the puppetization merged. Resolving for now.
Apr 27 2018
Apr 26 2018
Apr 25 2018
While investigating cronspam from recent reimages I took a look at mw1247 (for example) and noticed it has two disks but no software raid (T106381). I think we should also fix that while we're reimaging with Stretch anyways.
I sent some changes upstream that I think would be beneficial, https://github.com/Dev25/mcrouter_exporter/pull/3
Apr 24 2018
I've gone ahead and reimaged restbase1010, all cassandra instances are masked ATM but the host is otherwise good to be tested again.
@Cmjohnson confirmed raid config is the same on all of those, I rebooted the hosts showing the incorrect order and indeed upon reboot the order is as expected:
We're not backing up graphite's data directory, though metrics are mirrored to codfw too so we can copy back from there. Which files you need?
Host being setup in T191896: Rack and setup ms-be1040-1043
Looks like 3 out of 4 hosts have sda or sdb as one of the HDDs, not SSDs. The remaining host has sda/sdb as SSDs and two additional mdadm raid arrays.
No planned upgrades ATM, though a newer upstream version might help with understanding (hopefully fixing) T192456: Prometheus metrics missing for some hosts too, so definitely welcome!
@Cmjohnson restbase1010 is powered down and ready to have all of its ssd swapped
Apr 23 2018
+1 to remove atop as a daemon/cron, possibly the package altogether too
Alerted today, real short-lived issue. Note that the alert is a single one even though its text can change over time (e.g. when more sites alert) so icinga needs to be instructed to re-alert whenever the text changes. Other improvements include printing the "worst" value found among all metrics that match the query.
I'll be helping with mcrouter_exporter packaging/setup/etc, I tried it and looks like it is doing the right thing (though asking mcrouter directly, not using stats files)