Page MenuHomePhabricator

CDanis (Chris Danis)
SRE

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Thursday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (20 w, 1 d)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Today

CDanis renamed T214640: wikitech-static cert renewal seems to stop apache2 from wikitech-static cert about to expire to wikitech-static cert renewal seems to stop apache2.
Tue, Mar 26, 2:46 PM · Operations, wikitech.wikimedia.org
CDanis added a comment to T196336: Icinga passive checks go awol and downtime stops working.

Yesterday I had a fun adventure through UNIX internals diagnosing why the secondary icinga host winds up with lots of stuck nsca processes and fixed that with r499028 above.

Tue, Mar 26, 1:39 PM · Patch-For-Review, Operations, Icinga, monitoring
CDanis edited P8272 Masterwork From Distant Lands.
Tue, Mar 26, 1:03 PM
CDanis edited P8269 Masterwork From Distant Lands.
Tue, Mar 26, 12:39 AM
CDanis reopened T214640: wikitech-static cert renewal seems to stop apache2 as "Open".

Looks like certbot renews the cert but doesn't restart apache correctly?

Tue, Mar 26, 12:28 AM · Operations, wikitech.wikimedia.org
CDanis updated the title for P8268 certbot @ wikitech-static doesn't restart apache correctly? from Masterwork From Distant Lands to certbot @ wikitech-static doesn't restart apache correctly?.
Tue, Mar 26, 12:21 AM
CDanis edited P8268 certbot @ wikitech-static doesn't restart apache correctly?.
Tue, Mar 26, 12:16 AM

Yesterday

CDanis updated the title for P8266 cdanis@icinga2001.wikimedia.org ~ % ps -eo state,lstart,pid,cmd | grep nsca | grep '^S' | sort -g | cut -d' ' -f2-6 | phaste from Masterwork From Distant Lands to cdanis@icinga2001.wikimedia.org ~ % ps -eo state,lstart,pid,cmd | grep nsca | grep '^S' | sort -g | cut -d' ' -f2-6 | phaste.
Mon, Mar 25, 8:46 PM
CDanis edited P8265 Masterwork From Distant Lands.
Mon, Mar 25, 7:55 PM

Fri, Mar 22

CDanis renamed T196336: Icinga passive checks go awol and downtime stops working from Icinga passive checks go awal and downtime stops working to Icinga passive checks go awol and downtime stops working.
Fri, Mar 22, 9:12 PM · Patch-For-Review, Operations, Icinga, monitoring

Thu, Mar 21

CDanis added a comment to T215415: mw2206.codfw.wmnet memory issues .

The memory address is the same in all of these error reports.

Thu, Mar 21, 2:03 PM · User-jijiki, serviceops, ops-codfw, Operations

Tue, Mar 19

CDanis closed T209863: graph server temperature metrics as Resolved.

proper graphs with some labels extracted from hardware metadata, plus a thermal headroom 'saturation' graph extracted from syslog mtail

Tue, Mar 19, 5:38 PM · Patch-For-Review, Operations, monitoring, User-CDanis
CDanis added a comment to T218544: ms-be1043 sdk failed.

@fgiunchedi we should set this device to 0 weight in the rings, yes? Happy to do the change if you'll review

Tue, Mar 19, 3:16 PM · Patch-For-Review, monitoring, Operations-Software-Development, Operations, ops-eqiad

Tue, Mar 12

CDanis added a comment to T213214: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the <script> with source ...).

Talked to @ema and he thinks this is related to T215389 and T216006, which should be fixed by https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/494937/ submitted today.

Tue, Mar 12, 6:07 PM · User-Ryasmeen, Traffic, Wikimedia-Apache-configuration, Operations, VisualEditor
CDanis updated subscribers of T213214: Visual Editor gets stuck opening article (net::ERR_SPDY_PROTOCOL_ERROR 200/Loading failed for the <script> with source ...).

Thank you @The_RedBurn, that was helpful.

Tue, Mar 12, 5:35 PM · User-Ryasmeen, Traffic, Wikimedia-Apache-configuration, Operations, VisualEditor

Fri, Mar 8

CDanis created P8173 (An Untitled Masterwork).
Fri, Mar 8, 3:59 PM

Thu, Mar 7

CDanis closed T116767: limit the impact of heavy/large graphite queries as Resolved.

Just saw the new timeout work -- query returned a 500 status after ~60 seconds. Boldly going to call this resolved; of course reopen if there's still more to be done.

Thu, Mar 7, 4:35 PM · Patch-For-Review, monitoring, Operations
CDanis assigned T217679: Graphite returning server errors (out of memory?) to Lucas_Werkmeister_WMDE.

Lucas, can you verify that this is resolved?

Thu, Mar 7, 4:18 PM · Patch-For-Review, Operations, Graphite

Wed, Mar 6

CDanis closed T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries as Resolved.

Resolving this; there's followup work to be done but the 'incident' proper is over.

Wed, Mar 6, 6:19 PM · Patch-For-Review, Wikimedia-Incident, monitoring, Operations
CDanis added a comment to T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries.

Turns out those queries weren't actually to execute at all. Here's the highest-cardinality metrics on k8s@codfw prometheus:

Wed, Mar 6, 6:19 PM · Patch-For-Review, Wikimedia-Incident, monitoring, Operations
CDanis added a comment to T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries.

Also making sure url label can't grow with unbounded cardinality.

How do we enforce that though?

Wed, Mar 6, 5:48 PM · Patch-For-Review, Wikimedia-Incident, monitoring, Operations
CDanis updated the task description for T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries.
Wed, Mar 6, 4:40 AM · Patch-For-Review, Wikimedia-Incident, monitoring, Operations

Tue, Mar 5

CDanis updated the task description for T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries.
Tue, Mar 5, 10:37 PM · Patch-For-Review, Wikimedia-Incident, monitoring, Operations
CDanis triaged T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries as High priority.
Tue, Mar 5, 10:36 PM · Patch-For-Review, Wikimedia-Incident, monitoring, Operations
CDanis created T217715: INCIDENT: k8s@codfw prometheus queries disabled -- very slow to execute some queries.
Tue, Mar 5, 10:36 PM · Patch-For-Review, Wikimedia-Incident, monitoring, Operations
CDanis updated the task description for T215183: Redundant bootloaders for software RAID.
Tue, Mar 5, 4:45 PM · Patch-For-Review, Operations
CDanis added a comment to T217353: Backup strategy for Grafana.

It's almost certainly unnecessary with as little write traffic as our grafanas get, but what I've been doing for my own pre-version-upgrade backups is the following:

Tue, Mar 5, 2:13 PM · fundraising-tech-ops

Fri, Mar 1

CDanis added a comment to T200960: Logstash packet loss.

Twice in two months we've seen all the logstashen in one cluster 'lock up' at around the same time: stop processing incoming events, huge backlog of socket recv-Q bytes, JVM threads drop, CPU usage plummets, threads blocked on futex().

Fri, Mar 1, 3:21 PM · Operations, Patch-For-Review, Wikimedia-Logstash

Tue, Feb 26

CDanis added a comment to T214289: Make swift containers for docker registry cross replicated..

I'm happy to help in the future, although it will also be a learning exercise for me :)

Tue, Feb 26, 4:31 PM · Patch-For-Review, User-fsero, serviceops, Prod-Kubernetes, Kubernetes, Operations

Feb 22 2019

CDanis archived P8122 Masterwork From Distant Lands.
Feb 22 2019, 6:33 PM

Feb 21 2019

CDanis added a comment to T213708: Upgrade production prometheus-node-exporter to >= 0.16.

Looks like -n is/was a gawk option?

Feb 21 2019, 4:57 PM · Patch-For-Review, Goal, monitoring, Operations

Feb 20 2019

CDanis updated subscribers of T215183: Redundant bootloaders for software RAID.

@Joe made me aware of the existence of partman configs present on install1002 that are not in Puppet.

Feb 20 2019, 2:22 PM · Patch-For-Review, Operations
CDanis added a comment to T216611: Icinga check for ircecho should check for actual activity.

This would be an incredibly silly way to do it, but it would be very easy to write a check_prometheus invocation for outgoing network traffic from the machine:

Feb 20 2019, 12:52 PM · IRCecho, monitoring, Icinga, Operations

Feb 15 2019

CDanis added a comment to T205396: Evaluate/integrate rasdaemon as a replacement for mcelog.

in under 10 minutes after installing rasdaemon on thumbor1004 we also saw one there. that machine is such a consistent performer:

1Feb 15 17:37:00 thumbor1004 kernel: [340944.806495] mce: [Hardware Error]: Machine check events logged
2Feb 15 17:37:00 thumbor1004 kernel: [340944.806517] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
3Feb 15 17:37:00 thumbor1004 kernel: [340944.806523] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
4Feb 15 17:37:00 thumbor1004 kernel: [340944.806524] EDAC sbridge MC1: TSC 0
5Feb 15 17:37:00 thumbor1004 kernel: [340944.806525] EDAC sbridge MC1: ADDR cc68f4000
6Feb 15 17:37:00 thumbor1004 kernel: [340944.806526] EDAC sbridge MC1: MISC 90840800080208c
7Feb 15 17:37:00 thumbor1004 kernel: [340944.806527] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1550252220 SOCKET 1 APIC 20
8Feb 15 17:37:00 thumbor1004 kernel: [340944.806548] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
9Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (987) ras:mc_event with new print handler
10Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (986) ras:aer_event with new print handler
11Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (96) mce:mce_record with new print handler
12Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (988) ras:extlog_mem_event with new print handler
13Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: Calling ras_mc_event_opendb()
14Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: cpu 01:rasdaemon: mce_record store: 0x12794e8
15Feb 15 17:37:00 thumbor1004 mcelog: warning: 16 bytes ignored in each record
16Feb 15 17:37:00 thumbor1004 mcelog: consider an update
17Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: rasdaemon: register inserted at db
18Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: <idle>-0 [4281920] 0.034092: mce_record: 2019-02-15 17:37:00 +0000 bank=a, status= 8c000050000800c1, MEMORY CONTROLLER MS_CHANNEL1_ERR Transaction: Memory scrubbing error Corrected patrol scrub error, mci=Corrected_error, n_errors=1 memory_channel=1 ranks=-1 and -1, cpu_type= Ivy Bridge EP/EX, cpu= 1, socketid= 1, misc= 90840800080208c, addr= cc68f4000, mcgstatus=0, mcgcap= 1000c19, apicid= 20
19Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: cpu 01:rasdaemon: mc_event store: 0x1273a88
20Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: rasdaemon: register inserted at db

Feb 15 2019, 5:40 PM · Patch-For-Review, monitoring, Operations
CDanis created P8090 thumbor1004 correctable memory error with rasdaemon.
Feb 15 2019, 5:39 PM
CDanis edited P8089 mw2206 correctable memory error with rasdaemon.
Feb 15 2019, 5:20 PM
CDanis added a comment to T205396: Evaluate/integrate rasdaemon as a replacement for mcelog.

we got one:

1Feb 15 15:09:52 mw2206 kernel: [14254431.027746] mce: [Hardware Error]: Machine check events logged
2Feb 15 15:09:52 mw2206 kernel: [14254431.027780] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
3Feb 15 15:09:52 mw2206 kernel: [14254431.027793] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000051000800c2
4Feb 15 15:09:52 mw2206 kernel: [14254431.027798] EDAC sbridge MC0: TSC 0
5Feb 15 15:09:52 mw2206 kernel: [14254431.027800] EDAC sbridge MC0: ADDR 2bb8b1000
6Feb 15 15:09:52 mw2206 kernel: [14254431.027801] EDAC sbridge MC0: MISC 90000080008228c
7Feb 15 15:09:52 mw2206 kernel: [14254431.027804] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1550243392 SOCKET 0 APIC 0
8Feb 15 15:09:52 mw2206 kernel: [14254431.027835] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2bb8b1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:0)
9Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (987) ras:mc_event with new print handler
10Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (986) ras:aer_event with new print handler
11Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (96) mce:mce_record with new print handler
12Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (988) ras:extlog_mem_event with new print handler
13Feb 15 15:09:52 mw2206 rasdaemon[26701]: Calling ras_mc_event_opendb()
14Feb 15 15:09:52 mw2206 rasdaemon[26701]: cpu 00:rasdaemon: mce_record store: 0x558d58e06f28
15Feb 15 15:09:52 mw2206 mcelog: warning: 16 bytes ignored in each record
16Feb 15 15:09:52 mw2206 mcelog: consider an update
17Feb 15 15:09:52 mw2206 rasdaemon[26701]: rasdaemon: register inserted at db
18Feb 15 15:09:52 mw2206 rasdaemon[26701]: <idle>-0 [2003682560] 1.425447: mce_record: 2019-02-15 15:09:52 +0000 bank=b, status= 8c000051000800c2, MEMORY CONTROLLER MS_CHANNEL2_ERR Transaction: Memory scrubbing error Corrected patrol scrub error, mci=Corrected_error, n_errors=1 memory_channel=2 ranks=-1 and -1, cpu_type= Ivy Bridge EP/EX, cpu= 0, socketid= 0, misc= 90000080008228c, addr= 2bb8b1000, mcgstatus=0, mcgcap= 1000c1b, apicid= 0
19Feb 15 15:09:52 mw2206 rasdaemon[26701]: cpu 00:rasdaemon: mc_event store: 0x558d58e029f8
20Feb 15 15:09:52 mw2206 rasdaemon[26701]: rasdaemon: register inserted at db

Feb 15 2019, 5:20 PM · Patch-For-Review, monitoring, Operations
CDanis created P8089 mw2206 correctable memory error with rasdaemon.
Feb 15 2019, 5:19 PM

Feb 14 2019

CDanis reopened T205396: Evaluate/integrate rasdaemon as a replacement for mcelog as "Open".

@jbond kindly backported the buster version of rasdaemon to stretch. I'm going to attempt installing it on a few stretch hosts that are consistently reporting memory issues

Feb 14 2019, 10:24 PM · Patch-For-Review, monitoring, Operations
CDanis merged T207721: Broken memory on thumbor1004 into T215411: thumbor1004 memory errors.
Feb 14 2019, 2:08 PM · User-jijiki, Thumbor, ops-eqiad, serviceops, Operations
CDanis merged task T207721: Broken memory on thumbor1004 into T215411: thumbor1004 memory errors.
Feb 14 2019, 2:08 PM · Operations, ops-eqiad
CDanis awarded T216088: Mapping of servers to stakeholders a Like token.
Feb 14 2019, 1:07 AM · Operations

Feb 13 2019

CDanis renamed T215183: Redundant bootloaders for software RAID from sw raid1 doesnt install grub on sdb to Redundant bootloaders for software RAID.
Feb 13 2019, 6:26 PM · Patch-For-Review, Operations
CDanis created P8077 wdqs hosts with weird partitions.
Feb 13 2019, 3:51 PM
CDanis added a comment to T216043: Sort out which RAID packages are still needed.

I'm not sure if it's helpful here, or if you know this already, but there is a raid fact in facter that is an array of the types of RAID on a given machine.

Feb 13 2019, 2:42 PM · Operations

Feb 12 2019

CDanis added a comment to T214760: icinga1001 crashed.

OK, but IMO let's keep it passive. icinga can continue to run on icinga2001 for now.

Feb 12 2019, 10:01 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
CDanis reassigned T214760: icinga1001 crashed from CDanis to RobH.

icinga2001 looks stable; go for it Rob

Feb 12 2019, 7:16 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
CDanis added a comment to T215611: MediaWiki errors overloading logstash.

BTW @fgiunchedi authored an incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190208-logstash-mediawiki

Feb 12 2019, 3:21 PM · Core Platform Team Kanban (Done with CPT), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Wikimedia-production-error, Wikimedia-Logstash, Operations, MediaWiki-Database, monitoring
CDanis claimed T215183: Redundant bootloaders for software RAID.
Feb 12 2019, 1:30 PM · Patch-For-Review, Operations
CDanis added a comment to T205396: Evaluate/integrate rasdaemon as a replacement for mcelog.

It appears that how to make Prometheus node_exporter play nice with rasdaemon is an unresolved issue:
https://github.com/prometheus/node_exporter/issues/986

Feb 12 2019, 1:28 PM · Patch-For-Review, monitoring, Operations
CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.

Thanks @fgiunchedi, that's a good thought! However I couldn't find anything in the SEL for a selection of servers that are currently reporting / have recently reported memory issues:

Feb 12 2019, 1:10 PM · Patch-For-Review, Operations, monitoring
CDanis archived P8072 Masterwork From Distant Lands.
Feb 12 2019, 2:42 AM
CDanis edited P8072 Masterwork From Distant Lands.
Feb 12 2019, 2:42 AM
CDanis added a comment to T215855: Gerrit loads very slowly.

maybe this will be illuminating for someone -- it is stack traces from the gerrit jvm process at the time it was guzzling CPU

Feb 12 2019, 2:30 AM · serviceops, Operations, Gerrit
CDanis updated the title for P8070 Dump of gerrit stack traces generated using SIGQUIT on the jvm from Masterwork From Distant Lands to Dump of gerrit stack traces generated using SIGQUIT on the jvm.
Feb 12 2019, 2:28 AM
CDanis edited P8070 Dump of gerrit stack traces generated using SIGQUIT on the jvm.
Feb 12 2019, 2:27 AM
jijiki awarded T214760: icinga1001 crashed a Party Time token.
Feb 12 2019, 12:18 AM · Patch-For-Review, ops-eqiad, monitoring, Operations

Feb 11 2019

CDanis created T215848: icinga really needs to check puppet run success of passive icinga hosts.
Feb 11 2019, 11:35 PM · monitoring, Icinga, Operations

Feb 8 2019

CDanis added a project to T215611: MediaWiki errors overloading logstash: Core Platform Team.

I don't feel nearly well-versed in PHP/PSR-3/Monolog nor the MW codebase to suggest implementations, but it seems to me that Mediawiki itself should be doing some ratelimiting/deduplication of its logging output.

Feb 8 2019, 2:22 PM · Core Platform Team Kanban (Done with CPT), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Wikimedia-production-error, Wikimedia-Logstash, Operations, MediaWiki-Database, monitoring

Feb 7 2019

CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.

Talked some with @BBlack today, who observed that there are in fact a variety of drivers that back this stuff in the kernel, and that it's very possible we're getting all events delivered to us, but that the counters can be back by firmware which makes its own decision as to whether or not to increment said counters.
I did a bunch of digging and failed to find any obvious correlation between this behavior being observed and kernel version. There is somewhat of a correlation between the approximate 'age' of the platform -- it correlates with newer BIOS version dates, newer server model numbers, more feature flags enabled in CPUs -- but there's nothing definitive there.

Feb 7 2019, 10:15 PM · Patch-For-Review, Operations, monitoring
CDanis awarded T126989: MediaWiki logging & encryption a Love token.
Feb 7 2019, 1:41 PM · Patch-For-Review, monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
CDanis awarded T126989: MediaWiki logging & encryption a Love token.
Feb 7 2019, 1:41 PM · Patch-For-Review, monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
CDanis claimed T214529: EDAC events not being reported by node-exporter?.
Feb 7 2019, 1:25 PM · Patch-For-Review, Operations, monitoring

Feb 5 2019

CDanis added a comment to T215277: Expose linux kernel firewall and connections statistics.

Ah yes, sorry to be unclear -- the full extent of the data is definitely too much for Prometheus.

Feb 5 2019, 8:34 PM · Patch-For-Review, monitoring, Operations
CDanis added a subtask for T215183: Redundant bootloaders for software RAID: T215301: codfw spare pool system for partman testing.
Feb 5 2019, 8:25 PM · Patch-For-Review, Operations
CDanis added a parent task for T215301: codfw spare pool system for partman testing: T215183: Redundant bootloaders for software RAID.
Feb 5 2019, 8:25 PM · Patch-For-Review, Operations, hardware-requests
CDanis reassigned T215301: codfw spare pool system for partman testing from CDanis to Papaul.

@Papaul when you're back in codfw and have a spare moment, can you please unplug the disk on SATA port A from wmf6653 (in row D), and attempt booting it?
The disk on port B should also have a valid boot record on it.

Feb 5 2019, 8:18 PM · Patch-For-Review, Operations, hardware-requests
CDanis added a comment to T215301: codfw spare pool system for partman testing.

Grub does seem to be installed on both MBRs, with just minimal differences (going to guess pointing at the different physical disk for /):

Feb 5 2019, 7:59 PM · Patch-For-Review, Operations, hardware-requests
CDanis added a comment to T215301: codfw spare pool system for partman testing.
  1. Overwrote first 512 bytes of /dev/sda and /dev/sdb with zeros
    • Boot automatically fell through disk to using PXE
  2. Reimaged system again, with the change for the only_debian setting
  3. BIOS boot order port A first
    • booted fine
  4. Swapped boot order: port B
    • booted fine!
Feb 5 2019, 7:54 PM · Patch-For-Review, Operations, hardware-requests
CDanis updated subscribers of T215301: codfw spare pool system for partman testing.
  1. Overwrote first 512 bytes of /dev/sda and /dev/sdb with zeros
    • Boot automatically fell through disk to using PXE
  2. Reimaged system using raid1-lvm-ext4-srv-dualboot.cfg
    • Got a warning about the LVM volume group name being already in use; clicked through it manually
  3. Installer rebooted system, leaving BIOS disk boot order to port B
    • Hung at "Booting from Hard drive C:"
  4. Reset BIOS boot order to port A first
    • Booted fine
Feb 5 2019, 7:32 PM · Patch-For-Review, Operations, hardware-requests
CDanis added a comment to T215301: codfw spare pool system for partman testing.
  1. Installed server with standard raid1-lvm-ext4-srv.cfg partman config
    • Booted fine
  2. Went into BIOS and swapped boot order of SATA devices (afterwards, port B first)
    • Server seemed to hang at "Booting from Hard drive C:"
  3. back into BIOS; swapped boot order back to original (port A first).
    • Booted fine
  4. used install-console to grub-install /dev/sdb
  5. rebooted using default boot order
    • Booted fine
  6. back into BIOS; swap boot order (port B first)
    • Booted fine!
Feb 5 2019, 7:02 PM · Patch-For-Review, Operations, hardware-requests
CDanis added a comment to T215277: Expose linux kernel firewall and connections statistics.

I only did a quick look, but I couldn't find an existing Prometheus exporter for ulogd. I suppose we could write one, or we could just write an mtail module for its log output.

Feb 5 2019, 3:34 PM · Patch-For-Review, monitoring, Operations

Feb 4 2019

CDanis added a comment to T215183: Redundant bootloaders for software RAID.

Assumption 1: the partman-auto-raid directive exactly correlates with our use of Linux software RAID in production.
Assumption 2: in order to have a working grub install on each mirror, software RAID1/10 configs must override grub-installer/bootdev to list all the relevant disks.

Feb 4 2019, 9:16 PM · Patch-For-Review, Operations
CDanis updated the title for P8050 partman configs with patrman-auto-raid but without grub-installer/bootdev from Masterwork From Distant Lands to partman configs with patrman-auto-raid but without grub-installer/bootdev.
Feb 4 2019, 9:13 PM
CDanis edited P8050 partman configs with patrman-auto-raid but without grub-installer/bootdev.
Feb 4 2019, 9:12 PM
CDanis added a comment to T215033: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit).

So we just need a http check that checks the website (without checking if the ssl cert is valid or not as we have that check already) and also a load check to make sure the load doesn't go too high otherwise it could cause noticeable impact.

Feb 4 2019, 7:57 PM · serviceops, Patch-For-Review, Icinga, monitoring, Operations, Gerrit
CDanis added a comment to T215183: Redundant bootloaders for software RAID.

I know very little about debian-installer, but here's a guess based on what I found in the puppet repo:

Feb 4 2019, 6:51 PM · Patch-For-Review, Operations
CDanis added a comment to T215033: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit).

The timeouts on that check_ssl invocation make little sense to me -- a warning after 60 seconds, but critical after 30? Those seem backwards, and also too high by something like a factor of 4.

Feb 4 2019, 6:42 PM · serviceops, Patch-For-Review, Icinga, monitoring, Operations, Gerrit
CDanis added a comment to T215033: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit).

If the thing we want to monitor is "Gerrit is responding slowly / not at all", IMO that is the thing we should check. High CPU load is just one of the causes that could result in high latency.

Feb 4 2019, 5:57 PM · serviceops, Patch-For-Review, Icinga, monitoring, Operations, Gerrit
CDanis reopened T213664: correctable memory errors db1068 (commons primary master database) as "Open".

Seems like it is happening again

Feb 4 2019, 2:58 PM · Patch-For-Review, DBA, Operations

Jan 25 2019

CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.

I did a cumin run across the whole fleet to find hosts that have memory errors in their dmesg buffers, but haven't incremented counters beyond 0.

Jan 25 2019, 4:39 PM · Patch-For-Review, Operations, monitoring
CDanis created P8039 Hosts that have EDAC events reported in kernel logs in the last 30 days:.
Jan 25 2019, 4:34 PM
CDanis edited P8037 cp5006 EDAC dmesg.
Jan 25 2019, 3:10 PM
CDanis edited P8036 mw1302 EDAC dmesg.
Jan 25 2019, 3:08 PM
CDanis edited P8035 cp4026 /var/log/mcelog.
Jan 25 2019, 3:00 PM
CDanis added a comment to T178690: Better organization for SRE grafana dashboards.

Ah, forgot to update the task, but at the time @jcrespo and @fgiunchedi and I talked, and Jaime's biggest gripe was that iostat-reported "disk IO utilization" is not a very useful metric: it's the fraction of time that at least one oustanding iop was in the disk's queue. On a server that has any load at all, this metric will generally be "100%" all the time; what you actually care about are stats like "queue depth" and "request latencies".

Jan 25 2019, 2:57 PM · User-CDanis, Patch-For-Review, User-fgiunchedi, monitoring, Operations

Jan 24 2019

CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.
cdanis@cp4026.ulsfo.wmnet ~ % edac-util -v  
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#1_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#1_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#1_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#1_Chan#1_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#1_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#1_Chan#1_DIMM#0: 0 Corrected Errors
edac-util: No errors to report.
Jan 24 2019, 6:59 PM · Patch-For-Review, Operations, monitoring

Jan 23 2019

CDanis updated the task description for T214529: EDAC events not being reported by node-exporter?.
Jan 23 2019, 11:04 PM · Patch-For-Review, Operations, monitoring
CDanis edited P8031 cp4026 EDAC dmesg.
Jan 23 2019, 11:03 PM
CDanis added a comment to T183177: memory errors not showing in icinga.

@Dzahn investigation on the mystery of cp4026 ongoing in T214529

Jan 23 2019, 10:59 PM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
CDanis added projects to T214529: EDAC events not being reported by node-exporter?: monitoring, Operations.
Jan 23 2019, 10:55 PM · Patch-For-Review, Operations, monitoring
CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.

One more observation:

Jan 23 2019, 10:50 PM · Patch-For-Review, Operations, monitoring
CDanis created T214529: EDAC events not being reported by node-exporter?.
Jan 23 2019, 10:50 PM · Patch-For-Review, Operations, monitoring
CDanis closed T214090: Request to add new hire David Sharpe (Dsharpe) to wmf LDAP group as Resolved.

Not sure how I missed this, but fixed now.

Jan 23 2019, 3:49 PM · Patch-For-Review, LDAP-Access-Requests
CDanis closed T214090: Request to add new hire David Sharpe (Dsharpe) to wmf LDAP group, a subtask of T213742: Onboarding David Sharpe to Security Team as Information Security Analyst, as Resolved.
Jan 23 2019, 3:49 PM · Security-Team

Jan 21 2019

CDanis added a comment to T208524: RfC: Standards for external services in the Wikimedia infrastructure..

! In T208524#4895821, @daniel wrote:
For number (3), I think the question is when and how, rather than if. One, three, five years?

I would ask "if". What benefits does doing that bring that outweigh breaking compatibility with low-traffic low-effort installations?

Jan 21 2019, 4:23 PM · TechCom-RFC (TechCom-Approved), serviceops

Jan 18 2019

srishakatux awarded T213780: Add Srishti to analytics-privatedata-users a Like token.
Jan 18 2019, 6:50 PM · Patch-For-Review, Operations, SRE-Access-Requests, Developer-Advocacy (Jan-Mar 2019), Documentation
CDanis triaged T214153: Fix node vs nodejs dependency issue as Normal priority.
Jan 18 2019, 4:11 PM · Reading-Infrastructure-Team-Backlog, Operations, Maps
CDanis moved T214130: Requesting access to production for dsharpe from Awaiting User Input to SRE Meeting Review Required on the SRE-Access-Requests board.
Jan 18 2019, 4:09 PM · Operations, SRE-Access-Requests
CDanis updated subscribers of T214130: Requesting access to production for dsharpe.

@faidon @mark Could one of you approve this request?

Jan 18 2019, 4:09 PM · Operations, SRE-Access-Requests