Page MenuHomePhabricator

CDanis (Chris Danis)
SRE

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (15 w, 3 d)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

CDanis added a comment to T213708: Upgrade production prometheus-node-exporter to >= 0.16.

Looks like -n is/was a gawk option?

Thu, Feb 21, 4:57 PM · Patch-For-Review, Goal, monitoring, Operations

Wed, Feb 20

CDanis updated subscribers of T215183: Redundant bootloaders for software RAID.

@Joe made me aware of the existence of partman configs present on install1002 that are not in Puppet.

Wed, Feb 20, 2:22 PM · Patch-For-Review, Operations
CDanis added a comment to T216611: Icinga check for ircecho should check for actual activity.

This would be an incredibly silly way to do it, but it would be very easy to write a check_prometheus invocation for outgoing network traffic from the machine:

Wed, Feb 20, 12:52 PM · IRCecho, monitoring, Icinga, Operations

Fri, Feb 15

CDanis added a comment to T205396: Evaluate/integrate rasdaemon as a replacement for mcelog.

in under 10 minutes after installing rasdaemon on thumbor1004 we also saw one there. that machine is such a consistent performer:

1Feb 15 17:37:00 thumbor1004 kernel: [340944.806495] mce: [Hardware Error]: Machine check events logged
2Feb 15 17:37:00 thumbor1004 kernel: [340944.806517] EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
3Feb 15 17:37:00 thumbor1004 kernel: [340944.806523] EDAC sbridge MC1: CPU 1: Machine Check Event: 0 Bank 10: 8c000050000800c1
4Feb 15 17:37:00 thumbor1004 kernel: [340944.806524] EDAC sbridge MC1: TSC 0
5Feb 15 17:37:00 thumbor1004 kernel: [340944.806525] EDAC sbridge MC1: ADDR cc68f4000
6Feb 15 17:37:00 thumbor1004 kernel: [340944.806526] EDAC sbridge MC1: MISC 90840800080208c
7Feb 15 17:37:00 thumbor1004 kernel: [340944.806527] EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1550252220 SOCKET 1 APIC 20
8Feb 15 17:37:00 thumbor1004 kernel: [340944.806548] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xcc68f4 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:1)
9Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (987) ras:mc_event with new print handler
10Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (986) ras:aer_event with new print handler
11Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (96) mce:mce_record with new print handler
12Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: overriding event (988) ras:extlog_mem_event with new print handler
13Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: Calling ras_mc_event_opendb()
14Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: cpu 01:rasdaemon: mce_record store: 0x12794e8
15Feb 15 17:37:00 thumbor1004 mcelog: warning: 16 bytes ignored in each record
16Feb 15 17:37:00 thumbor1004 mcelog: consider an update
17Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: rasdaemon: register inserted at db
18Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: <idle>-0 [4281920] 0.034092: mce_record: 2019-02-15 17:37:00 +0000 bank=a, status= 8c000050000800c1, MEMORY CONTROLLER MS_CHANNEL1_ERR Transaction: Memory scrubbing error Corrected patrol scrub error, mci=Corrected_error, n_errors=1 memory_channel=1 ranks=-1 and -1, cpu_type= Ivy Bridge EP/EX, cpu= 1, socketid= 1, misc= 90840800080208c, addr= cc68f4000, mcgstatus=0, mcgcap= 1000c19, apicid= 20
19Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: cpu 01:rasdaemon: mc_event store: 0x1273a88
20Feb 15 17:37:00 thumbor1004 rasdaemon[31103]: rasdaemon: register inserted at db

Fri, Feb 15, 5:40 PM · Patch-For-Review, monitoring, Operations
CDanis created P8090 thumbor1004 correctable memory error with rasdaemon.
Fri, Feb 15, 5:39 PM
CDanis edited P8089 mw2206 correctable memory error with rasdaemon.
Fri, Feb 15, 5:20 PM
CDanis added a comment to T205396: Evaluate/integrate rasdaemon as a replacement for mcelog.

we got one:

1Feb 15 15:09:52 mw2206 kernel: [14254431.027746] mce: [Hardware Error]: Machine check events logged
2Feb 15 15:09:52 mw2206 kernel: [14254431.027780] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
3Feb 15 15:09:52 mw2206 kernel: [14254431.027793] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 11: 8c000051000800c2
4Feb 15 15:09:52 mw2206 kernel: [14254431.027798] EDAC sbridge MC0: TSC 0
5Feb 15 15:09:52 mw2206 kernel: [14254431.027800] EDAC sbridge MC0: ADDR 2bb8b1000
6Feb 15 15:09:52 mw2206 kernel: [14254431.027801] EDAC sbridge MC0: MISC 90000080008228c
7Feb 15 15:09:52 mw2206 kernel: [14254431.027804] EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1550243392 SOCKET 0 APIC 0
8Feb 15 15:09:52 mw2206 kernel: [14254431.027835] EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x2bb8b1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c2 socket:0 ha:0 channel_mask:2 rank:0)
9Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (987) ras:mc_event with new print handler
10Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (986) ras:aer_event with new print handler
11Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (96) mce:mce_record with new print handler
12Feb 15 15:09:52 mw2206 rasdaemon[26701]: overriding event (988) ras:extlog_mem_event with new print handler
13Feb 15 15:09:52 mw2206 rasdaemon[26701]: Calling ras_mc_event_opendb()
14Feb 15 15:09:52 mw2206 rasdaemon[26701]: cpu 00:rasdaemon: mce_record store: 0x558d58e06f28
15Feb 15 15:09:52 mw2206 mcelog: warning: 16 bytes ignored in each record
16Feb 15 15:09:52 mw2206 mcelog: consider an update
17Feb 15 15:09:52 mw2206 rasdaemon[26701]: rasdaemon: register inserted at db
18Feb 15 15:09:52 mw2206 rasdaemon[26701]: <idle>-0 [2003682560] 1.425447: mce_record: 2019-02-15 15:09:52 +0000 bank=b, status= 8c000051000800c2, MEMORY CONTROLLER MS_CHANNEL2_ERR Transaction: Memory scrubbing error Corrected patrol scrub error, mci=Corrected_error, n_errors=1 memory_channel=2 ranks=-1 and -1, cpu_type= Ivy Bridge EP/EX, cpu= 0, socketid= 0, misc= 90000080008228c, addr= 2bb8b1000, mcgstatus=0, mcgcap= 1000c1b, apicid= 0
19Feb 15 15:09:52 mw2206 rasdaemon[26701]: cpu 00:rasdaemon: mc_event store: 0x558d58e029f8
20Feb 15 15:09:52 mw2206 rasdaemon[26701]: rasdaemon: register inserted at db

Fri, Feb 15, 5:20 PM · Patch-For-Review, monitoring, Operations
CDanis created P8089 mw2206 correctable memory error with rasdaemon.
Fri, Feb 15, 5:19 PM

Thu, Feb 14

CDanis reopened T205396: Evaluate/integrate rasdaemon as a replacement for mcelog as "Open".

@jbond kindly backported the buster version of rasdaemon to stretch. I'm going to attempt installing it on a few stretch hosts that are consistently reporting memory issues

Thu, Feb 14, 10:24 PM · Patch-For-Review, monitoring, Operations
CDanis merged T207721: Broken memory on thumbor1004 into T215411: thumbor1004 memory errors.
Thu, Feb 14, 2:08 PM · Thumbor, ops-eqiad, serviceops, Operations
CDanis merged task T207721: Broken memory on thumbor1004 into T215411: thumbor1004 memory errors.
Thu, Feb 14, 2:08 PM · Operations, ops-eqiad
CDanis awarded T216088: Mapping of servers to stakeholders a Like token.
Thu, Feb 14, 1:07 AM · Operations

Wed, Feb 13

CDanis renamed T215183: Redundant bootloaders for software RAID from sw raid1 doesnt install grub on sdb to Redundant bootloaders for software RAID.
Wed, Feb 13, 6:26 PM · Patch-For-Review, Operations
CDanis created P8077 wdqs hosts with weird partitions.
Wed, Feb 13, 3:51 PM
CDanis added a comment to T216043: Sort out which RAID packages are still needed.

I'm not sure if it's helpful here, or if you know this already, but there is a raid fact in facter that is an array of the types of RAID on a given machine.

Wed, Feb 13, 2:42 PM · Operations

Tue, Feb 12

CDanis added a comment to T214760: icinga1001 crashed.

OK, but IMO let's keep it passive. icinga can continue to run on icinga2001 for now.

Tue, Feb 12, 10:01 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
CDanis reassigned T214760: icinga1001 crashed from CDanis to RobH.

icinga2001 looks stable; go for it Rob

Tue, Feb 12, 7:16 PM · Patch-For-Review, ops-eqiad, monitoring, Operations
CDanis added a comment to T215611: MediaWiki errors overloading logstash.

BTW @fgiunchedi authored an incident report at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190208-logstash-mediawiki

Tue, Feb 12, 3:21 PM · Core Platform Team Kanban (Blocked Externally), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Wikimedia-production-error, Wikimedia-Logstash, Operations, MediaWiki-Database, monitoring
CDanis claimed T215183: Redundant bootloaders for software RAID.
Tue, Feb 12, 1:30 PM · Patch-For-Review, Operations
CDanis added a comment to T205396: Evaluate/integrate rasdaemon as a replacement for mcelog.

It appears that how to make Prometheus node_exporter play nice with rasdaemon is an unresolved issue:
https://github.com/prometheus/node_exporter/issues/986

Tue, Feb 12, 1:28 PM · Patch-For-Review, monitoring, Operations
CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.

Thanks @fgiunchedi, that's a good thought! However I couldn't find anything in the SEL for a selection of servers that are currently reporting / have recently reported memory issues:

Tue, Feb 12, 1:10 PM · Patch-For-Review, Operations, monitoring
CDanis archived P8072 Masterwork From Distant Lands.
Tue, Feb 12, 2:42 AM
CDanis edited P8072 Masterwork From Distant Lands.
Tue, Feb 12, 2:42 AM
CDanis added a comment to T215855: Gerrit loads very slowly.

maybe this will be illuminating for someone -- it is stack traces from the gerrit jvm process at the time it was guzzling CPU

Tue, Feb 12, 2:30 AM · serviceops, Operations, Gerrit
CDanis updated the title for P8070 Dump of gerrit stack traces generated using SIGQUIT on the jvm from Masterwork From Distant Lands to Dump of gerrit stack traces generated using SIGQUIT on the jvm.
Tue, Feb 12, 2:28 AM
CDanis edited P8070 Dump of gerrit stack traces generated using SIGQUIT on the jvm.
Tue, Feb 12, 2:27 AM
jijiki awarded T214760: icinga1001 crashed a Party Time token.
Tue, Feb 12, 12:18 AM · Patch-For-Review, ops-eqiad, monitoring, Operations

Mon, Feb 11

CDanis created T215848: icinga really needs to check puppet run success of passive icinga hosts.
Mon, Feb 11, 11:35 PM · monitoring, Icinga, Operations

Fri, Feb 8

CDanis added a project to T215611: MediaWiki errors overloading logstash: Core Platform Team.

I don't feel nearly well-versed in PHP/PSR-3/Monolog nor the MW codebase to suggest implementations, but it seems to me that Mediawiki itself should be doing some ratelimiting/deduplication of its logging output.

Fri, Feb 8, 2:22 PM · Core Platform Team Kanban (Blocked Externally), Core Platform Team (Security, stability, performance and scalability (TEC1)), Performance-Team, Wikimedia-production-error, Wikimedia-Logstash, Operations, MediaWiki-Database, monitoring

Thu, Feb 7

CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.

Talked some with @BBlack today, who observed that there are in fact a variety of drivers that back this stuff in the kernel, and that it's very possible we're getting all events delivered to us, but that the counters can be back by firmware which makes its own decision as to whether or not to increment said counters.
I did a bunch of digging and failed to find any obvious correlation between this behavior being observed and kernel version. There is somewhat of a correlation between the approximate 'age' of the platform -- it correlates with newer BIOS version dates, newer server model numbers, more feature flags enabled in CPUs -- but there's nothing definitive there.

Thu, Feb 7, 10:15 PM · Patch-For-Review, Operations, monitoring
CDanis awarded T126989: MediaWiki logging & encryption a Love token.
Thu, Feb 7, 1:41 PM · monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
CDanis awarded T126989: MediaWiki logging & encryption a Love token.
Thu, Feb 7, 1:41 PM · monitoring, Wikimedia-Logstash, MediaWiki-Debug-Logger, Operations
CDanis claimed T214529: EDAC events not being reported by node-exporter?.
Thu, Feb 7, 1:25 PM · Patch-For-Review, Operations, monitoring

Tue, Feb 5

CDanis added a comment to T215277: Expose linux kernel firewall and connections statistics.

Ah yes, sorry to be unclear -- the full extent of the data is definitely too much for Prometheus.

Tue, Feb 5, 8:34 PM · monitoring, Operations
CDanis added a subtask for T215183: Redundant bootloaders for software RAID: T215301: codfw spare pool system for partman testing.
Tue, Feb 5, 8:25 PM · Patch-For-Review, Operations
CDanis added a parent task for T215301: codfw spare pool system for partman testing: T215183: Redundant bootloaders for software RAID.
Tue, Feb 5, 8:25 PM · Patch-For-Review, Operations, hardware-requests
CDanis reassigned T215301: codfw spare pool system for partman testing from CDanis to Papaul.

@Papaul when you're back in codfw and have a spare moment, can you please unplug the disk on SATA port A from wmf6653 (in row D), and attempt booting it?
The disk on port B should also have a valid boot record on it.

Tue, Feb 5, 8:18 PM · Patch-For-Review, Operations, hardware-requests
CDanis added a comment to T215301: codfw spare pool system for partman testing.

Grub does seem to be installed on both MBRs, with just minimal differences (going to guess pointing at the different physical disk for /):

Tue, Feb 5, 7:59 PM · Patch-For-Review, Operations, hardware-requests
CDanis added a comment to T215301: codfw spare pool system for partman testing.
  1. Overwrote first 512 bytes of /dev/sda and /dev/sdb with zeros
    • Boot automatically fell through disk to using PXE
  2. Reimaged system again, with the change for the only_debian setting
  3. BIOS boot order port A first
    • booted fine
  4. Swapped boot order: port B
    • booted fine!
Tue, Feb 5, 7:54 PM · Patch-For-Review, Operations, hardware-requests
CDanis updated subscribers of T215301: codfw spare pool system for partman testing.
  1. Overwrote first 512 bytes of /dev/sda and /dev/sdb with zeros
    • Boot automatically fell through disk to using PXE
  2. Reimaged system using raid1-lvm-ext4-srv-dualboot.cfg
    • Got a warning about the LVM volume group name being already in use; clicked through it manually
  3. Installer rebooted system, leaving BIOS disk boot order to port B
    • Hung at "Booting from Hard drive C:"
  4. Reset BIOS boot order to port A first
    • Booted fine
Tue, Feb 5, 7:32 PM · Patch-For-Review, Operations, hardware-requests
CDanis added a comment to T215301: codfw spare pool system for partman testing.
  1. Installed server with standard raid1-lvm-ext4-srv.cfg partman config
    • Booted fine
  2. Went into BIOS and swapped boot order of SATA devices (afterwards, port B first)
    • Server seemed to hang at "Booting from Hard drive C:"
  3. back into BIOS; swapped boot order back to original (port A first).
    • Booted fine
  4. used install-console to grub-install /dev/sdb
  5. rebooted using default boot order
    • Booted fine
  6. back into BIOS; swap boot order (port B first)
    • Booted fine!
Tue, Feb 5, 7:02 PM · Patch-For-Review, Operations, hardware-requests
CDanis added a comment to T215277: Expose linux kernel firewall and connections statistics.

I only did a quick look, but I couldn't find an existing Prometheus exporter for ulogd. I suppose we could write one, or we could just write an mtail module for its log output.

Tue, Feb 5, 3:34 PM · monitoring, Operations

Mon, Feb 4

CDanis added a comment to T215183: Redundant bootloaders for software RAID.

Assumption 1: the partman-auto-raid directive exactly correlates with our use of Linux software RAID in production.
Assumption 2: in order to have a working grub install on each mirror, software RAID1/10 configs must override grub-installer/bootdev to list all the relevant disks.

Mon, Feb 4, 9:16 PM · Patch-For-Review, Operations
CDanis updated the title for P8050 partman configs with patrman-auto-raid but without grub-installer/bootdev from Masterwork From Distant Lands to partman configs with patrman-auto-raid but without grub-installer/bootdev.
Mon, Feb 4, 9:13 PM
CDanis edited P8050 partman configs with patrman-auto-raid but without grub-installer/bootdev.
Mon, Feb 4, 9:12 PM
CDanis added a comment to T215033: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit).

So we just need a http check that checks the website (without checking if the ssl cert is valid or not as we have that check already) and also a load check to make sure the load doesn't go too high otherwise it could cause noticeable impact.

Mon, Feb 4, 7:57 PM · serviceops, Patch-For-Review, Icinga, monitoring, Operations, Gerrit
CDanis added a comment to T215183: Redundant bootloaders for software RAID.

I know very little about debian-installer, but here's a guess based on what I found in the puppet repo:

Mon, Feb 4, 6:51 PM · Patch-For-Review, Operations
CDanis added a comment to T215033: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit).

The timeouts on that check_ssl invocation make little sense to me -- a warning after 60 seconds, but critical after 30? Those seem backwards, and also too high by something like a factor of 4.

Mon, Feb 4, 6:42 PM · serviceops, Patch-For-Review, Icinga, monitoring, Operations, Gerrit
CDanis added a comment to T215033: improve Gerrit monitoring (was: Investigate why icinga did not report high cpu/load for gerrit).

If the thing we want to monitor is "Gerrit is responding slowly / not at all", IMO that is the thing we should check. High CPU load is just one of the causes that could result in high latency.

Mon, Feb 4, 5:57 PM · serviceops, Patch-For-Review, Icinga, monitoring, Operations, Gerrit
CDanis reopened T213664: correctable memory errors db1068 (commons primary master database) as "Open".

Seems like it is happening again

Mon, Feb 4, 2:58 PM · Patch-For-Review, DBA, Operations

Fri, Jan 25

CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.

I did a cumin run across the whole fleet to find hosts that have memory errors in their dmesg buffers, but haven't incremented counters beyond 0.

Fri, Jan 25, 4:39 PM · Patch-For-Review, Operations, monitoring
CDanis created P8039 Hosts that have EDAC events reported in kernel logs in the last 30 days:.
Fri, Jan 25, 4:34 PM
CDanis edited P8037 cp5006 EDAC dmesg.
Fri, Jan 25, 3:10 PM
CDanis edited P8036 mw1302 EDAC dmesg.
Fri, Jan 25, 3:08 PM
CDanis edited P8035 cp4026 /var/log/mcelog.
Fri, Jan 25, 3:00 PM
CDanis added a comment to T178690: Better organization for SRE grafana dashboards.

Ah, forgot to update the task, but at the time @jcrespo and @fgiunchedi and I talked, and Jaime's biggest gripe was that iostat-reported "disk IO utilization" is not a very useful metric: it's the fraction of time that at least one oustanding iop was in the disk's queue. On a server that has any load at all, this metric will generally be "100%" all the time; what you actually care about are stats like "queue depth" and "request latencies".

Fri, Jan 25, 2:57 PM · User-CDanis, Patch-For-Review, User-fgiunchedi, monitoring, Operations

Thu, Jan 24

CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.
cdanis@cp4026.ulsfo.wmnet ~ % edac-util -v  
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#1_Chan#0_DIMM#0: 0 Corrected Errors
mc0: csrow0: CPU_SrcID#0_Ha#1_Chan#1_DIMM#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#0_Chan#1_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#1_Chan#0_DIMM#1: 0 Corrected Errors
mc0: csrow1: CPU_SrcID#0_Ha#1_Chan#1_DIMM#1: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
mc1: csrow0: 0 Uncorrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#0_Chan#1_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#1_Chan#0_DIMM#0: 0 Corrected Errors
mc1: csrow0: CPU_SrcID#1_Ha#1_Chan#1_DIMM#0: 0 Corrected Errors
edac-util: No errors to report.
Thu, Jan 24, 6:59 PM · Patch-For-Review, Operations, monitoring

Wed, Jan 23

CDanis updated the task description for T214529: EDAC events not being reported by node-exporter?.
Wed, Jan 23, 11:04 PM · Patch-For-Review, Operations, monitoring
CDanis edited P8031 cp4026 EDAC dmesg.
Wed, Jan 23, 11:03 PM
CDanis added a comment to T183177: memory errors not showing in icinga.

@Dzahn investigation on the mystery of cp4026 ongoing in T214529

Wed, Jan 23, 10:59 PM · Traffic, Patch-For-Review, User-fgiunchedi, DC-Ops, Operations, monitoring
CDanis added projects to T214529: EDAC events not being reported by node-exporter?: monitoring, Operations.
Wed, Jan 23, 10:55 PM · Patch-For-Review, Operations, monitoring
CDanis added a comment to T214529: EDAC events not being reported by node-exporter?.

One more observation:

Wed, Jan 23, 10:50 PM · Patch-For-Review, Operations, monitoring
CDanis created T214529: EDAC events not being reported by node-exporter?.
Wed, Jan 23, 10:50 PM · Patch-For-Review, Operations, monitoring
CDanis closed T214090: Request to add new hire David Sharpe (Dsharpe) to wmf LDAP group as Resolved.

Not sure how I missed this, but fixed now.

Wed, Jan 23, 3:49 PM · Patch-For-Review, LDAP-Access-Requests
CDanis closed T214090: Request to add new hire David Sharpe (Dsharpe) to wmf LDAP group, a subtask of T213742: Onboarding David Sharpe to Security Team as Information Security Analyst, as Resolved.
Wed, Jan 23, 3:49 PM · Security-Team

Jan 21 2019

CDanis added a comment to T208524: RfC: Standards for external services in the Wikimedia infrastructure..

! In T208524#4895821, @daniel wrote:
For number (3), I think the question is when and how, rather than if. One, three, five years?

I would ask "if". What benefits does doing that bring that outweigh breaking compatibility with low-traffic low-effort installations?

Jan 21 2019, 4:23 PM · serviceops, TechCom, TechCom-RFC

Jan 18 2019

srishakatux awarded T213780: Add Srishti to analytics-privatedata-users a Like token.
Jan 18 2019, 6:50 PM · Patch-For-Review, Operations, SRE-Access-Requests, Developer-Advocacy (Jan-Mar 2019), Documentation
CDanis triaged T214153: Fix node vs nodejs dependency issue as Normal priority.
Jan 18 2019, 4:11 PM · Reading-Infrastructure-Team-Backlog, Operations, Maps
CDanis moved T214130: Requesting access to production for dsharpe from Awaiting User Input to SRE Meeting Review Required on the SRE-Access-Requests board.
Jan 18 2019, 4:09 PM · SRE-Access-Requests, Operations
CDanis updated subscribers of T214130: Requesting access to production for dsharpe.

@faidon @mark Could one of you approve this request?

Jan 18 2019, 4:09 PM · SRE-Access-Requests, Operations
CDanis updated the task description for T214130: Requesting access to production for dsharpe.
Jan 18 2019, 3:37 PM · SRE-Access-Requests, Operations
CDanis moved T214130: Requesting access to production for dsharpe from Untriaged to Awaiting User Input on the SRE-Access-Requests board.
Jan 18 2019, 2:24 PM · SRE-Access-Requests, Operations
CDanis added a comment to T214130: Requesting access to production for dsharpe.

Hi David,

Jan 18 2019, 2:24 PM · SRE-Access-Requests, Operations
CDanis assigned T214130: Requesting access to production for dsharpe to Dsharpe.
Jan 18 2019, 2:22 PM · SRE-Access-Requests, Operations
CDanis updated the task description for T213780: Add Srishti to analytics-privatedata-users.
Jan 18 2019, 2:09 PM · Patch-For-Review, Operations, SRE-Access-Requests, Developer-Advocacy (Jan-Mar 2019), Documentation
CDanis closed T213780: Add Srishti to analytics-privatedata-users as Resolved.

You should be all set.

Jan 18 2019, 2:09 PM · Patch-For-Review, Operations, SRE-Access-Requests, Developer-Advocacy (Jan-Mar 2019), Documentation
CDanis closed T213780: Add Srishti to analytics-privatedata-users, a subtask of T195119: Create user feedback gadget for technical documentation on Wikitech pages, as Resolved.
Jan 18 2019, 2:09 PM · Developer-Advocacy (Jan-Mar 2019), Cloud-Services, Documentation
CDanis updated the task description for T213780: Add Srishti to analytics-privatedata-users.
Jan 18 2019, 1:55 PM · Patch-For-Review, Operations, SRE-Access-Requests, Developer-Advocacy (Jan-Mar 2019), Documentation
CDanis triaged T214079: cloudstore100{8,9} - Upgrade to 10GbE as Normal priority.
Jan 18 2019, 12:34 AM · Patch-For-Review, ops-eqiad, Operations

Jan 17 2019

CDanis closed T214090: Request to add new hire David Sharpe (Dsharpe) to wmf LDAP group as Resolved.

https://tools.wmflabs.org/ldap/user/dsharpe

Jan 17 2019, 10:33 PM · Patch-For-Review, LDAP-Access-Requests
CDanis closed T214090: Request to add new hire David Sharpe (Dsharpe) to wmf LDAP group, a subtask of T213742: Onboarding David Sharpe to Security Team as Information Security Analyst, as Resolved.
Jan 17 2019, 10:33 PM · Security-Team
CDanis edited P8003 conftool/pybal view of parsoid backends.
Jan 17 2019, 8:52 PM
CDanis added a comment to T214077: Create discourse-test mailing list.

It looks like we've been down this road before...?

Jan 17 2019, 8:00 PM · Discourse, Operations, Wikimedia-Mailing-lists
CDanis triaged T214073: Fix maps puppet to make sure apt-get update runs after configuration change as Normal priority.
Jan 17 2019, 7:57 PM · Patch-For-Review, Puppet, Operations, Discovery-Search, Maps
CDanis triaged T214072: dns200[12] lack IPv6 records as Normal priority.
Jan 17 2019, 6:54 PM · Patch-For-Review, monitoring, Traffic, Pybal, Operations
CDanis updated the task description for T213780: Add Srishti to analytics-privatedata-users.
Jan 17 2019, 6:48 PM · Patch-For-Review, Operations, SRE-Access-Requests, Developer-Advocacy (Jan-Mar 2019), Documentation
CDanis claimed T213780: Add Srishti to analytics-privatedata-users.
Jan 17 2019, 6:48 PM · Patch-For-Review, Operations, SRE-Access-Requests, Developer-Advocacy (Jan-Mar 2019), Documentation
CDanis reassigned T214059: request of a new mailing list WIKI-BNCF from CDanis to Giaccai.

WIKI-BNCF list is now live.

Jan 17 2019, 6:42 PM · Operations, Wikimedia-Mailing-lists
CDanis claimed T214059: request of a new mailing list WIKI-BNCF.
Jan 17 2019, 6:18 PM · Operations, Wikimedia-Mailing-lists
CDanis triaged T214031: Investigate missing WikibaseQualityConstraints logs in logstash. as Normal priority.
Jan 17 2019, 4:51 PM · User-Addshore, Operations, Wikimedia-Logstash
CDanis triaged T213996: New MongoDB version is not DFSG-compatible, dropped by Debian as Normal priority.
Jan 17 2019, 4:51 PM · Performance-Team (Radar), VisualEditor, Software-Licensing, Operations
CDanis triaged T214024: Two test hosts for SREs as Normal priority.
Jan 17 2019, 4:50 PM · Operations, hardware-requests
CDanis added a comment to T213780: Add Srishti to analytics-privatedata-users.

Hi Srishti,

Jan 17 2019, 4:49 PM · Patch-For-Review, Operations, SRE-Access-Requests, Developer-Advocacy (Jan-Mar 2019), Documentation

Jan 16 2019

CDanis triaged T213971: Explicitly mention npm in L3 as Normal priority.
Jan 16 2019, 9:54 PM · Legalpad, Operations
CDanis assigned T213928: Reset Wikitech 2FA access for Matthias_Geisler_WMDE to Addshore.
Jan 16 2019, 3:06 PM · User-Addshore, Operations
CDanis added a comment to T213859: eqiad: rack a3 pdu swap / failure / replacement.

It's fine to simply shut down prometheus1003. We have a redundant machine prometheus1004 which will continue gathering metrics and answering queries. prometheus1003 will have a gap in its data afterwards but that can't be helped.

Jan 16 2019, 3:21 AM · Patch-For-Review, ops-eqiad, Operations

Jan 15 2019

CDanis added a comment to T213856: Degraded RAID on ms-be1016.

Dzahn, can you assign a priority for this ticket? Is 'normal' appropriate for Swift backend hosts?

Jan 15 2019, 8:58 PM · media-storage, ops-eqiad, Operations
CDanis updated the task description for T213780: Add Srishti to analytics-privatedata-users.
Jan 15 2019, 5:45 PM · Patch-For-Review, Operations, SRE-Access-Requests, Developer-Advocacy (Jan-Mar 2019), Documentation
CDanis closed T213768: LDAP nda access for RyanSteinberg as Resolved.

ryanmax is now a member of nda.

Jan 15 2019, 4:54 PM · LDAP-Access-Requests
CDanis closed T213812: Add `holger` to `wmf` LDAF group as Resolved.

Needing to do Core Platform dev work is sufficient justification, of course. Thanks and sorry for the trouble.

Jan 15 2019, 4:37 PM · Patch-For-Review, LDAP-Access-Requests