Dzahn (Daniel Zahn)
Operations Engineer at WMF

Projects (15)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Sep 30 2014, 4:39 PM (134 w, 5 d)
Availability
Available
IRC Nick
mutante
LDAP User
Dzahn
MediaWiki User
Unknown

Recent Activity

Yesterday

Dzahn added a comment to T164060: Request to add phuedx to "researchers" group.

Hi, could you add some detail what this is for? What is SWAP please and what do you need access to stat1002 for? Are you sure it's stat1002 and not stat1003? The requested group "researchers" is described as " Access to stat1003 and the credentials for the MariaDB slaves in/etc/mysql/conf.d/research-client.cnf." on https://wikitech.wikimedia.org/wiki/Analytics/Data_access and the word "SWAP" doesn't seem to appear there at all. Thanks!

Sat, Apr 29, 1:40 AM · Operations, Ops-Access-Requests
Dzahn added a comment to T161158: Degraded RAID on ocg1001.

It would be nice to keep this open until ocg1001 is actually back in service.

Sat, Apr 29, 1:21 AM · ops-eqiad, Operations

Fri, Apr 28

Dzahn added a comment to T161158: Degraded RAID on ocg1001.

@Cmjohnson Yes, that is what @Volans asked about on T161158#3219587. Can you apply the fix on T150160 please?

Fri, Apr 28, 5:30 PM · ops-eqiad, Operations
Dzahn reassigned T162928: decommision nembus from Papaul to RobH.
Fri, Apr 28, 5:28 PM · Patch-For-Review, ops-codfw, hardware-requests, Operations
Dzahn edited the description of T162928: decommision nembus.
Fri, Apr 28, 5:27 PM · Patch-For-Review, ops-codfw, hardware-requests, Operations
Dzahn added a comment to T162928: decommision nembus.

@RobH please remove switch config (asw-b-codfw:ge-5/0/12)

Fri, Apr 28, 5:27 PM · Patch-For-Review, ops-codfw, hardware-requests, Operations
Dzahn added a comment to T164011: codfw: ganeti2007-ganeti2008 racking and onsite setup task.

2007 and 2008 should go into A5 then (still has to happen)

Fri, Apr 28, 5:24 PM · ops-codfw, Operations
Dzahn added a comment to T162928: decommision nembus.

removed from rack

Fri, Apr 28, 4:58 PM · Patch-For-Review, ops-codfw, hardware-requests, Operations
Dzahn added a comment to T164011: codfw: ganeti2007-ganeti2008 racking and onsite setup task.

ganeti2006 has been moved to A4 @ 17. switch: asw-a4-codfw: port ge-4/0/17 please configure switch

Fri, Apr 28, 4:51 PM · ops-codfw, Operations
Dzahn added a comment to T164011: codfw: ganeti2007-ganeti2008 racking and onsite setup task.

ganeti2005 has been moved to A4 @ 16. switch: asw-a4-codfw: port ge-4/0/16 please configure switch

Fri, Apr 28, 4:30 PM · ops-codfw, Operations
Dzahn added a comment to T164011: codfw: ganeti2007-ganeti2008 racking and onsite setup task.

We have lots of room in A2 and A4 and we can move into A4, but we can't move into A2 because there is a 10G switch and the server just has 1G nic cards (would need adapter).

Fri, Apr 28, 4:11 PM · ops-codfw, Operations
Dzahn added a comment to T149006: elastic2020 is powered off and does not want to restart.

mainboard is being replaced right now

Fri, Apr 28, 3:47 PM · Patch-For-Review, Discovery-Search (Current work), Discovery, ops-codfw, DC-Ops, Operations, Elasticsearch
Dzahn added a comment to T147905: investigate lead hardware issue.

Heh, I was hoping T162850 would have solved it. It's a bit concerning that a R420 (it is indeed a R420) has possibly exhibited the same symptoms. The box will be 3 years old next Monday (May 1st). Apart from looking at thermal paste and trying to figure out if it has enough, I honestly can't think of a good enough way to get an "YES" or "NO" that does not involve powering the box and observing it (and potentially doing the stuff in T162850).

Fri, Apr 28, 1:46 PM · Operations, ops-eqiad

Thu, Apr 27

Dzahn added a comment to T147905: investigate lead hardware issue.

This sounds similar to the other tickets linked to "tracking" task T162850. We have observed downthrottling to 200MHz on other servers before. The interesting part is that they were all R320 while lead is a R420 (or racktables says so, is it really?). That would be unfortunate if it means more models are potentially affected by this bug.

Thu, Apr 27, 10:58 PM · Operations, ops-eqiad
Dzahn added a subtask for T162850: acpi_pad issues: T147905: investigate lead hardware issue.
Thu, Apr 27, 10:54 PM · Patch-For-Review, Operations
Dzahn added a parent task for T147905: investigate lead hardware issue: T162850: acpi_pad issues.
Thu, Apr 27, 10:54 PM · Operations, ops-eqiad
Dzahn added a comment to T163870: Remove account creation throttling for JESI on 2017-04-27.

Thanks for deploying that :)

Thu, Apr 27, 10:48 PM · User-Urbanecm, Patch-For-Review, Wikimedia-Site-requests
Dzahn added a comment to T161158: Degraded RAID on ocg1001.

@Cmjohnson it's still ok to take it down, it's not getting traffic. i'll wait for the IPMI issue before putting it back to work.

Thu, Apr 27, 10:44 PM · ops-eqiad, Operations
Dzahn added a comment to T161158: Degraded RAID on ocg1001.

looks like it's ok.

Thu, Apr 27, 10:41 PM · ops-eqiad, Operations
Dzahn added a comment to T161158: Degraded RAID on ocg1001.

it has been reinstalled and re-added to puppet and salt but i saw errors about failed deployment of the ocg app, so i didn't repool it yet... then after being afk for a while i came back and look at it.. and now ocg is running and puppet run without errors. who fixed it? :)

Thu, Apr 27, 10:34 PM · ops-eqiad, Operations
Dzahn added a comment to T162952: decom barium.frack.eqiad.wmnet.

added decom checklist template from https://wikitech.wikimedia.org/wiki/Server_Lifecycle/reclaim_checklist

Thu, Apr 27, 7:47 PM · Patch-For-Review, ops-eqiad, Operations, fundraising-tech-ops, DC-Ops
Dzahn edited the description of T162952: decom barium.frack.eqiad.wmnet.
Thu, Apr 27, 7:46 PM · Patch-For-Review, ops-eqiad, Operations, fundraising-tech-ops, DC-Ops
Dzahn claimed T161158: Degraded RAID on ocg1001.

Thanks, i will do a reinstall later today.

Thu, Apr 27, 6:04 PM · ops-eqiad, Operations
Dzahn claimed T159756: setup netmon1002.wikimedia.org.

cool, thank you. taking it back

Thu, Apr 27, 6:03 PM · Patch-For-Review, Operations
Dzahn renamed T160947: wikistats: add new wikipedias: kbp, khw, dty and pt.wikimedia from "wikistats: add new wikipedias: kbp, khw, dty" to "wikistats: add new wikipedias: kbp, khw, dty and pt.wikimedia".
Thu, Apr 27, 3:59 PM · Labs-project-Wikistats
Dzahn added a comment to T163939: Enable keyholder for ORES deployments.

You should not need the passphrase since the key is already loaded:

Thu, Apr 27, 3:57 PM · Operations, Deployment-Systems, Ops-Access-Requests
Dzahn updated subscribers of T163939: Enable keyholder for ORES deployments.
Thu, Apr 27, 4:10 AM · Operations, Deployment-Systems, Ops-Access-Requests
Dzahn added a project to T163960: phab1001 hdd port a failure: Phabricator.
Thu, Apr 27, 3:57 AM · Phabricator, ops-eqiad, Operations

Wed, Apr 26

Dzahn added a comment to T163939: Enable keyholder for ORES deployments.

I think what needs to happen here is adding a new "identity" for ORES deployments.

Wed, Apr 26, 8:25 PM · Operations, Deployment-Systems, Ops-Access-Requests
Dzahn placed T163870: Remove account creation throttling for JESI on 2017-04-27 up for grabs.

I'm uploading the change but i'm not going to be able to deploy it and i'm not in the European time zone. I'm hoping that somebody in Europe will take it or it will be deployed in SWAT window.

Wed, Apr 26, 5:17 AM · User-Urbanecm, Patch-For-Review, Wikimedia-Site-requests
Dzahn triaged T163870: Remove account creation throttling for JESI on 2017-04-27 as "High" priority.
Wed, Apr 26, 4:12 AM · User-Urbanecm, Patch-For-Review, Wikimedia-Site-requests
Dzahn edited the description of T161529: Create Wikipedia Doteli.
Wed, Apr 26, 3:31 AM · Patch-For-Review, Wikimedia-Site-requests
Dzahn added a comment to T161529: Create Wikipedia Doteli.

added to wikistats.wmflabs.org (T160947#3212839)

Wed, Apr 26, 3:31 AM · Patch-For-Review, Wikimedia-Site-requests
Dzahn added a comment to T160947: wikistats: add new wikipedias: kbp, khw, dty and pt.wikimedia.

http://wikistats.wmflabs.org/display.php?t=wp
http://wikistats.wmflabs.org/detail.php?t=wp&id=317

Wed, Apr 26, 3:30 AM · Labs-project-Wikistats
Dzahn added a comment to T160947: wikistats: add new wikipedias: kbp, khw, dty and pt.wikimedia.

added "dty" since it exists now (the other 2 are still incubator redirects as of today)

Wed, Apr 26, 3:26 AM · Labs-project-Wikistats

Tue, Apr 25

Dzahn changed the status of T146746: investigate shared inbox options from "Open" to "Stalled".
Tue, Apr 25, 11:57 PM · Operations
Dzahn closed T163476: ircecho - /etc/default/ircecho puppet issue as "Invalid".
Tue, Apr 25, 11:57 PM · Patch-For-Review, Operations
Dzahn added a comment to T163476: ircecho - /etc/default/ircecho puppet issue.

This was actually invalid and i was simply confused by T163324 and still looked on einsteinium instead of tegmen.

Tue, Apr 25, 11:56 PM · Patch-For-Review, Operations
Dzahn closed T163568: Production Shell access denied (update SSH key for jmorgan) as "Resolved".

We talked on IRC. It works again.

Tue, Apr 25, 7:21 PM · Patch-For-Review, Operations
Dzahn added a comment to T163568: Production Shell access denied (update SSH key for jmorgan).

@Capt_Swing Your key has been replaced on stat1003 and bast1001 (and other bastions puppet will do it soon). It should work now again.

Tue, Apr 25, 7:09 PM · Patch-For-Review, Operations
Dzahn renamed T163568: Production Shell access denied (update SSH key for jmorgan) from "Production Shell access denied" to "Production Shell access denied (update SSH key for jmorgan)".
Tue, Apr 25, 7:08 PM · Patch-For-Review, Operations
Dzahn awarded T163185: bouncycastle information disclosure [DSA 3829-1] [CVE-2015-6644] (and make Gerrit use Debian package) a Barnstar token.
Tue, Apr 25, 6:23 PM · Gerrit, Security
Dzahn added a comment to T162952: decom barium.frack.eqiad.wmnet.

@RobH @Jgreen see DNS change above. in this case it removes both main IP and mgmt at once, realizing that normally we do it seperate in non-fr prod, but it keeps the option to reach it per asset tag as wmf4076.mgmt.eqiad.wmnet.

Tue, Apr 25, 3:41 AM · Patch-For-Review, ops-eqiad, Operations, fundraising-tech-ops, DC-Ops
Dzahn reassigned T159756: setup netmon1002.wikimedia.org from Dzahn to Cmjohnson.
Tue, Apr 25, 1:36 AM · Patch-For-Review, Operations
Dzahn added a comment to T159756: setup netmon1002.wikimedia.org.

@Cmjohnson I fixed the typo above in mgmt DNS entry, but i still can't get on mgmt console after that.

Tue, Apr 25, 1:36 AM · Patch-For-Review, Operations
Dzahn reassigned T161158: Degraded RAID on ocg1001 from Dzahn to Cmjohnson.

Any other disk to try? can we replace sda one more time?

Tue, Apr 25, 1:09 AM · ops-eqiad, Operations
Dzahn added a comment to T161158: Degraded RAID on ocg1001.

@Cmjohnson Somehow the new /dev/sda also seems to be broken. Maybe it was used in something else before? Or it was this disk that was broken the whole time and we replaced the wrong one? I dunno, but from /var/log/syslog in installer shell this looks pretty much like broken hardware (again?/still?).

Tue, Apr 25, 1:08 AM · ops-eqiad, Operations
Dzahn added a comment to T161158: Degraded RAID on ocg1001.

I changed the boot order in BIOS (port A was still first, switched to port B), did not change the error. Still "during read on /dev/sda" at partitioning step.

Tue, Apr 25, 12:53 AM · ops-eqiad, Operations
Dzahn added a comment to T161158: Degraded RAID on ocg1001.

@Cmjohnson I attempted a reinstall but it consistently fails at the partitioning step with:

Tue, Apr 25, 12:43 AM · ops-eqiad, Operations
Dzahn claimed T159756: setup netmon1002.wikimedia.org.
Tue, Apr 25, 12:12 AM · Patch-For-Review, Operations
Dzahn edited the description of T159756: setup netmon1002.wikimedia.org.
Tue, Apr 25, 12:11 AM · Patch-For-Review, Operations

Mon, Apr 24

Dzahn added a comment to T163743: New ganeti VM for MW release pipeline work.

Should this exist in both DCs? one in eqiad one in codfw per default nowadays?

Mon, Apr 24, 9:27 PM · Operations, Security-General, Release-Engineering-Team, vm-requests
Dzahn added a comment to T161158: Degraded RAID on ocg1001.

I'll take a look at it today. Pretty sure we can just reinstall this.

Mon, Apr 24, 8:17 PM · ops-eqiad, Operations
Dzahn claimed T161158: Degraded RAID on ocg1001.
Mon, Apr 24, 8:16 PM · ops-eqiad, Operations
Dzahn changed the status of T159756: setup netmon1002.wikimedia.org from "Stalled" to "Open".
Mon, Apr 24, 7:39 PM · Patch-For-Review, Operations
Dzahn changed the status of T159756: setup netmon1002.wikimedia.org, a subtask of T156040: hardware request for netmon1001 replacement, from "Stalled" to "Open".
Mon, Apr 24, 7:39 PM · hardware-requests, Operations

Fri, Apr 21

Dzahn updated subscribers of T163432: Access request to Icinga control panel to acknowledge Performance alerts.

@Gilles @aaron @Peter @Krinkle see change above, it should give you the requested permissions. feel free to try it. when logging in at Icinga make sure you use the lower case version please. There is a caveat that the LDAP login accepts both upper case and lower case but for Icinga to get the permission part for you to run commands, it must match the Icinga contact. Go to the service (you are a contact for) and try "send acknowledgement" or "schedule downtime" or other commands from the drop down menu. Let me know if it works. If there are any other services you want to be able to control then we'll just have to add the contact group now.

Fri, Apr 21, 2:50 AM · Patch-For-Review, Performance-Team, Operations
Dzahn awarded T163368: Revisit paging strategy for frack servers a Like token.
Fri, Apr 21, 12:12 AM · Patch-For-Review, fundraising-tech-ops, Operations

Thu, Apr 20

Dzahn added a comment to T163432: Access request to Icinga control panel to acknowledge Performance alerts.

@Gilles alright. thanks. so first step is we need to create Icinga users ("contacts") for you. Krinkle has one but not the other 3 of you. And they have to match the LDAP users. The login you use to see Icinga is just LDAP auth in front of it, but from Icinga's point of view you are not users yet because you are not "contacts" for anything. It automatically gives permissions to do these things for hosts and services that a user is a contact for. The file with the Icinga contacts is in a private repo (because some contacts have phone numbers). So i'm doing that there now. i'll use your standard wikimedia.org email addresses but nothing will happen yet. Then i'll create a contactgroup for performance that has your users as members. Finally we'll add that contactgroup to the relevant service(s) via puppet and it will do 2 things at once that are directly connected. You will get email notifications about the service and you will get the right to execute commands for it (ACK it, schedule downtime, enable/disable notifications) which is what you are requesting. (Unless you really don't want email notifications).

Thu, Apr 20, 10:08 PM · Patch-For-Review, Performance-Team, Operations
Dzahn added a comment to T163368: Revisit paging strategy for frack servers.
12:09 -!- icinga-wm [~icinga-wm@tegmen.wikimedia.org] has joined #wikimedia-fundraising
12:11 < icinga-wm> test for T163368
Thu, Apr 20, 7:20 PM · Patch-For-Review, fundraising-tech-ops, Operations
Dzahn added a comment to T163432: Access request to Icinga control panel to acknowledge Performance alerts.

@Gilles could you please define who exactly is "we" in this request. Ideally a list of LDAP/wikitech user names. (what you see behind " Logged in as" when on Icinga). Thanks

Thu, Apr 20, 6:46 PM · Patch-For-Review, Performance-Team, Operations
Dzahn added a comment to T163432: Access request to Icinga control panel to acknowledge Performance alerts.

@Gilles I had already ACKed that and created T163408. The issue is that it recovered about 3 hours ago and became CRIT again. the status change removes the ACK. I just re-added it.

Thu, Apr 20, 6:45 PM · Patch-For-Review, Performance-Team, Operations
Dzahn created T163476: ircecho - /etc/default/ircecho puppet issue.
Thu, Apr 20, 6:05 PM · Patch-For-Review, Operations
Dzahn added a comment to T163368: Revisit paging strategy for frack servers.

The way the custom IRC notifications work:

Thu, Apr 20, 5:18 PM · Patch-For-Review, fundraising-tech-ops, Operations
Dzahn added a comment to T163368: Revisit paging strategy for frack servers.

So as of today if "sms" does not show up in contact_groups for a host or service, individual Ops don't get email or sms notification. If that's correct, we're much closer than I initially thought.

Thu, Apr 20, 4:32 PM · Patch-For-Review, fundraising-tech-ops, Operations
Dzahn added a comment to T163368: Revisit paging strategy for frack servers.

my suggestion would be:

Thu, Apr 20, 4:39 AM · Patch-For-Review, fundraising-tech-ops, Operations
Dzahn added a comment to T163368: Revisit paging strategy for frack servers.

relevant puppet code:

Thu, Apr 20, 3:48 AM · Patch-For-Review, fundraising-tech-ops, Operations
Dzahn claimed T149557: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*).
Thu, Apr 20, 2:51 AM · Patch-For-Review, vm-requests, Operations
Dzahn claimed T133110: Check for an oversized exim4 queue indicating mail delivery failures.
Thu, Apr 20, 2:50 AM · Monitoring, Operations
Dzahn closed T163220: Add Icinga check for CPU frequency on Dell R320 as "Resolved".

Thu, Apr 20, 1:45 AM · Patch-For-Review, Monitoring, Operations
Dzahn closed T163220: Add Icinga check for CPU frequency on Dell R320, a subtask of T162850: acpi_pad issues, as "Resolved".
Thu, Apr 20, 1:45 AM · Patch-For-Review, Operations
Dzahn added a project to T163408: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer : Monitoring.
Thu, Apr 20, 1:32 AM · Performance-Team, Monitoring, Operations
Dzahn added a comment to T163220: Add Icinga check for CPU frequency on Dell R320.

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=CPU+Freq

Thu, Apr 20, 1:15 AM · Patch-For-Review, Monitoring, Operations
Dzahn edited the description of T163408: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer .
Thu, Apr 20, 1:13 AM · Performance-Team, Monitoring, Operations
Dzahn renamed T163408: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer from "icinga/grafana: webpagetest-alerts is alerting: Desktop Internet Explorer render issues" to "(icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer ".
Thu, Apr 20, 1:12 AM · Performance-Team, Monitoring, Operations
Dzahn created T163408: (icinga/grafana) webpagetest-alerts: slow page rendering for Internet Explorer .
Thu, Apr 20, 1:09 AM · Performance-Team, Monitoring, Operations
Dzahn created T163405: appserver fatals - intermittent failed connections to rdb2005.
Thu, Apr 20, 12:23 AM · Operations

Wed, Apr 19

Dzahn closed T155180: codfw: mw2251-mw2260 rack/setup as "Resolved".

closing this again to handle mw2256 in a subtask (please continue on T163346)

Wed, Apr 19, 4:11 PM · Patch-For-Review, User-Elukey, Operations, ops-codfw
Dzahn created T163346: mw2256 - hardware issue.
Wed, Apr 19, 4:10 PM · Operations, ops-codfw
Dzahn added a comment to T155180: codfw: mw2251-mw2260 rack/setup.

@Papaul just the kernel panic PANIC: double fault, error_code: 0x0 and this during boot:

Wed, Apr 19, 3:53 PM · Patch-For-Review, User-Elukey, Operations, ops-codfw
Dzahn added a comment to T155180: codfw: mw2251-mw2260 rack/setup.

mw2256 died again

Wed, Apr 19, 3:39 PM · Patch-For-Review, User-Elukey, Operations, ops-codfw
Dzahn reopened T155180: codfw: mw2251-mw2260 rack/setup as "Open".
Wed, Apr 19, 3:37 PM · Patch-For-Review, User-Elukey, Operations, ops-codfw
Dzahn added a comment to T163286: Tegmen: process spawn loop + failed icinga + failing puppet.

Puppet was not running because of an Icinga configuration error

puppet runs alright now, no errors

Wed, Apr 19, 4:30 AM · Patch-For-Review, Monitoring, Operations
Dzahn removed a project from T162900: setup naos/WMF6406 as new codfw deployment server: Patch-For-Review.
Wed, Apr 19, 2:08 AM · ops-codfw, Operations
Dzahn edited the description of T162900: setup naos/WMF6406 as new codfw deployment server.
Wed, Apr 19, 2:06 AM · ops-codfw, Operations
Dzahn edited the description of T162900: setup naos/WMF6406 as new codfw deployment server.
Wed, Apr 19, 2:02 AM · ops-codfw, Operations
Dzahn edited the description of T162900: setup naos/WMF6406 as new codfw deployment server.
Wed, Apr 19, 2:00 AM · ops-codfw, Operations
Dzahn edited the description of T162900: setup naos/WMF6406 as new codfw deployment server.
Wed, Apr 19, 1:59 AM · ops-codfw, Operations
Dzahn removed a project from T119165: l10nupdate user uid mismatch between tin and mira: Patch-For-Review.
Wed, Apr 19, 1:40 AM · Operations, Deployment-Systems
Dzahn added a comment to T119165: l10nupdate user uid mismatch between tin and mira.

now: https://gerrit.wikimedia.org/r/#/c/348884/

Wed, Apr 19, 1:40 AM · Operations, Deployment-Systems
Dzahn added a comment to T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet .

@GWicke @Eevans thanks for the explanation (and i just saw you removing it and Icinga alerts followed by recoveries. looks good)

Wed, Apr 19, 1:38 AM · Patch-For-Review, ops-eqiad, Operations, Cassandra, Services (doing)
Dzahn added a project to T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet : ops-eqiad.
Wed, Apr 19, 1:26 AM · Patch-For-Review, ops-eqiad, Operations, Cassandra, Services (doing)
Dzahn added a comment to T163292: Failed disk / degraded RAID arrays: restbase1018.eqiad.wmnet .
Since these instances have already been down for some time, and no ETA for repair/replacement yet exists,
Wed, Apr 19, 1:25 AM · Patch-For-Review, ops-eqiad, Operations, Cassandra, Services (doing)
Dzahn added a comment to T162900: setup naos/WMF6406 as new codfw deployment server.
  • backups: confirmed with bconsole that naos now exists in Bacula with the same backup sets (/home and /srv/deployment are backed up on deployment servers)
  • UID issue: see this https://gerrit.wikimedia.org/r/#/c/348884/
Wed, Apr 19, 12:51 AM · ops-codfw, Operations

Tue, Apr 18

Dzahn edited the description of T162900: setup naos/WMF6406 as new codfw deployment server.
Tue, Apr 18, 11:47 PM · ops-codfw, Operations
Dzahn added a comment to T119915: Create response time monitoring for WDQS endpoint.

The Icinga/graphite check "Response time for WDQS" is in status "UNKNOWN" because there are "No valid datapoints found".

Tue, Apr 18, 11:43 PM · Discovery-Wikidata-Query-Service-Sprint, Patch-For-Review, Monitoring, Discovery, Operations, Wikidata-Query-Service, Wikidata
Dzahn reopened T119915: Create response time monitoring for WDQS endpoint as "Open".
Tue, Apr 18, 11:41 PM · Discovery-Wikidata-Query-Service-Sprint, Patch-For-Review, Monitoring, Discovery, Operations, Wikidata-Query-Service, Wikidata
Dzahn added a comment to T163158: acpi_pad consuming 100% CPU on tin.

The "Improperly owned -0:0- files in /srv/mediawiki-staging" Icinga check was failing on tin, caused by a timeout of completing the check in time.

Tue, Apr 18, 11:38 PM · Operations
Dzahn added a comment to T163280: Degraded RAID on restbase1018.

16:05 < icinga-wm> PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed
16:05 < icinga-wm> PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed
16:05 < icinga-wm> PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused
16:05 < icinga-wm> PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed
...

Tue, Apr 18, 11:35 PM · Operations
Dzahn added a comment to T163278: Four different PHP/HHVM versions on the cluster.

On any other day i would probably just do that since they are debug hosts. Though right now might be a bad moment. The "is running on the canaries" should cover mwdebug* though, shouldn't it.

Tue, Apr 18, 11:21 PM · Operations