Page MenuHomePhabricator

CDanis (Chris Danis)
SRE

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (28 w, 3 d)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Today

CDanis created T224236: include the 'Server:' response header in varnishkafka.
Thu, May 23, 4:31 PM · Patch-For-Review, Traffic, Analytics, Operations

Tue, May 21

CDanis added a comment to T223952: Increased instability in MediaWiki backends (according to load balancers).

We saw one of these events at 14:48 today and pybal reported fetch failures for -- and wanted to depool -- basically the entire appserver fleet https://phabricator.wikimedia.org/P8551

Tue, May 21, 3:01 PM · User-Marostegui, Performance-Team, HHVM, serviceops, Operations
CDanis updated the title for P8551 fgrep 'May 21 14:48' /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste from fgrep 14:48 /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste to fgrep 'May 21 14:48' /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste.
Tue, May 21, 2:59 PM
CDanis updated the title for P8551 fgrep 'May 21 14:48' /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste from Masterwork From Distant Lands to fgrep 14:48 /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste.
Tue, May 21, 2:57 PM
CDanis added a project to T223948: media seek controls not invokeable on Android 9 (Pie) and Pixel 3 XL: Android-app-Bugs.
Tue, May 21, 12:17 PM · Wikipedia-Android-App-Backlog (Android-app-release-v2.7.28x-M-Mochi), Android-app-Bugs

Mon, May 20

CDanis added a comment to T223934: Add annotations from ops vendor maintenance calendar to Grafana.

+1. In general I think it would be a great idea to do a lot more with annotations than we presently do:

Mon, May 20, 8:24 PM · Operations
CDanis created T223924: pybal logs into logstash.
Mon, May 20, 4:54 PM · Operations, Wikimedia-Logstash

Sun, May 19

Wang_Qiliang awarded T222418: 503 errors for several Wikipedia pages a Party Time token.
Sun, May 19, 2:04 PM · Wikimedia-Incident, Traffic, Operations, Wikimedia-General-or-Unknown, User-DannyS712
Ankit-Maity awarded T222418: 503 errors for several Wikipedia pages a Pterodactyl token.
Sun, May 19, 1:33 PM · Wikimedia-Incident, Traffic, Operations, Wikimedia-General-or-Unknown, User-DannyS712
CDanis closed T222418: 503 errors for several Wikipedia pages as Resolved.

Thanks! We now believe this is resolved.

Sun, May 19, 12:38 PM · Wikimedia-Incident, Traffic, Operations, Wikimedia-General-or-Unknown, User-DannyS712
CDanis added a comment to T222418: 503 errors for several Wikipedia pages.

For posterity:

Sun, May 19, 12:28 PM · Wikimedia-Incident, Traffic, Operations, Wikimedia-General-or-Unknown, User-DannyS712

Tue, May 14

CDanis added a comment to T223319: URL shortener subdomains for useful Wikimedia infrastructure.

Just want to throw out the possibility in the future (future) that some of these underlying tools may change and the unique identifier for that service may no longer align with the unique id in the new service. IOW: some of these cools urls might change ;)

Tue, May 14, 7:50 PM · Operations
CDanis created T223319: URL shortener subdomains for useful Wikimedia infrastructure.
Tue, May 14, 7:18 PM · Operations

Mon, May 13

CDanis added a comment to T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.

It would be nice to have a mockup of the API to test soon (with no production effect except maybe some debug information). That will allow to test automation from scripts we have already. I think that would be step #6 ?

Mon, May 13, 4:19 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
Jdforrester-WMF awarded T197126: Create tool to handle the state of database configuration in MediaWiki in etcd a Like token.
Mon, May 13, 3:48 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
Jdforrester-WMF awarded T197126: Create tool to handle the state of database configuration in MediaWiki in etcd a Like token.
Mon, May 13, 3:47 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
CDanis claimed T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.
Mon, May 13, 3:36 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
CDanis added a comment to T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.

Here's my tentative plan for moving forward with this, including a rollout procedure:

Mon, May 13, 3:33 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA

Fri, May 10

CDanis added a comment to T220212: Wikimedia Technical Conference 2019: Discussion .

+1 to what @Joe said, and to what @jijiki said. Especially speaking as someone who has been at the Foundation only six months now.

Fri, May 10, 1:07 PM · International-Developer-Events

Wed, May 8

CDanis created P8495 swift codfw-prod final rebalance.
Wed, May 8, 7:10 PM
CDanis added a comment to T219544: Make hadoop cluster able to push to swift .

Some quick notes from today's meeting:

Wed, May 8, 3:28 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics

Tue, May 7

CDanis created T222755: #wikimedia-sre is missing stashbot.
Tue, May 7, 7:26 PM · Stashbot, Operations
CDanis added a comment to T221904: swift backend decomms / rebalances are noisy.

Trying out a few things here:

Tue, May 7, 1:05 PM · observability, media-storage, Operations
CDanis claimed T221904: swift backend decomms / rebalances are noisy.
Tue, May 7, 12:59 PM · observability, media-storage, Operations
CDanis added a comment to T222620: cp1083 crashed.

Interestingly, there was a memory usage spike right before the host crashed.

Tue, May 7, 12:10 PM · Operations, ops-eqiad, Traffic

Mon, May 6

CDanis created T222654: ms-be2043 'sdd' throwing lots of errors.
Mon, May 6, 7:13 PM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
CDanis updated subscribers of T222391: Gerrit Hardware Upgrade.

cc @mark who I know is about to start looking at hardware requests for the coming FY

Mon, May 6, 5:51 PM · Release-Engineering-Team (Watching / External), serviceops, ops-eqiad, Operations, Gerrit
CDanis closed T222108: prometheus: some sort of IRC alerts on restarts? as Resolved.

We now have IRC alerting based on scraping each prometheus for its process_start_time_seconds metric.

Mon, May 6, 4:39 PM · Patch-For-Review, Wikimedia-Incident, observability, Operations
CDanis added a comment to T222605: CI is unavailable since around 10:00 UTC.

My patches are also stuck in the queue, and I'm seeing teammates manually V+2 their Puppet changes.

Mon, May 6, 1:46 PM · Wikimedia-Incident, Continuous-Integration-Config, Release-Engineering-Team
CDanis updated the title for P8478 cdanis@icinga2001.wikimedia.org ~ % fgrep 'Too many open files' /var/log/syslog.1 | awk '{print $3}' | cut -d: -f1-2 | sort | uniq -c | phaste from Masterwork From Distant Lands to cdanis@icinga2001.wikimedia.org ~ % fgrep 'Too many open files' /var/log/syslog.1 | awk '{print $3}' | cut -d: -f1-2 | sort | uniq -c | phaste.
Mon, May 6, 1:06 PM

Fri, May 3

CDanis updated the title for P8473 curl --silent 'https://gerrit.wikimedia.org/r/changes/operations%2Fpuppet~507623/detail' | head -n5 from Masterwork From Distant Lands to curl --silent 'https://gerrit.wikimedia.org/r/changes/operations%2Fpuppet~507623/detail' | head -n5.
Fri, May 3, 6:05 PM
CDanis edited P8473 curl --silent 'https://gerrit.wikimedia.org/r/changes/operations%2Fpuppet~507623/detail' | head -n5.
Fri, May 3, 6:05 PM
CDanis closed T222112: figure out why Kafka dashboard hammers Prometheus, and fix it as Resolved.

It does seem much faster now, thanks @elukey ! Impact of loading 30 days on Prometheus is also minimal now -- modest CPU usage and while there was some increase in RAM consumption over baseline while we were both playing with this, it's not concerning. Thank you :)

Fri, May 3, 2:43 PM · Wikimedia-Incident, Operations, observability
CDanis updated the title for P8471 dbctl config | head -n1 | jq . from Masterwork From Distant Lands to dbctl config | head -n1 | jq ..
Fri, May 3, 12:34 PM
CDanis edited P8471 dbctl config | head -n1 | jq ..
Fri, May 3, 12:34 PM

Thu, May 2

CDanis added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

Also sorry, I don't have a lot of time left over this week; can take a deeper look next week

Thu, May 2, 6:11 PM · Wikimedia-Incident, Operations, observability
CDanis added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

I think you should just be able to remove the "custom all value" in the dashboard settings and have it work. In this case Grafana will create its own 'all' value that is simply a regex OR'ing together all the known values, which it looks like it computes based on the cluster=kafka_jumbo hidden variable.

Thu, May 2, 6:11 PM · Wikimedia-Incident, Operations, observability
CDanis updated subscribers of T219544: Make hadoop cluster able to push to swift .

I got tied up with goal work and incident response and have only had a little time to spend on this.

Thu, May 2, 5:37 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics
CDanis renamed T222362: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications from ms-be2043 /dev/sdd drive failure to swift-drive-audit unmounting a drive doesn't produce any alerts or notifications.
Thu, May 2, 1:33 PM · observability, media-storage, Operations
CDanis added a comment to T222362: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications.

I think the 'real' thing we need to notify on here is when Swift decides it wants to stop using a disk (which it did here)

Thu, May 2, 1:22 PM · observability, media-storage, Operations
CDanis created T222362: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications.
Thu, May 2, 1:15 PM · observability, media-storage, Operations

Wed, May 1

CDanis added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

I've modified the Kafka dashboard so that only the Summary Row is uncollapsed bym default. I've also changed the default time range to last 3 hours, rather than last 24.

Wed, May 1, 1:48 PM · Wikimedia-Incident, Operations, observability

Tue, Apr 30

CDanis closed T222105: prometheus: current query limits are insufficient to prevent OOMs as Resolved.

As documented in T222112#5147131 this didn't actually fix the dashboard at fault in this particular incident, but I've heard from another large-scale Prometheus user (and Prometheus dev) that they've had similar problems and recommend 10M as a value.

Tue, Apr 30, 3:08 PM · Patch-For-Review, Wikimedia-Incident, observability, Operations
CDanis updated subscribers of T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

I'm pretty sure it is these panels that are responsible for the most Prometheus load


They take much longer to load than the rest of the panels, and some of them errored out with the new settings.

Tue, Apr 30, 2:22 PM · Wikimedia-Incident, Operations, observability
CDanis added a comment to T219825: Update dashboards to node-exporter 0.16+ metric names.

I think https://grafana.wikimedia.org/d/000000607/cluster-overview might have been missed here? I see at least some old metrics being used there, e.g. node_memory_Cached in the "Memory per host" section.

Tue, Apr 30, 2:06 PM · Patch-For-Review, observability

Mon, Apr 29

CDanis created T222113: prometheus: upgrade to 2.9.2.
Mon, Apr 29, 9:33 PM · Wikimedia-Incident, observability, Operations
CDanis added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

Very easy to reproduce this presently. Not an OOM but close.

Mon, Apr 29, 9:16 PM · Wikimedia-Incident, Operations, observability
CDanis created T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.
Mon, Apr 29, 9:05 PM · Wikimedia-Incident, Operations, observability
CDanis moved T222105: prometheus: current query limits are insufficient to prevent OOMs from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.
Mon, Apr 29, 8:54 PM · Patch-For-Review, Wikimedia-Incident, observability, Operations
CDanis moved T222102: prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc) from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.
Mon, Apr 29, 8:54 PM · Wikimedia-Incident, observability, Operations
CDanis moved T222108: prometheus: some sort of IRC alerts on restarts? from To Triage to Follow-up/Actionables on the Wikimedia-Incident board.
Mon, Apr 29, 8:54 PM · Patch-For-Review, Wikimedia-Incident, observability, Operations
CDanis created T222108: prometheus: some sort of IRC alerts on restarts?.
Mon, Apr 29, 8:53 PM · Patch-For-Review, Wikimedia-Incident, observability, Operations
CDanis created T222105: prometheus: current query limits are insufficient to prevent OOMs.
Mon, Apr 29, 8:19 PM · Patch-For-Review, Wikimedia-Incident, observability, Operations
CDanis updated the task description for T222102: prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc).
Mon, Apr 29, 7:46 PM · Wikimedia-Incident, observability, Operations
CDanis created T222102: prometheus: usable dashboard for meta-metrics about Prometheus itself (query durations etc).
Mon, Apr 29, 7:45 PM · Wikimedia-Incident, observability, Operations
CDanis updated the title for P8456 cdanis@grafana1001.eqiad.wmnet /var/log/apache2 % zgrep '^2019-04-25T20:3' other_vhosts_access.log.4 | sort -g -k +2 | cut -d"$(echo -ne '\t')" -f1-10 | tail -n100 | phaste from Masterwork From Distant Lands to cdanis@grafana1001.eqiad.wmnet /var/log/apache2 % zgrep '^2019-04-25T20:3' other_vhosts_access.log.4 | sort -g -k +2 | cut -d"$(echo -ne '\t')" -f1-10 | tail -n100 | phaste.
Mon, Apr 29, 7:40 PM
CDanis moved T220838: Upgrade grafana to 6.1 from Backlog to Up next on the observability board.
Mon, Apr 29, 3:16 PM · observability, Operations
CDanis updated the title for P8455 cdanis@prometheus1004.eqiad.wmnet /var/log/apache2 % zgrep '^2019-04-25T20:3' other_vhosts_access.log.4 | sort -g -k +2 | cut -d"$(echo -ne '\t')" -f1-10 | tail -n100 | phaste from Masterwork From Distant Lands to cdanis@prometheus1004.eqiad.wmnet /var/log/apache2 % zgrep '^2019-04-25T20:3' other_vhosts_access.log.4 | sort -g -k +2 | cut -d"$(echo -ne '\t')" -f1-10 | tail -n100 | phaste.
Mon, Apr 29, 2:59 PM
CDanis updated the title for P8454 cdanis@prometheus1003.eqiad.wmnet /var/log/apache2 % zgrep '^2019-04-25T20:3' other_vhosts_access.log.4 | sort -g -k +2 | cut -d"$(echo -ne '\t')" -f1-10 | tail -n100 | phaste from Masterwork From Distant Lands to cdanis@prometheus1003.eqiad.wmnet /var/log/apache2 % zgrep '^2019-04-25T20:3' other_vhosts_access.log.4 | sort -g -k +2 | cut -d"$(echo -ne '\t')" -f1-10 | tail -n100 | phaste.
Mon, Apr 29, 2:58 PM

Fri, Apr 26

CDanis added a project to T221068: decom ms-be201[345]: media-storage.
Fri, Apr 26, 8:00 PM · decommission, ops-codfw, media-storage, User-fgiunchedi, Operations
Bstorm awarded T221985: puppet-merge shouldn't fail if `tput` doesn't grok your terminal a Evil Spooky Haunted Tree token.
Fri, Apr 26, 7:28 PM · Puppet, Operations
CDanis created T221985: puppet-merge shouldn't fail if `tput` doesn't grok your terminal.
Fri, Apr 26, 6:48 PM · Puppet, Operations
fgiunchedi awarded T221964: RIPE Atlas data in Prometheus a Love token.
Fri, Apr 26, 12:56 PM · Traffic, Operations, observability
CDanis created T221964: RIPE Atlas data in Prometheus.
Fri, Apr 26, 12:15 PM · Traffic, Operations, observability
CDanis updated the title for P8441 cdanis@prometheus1004.eqiad.wmnet ~ % sudo journalctl --unit prometheus@ops.service --since '2019-04-25 18:41:48' --until '2019-04-25 20:42:03' | phaste from Masterwork From Distant Lands to cdanis@prometheus1004.eqiad.wmnet ~ % sudo journalctl --unit prometheus@ops.service --since '2019-04-25 18:41:48' --until '2019-04-25 20:42:03' | phaste.
Fri, Apr 26, 12:08 AM

Thu, Apr 25

CDanis created T221904: swift backend decomms / rebalances are noisy.
Thu, Apr 25, 10:05 PM · observability, media-storage, Operations
CDanis updated the title for P8437 sudo -u gerrit2 jstack -l `pidof java` | phaste from Masterwork From Distant Lands to sudo -u gerrit2 jstack -l `pidof java` | phaste.
Thu, Apr 25, 2:39 PM
CDanis updated the title for P8436 ssh -p 29418 gerrit.wikimedia.org gerrit show-queue -w --by-queue from Masterwork From Distant Lands to ssh -p 29418 gerrit.wikimedia.org gerrit show-queue -w --by-queue.
Thu, Apr 25, 2:34 PM
CDanis edited P8436 ssh -p 29418 gerrit.wikimedia.org gerrit show-queue -w --by-queue.
Thu, Apr 25, 2:33 PM

Wed, Apr 24

CDanis added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

@Cwek Thank you very much for the detailed report! I've rolled back the experimental change to our DNS records, and by now, more than enough time should have passed for the TTLs to expire on the records that seemed to cause inaccessibility. Hopefully this will rectify things.

Wed, Apr 24, 4:10 AM · Performance-Team (Radar), Operations, Traffic

Apr 23 2019

CDanis added a comment to T208263: Refactor public-facing DYNA scheme for primary project hostnames in our DNS.

authdns-update complete as of ~20:33:56 UTC.

Apr 23 2019, 8:34 PM · Performance-Team (Radar), Operations, Traffic

Apr 21 2019

CDanis added a comment to T221529: Frequent puppet failures .

A bit out of date, but

1~ % cat .weechat/logs/irc.znc.\#wikimedia-operations.weechatlog| grep 'Catalog fetch fail' | cut -f1 -d' ' | uniq -c
2 1 2018-11-08
3 2 2018-11-09
4 2 2018-11-10
5 2 2018-11-11
6 2 2018-11-13
7 8 2018-11-14
8 5 2018-11-15
9 1 2018-11-16
10 3 2018-11-17
11 1 2018-11-18
12 5 2018-11-19
13 1 2018-11-20
14 2 2018-11-21
15 5 2018-11-22
16 3 2018-11-23
17 1 2018-11-24
18 2 2018-11-25
19 3 2018-11-26
20 5 2018-11-27
21 9 2018-11-28
22 8 2018-11-29
23 4 2018-11-30
24 1 2018-12-01
25 4 2018-12-02
26 42 2018-12-03
27 5 2018-12-04
28 6 2018-12-05
29 10 2018-12-06
30 2 2018-12-07
31 1 2018-12-08
32 2 2018-12-09
33 2 2018-12-11
34 1 2018-12-12
35 2 2018-12-13
36 2 2018-12-14
37 1 2018-12-15
38 2 2018-12-16
39 1 2018-12-17
40 1 2018-12-18
41 4 2018-12-19
42 1 2018-12-23
43 1 2018-12-27
44 1 2018-12-30
45 4 2019-01-02
46 10 2019-01-03
47 3 2019-01-04
48 2 2019-01-06
49 1 2019-01-07
50 1 2019-01-08
51 1 2019-01-09
52 3 2019-01-10
53 2 2019-01-13
54 2 2019-01-14
55 1 2019-01-15
56 1 2019-01-16
57 1 2019-01-18
58 2 2019-01-19
59 3 2019-01-22
60 19 2019-01-23
61 4 2019-01-24
62 115 2019-01-25
63 1 2019-01-26
64 2 2019-01-27
65 1 2019-01-28
66 1 2019-01-29
67 1 2019-01-30
68 1 2019-01-31
69 1 2019-02-01
70 2 2019-02-02
71 43 2019-02-04
72 8 2019-02-05
73 2 2019-02-06
74 2 2019-02-07
75 1 2019-02-08
76 1 2019-02-10
77 11 2019-02-11
78 2 2019-02-12
79 16 2019-02-13
80 1 2019-02-14
81 4 2019-02-16
82 1 2019-02-19
83 1 2019-02-21
84 1 2019-02-23
85 2 2019-02-24
86 1 2019-02-26
87 2 2019-02-27
88 4 2019-02-28
89 188 2019-03-01
90 2 2019-03-02
91 2 2019-03-05
92 4 2019-03-07
93 2 2019-03-08
94 1 2019-03-09
95 1 2019-03-10
96 2 2019-03-12
97 3 2019-03-14
98 48 2019-03-15
99 1 2019-03-16
100 2 2019-03-17
101 1 2019-03-19
102 4 2019-03-20
103 14 2019-03-21
104 17 2019-03-22
105 13 2019-03-23
106 11 2019-03-24
107 6 2019-03-25
108 15 2019-03-26
109 6 2019-03-27
110 12 2019-03-28
111 19 2019-03-29
112 16 2019-03-30
113 13 2019-03-31
114 12 2019-04-01
115 17 2019-04-02
116 9 2019-04-03

Apr 21 2019, 11:43 PM · Puppet, puppet-compiler, Operations

Apr 19 2019

CDanis created P8421 "removal request for address of" lldpd messages by hostname.
Apr 19 2019, 4:22 PM
CDanis created P8420 (An Untitled Masterwork).
Apr 19 2019, 4:21 PM
CDanis added projects to T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out: DBA, MediaWiki-Database.
Apr 19 2019, 2:54 PM · Core Platform Team Kanban (Done with CPT), MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Performance, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
CDanis added a comment to T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out.

Found a logstash fatal that definitely implicates database on a commonswiki Special:Log pageload
https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-2019.04.19/mediawiki/?id=AWo2EGn0m4XPTDeITlGR

Apr 19 2019, 2:53 PM · Core Platform Team Kanban (Done with CPT), MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Performance, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
CDanis added a comment to T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out.

Looks like database being slow? Pretty sure this is a MW API call backing the pageload of Special:Log on commonswiki.

Apr 19 2019, 2:51 PM · Core Platform Team Kanban (Done with CPT), MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Performance, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
CDanis triaged T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out as High priority.
Apr 19 2019, 2:48 PM · Core Platform Team Kanban (Done with CPT), MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Performance, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error
CDanis created T221458: Special:Log on commons -- entire web request took longer than 60 seconds and timed out.
Apr 19 2019, 2:48 PM · Core Platform Team Kanban (Done with CPT), MW-1.34-notes (1.34.0-wmf.3; 2019-04-30), Performance, MediaWiki-Logging, MediaWiki-Database, DBA, Operations, Wikimedia-production-error

Apr 18 2019

CDanis edited P8418 Masterwork From Distant Lands.
Apr 18 2019, 1:51 PM
CDanis added a comment to T196336: Icinga passive checks go awol and downtime stops working.

similar again

Apr 18 2019, 1:42 PM · Operations, Icinga, observability

Apr 17 2019

CDanis closed T214529: EDAC events not being reported by node-exporter? as Resolved.

Calling this resolved for now -- the mtail-based events are also being monitored by Icinga, and would have caught all the previous instances missed by the node_exporter/kernel counter stats.

Apr 17 2019, 8:31 PM · Patch-For-Review, Operations, observability

Apr 16 2019

CDanis reassigned T218006: mw1280 crashed from Cmjohnson to jijiki.
Apr 16 2019, 9:04 PM · ops-eqiad, Operations, serviceops
CDanis added a comment to T218006: mw1280 crashed.

17:02:39 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=yes; selector: name=mw1280.eqiad.wmnet

Apr 16 2019, 9:02 PM · ops-eqiad, Operations, serviceops
CDanis added a comment to T218006: mw1280 crashed.

13:26:19 <+logmsgbot> !log cdanis@puppetmaster1001 conftool action : set/pooled=no; selector: dc=eqiad,name=mw1280.eqiad.wmnet,cluster=api_appserver

Apr 16 2019, 5:33 PM · ops-eqiad, Operations, serviceops

Apr 15 2019

CDanis closed T221035: scap no longer !log'ging to server admin log as Resolved.

looks like logmsgbot was happily chattering away in #wikimedia-overload because of some race condition (within IRC services?) reconnecting to freenode around Apr 13 17:06.

Apr 15 2019, 11:33 PM · Release-Engineering-Team (Watching / External), Stashbot, Scap, Operations
CDanis added a project to T221052: config file change canarying for logstash: Operations.
Apr 15 2019, 11:16 PM · Operations, Wikimedia-Logstash
CDanis created T221052: config file change canarying for logstash.
Apr 15 2019, 11:16 PM · Operations, Wikimedia-Logstash
CDanis created T221035: scap no longer !log'ging to server admin log.
Apr 15 2019, 7:54 PM · Release-Engineering-Team (Watching / External), Stashbot, Scap, Operations
CDanis updated the task description for T220838: Upgrade grafana to 6.1.
Apr 15 2019, 3:15 PM · observability, Operations
CDanis closed T213084: Build an understanding of our needs around external monitoring services - Q3 2018/19 goal as Resolved.
Apr 15 2019, 3:07 PM · User-CDanis, observability, Goal
CDanis added a comment to T218006: mw1280 crashed.

@Cmjohnson I'm on US East time and can handle the depool. Give me a ping when you're ready

Apr 15 2019, 2:29 PM · ops-eqiad, Operations, serviceops
CDanis created T220982: maps hosts have bad permissions under /srv/deployment.
Apr 15 2019, 2:20 PM · Operations
fgiunchedi awarded T220838: Upgrade grafana to 6.1 a Love token.
Apr 15 2019, 2:07 PM · observability, Operations
CDanis closed T220909: Degraded RAID on ms-be1013 as Invalid.
Apr 15 2019, 1:04 PM · ops-eqiad, Operations

Apr 12 2019

CDanis updated the task description for T220842: documented procedure for replacing disks in software RAID servers.
Apr 12 2019, 5:58 PM · DC-Ops, Operations
CDanis created T220842: documented procedure for replacing disks in software RAID servers.
Apr 12 2019, 5:58 PM · DC-Ops, Operations
CDanis created T220838: Upgrade grafana to 6.1.
Apr 12 2019, 4:55 PM · observability, Operations

Apr 8 2019

CDanis added a comment to T219544: Make hadoop cluster able to push to swift .

So it sounds like the firewall work is done (thanks Arzhel!)

Apr 8 2019, 7:09 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics
CDanis added a parent task for T197126: Create tool to handle the state of database configuration in MediaWiki in etcd: T220395: TEC6: Database Automation.
Apr 8 2019, 2:22 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA