Page MenuHomePhabricator

CDanis (Chris Danis)
SRE

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (36 w, 4 d)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Yesterday

CDanis added a comment to T228443: Help people remember to merge labs/private git.

I updated the docs:
https://wikitech.wikimedia.org/w/index.php?title=Puppet&type=revision&diff=1833007&oldid=1828050

Fri, Jul 19, 3:33 AM · Puppet, cloud-services-team (Kanban)

Thu, Jul 18

CDanis awarded T228443: Help people remember to merge labs/private git a Love token.
Thu, Jul 18, 4:05 PM · Puppet, cloud-services-team (Kanban)

Wed, Jul 17

CDanis committed rOSCT14fe88f755a7: dbctl schemata: move files to match prod (authored by CDanis).
dbctl schemata: move files to match prod
Wed, Jul 17, 7:06 PM
CDanis committed rOSCT2f35cfd9ffa2: debian: release 1.1.1-1 (authored by CDanis).
debian: release 1.1.1-1
Wed, Jul 17, 5:38 PM
CDanis committed rOSCTeaeb419f8517: bump version: --version and dbctl unification fixes (authored by CDanis).
bump version: --version and dbctl unification fixes
Wed, Jul 17, 4:18 PM
CDanis committed rOSCT01925d6341be: dbctl: part 2/2 to bring schema in line with production (authored by CDanis).
dbctl: part 2/2 to bring schema in line with production
Wed, Jul 17, 3:52 PM
CDanis committed rOSCT33dbb55af062: dbctl: part 1/2 to bring schema in line with production (authored by CDanis).
dbctl: part 1/2 to bring schema in line with production
Wed, Jul 17, 3:49 PM
CDanis created P8765 (An Untitled Masterwork).
Wed, Jul 17, 3:03 PM

Tue, Jul 16

CDanis updated subscribers of P8744 checking puppetdb compiled catalogs for nrpe::monitor_service with non-boolean values for $critical.

First used to validate the change made in https://gerrit.wikimedia.org/r/c/operations/puppet/+/522992 but useful generically.

Tue, Jul 16, 8:50 PM
CDanis edited P8744 checking puppetdb compiled catalogs for nrpe::monitor_service with non-boolean values for $critical.
Tue, Jul 16, 1:42 PM

Mon, Jul 15

CDanis closed T228089: Logstash down for MediaWiki as Resolved.

The backlog in Kafka should clear in just a few more minutes. Closing this; separate issues to be opened later for followup work.

Mon, Jul 15, 11:49 PM · Wikimedia-Incident, observability, Operations, Wikimedia-Logstash
CDanis added a comment to T228089: Logstash down for MediaWiki.

I've started an incident document at https://wikitech.wikimedia.org/wiki/Incident_documentation/20190715-logstash and would appreciate more contributors.

Mon, Jul 15, 11:48 PM · Wikimedia-Incident, observability, Operations, Wikimedia-Logstash
CDanis updated subscribers of T97972: Figure out a security model for etcd.

I think we likely want to revisit this.

Mon, Jul 15, 11:42 PM · Patch-For-Review, Operations, services-tooling, discovery-system, Traffic
CDanis added a comment to T228086: Swift TCP retransmits increase.

This is very likely related to T226937: Not possible to server-side upload certain images: "An unknown error occurred in storage backend "local-swift-eqiad""

Mon, Jul 15, 6:21 PM · Operations, media-storage
CDanis committed rOSCT9cd86cc5fc14: conftool: add support for --version to all executables (authored by CDanis).
conftool: add support for --version to all executables
Mon, Jul 15, 1:01 PM
CDanis reopened T224491: PHP 7 corruption during deployment (was: PHP 7 fatals on mw1262) as "Open".
Mon, Jul 15, 12:50 PM · User-jijiki, serviceops, PHP 7.2 support, Performance-Team (Radar), Operations, Wikimedia-production-error

Sun, Jul 14

CDanis edited P8744 checking puppetdb compiled catalogs for nrpe::monitor_service with non-boolean values for $critical.
Sun, Jul 14, 7:18 PM
CDanis created P8744 checking puppetdb compiled catalogs for nrpe::monitor_service with non-boolean values for $critical.
Sun, Jul 14, 7:18 PM

Fri, Jul 12

CDanis added a comment to T92298: Investigate our mitigation strategy for HTTPS response length attacks.

Do we support TLS 1.3 yet? I'm apparently connecting over 1.2 still.

Fri, Jul 12, 6:41 PM · Security, Operations, Security-General, Traffic, HTTPS
CDanis closed T227100: monitoring::check_prometheus should error on an unquoted ! in the query as Resolved.
Fri, Jul 12, 3:43 PM · good first bug, Operations, observability
CDanis added a comment to T227100: monitoring::check_prometheus should error on an unquoted ! in the query.

being as its a Friday i thought i would have a go at this however i didn't read the original message correctly and before i refactor i want to double check that you want device\\! and not device\!. I would have thought the first form would fail but i don't know how many levels of encoding/escaping are needed?

Fri, Jul 12, 2:11 PM · good first bug, Operations, observability

Thu, Jul 11

CDanis added a comment to T198850: debmonitor: Race condition between package updated triggered by apt hook and daily cron run.

Just a note that it happened again today ;)

Thu, Jul 11, 6:18 PM · Operations, SRE-tools

Wed, Jul 10

CDanis archived P8667 asdf.
Wed, Jul 10, 1:43 PM

Wed, Jul 3

CDanis reassigned T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool) from CDanis to Volans.

Wow, that was quick, thanks! Riccardo should have time to do the deploy while I'm on vacation 🙃

Wed, Jul 3, 11:14 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO (201907), Scap
CDanis created T227225: release a scap that contains I85a2161 (Remove functionality to talk to conftool).
Wed, Jul 3, 8:29 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO (201907), Scap

Tue, Jul 2

CDanis triaged T227100: monitoring::check_prometheus should error on an unquoted ! in the query as Normal priority.
Tue, Jul 2, 3:25 PM · good first bug, Operations, observability
CDanis created T227100: monitoring::check_prometheus should error on an unquoted ! in the query.
Tue, Jul 2, 3:25 PM · good first bug, Operations, observability

Mon, Jul 1

CDanis added a comment to T213708: Upgrade production prometheus-node-exporter to >= 0.16.

I think I just found another one that was missed: https://grafana.wikimedia.org/d/000000545/ganeti

Mon, Jul 1, 8:40 PM · Patch-For-Review, Goal, observability, Operations

Fri, Jun 28

CDanis added a comment to T226840: Consistent HTTP 503 Error on some urls for some logged-in users (CentralAuth Set-Cookie storm).

NB that the default limit in Varnish actually allows for slightly fewer than 64 headers -- the first line of an HTTP response is parsed into several pseudo-headers internally, so AIUI there's space for ~59 headers.

Fri, Jun 28, 4:04 PM · TimedMediaHandler, MW-1.34-notes (1.34.0-wmf.13; 2019-07-09), Wikimedia-Incident, Performance-Team (Radar), Traffic, MediaWiki-extensions-CentralAuth, Operations

Thu, Jun 27

CDanis closed T226769: consider running bastion Prometheis inside cgroups as Invalid.

I'm told the plan is to move these onto Ganeti in PoPs, so that seems just as good.

Thu, Jun 27, 9:59 PM · Operations, observability
CDanis added a comment to T226776: mobile commons GET dying in Varnish layer(?) under oddly specific conditions.

So, it looks like that this 500 did in fact come from the application layer... but shouldn't we still be getting more response headers from the edge?

Thu, Jun 27, 9:56 PM · Operations, Traffic
CDanis created T226776: mobile commons GET dying in Varnish layer(?) under oddly specific conditions.
Thu, Jun 27, 9:39 PM · Operations, Traffic
CDanis created T226769: consider running bastion Prometheis inside cgroups.
Thu, Jun 27, 8:42 PM · Operations, observability
CDanis updated the title for P8667 asdf from untitled to asdf.
Thu, Jun 27, 3:40 PM

Wed, Jun 26

CDanis added a comment to T226545: Loop trying to create an account in Wikimedia Space in certain cases.

I am unable to create a Discourse account because of this loop.

Wed, Jun 26, 4:47 PM · Space (Jul-Sep-2019), Discourse
CDanis added a comment to T213918: Investigate distributed and long term storage solutions for Prometheus.

+1 for the relative simplicity of Thanos (from both a design and deployment
perspective)

Wed, Jun 26, 10:37 AM · User-fgiunchedi, Goal, observability, Operations

Tue, Jun 25

CDanis created T226508: Icinga custom checks should follow our HTTP User-Agent policy.
Tue, Jun 25, 2:00 PM · observability, Operations

Mon, Jun 24

CDanis added a comment to T226109: Jobs not being executed on 1.34.0-wmf.10.

Am I alone in feeling like this probably deserves an incident report?

Mon, Jun 24, 7:16 PM · Analytics, EventBus, Services (done), Core Platform Team Workboards (Done with CPT), Operations, WMF-JobQueue, MassMessage
CDanis lowered the priority of T226394: Telia IC-307235 reported down from the eqiad side from Unbreak Now! to High.

it's just one (not-often-used) link down, not a site down; UBN is unnecessary IMO

Mon, Jun 24, 3:31 PM · Operations, netops
CDanis renamed T220838: Upgrade grafana to 6.x from Upgrade grafana to 6.1 to Upgrade grafana to 6.x.
Mon, Jun 24, 3:12 PM · observability, Operations
CDanis added a comment to T226394: Telia IC-307235 reported down from the eqiad side.

Telia reports a 'major outage' and is tracking status of our circuit in case 00993514

Mon, Jun 24, 1:41 PM · Operations, netops
CDanis created T226394: Telia IC-307235 reported down from the eqiad side.
Mon, Jun 24, 1:16 PM · Operations, netops

Jun 20 2019

CDanis updated subscribers of T226048: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster).

My guess is that the beginning of this problem correlates with the beginning of the fetch failures in the first graph panel here:
https://grafana.wikimedia.org/d/000000352/varnish-failed-fetches?orgId=1&from=now-7d&to=now

Jun 20 2019, 3:24 AM · CommRel-Specialists-Support (Jul-Sep-2019), User-notice, Performance-Team (Radar), Traffic, Operations, Performance

Jun 19 2019

CDanis added a comment to T226109: Jobs not being executed on 1.34.0-wmf.10.

@Reedy manually ran the global renames that were never queued properly.

Jun 19 2019, 4:52 PM · Analytics, EventBus, Services (done), Core Platform Team Workboards (Done with CPT), Operations, WMF-JobQueue, MassMessage

Jun 5 2019

CDanis added a comment to T225166: Gerrit crashed due to out of Heap.

Some curious stuff in the monitoring data:

Jun 5 2019, 10:32 PM · Gerrit

Jun 4 2019

CDanis created P8590 (An Untitled Masterwork).
Jun 4 2019, 7:27 PM
CDanis created P8589 import.php.
Jun 4 2019, 7:25 PM
CDanis added a comment to T224236: include the 'Server:' response header in varnishkafka.

Indeed, thanks @ema ! I talked with @fgiunchedi some about this earlier and we tweaked the wording on the Logstash dashboard to remind users that "Varnish" appearing highly in the "top n Backends" panel is not necessarily reflective of a Varnish issue.

Jun 4 2019, 2:14 PM · Analytics-Kanban, User-Elukey, Traffic, Analytics, Operations

Jun 3 2019

CDanis added a comment to T224888: Network port utilization alerts should be paging .

There is a "Nagios Compatible" transport, but it is underdocumented and seems to also only write to a local filesystem path (which is presumed to be a Nagios external command FIFO).

Jun 3 2019, 6:30 PM · Traffic, Operations, netops

Jun 1 2019

CDanis updated subscribers of T219825: Update dashboards to node-exporter 0.16+ metric names.

@Marostegui just found something we forgot: the use of Prometheus metrics in Grafana's variable definitions (e.g. by a label_values() query)

Jun 1 2019, 8:34 PM · Patch-For-Review, observability

May 31 2019

CDanis triaged T224738: add Icinga alert on Varnish backends that are close to maxing out their allowed connections to their applayer backends as Normal priority.
May 31 2019, 3:45 PM · Traffic, Operations
CDanis created T224738: add Icinga alert on Varnish backends that are close to maxing out their allowed connections to their applayer backends.
May 31 2019, 3:43 PM · Traffic, Operations
CDanis added a comment to T224236: include the 'Server:' response header in varnishkafka.

SGTM @elukey, thanks!

May 31 2019, 12:59 PM · Analytics-Kanban, User-Elukey, Traffic, Analytics, Operations

May 30 2019

CDanis created P8575 (An Untitled Masterwork).
May 30 2019, 4:11 PM

May 29 2019

CDanis assigned T224236: include the 'Server:' response header in varnishkafka to Ottomata.

Andrew, can you (or someone else) advise on rolling out this change for Analytics?

May 29 2019, 8:12 PM · Analytics-Kanban, User-Elukey, Traffic, Analytics, Operations

May 23 2019

CDanis created T224236: include the 'Server:' response header in varnishkafka.
May 23 2019, 4:31 PM · Analytics-Kanban, User-Elukey, Traffic, Analytics, Operations

May 21 2019

CDanis added a comment to T223952: Increased instability in MediaWiki backends (according to load balancers).

We saw one of these events at 14:48 today and pybal reported fetch failures for -- and wanted to depool -- basically the entire appserver fleet https://phabricator.wikimedia.org/P8551

May 21 2019, 3:01 PM · Performance-Team (Radar), User-Marostegui, HHVM, serviceops, Operations
CDanis updated the title for P8551 fgrep 'May 21 14:48' /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste from fgrep 14:48 /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste to fgrep 'May 21 14:48' /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste.
May 21 2019, 2:59 PM
CDanis updated the title for P8551 fgrep 'May 21 14:48' /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste from Masterwork From Distant Lands to fgrep 14:48 /var/log/pybal.log | grep -i 'fetch failed' | egrep -o 'WARN: (mw.*wmnet)' | cut -f2 -d' ' | sort | uniq -c | sort -gr | phaste.
May 21 2019, 2:57 PM
CDanis added a project to T223948: media seek controls not invokeable on Android 9 (Pie) and Pixel 3 XL: Android-app-Bugs.
May 21 2019, 12:17 PM · Wikipedia-Android-App-Backlog (Android-app-release-v2.7.28x-M-Mochi), Android-app-Bugs

May 20 2019

CDanis added a comment to T223934: Add annotations from ops vendor maintenance calendar to Grafana.

+1. In general I think it would be a great idea to do a lot more with annotations than we presently do:

May 20 2019, 8:24 PM · Operations
CDanis created T223924: pybal logs into logstash.
May 20 2019, 4:54 PM · Operations, Wikimedia-Logstash

May 19 2019

Wang_Qiliang awarded T222418: 503 errors for several Wikipedia pages a Party Time token.
May 19 2019, 2:04 PM · Wikimedia-Incident, Traffic, Operations, Wikimedia-General-or-Unknown, User-DannyS712
Ankit-Maity awarded T222418: 503 errors for several Wikipedia pages a Pterodactyl token.
May 19 2019, 1:33 PM · Wikimedia-Incident, Traffic, Operations, Wikimedia-General-or-Unknown, User-DannyS712
CDanis closed T222418: 503 errors for several Wikipedia pages as Resolved.

Thanks! We now believe this is resolved.

May 19 2019, 12:38 PM · Wikimedia-Incident, Traffic, Operations, Wikimedia-General-or-Unknown, User-DannyS712
CDanis added a comment to T222418: 503 errors for several Wikipedia pages.

For posterity:

May 19 2019, 12:28 PM · Wikimedia-Incident, Traffic, Operations, Wikimedia-General-or-Unknown, User-DannyS712

May 14 2019

CDanis added a comment to T223319: URL shortener subdomains for useful Wikimedia infrastructure.

Just want to throw out the possibility in the future (future) that some of these underlying tools may change and the unique identifier for that service may no longer align with the unique id in the new service. IOW: some of these cools urls might change ;)

May 14 2019, 7:50 PM · Operations
CDanis created T223319: URL shortener subdomains for useful Wikimedia infrastructure.
May 14 2019, 7:18 PM · Operations

May 13 2019

CDanis added a comment to T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.

It would be nice to have a mockup of the API to test soon (with no production effect except maybe some debug information). That will allow to test automation from scripts we have already. I think that would be step #6 ?

May 13 2019, 4:19 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
Jdforrester-WMF awarded T197126: Create tool to handle the state of database configuration in MediaWiki in etcd a Like token.
May 13 2019, 3:48 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
Jdforrester-WMF awarded T197126: Create tool to handle the state of database configuration in MediaWiki in etcd a Like token.
May 13 2019, 3:47 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
CDanis claimed T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.
May 13 2019, 3:36 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA
CDanis added a comment to T197126: Create tool to handle the state of database configuration in MediaWiki in etcd.

Here's my tentative plan for moving forward with this, including a rollout procedure:

May 13 2019, 3:33 PM · User-ArielGlenn, Patch-For-Review, User-Joe, MediaWiki-Configuration, Operations, DBA

May 10 2019

CDanis added a comment to T220212: Wikimedia Technical Conference 2019: Discussion .

+1 to what @Joe said, and to what @jijiki said. Especially speaking as someone who has been at the Foundation only six months now.

May 10 2019, 1:07 PM · International-Developer-Events

May 8 2019

CDanis created P8495 swift codfw-prod final rebalance.
May 8 2019, 7:10 PM
CDanis added a comment to T219544: Make hadoop cluster able to push to swift .

Some quick notes from today's meeting:

May 8 2019, 3:28 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics

May 7 2019

CDanis created T222755: #wikimedia-sre is missing stashbot.
May 7 2019, 7:26 PM · Stashbot, Operations
CDanis added a comment to T221904: swift backend decomms / rebalances are noisy.

Trying out a few things here:

May 7 2019, 1:05 PM · observability, media-storage, Operations
CDanis claimed T221904: swift backend decomms / rebalances are noisy.
May 7 2019, 12:59 PM · observability, media-storage, Operations
CDanis added a comment to T222620: cp1083 crashed.

Interestingly, there was a memory usage spike right before the host crashed.

May 7 2019, 12:10 PM · Operations, ops-eqiad, Traffic

May 6 2019

CDanis created T222654: ms-be2043 'sdd' throwing lots of errors.
May 6 2019, 7:13 PM · User-fgiunchedi, ops-codfw, observability, media-storage, Operations
CDanis updated subscribers of T222391: Gerrit Hardware Upgrade.

cc @mark who I know is about to start looking at hardware requests for the coming FY

May 6 2019, 5:51 PM · Release-Engineering-Team (Development services), Release-Engineering-Team-TODO, serviceops, Operations, Gerrit
CDanis closed T222108: prometheus: some sort of IRC alerts on restarts? as Resolved.

We now have IRC alerting based on scraping each prometheus for its process_start_time_seconds metric.

May 6 2019, 4:39 PM · Patch-For-Review, Wikimedia-Incident, observability, Operations
CDanis added a comment to T222605: CI is unavailable since around 10:00 UTC.

My patches are also stuck in the queue, and I'm seeing teammates manually V+2 their Puppet changes.

May 6 2019, 1:46 PM · Wikimedia-Incident, Continuous-Integration-Config, Release-Engineering-Team
CDanis updated the title for P8478 cdanis@icinga2001.wikimedia.org ~ % fgrep 'Too many open files' /var/log/syslog.1 | awk '{print $3}' | cut -d: -f1-2 | sort | uniq -c | phaste from Masterwork From Distant Lands to cdanis@icinga2001.wikimedia.org ~ % fgrep 'Too many open files' /var/log/syslog.1 | awk '{print $3}' | cut -d: -f1-2 | sort | uniq -c | phaste.
May 6 2019, 1:06 PM

May 3 2019

CDanis updated the title for P8473 curl --silent 'https://gerrit.wikimedia.org/r/changes/operations%2Fpuppet~507623/detail' | head -n5 from Masterwork From Distant Lands to curl --silent 'https://gerrit.wikimedia.org/r/changes/operations%2Fpuppet~507623/detail' | head -n5.
May 3 2019, 6:05 PM
CDanis edited P8473 curl --silent 'https://gerrit.wikimedia.org/r/changes/operations%2Fpuppet~507623/detail' | head -n5.
May 3 2019, 6:05 PM
CDanis closed T222112: figure out why Kafka dashboard hammers Prometheus, and fix it as Resolved.

It does seem much faster now, thanks @elukey ! Impact of loading 30 days on Prometheus is also minimal now -- modest CPU usage and while there was some increase in RAM consumption over baseline while we were both playing with this, it's not concerning. Thank you :)

May 3 2019, 2:43 PM · Wikimedia-Incident, Operations, observability
CDanis updated the title for P8471 dbctl config | head -n1 | jq . from Masterwork From Distant Lands to dbctl config | head -n1 | jq ..
May 3 2019, 12:34 PM
CDanis edited P8471 dbctl config | head -n1 | jq ..
May 3 2019, 12:34 PM

May 2 2019

CDanis added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

Also sorry, I don't have a lot of time left over this week; can take a deeper look next week

May 2 2019, 6:11 PM · Wikimedia-Incident, Operations, observability
CDanis added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

I think you should just be able to remove the "custom all value" in the dashboard settings and have it work. In this case Grafana will create its own 'all' value that is simply a regex OR'ing together all the known values, which it looks like it computes based on the cluster=kafka_jumbo hidden variable.

May 2 2019, 6:11 PM · Wikimedia-Incident, Operations, observability
CDanis updated subscribers of T219544: Make hadoop cluster able to push to swift .

I got tied up with goal work and incident response and have only had a little time to spend on this.

May 2 2019, 5:37 PM · Patch-For-Review, Analytics-Kanban, Research, Operations, Discovery, Analytics
CDanis renamed T222362: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications from ms-be2043 /dev/sdd drive failure to swift-drive-audit unmounting a drive doesn't produce any alerts or notifications.
May 2 2019, 1:33 PM · observability, media-storage, Operations
CDanis added a comment to T222362: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications.

I think the 'real' thing we need to notify on here is when Swift decides it wants to stop using a disk (which it did here)

May 2 2019, 1:22 PM · observability, media-storage, Operations
CDanis created T222362: swift-drive-audit unmounting a drive doesn't produce any alerts or notifications.
May 2 2019, 1:15 PM · observability, media-storage, Operations

May 1 2019

CDanis added a comment to T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

I've modified the Kafka dashboard so that only the Summary Row is uncollapsed bym default. I've also changed the default time range to last 3 hours, rather than last 24.

May 1 2019, 1:48 PM · Wikimedia-Incident, Operations, observability

Apr 30 2019

CDanis closed T222105: prometheus: current query limits are insufficient to prevent OOMs as Resolved.

As documented in T222112#5147131 this didn't actually fix the dashboard at fault in this particular incident, but I've heard from another large-scale Prometheus user (and Prometheus dev) that they've had similar problems and recommend 10M as a value.

Apr 30 2019, 3:08 PM · Patch-For-Review, Wikimedia-Incident, observability, Operations
CDanis updated subscribers of T222112: figure out why Kafka dashboard hammers Prometheus, and fix it.

I'm pretty sure it is these panels that are responsible for the most Prometheus load


They take much longer to load than the rest of the panels, and some of them errored out with the new settings.

Apr 30 2019, 2:22 PM · Wikimedia-Incident, Operations, observability
CDanis added a comment to T219825: Update dashboards to node-exporter 0.16+ metric names.

I think https://grafana.wikimedia.org/d/000000607/cluster-overview might have been missed here? I see at least some old metrics being used there, e.g. node_memory_Cached in the "Memory per host" section.

Apr 30 2019, 2:06 PM · Patch-For-Review, observability