Page MenuHomePhabricator
Feed Advanced Search

Tue, Jan 21

fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Tue, Jan 21, 11:40 AM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi closed T242609: Move thumbor to the logging pipeline, a subtask of T227080: Deprecate all non-Kafka logstash inputs, as Resolved.
Tue, Jan 21, 11:39 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi closed T242609: Move thumbor to the logging pipeline as Resolved.

Thumbor now is logging to localhost:11514, from there logs are shipped to the logging pipeline, resolving!

Tue, Jan 21, 11:39 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations

Mon, Jan 20

fgiunchedi added a comment to T242826: Increase retention time of graphite stats for CI.

No idea @greg right off the bat, although I see the panel Gate and Submit "Resident Time" on top left is working as expected so I take it is possible in some fashion

Mon, Jan 20, 9:34 AM · User-greg, Release-Engineering-Team (CI & Testing services), Graphite, Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3))

Fri, Jan 17

fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Fri, Jan 17, 2:00 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Fri, Jan 17, 1:54 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Fri, Jan 17, 1:46 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Fri, Jan 17, 1:40 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T213899: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs.
Fri, Jan 17, 1:33 PM · observability, Patch-For-Review, User-herron, Operations, Wikimedia-Logstash
fgiunchedi created T243065: Provision plaintext syslog collectors in esams/ulsfo/eqsin.
Fri, Jan 17, 11:03 AM · netops, observability, Operations
fgiunchedi created T243057: Move Prometheus off eqsin/ulsfo/esams bastions.
Fri, Jan 17, 10:07 AM · Operations, observability
fgiunchedi added a comment to T125408: Regularly & Automatically backup WMDE metrics stored in graphite.

@fgiunchedi Any idea if there is any sort of regular / scheduled backups of the disks for graphite nodes?
If not I'll try to put something in place for the few metrics we would like to not loose for now :)

Fri, Jan 17, 9:26 AM · Graphite, User-Addshore, Operations, WMDE-Analytics-Engineering
fgiunchedi closed T241790: (No Need By Date Provided) rack/setup/install restbase202[123] as Resolved.

All done, service is being implemented in T243000

Fri, Jan 17, 9:17 AM · Core Platform Team Workboards (Clinic Duty Team), ops-codfw, Operations
fgiunchedi updated the task description for T241790: (No Need By Date Provided) rack/setup/install restbase202[123].
Fri, Jan 17, 9:16 AM · Core Platform Team Workboards (Clinic Duty Team), ops-codfw, Operations
fgiunchedi reassigned T241790: (No Need By Date Provided) rack/setup/install restbase202[123] from fgiunchedi to Eevans.
Fri, Jan 17, 9:15 AM · Core Platform Team Workboards (Clinic Duty Team), ops-codfw, Operations

Wed, Jan 15

fgiunchedi reassigned T240882: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet from fgiunchedi to herron.
Wed, Jan 15, 2:21 PM · Patch-For-Review, Operations, ops-codfw, Wikimedia-Logstash
fgiunchedi added a comment to T240882: (No Need By Date Provided) rack/setup/install logstash202[6-9].codfw.wmnet.

Thank you @Papaul and @Dzahn !

Wed, Jan 15, 1:51 PM · Patch-For-Review, Operations, ops-codfw, Wikimedia-Logstash
fgiunchedi added a comment to T242826: Increase retention time of graphite stats for CI.

In other panels I see data going back to Aug 2017 for core's gate-and-submit, e.g. https://grafana.wikimedia.org/d/000000108/releng-kpis?orgId=1&from=1496046004105&to=1579080951474&fullscreen&edit&panelId=2 maybe it has to do with the panel's query ?

Wed, Jan 15, 9:38 AM · User-greg, Release-Engineering-Team (CI & Testing services), Graphite, Release-Engineering-Team-TODO (2020-01 to 2020-03 (Q3))

Tue, Jan 14

fgiunchedi added a comment to T235891: Ingest production logs with ELK7.

Something else I realized today: with ELK7 we dropped our custom logstash template in favor of logstash's default, although we'll need to bump the default number of fields per index (1000 -> 4096) at least

Tue, Jan 14, 5:04 PM · User-fgiunchedi, Patch-For-Review, Operations, Wikimedia-Logstash
jijiki awarded T242609: Move thumbor to the logging pipeline a Love token.
Tue, Jan 14, 10:33 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a comment to T232820: Security Concept Review For client side error logging js client.

AFAICT the code is basically ready to be merged, when does security review need to happen ?

@fgiunchedi - a few points:

  1. Given that the patch set is a config file change and some fairy minimal JavaScript for error-tracking, this probably wouldn't warrant a full security readiness review. Those are typically performed for new (or substantial rewrites) of MediaWiki extensions, services, etc. I know our documentation isn't very clear on various review thresholds - hopefully we can improve that at some point.
  2. I'd imagine the Performance-Team might be interested in having a look at this if they haven't already.
  3. Prior to any patch review, we'd probably require that all of the relevant CI checks are passing.
  4. For the patch review itself, @Reedy or I would most likely focus on ensuring that the regexWebkit and regexGecko patterns are sufficiently hardened and if any potential attack vectors against EventGate are being introduced.
Tue, Jan 14, 10:22 AM · Security Concept Review, Security-Team

Mon, Jan 13

fgiunchedi added a subtask for T227080: Deprecate all non-Kafka logstash inputs: T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline.
Mon, Jan 13, 2:30 PM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a parent task for T225122: Migrate services using deprecated Gelf logstash input to Kafka enabled logging pipeline: T227080: Deprecate all non-Kafka logstash inputs.
Mon, Jan 13, 2:30 PM · observability, Operations, Wikimedia-Logstash
fgiunchedi created T242609: Move thumbor to the logging pipeline.
Mon, Jan 13, 2:18 PM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a comment to T212946: Stream Thumbor logs to logstash.

Reviving this as part of this Q's OKRs to move services off logstash non-kafka inputs, I'll followup with patches to move to the localhost-udp compatibility endpoint!

Mon, Jan 13, 2:14 PM · observability, Wikimedia-Logstash, User-jijiki, serviceops, Operations, Thumbor
fgiunchedi added a comment to T242579: Setup netconsole on upload@esams hosts.

In case it is helpful: we can reuse the centrallog hosts in codfw/eqiad. For site-local netconsole instead we'll need to setup local syslog collectors anyways (on ganeti VMs) for network devices syslog, so we could piggyback on those.

Mon, Jan 13, 12:03 PM · Traffic, Operations
fgiunchedi created T242585: Move cassandra logging to logging pipeline.
Mon, Jan 13, 10:53 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi added a comment to T232820: Security Concept Review For client side error logging js client.

@fgiunchedi - any update or progress on this? If not, we might just want to decline this task for now until there's actually some code ready for review. Thanks.

Yes, progress in the form of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/553376 which is the client-side javascript part. The code is being reviewed and worked on although I don't think it is yet at the security-review stage. Having said that, feel free to close this task and we'll reopen/followup when needed.

Mon, Jan 13, 10:44 AM · Security Concept Review, Security-Team
fgiunchedi moved T178839: New upstream jvm-tools from Backlog to Radar on the User-fgiunchedi board.
Mon, Jan 13, 10:08 AM · Core Platform Team, User-Eevans, User-fgiunchedi, Operations
fgiunchedi moved T227080: Deprecate all non-Kafka logstash inputs from Backlog to Doing on the User-fgiunchedi board.
Mon, Jan 13, 10:08 AM · observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi moved T227108: Port varnishlog consumers to log to syslog / logging infra from Backlog to Doing on the User-fgiunchedi board.
Mon, Jan 13, 10:08 AM · Traffic, Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Operations
fgiunchedi closed T242471: Degraded RAID on ms-be1035 as Resolved.

Looks like the controller freaked out (T141756), firmware upgraded and rebooted.

Mon, Jan 13, 9:27 AM · SRE-swift-storage, ops-eqiad, Operations
fgiunchedi assigned T242511: Degraded RAID on ms-be1039 to Jclark-ctr.

@Jclark-ctr @Cmjohnson host is under warranty for another month according to netbox, please order a replacement for the failed 4TB disk (led is blinking), thanks!

Mon, Jan 13, 9:09 AM · SRE-swift-storage, ops-eqiad, Operations

Thu, Jan 9

fgiunchedi closed T241534: Degraded RAID on ms-be2035 as Resolved.

Thanks @Papaul ! Upon reboot the host booted into pxe, I am assuming because the first disk was present but was unbootable and didn't fallback onto booting from the second disk. Anyways all good after a reimage, resolving.

Thu, Jan 9, 12:55 PM · SRE-swift-storage, Operations, ops-codfw

Tue, Jan 7

kostajh awarded Blog Post: The journey to Prometheus 2 a Yellow Medal token.
Tue, Jan 7, 12:23 PM
fgiunchedi added a comment to Blog Post: The journey to Prometheus 2.
In J184#2651, @mmodell wrote:

Thank you @fgiunchedi for sharing this story. Congratulations for working through some difficult bugs to roll out this important upgrade.

Tue, Jan 7, 10:51 AM
fgiunchedi added a comment to T240520: Produce dumps of commons thumbnail URLs.

How long does it take to list one of these swift containers, say the one for en wiki thumbs, which is probably among the largest?

This seems to get urls from swift at about 20k/sec, for the 1.3B commons thumbs that works out to about 18 hours. I didn't check enwiki, assuming commons would be an order of magnitude more than the others, but could look into it. If we want things to take less time that could be parallelized over the list of containers to dump (255), probably we could do 4 at a time or some such.

The script as written will also produce a listing for commonswiki, do we want that? How long would those containers take to list?

Commonswiki was the primary goal, as above about it is around 18 hours. Compressed the output is around 7GB.

How does swift hold up under this? Can we run one process doing commons and another/others doing the rest, or is that going to be a noticeable load on the servers? I'm going to add @fgiunchedi for comments on this.

Tue, Jan 7, 10:50 AM · Patch-For-Review, Dumps-Generation, Internet-Archive, Datasets-Archiving

Sat, Jan 4

mmodell awarded Blog Post: The journey to Prometheus 2 a Love token.
Sat, Jan 4, 8:06 PM

Thu, Jan 2

fgiunchedi changed the header image for post Blog Post: The journey to Prometheus 2.
Thu, Jan 2, 12:53 PM
fgiunchedi changed the header image for post Blog Post: The journey to Prometheus 2.
Thu, Jan 2, 12:52 PM
fgiunchedi changed the header image for post Blog Post: The journey to Prometheus 2.
Thu, Jan 2, 12:51 PM
fgiunchedi updated the post content for Blog Post: The journey to Prometheus 2.
Thu, Jan 2, 12:46 PM
fgiunchedi updated the post content for Blog Post: The journey to Prometheus 2.
Thu, Jan 2, 12:41 PM
fgiunchedi removed a project from T234698: ms-be1020 - firmware upgrade: (was: host went down): User-fgiunchedi.
Thu, Jan 2, 11:18 AM · ops-eqiad, SRE-swift-storage, Operations
fgiunchedi closed T239805: ms-fe2007 NIC failure as Resolved.

Host is back in service!

Thu, Jan 2, 11:10 AM · User-fgiunchedi, ops-codfw, Operations
fgiunchedi added a comment to T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory.

I checked all "memory free" metrics as reported by node-exporter for the varnish case and indeed the numbers match, i.e. the kernel was reporting multiple GBs of memory free at the time of the crashes:

Thu, Jan 2, 10:58 AM · observability, Traffic, Operations
fgiunchedi added a project to T239805: ms-fe2007 NIC failure: User-fgiunchedi.
Thu, Jan 2, 10:38 AM · User-fgiunchedi, ops-codfw, Operations
fgiunchedi lowered the priority of T226373: Swift object servers become briefly unresponsive on a regular basis from High to Medium.

What is the right followup after a month? "I don't know" is an ok answer, I just want to clarify the status of the ticket- e.g. does ATS people need to be involved? Or just "being handled, just needs time"?

Thu, Jan 2, 10:35 AM · Performance-Team (Radar), User-jijiki, serviceops, Patch-For-Review, SRE-swift-storage, Operations
fgiunchedi merged T241714: Degraded RAID on ms-be2035 into T241534: Degraded RAID on ms-be2035.
Thu, Jan 2, 10:09 AM · SRE-swift-storage, Operations, ops-codfw
fgiunchedi merged task T241714: Degraded RAID on ms-be2035 into T241534: Degraded RAID on ms-be2035.
Thu, Jan 2, 10:09 AM · Operations, ops-codfw
fgiunchedi added a comment to T241534: Degraded RAID on ms-be2035.

@Papaul host is in warranty and looks like an SSD failed, could we get that replaced (led is blinking), thanks!

Thu, Jan 2, 10:08 AM · SRE-swift-storage, Operations, ops-codfw
fgiunchedi merged task T241535: Degraded RAID on ms-be2035 into T241534: Degraded RAID on ms-be2035.
Thu, Jan 2, 10:07 AM · SRE-swift-storage, Operations, ops-codfw
fgiunchedi merged T241535: Degraded RAID on ms-be2035 into T241534: Degraded RAID on ms-be2035.
Thu, Jan 2, 10:07 AM · SRE-swift-storage, Operations, ops-codfw
fgiunchedi added a comment to T239805: ms-fe2007 NIC failure.

Configured both ports to use PXE when booting, now the host is running the reimage correctly:

Thu, Jan 2, 9:46 AM · User-fgiunchedi, ops-codfw, Operations
fgiunchedi added a comment to T239805: ms-fe2007 NIC failure.

@Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios:

Link Status                                           <Disconnected>         *
Thu, Jan 2, 9:36 AM · User-fgiunchedi, ops-codfw, Operations
fgiunchedi added a comment to T239805: ms-fe2007 NIC failure.

@Papaul thanks! I tried to reimage the host but the NIC shows its link as offline from bios:

Thu, Jan 2, 9:32 AM · User-fgiunchedi, ops-codfw, Operations
fgiunchedi added a comment to T211661: Automatically clean up unused thumbnails in Swift.

Analytics now publishes media access stats, might be useful to drive some/all thumbnail cleanup: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Mediarequests

Thu, Jan 2, 9:09 AM · User-jijiki, Patch-For-Review, Traffic, SRE-swift-storage, Operations, Performance-Team

Dec 20 2019

fgiunchedi awarded T226986: Client side error logging production launch a Love token.
Dec 20 2019, 8:55 AM · Desktop Improvements, Patch-For-Review, observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog, Epic, Analytics

Dec 19 2019

ema awarded Blog Post: The journey to Prometheus 2 a Burninate token.
Dec 19 2019, 3:26 PM
fgiunchedi published Blog Post: The journey to Prometheus 2.
Dec 19 2019, 2:18 PM

Dec 18 2019

fgiunchedi added a comment to T240685: MediaWiki Prometheus support.

My two cents re: prometheus_client_php current adapters, apcu and redis, since AIUI none are optimal/desirable. There might be another way inspired by what the ruby client does (see also this PromCon 2019 talk and video). Very broadly, the idea IIRC is for each process to mmap a separate file with its own metrics, then at collection time metrics are read from the files and merged.

Dec 18 2019, 9:55 AM · Operations, MediaWiki-General, observability

Dec 17 2019

fgiunchedi added a comment to T240789: Return traffic to eqiad WMCS triggering FNM.

FWIW I think if the current thresholds are good at detecting DDoS we should explicitly whitelist WMCS ranges with say 1.5x the current thresholds and see how far that gets us.

Dec 17 2019, 9:48 AM · Patch-For-Review, cloud-services-team (Kanban), Operations, netops

Dec 16 2019

fgiunchedi added a comment to T232820: Security Concept Review For client side error logging js client.

@fgiunchedi - any update or progress on this? If not, we might just want to decline this task for now until there's actually some code ready for review. Thanks.

Dec 16 2019, 3:17 PM · Security Concept Review, Security-Team
fgiunchedi updated the task description for T236075: Evaluate, suggest and choose an alert escalation solution.
Dec 16 2019, 3:08 PM · User-fgiunchedi, observability
fgiunchedi moved T231086: Picture from Commons not found from Singapore from Doing to Radar on the User-fgiunchedi board.
Dec 16 2019, 3:07 PM · Performance-Team (Radar), User-fgiunchedi, Structured-Data-Backlog, Structured Data Engineering, Multimedia, MW-1.34-notes (1.34.0-wmf.21; 2019-09-03), Patch-For-Review, Commons, MediaWiki-File-management, SRE-swift-storage, Traffic, Operations
fgiunchedi moved T156955: Standardizing our partman recipes from Backlog to Doing on the User-fgiunchedi board.
Dec 16 2019, 3:07 PM · Patch-For-Review, User-fgiunchedi, Operations
fgiunchedi edited projects for T156955: Standardizing our partman recipes, added: User-fgiunchedi; removed Patch-For-Review.
Dec 16 2019, 3:04 PM · Patch-For-Review, User-fgiunchedi, Operations
fgiunchedi closed T240798: Degraded RAID on ms-be2016 as Resolved.

hw raid firmware upgraded, resolving

Dec 16 2019, 9:25 AM · Operations, ops-codfw

Dec 13 2019

fgiunchedi updated the description for Observing the observable.
Dec 13 2019, 3:24 PM
fgiunchedi changed the profile image for blog Observing the observable.
Dec 13 2019, 3:23 PM
fgiunchedi changed the profile image for blog Observing the observable.
Dec 13 2019, 3:23 PM
fgiunchedi created Observing the observable.
Dec 13 2019, 3:11 PM
fgiunchedi updated the task description for T240667: Ingestion errors for production logs on ELK7.
Dec 13 2019, 2:14 PM · User-fgiunchedi, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T240667: Ingestion errors for production logs on ELK7.
Dec 13 2019, 2:11 PM · User-fgiunchedi, Operations, Wikimedia-Logstash
fgiunchedi updated the task description for T240667: Ingestion errors for production logs on ELK7.
Dec 13 2019, 2:09 PM · User-fgiunchedi, Operations, Wikimedia-Logstash
fgiunchedi renamed T239090: Restbase logging indexing conflict on 'res' and 'body' logging fields from Restbase logging indexing conflict to Restbase logging indexing conflict on 'res' and 'body' logging fields.
Dec 13 2019, 2:07 PM · Core Platform Team Workboards (Clinic Duty Team), User-fgiunchedi, Wikimedia-Logstash, RESTBase
fgiunchedi added a comment to T239090: Restbase logging indexing conflict on 'res' and 'body' logging fields.

After getting a little more perspective in T240667 it seems that indeed res and body are sometimes sent as strings and sometimes as nested objects.

Dec 13 2019, 2:05 PM · Core Platform Team Workboards (Clinic Duty Team), User-fgiunchedi, Wikimedia-Logstash, RESTBase
fgiunchedi updated the task description for T240667: Ingestion errors for production logs on ELK7.
Dec 13 2019, 2:03 PM · User-fgiunchedi, Operations, Wikimedia-Logstash
fgiunchedi closed T69116: number of uploads in graphite as Resolved.

Resolving as we're using metrics from swift to plot original uploads: https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?orgId=1&from=now-3h&to=now-1m&var-DC=eqiad&var-prometheus=eqiad%20prometheus%2Fops&refresh=5m&fullscreen&panelId=26

Dec 13 2019, 1:39 PM · Wikimedia-General-or-Unknown
fgiunchedi added a comment to T240048: Make grafana-next.wm.o HTTP 302 redirect to grafana.wm.o.

CC @fgiunchedi , although maybe it was someone else from Foundations that worked on this?

Dec 13 2019, 10:53 AM · observability, Operations
fgiunchedi created T240667: Ingestion errors for production logs on ELK7.
Dec 13 2019, 10:41 AM · User-fgiunchedi, Operations, Wikimedia-Logstash
fgiunchedi added a comment to T239090: Restbase logging indexing conflict on 'res' and 'body' logging fields.

Some more details, afaict it is errors from mobileapps and graphoid that fail to be emitted and parsed correctly

Dec 13 2019, 8:59 AM · Core Platform Team Workboards (Clinic Duty Team), User-fgiunchedi, Wikimedia-Logstash, RESTBase

Dec 12 2019

fgiunchedi added a comment to T240560: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down.

rsyslog does indeed use librdkafka so it might be that! re: losing logs AFAICT that's not happening at the moment in the sense that there aren't discards reported by rsyslog-exporter (also dashboard at https://grafana.wikimedia.org/d/000000596/rsyslog)

Dec 12 2019, 4:39 PM · Operations, serviceops, observability
fgiunchedi added a comment to T236832: /etc/php/php7-fatal-error.php uses unsafe ob_start.

I thought maybe the puppet change didn't propagate to the wtp servers, but that's not the case I realise now.
@fgiunchedi That's the same class of issue, but other than being a violation of the same constraint in PHP, I see no indication that it is caused by php7-fatal-error.php.
It has a different stack trace than the one reported in this task. Or rather, these have no stack trace.

Task description
PHP Fatal Error: ob_start(): Cannot use output buffering in output buffering display handlers
from line 57 of /etc/php/php7-fatal-error.php:
Currently Logstash entries forCannot use output bufferingfrom wpt* servers
PHP Fatal error:  Unknown: Cannot use output buffering in output buffering display handlers in Unknown on line 0
exception.file	"Unknown:0"
exception.message	"Unknown: Cannot use output buffering in output buffering display handlers"
exception.trace	""

Note that all fatal errors from MediaWiki and Parsoid-php in production are caught by and reported through "php7-fatal-error.php". The issue in this task however was an output buffer bug existing in the php7-fatal-error.php script itself. Very confusing, I know :)

Dec 12 2019, 8:48 AM · Core Platform Team, Performance-Team (Radar), observability, MediaWiki-Debug-Logger

Dec 11 2019

fgiunchedi closed T188917: puppetization of check_prometheus is not robust to the use of single quotes as Resolved.

Complete! check_prometheus will fail on queries with single quotes

Dec 11 2019, 11:49 AM · observability, Puppet
fgiunchedi added a comment to T240379: Explicitly state the timezone in grafana.

grafana.wikimedia.org defaults to UTC, logged-in users can change their settings of course!

Dec 11 2019, 11:49 AM · Upstream, observability
fgiunchedi closed T186069: Icinga: page in case all MediaWiki are throwing 5xx as Resolved.

Tentatively resolving, we'll be paging if >= 1% of global traffic is 5xx

Dec 11 2019, 11:34 AM · Wikimedia-Incident, Icinga, Operations, observability

Dec 10 2019

fgiunchedi closed T154619: Export ipsec counters as Prometheus metrics as Declined.

Given we're moving off ipsec for most/all use cases I'm boldly declining

Dec 10 2019, 2:40 PM · observability, Operations
fgiunchedi added a comment to T167422: Monitoring: add link to graph for Icinga timeseries alarms.

We have dashboard_links and notes_link now for check_prometheus and monitoring::service enforces the presence of notes_url. @Volans is there anything else to do here you think ?

Dec 10 2019, 2:39 PM · Operations, observability
fgiunchedi closed T95801: Allow customizing the alert message from graphite as Declined.

We're getting rid of graphite and graphite-based alerts over time, declining but please reopen if needed!

Dec 10 2019, 2:34 PM · Operations, observability
fgiunchedi moved T192948: Upgrade prometheus-jmx-exporter on all services using it from Inbox to Backlog on the observability board.
Dec 10 2019, 2:20 PM · Core Platform Team Legacy (Watching / External), User-Elukey, observability, Analytics, Puppet, Services (watching), Cassandra
fgiunchedi moved T151009: Provide authenticated access to Prometheus native web interface from Inbox to Backlog on the observability board.
Dec 10 2019, 2:19 PM · observability, Patch-For-Review, User-fgiunchedi, Operations, Prometheus-metrics-monitoring
fgiunchedi moved T183146: Monitor resource usage on a per-cgroup basis from Inbox to Backlog on the observability board.
Dec 10 2019, 2:19 PM · Operations, observability
fgiunchedi moved T167689: Add RIPE atlas data to Prometheus from Inbox to In progress on the observability board.
Dec 10 2019, 2:18 PM · observability, Operations
fgiunchedi moved T171482: Programmatic generation of grafana dashboards from Inbox to Backlog on the observability board.
Dec 10 2019, 2:18 PM · Patch-For-Review, Graphite, User-fgiunchedi, observability, Operations
fgiunchedi moved T160071: Add slabinfo prometheus exporter from Inbox to Backlog on the observability board.
Dec 10 2019, 2:17 PM · Operations, observability
fgiunchedi moved T152967: Investigate usage of service dependencies in icinga from Inbox to Backlog on the observability board.
Dec 10 2019, 2:17 PM · observability
fgiunchedi moved T143556: Setting up grafana should also setup Anonymous read-only access for the default org from Inbox to Backlog on the observability board.
Dec 10 2019, 2:17 PM · observability, Cloud-Services, Operations
fgiunchedi moved T152445: Move prometheus entry point off port 80 from Inbox to Backlog on the observability board.
Dec 10 2019, 2:14 PM · observability, Prometheus-metrics-monitoring, Operations
fgiunchedi closed T141520: "MediaWiki exceptions and fatals per minute" alarm is too slow (half an hour delay!) as Resolved.

I believe nowadays the alert is based on metrics from logstash and appears on IRC in a timely fashion, resolving but please do reopen if it occurs again.

Dec 10 2019, 2:14 PM · Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Operations, observability