Page MenuHomePhabricator
Feed Advanced Search

Dec 10 2019

fgiunchedi closed T124185: Evaluate alternative web interfaces to icinga 1 core, a subtask of T124179: Improve access to and control over incident and metrics monitoring infrastructure, as Declined.
Dec 10 2019, 2:12 PM · observability, Tracking-Neverending, SRE
fgiunchedi closed T124185: Evaluate alternative web interfaces to icinga 1 core as Declined.

Declining as these points are covered by the alerting roadmap. Feel free to reopen if needed!

Dec 10 2019, 2:12 PM · observability, SRE
fgiunchedi closed T124179: Improve access to and control over incident and metrics monitoring infrastructure as Declined.

Declining as these points are covered by the alerting roadmap. Feel free to reopen if needed!

Dec 10 2019, 2:11 PM · observability, Tracking-Neverending, SRE
fgiunchedi closed T114651: non sms alternatives as Invalid.

We'll indeed be investigating non-SMS alternatives as a requirement for pages escalation, resolving but please reopen if needed!

Dec 10 2019, 2:08 PM · SRE, observability
fgiunchedi closed T108985: Monitor MediaWiki sessions as Resolved.

This was done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/350555

Dec 10 2019, 2:05 PM · SRE, observability, Sustainability
fgiunchedi moved T238795: The "logstash-*" index pattern does not contain any of the following field types: ip from Inbox to Backlog on the observability board.
Dec 10 2019, 2:01 PM · SRE, observability
fgiunchedi moved T230570: De-noise systemd alerts (Reduce Icinga alert noise goal) from Inbox to Backlog on the observability board.
Dec 10 2019, 2:01 PM · Observability-Alerting, Patch-For-Review, Goal
fgiunchedi moved T238807: Clean up ORES metrics from Inbox to In progress on the observability board.
Dec 10 2019, 2:01 PM · observability, SRE

Dec 9 2019

fgiunchedi added a comment to T159613: Host lookup failed [-9999]: Unknown error -9999.

Looks like this error is HHVM-specific and I couldn't find other occurrences in logstash, ok to resolve and keep investigating T230245 ?

It isn't HHVM specific (I'm not even sure it's monolog specific, but that's the code where it actually surfaces), but maybe where it was appearing in the logs more frequently was (ie in 2017/2018 when it was hhvm).

Certainly, as per T230245#5582062, if you run the script on PHP7 and give it a high enough quantity (ie the 10K) you'll be able to get the error too

With my hacky workaround in place, it's probably not happening now and as such isn't in the logs.

Dec 9 2019, 3:01 PM · observability, Wikimedia-Logstash, Wikimedia-production-error, MediaWiki-Debug-Logger
fgiunchedi added a comment to T226373: Swift object servers become briefly unresponsive on a regular basis.

re: ats and client timeouts and retries, yes ats does retry on origin timeout as it seems. Otherwise a 504 is returned to the user, for cache_upload there are indeed a few 504 on the backend but none on the tls frontend.

Dec 9 2019, 2:18 PM · serviceops, SRE-swift-storage, SRE
fgiunchedi moved T181536: ORES worker icinga message not specific enough from Inbox to Radar on the observability board.
Dec 9 2019, 12:12 PM · ORES, Machine-Learning-Team
fgiunchedi moved T181542: Monitoring for top IPs and User-Agents hitting the ORES service from Inbox to Radar on the observability board.
Dec 9 2019, 12:12 PM · Machine-Learning-Team
fgiunchedi moved T182160: Develop tests for phabricator search to detect regressions / search quality issues from Inbox to Radar on the observability board.
Dec 9 2019, 12:11 PM · Phabricator (Search), Release-Engineering-Team (Seen), User-MModell, Browser-Tests, observability
fgiunchedi moved T186069: Icinga: page in case all MediaWiki are throwing 5xx from Inbox to Up next on the observability board.
Dec 9 2019, 12:11 PM · Sustainability (Incident Followup), Icinga, SRE, observability
fgiunchedi added a comment to T186069: Icinga: page in case all MediaWiki are throwing 5xx.

We have availability-based alerts now (i.e. 5xx / all status codes) for varnish and ATS, those can be made paging now I believe as we haven't seen false positives with 99.5% (warn) and 99% (crit)

Dec 9 2019, 12:11 PM · Sustainability (Incident Followup), Icinga, SRE, observability
fgiunchedi added a comment to T184714: Puppet fail to properly refresh Icinga.

I'm wondering if we've seen this behavior again? (i.e. certain icinga changes are not applied on puppet refresh)

Dec 9 2019, 12:06 PM · SRE Observability, SRE
fgiunchedi moved T180105: Set up a statsv-like endpoint for Prometheus from Inbox to Radar on the observability board.
Dec 9 2019, 12:03 PM · Grafana, SRE, observability
fgiunchedi moved T140282: Create a grafana dashboard for logstash*.eqiad.wmnet based on search dashboards from Inbox to Externally blocked on the observability board.
Dec 9 2019, 11:55 AM · Observability-Logging, observability, Wikimedia-Logstash
fgiunchedi added a comment to T140282: Create a grafana dashboard for logstash*.eqiad.wmnet based on search dashboards.

@EBernhardson has this been done eventually and shows up in dashboards ?

Dec 9 2019, 11:55 AM · Observability-Logging, observability, Wikimedia-Logstash
fgiunchedi moved T140751: Create a logstash input filter to preprocess mysqld syslog messages from Inbox to Radar on the observability board.
Dec 9 2019, 11:53 AM · observability, Wikimedia-Logstash, Beta-Cluster-Infrastructure, OKR-Work
fgiunchedi moved T141500: [toolforge.infra] Setup centralized logging for the infra (ELK possibly) from Inbox to Radar on the observability board.
Dec 9 2019, 11:53 AM · Observability-Logging, observability, Wikimedia-Logstash, Toolforge
fgiunchedi closed T152782: Kibana functionality missing after upgrade: histograms as Invalid.

I'm boldly declining this task for now as there hasn't been activity and/or other use cases / feature requests. Feel free to reopen if needed!

Dec 9 2019, 11:53 AM · observability, Platform Team Legacy (Watching / External), SRE, Wikimedia-Logstash, Services (watching)
fgiunchedi moved T166107: Cleanup old logstash logs (application and JVM GC) from Inbox to Up next on the observability board.
Dec 9 2019, 11:50 AM · Observability-Logging, observability, Wikimedia-Logstash
fgiunchedi renamed T166107: Cleanup old logstash logs (application and JVM GC) from logrotate and logstash logs does not play well together to Cleanup old logstash logs (application and JVM GC).
Dec 9 2019, 11:50 AM · Observability-Logging, observability, Wikimedia-Logstash
fgiunchedi added a comment to T166107: Cleanup old logstash logs (application and JVM GC).

The original issue is gone, taking over this issue for general cleanup on logstash logs (including GC)

Dec 9 2019, 11:49 AM · Observability-Logging, observability, Wikimedia-Logstash
fgiunchedi moved T190455: Logstash no longer captures DB queries in debug mode from Inbox to Radar on the observability board.
Dec 9 2019, 11:44 AM · Platform Engineering, Developer Productivity, MediaWiki-libs-Rdbms, observability, MediaWiki-Debug-Logger
fgiunchedi moved T165675: Fatalmonitor on logstash still includes deprecated channel:wfLogDBError from Inbox to Radar on the observability board.
Dec 9 2019, 11:44 AM · observability, Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Wikimedia-Logstash, Wikimedia-production-error
fgiunchedi moved T180051: Reduce the number of fields declared in elasticsearch by logstash from Inbox to Up next on the observability board.
Dec 9 2019, 11:43 AM · Observability-Logging, observability, Patch-For-Review, Platform Team Legacy (Watching / External), Services (watching), SRE, Wikimedia-Logstash
fgiunchedi moved T184602: logstash-beta.wmflabs.org default dashboard missing from Inbox to Radar on the observability board.
Dec 9 2019, 11:43 AM · observability, Beta-Cluster-Infrastructure, Wikimedia-Logstash
fgiunchedi moved T189333: Changing Kibana filters is ridiculously slow from Inbox to Up next on the observability board.
Dec 9 2019, 11:43 AM · Developer Productivity, User-fgiunchedi, observability, Traffic, SRE, User-Addshore, Wikimedia-Logstash
fgiunchedi moved T204845: logstash-beta.wmflab throws multiple "Error: Could not locate that visualization" from Inbox to Radar on the observability board.
Dec 9 2019, 11:43 AM · Release-Engineering-Team (Radar), observability, SRE, Wikimedia-Logstash, Beta-Cluster-Infrastructure
fgiunchedi moved T214031: Investigate missing WikibaseQualityConstraints logs in logstash. from Inbox to Radar on the observability board.
Dec 9 2019, 11:43 AM · Observability-Logging, observability, User-Addshore, SRE, Wikimedia-Logstash
fgiunchedi closed T214309: kafka / logstash / elasticsearch lag monitoring and alerting as Resolved.

We're alerting on kafka-logging consumer lag now, resolving

Dec 9 2019, 11:42 AM · observability, Wikimedia-Logstash
fgiunchedi moved T221052: config file change canarying for logstash from Inbox to Up next on the observability board.
Dec 9 2019, 11:40 AM · observability, SRE, Wikimedia-Logstash
fgiunchedi closed T226974: Failed to index error logs from SwiftFileBackend::doStoreInternal as Declined.

We've separated indices now so this specific error has been resolved, there are other logging conflicts still left of course.

Dec 9 2019, 11:38 AM · observability, MediaWiki-General, Wikimedia-Logstash
fgiunchedi moved T211984: Logstash in beta fails periodically from Inbox to Radar on the observability board.
Dec 9 2019, 11:36 AM · SRE Observability (FY2021/2022-Q2), observability, Beta-Cluster-Infrastructure, Wikimedia-Logstash
fgiunchedi moved T213902: Implement sensitive logstash access control from Inbox to Up next on the observability board.
Dec 9 2019, 11:36 AM · Patch-Needs-Improvement, User-herron, Observability-Logging
fgiunchedi moved T215904: Better understanding of Logstash performance from Inbox to Up next on the observability board.
Dec 9 2019, 11:36 AM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi moved T217556: Decommission old eqiad logstash hardware hosts logstash100[456] from Inbox to Radar on the observability board.
Dec 9 2019, 11:35 AM · observability, decommission-hardware, ops-eqiad, User-herron, SRE, Wikimedia-Logstash
fgiunchedi moved T159613: Host lookup failed [-9999]: Unknown error -9999 from Inbox to Radar on the observability board.
Dec 9 2019, 11:33 AM · observability, Wikimedia-Logstash, Wikimedia-production-error, MediaWiki-Debug-Logger
fgiunchedi added a comment to T159613: Host lookup failed [-9999]: Unknown error -9999.

Looks like this error is HHVM-specific and I couldn't find other occurrences in logstash, ok to resolve and keep investigating T230245 ?

Dec 9 2019, 11:33 AM · observability, Wikimedia-Logstash, Wikimedia-production-error, MediaWiki-Debug-Logger
fgiunchedi moved T230733: Expose pooled status of gdnsd and conftool managed services as metrics from Inbox to Radar on the observability board.
Dec 9 2019, 11:29 AM · User-CDanis, SRE, observability
fgiunchedi moved T224888: Network port utilization alerts should be paging from Inbox to Up next on the observability board.
Dec 9 2019, 11:18 AM · Infrastructure-Foundations, observability, Traffic, netops, SRE
fgiunchedi added a comment to T236832: /etc/php/php7-fatal-error.php uses unsafe ob_start.

Looks like we're still getting this from time to time on wtp hosts:

Dec 9 2019, 11:18 AM · Platform Engineering, Performance-Team (Radar), observability, MediaWiki-Debug-Logger
fgiunchedi moved T238296: job queue insert rate metrics gone from Grafana from Inbox to Radar on the observability board.
Dec 9 2019, 11:15 AM · Platform Team Workboards (Clinic Duty Team), serviceops, WMF-JobQueue, MediaWiki-Core-JobQueue, observability
fgiunchedi closed T238794: dropped packets to kafkamon 9000/tcp as Resolved.

Done! packet drops are gone

Dec 9 2019, 11:14 AM · Data-Platform-SRE, SRE Observability, SRE
fgiunchedi moved T239121: VE edit data stopped due to statsv falling over (?) on webperf1001 from Inbox to Radar on the observability board.
Dec 9 2019, 11:08 AM · Analytics-Radar, Performance-Team (Radar), observability, Editing-team
fgiunchedi moved T239833: StatsD Exporter drops relayed metrics from Inbox to In progress on the observability board.
Dec 9 2019, 11:07 AM · observability, SRE

Dec 5 2019

fgiunchedi removed a project from T191659: Configure a threshold for earlier notification of /srv/cassandra/instance-data: User-fgiunchedi.
Dec 5 2019, 1:35 PM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Legacy (Later), Patch-For-Review, SRE, Services (next), RESTBase-Cassandra, Cassandra
fgiunchedi created T239907: mtail stuck on some mw hosts.
Dec 5 2019, 1:26 PM · observability
fgiunchedi archived P9823 Masterwork From Distant Lands.
Dec 5 2019, 1:25 PM
fgiunchedi added a comment to T236573: "etcd" Cloud VPS project jessie deprecation.

For what is worth, I have no usage of these machines nor the project.

Dec 5 2019, 11:16 AM · Cloud-VPS (Debian Jessie Deprecation)
fgiunchedi added a comment to T174432: Unclear LVS bandwidth graph in "load balancers" dashboard.

Are the non-icmp graphs somehow LVS-specific?

Yes, the metrics are: node_ipvs_backend_connections_active, node_ipvs_incoming_packets_total, node_ipvs_incoming_bytes_total. The icmp graph instead plots node_netstat_Icmp_InMsgs.

The text panel @fgiunchedi added is correct, so I guess that should be enough to clarify the ambiguity? Alternatively, we could move the ICMP graphs to a new dashboard with host-specific metrics only.

Dec 5 2019, 10:37 AM · SRE, Traffic
fgiunchedi closed T236700: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge as Resolved.

Fixed now and 'load balancers' dashboard adjusted

Dec 5 2019, 10:35 AM · Traffic, observability, SRE
fgiunchedi added a comment to T226373: Swift object servers become briefly unresponsive on a regular basis.

I've investigated a bit the scope and impact of this issue, namely by joining the transactions IDs for which swift reported ConnectionTimeout in server.log with swift proxy-access.log. The idea being to see what swift sent back to ATS and with which latency.

Dec 5 2019, 10:21 AM · serviceops, SRE-swift-storage, SRE
fgiunchedi moved T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet from Doing to Radar on the User-fgiunchedi board.
Dec 5 2019, 8:44 AM · ops-eqiad, User-fgiunchedi, SRE
fgiunchedi updated the task description for T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.
Dec 5 2019, 8:39 AM · ops-eqiad, User-fgiunchedi, SRE
fgiunchedi reassigned T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet from fgiunchedi to Cmjohnson.

Hosts are fully in service now!

Dec 5 2019, 8:37 AM · ops-eqiad, User-fgiunchedi, SRE

Dec 4 2019

fgiunchedi added a comment to T239805: ms-fe2007 NIC failure.

@fgiunchedi the 10G NiC is dead

1- option replace the server with another server
https://netbox.wikimedia.org/dcim/devices/1099/
2- option Buy another 10G NIC

Dec 4 2019, 5:45 PM · User-fgiunchedi, ops-codfw, SRE
fgiunchedi renamed T239805: ms-fe2007 NIC failure from ms-fe2007 nic failure to ms-fe2007 NIC failure.
Dec 4 2019, 3:11 PM · User-fgiunchedi, ops-codfw, SRE
fgiunchedi created T239805: ms-fe2007 NIC failure.
Dec 4 2019, 12:32 PM · User-fgiunchedi, ops-codfw, SRE

Dec 3 2019

fgiunchedi added a comment to T180051: Reduce the number of fields declared in elasticsearch by logstash.

We've been working with service owners to fix the obvious offenders in terms of "fields spam" and bumped the fields limit to 2048. We're also alerting on indexing failures when Logstash gets errors from Elasticsearch. ATM only kartotherian bumps into the limit, although that doesn't necessarily mean kartotherian is the "fields spammer" in this case. I'll be following up with a patch to further bump the limit to 4096, that should be plenty to fully ingest all logs we're producing now.

Dec 3 2019, 3:09 PM · Observability-Logging, observability, Patch-For-Review, Platform Team Legacy (Watching / External), Services (watching), SRE, Wikimedia-Logstash
fgiunchedi added a comment to T189333: Changing Kibana filters is ridiculously slow.

I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.

That's indeed unexpected, can you share how you are doing the analysis/pulling the field names?

  1. Open a Logstash dashboard in a Chromium browser, and open the Dev Tools.
  2. Edit or create a filter bubble in the Kibana UI, and open the channel dropdown.
  3. Then, from the Console tab in Dev Tools, execute copy($$('ul.uiSelectChoices--autoWidth.ui-select-dropdown')[0].textContent)

This queries the DOM for the <ul> node that represents the channel dropdown menu, then uses textContent (recursively aggregates the textual content of all child list items and concatenates it), and copies it to your clipboard.

Then, paste in a text editor and use some method of removing empty lines and count them :)

The more direct place to get this information is to click the Management (gear/cog) link in sidebar and select Index Patterns. This will report all the fields kibana knows about, along with counts. Today it lists 11091 fields. I'm not sure when exactly this metadata updates, or if it's real time. The refresh button which gives a big warning about resetting popularity counters suggests to it might not auto-update? We can compare to the actual indices with a bit of jq magic, but would take a bit to work up.

Dec 3 2019, 1:56 PM · Developer Productivity, User-fgiunchedi, observability, Traffic, SRE, User-Addshore, Wikimedia-Logstash
fgiunchedi updated the task description for T239713: Citoid is logging all request / response headers as separate fields.
Dec 3 2019, 1:49 PM · Observability-Logging, observability, Wikimedia-Logstash, Platform Engineering (Icebox), serviceops, SRE, Citoid
fgiunchedi updated subscribers of T239713: Citoid is logging all request / response headers as separate fields.
Dec 3 2019, 1:47 PM · Observability-Logging, observability, Wikimedia-Logstash, Platform Engineering (Icebox), serviceops, SRE, Citoid
fgiunchedi created T239713: Citoid is logging all request / response headers as separate fields.
Dec 3 2019, 1:47 PM · Observability-Logging, observability, Wikimedia-Logstash, Platform Engineering (Icebox), serviceops, SRE, Citoid
fgiunchedi added a comment to T239458: Mediawiki logging indexing conflict.

Similar message but for errors

Dec 3 2019, 11:26 AM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), Patch-For-Review, SRE Observability, observability, Wikimedia-Logstash, Editing-team (Tracking), MediaWiki-Logevents, VisualEditor, MediaWiki-General
fgiunchedi added projects to T239458: Mediawiki logging indexing conflict: User-fgiunchedi, MediaWiki-Logevents.
Dec 3 2019, 11:25 AM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), Patch-For-Review, SRE Observability, observability, Wikimedia-Logstash, Editing-team (Tracking), MediaWiki-Logevents, VisualEditor, MediaWiki-General
fgiunchedi added a comment to T234854: Upgrade ELK Stack to version 7.

Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :)

Dec 3 2019, 9:07 AM · SRE Observability (FY2021/2022-Q1), observability, Patch-For-Review, SRE, Wikimedia-Logstash

Dec 2 2019

fgiunchedi added a comment to T233934: Collects metrics for CAS.

While talking metrics and such for java, please consider also adding jmx_exporter (in addition to the native metrics) to CAS' jvm as we are doing for other JVMs across the fleet in T177197: Export Prometheus-compatible JVM metrics from JVMs in production

Dec 2 2019, 2:13 PM · User-jbond, SRE
fgiunchedi updated the task description for T156955: Standardizing our partman recipes.
Dec 2 2019, 2:10 PM · Patch-For-Review, User-fgiunchedi, SRE
fgiunchedi added a comment to T151009: Provide authenticated access to Thanos native web interface.

Im tempted to add this directly to apereo cas (time permitting) however im curious what you had in mind for the service domain names considering we need one for each codfw and eqiad?

Something like:

https://prometheous.codfw.wikimedia.org/
https://prometheous.eqiad.wikimedia.org/

or did you have something else in mind?

Dec 2 2019, 1:55 PM · observability, Patch-For-Review, User-fgiunchedi, SRE, Prometheus-metrics-monitoring
fgiunchedi updated the task description for T156955: Standardizing our partman recipes.
Dec 2 2019, 12:26 PM · Patch-For-Review, User-fgiunchedi, SRE
fgiunchedi added a comment to T221904: swift backend decomms / rebalances are noisy.

AFAICS through the latest rebalances we haven't observed any alerts, possibly also due to using multiple servers per port (T222366)

Dec 2 2019, 12:20 PM · Patch-For-Review, User-fgiunchedi, SRE-swift-storage, SRE
fgiunchedi updated the task description for T239054: Reimage all mediawiki servers .
Dec 2 2019, 11:51 AM · SRE, serviceops

Nov 29 2019

fgiunchedi created T239458: Mediawiki logging indexing conflict.
Nov 29 2019, 9:08 AM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), Patch-For-Review, SRE Observability, observability, Wikimedia-Logstash, Editing-team (Tracking), MediaWiki-Logevents, VisualEditor, MediaWiki-General

Nov 28 2019

fgiunchedi closed T187708: Monitor prometheus exporters "up" status as Resolved.

All deployed now, boldly resolving

Nov 28 2019, 10:25 AM · User-fgiunchedi, observability

Nov 27 2019

fgiunchedi updated the task description for T187708: Monitor prometheus exporters "up" status.
Nov 27 2019, 5:39 PM · User-fgiunchedi, observability
fgiunchedi updated the task description for T156955: Standardizing our partman recipes.
Nov 27 2019, 4:46 PM · Patch-For-Review, User-fgiunchedi, SRE
fgiunchedi changed the status of T215904: Better understanding of Logstash performance from Stalled to Open.
Nov 27 2019, 11:05 AM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi added a comment to T215904: Better understanding of Logstash performance.

Thanks for the in depth investigation and the numbers @colewhite ! Indeed looks like we'll need to tweak logstash pipeline parameters to >= 1000

Nov 27 2019, 11:04 AM · User-fgiunchedi, observability, Wikimedia-Logstash
fgiunchedi created T239321: Deprecate msdos partition scheme in favor of GPT.
Nov 27 2019, 10:49 AM · SRE
fgiunchedi moved T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet from Backlog to Doing on the User-fgiunchedi board.
Nov 27 2019, 8:08 AM · ops-eqiad, User-fgiunchedi, SRE
fgiunchedi added a project to T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet: User-fgiunchedi.
Nov 27 2019, 8:07 AM · ops-eqiad, User-fgiunchedi, SRE

Nov 26 2019

fgiunchedi added a comment to T237587: Determine & implement near-term method for escalating network alerts.

FTR, re: paging on librenms alerts, see this plan: https://phabricator.wikimedia.org/T224888#5690188

Nov 26 2019, 3:50 PM · Patch-For-Review, SRE, netops, observability
fgiunchedi moved T187708: Monitor prometheus exporters "up" status from Up next to Doing on the User-fgiunchedi board.
Nov 26 2019, 3:42 PM · User-fgiunchedi, observability
fgiunchedi added a comment to T224888: Network port utilization alerts should be paging .

Any preferences or thoughts re: the special tag? Right now I'm leaning towards #page as that seems the most self-explanatory.

Nov 26 2019, 3:39 PM · Infrastructure-Foundations, observability, Traffic, netops, SRE
fgiunchedi added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.

First thank you for getting the ball rolling on this proposal! A question: are all approaches proposed targeting group B actions only or some approaches would also tackle group A? Also I think it'll be helpful if the (only most promising?) approaches have an outline of what group B actions will turn into.

Nov 26 2019, 2:37 PM · SRE, Prod-Kubernetes, PyBal, Traffic, serviceops
fgiunchedi added a comment to T224888: Network port utilization alerts should be paging .

I've a proposal for doing this:

  • Add some special tag like #NRPE or #page to the names of any LibreNMS alert rules we'd like to make page. For our purpose here this would just be #6 Primary outbound port utilisation over 80% and #25 Primary inbound port utilisation over 80%.
  • In a Python NRPE:

This will prevent turning any LibreNMS critical into a page for the whole team (e.g. the currently-firing "Sensor over limit" for cr3-esams). It will also mean that ACKing alerts within LibreNMS does the right thing. And it makes it fairly straightforward to add/remove alert rules from the set that pages the team.

SGTU?

Nov 26 2019, 10:43 AM · Infrastructure-Foundations, observability, Traffic, netops, SRE
fgiunchedi added a comment to T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.

@Cmjohnson @Jclark-ctr I'm not blocked on this (thus no reassigning) but ms-be1059 is in row D judging by its ip address and netbox says row C. I believe netbox will need updating

Nov 26 2019, 10:34 AM · ops-eqiad, User-fgiunchedi, SRE
fgiunchedi claimed T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.
Nov 26 2019, 10:31 AM · ops-eqiad, User-fgiunchedi, SRE
fgiunchedi added a comment to T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.

@fgiunchedi These are ready for you for implementation. I removed the ops-eqiad tag. if you have an issue please assign to me and add the ops-eqiad tag back

Nov 26 2019, 10:13 AM · ops-eqiad, User-fgiunchedi, SRE
fgiunchedi added a project to T239090: Restbase logging indexing conflict on 'res' and 'body' logging fields: User-fgiunchedi.
Nov 26 2019, 9:51 AM · SRE Observability, observability, Platform Team Workboards (Clinic Duty Team), Wikimedia-Logstash, RESTBase

Nov 25 2019

fgiunchedi moved T230570: De-noise systemd alerts (Reduce Icinga alert noise goal) from In progress to Inbox on the observability board.
Nov 25 2019, 4:11 PM · Observability-Alerting, Patch-For-Review, Goal
fgiunchedi updated subscribers of T239039: Varnish traffic drop alert @ codfw is noisy / codfw incoming traffic is spikey.

Took a quick look at the expression and the idea LGTM, thanks @CDanis. Also cc @ayounsi as the original implementor of the alert

Nov 25 2019, 2:39 PM · Traffic, SRE, observability
fgiunchedi moved T236700: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge from Inbox to In progress on the observability board.
Nov 25 2019, 2:00 PM · Traffic, observability, SRE
fgiunchedi moved T237587: Determine & implement near-term method for escalating network alerts from Inbox to Up next on the observability board.
Nov 25 2019, 1:53 PM · Patch-For-Review, SRE, netops, observability
fgiunchedi moved T97297: Select a standard log shipping solution to use with applications that cannot be configured to send log events directly to Logstash and/or fluorine from Up next to Inbox on the observability board.
Nov 25 2019, 1:51 PM · observability, SRE, Wikimedia-Logstash
fgiunchedi moved T205856: Retire udp2log: onboard its producers and consumers to the logging pipeline from Up next to Inbox on the observability board.
Nov 25 2019, 1:51 PM · Data-Engineering-Icebox, Observability-Logging, observability, Analytics-Radar, Wikimedia-Logstash, SRE
fgiunchedi moved T217340: Change logstash plugin deployment to use deb packaging and deployment from Inbox to Up next on the observability board.
Nov 25 2019, 1:50 PM · Observability-Logging, SRE, Discovery-Search