In T159613#5723632, @Reedy wrote:

In T159613#5723053, @fgiunchedi wrote:

Looks like this error is HHVM-specific and I couldn't find other occurrences in logstash, ok to resolve and keep investigating T230245 ?

It isn't HHVM specific (I'm not even sure it's monolog specific, but that's the code where it actually surfaces), but maybe where it was appearing in the logs more frequently was (ie in 2017/2018 when it was hhvm).

Certainly, as per T230245#5582062, if you run the script on PHP7 and give it a high enough quantity (ie the 10K) you'll be able to get the error too

With my hacky workaround in place, it's probably not happening now and as such isn't in the logs.

Dec 9 2019, 3:01 PM · observability, Wikimedia-Logstash, Wikimedia-production-error, MediaWiki-Debug-Logger

fgiunchedi added a comment to T226373: Swift object servers become briefly unresponsive on a regular basis.

re: ats and client timeouts and retries, yes ats does retry on origin timeout as it seems. Otherwise a 504 is returned to the user, for cache_upload there are indeed a few 504 on the backend but none on the tls frontend.

Dec 9 2019, 2:18 PM · serviceops, SRE-swift-storage, SRE

fgiunchedi moved T181536: ORES worker icinga message not specific enough from Inbox to Radar on the observability board.

Dec 9 2019, 12:12 PM · ORES, Machine-Learning-Team

fgiunchedi moved T181542: Monitoring for top IPs and User-Agents hitting the ORES service from Inbox to Radar on the observability board.

Dec 9 2019, 12:12 PM · Machine-Learning-Team

fgiunchedi moved T182160: Develop tests for phabricator search to detect regressions / search quality issues from Inbox to Radar on the observability board.

Dec 9 2019, 12:11 PM · Phabricator (Search), Release-Engineering-Team (Seen), User-MModell, Browser-Tests, observability

fgiunchedi moved T186069: Icinga: page in case all MediaWiki are throwing 5xx from Inbox to Up next on the observability board.

Dec 9 2019, 12:11 PM · Sustainability (Incident Followup), Icinga, SRE, observability

fgiunchedi added a comment to T186069: Icinga: page in case all MediaWiki are throwing 5xx.

We have availability-based alerts now (i.e. 5xx / all status codes) for varnish and ATS, those can be made paging now I believe as we haven't seen false positives with 99.5% (warn) and 99% (crit)

Dec 9 2019, 12:11 PM · Sustainability (Incident Followup), Icinga, SRE, observability

fgiunchedi added a comment to T184714: Puppet fail to properly refresh Icinga.

I'm wondering if we've seen this behavior again? (i.e. certain icinga changes are not applied on puppet refresh)

Dec 9 2019, 12:06 PM · SRE Observability, SRE

fgiunchedi moved T180105: Set up a statsv-like endpoint for Prometheus from Inbox to Radar on the observability board.

Dec 9 2019, 12:03 PM · Grafana, SRE, observability

fgiunchedi moved T140282: Create a grafana dashboard for logstash*.eqiad.wmnet based on search dashboards from Inbox to Externally blocked on the observability board.

Dec 9 2019, 11:55 AM · Observability-Logging, observability, Wikimedia-Logstash

fgiunchedi added a comment to T140282: Create a grafana dashboard for logstash*.eqiad.wmnet based on search dashboards.

@EBernhardson has this been done eventually and shows up in dashboards ?

Dec 9 2019, 11:55 AM · Observability-Logging, observability, Wikimedia-Logstash

fgiunchedi moved T140751: Create a logstash input filter to preprocess mysqld syslog messages from Inbox to Radar on the observability board.

Dec 9 2019, 11:53 AM · observability, Wikimedia-Logstash, Beta-Cluster-Infrastructure, OKR-Work

fgiunchedi moved T141500: [toolforge.infra] Setup centralized logging for the infra (ELK possibly) from Inbox to Radar on the observability board.

Dec 9 2019, 11:53 AM · Observability-Logging, observability, Wikimedia-Logstash, Toolforge

fgiunchedi closed T152782: Kibana functionality missing after upgrade: histograms as Invalid.

I'm boldly declining this task for now as there hasn't been activity and/or other use cases / feature requests. Feel free to reopen if needed!

Dec 9 2019, 11:53 AM · observability, Platform Team Legacy (Watching / External), SRE, Wikimedia-Logstash, Services (watching)

fgiunchedi moved T166107: Cleanup old logstash logs (application and JVM GC) from Inbox to Up next on the observability board.

Dec 9 2019, 11:50 AM · Observability-Logging, observability, Wikimedia-Logstash

fgiunchedi renamed T166107: Cleanup old logstash logs (application and JVM GC) from logrotate and logstash logs does not play well together to Cleanup old logstash logs (application and JVM GC).

Dec 9 2019, 11:50 AM · Observability-Logging, observability, Wikimedia-Logstash

fgiunchedi added a comment to T166107: Cleanup old logstash logs (application and JVM GC).

The original issue is gone, taking over this issue for general cleanup on logstash logs (including GC)

Dec 9 2019, 11:49 AM · Observability-Logging, observability, Wikimedia-Logstash

fgiunchedi moved T190455: Logstash no longer captures DB queries in debug mode from Inbox to Radar on the observability board.

Dec 9 2019, 11:44 AM · Platform Engineering, Developer Productivity, MediaWiki-libs-Rdbms, observability, MediaWiki-Debug-Logger

fgiunchedi moved T165675: Fatalmonitor on logstash still includes deprecated channel:wfLogDBError from Inbox to Radar on the observability board.

Dec 9 2019, 11:44 AM · observability, Release-Engineering-Team (Deployment services), Release-Engineering-Team-TODO, Wikimedia-Logstash, Wikimedia-production-error

fgiunchedi moved T180051: Reduce the number of fields declared in elasticsearch by logstash from Inbox to Up next on the observability board.

Dec 9 2019, 11:43 AM · Observability-Logging, observability, Patch-For-Review, Platform Team Legacy (Watching / External), Services (watching), SRE, Wikimedia-Logstash

fgiunchedi moved T184602: logstash-beta.wmflabs.org default dashboard missing from Inbox to Radar on the observability board.

Dec 9 2019, 11:43 AM · observability, Beta-Cluster-Infrastructure, Wikimedia-Logstash

fgiunchedi moved T189333: Changing Kibana filters is ridiculously slow from Inbox to Up next on the observability board.

Dec 9 2019, 11:43 AM · Developer Productivity, User-fgiunchedi, observability, Traffic, SRE, User-Addshore, Wikimedia-Logstash

fgiunchedi moved T204845: logstash-beta.wmflab throws multiple "Error: Could not locate that visualization" from Inbox to Radar on the observability board.

Dec 9 2019, 11:43 AM · Release-Engineering-Team (Radar), observability, SRE, Wikimedia-Logstash, Beta-Cluster-Infrastructure

fgiunchedi moved T214031: Investigate missing WikibaseQualityConstraints logs in logstash. from Inbox to Radar on the observability board.

Dec 9 2019, 11:43 AM · Observability-Logging, observability, User-Addshore, SRE, Wikimedia-Logstash

fgiunchedi closed T214309: kafka / logstash / elasticsearch lag monitoring and alerting as Resolved.

We're alerting on kafka-logging consumer lag now, resolving

Dec 9 2019, 11:42 AM · observability, Wikimedia-Logstash

fgiunchedi moved T221052: config file change canarying for logstash from Inbox to Up next on the observability board.

Dec 9 2019, 11:40 AM · observability, SRE, Wikimedia-Logstash

fgiunchedi closed T226974: Failed to index error logs from SwiftFileBackend::doStoreInternal as Declined.

We've separated indices now so this specific error has been resolved, there are other logging conflicts still left of course.

Dec 9 2019, 11:38 AM · observability, MediaWiki-General, Wikimedia-Logstash

fgiunchedi moved T211984: Logstash in beta fails periodically from Inbox to Radar on the observability board.

Dec 9 2019, 11:36 AM · SRE Observability (FY2021/2022-Q2), observability, Beta-Cluster-Infrastructure, Wikimedia-Logstash

fgiunchedi moved T213902: Implement sensitive logstash access control from Inbox to Up next on the observability board.

Dec 9 2019, 11:36 AM · Patch-Needs-Improvement, User-herron, Observability-Logging

fgiunchedi moved T215904: Better understanding of Logstash performance from Inbox to Up next on the observability board.

Dec 9 2019, 11:36 AM · User-fgiunchedi, observability, Wikimedia-Logstash

fgiunchedi moved T217556: Decommission old eqiad logstash hardware hosts logstash100[456] from Inbox to Radar on the observability board.

Dec 9 2019, 11:35 AM · observability, decommission-hardware, ops-eqiad, User-herron, SRE, Wikimedia-Logstash

fgiunchedi moved T159613: Host lookup failed [-9999]: Unknown error -9999 from Inbox to Radar on the observability board.

Dec 9 2019, 11:33 AM · observability, Wikimedia-Logstash, Wikimedia-production-error, MediaWiki-Debug-Logger

fgiunchedi added a comment to T159613: Host lookup failed [-9999]: Unknown error -9999.

Looks like this error is HHVM-specific and I couldn't find other occurrences in logstash, ok to resolve and keep investigating T230245 ?

Dec 9 2019, 11:33 AM · observability, Wikimedia-Logstash, Wikimedia-production-error, MediaWiki-Debug-Logger

fgiunchedi moved T230733: Expose pooled status of gdnsd and conftool managed services as metrics from Inbox to Radar on the observability board.

Dec 9 2019, 11:29 AM · User-CDanis, SRE, observability

fgiunchedi moved T224888: Network port utilization alerts should be paging from Inbox to Up next on the observability board.

Dec 9 2019, 11:18 AM · Infrastructure-Foundations, observability, Traffic, netops, SRE

fgiunchedi added a comment to T236832: /etc/php/php7-fatal-error.php uses unsafe ob_start.

Looks like we're still getting this from time to time on wtp hosts:

Dec 9 2019, 11:18 AM · Platform Engineering, Performance-Team (Radar), observability, MediaWiki-Debug-Logger

fgiunchedi moved T238296: job queue insert rate metrics gone from Grafana from Inbox to Radar on the observability board.

Dec 9 2019, 11:15 AM · Platform Team Workboards (Clinic Duty Team), serviceops, WMF-JobQueue, MediaWiki-Core-JobQueue, observability

fgiunchedi closed T238794: dropped packets to kafkamon 9000/tcp as Resolved.

Done! packet drops are gone

Dec 9 2019, 11:14 AM · Data-Platform-SRE, SRE Observability, SRE

fgiunchedi moved T239121: VE edit data stopped due to statsv falling over (?) on webperf1001 from Inbox to Radar on the observability board.

Dec 9 2019, 11:08 AM · Analytics-Radar, Performance-Team (Radar), observability, Editing-team

fgiunchedi moved T239833: StatsD Exporter drops relayed metrics from Inbox to In progress on the observability board.

Dec 9 2019, 11:07 AM · observability, SRE

Dec 5 2019

fgiunchedi removed a project from T191659: Configure a threshold for earlier notification of /srv/cassandra/instance-data: User-fgiunchedi.

Dec 5 2019, 1:35 PM · Platform Team Workboards (Platform Engineering Reliability), Platform Team Legacy (Later), Patch-For-Review, SRE, Services (next), RESTBase-Cassandra, Cassandra

fgiunchedi created T239907: mtail stuck on some mw hosts.

Dec 5 2019, 1:26 PM · observability

fgiunchedi archived P9823 Masterwork From Distant Lands.

Dec 5 2019, 1:25 PM

fgiunchedi added a comment to T236573: "etcd" Cloud VPS project jessie deprecation.

In T236573#5714723, @akosiaris wrote:

For what is worth, I have no usage of these machines nor the project.

Dec 5 2019, 11:16 AM · Cloud-VPS (Debian Jessie Deprecation)

fgiunchedi added a comment to T174432: Unclear LVS bandwidth graph in "load balancers" dashboard.

In T174432#3565169, @ema wrote:

In T174432#3562830, @BBlack wrote:

Are the non-icmp graphs somehow LVS-specific?

Yes, the metrics are: node_ipvs_backend_connections_active, node_ipvs_incoming_packets_total, node_ipvs_incoming_bytes_total. The icmp graph instead plots node_netstat_Icmp_InMsgs.

The text panel @fgiunchedi added is correct, so I guess that should be enough to clarify the ambiguity? Alternatively, we could move the ICMP graphs to a new dashboard with host-specific metrics only.

Dec 5 2019, 10:37 AM · SRE, Traffic

fgiunchedi closed T236700: 'LVS connections' graph on Load Balancers dashboard takes a rate of a gauge as Resolved.

Fixed now and 'load balancers' dashboard adjusted

Dec 5 2019, 10:35 AM · Traffic, observability, SRE

fgiunchedi added a comment to T226373: Swift object servers become briefly unresponsive on a regular basis.

I've investigated a bit the scope and impact of this issue, namely by joining the transactions IDs for which swift reported ConnectionTimeout in server.log with swift proxy-access.log. The idea being to see what swift sent back to ATS and with which latency.

Dec 5 2019, 10:21 AM · serviceops, SRE-swift-storage, SRE

fgiunchedi moved T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet from Doing to Radar on the User-fgiunchedi board.

Dec 5 2019, 8:44 AM · ops-eqiad, User-fgiunchedi, SRE

fgiunchedi updated the task description for T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.

Dec 5 2019, 8:39 AM · ops-eqiad, User-fgiunchedi, SRE

fgiunchedi reassigned T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet from fgiunchedi to • Cmjohnson.

Hosts are fully in service now!

Dec 5 2019, 8:37 AM · ops-eqiad, User-fgiunchedi, SRE

Dec 4 2019

fgiunchedi added a comment to T239805: ms-fe2007 NIC failure.

In T239805#5713046, @Papaul wrote:

@fgiunchedi the 10G NiC is dead

1- option replace the server with another server
https://netbox.wikimedia.org/dcim/devices/1099/
2- option Buy another 10G NIC

Dec 4 2019, 5:45 PM · User-fgiunchedi, ops-codfw, SRE

fgiunchedi renamed T239805: ms-fe2007 NIC failure from ms-fe2007 nic failure to ms-fe2007 NIC failure.

Dec 4 2019, 3:11 PM · User-fgiunchedi, ops-codfw, SRE

fgiunchedi created T239805: ms-fe2007 NIC failure.

Dec 4 2019, 12:32 PM · User-fgiunchedi, ops-codfw, SRE

Dec 3 2019

fgiunchedi added a comment to T180051: Reduce the number of fields declared in elasticsearch by logstash.

We've been working with service owners to fix the obvious offenders in terms of "fields spam" and bumped the fields limit to 2048. We're also alerting on indexing failures when Logstash gets errors from Elasticsearch. ATM only kartotherian bumps into the limit, although that doesn't necessarily mean kartotherian is the "fields spammer" in this case. I'll be following up with a patch to further bump the limit to 4096, that should be plenty to fully ingest all logs we're producing now.

Dec 3 2019, 3:09 PM · Observability-Logging, observability, Patch-For-Review, Platform Team Legacy (Watching / External), Services (watching), SRE, Wikimedia-Logstash

fgiunchedi added a comment to T189333: Changing Kibana filters is ridiculously slow.

In T189333#5645365, @EBernhardson wrote:

In T189333#5488005, @Krinkle wrote:

In T189333#5483346, @fgiunchedi wrote:

In T189333#5481492, @Krinkle wrote:

I re-ran my analysis today, and oddly enough the total number of fields it not only similar but equal to the number of fields there were three months ago. Currently at 7,665 table columns.

That's indeed unexpected, can you share how you are doing the analysis/pulling the field names?

Open a Logstash dashboard in a Chromium browser, and open the Dev Tools.

Edit or create a filter bubble in the Kibana UI, and open the channel dropdown.

Then, from the Console tab in Dev Tools, execute copy($$('ul.uiSelectChoices--autoWidth.ui-select-dropdown')[0].textContent)

This queries the DOM for the <ul> node that represents the channel dropdown menu, then uses textContent (recursively aggregates the textual content of all child list items and concatenates it), and copies it to your clipboard.

Then, paste in a text editor and use some method of removing empty lines and count them :)

The more direct place to get this information is to click the Management (gear/cog) link in sidebar and select Index Patterns. This will report all the fields kibana knows about, along with counts. Today it lists 11091 fields. I'm not sure when exactly this metadata updates, or if it's real time. The refresh button which gives a big warning about resetting popularity counters suggests to it might not auto-update? We can compare to the actual indices with a bit of jq magic, but would take a bit to work up.

Dec 3 2019, 1:56 PM · Developer Productivity, User-fgiunchedi, observability, Traffic, SRE, User-Addshore, Wikimedia-Logstash

fgiunchedi updated the task description for T239713: Citoid is logging all request / response headers as separate fields.

Dec 3 2019, 1:49 PM · Observability-Logging, observability, Wikimedia-Logstash, Platform Engineering (Icebox), serviceops, SRE, Citoid

fgiunchedi updated subscribers of T239713: Citoid is logging all request / response headers as separate fields.

Dec 3 2019, 1:47 PM · Observability-Logging, observability, Wikimedia-Logstash, Platform Engineering (Icebox), serviceops, SRE, Citoid

fgiunchedi created T239713: Citoid is logging all request / response headers as separate fields.

Dec 3 2019, 1:47 PM · Observability-Logging, observability, Wikimedia-Logstash, Platform Engineering (Icebox), serviceops, SRE, Citoid

fgiunchedi added a comment to T239458: Mediawiki logging indexing conflict.

Similar message but for errors

Dec 3 2019, 11:26 AM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), Patch-For-Review, SRE Observability, observability, Wikimedia-Logstash, Editing-team (Tracking), MediaWiki-Logevents, VisualEditor, MediaWiki-General

fgiunchedi added projects to T239458: Mediawiki logging indexing conflict: User-fgiunchedi, MediaWiki-Logevents.

Dec 3 2019, 11:25 AM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), Patch-For-Review, SRE Observability, observability, Wikimedia-Logstash, Editing-team (Tracking), MediaWiki-Logevents, VisualEditor, MediaWiki-General

fgiunchedi added a comment to T234854: Upgrade ELK Stack to version 7.

In T234854#5708171, @elukey wrote:

Hello! I took the liberty to ack a lot of criticals/unknowns in icinga that were related to these new hosts, IIUC these are not in production :)

Dec 3 2019, 9:07 AM · SRE Observability (FY2021/2022-Q1), observability, Patch-For-Review, SRE, Wikimedia-Logstash

Dec 2 2019

fgiunchedi added a comment to T233934: Collects metrics for CAS.

While talking metrics and such for java, please consider also adding jmx_exporter (in addition to the native metrics) to CAS' jvm as we are doing for other JVMs across the fleet in T177197: Export Prometheus-compatible JVM metrics from JVMs in production

Dec 2 2019, 2:13 PM · User-jbond, SRE

fgiunchedi updated the task description for T156955: Standardizing our partman recipes.

Dec 2 2019, 2:10 PM · Patch-For-Review, User-fgiunchedi, SRE

fgiunchedi added a comment to T151009: Provide authenticated access to Thanos native web interface.

In T151009#5704732, @jbond wrote:
Im tempted to add this directly to apereo cas (time permitting) however im curious what you had in mind for the service domain names considering we need one for each codfw and eqiad?

Something like:
https://prometheous.codfw.wikimedia.org/
https://prometheous.eqiad.wikimedia.org/
or did you have something else in mind?

Dec 2 2019, 1:55 PM · observability, Patch-For-Review, User-fgiunchedi, SRE, Prometheus-metrics-monitoring

fgiunchedi updated the task description for T156955: Standardizing our partman recipes.

Dec 2 2019, 12:26 PM · Patch-For-Review, User-fgiunchedi, SRE

fgiunchedi added a comment to T221904: swift backend decomms / rebalances are noisy.

AFAICS through the latest rebalances we haven't observed any alerts, possibly also due to using multiple servers per port (T222366)

Dec 2 2019, 12:20 PM · Patch-For-Review, User-fgiunchedi, SRE-swift-storage, SRE

fgiunchedi updated the task description for T239054: Reimage all mediawiki servers .

Dec 2 2019, 11:51 AM · SRE, serviceops

Nov 29 2019

fgiunchedi created T239458: Mediawiki logging indexing conflict.

Nov 29 2019, 9:08 AM · MW-1.38-notes (1.38.0-wmf.12; 2021-12-06), Patch-For-Review, SRE Observability, observability, Wikimedia-Logstash, Editing-team (Tracking), MediaWiki-Logevents, VisualEditor, MediaWiki-General

Nov 28 2019

fgiunchedi closed T187708: Monitor prometheus exporters "up" status as Resolved.

All deployed now, boldly resolving

Nov 28 2019, 10:25 AM · User-fgiunchedi, observability

Nov 27 2019

fgiunchedi updated the task description for T187708: Monitor prometheus exporters "up" status.

Nov 27 2019, 5:39 PM · User-fgiunchedi, observability

fgiunchedi updated the task description for T156955: Standardizing our partman recipes.

Nov 27 2019, 4:46 PM · Patch-For-Review, User-fgiunchedi, SRE

fgiunchedi changed the status of T215904: Better understanding of Logstash performance from Stalled to Open.

Nov 27 2019, 11:05 AM · User-fgiunchedi, observability, Wikimedia-Logstash

fgiunchedi added a comment to T215904: Better understanding of Logstash performance.

Thanks for the in depth investigation and the numbers @colewhite ! Indeed looks like we'll need to tweak logstash pipeline parameters to >= 1000

Nov 27 2019, 11:04 AM · User-fgiunchedi, observability, Wikimedia-Logstash

fgiunchedi created T239321: Deprecate msdos partition scheme in favor of GPT.

Nov 27 2019, 10:49 AM · SRE

fgiunchedi moved T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet from Backlog to Doing on the User-fgiunchedi board.

Nov 27 2019, 8:08 AM · ops-eqiad, User-fgiunchedi, SRE

fgiunchedi added a project to T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet: User-fgiunchedi.

Nov 27 2019, 8:07 AM · ops-eqiad, User-fgiunchedi, SRE

Nov 26 2019

fgiunchedi added a comment to T237587: Determine & implement near-term method for escalating network alerts.

FTR, re: paging on librenms alerts, see this plan: https://phabricator.wikimedia.org/T224888#5690188

Nov 26 2019, 3:50 PM · Patch-For-Review, SRE, netops, observability

fgiunchedi moved T187708: Monitor prometheus exporters "up" status from Up next to Doing on the User-fgiunchedi board.

Nov 26 2019, 3:42 PM · User-fgiunchedi, observability

fgiunchedi added a comment to T224888: Network port utilization alerts should be paging .

In T224888#5693759, @CDanis wrote:

Any preferences or thoughts re: the special tag? Right now I'm leaning towards #page as that seems the most self-explanatory.

Nov 26 2019, 3:39 PM · Infrastructure-Foundations, observability, Traffic, netops, SRE

fgiunchedi added a comment to T238909: Proposal: simplify set up of a new load-balanced service on kubernetes.

First thank you for getting the ball rolling on this proposal! A question: are all approaches proposed targeting group B actions only or some approaches would also tackle group A? Also I think it'll be helpful if the (only most promising?) approaches have an outline of what group B actions will turn into.

Nov 26 2019, 2:37 PM · SRE, Prod-Kubernetes, PyBal, Traffic, serviceops

fgiunchedi added a comment to T224888: Network port utilization alerts should be paging .

In T224888#5690188, @CDanis wrote:

I've a proposal for doing this:

Add some special tag like #NRPE or #page to the names of any LibreNMS alert rules we'd like to make page. For our purpose here this would just be #6 Primary outbound port utilisation over 80% and #25 Primary inbound port utilisation over 80%.

In a Python NRPE:

query the API's list of alert rules looking for names with this tag and collect those rule IDs https://docs.librenms.org/API/Alerts/#list_alert_rules

query the list of state=alerting and status=critical alerts https://docs.librenms.org/API/Alerts/#list_alerts (query params state=1&severity=critical) and then filter alerts based on the above list of rule IDs

return CRITICAL if any of those are found, UNKNOWN on any scrape errors, OK otherwise

This will prevent turning any LibreNMS critical into a page for the whole team (e.g. the currently-firing "Sensor over limit" for cr3-esams). It will also mean that ACKing alerts within LibreNMS does the right thing. And it makes it fairly straightforward to add/remove alert rules from the set that pages the team.

SGTU?

Nov 26 2019, 10:43 AM · Infrastructure-Foundations, observability, Traffic, netops, SRE

fgiunchedi added a comment to T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.

@Cmjohnson @Jclark-ctr I'm not blocked on this (thus no reassigning) but ms-be1059 is in row D judging by its ip address and netbox says row C. I believe netbox will need updating

Nov 26 2019, 10:34 AM · ops-eqiad, User-fgiunchedi, SRE

fgiunchedi claimed T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.

Nov 26 2019, 10:31 AM · ops-eqiad, User-fgiunchedi, SRE

fgiunchedi added a comment to T237438: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet.

In T237438#5690914, @Cmjohnson wrote:

@fgiunchedi These are ready for you for implementation. I removed the ops-eqiad tag. if you have an issue please assign to me and add the ops-eqiad tag back

Nov 26 2019, 10:13 AM · ops-eqiad, User-fgiunchedi, SRE

fgiunchedi added a project to T239090: Restbase logging indexing conflict on 'res' and 'body' logging fields: User-fgiunchedi.

Nov 26 2019, 9:51 AM · SRE Observability, observability, Platform Team Workboards (Clinic Duty Team), Wikimedia-Logstash, RESTBase