Page MenuHomePhabricator
Feed Advanced Search

Mar 30 2020

fgiunchedi moved T234900: Setup bacula backup monitoring from Inbox to Radar on the observability board.
Mar 30 2020, 2:33 PM · Data-Persistence-Backup, Patch-For-Review, Sustainability, observability, Goal, SRE
fgiunchedi closed T239837: prometheus hosts try to start rsync and fails on every puppet run as Invalid.

Has been fixed since

Mar 30 2020, 2:30 PM · observability
fgiunchedi closed T240379: Explicitly state the timezone in grafana as Invalid.

Not a bug, timezone is displayed

Mar 30 2020, 2:26 PM · Upstream, observability
fgiunchedi closed T152445: Move prometheus entry point off port 80 as Invalid.

We're moving Prometheus on its own dedicated hosts everywhere, I see no reason not to leave the current entry point as is now (also we moved to apache in the meantime)

Mar 30 2020, 2:19 PM · observability, Prometheus-metrics-monitoring, SRE
fgiunchedi closed T247538: Icinga latency is skyrocketing and commands ignored as Resolved.

With https://gerrit.wikimedia.org/r/580985 merged I'm resolving this task since check latency is doing better now, and we're alerting on excessive latency.

Mar 30 2020, 2:13 PM · User-fgiunchedi, fundraising-tech-ops, observability, SRE
fgiunchedi added a comment to T248858: Create prometheus metrics for Maps OSM data disk usage.

We have https://github.com/wrouesnel/postgres_exporter deployed on the maps hosts, I believe some/all of the metrics you are looking for are available in grafana/prometheus. You can get a preview of those from the host itself if you wish: curl -s localhost:9187/metrics or use Grafana's "explore" function (while logged in), the metrics will start with pg_, hope that helps!

Mar 30 2020, 2:00 PM · Product-Infrastructure-Team-Backlog-Deprecated (Kanban), Maps (Maps-data)
fgiunchedi moved T248151: Big number of uploads from DPLA bot from Backlog to Radar on the User-fgiunchedi board.
Mar 30 2020, 9:55 AM · User-fgiunchedi, SRE, SRE-swift-storage, Commons
fgiunchedi moved T217142: [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors from Up next to Doing on the User-fgiunchedi board.
Mar 30 2020, 9:54 AM · observability, User-fgiunchedi, Better Use Of Data, MW-1.34-notes (1.34.0-wmf.15; 2019-07-23), Patch-For-Review, User-herron, Product-Infrastructure-Team-Backlog-Deprecated, Wikimedia-Logstash

Mar 26 2020

fgiunchedi lowered the priority of T247538: Icinga latency is skyrocketing and commands ignored from High to Medium.

Lowering priority as things I believe are better now, pending https://gerrit.wikimedia.org/r/c/operations/puppet/+/580985 as the last attempt at lowering check latency further.

Mar 26 2020, 1:58 PM · User-fgiunchedi, fundraising-tech-ops, observability, SRE
fgiunchedi added a comment to T248174: Request increased quota for monitoring Cloud VPS project.

Thanks @Andrew ! Appreciate it

Mar 26 2020, 9:00 AM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)

Mar 24 2020

fgiunchedi updated subscribers of T247820: Decide on `service-runner` aggregated prometheus metrics and use of `service` label.

Myself, @akosiaris @colewhite and @Ottomata met today to bikesh^W understand better what service means and other related labels.

Mar 24 2020, 4:26 PM · Platform Team Workboards (External Code Reviews), Performance-Team (Radar), observability, SRE

Mar 23 2020

fgiunchedi updated the task description for T248174: Request increased quota for monitoring Cloud VPS project.
Mar 23 2020, 10:09 AM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
fgiunchedi added a project to T248151: Big number of uploads from DPLA bot: User-fgiunchedi.
Mar 23 2020, 9:46 AM · User-fgiunchedi, SRE, SRE-swift-storage, Commons

Mar 20 2020

fgiunchedi updated subscribers of T248151: Big number of uploads from DPLA bot.

Summary of the IRC chat: the current batch of uploads is about halfway finished and will likely be done by early next week, although no byte size estimates are available. Bots don't seem to have upload rate limits enforced (thanks @Reedy) which I filed as T248177. The one file per page approach is fine as is, depending on the source we do get cases like that.

Mar 20 2020, 3:00 PM · User-fgiunchedi, SRE, SRE-swift-storage, Commons
fgiunchedi created T248177: Enforce upload rate limits for bots on commons.
Mar 20 2020, 2:35 PM · Traffic, serviceops, Wikimedia-Site-requests, Commons
fgiunchedi created T248174: Request increased quota for monitoring Cloud VPS project.
Mar 20 2020, 1:39 PM · cloud-services-team (Kanban), Cloud-VPS (Quota-requests)
fgiunchedi moved T246030: Enable client side error logging in prod for small wiki from Backlog to Radar on the User-fgiunchedi board.
Mar 20 2020, 1:32 PM · MW-1.35-notes (1.35.0-wmf.23; 2020-03-10), Product-Infrastructure-Team-Backlog-Deprecated (Kanban), Patch-For-Review, Analytics-Kanban, Performance-Team (Radar), Desktop Improvements (Vector 2022), observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Epic, Analytics
fgiunchedi moved T246997: smartd not starting properly on gen9 + buster from Backlog to Doing on the User-fgiunchedi board.
Mar 20 2020, 1:31 PM · User-fgiunchedi, SRE
fgiunchedi moved T247538: Icinga latency is skyrocketing and commands ignored from Backlog to Doing on the User-fgiunchedi board.
Mar 20 2020, 1:31 PM · User-fgiunchedi, fundraising-tech-ops, observability, SRE
fgiunchedi added a project to T246997: smartd not starting properly on gen9 + buster: User-fgiunchedi.
Mar 20 2020, 1:31 PM · User-fgiunchedi, SRE
fgiunchedi added a comment to T248151: Big number of uploads from DPLA bot.

Hi, this is me! 😳If it's easier, I can get on Telegram or IRC to chat with you about my project. Obviously, I've been going at a high rate, but I don't really want to break Wikimedia!

Mar 20 2020, 1:28 PM · User-fgiunchedi, SRE, SRE-swift-storage, Commons
fgiunchedi added a project to T247538: Icinga latency is skyrocketing and commands ignored: User-fgiunchedi.
Mar 20 2020, 1:25 PM · User-fgiunchedi, fundraising-tech-ops, observability, SRE
fgiunchedi renamed T244208: Upgrade Grafana to 6.7 from Upgrade Grafana to 6.6 to Upgrade Grafana to 6.7.
Mar 20 2020, 1:21 PM · cloud-services-team (Kanban), Patch-For-Review, User-CDanis, SRE, observability
fgiunchedi added a comment to T248131: Prometheus jobs reduced availability alerts for Icinga exporter.

The flapping seems to have started with the latest version of the exporter AFAICS (around March 18 end of UTC day) maybe that's a lead too? Also it happens in codfw exclusively, I'm assuming due to the periodic icinga restarts/reloads we do there

Mar 20 2020, 9:22 AM · observability
fgiunchedi created T248151: Big number of uploads from DPLA bot.
Mar 20 2020, 9:09 AM · User-fgiunchedi, SRE, SRE-swift-storage, Commons

Mar 19 2020

fgiunchedi placed T111540: Clean up labs graphite datapoints up for grabs.
Mar 19 2020, 3:47 PM · cloud-services-team (Kanban), SRE, Cloud-VPS, Shinken, Grafana
fgiunchedi placed T86546: graphite-web logs are not rotated up for grabs.
Mar 19 2020, 3:46 PM · observability, audits-data-retention, SRE, Grafana
fgiunchedi placed T88997: Improve graphite failover up for grabs.
Mar 19 2020, 3:46 PM · SRE Observability, observability, Performance-Team (Radar), SRE, Grafana
fgiunchedi placed T95429: rsync errors slowing down object-replicator up for grabs.
Mar 19 2020, 3:45 PM · SRE, SRE-swift-storage
fgiunchedi closed T99233: limit the impact of many new metrics being pushed to graphite as Declined.

Not relevant anymore as we're dialing down our graphite usage across the board

Mar 19 2020, 3:45 PM · SRE, Grafana
fgiunchedi closed T99234: improve graphite operational documentation as Resolved.

Docs have been expanded and available at https://wikitech.wikimedia.org/wiki/Graphite

Mar 19 2020, 3:44 PM · SRE, Grafana
fgiunchedi closed T101141: UDP rcvbuferrors and inerrors on graphite hosts, a subtask of T105218: check_graphite - "UNKNOWN: More than half of the datapoints are undefined ", as Resolved.
Mar 19 2020, 3:43 PM · Patch-For-Review, SRE, HHVM, Grafana, observability
fgiunchedi closed T101141: UDP rcvbuferrors and inerrors on graphite hosts as Resolved.

Resolving since we have significantly lessened the load of udp traffic

Mar 19 2020, 3:43 PM · observability, MW-1.27-release-notes, MW-1.27-release (WMF-deploy-2016-04-12_(1.27.0-wmf.21)), SRE, Grafana
fgiunchedi placed T102575: document graphite failover/backfill procedures up for grabs.
Mar 19 2020, 3:42 PM · SRE Observability, observability, Documentation, SRE, Grafana
fgiunchedi closed T119718: Make it easier to ban misbehaving dashboards from graphite as Declined.

Declining as we haven't been experiencing this problem anymore (less dashboards on graphite)

Mar 19 2020, 3:42 PM · Grafana, SRE
fgiunchedi placed T119719: Enforce a minimum refresh period for grafana dashboards hitting graphite up for grabs.
Mar 19 2020, 3:41 PM · observability, SRE Observability (FY2021/2022-Q1), SRE, Grafana
fgiunchedi added a comment to T244208: Upgrade Grafana to 6.7.

See also T119719 when Grafana 6.7 is released

Mar 19 2020, 3:39 PM · cloud-services-team (Kanban), Patch-For-Review, User-CDanis, SRE, observability
fgiunchedi added a comment to T119719: Enforce a minimum refresh period for grafana dashboards hitting graphite.

GH issue is resolved, and the feature will be available in Grafana 6.7: https://github.com/grafana/grafana/blob/master/CHANGELOG.md#670-beta1-2020-03-12

Mar 19 2020, 3:38 PM · observability, SRE Observability (FY2021/2022-Q1), SRE, Grafana
fgiunchedi placed T130709: authoritative copy of 'root' files for upload.wikimedia.org is only in swift up for grabs.
Mar 19 2020, 3:36 PM · SRE-swift-storage, SRE
fgiunchedi placed T138821: extend existing graphite whisper files retention to five years up for grabs.
Mar 19 2020, 3:35 PM · Observability-Metrics, observability, SRE
fgiunchedi placed T169316: Thumbor should alert/page when thumbs aren't rendered up for grabs.
Mar 19 2020, 3:34 PM · Thumbor
fgiunchedi placed T159830: Sanity check global-multiwrite logs for ConfirmEdit usage up for grabs.
Mar 19 2020, 3:33 PM · SRE, SRE-swift-storage
fgiunchedi created T248093: Renew certs for mcrouter on all application servers..
Mar 19 2020, 2:09 PM · SRE, serviceops
fgiunchedi removed a project from T247968: Migrate logging::webrequest::ops to Buster: Epic.
Mar 19 2020, 10:15 AM · Patch-For-Review, User-fgiunchedi, observability
fgiunchedi removed a project from T247967: Migrate role::netmon to Buster: Epic.
Mar 19 2020, 10:15 AM · User-fgiunchedi, Patch-For-Review, netops, observability, SRE
fgiunchedi removed a project from T247966: Migrate role::alerting_host to Buster: Epic.
Mar 19 2020, 10:15 AM · Patch-For-Review, observability
fgiunchedi removed a project from T247963: Migrate role::graphite::production to Bullseye: Epic.
Mar 19 2020, 10:15 AM · User-fgiunchedi, Patch-For-Review, SRE Observability (FY2021/2022-Q2)
fgiunchedi added a comment to T247820: Decide on `service-runner` aggregated prometheus metrics and use of `service` label.

Good idea forking the original task. Thanks for that!

Mar 19 2020, 10:13 AM · Platform Team Workboards (External Code Reviews), Performance-Team (Radar), observability, SRE

Mar 18 2020

fgiunchedi added a comment to T247538: Icinga latency is skyrocketing and commands ignored.

Top 50 checks as of today, with a little longer time horizon than the previous audit

Mar 18 2020, 3:30 PM · User-fgiunchedi, fundraising-tech-ops, observability, SRE
fgiunchedi created T247968: Migrate logging::webrequest::ops to Buster.
Mar 18 2020, 12:41 PM · Patch-For-Review, User-fgiunchedi, observability
fgiunchedi created T247967: Migrate role::netmon to Buster.
Mar 18 2020, 12:40 PM · User-fgiunchedi, Patch-For-Review, netops, observability, SRE
fgiunchedi edited projects for T247963: Migrate role::graphite::production to Bullseye, added: observability; removed SRE.
Mar 18 2020, 12:38 PM · User-fgiunchedi, Patch-For-Review, SRE Observability (FY2021/2022-Q2)
fgiunchedi edited projects for T247966: Migrate role::alerting_host to Buster, added: observability; removed SRE.
Mar 18 2020, 12:38 PM · Patch-For-Review, observability
fgiunchedi created T247966: Migrate role::alerting_host to Buster.
Mar 18 2020, 12:38 PM · Patch-For-Review, observability
fgiunchedi created T247963: Migrate role::graphite::production to Bullseye.
Mar 18 2020, 12:36 PM · User-fgiunchedi, Patch-For-Review, SRE Observability (FY2021/2022-Q2)
fgiunchedi added a parent task for T243057: Move Prometheus off eqsin/ulsfo/esams bastions: T247962: Migrate role::prometheus to Bullseye.
Mar 18 2020, 12:26 PM · Patch-For-Review, SRE, observability
fgiunchedi added a subtask for T247962: Migrate role::prometheus to Bullseye: T243057: Move Prometheus off eqsin/ulsfo/esams bastions.
Mar 18 2020, 12:26 PM · SRE Observability (FY2021/2022-Q3)
fgiunchedi created T247962: Migrate role::prometheus to Bullseye.
Mar 18 2020, 12:25 PM · SRE Observability (FY2021/2022-Q3)
fgiunchedi added a comment to T247538: Icinga latency is skyrocketing and commands ignored.

The new baseline in eqiad for average check latency is ~70s, which isn't great IMHO but certainly better. Short of deploying more powerful hardware I think we can continue looking for low hanging fruits and lessen the load in terms of checks that icinga needs to run.

Mar 18 2020, 11:25 AM · User-fgiunchedi, fundraising-tech-ops, observability, SRE

Mar 17 2020

Dzahn awarded T247759: eqiad squid performances issue a Barnstar token.
Mar 17 2020, 6:59 PM · SRE
fgiunchedi added a comment to T247538: Icinga latency is skyrocketing and commands ignored.

Change 580327 had a related patch set uploaded (by Filippo Giunchedi; owner: Filippo Giunchedi):
[operations/puppet@production] base: relax interval for selected checks

https://gerrit.wikimedia.org/r/580327

Mar 17 2020, 1:53 PM · User-fgiunchedi, fundraising-tech-ops, observability, SRE
fgiunchedi closed T247759: eqiad squid performances issue as Resolved.

Fix is deployed, looking good!

Mar 17 2020, 12:07 PM · SRE
fgiunchedi added a comment to T247759: eqiad squid performances issue.

I've bumped the limits for squid on install1003 and things look good now, the permanent fix is in https://gerrit.wikimedia.org/r/580296

Mar 17 2020, 11:22 AM · SRE

Mar 16 2020

fgiunchedi added a comment to T238658: Migrate EventStreams to k8s deployment pipeline.

Thanks for the context on service @akosiaris , now it is much more clear in my mind what the status quo is. In the interest of compatibility and time (and picking our battles) I'd say let's go ahead and keep service as it is, for sure the whole conversation is interesting but definitely for another task.

Mar 16 2020, 2:47 PM · Analytics-Kanban, Analytics, Patch-For-Review, Release-Engineering-Team (Pipeline), Services (watching), Release Pipeline
fgiunchedi moved T141324: Look into shoving gerrit logs into logstash from Inbox to Up next on the observability board.
Mar 16 2020, 2:43 PM · Release-Engineering-Team-TODO (2021-01-01 to 2021-03-31 (Q3)), Release-Engineering-Team (Development services), observability, Technical-Debt, Wikimedia-Logstash, Gerrit
fgiunchedi closed T212946: Stream Thumbor logs to logstash as Resolved.

This is complete (i.e. T242609: Move thumbor to the logging pipeline), resolving. Feel free to reopen though!

Mar 16 2020, 2:27 PM · observability, Wikimedia-Logstash, User-jijiki, serviceops, SRE, Thumbor
fgiunchedi closed T212946: Stream Thumbor logs to logstash, a subtask of T216815: Upgrade Thumbor to Buster, as Resolved.
Mar 16 2020, 2:27 PM · Thumbor Migration, User-jijiki, serviceops, SRE, Thumbor
fgiunchedi moved T211125: Move service-runner to new logging infrastructure from Inbox to In progress on the observability board.
Mar 16 2020, 2:24 PM · observability, Platform Team Legacy (Watching / External), service-runner, Wikimedia-Logstash, SRE
fgiunchedi moved T219919: Move citoid logging to new logging pipeline from Inbox to Externally blocked on the observability board.
Mar 16 2020, 2:23 PM · SRE Observability, observability, Citoid, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
fgiunchedi moved T219921: Move cxserver logging to new logging pipeline from Inbox to Externally blocked on the observability board.
Mar 16 2020, 2:23 PM · observability, CX-cxserver, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
fgiunchedi moved T219924: Move mobileapps logging to new logging pipeline from Inbox to Externally blocked on the observability board.
Mar 16 2020, 2:23 PM · Product-Infrastructure-Team-Backlog-Deprecated, Page Content Service, observability, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
fgiunchedi moved T219925: Move proton logging to new logging pipeline from Inbox to Externally blocked on the observability board.
Mar 16 2020, 2:23 PM · Product-Infrastructure-Team-Backlog-Deprecated, observability, Proton, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
fgiunchedi moved T222377: Move kartotherian/tilerator logging to new logging pipeline from Inbox to Externally blocked on the observability board.
Mar 16 2020, 2:22 PM · Product-Infrastructure-Team-Backlog-Deprecated (Kanban), observability, Platform Team Legacy (Watching / External), Services (watching), Maps, service-runner, Wikimedia-Logstash, SRE
fgiunchedi moved T234565: Standardize the logging format from Inbox to In progress on the observability board.
Mar 16 2020, 2:22 PM · SRE Observability (FY2023/2024-Q4), Patch-For-Review
fgiunchedi moved T217142: [Proposal] Use the Kafka-Logstash logging infrastructure to log client-side errors from Inbox to In progress on the observability board.
Mar 16 2020, 2:22 PM · observability, User-fgiunchedi, Better Use Of Data, MW-1.34-notes (1.34.0-wmf.15; 2019-07-23), Patch-For-Review, User-herron, Product-Infrastructure-Team-Backlog-Deprecated, Wikimedia-Logstash
fgiunchedi moved T245604: Move wikifeeds to the logging pipeline from Inbox to Externally blocked on the observability board.
Mar 16 2020, 2:22 PM · Wikifeeds, observability, Wikimedia-Logstash, SRE
fgiunchedi moved T245603: Move termbox to the logging pipeline from Inbox to Externally blocked on the observability board.
Mar 16 2020, 2:22 PM · Wikidata-Termbox, observability, Wikimedia-Logstash
fgiunchedi moved T246030: Enable client side error logging in prod for small wiki from Inbox to In progress on the observability board.
Mar 16 2020, 2:21 PM · MW-1.35-notes (1.35.0-wmf.23; 2020-03-10), Product-Infrastructure-Team-Backlog-Deprecated (Kanban), Patch-For-Review, Analytics-Kanban, Performance-Team (Radar), Desktop Improvements (Vector 2022), observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Epic, Analytics
fgiunchedi moved T240685: MediaWiki Prometheus support from Inbox to Up next on the observability board.
Mar 16 2020, 2:19 PM · SRE Observability (FY2023/2024-Q4), MW-1.41-notes (1.41.0-wmf.28; 2023-09-26), MW-1.40-notes (1.40.0-wmf.27; 2023-03-13), MW-1.38-notes (1.38.0-wmf.19; 2022-01-24), MediaWiki-libs-Stats, Platform Team Workboards (External Code Reviews), serviceops, SRE, MediaWiki-General, observability
fgiunchedi moved T247538: Icinga latency is skyrocketing and commands ignored from Inbox to Up next on the observability board.
Mar 16 2020, 2:19 PM · User-fgiunchedi, fundraising-tech-ops, observability, SRE
fgiunchedi moved T243057: Move Prometheus off eqsin/ulsfo/esams bastions from Inbox to Up next on the observability board.
Mar 16 2020, 2:19 PM · Patch-For-Review, SRE, observability
fgiunchedi moved T244208: Upgrade Grafana to 6.7 from Inbox to Up next on the observability board.
Mar 16 2020, 2:19 PM · cloud-services-team (Kanban), Patch-For-Review, User-CDanis, SRE, observability
fgiunchedi moved T245176: Add Prometheus Squid exporter from Inbox to In progress on the observability board.
Mar 16 2020, 2:19 PM · Patch-For-Review, observability
fgiunchedi moved T246860: some Prometheis not scraping the full set of targets from Inbox to In progress on the observability board.
Mar 16 2020, 2:19 PM · Patch-For-Review, Traffic, observability, SRE
fgiunchedi moved T247376: Logstash: add SSD tier to ELK7 cluster from Inbox to In progress on the observability board.
Mar 16 2020, 2:18 PM · Wikimedia-Logstash, observability, SRE
fgiunchedi moved T86552: Monitor and alarm on SMART attributes [tracking] from Up next to Backlog on the observability board.
Mar 16 2020, 2:18 PM · Observability-Alerting, Epic, SRE
fgiunchedi created T247755: mw1373 power supply redundancy ipmi alert.
Mar 16 2020, 2:06 PM · DC-Ops, ops-eqiad, SRE
fgiunchedi added a comment to T238658: Migrate EventStreams to k8s deployment pipeline.

Perhaps using a single metric name e.g. 'express_router_request_duration_seconds' for all services is a bad idea? Maybe these should be named per service instead?

The metric won't be a single one for all services though, when Prometheus pulls from k8s services it'll attach tags

By tags do you mean labels, or is this a prometheus thing I don't know about?! My understanding that the metric name here is 'express_router_request_duration_seconds', and every service-runner based app will emit a metric with the same name.

@fgiunchedi my suggestion would be to use a per service app name metric for this, instead of using one for all services that happen to use express. I currently have eventstreams_connected_clients, so this would be eventstreams_request_duration_seconds with path specific labels.

Mar 16 2020, 12:07 PM · Analytics-Kanban, Analytics, Patch-For-Review, Release-Engineering-Team (Pipeline), Services (watching), Release Pipeline

Mar 10 2020

fgiunchedi added a comment to T246860: some Prometheis not scraping the full set of targets.

I took a look at this on both prometheus200[34] for up{instance=~"elastic2055.*9108"} and the metric appears yesterday on 2004 at 9:44 and 2003 at 17:48. Whereas both target files (/srv/prometheus/ops/targets/elasticsearch_codfw.yaml) are last modified at 10:10, so definitely 2003 has been lagging behind and eventually discovered the target, for reasons yet to be determined.

Mar 10 2020, 11:43 AM · Patch-For-Review, Traffic, observability, SRE
fgiunchedi added a comment to T226986: Client side error logging production launch.

I've saved this as a dashboard called mw-client-errors and linked it from the Kibana homepage.

Mar 10 2020, 11:27 AM · Analytics-Radar, MW-1.35-notes (1.35.0-wmf.24; 2020-03-17), Performance-Team (Radar), Desktop Improvements (Vector 2022), observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog-Deprecated, Epic

Mar 9 2020

fgiunchedi added a comment to T216815: Upgrade Thumbor to Buster.

Not sure if there's a more specific python3 + Thumbor task but the alpha version of Thumbor ships with Python 3 support: https://github.com/thumbor/thumbor/releases/tag/7.0.0a2

Mar 9 2020, 3:41 PM · Thumbor Migration, User-jijiki, serviceops, SRE, Thumbor
fgiunchedi added a comment to T238658: Migrate EventStreams to k8s deployment pipeline.

It does have it's usefulness. As I pointed out, gauges have the problem that you will never get insights into events that last less than the current polling period (60s currently). Counters have the capability to expose that. But your point about the long lived connections is correct. I 'd say keep both?

Mar 9 2020, 3:31 PM · Analytics-Kanban, Analytics, Patch-For-Review, Release-Engineering-Team (Pipeline), Services (watching), Release Pipeline
fgiunchedi added a comment to T226986: Client side error logging production launch.

And we're on: https://logstash.wikimedia.org/goto/edf04ab8ff11b50a69ecf9988337b7e1

Mar 9 2020, 3:25 PM · Analytics-Radar, MW-1.35-notes (1.35.0-wmf.24; 2020-03-17), Performance-Team (Radar), Desktop Improvements (Vector 2022), observability, Wikimedia-Logstash, User-fgiunchedi, Better Use Of Data, Product-Infrastructure-Team-Backlog-Deprecated, Epic
fgiunchedi added a comment to T166107: Cleanup old logstash logs (application and JVM GC).

Similarly, the gc logs rotation (openjdk 11 on buster) config doesn't seem to work:

Mar 9 2020, 3:09 PM · Observability-Logging, observability, Wikimedia-Logstash
fgiunchedi added a comment to T166107: Cleanup old logstash logs (application and JVM GC).

It doesn't look like the current log4j config is working as intended:

Mar 9 2020, 2:52 PM · Observability-Logging, observability, Wikimedia-Logstash
fgiunchedi changed the status of T219925: Move proton logging to new logging pipeline, a subtask of T211125: Move service-runner to new logging infrastructure, from Open to Stalled.
Mar 9 2020, 11:54 AM · observability, Platform Team Legacy (Watching / External), service-runner, Wikimedia-Logstash, SRE
fgiunchedi changed the status of T219925: Move proton logging to new logging pipeline, a subtask of T224602: Fix logging umbrella task, from Open to Stalled.
Mar 9 2020, 11:54 AM · Better Use Of Data, Epic, Product-Infrastructure-Team-Backlog-Deprecated
fgiunchedi changed the status of T219925: Move proton logging to new logging pipeline from Open to Stalled.

Stalling since we'll be piggybacking on Proton (and mobileapps) moving to k8s, and thus the logging pipeline. See also T219924 for the full discussion.

Mar 9 2020, 11:54 AM · Product-Infrastructure-Team-Backlog-Deprecated, observability, Proton, Platform Team Legacy (Watching / External), Services (watching), service-runner, Wikimedia-Logstash, SRE
fgiunchedi changed the status of T219924: Move mobileapps logging to new logging pipeline, a subtask of T211125: Move service-runner to new logging infrastructure, from Open to Stalled.
Mar 9 2020, 11:54 AM · observability, Platform Team Legacy (Watching / External), service-runner, Wikimedia-Logstash, SRE
fgiunchedi changed the status of T219924: Move mobileapps logging to new logging pipeline, a subtask of T224602: Fix logging umbrella task, from Open to Stalled.
Mar 9 2020, 11:53 AM · Better Use Of Data, Epic, Product-Infrastructure-Team-Backlog-Deprecated