Page MenuHomePhabricator

Joe (Giuseppe Lavagetto)
Spy

Projects (22)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Friday

  • Clear sailing ahead.

User Details

User Since
Oct 3 2014, 5:57 AM (451 w, 5 d)
Availability
Available
LDAP User
Giuseppe Lavagetto
MediaWiki User
GLavagetto (WMF) [ Global Accounts ]

Recent Activity

Mon, May 29

Joe closed T337359: deployment-charts CI should allow opting out of default fixture injections, a subtask of T324117: Helmchart for OpenTelemetry Collector, as Resolved.
Mon, May 29, 1:28 PM · serviceops, Observability-Tracing
Joe closed T337359: deployment-charts CI should allow opting out of default fixture injections as Resolved.
Mon, May 29, 1:28 PM · Kubernetes, serviceops
Joe triaged T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad as High priority.

Just to make clear what I did yesterday:

Mon, May 29, 4:54 AM · All-and-every-Wikisource, serviceops, Thumbor

Sun, May 28

Joe added a comment to T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad.

I would suppose the poolcounter limit that is reached is the one for expensive files:

Sun, May 28, 12:29 PM · All-and-every-Wikisource, serviceops, Thumbor
Joe created T337649: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad.
Sun, May 28, 12:20 PM · All-and-every-Wikisource, serviceops, Thumbor

Thu, May 25

Joe added a comment to T337453: Remove tls_minimum_protocol_version from envoy config.

It would be great if envoy fixed the TLS 1.3 to work well when two envoys talk to each other - we should check if that's been solved in the latest versions.

Thu, May 25, 8:49 AM · SRE, Traffic, serviceops, envoy
Joe added a comment to T318707: Don't scrape every containerPort for metrics.

I think the current solution works well. Basically:

Thu, May 25, 7:26 AM · Machine-Learning-Team, Kubernetes, Observability-Metrics, serviceops
Joe closed T271822: Add support for scraping php applications to the kubernetes prometheus scraper, a subtask of T281423: New Service Request Shellbox, as Resolved.
Thu, May 25, 7:16 AM · Service-deployment-requests, Services, SRE
Joe closed T271822: Add support for scraping php applications to the kubernetes prometheus scraper as Resolved.
Thu, May 25, 7:16 AM · observability, MW-on-K8s, serviceops, SRE

Wed, May 24

Ladsgroup awarded T324003: mwdebug: people in the "deployment" group should be able to launch 'experimental' instances for testing purposes a Like token.
Wed, May 24, 6:45 PM · Release-Engineering-Team (Priority Backlog 📥), serviceops, MW-on-K8s
Joe added a subtask for T324117: Helmchart for OpenTelemetry Collector: T337359: deployment-charts CI should allow opting out of default fixture injections.
Wed, May 24, 4:55 AM · serviceops, Observability-Tracing
Joe added a parent task for T337359: deployment-charts CI should allow opting out of default fixture injections: T324117: Helmchart for OpenTelemetry Collector.
Wed, May 24, 4:55 AM · Kubernetes, serviceops
Joe changed the status of T337359: deployment-charts CI should allow opting out of default fixture injections from Open to In Progress.
Wed, May 24, 4:55 AM · Kubernetes, serviceops
Joe created T337359: deployment-charts CI should allow opting out of default fixture injections.
Wed, May 24, 4:53 AM · Kubernetes, serviceops

Sat, May 20

Joe committed rOSCTc18dded603ec: Add black formatting and enforcement (authored by Joe).
Add black formatting and enforcement
Sat, May 20, 6:13 PM
Joe committed rOSCT1cacda5fed79: Release 2.3.0 (authored by Joe).
Release 2.3.0
Sat, May 20, 6:13 PM
Joe committed rOSCT4f8aea3a6484: Support urllib 2.x (authored by Joe).
Support urllib 2.x
Sat, May 20, 6:13 PM
Joe added a comment to T324003: mwdebug: people in the "deployment" group should be able to launch 'experimental' instances for testing purposes.

Regarding the last point - I think the best way to do this is to actually leave those hostpaths empty until someone needs to do something with the code - then allow people to run a command that will:

  • Copy the code out of the latest mediawiki container to a path on the mw-debug kubernetes node
  • Re-deploy the individual developer session, mounting the code from the hostPath
  • tell the developer where they can find their code.
Sat, May 20, 1:01 PM · Release-Engineering-Team (Priority Backlog 📥), serviceops, MW-on-K8s

Fri, May 19

Joe added a comment to T324003: mwdebug: people in the "deployment" group should be able to launch 'experimental' instances for testing purposes.

So the steps to do what Alex proposed would be:

Fri, May 19, 12:31 PM · Release-Engineering-Team (Priority Backlog 📥), serviceops, MW-on-K8s
Joe closed T336025: ResourceLoader icon rasterization fails via MediaWiki-on-Kubernetes, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Fri, May 19, 12:14 PM · Performance-Team (Radar), Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe closed T336025: ResourceLoader icon rasterization fails via MediaWiki-on-Kubernetes as Resolved.

AFAICT this is now resolved.

Fri, May 19, 12:14 PM · Performance-Team (Radar), Wikimedia-production-error, serviceops, MW-on-K8s
Joe updated subscribers of T324003: mwdebug: people in the "deployment" group should be able to launch 'experimental' instances for testing purposes.

An alternative idea from @akosiaris which is also very interesting:

Fri, May 19, 12:09 PM · Release-Engineering-Team (Priority Backlog 📥), serviceops, MW-on-K8s
Joe added a comment to T318223: Verify images before MW K8s deployments.

@jnuche is this task still valid? I think it's not really an issue given how scap works now, including the pre-download of images on the k8s nodes.

Fri, May 19, 8:49 AM · Release-Engineering-Team, Scap
Joe closed T318538: Honor flavor fields in MW K8s deployment config files, a subtask of T318536: Scap Mediawiki K8s deployments, as Resolved.
Fri, May 19, 8:48 AM · MW-on-K8s, Release-Engineering-Team, Scap
Joe closed T318538: Honor flavor fields in MW K8s deployment config files as Resolved.

@jnuche I'm closing this task as resolved as AIUI we do this since months.

Fri, May 19, 8:48 AM · Release-Engineering-Team, Scap
Joe closed T336038: Add traffic sampling support to mw-on-k8s.lua ATS script, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Fri, May 19, 8:34 AM · Performance-Team (Radar), Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe closed T336038: Add traffic sampling support to mw-on-k8s.lua ATS script as Resolved.
Fri, May 19, 8:34 AM · SRE, Traffic, serviceops, MW-on-K8s
Joe closed T336037: Add configuration file support to mw-on-k8s.lua ATS script, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Fri, May 19, 8:33 AM · Performance-Team (Radar), Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe closed T336037: Add configuration file support to mw-on-k8s.lua ATS script as Resolved.
Fri, May 19, 8:33 AM · SRE, Traffic, serviceops, MW-on-K8s

Tue, May 16

Joe added a comment to T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering.

I can see the following two paths to solve this:

Tue, May 16, 12:30 PM · Parsoid-Read-Views (Phase 1 - DiscussionTools support), Parsoid
Joe changed the status of T335560: Publish Wikimedia bookworm base Docker image from Open to Stalled.

After discussion with @MoritzMuehlenhoff - it makes sense that snapshots might be broken right now that things change hectically for a testing distro before the freeze - we should pick up this task after June 3rd when the freeze will be enacted.

Tue, May 16, 7:47 AM · serviceops
Joe changed the status of T335560: Publish Wikimedia bookworm base Docker image, a subtask of T335507: Build Bookworm based Toolforge Kubernetes images, from Open to Stalled.
Tue, May 16, 7:47 AM · Toolforge (Software install/update)
Joe added a comment to T335560: Publish Wikimedia bookworm base Docker image.

It's much simpler than that - debuerreotype would work correctly - the problem is that snapshots are full of broken links for bookworm - again:

Tue, May 16, 7:24 AM · serviceops
Joe added a comment to T335560: Publish Wikimedia bookworm base Docker image.

While I've added bookworm to the build process, I think I'll revert that part of my change.

Tue, May 16, 6:57 AM · serviceops

Mon, May 15

Joe added a comment to T324200: Handle edge cache invalidation for the api gateway.

Basically add a configuration that allows to figure out article_url => derived urls as a mapping, and eliminate all the need for additional configuration everywhere.

I like this idea.

Q. If this was done, would there be a need for the resource_purge event at all? Wouldn't the resource_change event (or, more ideally, the actual state change event ;) ) be enough to indicate that a purge should happen?

Mon, May 15, 2:45 PM · SRE, Traffic, Platform Team Initiatives (API Gateway), serviceops
Joe moved T335560: Publish Wikimedia bookworm base Docker image from Incoming 🐫 to Doing 😎 on the serviceops board.
Mon, May 15, 5:36 AM · serviceops
Joe claimed T335560: Publish Wikimedia bookworm base Docker image.
Mon, May 15, 5:35 AM · serviceops

Fri, May 12

Joe closed T336554: Repool jobrunners and videoscalers as Resolved.

The videoscaling has slowed down, so the current situation is that we have:

Fri, May 12, 9:17 AM · Sustainability (Incident Followup), serviceops
Joe added a comment to T324200: Handle edge cache invalidation for the api gateway.

My idea for implementing this is as follows:

  • Create a benthos container
  • Add a release containing a Deployment with N replicas (is this the best solution?) to the namespace of the service, running benthos with the adequate configuration, one per datacenter, listening to the local queue
  • Limit what we need to configure to just the prefix for the URL we get from resource-change
Fri, May 12, 7:10 AM · SRE, Traffic, Platform Team Initiatives (API Gateway), serviceops
Joe updated the task description for T324200: Handle edge cache invalidation for the api gateway.
Fri, May 12, 7:04 AM · SRE, Traffic, Platform Team Initiatives (API Gateway), serviceops
Joe triaged T324200: Handle edge cache invalidation for the api gateway as High priority.
Fri, May 12, 6:39 AM · SRE, Traffic, Platform Team Initiatives (API Gateway), serviceops
Joe added a comment to T336554: Repool jobrunners and videoscalers.

The number of servers pooled in jobrunning was just 7, and some servers (mw1458,mw1461,nw1466,mw1495) were depooled from jobrunning but receiving no traffic; This is yet another case where we should probably reduce the number of connections reused by envoy as, especially for videoscaling, it means basically no load-balancing from LVS, because the connections between changeprop and the backends live "forever".

Fri, May 12, 5:42 AM · Sustainability (Incident Followup), serviceops
Joe claimed T336554: Repool jobrunners and videoscalers.
Fri, May 12, 5:29 AM · Sustainability (Incident Followup), serviceops

Thu, May 11

Joe added a comment to T336504: Transcluding Special:Prefixindex can force the default skin.

Our previous assumption that this was only happening (as far as our ability to reproduce the bug) in codfw just proved wrong, I randomly got a page on V22 while having set V10 in my preferences for one of the reported pages.

Thu, May 11, 4:49 PM · MW-1.41-notes (1.41.0-wmf.10; 2023-05-23), Beta-Cluster-reproducible, GrowthExperiments, Growth-Team, Wikimedia-production-error, Regression
Joe added a comment to T336504: Transcluding Special:Prefixindex can force the default skin.

I would aslo add - purging pages shouldn't help, unless we broke something fundamental in how parsercache works.

Thu, May 11, 4:44 PM · MW-1.41-notes (1.41.0-wmf.10; 2023-05-23), Beta-Cluster-reproducible, GrowthExperiments, Growth-Team, Wikimedia-production-error, Regression
Joe claimed T271822: Add support for scraping php applications to the kubernetes prometheus scraper.
Thu, May 11, 1:21 PM · observability, MW-on-K8s, serviceops, SRE
Joe moved T336037: Add configuration file support to mw-on-k8s.lua ATS script from Backlog to In Progress on the MW-on-K8s board.
Thu, May 11, 10:47 AM · SRE, Traffic, serviceops, MW-on-K8s
Joe moved T336038: Add traffic sampling support to mw-on-k8s.lua ATS script from Backlog to In Progress on the MW-on-K8s board.
Thu, May 11, 10:47 AM · SRE, Traffic, serviceops, MW-on-K8s

Wed, May 10

Joe added a comment to T209892: SecurePoll is not compatible with GPG 2.1+.

Also: who is going to take on this work?

@jrbs and @drochford may have thoughts on that, since they've been organising maintenance work on SecurePoll recently.

I can prioritise this with the contractors we have at the moment particularly if this is blocking other work.

Wed, May 10, 4:21 PM · MediaWiki-extensions-SecurePoll
Dzahn awarded T329791: Vopsbot doesn't have channel topic rights a Like token.
Wed, May 10, 3:55 PM · Observability-Alerting, SRE-OnFire, SRE
Joe committed rWISCa956a57608c4: Move sirenbot to full op (authored by Joe).
Move sirenbot to full op
Wed, May 10, 2:18 PM
Joe added a comment to T336025: ResourceLoader icon rasterization fails via MediaWiki-on-Kubernetes.

@Joe Thanks!

If it's any concellation, these are not user-generated SVGs. They're only ~1KB SVGs that are code-reviewed, checked-in, on-disk as part of the deployed software.

Wed, May 10, 2:05 PM · Performance-Team (Radar), Wikimedia-production-error, serviceops, MW-on-K8s
Joe closed T329791: Vopsbot doesn't have channel topic rights as Resolved.

This is solved thanks to @Legoktm's patch.

Wed, May 10, 1:59 PM · Observability-Alerting, SRE-OnFire, SRE
Joe added a comment to T336025: ResourceLoader icon rasterization fails via MediaWiki-on-Kubernetes.

So, in order to get this working, we would need to install rsvg-convert in the base mediawiki image in production-images, and then use that in building our mw-on-k8s images.

Wed, May 10, 9:40 AM · Performance-Team (Radar), Wikimedia-production-error, serviceops, MW-on-K8s
Joe added a comment to T336025: ResourceLoader icon rasterization fails via MediaWiki-on-Kubernetes.

For reference, the code in question is in Resourceloader\Image::rasterize:

Wed, May 10, 9:39 AM · Performance-Team (Radar), Wikimedia-production-error, serviceops, MW-on-K8s
Joe added a comment to T336025: ResourceLoader icon rasterization fails via MediaWiki-on-Kubernetes.

Indeed. The code for resourceloader wants to use rsvg-convert and do it without going through shellbox - I guess for performance reasons.

Wed, May 10, 9:32 AM · Performance-Team (Radar), Wikimedia-production-error, serviceops, MW-on-K8s
Joe edited P48059 Masterwork From Distant Lands.
Wed, May 10, 5:43 AM

Tue, May 9

Joe changed the status of T336038: Add traffic sampling support to mw-on-k8s.lua ATS script, a subtask of T290536: Serve production traffic via Kubernetes, from Open to In Progress.
Tue, May 9, 10:03 AM · Performance-Team (Radar), Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe changed the status of T336038: Add traffic sampling support to mw-on-k8s.lua ATS script from Open to In Progress.
Tue, May 9, 10:03 AM · SRE, Traffic, serviceops, MW-on-K8s
Joe changed the status of T336037: Add configuration file support to mw-on-k8s.lua ATS script, a subtask of T290536: Serve production traffic via Kubernetes, from Open to In Progress.
Tue, May 9, 9:00 AM · Performance-Team (Radar), Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Joe changed the status of T336037: Add configuration file support to mw-on-k8s.lua ATS script from Open to In Progress.
Tue, May 9, 9:00 AM · SRE, Traffic, serviceops, MW-on-K8s

Mon, May 8

Joe added a comment to T333269: Benchmark baremetal vs k8s mediawiki perf (2023).

In order to do this properly, we need to do as follows, IMHO:

  1. Pick a k8s node, or even better, reimage one appserver to act as an additional k8s node with specific node taints so that no "normal" pod can be executed
  2. Add a deployment of mw-on-k8s targeting those taints, add as many replicas as we can fit in that node; remember to also allow http connections as ab doesn't work well with TLS
  3. Use benchmw against this node (with the port you chose for http) and a depooled appserver with the same generation of hardware
Mon, May 8, 9:11 AM · Performance-Team

Fri, May 5

Joe added a comment to T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering.

The parsoid servers (reachable via localhost:6002 on any appserver, or any server running the service mesh proxy) can reply to requests for any url that mediawiki would respond to on an api or appserver. So your code could just call the api URL connecting to port 6002 on localhost (inside mediawiki-config we might need to add a configuration for that).

Fri, May 5, 11:42 AM · Parsoid-Read-Views (Phase 1 - DiscussionTools support), Parsoid

Thu, May 4

Joe added a comment to T334703: Pybal maintenances break safe-service-restart.py (and thus prevent scap deploys of mediawiki).

I'm frankly not sure how checking appserver.svc.eaqiad.wmnet:9090 from an appserver would work - that IP resolves locally to the loopback interface on any appserver. We'd need to pick another internal IP.

Thu, May 4, 7:29 AM · Patch-For-Review, SRE-OnFire, Sustainability (Incident Followup), serviceops, Traffic, conftool
Joe closed T292818: Better scaffolding for helm charts / releases as Resolved.
Thu, May 4, 6:52 AM · Patch-For-Review, Prod-Kubernetes, serviceops

Tue, May 2

Joe closed Restricted Task, a subtask of T324678: Migrate proton (chromium-render) away from restbase, as Resolved.
Tue, May 2, 8:33 AM · Content-Transform-Team, Patch-For-Review, Proton, WMF-Architecture-Team, RESTbase Sunsetting, Epic, Foundational Technology Requests, Code-Health, Platform Engineering Roadmap, Platform Engineering Roadmap Decision Making
Joe updated the task description for T313227: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org.
Tue, May 2, 7:13 AM · SRE, Traffic, HTTPS, serviceops, Abstract Wikipedia team (Phase λ – Launch)
Joe added a comment to T313227: Get new edge & internal HTTPS certificates expanded to add wikifunctions.org and *.wikifunctions.org.

we also need to add wikifunctions to our internal certs

Tue, May 2, 6:53 AM · SRE, Traffic, HTTPS, serviceops, Abstract Wikipedia team (Phase λ – Launch)

Apr 19 2023

Joe closed T327665: Create a cookbook to help us depool *all* services from a datacentre as Resolved.
Apr 19 2023, 8:28 AM · serviceops, Infrastructure-Foundations

Apr 18 2023

Joe added a comment to T334895: XSS via Graph extension.

Another thing to consider is: given there are known XSS vectors in vega 2, this might be the first time we get a report, but not the first time this has been found out. We should probably get someone to check all the past revisions of all pages containing graphs for suspiciously-looking patterns.

Apr 18 2023, 2:59 PM · SecTeam-wikimedia-project-event, SecTeam-Processed, WMDE-TechWish-Sprint-2023-04-05, Editing-team, Vuln-XSS, MediaWiki-extensions-Graph, Security, Security-Team
Joe updated subscribers of T334895: XSS via Graph extension.
Apr 18 2023, 2:58 PM · SecTeam-wikimedia-project-event, SecTeam-Processed, WMDE-TechWish-Sprint-2023-04-05, Editing-team, Vuln-XSS, MediaWiki-extensions-Graph, Security, Security-Team
Joe updated subscribers of T334895: XSS via Graph extension.
Apr 18 2023, 10:51 AM · SecTeam-wikimedia-project-event, SecTeam-Processed, WMDE-TechWish-Sprint-2023-04-05, Editing-team, Vuln-XSS, MediaWiki-extensions-Graph, Security, Security-Team

Mar 30 2023

Joe added a comment to T333536: Survey RESTBase services and find which ones accesses Parsoid via RESTBase.

A starting point for this investigation can be which services currently call restbase:

Mar 30 2023, 10:05 AM · API Platform (RESTbase Deprecation Roadmap), serviceops, RESTbase Sunsetting, Epic, Platform Engineering Roadmap

Mar 29 2023

Joe closed T333069: Multiple restbase servers have not received any deployments since at least October 2022 as Resolved.

All stale nodes have been updated.

Mar 29 2023, 7:27 AM · Patch-For-Review, SRE, Parsoid, ChangeProp, RESTBase
Joe added a comment to T333069: Multiple restbase servers have not received any deployments since at least October 2022.

Adding @hnowlan as FYI so that we are careful about this in the future.

Mar 29 2023, 7:06 AM · Patch-For-Review, SRE, Parsoid, ChangeProp, RESTBase
Joe added a comment to T333069: Multiple restbase servers have not received any deployments since at least October 2022.

Very simply, those 3 servers are not in the targets file for deployment.

Mar 29 2023, 6:40 AM · Patch-For-Review, SRE, Parsoid, ChangeProp, RESTBase
Joe updated subscribers of T330507: New Service Request mediawiki-page-content-change-enrichment.

@Joe we discussed the use of page_content_change in Kafka main today, and decided that we don't need it for now after all. We may one day, but it looks like all the immediate use cases of it are more data lakey (incremental dumps, mediawiki history, etc.), so Kafka jumbo will be fine.

We'd still like to deploy to wikikube, as DSE is multi DC and also would prefer not to run long lived apps as this time. This means that while the app would consume page_change from local DC Kafka main, it would always produce page_content_change to Kafka jumbo-eqiad. This means that when MW is pooled in codfw, most page_content_change events would be produced cross DC to Kafka jumbo. FWIW, this data would be produced cross DC links if we were using Kafka main too (via MirrorMaker). However, in this case, the Kafka producer in the app would have to deal with any cross DC latencies, so we'll have to be careful with producer buffering.

(Given that webrequest Kafka producers produce cross DC, I'm hoping we'll be okay :) )

Mar 29 2023, 6:37 AM · Data-Engineering, Event-Platform Value Stream (Sprint 14 A), serviceops, Service-deployment-requests

Mar 28 2023

Joe committed rOSCTcbb79c76aaa2: Add a ConftoolClient class (authored by Joe).
Add a ConftoolClient class
Mar 28 2023, 3:13 PM
Joe added a comment to T329366: Enable WarmParsoidParserCache on all wikis.

I see a few ways to be able to enable this job on all wikis, but fundamentally the procedure I think would make sense is something as follows:

Mar 28 2023, 12:08 PM · Patch-For-Review, serviceops, Parsoid (Tracking), RESTbase Sunsetting
Joe committed rOSCT5b282e81fe8b: requestctl: fix default path for the git repo (authored by Joe).
requestctl: fix default path for the git repo
Mar 28 2023, 10:02 AM
Joe added a comment to T330507: New Service Request mediawiki-page-content-change-enrichment.

Sorry, I'm getting confused; to my understanding, WDQS/search will use mediawiki.page_change which AIUI are generated from mediawiki, not mediawiki.page_content_change.

Mar 28 2023, 9:31 AM · Data-Engineering, Event-Platform Value Stream (Sprint 14 A), serviceops, Service-deployment-requests
Joe closed T287983: Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage as Resolved.
Mar 28 2023, 6:51 AM · SRE-Sprint-Week-Sustainability-March2023, serviceops, envoy, Sustainability (Incident Followup), Traffic

Mar 27 2023

Joe created T333120: Migrate internal traffic to k8s.
Mar 27 2023, 7:33 AM · Patch-For-Review, Performance-Team (Radar), Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Mar 24 2023

Joe claimed T245059: depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled.
Mar 24 2023, 10:01 AM · SRE-Sprint-Week-Sustainability-March2023, serviceops-radar, Sustainability (Incident Followup), conftool
Joe moved T245059: depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled from Backlog to Doing on the SRE-Sprint-Week-Sustainability-March2023 board.
Mar 24 2023, 10:01 AM · SRE-Sprint-Week-Sustainability-March2023, serviceops-radar, Sustainability (Incident Followup), conftool
Joe moved T245058: Create an automated alert for 'too many nodes depooled from a service' from Backlog to Doing on the SRE-Sprint-Week-Sustainability-March2023 board.
Mar 24 2023, 10:01 AM · SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), serviceops-radar, conftool
Joe moved T197084: Report problems found in server's IPMI SEL from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.
Mar 24 2023, 10:00 AM · SRE Observability, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), observability
Joe moved T110169: Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc) from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.
Mar 24 2023, 9:59 AM · SRE-Sprint-Week-Sustainability-March2023, serviceops, Sustainability (Incident Followup), User-jijiki, User-Joe, Incident-20150825-Redis, observability
Joe moved T278946: Add alerting for Memcached timeout errors from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.
Mar 24 2023, 9:59 AM · SRE-Sprint-Week-Sustainability-March2023, observability, serviceops, Sustainability (Incident Followup)
Joe added a comment to T245059: depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled.

On further thoughts:

Mar 24 2023, 9:56 AM · SRE-Sprint-Week-Sustainability-March2023, serviceops-radar, Sustainability (Incident Followup), conftool
Joe closed T322400: Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often, a subtask of T322360: conf* hosts ran out of disk space due to log spam, as Resolved.
Mar 24 2023, 9:40 AM · MW-1.39-notes, MW-1.40-notes (1.40.0-wmf.10; 2022-11-14), Wikimedia-Incident, Dumps-Generation, serviceops, SRE
Joe closed T322400: Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often as Resolved.
Mar 24 2023, 9:40 AM · SRE-Sprint-Week-Sustainability-March2023, SRE-OnFire, Sustainability (Incident Followup), serviceops-radar, observability
Joe moved T322400: Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.
Mar 24 2023, 9:40 AM · SRE-Sprint-Week-Sustainability-March2023, SRE-OnFire, Sustainability (Incident Followup), serviceops-radar, observability
Joe moved T287983: Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.
Mar 24 2023, 9:38 AM · SRE-Sprint-Week-Sustainability-March2023, serviceops, envoy, Sustainability (Incident Followup), Traffic

Mar 23 2023

Joe removed a project from T302639: How should we monitor for faulty memory modules?: SRE-Sprint-Week-Sustainability-March2023.
Mar 23 2023, 7:39 AM · SRE Observability, Infrastructure-Foundations
Joe closed T278945: Add rate limiting to the jobqueue vidoscalers to prevent overloads as Resolved.
Mar 23 2023, 7:38 AM · SRE-Sprint-Week-Sustainability-March2023, TimedMediaHandler-Transcode, WMF-JobQueue, Sustainability (Incident Followup), serviceops
Joe moved T278945: Add rate limiting to the jobqueue vidoscalers to prevent overloads from Doing to Done on the SRE-Sprint-Week-Sustainability-March2023 board.
Mar 23 2023, 7:38 AM · SRE-Sprint-Week-Sustainability-March2023, TimedMediaHandler-Transcode, WMF-JobQueue, Sustainability (Incident Followup), serviceops
Joe placed T306860: Videoscalers fail health checks while CPU is maxed up for grabs.
Mar 23 2023, 7:11 AM · Wikimedia-Video, serviceops-radar, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), WMF-JobQueue
Joe moved T306860: Videoscalers fail health checks while CPU is maxed from Doing to Not an SRE issue on the SRE-Sprint-Week-Sustainability-March2023 board.
Mar 23 2023, 7:10 AM · Wikimedia-Video, serviceops-radar, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), WMF-JobQueue
Joe edited projects for T306860: Videoscalers fail health checks while CPU is maxed, added: serviceops-radar, Wikimedia-Video; removed serviceops.

I wouldn't consider this task done, but we took all the actions that are reasonable on the SRE side of the issue. Retagging as necessary.

Mar 23 2023, 7:04 AM · Wikimedia-Video, serviceops-radar, SRE-Sprint-Week-Sustainability-March2023, Sustainability (Incident Followup), WMF-JobQueue