User Details
- User Since
- Oct 15 2019, 4:02 PM (320 w, 5 d)
- Availability
- Available
- IRC Nick
- rzl
- LDAP User
- RLazarus
- MediaWiki User
- RLazarus (WMF) [ Global Accounts ]
Wed, Dec 3
Envoy 1.35.7 is about to come out, with security fixes: https://groups.google.com/g/envoy-announce/c/zr2OzwmJFqY
Mon, Dec 1
The conversation in #wikimedia-serviceops when this was raised:
Wed, Nov 26
(Clinic duty here! Apparently a milestone tag, like SRE Observability (FY2025/2026-Q3), is mutually exclusive with the project tag, like SRE Observability, and that means the task shows up on the clinic duty dashboard as "needs triage." I'm adding Observability-Metrics at a guess, because that also takes it off the triage list, but if you'll be using those milestone tags going forward, we may want to adjust the clinic duty dashboard query.)
Tue, Nov 25
Added to nda:
rzl@ldap-maint1001:~$ ldapsearch -x cn=nda | grep chandra-wmde member: uid=chandra-wmde,ou=people,dc=wikimedia,dc=org
Oh, and: On top of L3 which you've already read, please ensure you're also familiar with https://wikitech.wikimedia.org/wiki/Data_Platform/Data_access#User_responsibilities and reach out if you have any questions. Thanks!
This is complete -- please allow up to 30 minutes for it to take effect, then you should be all set! If you still have any trouble, feel free to reopen the task or file a new one.
Hi, this week's clinic duty SRE here.
@Milimetric @Ahoelzl Ping - can you approve for Data Engineering please? The requester is not a WMF or WMDE employee so this needs an explicit signoff.
Mon, Nov 24
Optimistically resolving. :) @Arian_Bozorg please let us know if you have any trouble with your access, either by reopening this task or filing a new one.
Followed up with @DSmit-WMF and confirmed level 1 is what we're doing. Implementation to follow.
Fri, Nov 21
Alerts are enabled! Let's continue to monitor here a tiny bit longer, just in case they behave unexpectedly and the initial config needs tweaking -- but after a few days of finding it to be grossly working, we can declare victory, resolve this, and track any followup work separately.
Thanks @taavi.
Wed, Nov 19
(I'm not married to the specific CLI syntax in the example. Among other things, making it an --optional-flag means that the positional hosts argument would have to become optional too, which might be tricky. The argument might also have to restate the cluster name, something like eqiad-C5, if it can't be scraped out of --k8s-cluster. All that stuff is up to the implementer, IMHO -- as long as it's easier, I'm happy.)
Nov 7 2025
Nov 6 2025
Oct 30 2025
Oct 28 2025
Oct 27 2025
Testing this in mw-debug, there are two envoy warnings in the logs on startup:
Oct 21 2025
This is deployed to all services.
Oct 18 2025
From conversation with @DLynch we think https://gerrit.wikimedia.org/r/1196940 addresses a possible underlying cause in EditCheck: if the model fetch takes so long that users abandon the page while it's underway, that will count against the SLO, since we already incremented Available but never increment Shown nor NotShown. The fix adds a 6-second timeout (down from 5 minutes, enforced elsewhere).
Oct 15 2025
This smells like a metrics issue to me -- note that on the rolling window dashboard, the "Errors" graph is regularly fluctuating between 0 and 50%, and occasionally past 100% (!) whereas in the calendar window dashboard, the "Error ratio" graph is steady between 0 and 0.03%. Those ought to be the same data, modulo the time window.
Oct 14 2025
Thanks @brouberol! Confirming sextant update runs without issue now.
Oct 8 2025
Oops, I see @brouberol is OOO for a bit. @RKemper can I talk you into taking a look?
This dates from https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1056137, where growthbook used to be one chart but was split into a frontend and a backend. That commit copied charts/growthbook/templates/vendor/... into charts/growthbook/charts/growthbook-{backend,frontend}/templates/vendor/... and removed the original. But it left the original package.json in place.
Oct 2 2025
Thanks for this! I hadn't originally thought about using charlie this way. For my use case (applying the same diff to every service, like an Envoy upgrade) the "just deploy everything without asking me" feature is tempting but also an obviously terrible idea, hence why it's deliberately not supported. But for your use case it's perfect.
Oct 1 2025
Sep 26 2025
Sep 25 2025
1.23 is gone. 🎉
I deployed most services in wikikube (in part to test https://gerrit.wikimedia.org/r/1188456). Remaining services with an Envoy upgrade to go:
Sep 10 2025
And by request from @CDanis, adding to this config update cycle:
Sep 9 2025
@elukey Thank you! Looks like an ownership issue, and yes please if you're comfortable deploying those, I'll take you up on it. (We were just talking in serviceops about the general problem of keeping the state of the world up to date with the state of the repo. In the general case it's hard and we'll need to figure it out; in the specific case your help would make a big difference!)
The global_downstream_max_connections was deprecated in the 1.28 release notes, but as of 1.29, the downstream connections resource_monitor was still a work in progress. So we won't actually switch over to it until after 1.30.
Note the event_log_path comes up in mesh.configuration._tcp_cluster, which pulls in the entire health_checks field from values.yaml, so that's where the event_log_paths are set. We can either replace those event_log_paths with event_loggers right there in the values file (verbose, messy, wrong level of abstraction) or transform them in the template (cumbersome migration but better end state).
Sep 6 2025
Sep 4 2025
Sep 3 2025
Removed the tracing item
Aug 27 2025
LGTM, thank you for the work!
Aug 26 2025
More deprecation warnings from the API Gateway (started locally after modifying charts/api-gateway/values-devel.yaml to use envoy-future:
Validated on mathoid and mw-debug (mathoid still on envoy-future, mw-debug back on 1.23 for now).
Aug 25 2025
Aug 22 2025
@ecarg Just a heads-up, we've broken the config out into per-team files to make it a little easier to work with, so the stanza I mentioned above has now moved to abstract_wikipedia.pp. Let us know how it's going!
For posterity -- I fatfingered the reprepro include the first time and included the _source.changes without the _amd64.changes, so for a couple hours we had a source-only package for 1.26, and I managed to publish envoy-future:1.26.8-1 (which actually contained the Envoy 1.23.10 binary) without noticing.
Aug 21 2025
Aug 20 2025
How's this looking?
Aug 15 2025
First SLO is up! The rolling dashboard is here, and the quarterly dashboard is here (not much to see until we've collected more data). Take an early look at the rolling dashboard, and see if the data reflects what you expect to see so far.
This is implemented, as option #2 (--local_dblist rather than --local-dblist for consistency with other flags).
Aug 13 2025
Sorry yes, I wrote that misleadingly, but I think @akosiaris and I are both addressing the question of whether the username needs to be in the email body. No objections to sending an email notification.
- Yeah, the benefit of using Istio metrics is Istio exports them for you, so you don't have to create anything. The semantics are marginally different because they're collected at the ingress level. Since you've already done the work of defining this metric the way you want it, I'm on board with using it for now and then considering a switch later.
- That's a good expression, the only trouble is it's a success fraction (perfect is 100%) where Pyrra's expecting an error fraction (perfect is 0%). We can adapt it either by setting the error expression to, like,
One other note about all three ConfigMap approaches: they would insert the file into dblists/ but wouldn't update dblists-index.php. I don't think that's a problem for the mwscript use case, but if it is it might be a dealbreaker for this whole approach.
This would work fine (the Python wrapper reads the file and creates the ConfigMap with the contents) so using <(...) is a viable approach.
I agree with @akosiaris (and thanks for the archaeology). It wouldn't be hard to implement this, but I think it's the wrong approach -- especially if addWiki.php is the only script using SUDO_USER, we should update the script rather than add an anachronism to pretend we're still using sudo.
Aug 12 2025
Thanks @ecarg! I should be able to help with this. A couple of questions, each of them hopefully quick:
Aug 6 2025
We talked about this in the SLO meeting today -- one possible approach is to keep ATSBackendErrorsHigh as a default policy, but keep a list of services to exclude because they have SLO-driven availability alerts (which are effectively the same, except with a thoughtfully chosen alert threshold). That way, over time fewer and fewer services are covered by the default alert.
Aug 4 2025
Implemented and documented on Wikitech.