The requested access has been merged and will be fully deployed within 30 minutes. I'll go ahead and resolve this but please don't hesitate to re-open if any followup is needed. Thanks!
- Feed Queries
- All Stories
- Search
- Feed Search
- Transactions
- Transaction Logs
Jun 18 2025
Jun 17 2025
Hi @AndyRussG_volunteer you should have just received an email regarding kerberos, and I'll update the account data to reflect krb: present now as well
In T397099#10925025, @gerritbot wrote:Change #1160216 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] admin: add ldap_only entry for derhexer
Thanks! I've just emailed you as well for the out of band verification step
Yes please 👍
In T397004#10920139, @AndyRussG_volunteer wrote:
- I created a new SSH key for andrew.green@extern.wikimedia.de and added it in Gerrit, as per instructions.
In T397200#10923720, @Martina_sanchez wrote:Do you have any pointers as to how I could achieve this?
Jun 16 2025
@WMDE-leszek is this for a contract with end-date, or for ongoing access?
Hello! Here are a few next-steps to complete before proceeding with access:
Hi @Anton.Kokh could you please add a unique SSH key here? Thanks in advance!
Jun 11 2025
Jun 10 2025
Since we've addressed the misleading panels I think we're ok to resolve. There will be some follow up down the road like upgrading Pyrra, and adapting existing dashboards to use updated recording rules that become available in new version(s) but I think it'll be ok to track that independently
If I'm understanding correctly "WDQS update lag" was replaced by "Search update lag" which looks healthy but it seems we've hit a bug where the old SLO hasn't been fully removed from the pyrra output rules that are executed by thanos rule, so thanos and grafana thought it still existed. I've just cleared the stale output rule and we should start to see the old SLO drop off.
In T395916#10884971, @elukey wrote:+1 for the 4w, my only doubt is about backfilling - do we have a way to do it? Or should we just start from a clean state without history?
Jun 5 2025
In T395920#10887711, @elukey wrote:@Vgutierrez @herron I added more stuff to https://wikitech.wikimedia.org/wiki/SLO/Template_instructions/Dashboards_and_alerts, especially related to alerting. I am planning to have multiple people reviewing it, but lemme know if you like it and if it is close to what we discussed over meetings.
Jun 3 2025
I went ahead and made adjustments that I think simplify the fixed window view https://grafana-rw.wikimedia.org/d/ccssRIenz/slo-quarterly-drilldown
I'm experimenting with self referencing links in the grafana slo review/list dashboard (https://grafana-rw.wikimedia.org/d/YuUMRZ44z/slo-quarterly-review) to provide dashboard buttons that load specific date ranges (e.g. Q3, Q4, etc.). It seems promising, I think with a bit of tuning this will be an intuitive way to browse a fixed window view, and in theory we could add links something like once per FY.
For a quick simulation of viewing 4w over a longer window like 8w or 12w, we could view our current 12w window over a longer period. For instance here is a 12w (84d) pyrra window rendered in Grafana over a 6 month period. The percentages and graphs look ok, but I do think we'll want to clarify or remove the "window" panel which displays the pyrra_window metric (as opposed to grafana time picker) and likewise consider adjusting the other instant percentage value panels in favor of graphs over time.
May 20 2025
Made a few more improvements to the Pyrra detail dashboard, more specifically to properly include cluster labels when present, and automatically filter the site/cluster variables each time a different SLO is selected
Thanks, this makes a lot of sense. I've saved a version of the dashboard that pairs way down on headings. FWIW Cluster I think can stay since some SLOs use it e.g. haproxy.
May 19 2025
Lift Wing Grizzly dashboards have been bulk deleted within Grafana as well
Thanks, makes sense to me. Cross site latency is a good point, not sure why I had it in mind that we might share across all nodes when realistically it'd be per-site.
In T393797#10820949, @elukey wrote:Thanks a lot! So the changes will not be reverted by future Pyrra filesystem syncs (new SLOs etc..) as for the time window right? If so I think it is something workable at the moment,
May 14 2025
As part of this I think we should summarize the available cache backends available today and document the reason for selecting the memcached backend. Other options available are:
linking with T383570 since these alerts are evaluated by thanos rule which depends on thanos query
May 13 2025
I see a lot of read: connection timed out errors in the thanos-rule journal, for instance on titan1001, and also noticed a few OOM events today affecting thanos query and thanos query frontend:
Yeah, those panels are spiky and confusing. To try and make the better use of the recording rules that are in place today I added an experimental "error ratio" panel to the Pyrra details dashboard (https://grafana-rw.wikimedia.org/d/ccssRIenz/pyrra-detail?var-slo=citoid-requests) and hid the bugged panels inside a collapsed row. With this we can at least visualize error spikes as a percentage, and for the time being defer to service specific dashboards for request rates, etc.
! In T393796#10815188, @elukey wrote:
@herron I think it is perfect, two manual changes are definitely ok for this use case! Do you know if the settings are kept even when we add new SLOs via Pyrra? I Assume that there is an auto-sync from Pyrra to Grafana, but I am not sure about it.
May 12 2025
- In the Pyrra Grafana dashboards that are exported. Ideally we'd want to avoid setting the time manually and/or use a specialized URI every time that we need to check, it would be nice if it was pre-generated and updated upon single/centralized config change.
Tentatively resolving
Method based acl has been rolled out today, and jaeger has been granted read/write access (reduced from what was effectively read/write/delete before). Spans are still flowing into opensearch, I think we're good here!
May 7 2025
May 6 2025
In T391852#10796071, @elukey wrote:@herron @RLazarus There are a couple of logistical things to discuss:
- https://gerrit.wikimedia.org/r/1142596 is supposed to add/enable SLO alerts that Pyrra generates, so we can start checking how they work. Have we already decided what alerts are we going to use? Are the Pyrra ones good enough?
Apr 25 2025
In T391687#10766878, @colewhite wrote:Another option is partitioning the data into more indexes to reduce index size.
An additional factor that I should have put in the description are logstash pipeline(s). Backpressure on the opensearch output with current single logstash pipeline slowed the opensearch indexing rate to about 1/5 the usual amount. Probably we should factor pipelines into the partitioning scheme as well, and could align topic, logstash pipeline and index to some degree.
Can confirm topics have rebalanced as well. Thanks!
Apr 23 2025
Hi @Jhancock.wm I'll be helping out with this as a kafka-logging service owner, yes please proceed!
Apr 17 2025
I think we're ok to just decline this for now. We tried doubling refresh_interval across all indices and saw no change even under extreme load. We've reverted back to the original value (https://gerrit.wikimedia.org/r/c/operations/puppet/+/1137027). Declining this one and focusing on other tunables now
In T391714#10740557, @gerritbot wrote:Change #1136394 merged by Herron:
[operations/puppet@production] logstash: increase refresh_interval to 10s in index templates
Apr 16 2025
In T391714#10740656, @hashar wrote:
Apr 15 2025
Yes, although for the purposes of e.g. citoid latency SLO, we need just a few labels: source_workload_namespace destination_canonical_service le site along with the typical instance, job etc.
Apr 14 2025
First thing I notice is the first panel (using recording rule) applies rate(sum()) and the second panel sum(rate())
In T391714#10739172, @gerritbot wrote:Change #1136394 had a related patch set uploaded (by Herron; author: Herron):
[operations/puppet@production] logstash: increase refresh_interval to 10s in index templates
I had a look at whats generating these and they map back to various alerts defined in Grafana which make use of the graphite datasource.
Apr 11 2025
Adding shards is worth a try IMO. This got me thinking, what metrics should we monitor to know how much of an improvement sharding makes?
Apr 7 2025
Turns out we had superuser reserve enabled on these filesystems as well, so I've updated that to 0% and extended the srv filesystems. With that we get the same increase in FS capacity while keeping a ~500G reserve. Since we've seen some long lead times on storage I think it's a good idea to be conservative with reserve capacity for an emergency.
Apr 2 2025
Rudimentary ACL would be possible in apache but FWIW we talked about this a bit in the beginning of 2023 in 881839 and the consensus then was that security plugin will handle this
Mar 19 2025
Thanks to @ssingh the k8s-ingress-aux.svc.codfw.wmnet LVS is alive!
Mar 13 2025
Great -- Just re-set BGP true on the aux-k8s-(ctrl|worker)2* nodes in netbox and the homer diff/commit completed successfully. Thanks @cmooney!