User Details
- User Since
- May 1 2020, 10:28 PM (292 w, 6 h)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- RKemper (WMF) [ Global Accounts ]
Today
In the past I've had to assign the ticket to the relevant person, but I don't see those instructions in the template so hopefully I didn't mess up by not putting an assignee :)
Thu, Dec 4
Stat host reboots completed.
Oh, with respect to the patch, we should also get https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/976163/1/cookbooks/sre/hadoop/reboot-workers.py reviewed and merged at the same time since it's directly relevant to this
an-worker* partially done. made https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1214664 to allow us to reboot a subset of a cluster's hosts while still handling the need to restart one journal node at a time properly. patch needs a bit of fixup.
Wed, Dec 3
In short, it looks like opensearch-ipoid-test-bootstrap-0 is failing to properly initialize, leading to pod/opensearch-ipoid-test-masters-0 being unable to boostrap the cluster
We merged the above two patches and deployed. The opensearch-ipoid-test cluster is having issues bootstrapping:
It's interesting that the deletion in the task description is still not processed. I spot-checked every wdqs host in eqiad, and they all agree that the item still exists, so there's definitely an issue here and it's not merely confined to a few hosts.
an-worker* reboots ongoing now
Tue, Dec 2
The current iteration of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1202049/5/modules/profile/files/thanos/recording_rules.yaml has removed the sum and rate functions, since we can rely on pyrra to compute some intermediate metrics from these SLIs.
I like the proposal of >800 for 30 minutes, it should hopefully avoid too much flapping from the alert.
Thu, Nov 27
Wed, Nov 26
Looks like wdqs1032 and wdqs1029 at minimum might need another reimage
Wed, Nov 19
We can see the newly indexed documents here
Tue, Nov 11
It's a little difficult to work on the new dashboard with the clusters not being used; we should probably run some test queries so we have some more data plumbing through. For example, one of the most important thinks we want is a measure of QPS, which presumably we can get via search rate, but that metric currently has nothing so I'm not totally sure.
Nov 5 2025
@elukey Brian and I just tested out a couple of operations, and everything looks good. I think we're ready for the full release.
Nov 4 2025
Nov 3 2025
We could alert when this metric hits >1000 or maybe >1500
Oct 31 2025
Discussed this with Brian. First step for me is to take a look at https://grafana-rw.wikimedia.org/d/c0a89788-c6fe-4d06-aeb2-70b63049599e/opensearch-on-k8s?orgId=1&from=now-7d&to=now&timezone=browser&var-datasource=P0AF0B00C3C579A2D&var-interval=1m&var-cluster=opensearch-test&var-node=$__all&var-shard_type=$__all&var-pool_name=$__all, pull out the most directly useful panels and create a new section at the top with those concise metrics (we'll still have the auto-generated stuff below it).
Oct 29 2025
@dcausse In this updated version of the SLI we don't want to count throttled requests as either a success or failure, but rather exclude them entirely. However I'm having a bit of trouble understanding how all the pieces fit together.
Okay, we verified that the kerberos principal is set up and Justin can kinit successfully.
Oops, needed to have made it for jmoore111. I deleted the old principal and recreated with the proper name:
Configured a kerberos principal (hopefully I was supposed to do that in this ticket and not a separate request):
Working on the new metrics here. The panel labeled success rate is what will ultimately be the SLI. There's still a couple further changes to make:
- merge the two datacenter's metrics (they're just separated right now while the final query is getting assembled)
- subtract throttled requests
Oct 28 2025
Oct 27 2025
Direct URL seems to be https://data.muziekweb.nl/_api/datasets/MuziekwebOrganization/Muziekweb/services/Muziekweb/sparql
When I make a request via the UI it ends up going to https://www.performing-arts.ch/sparql?repository=default, so I'll try that URL
Oct 24 2025
@dcausse Here's the current state now, looks like everything's synced up:
Oct 23 2025
Met with rzl.
With respect to the thread-count approach, here's an interesting graph of a recent deadlock: https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs-main&from=2025-10-22T20:47:05.347Z&to=2025-10-23T08:26:43.628Z&timezone=utc&var-graph_type=%289102%7C919%5B35%5D%29&viewPanel=panel-22
Confirmed the patch fixes the issue:
Uploaded a proposed solution that alerts if the triples count metric is missing. Blazegraph deadlock always leads to the triple count metric missing.
Oct 22 2025
Looks like Brian already covered it, but just to reiterate, the one cumin test host approach sounds good. We'll just need to exercise a few of the codepaths, I'd probably start with something simple like just hitting the flush synced shards method and then if that works moving to the cookbooks that brian mentioned, which rely indirectly on the elasticsearch spicerack library.
Oct 21 2025
derp, turns out it was just the limit order! the following query works:
Oct 20 2025
@dcausse any guesses why this federation isn't working here?
Sounds like some initial investigation has been done by the team and the issue seems to be localized to codfw hosts. We can likely fix the issue by doing data transfers of the categories graph from an eqiad host to all the codfw hosts, but we may want to take a closer look before doing so incase we miss an opportunity to identify a bug in our process.
Oct 10 2025
This is done.
Oct 8 2025
Sigh, had host all ready for the data-transfer and then ran the reimage by mistake. Probably my sign to log off for the night :) This host will need a scap deploy and data transfer when done
^ oops that should say wdqs-main host not wdqs-internal-main
wdqs1018 has been reimaged and scap-deployed. data-transfer in progress
Oct 7 2025
Oops, updated the wrong ticket. This is failing with upstream request timeout: https://query.wikidata.org/#SELECT%20%3Fs%20%3Fp%20%3Fo%20%7B%0A%20%20SERVICE%20%3Chttps%3A%2F%2Fquery.portal.mardi4nfdi.de%2Fsparql%3E%20%7B%0A%20%20%20%20%3Fs%20%3Fp%20%3Fo%0A%20%20%7D%0A%7D%0ALIMIT%2010
EDIT: Ignore the below, mixed up tickets!
Alright, we've got tests passing and it looks like we're ready to merge! It's been a while since I've merged a new spicerack version, are there still a bunch of manual steps that you need to run @elukey or is it pretty hands-off?
Oct 6 2025
Sep 23 2025
Removed via sudo -E cumin 'A:config-master' 'rm -fv /srv/config-master/pybal/*/wdqs':
- LVS teardown was completed end of last week
- For some reason the pybal LVS links are still visible: https://config-master.wikimedia.org/pybal/codfw/wdqs will follow up with traffic there. once that's done we can resolve this ticket
Sep 18 2025
Have pushed out various improvements to the code. Still much to do on the unit test side.
Sep 12 2025
Current state:
Sep 4 2025
Current status:
