User Details
- User Since
- May 1 2020, 10:28 PM (215 w, 6 d)
- Availability
- Available
- LDAP User
- Unknown
- MediaWiki User
- RKemper (WMF) [ Global Accounts ]
Yesterday
Host has been downtimed. Accidentally associated to wrong ticket: https://phabricator.wikimedia.org/T367825#9908323
Tue, Jun 18
Mon, Jun 17
Fri, Jun 14
Thu, Jun 13
Wed, Jun 12
Tue, Jun 4
Traceback (most recent call last): File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 250, in _run raw_ret = runner.run() File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 268, in run self.preparation_step.run() File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 458, in run self._extract_from_hdfs(tmpdir) File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 415, in _extract_from_hdfs size = self._get_dump_size_from_hdfs() File "/srv/deployment/spicerack/cookbooks/sre/wdqs/data-reload.py", line 408, in _get_dump_size_from_hdfs return int(re.sub(r"^(\d+)\s+.*$", next(lines), r"\1")) ValueError: invalid literal for int() with base 10: '\\1'
Tue, May 28
Confirm or deny that we do plan on setting an availability SLO target around the metric "more than 0.1% of requests to Elastic from the MW app servers fail over a 5m period".
Thu, May 23
May 21 2024
Here's what it looked like before the puppet patch:
May 8 2024
@dancy Should be fixed now. Here's a previously failing patch I ran a recheck on: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028565/4#message-8d03a9f159d1ec58172e8606d84deaf43bbcbbb0
May 7 2024
Kicked off the second run like so:
we saw that this took about 3702 minutes, or about 2.57 hours
Typo you'll want to fix here and in the original: 2.57 days
May 2 2024
I'm realizing I don't remember enough about how we load specific graph splits (scholarly vs main). But it's possible we won't need the above nfs patch if our previous process was to manually download the relevant dump file.
Actually we'll use wdqs1021 since we're not confident NFS will work seamlessly between eqiad and codfw (we use nsf to procure the dumps that the data-reload is run from).
May 1 2024
We'll use wdqs2023 to compare against wdqs1023 (same hardware). 1023 is scholarly-articles so that's what we'll want to load 2023 with.
Apr 30 2024
Filled out the search preview SLI info in the documentation, and also updated the existing SLIs to make it more clear what exactly is being measured (page render, etc).
Apr 27 2024
Here's an example curl command we can use to verify when the service is properly supporting application/sparql-results+xml:
Apr 24 2024
Patched was merged and data-transfer reran; this is done.
Apr 18 2024
Looks good on our end, thanks!
Apr 17 2024
Was able to get a puppet run on elastic2088, but since that run a couple hours ago the host is ssh unreachable (it hangs indefinitely). Seeing some concerning stuff in the drac via getsel on elastic2088.mgmt.codfw.wmnet:
Apr 15 2024
Added further context to the SLI section of the documentation explaining what each query type actually means. I believe there's no more oustanding TODOs on this task.
I vote that we choose the default based on the cluster being operated on. So 3 for eqiad and codfw, 2 for cloudelastic and 1 for relforge.
Apr 8 2024
Created subtask for dc-ops' side of the decom. Resolving this parent ticket now.
Apr 4 2024
Apr 2 2024
With dc-ops having closed out the decom subtask, this should be all done.
Mar 28 2024
@VRiley-WMF In netbox I see cloudelastic1003 still listed as decommissioning, whereas the other cloudelastic hosts are marked as Offline. Is it just the step to set netbox status to Offline that we're missing or are there other steps that still need to be run on cloudelastic1003 as well?
Mar 21 2024
Glancing at ryankemper@apifeatureusage1001:~$ sudo journalctl -u curator_actions_apifeatureusage_eqiad:
Mar 20 2024
Had forgotten to properly assign dc-ops as well as tag for the DC. Straightened that out now, so this should be ready for dc-ops to do the decom.
Mar 15 2024
Mar 13 2024
With https://github.com/wikimedia/restbase/pull/1336 being merged, is this ticket now resolved?
Mar 4 2024
Mar 1 2024
This should be all done; @TJones can you confirm all is working as it should?
Feb 29 2024
@fnegri @brouberol Yeah, Brian and I will work on getting this tested and merged. Thanks for the heads up!
Feb 21 2024
@HinMar Okay, I think we've got the endpoints properly allowed. Queries appear to be working for me. Are you seeing the same?
@Hannah_Bast Okay, we figured out what was making the allowed endpoints not updated properly. https://w.wiki/6q2i doesn't get an error message anymore, although the query itself returns no results.
@Loz.ross Yes the change to allowed endpoints did not get properly deployed; we've fixed that now. However there's another issue now that we've gotten past the Service URI not being allowed:
Feb 20 2024
Feb 15 2024
Finished the upload process; next up is rolling restart of cluster and merge of https://gitlab.wikimedia.org/repos/search-platform/cirrussearch-elasticsearch-image/-/merge_requests/7?commit_id=f1028a26dff38603bea67a8edda8337dab07bbfc
Feb 6 2024
Added new threshold markers at 95% for the 4 SLO graphs. We may want to revise the % SLO upwards, but let's stick with 95% for now until we get another quarter of data.
Feb 1 2024
Jan 31 2024
@Loz.ross Sorry for the delay, we've added the endpoint. Can you confirm it's working with an example query?
@HinMar Sorry for missing this request - our bad! I see your earlier comment mentioned the project expiring by end of 2023. Is the project still ongoing and therefore we should still whitelist this new endpoint or should I instead close this ticket out?
Jan 30 2024
Jan 25 2024
Old masters are no longer master-eligible. They're still participating in the actual cluster; we're holding off on the physical decom until T355617 is done
Jan 24 2024
Forgot to add the Bug: label but https://gerrit.wikimedia.org/r/c/operations/puppet/+/992826 is part of this ticket as well
Jan 23 2024
This should be all done, with the new experimental services accessible at:
Experimental microsites are up and externally reachable:
We've rolled this out following the steps in https://wikitech.wikimedia.org/wiki/Cergen#Update_a_certificate