User Details
- User Since
- Jul 26 2022, 2:11 PM (114 w, 3 d)
- Availability
- Available
- IRC Nick
- claime
- LDAP User
- Clément Goubert
- MediaWiki User
- CGoubert-WMF [ Global Accounts ]
Tue, Oct 1
It's possible although I'm not sure exactly how. Maybe someone with a better understanding of the internal error capture routing would know.
In any case we can call it resolved as it isn't happening anymore.
Mon, Sep 23
Calling to attention T375382: Post pc1013 crash, failover may need to be done by DBA before the switchover.
Fri, Sep 20
Fri, Sep 13
Logistics... Thanks for the update!
Thu, Sep 12
Wed, Sep 11
sudo systemctl status httpbb_kubernetes_mw-wikifunctions_hourly.service ● httpbb_kubernetes_mw-wikifunctions_hourly.service - Run httpbb wikifunctions tests hourly on Kubernetes mw-wikifunctions. Loaded: loaded (/lib/systemd/system/httpbb_kubernetes_mw-wikifunctions_hourly.service; static) Active: inactive (dead) since Wed 2024-09-11 08:44:02 UTC; 1min 48s ago TriggeredBy: ● httpbb_kubernetes_mw-wikifunctions_hourly.timer Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state Process: 3918207 ExecStart=/bin/sh -c /usr/bin/httpbb /srv/deployment/httpbb-tests/wikifunctions/*.yaml --host mw-wikifunctions.discovery.wmnet> Main PID: 3918207 (code=exited, status=0/SUCCESS) CPU: 297ms
Tue, Sep 10
Reset the RAID config and the disk is still in Foreign state, so I can't use it for a Virtual Disk. I think a replacement is in order.
It's not showing up in system, and still shows foreign on the RAID controler interface, but that host is part of T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) and should not actually have hardware RAID. I can try and run it through the procedure in that task to see what shakes, what do you think?
Host depooled and downtimed for a week, all yours.
Yep that's ours. I'll depool the node so you can reseat when you want.
Sure, what URLs and expected HTTP codes/text would you like httpbb to test for?
Do you want serviceops to disable httpbb for wikifunctions in the meantime?
Sorry I didn't see the updates to the discussion before merging the previous iteration. Patch up to disable puppet-agent-timer.timer
Mon, Sep 9
Correction, it worked for puppetdb, but they got added back to debmonitor. Will investigate further.
Tested via test-cookbook on mw2428 and mw2429 and they seem to have been correctly removed from both puppetdb and debmonitor.
Sounds good.
From what I can see, poolcounter2004.codfw.wmnet and poolcounter1005.eqiad.wmnet are the least used, depending on whether you plan on doing the update before or after T370962: Southward Datacenter Switchover (September 2024)
Thanks for taking care of this <3
Fri, Sep 6
Sep 4 2024
@Papaul We'll be decommissioning this host, sorry :)
Pulling restricted images now works from wikikube-worker2080, resolving.
Sep 3 2024
There was a point release yesterday which could explain why it changed cc @elukey
- The 5 nodes with an incorrect RAID config from T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) that haven't yet been reimaged
- The codfw nodes to be decommissioned
- The 3 nodes waiting for T367949: Spin down api_appserver and appserver clusters (the fourth is part of the nodes to be decommissioned)
Sep 2 2024
Our ATS configuration was wrong and as such, traffic was being sent to the mw-web cluster instead of mw-api-ext. This had functionally no impact as the clusters are identical except for the amount of resources they have, but it is now fixed.
Aug 30 2024
Only hosts left are:
- The 5 nodes with an incorrect RAID config from T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) that haven't yet been reimaged
- The codfw nodes to be decommissioned
- The 3 nodes waiting for T367949: Spin down api_appserver and appserver clusters (the fourth is part of the nodes to be decommissioned)
Aug 29 2024
Aug 28 2024
Aug 26 2024
Aug 23 2024
Just a heads up that the removal of the management DNS entries for these three servers popped up in a sre.dns.netbox run this morning, since they're in decommissioning state in Netbox I've proceeded with it.
Aug 22 2024
All good now
cgoubert@cumin1002:~$ httpbb /srv/deployment/httpbb-tests/appserver/*.yaml --host mw-api-ext.discovery.wmnet --https_port 4447 Sending to mw-api-ext.discovery.wmnet... PASS: 131 requests sent to mw-api-ext.discovery.wmnet. All assertions passed. cgoubert@cumin1002:~$ curl --connect-to en.wikipedia.org:443:mw-api-ext.discovery.wmnet:4447 https://en.wikipedia.org/api/ <!DOCTYPE html> <html lang="en" dir="ltr"> <head> <meta charset="utf-8"> <title>APIs</title> <meta name=viewport content="width=device-width, initial-scale=1"> <meta name="robots" content="index, follow"> <style> body { background: #fff; margin: 7% auto 0; padding: 2em 1em 1em; font: 15px/1.6 sans-serif; color: #333; max-width: 640px; } p { margin: 0.7em 0 1em 0; } a { color: #0645AD; text-decoration: underline; } </style> </head> <body> <h2>APIs</h2> <ul> <li><a href="/w/api.php">Action API</a>, providing rich queries, editing and content access.</li> <li><a href="/api/rest_v1/?doc">REST API v1</a>, mainly focused on high-volume content access.</li> </ul> <h2>Legal</h2> <ul> <li><a href="https://foundation.wikimedia.org/wiki/Developer_app_guidelines">App Guidelines</a>, for developers on how to properly reuse Wikimedia data, API, trademarks, and other content.</li> </ul> </body> </html>
As expected:
cgoubert@cumin1002:~$ httpbb /srv/deployment/httpbb-tests/appserver/test_main.yaml --host mwdebug1002.eqiad.wmnet Sending to mwdebug1002.eqiad.wmnet... PASS: 54 requests sent to mwdebug1002.eqiad.wmnet. All assertions passed.
cgoubert@cumin1002:~$ httpbb /srv/deployment/httpbb-tests/appserver/test_main.yaml --host mw-api-ext.discovery.wmnet --https_port 4447 Sending to mw-api-ext.discovery.wmnet... https://en.wikipedia.org/api/ (/srv/deployment/httpbb-tests/appserver/test_main.yaml:48) Status code: expected 200, got 404. Body: expected to contain 'providing rich queries, editing and content access', got 'File not found.\n'. === FAIL: 54 requests sent to mw-api-ext.discovery.wmnet. 1 request with failed assertions.
At first I thought it could be due to this change routing /api/ to /w/rest.php for T364400
Aug 21 2024
Thank you!
Apparently the removal from the puppetserver wasn't properly done by the cookbook, I've done it manually and it should resolve. Sorry about that.
This is most likely caused by envoy terminating before mediawiki.
Aug 20 2024
From what I can gather the automation is there with the --move-vlan option to the reimage cookbook, I think the cabling is already correct and only the port's vlan and server ip need to change. We can probably take advantage of these reimages to rename the former appservers at the same time.
Changes to sidecar images are generally fine to deploy, if in doubt you can ask on IRC either in #wikimedia-operations or #wikimedia-serviceops and someone should be able to answer. Thanks for deploying all of them :)
Aug 19 2024
I don't remember to be honest, I created the task after digging around for a little bit, finding the timing coincidental, and it fixed itself with the next run.
Aug 1 2024
Jul 30 2024
Resolved with the 10:10UTC run