Page MenuHomePhabricator

Clement_Goubert (claime)
Senior SRE

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Sunday

  • Clear sailing ahead.

User Details

User Since
Jul 26 2022, 2:11 PM (114 w, 3 d)
Availability
Available
IRC Nick
claime
LDAP User
Clément Goubert
MediaWiki User
CGoubert-WMF [ Global Accounts ]

Recent Activity

Tue, Oct 1

Clement_Goubert closed T374556: Some wikifunctions calls end up served by mw-web as Resolved.

It's possible although I'm not sure exactly how. Maybe someone with a better understanding of the internal error capture routing would know.
In any case we can call it resolved as it isn't happening anymore.

Tue, Oct 1, 9:59 AM · Abstract Wikipedia team (25Q2 (Oct–Dec)), MW-on-K8s, serviceops

Mon, Sep 23

Clement_Goubert added a comment to T370962: Southward Datacenter Switchover (September 2024).

Calling to attention T375382: Post pc1013 crash, failover may need to be done by DBA before the switchover.

Mon, Sep 23, 12:37 PM · Patch-For-Review, Datacenter-Switchover, serviceops

Fri, Sep 20

Clement_Goubert updated the task description for T358489: mw2420-mw2451 do have unnecessary raid controllers (configured).
Fri, Sep 20, 1:09 PM · Patch-For-Review, SRE, serviceops

Fri, Sep 13

Clement_Goubert added a comment to T374409: Degraded RAID on wikikube-worker2092.

Logistics... Thanks for the update!

Fri, Sep 13, 2:35 PM · Prod-Kubernetes, serviceops, DC-Ops, ops-codfw

Thu, Sep 12

Clement_Goubert removed a project from T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets: SRE.
Thu, Sep 12, 1:28 PM · serviceops
Clement_Goubert added a comment to T374231: wikifunctions mediawiki instance can't sustain more than 5rps.

We got paged again and we see lots of /view/ requests being cut off at 60s, an example can be found here https://trace.wikimedia.org/trace/00baa42e12812e63ec9ebd5d2d96a3c2, the request ID in question has been logged as well:

[3eb4cf2d-d4f0-9df4-b4f4-32ade54f52c8] /view/kho/Z14554   PHP Fatal Error from line 486 of /srv/mediawiki/php-1.43.0-wmf.22/includes/exception/MWExceptionHandler.php: Allowed memory size of 1468006400 bytes exhausted (tried to allocate 2097160 bytes)

link T374241

Thu, Sep 12, 10:37 AM · Abstract Wikipedia team (25Q1 (Jul–Sep)), serviceops-radar, Wikifunctions, WikiLambda

Wed, Sep 11

Clement_Goubert created T374556: Some wikifunctions calls end up served by mw-web.
Wed, Sep 11, 4:51 PM · Abstract Wikipedia team (25Q2 (Oct–Dec)), MW-on-K8s, serviceops
Clement_Goubert closed T374442: While mw-wikifunctions exists as a separate cluster, replace the httpbb appserver test suite with one specific to WF as Resolved.
sudo systemctl status httpbb_kubernetes_mw-wikifunctions_hourly.service
● httpbb_kubernetes_mw-wikifunctions_hourly.service - Run httpbb wikifunctions tests hourly on Kubernetes mw-wikifunctions.
     Loaded: loaded (/lib/systemd/system/httpbb_kubernetes_mw-wikifunctions_hourly.service; static)
     Active: inactive (dead) since Wed 2024-09-11 08:44:02 UTC; 1min 48s ago
TriggeredBy: ● httpbb_kubernetes_mw-wikifunctions_hourly.timer
       Docs: https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
    Process: 3918207 ExecStart=/bin/sh -c /usr/bin/httpbb /srv/deployment/httpbb-tests/wikifunctions/*.yaml --host mw-wikifunctions.discovery.wmnet>
   Main PID: 3918207 (code=exited, status=0/SUCCESS)
        CPU: 297ms
Wed, Sep 11, 8:54 AM · Abstract Wikipedia team (25Q1 (Jul–Sep)), serviceops, Wikifunctions

Tue, Sep 10

Clement_Goubert added a comment to T374409: Degraded RAID on wikikube-worker2092.

Reset the RAID config and the disk is still in Foreign state, so I can't use it for a Virtual Disk. I think a replacement is in order.

Tue, Sep 10, 4:55 PM · Prod-Kubernetes, serviceops, DC-Ops, ops-codfw
Clement_Goubert updated the task description for T341984: Update Kubernetes clusters to >1.25.
Tue, Sep 10, 4:10 PM · Patch-For-Review, Data-Platform-SRE, Kubernetes, Prod-Kubernetes, serviceops
Clement_Goubert moved T374409: Degraded RAID on wikikube-worker2092 from Incoming 🐫 to 🛠 Upgrades and Hardware on the serviceops board.
Tue, Sep 10, 4:00 PM · Prod-Kubernetes, serviceops, DC-Ops, ops-codfw
Clement_Goubert updated the task description for T358489: mw2420-mw2451 do have unnecessary raid controllers (configured).
Tue, Sep 10, 3:58 PM · Patch-For-Review, SRE, serviceops
Clement_Goubert added a comment to T374409: Degraded RAID on wikikube-worker2092.

It's not showing up in system, and still shows foreign on the RAID controler interface, but that host is part of T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) and should not actually have hardware RAID. I can try and run it through the procedure in that task to see what shakes, what do you think?

Tue, Sep 10, 3:47 PM · Prod-Kubernetes, serviceops, DC-Ops, ops-codfw
Clement_Goubert added a comment to T374409: Degraded RAID on wikikube-worker2092.

Host depooled and downtimed for a week, all yours.

Tue, Sep 10, 3:35 PM · Prod-Kubernetes, serviceops, DC-Ops, ops-codfw
Clement_Goubert edited projects for T374409: Degraded RAID on wikikube-worker2092, added: serviceops, Prod-Kubernetes; removed SRE.

Yep that's ours. I'll depool the node so you can reseat when you want.

Tue, Sep 10, 3:33 PM · Prod-Kubernetes, serviceops, DC-Ops, ops-codfw
Clement_Goubert added a comment to T374442: While mw-wikifunctions exists as a separate cluster, replace the httpbb appserver test suite with one specific to WF.

Sure, what URLs and expected HTTP codes/text would you like httpbb to test for?
Do you want serviceops to disable httpbb for wikifunctions in the meantime?

Tue, Sep 10, 3:08 PM · Abstract Wikipedia team (25Q1 (Jul–Sep)), serviceops, Wikifunctions
Clement_Goubert added a comment to T374351: Race condition on puppetdb in sre.hosts.rename cookbook.

Sorry I didn't see the updates to the discussion before merging the previous iteration. Patch up to disable puppet-agent-timer.timer

Tue, Sep 10, 2:56 PM · SRE-tools, Infrastructure-Foundations, serviceops-radar
Clement_Goubert added a comment to T373591: Relabel codfw kubernetes nodes.
Tue, Sep 10, 9:38 AM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert added a comment to T374249: Relabel codfw kubernetes nodes.
Tue, Sep 10, 9:36 AM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Mon, Sep 9

Clement_Goubert created T374380: Relabel codfw kubernetes nodes.
Mon, Sep 9, 4:38 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert added a comment to T374351: Race condition on puppetdb in sre.hosts.rename cookbook.

Correction, it worked for puppetdb, but they got added back to debmonitor. Will investigate further.

Mon, Sep 9, 2:54 PM · SRE-tools, Infrastructure-Foundations, serviceops-radar
Clement_Goubert added a comment to T374351: Race condition on puppetdb in sre.hosts.rename cookbook.

Tested via test-cookbook on mw2428 and mw2429 and they seem to have been correctly removed from both puppetdb and debmonitor.

Mon, Sep 9, 2:41 PM · SRE-tools, Infrastructure-Foundations, serviceops-radar
Clement_Goubert created T374351: Race condition on puppetdb in sre.hosts.rename cookbook.
Mon, Sep 9, 11:36 AM · SRE-tools, Infrastructure-Foundations, serviceops-radar
Clement_Goubert added a comment to T332015: Migrate poolcounter hosts to bookworm.

Sounds good.
From what I can see, poolcounter2004.codfw.wmnet and poolcounter1005.eqiad.wmnet are the least used, depending on whether you plan on doing the update before or after T370962: Southward Datacenter Switchover (September 2024)
Thanks for taking care of this <3

Mon, Sep 9, 10:59 AM · serviceops

Fri, Sep 6

Clement_Goubert created T374249: Relabel codfw kubernetes nodes.
Fri, Sep 6, 2:55 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert updated the task description for T373916: Relabel codfw kubernetes nodes.
Fri, Sep 6, 10:24 AM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Sep 4 2024

Clement_Goubert closed T373934: Update iDRAC on mw2260.codfw.wmnet as Invalid.

@Papaul We'll be decommissioning this host, sorry :)

Sep 4 2024, 10:27 AM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert closed T373934: Update iDRAC on mw2260.codfw.wmnet, a subtask of T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets, as Invalid.
Sep 4 2024, 10:25 AM · serviceops
Clement_Goubert updated the task description for T373916: Relabel codfw kubernetes nodes.
Sep 4 2024, 10:24 AM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert closed T373982: wikikube-worker2080.codfw.wmnet can't auth to registry as Resolved.

Pulling restricted images now works from wikikube-worker2080, resolving.

Sep 4 2024, 9:53 AM · Patch-For-Review, serviceops, netops, SRE, Wikimedia-production-error, Infrastructure-Foundations
Clement_Goubert closed T373982: wikikube-worker2080.codfw.wmnet can't auth to registry, a subtask of T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets, as Resolved.
Sep 4 2024, 9:51 AM · serviceops
Clement_Goubert changed the status of T373982: wikikube-worker2080.codfw.wmnet can't auth to registry from Open to In Progress.
Sep 4 2024, 9:18 AM · Patch-For-Review, serviceops, netops, SRE, Wikimedia-production-error, Infrastructure-Foundations

Sep 3 2024

Clement_Goubert updated the task description for T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.
Sep 3 2024, 11:39 AM · serviceops
Clement_Goubert updated subscribers of T373819: Issues reimaging kubernetes workers due to user conflicts in systemd-timesyncd .

There was a point release yesterday which could explain why it changed cc @elukey

Sep 3 2024, 10:54 AM · serviceops, Infrastructure-Foundations
Clement_Goubert closed T351074: Move servers from the appserver/api cluster to kubernetes, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Sep 3 2024, 10:21 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T351074: Move servers from the appserver/api cluster to kubernetes as Resolved.
Sep 3 2024, 10:21 AM · serviceops, MW-on-K8s
Clement_Goubert updated the task description for T371262: decommission mw226[1-2].codfw.wmnet mw22[68-77].codfw.wmnet.
Sep 3 2024, 10:20 AM · SRE, DC-Ops, ops-codfw, serviceops, decommission-hardware

Sep 2 2024

Clement_Goubert added a comment to T364400: map the /api/ prefix to /w/rest.php.

Our ATS configuration was wrong and as such, traffic was being sent to the mw-web cluster instead of mw-api-ext. This had functionally no impact as the clusters are identical except for the amount of resources they have, but it is now fixed.

Sep 2 2024, 3:57 PM · serviceops, Traffic, MW-Interfaces-Team
Clement_Goubert created P68532 testing /api/ and /w/rest.php regex.
Sep 2 2024, 3:15 PM
Clement_Goubert updated the task description for T371262: decommission mw226[1-2].codfw.wmnet mw22[68-77].codfw.wmnet.
Sep 2 2024, 2:13 PM · SRE, DC-Ops, ops-codfw, serviceops, decommission-hardware

Aug 30 2024

Clement_Goubert added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

Only hosts left are:

Aug 30 2024, 4:41 PM · serviceops, MW-on-K8s
Clement_Goubert edited P68327 (An Untitled Masterwork).
Aug 30 2024, 4:29 PM
Clement_Goubert merged task T373699: Relabel codfw kubernetes nodes mw237[789] into T373591: Relabel codfw kubernetes nodes.
Aug 30 2024, 4:26 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert merged T373699: Relabel codfw kubernetes nodes mw237[789] into T373591: Relabel codfw kubernetes nodes.
Aug 30 2024, 4:25 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert updated the task description for T373591: Relabel codfw kubernetes nodes.
Aug 30 2024, 4:24 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert created P68327 (An Untitled Masterwork).
Aug 30 2024, 4:00 PM
Clement_Goubert created P68324 (An Untitled Masterwork).
Aug 30 2024, 3:46 PM
Clement_Goubert merged task T373669: Relabel codfw kubernetes nodes mw2295,mw2296,mw2297 into T373591: Relabel codfw kubernetes nodes.
Aug 30 2024, 3:25 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert merged T373669: Relabel codfw kubernetes nodes mw2295,mw2296,mw2297 into T373591: Relabel codfw kubernetes nodes.
Aug 30 2024, 3:24 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert updated the task description for T373591: Relabel codfw kubernetes nodes.
Aug 30 2024, 3:23 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert created T373696: Relabel eqiad kubernetes nodes.
Aug 30 2024, 3:21 PM · SRE, ops-eqiad, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Aug 29 2024

Clement_Goubert updated the task description for T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.
Aug 29 2024, 9:31 AM · serviceops

Aug 28 2024

Clement_Goubert updated the task description for T373457: Relabel codfw kubernetes nodes.
Aug 28 2024, 9:28 AM · ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert merged tasks T373505: Relabel codfw kubernetes nodes, T373491: Relabel codfw kubernetes nodes into T373457: Relabel codfw kubernetes nodes.
Aug 28 2024, 9:24 AM · ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert merged task T373491: Relabel codfw kubernetes nodes into T373457: Relabel codfw kubernetes nodes.
Aug 28 2024, 9:24 AM · ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert merged task T373505: Relabel codfw kubernetes nodes into T373457: Relabel codfw kubernetes nodes.
Aug 28 2024, 9:24 AM · ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops

Aug 26 2024

Clement_Goubert changed the status of T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets from Open to In Progress.
Aug 26 2024, 1:36 PM · serviceops
Clement_Goubert changed the status of T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets, a subtask of T354869: Re-IP hosts on codfw row A and B to new per-rack vlans/subnets, from Open to In Progress.
Aug 26 2024, 1:32 PM · netops, SRE, Infrastructure-Foundations
Clement_Goubert added a comment to T364417: deploy1003 implementation tracking.

Hi, I failed to ssh deployment.eqiad.wmnet. The message I got is deployment.eqiad.wmnet: Permission denied (publickey). I was able to ssh a couple of months ago. Is this related to the deployment? Is there any update on my end that could fix it?

Aug 26 2024, 9:51 AM · serviceops
ayounsi awarded T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets a 100 token.
Aug 26 2024, 6:11 AM · serviceops

Aug 23 2024

Clement_Goubert added a comment to T373149: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003.

@Clement_Goubert Oh. Thanks for that. I must have forgot it last night. Sorry about that.

Aug 23 2024, 4:17 PM · SRE, ops-codfw, DC-Ops, fundraising-tech-ops, decommission-hardware
Clement_Goubert updated the task description for T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.
Aug 23 2024, 11:29 AM · serviceops
Clement_Goubert added a comment to T373149: decommission of codfw frack servers - frdb2001 frqueue2001 payments2003.

Just a heads up that the removal of the management DNS entries for these three servers popped up in a sre.dns.netbox run this morning, since they're in decommissioning state in Netbox I've proceeded with it.

Aug 23 2024, 9:50 AM · SRE, ops-codfw, DC-Ops, fundraising-tech-ops, decommission-hardware

Aug 22 2024

Clement_Goubert updated the task description for T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.
Aug 22 2024, 12:26 PM · serviceops
Clement_Goubert updated the task description for T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.
Aug 22 2024, 12:25 PM · serviceops
Clement_Goubert updated the task description for T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.
Aug 22 2024, 12:22 PM · serviceops
Clement_Goubert closed T373048: https://en.wikipedia.org/api/ 404 Not Found due to extract2.php RewriteRule as Resolved.

All good now

cgoubert@cumin1002:~$ httpbb /srv/deployment/httpbb-tests/appserver/*.yaml --host mw-api-ext.discovery.wmnet --https_port 4447
Sending to mw-api-ext.discovery.wmnet...
PASS: 131 requests sent to mw-api-ext.discovery.wmnet. All assertions passed.
cgoubert@cumin1002:~$ curl --connect-to en.wikipedia.org:443:mw-api-ext.discovery.wmnet:4447 https://en.wikipedia.org/api/
<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
  <meta charset="utf-8">
  <title>APIs</title>
  <meta name=viewport content="width=device-width, initial-scale=1">
  <meta name="robots" content="index, follow">
  <style>
body { background: #fff; margin: 7% auto 0; padding: 2em 1em 1em; font: 15px/1.6 sans-serif; color: #333; max-width: 640px; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645AD; text-decoration: underline; }
</style>
</head>
<body>
        <h2>APIs</h2>
        <ul>
            <li><a href="/w/api.php">Action API</a>, providing rich queries, editing and content access.</li>
            <li><a href="/api/rest_v1/?doc">REST API v1</a>, mainly focused on high-volume content access.</li>
        </ul>
    <h2>Legal</h2>
    <ul>
        <li><a href="https://foundation.wikimedia.org/wiki/Developer_app_guidelines">App Guidelines</a>, for developers on how to properly reuse Wikimedia data, API, trademarks, and other content.</li>
    </ul>
</body>
</html>
Aug 22 2024, 10:28 AM · MW-on-K8s, Wikimedia-Apache-configuration, Regression, MW-Interfaces-Team
Clement_Goubert added a comment to T373048: https://en.wikipedia.org/api/ 404 Not Found due to extract2.php RewriteRule.

As expected:

cgoubert@cumin1002:~$ httpbb /srv/deployment/httpbb-tests/appserver/test_main.yaml --host mwdebug1002.eqiad.wmnet
Sending to mwdebug1002.eqiad.wmnet...
PASS: 54 requests sent to mwdebug1002.eqiad.wmnet. All assertions passed.
cgoubert@cumin1002:~$ httpbb /srv/deployment/httpbb-tests/appserver/test_main.yaml --host mw-api-ext.discovery.wmnet --https_port 4447
Sending to mw-api-ext.discovery.wmnet...
https://en.wikipedia.org/api/ (/srv/deployment/httpbb-tests/appserver/test_main.yaml:48)
    Status code: expected 200, got 404.
    Body: expected to contain 'providing rich queries, editing and content access', got 'File not found.\n'.
===
FAIL: 54 requests sent to mw-api-ext.discovery.wmnet. 1 request with failed assertions.
Aug 22 2024, 10:12 AM · MW-on-K8s, Wikimedia-Apache-configuration, Regression, MW-Interfaces-Team
Clement_Goubert claimed T373048: https://en.wikipedia.org/api/ 404 Not Found due to extract2.php RewriteRule.
Aug 22 2024, 10:06 AM · MW-on-K8s, Wikimedia-Apache-configuration, Regression, MW-Interfaces-Team
Clement_Goubert added a comment to T373048: https://en.wikipedia.org/api/ 404 Not Found due to extract2.php RewriteRule.

Maybe it's an urban myth but I always though if index.php/index.html exists in a directory, apache automatically sends requests to the directory there (maybe it's lighttp only but I have seen this work in many places)

Aug 22 2024, 10:00 AM · MW-on-K8s, Wikimedia-Apache-configuration, Regression, MW-Interfaces-Team
Clement_Goubert added a comment to T373048: https://en.wikipedia.org/api/ 404 Not Found due to extract2.php RewriteRule.

At first I thought it could be due to this change routing /api/ to /w/rest.php for T364400

Aug 22 2024, 9:31 AM · MW-on-K8s, Wikimedia-Apache-configuration, Regression, MW-Interfaces-Team

Aug 21 2024

Clement_Goubert updated the task description for T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.
Aug 21 2024, 3:35 PM · serviceops
Clement_Goubert added a comment to T372916: Relabel codfw kubernetes nodes.

Thank you!

Aug 21 2024, 2:55 PM · ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert updated the task description for T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.
Aug 21 2024, 2:55 PM · serviceops
Clement_Goubert added a comment to T372916: Relabel codfw kubernetes nodes.

Apparently the removal from the puppetserver wasn't properly done by the cookbook, I've done it manually and it should resolve. Sorry about that.

Aug 21 2024, 10:12 AM · ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert added a comment to T371633: Burst of GuzzleHttp Exception for http://localhost:6025/call/constraint-regex-checker.

This is most likely caused by envoy terminating before mediawiki.

Aug 21 2024, 9:56 AM · User-ItamarWMDE, Wikidata Dev Team, Wikidata, wmde-wikidata-tech, Wikimedia-production-error, Shellbox, serviceops, Wikibase-Quality-Constraints, Deployments

Aug 20 2024

Clement_Goubert created T372916: Relabel codfw kubernetes nodes.
Aug 20 2024, 4:46 PM · ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert added a comment to T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.

From what I can gather the automation is there with the --move-vlan option to the reimage cookbook, I think the cabling is already correct and only the port's vlan and server ip need to change. We can probably take advantage of these reimages to rename the former appservers at the same time.

Aug 20 2024, 11:41 AM · serviceops
Clement_Goubert triaged T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets as High priority.
Aug 20 2024, 11:25 AM · serviceops
Clement_Goubert created T372878: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets.
Aug 20 2024, 11:22 AM · serviceops
Clement_Goubert edited projects for T372825: Unexpected helmfile changes when attempting a k8s deployment for a miscweb site, added: serviceops; removed collaboration-services.

Changes to sidecar images are generally fine to deploy, if in doubt you can ask on IRC either in #wikimedia-operations or #wikimedia-serviceops and someone should be able to answer. Thanks for deploying all of them :)

Aug 20 2024, 10:26 AM · serviceops

Aug 19 2024

Clement_Goubert added a comment to T371354: startupregistrystats-testwiki maintenance job is failing.

I don't remember to be honest, I created the task after digging around for a little bit, finding the timing coincidental, and it fixed itself with the next run.

Aug 19 2024, 9:52 AM · Regression, MW-1.43-notes (1.43.0-wmf.16; 2024-07-30), Wikimedia-production-error

Aug 1 2024

Clement_Goubert updated the task description for T360636: Phase out cergen for ServiceOps services.
Aug 1 2024, 10:14 AM · Patch-For-Review, serviceops, Epic, SRE

Jul 30 2024

Clement_Goubert closed T31186: Rename Võro Wikipedia, fiu-vro -> vro as Resolved.
Jul 30 2024, 4:12 PM · Wiki-Setup (Rename), Wikimedia-Language-setup
Clement_Goubert closed T25216: Move the Nourmande Wikipedia from nrm to nrf as Resolved.
Jul 30 2024, 4:09 PM · Patch-Needs-Improvement, Wiki-Setup (Rename), Wikimedia-Language-setup
Clement_Goubert closed T371354: startupregistrystats-testwiki maintenance job is failing as Resolved.

Resolved with the 10:10UTC run

Jul 30 2024, 10:23 AM · Regression, MW-1.43-notes (1.43.0-wmf.16; 2024-07-30), Wikimedia-production-error
Clement_Goubert closed T371354: startupregistrystats-testwiki maintenance job is failing, a subtask of T366961: 1.43.0-wmf.16 deployment blockers, as Resolved.
Jul 30 2024, 10:23 AM · User-brennen, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
Clement_Goubert added a subtask for T366961: 1.43.0-wmf.16 deployment blockers: T371354: startupregistrystats-testwiki maintenance job is failing.
Jul 30 2024, 10:06 AM · User-brennen, Release-Engineering-Team (Priority Backlog 📥), Release, Train Deployments
Clement_Goubert added a parent task for T371354: startupregistrystats-testwiki maintenance job is failing: T366961: 1.43.0-wmf.16 deployment blockers.
Jul 30 2024, 10:06 AM · Regression, MW-1.43-notes (1.43.0-wmf.16; 2024-07-30), Wikimedia-production-error
Clement_Goubert triaged T371354: startupregistrystats-testwiki maintenance job is failing as Medium priority.
Jul 30 2024, 10:06 AM · Regression, MW-1.43-notes (1.43.0-wmf.16; 2024-07-30), Wikimedia-production-error

Jul 29 2024

Clement_Goubert updated the task description for T358489: mw2420-mw2451 do have unnecessary raid controllers (configured).
Jul 29 2024, 3:42 PM · Patch-For-Review, SRE, serviceops
Clement_Goubert closed T367949: Spin down api_appserver and appserver clusters as Resolved.
Jul 29 2024, 3:27 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T367949: Spin down api_appserver and appserver clusters, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Jul 29 2024, 3:22 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T371262: decommission mw226[1-2].codfw.wmnet mw22[68-77].codfw.wmnet.
Jul 29 2024, 2:24 PM · SRE, DC-Ops, ops-codfw, serviceops, decommission-hardware
Clement_Goubert created T371262: decommission mw226[1-2].codfw.wmnet mw22[68-77].codfw.wmnet.
Jul 29 2024, 2:22 PM · SRE, DC-Ops, ops-codfw, serviceops, decommission-hardware
Clement_Goubert triaged T358489: mw2420-mw2451 do have unnecessary raid controllers (configured) as Low priority.
Jul 29 2024, 2:12 PM · Patch-For-Review, SRE, serviceops
Clement_Goubert triaged T371260: Relabel codfw kubernetes nodes as Low priority.
Jul 29 2024, 2:09 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert created T371260: Relabel codfw kubernetes nodes.
Jul 29 2024, 2:09 PM · SRE, ops-codfw, Kubernetes, Prod-Kubernetes, DC-Ops, serviceops
Clement_Goubert updated the task description for T367949: Spin down api_appserver and appserver clusters.
Jul 29 2024, 11:04 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Jul 25 2024

Clement_Goubert added a comment to T369011: hw troubleshooting: Management and main interfaces down for kubernetes1051.eqiad.wmnet.

Host BGP re-enabled, back in Active status and uncordoned, all looks good. Thanks @VRiley-WMF

Small note about the confusing names of the BGP state machine. "Active" means "actively trying to establish the session". "Established" means its managed to connect, and usually things progress from though all the states to "Esatablished" in a second or two unless there is a problem. So if you see a state of "active" when checking it usually means there is a problem!

In this case all is good:

cmooney@lsw1-e3-eqiad> show bgp summary | match "10.64.132.28|2620:0:861:10b:10:64:132:28" 
10.64.132.28          64601        917        883       0       1     2:17:09 Establ
2620:0:861:10b:10:64:132:28       64601        920        883       0       1     2:17:09 Establ

image.png (430×475 px, 85 KB)

Jul 25 2024, 12:23 PM · SRE, ops-eqiad, DC-Ops, Prod-Kubernetes, serviceops