Page MenuHomePhabricator

Clement_Goubert (claime)
Senior SRE

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Jul 26 2022, 2:11 PM (90 w, 2 d)
Availability
Available
IRC Nick
claime
LDAP User
Clément Goubert
MediaWiki User
CGoubert-WMF [ Global Accounts ]

Recent Activity

Today

Clement_Goubert added a comment to T351074: Move servers from the appserver/api cluster to kubernetes.

I abandoned the CR to move more eqiad api_appservers because it would leave only 15, 5 of them canaries. We still have a bit more margin on the appserver side in eqiad.

Thu, Apr 18, 2:46 PM · Patch-For-Review, serviceops, MW-on-K8s
Clement_Goubert added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

[...]
So it seems like two separate issues.

I guess sometimes the job runner pod gets terminated in the middle of a job. That would be fine if something like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1008403 got merged)

Thu, Apr 18, 11:58 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
Clement_Goubert added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

request_terminate_timeout for mw-jobrunners should now be set to 86400, as it was on bare metal.

Thu, Apr 18, 11:33 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
Clement_Goubert added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

I think I found it

Thu, Apr 18, 9:59 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
Clement_Goubert added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

How can I get into a pod in job runners namespace(?) via shell.php? I want to try some stuff

Thu, Apr 18, 9:50 AM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management

Yesterday

Clement_Goubert added a comment to T362766: 2024-04-17 mw-* went down in eqiad.

As an aside, and contributing to the time to recovery, we observed the apache container getting oomkilled, we strongly suppose because of the backpressure from the php-fpm workers being busy waiting for the DNS response.

Wed, Apr 17, 1:53 PM · serviceops, Sustainability (Incident Followup)
Clement_Goubert added a comment to T362766: 2024-04-17 mw-* went down in eqiad.

The change was rolled back in eqiad, and eqiad was repooled around 10:45. A terminating dot was added to the DNS name in codfw to avoid a recursive request.

Wed, Apr 17, 11:33 AM · serviceops, Sustainability (Incident Followup)

Tue, Apr 16

Clement_Goubert added a comment to T358308: AssembleUploadChunksJob & PublishStashedFile jobs seem to be timing out at about 3 minutes, but should be ~20 minutes.

We can exclude a bad setting of the async trait for mw-jobrunner.
From a pod in production via shell.php

> use Wikimedia\MWConfig\ClusterConfig;
> ClusterConfig::getInstance()->isK8s()
= true
Tue, Apr 16, 3:54 PM · Patch-For-Review, WMF-JobQueue, MediaWiki-File-management
Clement_Goubert added a comment to T362518: Deprecate buster-backports.

The following images fail docker-reporter checks because they haven't been rebuilt on top of the new buster base image:

base images
docker-registry.wikimedia.org/docker-gc:1.0.0-20230402              [FAIL]
docker-registry.wikimedia.org/golang:1.14-1-20240407                [FAIL]
docker-registry.wikimedia.org/httpd-fcgi:2.4.38-10-u5-20240407      [FAIL]
docker-registry.wikimedia.org/kubeflow-kfserving-agent:0.6.0-1-20211017[FAIL]
docker-registry.wikimedia.org/kubeflow-kfserving-controller:0.6.0-1-20211017[FAIL]
docker-registry.wikimedia.org/kubeflow-kfserving-storage-initializer:0.6.0-5-20211010[FAIL]
docker-registry.wikimedia.org/loki:1.5.0-2-20230604                 [FAIL]
docker-registry.wikimedia.org/mediawiki-httpd:0.1.8-s2-20240407     [FAIL]
docker-registry.wikimedia.org/php7.2-cli:0.2.0-s3-20221204          [FAIL]
docker-registry.wikimedia.org/php7.2-fpm:0.4.0-20221204             [FAIL]
docker-registry.wikimedia.org/php7.2-fpm-multiversion-base:1.0.7-20221204[FAIL]
docker-registry.wikimedia.org/php7.4-cli-icu67:7.4.33-1-s2-20231106-20231106[FAIL]
docker-registry.wikimedia.org/php7.4-fpm-icu67:7.4.33-3-20231106-20231106[FAIL]
docker-registry.wikimedia.org/wikimedia-buster:20210523             [FAIL]
Tue, Apr 16, 2:00 PM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
Jdforrester-WMF awarded T362662: Rename X-Wikimedia-Debug k8s-experimental option a Like token.
Tue, Apr 16, 1:58 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert triaged T362662: Rename X-Wikimedia-Debug k8s-experimental option as Low priority.
Tue, Apr 16, 1:54 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert created T362662: Rename X-Wikimedia-Debug k8s-experimental option.
Tue, Apr 16, 1:54 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert created T362628: Find a way to stage updated PHP packages on wikikube.
Tue, Apr 16, 9:53 AM · Release-Engineering-Team, serviceops, MW-on-K8s, Scap

Mon, Apr 15

Clement_Goubert updated the task description for T362518: Deprecate buster-backports.
Mon, Apr 15, 11:20 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
Clement_Goubert updated the task description for T362518: Deprecate buster-backports.
Mon, Apr 15, 11:17 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
Clement_Goubert updated the task description for T362518: Deprecate buster-backports.
Mon, Apr 15, 10:25 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
Clement_Goubert changed the status of T362518: Deprecate buster-backports from Open to In Progress.
Mon, Apr 15, 10:09 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops
Clement_Goubert created T362518: Deprecate buster-backports.
Mon, Apr 15, 10:09 AM · Patch-For-Review, Infrastructure-Foundations, Release-Engineering-Team, serviceops

Fri, Apr 12

Clement_Goubert added a comment to T329857: MediaWiki deploy servers should not be mediawiki installation targets.

@Clement_Goubert I noticed the /srv/mediawiki.old.20230424.T329857 directory on deploy1002.eqiad.wmnet today. It's safe to delete.

Fri, Apr 12, 9:07 AM · serviceops, Performance-Team (Radar), Deployments, Release-Engineering-Team

Thu, Apr 11

Clement_Goubert updated the task description for T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons).
Thu, Apr 11, 12:51 PM · MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons).
Thu, Apr 11, 12:48 PM · MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
taavi awarded T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) a Barnstar token.
Thu, Apr 11, 12:33 PM · MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T290536: Serve production traffic via Kubernetes.
Thu, Apr 11, 12:32 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
jijiki awarded T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) a Burninate token.
Thu, Apr 11, 12:23 PM · MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Ladsgroup awarded T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) a Love token.
Thu, Apr 11, 12:22 PM · MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
hnowlan awarded T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) a Stroopwafel token.
Thu, Apr 11, 12:16 PM · MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert triaged T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons) as High priority.
Thu, Apr 11, 12:14 PM · MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert created T362323: Move 100% of external traffic to Kubernetes (excluding Votewiki and Commons).
Thu, Apr 11, 12:14 PM · MoveComms-Support, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T360763: Move 70% of mediawiki external requests to mw on k8s as Resolved.
Thu, Apr 11, 12:08 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T360763: Move 70% of mediawiki external requests to mw on k8s, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Thu, Apr 11, 12:05 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a comment to T362316: Migrate ml-services to mw-api-int.

Aaaand I just realized they all use http and not https, so now I can change them all.

Thu, Apr 11, 11:49 AM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T362316: Migrate ml-services to mw-api-int.
Thu, Apr 11, 11:43 AM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
Clement_Goubert triaged T362316: Migrate ml-services to mw-api-int as Medium priority.
Thu, Apr 11, 10:36 AM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s
Clement_Goubert created T362316: Migrate ml-services to mw-api-int.
Thu, Apr 11, 10:36 AM · Patch-For-Review, Machine-Learning-Team, SRE, serviceops, MW-on-K8s

Wed, Apr 10

Clement_Goubert added a comment to T213689: Create a readiness probe for zotero.

Thanks for linking the actual current Zotero probe - I see it checks the export endpoint? Where can I see the history of that one? This was very outdated: https://wikitech.wikimedia.org/wiki/Zotero/Deploying_zotero#Monitoring

Wed, Apr 10, 12:18 PM · Patch-For-Review, SRE, Citoid, serviceops
Clement_Goubert added a comment to T213689: Create a readiness probe for zotero.

Summing up the discussion on the patch set, this is not what is wanted, turning monitoring on in the service would turn on prometheus metrics scraping, and zotero doesn't expose any metrics. What may be wanted instead is to add a probe of type swagger to the service definition in service.yaml, but I am unsure if x-amples are needed for this to work correctly.

Wed, Apr 10, 11:37 AM · Patch-For-Review, SRE, Citoid, serviceops
Clement_Goubert added a comment to T213689: Create a readiness probe for zotero.

I think it's because monitoring is disabled in the service's values.yaml

Wed, Apr 10, 11:05 AM · Patch-For-Review, SRE, Citoid, serviceops
Clement_Goubert added a comment to T360636: Phase out cergen for ServiceOps services.

chartmuseum and docker-registry done

Wed, Apr 10, 10:27 AM · Patch-For-Review, serviceops, Epic, SRE
Clement_Goubert updated the task description for T360636: Phase out cergen for ServiceOps services.
Wed, Apr 10, 10:26 AM · Patch-For-Review, serviceops, Epic, SRE

Tue, Apr 9

Clement_Goubert updated the task description for T360636: Phase out cergen for ServiceOps services.
Tue, Apr 9, 10:38 AM · Patch-For-Review, serviceops, Epic, SRE

Mon, Apr 8

Clement_Goubert added a comment to T361724: scap should check if it is running within a tmux/screen.

For this use case the only keybinding you would need to know is how to exit once your run is done, which you would do the same way you exit a shell, with exit or ^+d.

Mon, Apr 8, 10:23 AM · Patch-For-Review, Sustainability (Incident Followup), Scap, Release-Engineering-Team, serviceops

Thu, Mar 28

Clement_Goubert updated the task description for T360763: Move 70% of mediawiki external requests to mw on k8s.
Thu, Mar 28, 12:13 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T333120: Migrate internal traffic to k8s.
Thu, Mar 28, 11:59 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T333120: Migrate internal traffic to k8s.
Thu, Mar 28, 11:58 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T358213: Migrate restbase from mwapi-async to mw-api-int as Resolved.
Thu, Mar 28, 11:56 AM · Patch-For-Review, RESTBase, SRE, serviceops, MW-on-K8s
Clement_Goubert closed T358213: Migrate restbase from mwapi-async to mw-api-int, a subtask of T333120: Migrate internal traffic to k8s, as Resolved.
Thu, Mar 28, 11:55 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Wed, Mar 27

Clement_Goubert closed T360867: httpbb appserver test breaks deployment of the week due to a timeout parsing page as Resolved.

--retry_on_timeout merged and deployed, hopefully this makes deployments easier and closer to the tests we actually want to run.

Wed, Mar 27, 3:51 PM · Patch-For-Review, serviceops, Release-Engineering-Team, Deployments
Clement_Goubert added a comment to T358213: Migrate restbase from mwapi-async to mw-api-int.

50%

image.png (500×1 px, 43 KB)

Wed, Mar 27, 3:47 PM · Patch-For-Review, RESTBase, SRE, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T358213: Migrate restbase from mwapi-async to mw-api-int.
Wed, Mar 27, 3:46 PM · Patch-For-Review, RESTBase, SRE, serviceops, MW-on-K8s
Clement_Goubert added a comment to T358213: Migrate restbase from mwapi-async to mw-api-int.

Things to keep an eye on:

  • Upstream error rate is higher on mw-api-int than bare-metal

image.png (500×1 px, 63 KB)

  • Connection establishment time is way higher on mw-api-int

image.png (500×1 px, 91 KB)

  • Upstream latencies are consistently higher on mw-api-int

image.png (500×1 px, 139 KB)

Wed, Mar 27, 11:54 AM · Patch-For-Review, RESTBase, SRE, serviceops, MW-on-K8s
Clement_Goubert added a comment to T361024: NEW BUG REPORT SSL certificate verification error when using internal API endpoints from conda-analytics and Jupyter on stat host.

It wouldn't fix it for anything but conda-analytics but you could add that environment variable to /opt/conda-analytics/etc/profile.d/conda.sh?

Wed, Mar 27, 11:34 AM · Data-Platform-SRE, Data-Platform
Clement_Goubert updated subscribers of T360867: httpbb appserver test breaks deployment of the week due to a timeout parsing page.

Some context given by @RLazarus from the CR:

At the time we added this test, the Barack Obama page did consistently load within the default timeout, and we wanted a test to make sure that remained true. Being "notoriously slow" is exactly the reason we picked it.
Have we decided it's okay for that page to take longer now? If so, we might as well just delete this test rather than bumping the timeout; there's no other reason to keep it around. If not, we should keep the test and fix it so that it passes.

Wed, Mar 27, 11:25 AM · Patch-For-Review, serviceops, Release-Engineering-Team, Deployments

Tue, Mar 26

Clement_Goubert added a comment to T358213: Migrate restbase from mwapi-async to mw-api-int.

10% of RESTbase's backend mwapi requests are now made to mw-api-int

image.png (500×1 px, 46 KB)

Tue, Mar 26, 2:59 PM · Patch-For-Review, RESTBase, SRE, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T358213: Migrate restbase from mwapi-async to mw-api-int.
Tue, Mar 26, 2:55 PM · Patch-For-Review, RESTBase, SRE, serviceops, MW-on-K8s

Mon, Mar 25

Clement_Goubert updated the task description for T333120: Migrate internal traffic to k8s.
Mon, Mar 25, 12:25 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T358213: Migrate restbase from mwapi-async to mw-api-int.
Mon, Mar 25, 12:18 PM · Patch-For-Review, RESTBase, SRE, serviceops, MW-on-K8s
Clement_Goubert closed T360767: Migrate changeprop to mw-api-int, a subtask of T333120: Migrate internal traffic to k8s, as Resolved.
Mon, Mar 25, 12:18 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T360767: Migrate changeprop to mw-api-int as Resolved.
Mon, Mar 25, 12:18 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a comment to T360767: Migrate changeprop to mw-api-int.

mw-api-int is now receiving all calls to mwapi_uri from changeprop

image.png (595×1 px, 107 KB)

Mon, Mar 25, 12:14 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert changed the status of T360767: Migrate changeprop to mw-api-int, a subtask of T333120: Migrate internal traffic to k8s, from Open to In Progress.
Mon, Mar 25, 10:47 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert changed the status of T360767: Migrate changeprop to mw-api-int from Open to In Progress.
Mon, Mar 25, 10:47 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a watcher for serviceops: Clement_Goubert.
Mon, Mar 25, 9:31 AM
Clement_Goubert added a watcher for MW-on-K8s: Clement_Goubert.
Mon, Mar 25, 9:30 AM

Fri, Mar 22

Clement_Goubert added a comment to T360763: Move 70% of mediawiki external requests to mw on k8s.

Given we have increased mw-web and mw-api-ext by respectively 53 and 10 replicas to cope with handling all the appserver traffic during the codfw depool part of the switchover, the first 5% increase will in my opinion not need an associated replicas increase.

Fri, Mar 22, 12:39 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T333120: Migrate internal traffic to k8s.
Fri, Mar 22, 12:12 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T333120: Migrate internal traffic to k8s.
Fri, Mar 22, 12:05 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert triaged T360767: Migrate changeprop to mw-api-int as High priority.
Fri, Mar 22, 12:02 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert created T360767: Migrate changeprop to mw-api-int.
Fri, Mar 22, 12:02 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T360625: Alter changeprop chart to use the service mesh as Declined.

Abandoned because the internals of changeprop make it unadvisable to add another layer. I'll create another task for its migration to mw-api-int.

Fri, Mar 22, 11:54 AM · WMF-JobQueue, ChangeProp, SRE, serviceops, MW-on-K8s
Clement_Goubert closed T360625: Alter changeprop chart to use the service mesh, a subtask of T333120: Migrate internal traffic to k8s, as Declined.
Fri, Mar 22, 11:52 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T290536: Serve production traffic via Kubernetes.
Fri, Mar 22, 11:29 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a comment to T360763: Move 70% of mediawiki external requests to mw on k8s.

Waiting on codfw repool as part of T357547: ☂️ Northward Datacentre Switchover (March 2024) before moving forward with this increase.

Fri, Mar 22, 11:24 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Ladsgroup awarded T360763: Move 70% of mediawiki external requests to mw on k8s a Love token.
Fri, Mar 22, 11:23 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert changed the status of T360763: Move 70% of mediawiki external requests to mw on k8s from Open to In Progress.
Fri, Mar 22, 11:23 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert changed the status of T360763: Move 70% of mediawiki external requests to mw on k8s, a subtask of T290536: Serve production traffic via Kubernetes, from Open to In Progress.
Fri, Mar 22, 11:22 AM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert created T360763: Move 70% of mediawiki external requests to mw on k8s.
Fri, Mar 22, 11:21 AM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert added a comment to T360652: imagecatalog_record.service fails due to read-only sqlite database.

Checking on deploy2002 (which we moved away from with this switchover), the catalog.sqlite files stays in place after a switchover, and is now owned by the helm user there as you mentioned in T287130#7651203

Fri, Mar 22, 10:52 AM · Datacenter-Switchover, serviceops
Clement_Goubert added a comment to T360625: Alter changeprop chart to use the service mesh.

That makes sense. I don't necessarily have a problem with it not using the service mesh (except for the lack of telemetry), except the fact that it means migrating it in one go to use mw-api-int as a backend.

Fri, Mar 22, 10:28 AM · WMF-JobQueue, ChangeProp, SRE, serviceops, MW-on-K8s

Thu, Mar 21

Clement_Goubert lowered the priority of T360652: imagecatalog_record.service fails due to read-only sqlite database from High to Low.

As the action taken in production fixed the immediate problem, lowering priority.

Thu, Mar 21, 3:58 PM · Datacenter-Switchover, serviceops
Clement_Goubert added a comment to T360652: imagecatalog_record.service fails due to read-only sqlite database.
cgoubert@deploy1002:~$ sudo chown imagecatalog:imagecatalog /srv/deployment/imagecatalog/catalog.sqlite
cgoubert@deploy1002:~$ sudo systemctl restart imagecatalog_record.service
Thu, Mar 21, 3:57 PM · Datacenter-Switchover, serviceops
Clement_Goubert triaged T360652: imagecatalog_record.service fails due to read-only sqlite database as High priority.
Thu, Mar 21, 3:50 PM · Datacenter-Switchover, serviceops
Clement_Goubert created T360652: imagecatalog_record.service fails due to read-only sqlite database.
Thu, Mar 21, 3:50 PM · Datacenter-Switchover, serviceops
Clement_Goubert updated the task description for T360625: Alter changeprop chart to use the service mesh.
Thu, Mar 21, 2:43 PM · WMF-JobQueue, ChangeProp, SRE, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T360625: Alter changeprop chart to use the service mesh.
Thu, Mar 21, 12:48 PM · WMF-JobQueue, ChangeProp, SRE, serviceops, MW-on-K8s
Clement_Goubert triaged T360625: Alter changeprop chart to use the service mesh as High priority.
Thu, Mar 21, 12:31 PM · WMF-JobQueue, ChangeProp, SRE, serviceops, MW-on-K8s
Clement_Goubert created T360625: Alter changeprop chart to use the service mesh.
Thu, Mar 21, 12:31 PM · WMF-JobQueue, ChangeProp, SRE, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T333120: Migrate internal traffic to k8s.
Thu, Mar 21, 12:15 PM · Patch-For-Review, Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s

Mar 19 2024

Clement_Goubert added a comment to T357547: ☂️ Northward Datacentre Switchover (March 2024) .

Some tweaking of replicas size was needed on mw-on-k8s, which was expected as this is the first switchover where more of the external traffic goes to it than to bare-metal clusters.

Mar 19 2024, 4:11 PM · Patch-For-Review, Datacenter-Switchover, Data-Persistence, SRE Observability (FY2023/2024-Q3), collaboration-services, observability, serviceops, DC-Ops, Traffic

Mar 7 2024

Clement_Goubert added a comment to T358117: Adapt scap's testing strategy to mw-on-k8s.

@dancy Thanks a bunch! \o/

Mar 7 2024, 5:32 PM · Release-Engineering-Team (Now this 🫠), Scap, SRE, serviceops, MW-on-K8s
Clement_Goubert closed T357508: Move 60% of mediawiki external requests to mw on k8s as Resolved.
Mar 7 2024, 1:26 PM · Release-Engineering-Team, SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert updated the task description for T290536: Serve production traffic via Kubernetes.
Mar 7 2024, 1:24 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T357508: Move 60% of mediawiki external requests to mw on k8s, a subtask of T290536: Serve production traffic via Kubernetes, as Resolved.
Mar 7 2024, 1:23 PM · Release-Engineering-Team (Seen), SRE, Traffic, serviceops, MW-on-K8s
Clement_Goubert closed T357508: Move 60% of mediawiki external requests to mw on k8s, a subtask of T357402: Scap should check errors coming from mw-on-k8s canaries during deployments, as Resolved.
Mar 7 2024, 1:23 PM · Release-Engineering-Team (Now this 🫠), Scap, SRE, serviceops, MW-on-K8s
Clement_Goubert closed T356497: Raise mw-api-int replicas for increased load from mobileapps, a subtask of T339865: PCS should use parsoid endpoints in MediaWiki, not RESTbase, as Resolved.
Mar 7 2024, 12:55 PM · Content-Transform-Team-WIP, Page Content Service, RESTBase Sunsetting
Clement_Goubert closed T356497: Raise mw-api-int replicas for increased load from mobileapps as Resolved.
Mar 7 2024, 12:55 PM · serviceops, Content-Transform-Team-WIP, Page Content Service, RESTBase Sunsetting
Clement_Goubert added projects to T359509: REST API calls suddenly all returning 400: Content-Transform-Team, Parsoid.
Mar 7 2024, 10:36 AM · MW-1.42-notes (1.42.0-wmf.23; 2024-03-19), MW-Interfaces-Team, Content-Transform-Team-WIP, Patch-For-Review, RESTBase-API

Mar 5 2024

Clement_Goubert closed T359155: Scap deployments to mw-on-k8s timing out, a subtask of T354439: 1.42.0-wmf.21 deployment blockers, as Resolved.
Mar 5 2024, 1:55 PM · Release-Engineering-Team (Now this 🫠), Release, Train Deployments
Clement_Goubert closed T359155: Scap deployments to mw-on-k8s timing out as Resolved.

This is now resolved and the train is proceeding.

Mar 5 2024, 1:55 PM · MW-on-K8s, serviceops
Clement_Goubert updated subscribers of T359155: Scap deployments to mw-on-k8s timing out.

This was caused by an error with the php-fpm image introduced in https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/994764
@jijiki is reverting this change and rebuilding the image, and doing a full rebuild of the mediawiki images following that.

Mar 5 2024, 12:48 PM · MW-on-K8s, serviceops
Clement_Goubert merged T348466: Rethink kubernetes etcd storage into T353464: Migrate wikikube control planes to hardware nodes.
Mar 5 2024, 12:45 PM · serviceops, Prod-Kubernetes, Kubernetes