Downtimed, silence ID e5915daa-08f1-45f6-b805-fee5078d64da
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Today
Yesterday
I've uploaded a patch to bump the memory limit to 1G, since I've seen it spike up to 980M.
Wed, Apr 24
Tue, Apr 23
Fri, Apr 19
Marking this resolved as you just confirmed a big file upload going through correctly. Thanks for your help in debugging this!
Yes, alert was moved to https://gerrit.wikimedia.org/r/plugins/gitiles/operations/alerts/+/refs/heads/master/team-sre/mediawiki.yaml#159 with the correct dashboard.
I suppose that can be hotswapped? Let us know if it can't, we'll drain and cordon the host for the disk swap.
Thu, Apr 18
I abandoned the CR to move more eqiad api_appservers because it would leave only 15, 5 of them canaries. We still have a bit more margin on the appserver side in eqiad.
In T358308#9724785, @Bawolff wrote:[...]
So it seems like two separate issues.I guess sometimes the job runner pod gets terminated in the middle of a job. That would be fine if something like https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1008403 got merged)
request_terminate_timeout for mw-jobrunners should now be set to 86400, as it was on bare metal.
I think I found it
In T358308#9724446, @Ladsgroup wrote:How can I get into a pod in job runners namespace(?) via shell.php? I want to try some stuff
Wed, Apr 17
As an aside, and contributing to the time to recovery, we observed the apache container getting oomkilled, we strongly suppose because of the backpressure from the php-fpm workers being busy waiting for the DNS response.
The change was rolled back in eqiad, and eqiad was repooled around 10:45. A terminating dot was added to the DNS name in codfw to avoid a recursive request.
Tue, Apr 16
We can exclude a bad setting of the async trait for mw-jobrunner.
From a pod in production via shell.php
> use Wikimedia\MWConfig\ClusterConfig; > ClusterConfig::getInstance()->isK8s() = true
The following images fail docker-reporter checks because they haven't been rebuilt on top of the new buster base image:
base images docker-registry.wikimedia.org/docker-gc:1.0.0-20230402 [FAIL] docker-registry.wikimedia.org/golang:1.14-1-20240407 [FAIL] docker-registry.wikimedia.org/httpd-fcgi:2.4.38-10-u5-20240407 [FAIL] docker-registry.wikimedia.org/kubeflow-kfserving-agent:0.6.0-1-20211017[FAIL] docker-registry.wikimedia.org/kubeflow-kfserving-controller:0.6.0-1-20211017[FAIL] docker-registry.wikimedia.org/kubeflow-kfserving-storage-initializer:0.6.0-5-20211010[FAIL] docker-registry.wikimedia.org/loki:1.5.0-2-20230604 [FAIL] docker-registry.wikimedia.org/mediawiki-httpd:0.1.8-s2-20240407 [FAIL] docker-registry.wikimedia.org/php7.2-cli:0.2.0-s3-20221204 [FAIL] docker-registry.wikimedia.org/php7.2-fpm:0.4.0-20221204 [FAIL] docker-registry.wikimedia.org/php7.2-fpm-multiversion-base:1.0.7-20221204[FAIL] docker-registry.wikimedia.org/php7.4-cli-icu67:7.4.33-1-s2-20231106-20231106[FAIL] docker-registry.wikimedia.org/php7.4-fpm-icu67:7.4.33-3-20231106-20231106[FAIL] docker-registry.wikimedia.org/wikimedia-buster:20210523 [FAIL]
Mon, Apr 15
Fri, Apr 12
In T329857#9708545, @dancy wrote:@Clement_Goubert I noticed the /srv/mediawiki.old.20230424.T329857 directory on deploy1002.eqiad.wmnet today. It's safe to delete.
Thu, Apr 11
Aaaand I just realized they all use http and not https, so now I can change them all.
Wed, Apr 10
In T213689#9703551, @Mvolz wrote:Thanks for linking the actual current Zotero probe - I see it checks the export endpoint? Where can I see the history of that one? This was very outdated: https://wikitech.wikimedia.org/wiki/Zotero/Deploying_zotero#Monitoring
Summing up the discussion on the patch set, this is not what is wanted, turning monitoring on in the service would turn on prometheus metrics scraping, and zotero doesn't expose any metrics. What may be wanted instead is to add a probe of type swagger to the service definition in service.yaml, but I am unsure if x-amples are needed for this to work correctly.
I think it's because monitoring is disabled in the service's values.yaml
chartmuseum and docker-registry done
Tue, Apr 9
Mon, Apr 8
For this use case the only keybinding you would need to know is how to exit once your run is done, which you would do the same way you exit a shell, with exit or ^+d.
Thu, Mar 28
Mar 27 2024
--retry_on_timeout merged and deployed, hopefully this makes deployments easier and closer to the tests we actually want to run.
Things to keep an eye on:
- Upstream error rate is higher on mw-api-int than bare-metal
- Connection establishment time is way higher on mw-api-int
- Upstream latencies are consistently higher on mw-api-int
It wouldn't fix it for anything but conda-analytics but you could add that environment variable to /opt/conda-analytics/etc/profile.d/conda.sh?
Some context given by @RLazarus from the CR:
At the time we added this test, the Barack Obama page did consistently load within the default timeout, and we wanted a test to make sure that remained true. Being "notoriously slow" is exactly the reason we picked it.
Have we decided it's okay for that page to take longer now? If so, we might as well just delete this test rather than bumping the timeout; there's no other reason to keep it around. If not, we should keep the test and fix it so that it passes.
Mar 26 2024
Mar 25 2024
Mar 22 2024
Given we have increased mw-web and mw-api-ext by respectively 53 and 10 replicas to cope with handling all the appserver traffic during the codfw depool part of the switchover, the first 5% increase will in my opinion not need an associated replicas increase.
Abandoned because the internals of changeprop make it unadvisable to add another layer. I'll create another task for its migration to mw-api-int.
Waiting on codfw repool as part of T357547: ☂️ Northward Datacentre Switchover (March 2024) before moving forward with this increase.
Checking on deploy2002 (which we moved away from with this switchover), the catalog.sqlite files stays in place after a switchover, and is now owned by the helm user there as you mentioned in T287130#7651203
That makes sense. I don't necessarily have a problem with it not using the service mesh (except for the lack of telemetry), except the fact that it means migrating it in one go to use mw-api-int as a backend.
Mar 21 2024
As the action taken in production fixed the immediate problem, lowering priority.
cgoubert@deploy1002:~$ sudo chown imagecatalog:imagecatalog /srv/deployment/imagecatalog/catalog.sqlite cgoubert@deploy1002:~$ sudo systemctl restart imagecatalog_record.service
Mar 19 2024
Some tweaking of replicas size was needed on mw-on-k8s, which was expected as this is the first switchover where more of the external traffic goes to it than to bare-metal clusters.
Mar 7 2024
@dancy Thanks a bunch! \o/