User Details
- User Since
- Oct 3 2014, 5:57 AM (477 w, 6 d)
- Availability
- Available
- LDAP User
- Giuseppe Lavagetto
- MediaWiki User
- GLavagetto (WMF) [ Global Accounts ]
Today
These images should be in the base images repository at operations/docker-images/production-images that gets built via docker-pkg as part of our pipeline.
Yesterday
When we make the change, it will require a restart of etcd on the nodes.
I created the subtasks, assigned the one about WikimediaEvents to @kostajh.
The task is not resolved until both uses of the old geoip library aren't removed. We can either leave this UBN! open (and assign it to the maintainers of the libraries) or create UBN! subtasks for those.
Tue, Nov 28
codesearch also says the same code more or less is also in https://gerrit.wikimedia.org/g/mediawiki/extensions/LandingCheck/+/9977828e2e4c4ce1ef88384807192ed03fae4a37/includes/SpecialLandingCheck.php#108, which should also be amended to use geoip2 I guess.
The problem is that the new kubernetes nodes don't have a copy of the .dat files... because those files have been discontinued in April 2022 and we should've converted any use of such files a long time ago, per T269475, using the geoip2/geoip2 library instead.
Mon, Nov 27
Apologies, I didn't realize there was this additional task re: statsd on k8s, and I've attached the patches to T343025
Wed, Nov 22
Tue, Nov 21
Given we now have switchovers at regular intervals, we can resolve this task. There is no need to do a lot of communication around it.
Mon, Nov 20
As you might have noticed by the patches here, we've pivoted as traffic splitting to the canaries via kube-proxy converges over hours, not seconds which is what we'll need when we start raising our percentages.
Thu, Nov 16
I would suggest to actually "fix" mediawiki-config, as changing cluster in puppet is not as straightforward as we expected.
Mon, Nov 13
I decided we should move about 10% of the mobileapps traffic at a time; that means about 300 rps, which I think we should be able to serve moving over about 2-3 api servers to become k8s nodes, or an additional 15 pods
Thu, Nov 9
As it's clear from the patches, I chose to take the sage advice of @JMeybohm and go down the path of least resistance :)
And good news, most requests to the endpoint now take 50-100ms to get a response, instead than 5-10 seconds. This is much more sustainable.
Given the patch is now live with the latest train, I've disabled the rule for now.
Tentatively resolving for now, but this might go back to banned if the underlying problems aren't solved with the latest patches.
Wed, Nov 8
Tue, Nov 7
Mon, Nov 6
@matmarex do you have a way to verify if the bug still presents itself? It's slightly hard for me as I mostly edit mediawiki.org or wikitech.
I suspect the fix I made for T342201 actually might have solved this issue as well. Not sure how to verify it though.
Since my deployment of this change, MediaWiki\Extension\Notifications\Api\ApiEchoUnreadNotificationPages::getUnreadNotificationPagesFromForeign: Unexpected API response from {wiki} messages seem to have disappeared. I'll wait this evening and resolve the task.
When mcrouter-primary-dc is selected, we set the routing prefix to /$wmgMasterDatacenter/mw/, so writes are consistently directed there. I just found a bug in the configuration of mcrouter on k8s that should solve the issue once I've fixed it.
I can't imagine why calling the primary datacenter would be a problem in this case, unless there is a logical race condition or, worse, we completely rely on data being in memcached and/or any other datastore that's not replicated.
Thu, Nov 2
I already added Shaun and I'm just filing the task retroactively to keep track of the action properly.
The error in the loki build is due to its dependency on the fact it depends on golang-1.13 which has been dismissed years ago. @colewhite do you think we can just remove the loki image instead?
@Krinkle the instrumentation added doesn't distinguish between failing to get a lock because it's already taken by another thread and just a complete failure. That would help us understand if it's actually an issue or not.
The change should now be live, I'm tentatively re-closing this task as I can only find truncated messages that are unparsed now, not any due to bad encoding.
Oct 31 2023
Oct 30 2023
Oct 26 2023
The problem can also be that we have one component in front of the service (envoyproxy) that gets terminated immediately, while terminating the application on the backend takes more time, so connections are truncated immediately. We just added a pause before terminating envoy in mediawiki for example to overcome this problem.
For now it would be enough for us to just get a gerrit account that we can use to:
- submit and merge a change per week that adds a new changelog entry to all images in a repository as part of a weekly cron
- build the production images after the change is merged
If your desire is having deterministic builds, it would be enough to NOT remove the apt archives from the base layer image, and then only run apt-get install at all other layers (unless you've added a new component), but I think this is a pretty specific need of CI images, to be honest. OS updates aren't usually problematic for anything else.
I would say this is resolved since a long time?
Oct 25 2023
Looking at the new file stream it looks like uploads work in general, so this is not an UBN! bug as far as SREs are concerned. I'd say that as far as we're concerned there seems to be no real generalized malfunction.
@colewhite prometheus statsd exporter can be now installed in mw on k8s with a configuration switch. Resolving this task for now.
Oct 23 2023
Sorry for the silence, I was first at a conference then in bed sick (and I'm still not in a great health condition).
Setting the priority to medium as we do clearly read correctly from etcd in k8s, else we would've noticed all sorts of other problems like db connections failing
EtcdConfig uses eventually APCUBagOfStuff, see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/mediawiki-config/+/refs/heads/master/wmf-config/etcd.php#29, so the lock is taken on APCU.
Oct 4 2023
Sep 28 2023
Well turns out the issue was simpler: we even had a TODO in the code:
I tried restarting ATS on a backend, cp1081, then made requests for wikidata's special:random to trafficserver directly: still all going to appservers on bare metal.
Interestingly, I do get correct results for m.wikidata.org, but somehow not for www.wikidata.org (also, please grep for mw-web as we've repooled eqiad in the meantime).
@Jdforrester-WMF no, this task is actually about that patch not having the effect we expected.
Sep 27 2023
Sep 26 2023
Sep 25 2023
When you have your code in core, and we have merged the above patches, we can start testing your code specifically on k8s, first on the debug cluster, then everywhere.
Sep 22 2023
@Urbanecm_WMF as I said on IRC, there's two main differences when running in the jobqueue:
Sep 21 2023
I pressed submit before finishing my comment: