User Details
- User Since
- Aug 14 2018, 10:50 AM (381 w, 3 d)
- Availability
- Available
- IRC Nick
- effie
- LDAP User
- Effie Mouzeli
- MediaWiki User
- EMouzeli (WMF) [ Global Accounts ]
Thu, Dec 4
Mon, Dec 1
Fri, Nov 28
Hello! Since 1212204 was backported, it has been producing thousands of error messages https://logstash.wikimedia.org/goto/103ab3d23a65b901740f79fc62e71e9a
Thu, Nov 27
Wed, Nov 26
Bluntly closing
Tue, Nov 25
Key has been removed from puppet, please reopen if something is not right
@KartikMistry I will have a look, sorry for that
Fri, Nov 21
Thu, Nov 20
Wed, Nov 19
Mon, Nov 17
Thank you for the discussion everyone! Reading through, I would suggest proceeding with Option D for the time being. This approach not only unblocks the work without requiring any significant changes to Kafka, but also allows us to observe the workflow in practice and better understand its requirements.
That said, we can later define a set of performance expectations (eg for latency), which will then help us to assess whether any of the other options would provide sufficient benefit to justify any additional efforts.
Thu, Nov 13
Tue, Nov 11
Looking at the memory usage
Yesterday we decided to go for option B, and assess if an expansion will be needed sometime in the next quarters
Nov 4 2025
Oct 31 2025
Oct 24 2025
Oct 23 2025
Oct 21 2025
confirmed oob
confirmed oob
Oct 20 2025
Oct 16 2025
This is sorted
pinged user for out of band key confirmation
Pinged user for out of band key verification
Added to nda and wmde ldap groups.
Sep 29 2025
Sep 25 2025
My concerns with the cronjob approach are the following:
- Potentially the time it takes to spawn a pod to run the cronjob is longer than the time it takes to run a check
- upstream latency appears to have a p50 of ~60ms and a p99 of ~500ms
- 5m minutes is a very long time to either make an assumption that hcaptcha is up, but in a similar matter, that it is down
- If the memcached node holding this key fails, the key will be lost
- The cluster will failover pretty quickly to a spare memcached node, but this node will be cold
- Until the next cronjob, we will be operating using the default (which one?)
Sep 23 2025
Please keep in mind that tomorrow we will be performing the T399891 Southward Datacenter Switchover @ 15:00 UTC. As Phase 2 has been scheduled during the UTC morning backport window, it should not affect this work.
Sep 16 2025
For licensing reasons, as discussed in 1185314 (and to my knowledge as well), we are unable to host any proprietary code to our repos. I am afraid, same principle applies to our puppet repo as well.
Sep 10 2025
Sep 5 2025
Sep 4 2025
In wikikube we run a mw-mcrouter daemonset to interact with the memcached cluster. The dse cluster does not run mcrouter as a daemonset (and it wouldnt make sense to do so). Using the default MCROUTER_SERVER environment variable (which was incorrect) led to multiple memcached errors and may (unsure) have contributed to delays in producing dumps.
Sep 3 2025
Sep 2 2025
After discussing with @Dreamy_Jazz @kostajh and @Ladsgroup, we have doubts it the above observations are caused by the change, so I am lowering the priority while we investigate.
Hey folks, after the deployment at ~13:15 UTC, I observed the following
- Increase in memcached traffic
~40% in write traffic
Sorry for the mixup, I will reopen T402181 and remove my comment!
<comment removed as it belongs to T402181>
Aug 6 2025
The bandaid works for now. We the timeout command first sends a SIGTERM (which is being ignored/blocked as we saw earlier), and 5s later we sends a SIGKILL which, well, can't be ignored. This by far is not addressing the underline issue, however, it improves the availability of the service.
Aug 4 2025
Digging deeper with @Clement_Goubert and @Scott_French, we found that a vast number of mw-script jobs was created (not on purpose) around the 30th and 31st of July. This lead to the creation of thousand objects, jobs, and pods, that were not cleaned up yet, occupying quite a lot of etcd space, leading to this production error.
moved to k8s
Aug 1 2025
sorry folks, host's number is up for retirement, my bad. tx @Clement_Goubert
Jul 31 2025
Given that the value of the X-Wikimedia-Debug header determines whether, and to which mw-debug or mw-experimental service, a request will be routed, I would guess that this POST request was missing this header.
Jul 30 2025
Take the following with a grain of salt, attaching gdb on the hang convert process: It appears that one thread is stuck(?) during unlink(), assuming that it may be in the process of cleaning up. Additionally, during that operation it caught a signal, which again we could assume is it is the SIGTERM from timeout.
