Oh hi. Nice to see you here.
User Details
- User Since
- Dec 11 2018, 9:39 PM (390 w, 6 d)
- Availability
- Available
- IRC Nick
- sukhe
- LDAP User
- Unknown
- MediaWiki User
- SSingh (WMF) [ Global Accounts ]
Yesterday
Update: we are still waiting to get clarity on what is happening here. We are pursuing this and I will update this when things change.
Fri, Jun 5
dns[2004-2006].wikimedia.org - Need special care to not cause traffic imbalance @ssingh
Wed, Jun 3
CAA works at the subdomain level so we can set the records for payments.wikimedia.org to allow them their issuance while removing the unnecessary records for the rest of the stack.
Tue, Jun 2
Mon, Jun 1
Fri, May 29
Update: We are looking into this -- on how to update the list of Bing IPs -- and will update the thread.
I meant we set profile::netconsole::client::ensure: absent in hieradata/role/common/cache/upload.yaml so it should not be enabled there.
Yeah it's a good question and predates me so I don't have good answers. But, it doesn't seem that it is enabled for upload? The upload role does say include profile::netconsole::client but we also set profile::netconsole::client::ensure: absent and it seems like nothing is being actually set on upload unless I am mistaken.
Thu, May 28
Per https://www.mediawiki.org/wiki/Manual:Pywikibot/User-agent, users can set a custom UA easily by setting user_agent_format. Should we perhaps not be encouraging that vs blanket allow-listing the default UA? That then falls into line with our other UA policy for library defaults -- each bot should set a custom UA.
The host is acting up again and is depooled. It seems like this time it is memory errors but racadm points to nothing. I am going to leave it depooled while we look.
Wed, May 27
Depool for cp2044 looks good; please ping Traffic if you want us to take care of it.
Tue, May 26
Once we reboot for T426585, we can consider this resolved as well.
Fri, May 22
Thanks for the update and the explanation, @cmooney!
This hasn't happened in a while (last incident was 2023) and we have run sre.cdn.roll-reboot many times since then, so boldly resolving.
@cmooney: We plan to move to Liberica in Q1 or Q2 of APP2026. Do you think we should still consider working on this?
This hasn't happened again and it's hard investigating now what caused these two blips. Boldly resolving for this as part of regular task cleanup. If it happens again, we can look into it.
Boldly resolving for the reasons above: the issue was transient because we responded to it, there is no follow up and there is nothing on our side to indicate that this persists.
We never got to this in Q3 or even Q4. Should we plan to do this in Q1 2026?
LVS in core sites will be superseded by Liberica so we are unlikely to spend any time on this. I am taking the liberty to close this as part of regular task cleanup, please re-open if desired.
No one else has observed this issue and it has been almost two years since this was reported, with no follow-up. As the person who reported this, resolving.
It seems like the issue was transient and therefore I am taking the liberty to close this as part of regular task cleanup. Please re-open if desired.
I am curious, which error pages are we talking about?
@Ergur: Hi. Can you confirm if this is a problem for you still or has resolved?
All hosts were reimaged and have been back in production for a while. Resolving.
LVS in core sites will be superseded by Liberica so we are unlikely to spend any time on this.
There has been no follow-up to this in a while (and this is on k8s anyway now?) and this task has been open since 2024. I am taking the liberty to close this as part of regular task cleanup, please re-open if desired.
The only blocker in this task was the cp hosts for OpenSSL. We have already upgraded them to trixie in T401832, so this task can be resolved. A note has been made for the same in the description.
This has been completed in various iterations of the updates to geo-maps.
We have done quite a few reimages of durum since then (and reboots) and this issue was not observed. I am taking the liberty to close this as part of regular task cleanup, please re-open if desired.
Thu, May 21
Should now be resolved; @bd808 already cherry-picked but this has been rolled out to operations/puppet as well.
Wed, May 20
Can someone confirm on how to reproduce this? If I try to go to Bing and do a reverse search with Commons, it seems to work for me.
Without knowing the details of this, I wanted to point out that the drmrs refresh is upcoming in Q1/Q2 of FY2026 and drmrs like all edge sites, is on Liberica. If there is any redesign we want to do around this in drmrs, perhaps this is the time?
Yeah I should have been more careful in resolving this, my bad. @Jhancock.wm: While the DIMM was replaced, we still need to look at the RAID thing.
Error: Failed to apply catalog: Parameter source failed on File[/etc/haproxy/ip-reputation.d/top_10000_ips_requestctl_webrequest_text_7days.map]: Cannot use relative URLs '' (file: /srv/puppet_code/environments/production/modules/profile/manifests/cache/haproxy.pp, line: 510)
Tue, May 19
Hi: Following up to see if this issue still persists; I think perhaps not (see my comment above) since I think it was transient but please let us know.
Hi @BrokenImages1234: Following up to check if this issue still persists for you?
Fri, May 15
@sbassett: I am having a bit of trouble parsing what exactly is failing, but if it is just the report-to header, note that we set that in Varnish (VCL) for Network Error Logging. That looks like, for example,
Tue, May 12
Things look good and lvs2012 is happily serving traffic. Marking as resolved, thanks @Jhancock.wm!
Confirmed with @Papaul that this is actually meant for May 20, same time.
@VRiley-WMF: As John correctly pointed out, this is booting with UEFI enabled now. Is that expected and the default for all hosts now? If that is the case, we can update the partman recipes but this is an old host, so I am surprised that UEFI is the default. Any insights on that?
Thanks for the report @Ponor. Does this issue still persist? If yes, can you please undertake the steps in https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue and reach out to us at noc@wikimedia.org with the output? You can refer to this task.
Mon, May 11
Forced UEFI HTTP Boot for next reboot Resetting chassis power status for lvs1017 to ForceRestart Host rebooted via Redfish
Hi @BrokenImages1234, thanks for your report. The error report you indicated, 76af2b0, does indeed point to the issue on why you are seeing 429s. Upgrading to a more recent browser version will help fix the problem. Also, you can try enabling third-party cookies and that should help improve the rate-limiting somewhat.
I may be wrong but this was due to a temporary issue we had with upload.wikimedia.org in ulsfo, which matches the time of this report, and also matches the traffic from New Zealand, since that goes to ulsfo as well. I believe that this should now be resolved.
@VRiley-WMF: We may need to check this host; I can't seem to get it to come back up after a reboot (checked twice). Is there something else missing here? Perhaps a provisioning cookbook run or something? (Just guessing!)
@KOfori is out, deferring to @Kappakayala as the approver in the interim.
@Jhancock.wm: Thanks for the quick turnaround! Host is back and serving traffic, will keep a close watch for a bit before resolving this.
Record: 410 Date/Time: 05/10/2026 04:22:34 Source: system Severity: Critical Description: A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42). ------------------------------------------------------------------------------- Record: 411 Date/Time: 05/10/2026 05:21:45 Source: system Severity: Critical Description: A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42). ------------------------------------------------------------------------------- Record: 412 Date/Time: 05/10/2026 19:52:56 Source: system Severity: Critical Description: A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42). ------------------------------------------------------------------------------- Record: 413 Date/Time: 05/10/2026 19:52:57 Source: system Severity: Critical Description: A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42). -------------------------------------------------------------------------------
May 7 2026
sukhe@cumin1003:~$ sudo cumin "A:cp and not P{cp2041* or cp2042*}" "curl -s https://upload.wikimedia.org/wikipedia/test/4/45/T425216_ESAMS_overwrite_test.png | md5sum -"
111 hosts will be targeted:
cp[2043-2058].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5021,5023-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 111 hosts? Enter the number of affected hosts to confirm or "q" to quit: 111
===== NODE GROUP =====
(2) cp[3071,3078].esams.wmnet
----- OUTPUT for command #1: 'curl -s https://...t.png | md5sum -' -----
803be1a3faf69055c88a8e7061170336 -
===== NODE GROUP =====
(3) cp[2043,2053,2058].codfw.wmnet
----- OUTPUT for command #1: 'curl -s https://...t.png | md5sum -' -----
60ac3bdc2b5a68e8cda0f99118117eab -
===== NODE GROUP =====
(38) cp[2048,2050,2055].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1105,1108-1109,1112,1114].eqiad.wmnet,cp[5019,5021,5026,5031].eqsin.wmnet,cp[3066-3069,3072,3076-3077,3080-3081].esams.wmnet,cp7015.magru.wmnet
----- OUTPUT for command #1: 'curl -s https://...t.png | md5sum -' -----
203dbe0702a1c7ee976c18481893228e -
===== NODE GROUP =====
(68) cp[2044-2047,2049,2051-2052,2054,2056-2057].codfw.wmnet,cp[1100-1104,1106-1107,1110-1111,1113,1115].eqiad.wmnet,cp[5017-5018,5020,5023-5025,5027-5030,5032].eqsin.wmnet,cp[3070,3073-3075,3079].esams.wmnet,cp[7001-7014,7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
----- OUTPUT for command #1: 'curl -s https://...t.png | md5sum -' -----
64a5c589df8d0d96db7475340b5c4c91 -
================
PASS |████████████████████████████████████████████████████████████████████████████████████| 100% (111/111) [00:08<00:00, 12.77hosts/s]
FAIL | | 0% (0/111) [00:08<?, ?hosts/s]
100.0% (111/111) success ratio (>= 100.0% threshold) for command #1: 'curl -s https://...t.png | md5sum -'.
100.0% (111/111) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.Yeah, no, I was wrong and I misunderstood the problem. I misread that the actual image also differs across the CDN but it clearly does not.
May 6 2026
^ The above is not a trixie reimage but a bookworm reimage for the ulsfo work in T424686. Please disregard.
Apr 29 2026
Apr 28 2026
Apr 27 2026
Apr 24 2026
Slight correction on my end, sorry: this is host is not an active host, so you can install the NIC whenever you want before May 11 as well. We will however pick this up on our end during that week.
@VRiley-WMF: We are planning to do this the week of May 11. Does that work for you?
Hi @RobH. Any update on this from Dell's end?
It seems like @SKaram-WMF's Kerberos credentials were never created initially:
Apr 23 2026
Discussed with @Papaul a bit -- we will depool the site for all three days, just to be on the safe side and since it's ulsfo, one extra day does not really change things.

