Page MenuHomePhabricator

ssingh (Sukhbir Singh)
SRE/Traffic

Today

  • No visible events.

Tomorrow

  • No visible events.

Thursday

  • No visible events.

User Details

User Since
Dec 11 2018, 9:39 PM (390 w, 6 d)
Availability
Available
IRC Nick
sukhe
LDAP User
Unknown
MediaWiki User
SSingh (WMF) [ Global Accounts ]

Oh hi. Nice to see you here.

Recent Activity

Yesterday

ssingh added a comment to T425850: Bing can't search images from Commons, is Wikimedia denying their requests?.

Update: we are still waiting to get clarity on what is happening here. We are pursuing this and I will update this when things change.

Mon, Jun 8, 7:13 PM · SEO, Traffic
ssingh added a comment to T427465: Move thumbnail caching from upload cluster to text.

Contrary to what it looks like, I have no clue what I'm doing. We probably need a service in hieradata/common/service.yaml too but not 100% sure

Mon, Jun 8, 3:28 PM · Patch-For-Review, Data-Persistence, Traffic
ssingh added a comment to T428060: codfw: move public baremetal servers to per rack vlan.

@ssingh For the DNS servers, the ones peering with the core routers will have a higher priority (as-path) than the ones peering with the ToR switches. So if one can handle all the load then we're good as we won't have any redundancy issue.
We can use as-path prepending to fine tune it, but if we can avoid it it would be better.

Plan could be:

  • Move dns2004 - internet load will be shared on 2005/2006 (making 2004 roughly a hot standby, while still receiving 10.3.0.1 queries from its pod)
  • Test that 2004 works well
  • Move dns2005 - all the internet load will go to 2006
  • Shortly after, move dns2006 - all the internet load will be balanced between the 3 hosts again
Mon, Jun 8, 2:01 PM · SRE, ops-codfw, DC-Ops

Fri, Jun 5

ssingh added a comment to T428060: codfw: move public baremetal servers to per rack vlan.

dns[2004-2006].wikimedia.org - Need special care to not cause traffic imbalance @ssingh

Fri, Jun 5, 3:46 PM · SRE, ops-codfw, DC-Ops
ssingh added a comment to T414411: cp5022 is unreachable.

We discussed this and the general consensus seemed to be to just decomm the server and wait for the refresh which is happening shortly anyway. @ssingh Is that accurate, and if so, ready for me to open a task/decom it?

Fri, Jun 5, 1:46 PM · SRE, DC-Ops, ops-eqsin, Traffic

Wed, Jun 3

ssingh added a comment to T428093: Remove Digicert CAA records from most domains.

So payments is a CNAME and hence we can't add a CAA record for it.

Can we not add a CAA record to the CNAME targets (payments-eqiad/codfw)?

Wed, Jun 3, 7:03 PM · Traffic
ssingh added a comment to T428093: Remove Digicert CAA records from most domains.

CAA works at the subdomain level so we can set the records for payments.wikimedia.org to allow them their issuance while removing the unnecessary records for the rest of the stack.

Wed, Jun 3, 7:00 PM · Traffic
ssingh added a comment to T428052: Beta cluster haproxy does not support `warn-blocked-traffic-after` keyword.

The Beta Cluster cache nodes are Debian Bullseye running HAProxy version 2.8.18-1~bpo11+1 2025/12/26. It looks like the prod CDN edge is using Debian Trixie and HAProxy 3.2.15-1~bpo13+1.

See also: T401839: Migrate deployment-prep away from Debian Bullseye to Bookworm/Trixie

Wed, Jun 3, 4:53 PM · Traffic, SRE, Beta-Cluster-Infrastructure

Tue, Jun 2

ssingh added a comment to T117618: Add restrictive CSP to upload.wikimedia.org.

Revisiting this over the past month, it looks like we're receiving, on average, ~ 2500 report-only reports each day for the current upload.wikimedia.org CSP config:

Screenshot 2026-06-01 at 4.37.54 PM.png (587×264 px, 19 KB)

@ssingh - I think we should remove the restrictions/filters and just serve this as a report-only policy across all of upload.wikimedia.org, for all media files, for a few days. And then ultimately set it as an enforcing CSP. At this point I'm not seeing any compelling reason not to aggressively move forward with this either in the comments here or CSP report-only log data.

Tue, Jun 2, 5:46 PM · Patch-For-Review, Traffic, ContentSecurityPolicy, WMF-General-or-Unknown, Security-Team

Mon, Jun 1

ssingh updated subscribers of T427836: WE5.2.13 Dumps UA enforcement.
Mon, Jun 1, 5:44 PM · Patch-For-Review, SRE, Traffic
ssingh triaged T427836: WE5.2.13 Dumps UA enforcement as Medium priority.
Mon, Jun 1, 5:44 PM · Patch-For-Review, SRE, Traffic
ssingh created T427836: WE5.2.13 Dumps UA enforcement.
Mon, Jun 1, 5:44 PM · Patch-For-Review, SRE, Traffic

Fri, May 29

ssingh added a comment to T425850: Bing can't search images from Commons, is Wikimedia denying their requests?.

Update: We are looking into this -- on how to update the list of Bing IPs -- and will update the thread.

Fri, May 29, 8:39 PM · SEO, Traffic
ssingh added a comment to T425850: Bing can't search images from Commons, is Wikimedia denying their requests?.

The request was simple. Look in the server logs.

Fri, May 29, 7:39 PM · SEO, Traffic
ssingh added a comment to T426912: Investigate hardware RAID usage in codfw LVS hosts.

@ssingh @BBlack Okay with me switching write-back to write-through slowly through the codfw cluster or shall we leave this as-is until the refresh?

Fri, May 29, 4:13 PM · SRE, Traffic, ops-codfw, DC-Ops
ssingh added a comment to T426968: cp6015 network error.

@ssingh What made you suspect mem errors? I see from the previous boot that OOM kept getting invoked on purged but I suspect that from some non-hardware issue.

{F84961765}

Fri, May 29, 4:07 PM · DC-Ops, ops-drmrs
ssingh added a comment to T427646: netconsole being used for cache hosts?.

I meant we set profile::netconsole::client::ensure: absent in hieradata/role/common/cache/upload.yaml so it should not be enabled there.

Fri, May 29, 3:16 PM · Traffic, SRE
ssingh added a comment to T427646: netconsole being used for cache hosts?.

Yeah it's a good question and predates me so I don't have good answers. But, it doesn't seem that it is enabled for upload? The upload role does say include profile::netconsole::client but we also set profile::netconsole::client::ensure: absent and it seems like nothing is being actually set on upload unless I am mistaken.

Fri, May 29, 3:09 PM · Traffic, SRE

Thu, May 28

ssingh added a comment to T427491: x-ua-contact: recognize pywikibot user-agent strings.

Per https://www.mediawiki.org/wiki/Manual:Pywikibot/User-agent, users can set a custom UA easily by setting user_agent_format. Should we perhaps not be encouraging that vs blanket allow-listing the default UA? That then falls into line with our other UA policy for library defaults -- each bot should set a custom UA.

The default UA is customized automatically and contains contact information based on the user's login info. Why should the bot'S owner have to override the user-agent string if they are already supplying their user name? All we want to know is who is running the bot.

I'm not suggesting to whitelist pywikibot, I'm suggesting to support the format in which pywikibot supplies user information in the user-agent in accordance with our policy.

Thu, May 28, 4:34 PM · Traffic
ssingh added a comment to T427491: x-ua-contact: recognize pywikibot user-agent strings.

Per https://www.mediawiki.org/wiki/Manual:Pywikibot/User-agent, users can set a custom UA easily by setting user_agent_format. Should we perhaps not be encouraging that vs blanket allow-listing the default UA? That then falls into line with our other UA policy for library defaults -- each bot should set a custom UA.

Thu, May 28, 3:14 PM · Traffic
ssingh reopened T426968: cp6015 network error as "Open".

The host is acting up again and is depooled. It seems like this time it is memory errors but racadm points to nothing. I am going to leave it depooled while we look.

Thu, May 28, 2:02 PM · DC-Ops, ops-drmrs

Wed, May 27

ssingh added a comment to T427357: codfw: rack A4 maintenance.

Depool for cp2044 looks good; please ping Traffic if you want us to take care of it.

Wed, May 27, 2:55 PM · Infrastructure-Foundations, netops, Observability-Logging, Machine-Learning-Team, Traffic, ServiceOps new, Discovery-Search
ssingh added a comment to T426109: Reboot lvs1019 for memory self-healing.

@ssingh The Dell docs mention updating the BIOS:

update BIOS to the latest revision that includes many memory Self-healing capabilities and ongoing enhancements

However, that would put our version ahead of the rest of the fleet. Is that acceptable or should I just keep it as-is?

Wed, May 27, 1:29 PM · Traffic

Tue, May 26

ssingh added a comment to T426109: Reboot lvs1019 for memory self-healing.

Once we reboot for T426585, we can consider this resolved as well.

Tue, May 26, 2:26 PM · Traffic

Fri, May 22

ssingh added a comment to T405630: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan.

Thanks for the update and the explanation, @cmooney!

Fri, May 22, 4:05 PM · Traffic, netops, Infrastructure-Foundations, SRE
ssingh closed T344674: ATS automatically restarted due to receiving SIGUSR2 on cp5024 as Resolved.

This hasn't happened in a while (last incident was 2023) and we have run sre.cdn.roll-reboot many times since then, so boldly resolving.

Fri, May 22, 4:00 PM · SRE, Traffic
ssingh added a comment to T405630: lvs1020: reimage to move primary IP from private1-d-eqiad to private1-d7-eqiad vlan.

@cmooney: We plan to move to Liberica in Q1 or Q2 of APP2026. Do you think we should still consider working on this?

Fri, May 22, 3:58 PM · Traffic, netops, Infrastructure-Foundations, SRE
ssingh closed T423667: Investigate port 80 page in text@esams for Ipv6 as Declined.

This hasn't happened again and it's hard investigating now what caused these two blips. Boldly resolving for this as part of regular task cleanup. If it happens again, we can look into it.

Fri, May 22, 3:57 PM · Traffic, SRE
ssingh moved T426109: Reboot lvs1019 for memory self-healing from Backlog to Actively Servicing on the Traffic board.
Fri, May 22, 3:55 PM · Traffic
ssingh closed T425670: images are not loading for some users (on the us west coast?) as Resolved.

Boldly resolving for the reasons above: the issue was transient because we responded to it, there is no follow up and there is nothing on our side to indicate that this persists.

Fri, May 22, 3:55 PM · Traffic
ssingh added a comment to T401025: Investigate setting init_on_alloc=0 on cache hosts.

We never got to this in Q3 or even Q4. Should we plan to do this in Q1 2026?

Fri, May 22, 3:53 PM · Traffic
ssingh closed T394789: Validate pybal config in CI as Declined.

LVS in core sites will be superseded by Liberica so we are unlikely to spend any time on this. I am taking the liberty to close this as part of regular task cleanup, please re-open if desired.

Fri, May 22, 3:52 PM · Data-Platform-SRE (2026-04-24 - 2026-05-15), Traffic
ssingh closed T372646: confd causes soft lockup when you are tailing a file with -F and the state is updated as Resolved.

No one else has observed this issue and it has been almost two years since this was reported, with no follow-up. As the person who reported this, resolving.

Fri, May 22, 3:50 PM · Traffic, SRE, conftool
ssingh closed T383013: Backend fetch failed as Resolved.

It seems like the issue was transient and therefore I am taking the liberty to close this as part of regular task cleanup. Please re-open if desired.

Fri, May 22, 3:48 PM · Traffic, SRE, Commons
ssingh added a comment to T352291: Provide better error pages for HAProxy.

I am curious, which error pages are we talking about?

Fri, May 22, 3:47 PM · Traffic
ssingh added a comment to T423991: HTTP 503 error trying to make any edits on Wikipedia.

@Ergur: Hi. Can you confirm if this is a problem for you still or has resolved?

Fri, May 22, 3:45 PM · Traffic
ssingh closed T424686: ulsfo switch work May 2026: Host reimaging, a subtask of T408892: ULSFO: New switch configuration, as Resolved.
Fri, May 22, 3:45 PM · Patch-For-Review, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
ssingh closed T424686: ulsfo switch work May 2026: Host reimaging as Resolved.

All hosts were reimaged and have been back in production for a while. Resolving.

Fri, May 22, 3:45 PM · Infrastructure-Foundations, Traffic, DC-Ops
ssingh closed T334166: Abstract LVS restart using cookbook as Resolved.

LVS in core sites will be superseded by Liberica so we are unlikely to spend any time on this.

Fri, May 22, 3:39 PM · SRE, SRE-tools, Infrastructure-Foundations, Traffic
ssingh closed T356951: PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL as Resolved.

There has been no follow-up to this in a while (and this is on k8s anyway now?) and this task has been open since 2024. I am taking the liberty to close this as part of regular task cleanup, please re-open if desired.

Fri, May 22, 3:36 PM · Traffic, SRE
ssingh closed T342154: Upgrade Traffic hosts to bookworm as Resolved.
Fri, May 22, 3:34 PM · Patch-For-Review, Traffic
ssingh updated the task description for T342154: Upgrade Traffic hosts to bookworm.
Fri, May 22, 3:34 PM · Patch-For-Review, Traffic
ssingh added a comment to T342154: Upgrade Traffic hosts to bookworm.

The only blocker in this task was the cp hosts for OpenSSL. We have already upgraded them to trixie in T401832, so this task can be resolved. A note has been made for the same in the description.

Fri, May 22, 3:33 PM · Patch-For-Review, Traffic
ssingh closed T387774: Update South America geo-maps as Resolved.

This has been completed in various iterations of the updates to geo-maps.

Fri, May 22, 3:31 PM · Traffic
ssingh closed T419868: Startup failure for Bird on new durum hosts as Resolved.

We have done quite a few reimages of durum since then (and reboots) and this issue was not observed. I am taking the liberty to close this as part of regular task cleanup, please re-open if desired.

Fri, May 22, 3:30 PM · SRE, Traffic
ssingh changed the status of T355446: Synchronize and rotate TCP Fastopen keys for various use-cases from Open to In Progress.
Fri, May 22, 3:26 PM · Patch-For-Review, Traffic

Thu, May 21

ssingh added a comment to T414411: cp5022 is unreachable.

Without getting into pricing on this public task the options are:

  • spend more money (see T426985) to replace the CPU
    • we have no money left in expendables for this, so it would likely have to kick to July (unless mgmt approves overspend)
  • shuffle all the memory to report to the working CPU and use this CP host with only half the CPU cores
    • no clue if this is viable, this is a question for @ssingh
  • decommission the host and use as spare parts for rest of eqsin fleet
    • worst case scenario since it means a depreciated fleet quantity in eqsin.
Thu, May 21, 6:33 PM · SRE, DC-Ops, ops-eqsin, Traffic
ssingh added a comment to T414411: cp5022 is unreachable.

Apologies, this ran super late and I neglected to update the task accordingly.

The mainboard swap was successful but it appears of the two CPUs, one of them has failed. Dell SG is sending over a quote for replacement, as the system is currently remotely accessible with only one of two cpus installed.
I'll update the task with the new quote later today!

Thu, May 21, 2:18 PM · SRE, DC-Ops, ops-eqsin, Traffic
ssingh added a comment to T414411: cp5022 is unreachable.

Scheduled a new site visit for them to go out this Friday @ 8AM Singapore Time so my Thursday @ 4PM.

1-260037210462

Thu, May 21, 1:37 PM · SRE, DC-Ops, ops-eqsin, Traffic
ssingh closed T426822: No Puppet resources found on instance deployment-cache-upload08 on project deployment-prep as Resolved.

Should now be resolved; @bd808 already cherry-picked but this has been rolled out to operations/puppet as well.

Thu, May 21, 1:20 PM · Traffic, Beta-Cluster-Infrastructure

Wed, May 20

ssingh added a comment to T425850: Bing can't search images from Commons, is Wikimedia denying their requests?.

Can someone confirm on how to reproduce this? If I try to go to Bing and do a reverse search with Commons, it seems to work for me.

Wed, May 20, 7:45 PM · SEO, Traffic
ssingh added a comment to T362772: ASW single-point of failure for LVS VIPs at POPs.

Without knowing the details of this, I wanted to point out that the drmrs refresh is upcoming in Q1/Q2 of FY2026 and drmrs like all edge sites, is on Liberica. If there is any redesign we want to do around this in drmrs, perhaps this is the time?

Wed, May 20, 6:38 PM · Traffic, SRE
ssingh assigned T426299: Ensure the pre-repooling checklist includes to restart liberica services whenever realserver IPs has changed to BCornwall.
Wed, May 20, 6:37 PM · Traffic, Sustainability (Incident Followup)
ssingh reassigned T425890: Degraded RAID on lvs2012 from ssingh to Jhancock.wm.
Wed, May 20, 6:24 PM · Traffic, SRE, ops-codfw, DC-Ops
ssingh added a comment to T425890: Degraded RAID on lvs2012.

Yeah I should have been more careful in resolving this, my bad. @Jhancock.wm: While the DIMM was replaced, we still need to look at the RAID thing.

Wed, May 20, 5:45 PM · Traffic, SRE, ops-codfw, DC-Ops
ssingh updated subscribers of T426822: No Puppet resources found on instance deployment-cache-upload08 on project deployment-prep.
Wed, May 20, 12:34 PM · Traffic, Beta-Cluster-Infrastructure
ssingh added a comment to T426822: No Puppet resources found on instance deployment-cache-upload08 on project deployment-prep.
Error: Failed to apply catalog: Parameter source failed on File[/etc/haproxy/ip-reputation.d/top_10000_ips_requestctl_webrequest_text_7days.map]: Cannot use relative URLs '' (file: /srv/puppet_code/environments/production/modules/profile/manifests/cache/haproxy.pp, line: 510)
Wed, May 20, 12:32 PM · Traffic, Beta-Cluster-Infrastructure

Tue, May 19

ssingh added a comment to T425670: images are not loading for some users (on the us west coast?).

Hi: Following up to see if this issue still persists; I think perhaps not (see my comment above) since I think it was transient but please let us know.

Tue, May 19, 5:14 PM · Traffic
ssingh added a comment to T425763: Error 429 for search queries and images in older browsers.

Hi @BrokenImages1234: Following up to check if this issue still persists for you?

Tue, May 19, 5:13 PM · MediaWiki-Platform-Team (Radar), Traffic

Fri, May 15

ssingh added a comment to T424058: Properly set the Reporting-Endpoints header and the report-to directive via MediaWiki's CSP implementation.

@sbassett: I am having a bit of trouble parsing what exactly is failing, but if it is just the report-to header, note that we set that in Varnish (VCL) for Network Error Logging. That looks like, for example,

Fri, May 15, 6:47 PM · MW-1.47-notes (1.47.0-wmf.2; 2026-05-12), SecTeam-Processed, Security-Team, ContentSecurityPolicy

Tue, May 12

ssingh closed T425890: Degraded RAID on lvs2012 as Resolved.

Things look good and lvs2012 is happily serving traffic. Marking as resolved, thanks @Jhancock.wm!

Tue, May 12, 7:00 PM · Traffic, SRE, ops-codfw, DC-Ops
ssingh added a comment to T416562: ulsfo: upgrade routers (2026).

Confirmed with @Papaul that this is actually meant for May 20, same time.

Tue, May 12, 6:54 PM · Infrastructure-Foundations, netops
ssingh added a comment to T416562: ulsfo: upgrade routers (2026).

@ssingh i think it will be best to depool the site since this will be my first time doing the draining process I will like to be on the safe side.

Tue, May 12, 5:54 PM · Infrastructure-Foundations, netops
ssingh added a comment to T416562: ulsfo: upgrade routers (2026).

@ssingh and team now that we are done with the switch refresh and everything is stable in ulsfo and after we connect the missing link between cr3 and asw1-23 We will like to schedule a 3 hours downtime for the JUNOS upgrade on the core routers next week May 14th at 9:45am CT 10:45am EST. Please let us know if this time and date works for you.
Thanks.

Tue, May 12, 5:50 PM · Infrastructure-Foundations, netops
ssingh added a comment to T425120: Firefox: Be careful. Something doesn’t look right (Strict Transport Security).

@ssingh
Thanks. This has been an issue for some 10 days, but I haven't noticed it recently. I see that FF got a little update, so it's hard to tell what the cause was.
I thought it'd be good to report it in case someone else was having the same issues. I'll close the ticket now.

Tue, May 12, 4:40 PM · Traffic, HTTPS
ssingh added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.

@VRiley-WMF: As John correctly pointed out, this is booting with UEFI enabled now. Is that expected and the default for all hosts now? If that is the case, we can update the partman recipes but this is an old host, so I am surprised that UEFI is the default. Any insights on that?

Tue, May 12, 2:08 PM · SRE, Traffic
ssingh added a comment to T425120: Firefox: Be careful. Something doesn’t look right (Strict Transport Security).

Thanks for the report @Ponor. Does this issue still persist? If yes, can you please undertake the steps in https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue and reach out to us at noc@wikimedia.org with the output? You can refer to this task.

Tue, May 12, 1:28 PM · Traffic, HTTPS
ssingh closed T423331: Wikimedia Hackathon 2026: Wikimedia's Production DNS Infrastructure and GeoDNS User Routing as Declined.
Tue, May 12, 1:16 PM · Traffic, Wikimedia-Hackathon-2026

Mon, May 11

ssingh added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.
Forced UEFI HTTP Boot for next reboot
Resetting chassis power status for lvs1017 to ForceRestart
Host rebooted via Redfish
Mon, May 11, 7:11 PM · SRE, Traffic
ssingh added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.

@ssingh you are booting with UEFI?

the YAML file need to be updated for lvs1017

-partman/standard-efi.cfg
-partman/raid1-2dev-efi.cfg

Mon, May 11, 7:07 PM · SRE, Traffic
ssingh added a comment to T425763: Error 429 for search queries and images in older browsers.

Hi @BrokenImages1234, thanks for your report. The error report you indicated, 76af2b0, does indeed point to the issue on why you are seeing 429s. Upgrading to a more recent browser version will help fix the problem. Also, you can try enabling third-party cookies and that should help improve the rate-limiting somewhat.

Mon, May 11, 6:51 PM · MediaWiki-Platform-Team (Radar), Traffic
ssingh added a comment to T425670: images are not loading for some users (on the us west coast?).

I may be wrong but this was due to a temporary issue we had with upload.wikimedia.org in ulsfo, which matches the time of this report, and also matches the traffic from New Zealand, since that goes to ulsfo as well. I believe that this should now be resolved.

Mon, May 11, 6:44 PM · Traffic
ssingh added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.

@VRiley-WMF: We may need to check this host; I can't seem to get it to come back up after a reboot (checked twice). Is there something else missing here? Perhaps a provisioning cookbook run or something? (Just guessing!)

Mon, May 11, 6:26 PM · SRE, Traffic
ssingh added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.

@ssingh I have almost gotten it all the way through. However, it doesn't seem to take the reimage. It's seemingly is getting stuck at the raid. I tried to log into the other other lvs servers however, I'm unable to. Is there a specific raid this needs to have? Let me know, thanks!

Mon, May 11, 4:41 PM · SRE, Traffic
ssingh updated subscribers of T425930: Adding cwilliams to users and ops.

@KOfori is out, deferring to @Kappakayala as the approver in the interim.

Mon, May 11, 4:12 PM · SRE, SRE-Access-Requests
ssingh added a comment to T425890: Degraded RAID on lvs2012.

@Jhancock.wm: Thanks for the quick turnaround! Host is back and serving traffic, will keep a close watch for a bit before resolving this.

Mon, May 11, 3:32 PM · Traffic, SRE, ops-codfw, DC-Ops
ssingh added a comment to T425890: Degraded RAID on lvs2012.

i pulled a replacement DIMM and a ssd from our offlined hosts.
@ssingh safe to power down the host?

Mon, May 11, 2:55 PM · Traffic, SRE, ops-codfw, DC-Ops
ssingh added a comment to T425890: Degraded RAID on lvs2012.
Record:      410
Date/Time:   05/10/2026 04:22:34
Source:      system
Severity:    Critical
Description: A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42).
-------------------------------------------------------------------------------
Record:      411
Date/Time:   05/10/2026 05:21:45
Source:      system
Severity:    Critical
Description: A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42).
-------------------------------------------------------------------------------
Record:      412
Date/Time:   05/10/2026 19:52:56
Source:      system
Severity:    Critical
Description: A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42).
-------------------------------------------------------------------------------
Record:      413
Date/Time:   05/10/2026 19:52:57
Source:      system
Severity:    Critical
Description: A critical diagnostic event occurred in the memory device at B2. Contact your service provider for assistance in replacing the device. (Extended ID: 0x4E42).
-------------------------------------------------------------------------------
Mon, May 11, 2:32 PM · Traffic, SRE, ops-codfw, DC-Ops
ssingh added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.

Hey @ssingh Is it okay to make this change today?

Mon, May 11, 1:16 PM · SRE, Traffic

May 7 2026

ssingh updated subscribers of T425216: ESAMS and others serving older revisions of overwritten files.
sukhe@cumin1003:~$ sudo cumin "A:cp and not P{cp2041* or cp2042*}" "curl -s https://upload.wikimedia.org/wikipedia/test/4/45/T425216_ESAMS_overwrite_test.png | md5sum -"
111 hosts will be targeted:
cp[2043-2058].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1100-1115].eqiad.wmnet,cp[5017-5021,5023-5032].eqsin.wmnet,cp[3066-3081].esams.wmnet,cp[7001-7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet
OK to proceed on 111 hosts? Enter the number of affected hosts to confirm or "q" to quit: 111
===== NODE GROUP =====                                                                                                                
(2) cp[3071,3078].esams.wmnet                                                                                                         
----- OUTPUT for command #1: 'curl -s https://...t.png | md5sum -' -----                                                              
803be1a3faf69055c88a8e7061170336  -                                                                                                   
===== NODE GROUP =====                                                                                                                
(3) cp[2043,2053,2058].codfw.wmnet                                                                                                    
----- OUTPUT for command #1: 'curl -s https://...t.png | md5sum -' -----                                                              
60ac3bdc2b5a68e8cda0f99118117eab  -                                                                                                   
===== NODE GROUP =====                                                                                                                
(38) cp[2048,2050,2055].codfw.wmnet,cp[6001-6016].drmrs.wmnet,cp[1105,1108-1109,1112,1114].eqiad.wmnet,cp[5019,5021,5026,5031].eqsin.wmnet,cp[3066-3069,3072,3076-3077,3080-3081].esams.wmnet,cp7015.magru.wmnet                                                            
----- OUTPUT for command #1: 'curl -s https://...t.png | md5sum -' -----                                                              
203dbe0702a1c7ee976c18481893228e  -                                                                                                   
===== NODE GROUP =====                                                                                                                
(68) cp[2044-2047,2049,2051-2052,2054,2056-2057].codfw.wmnet,cp[1100-1104,1106-1107,1110-1111,1113,1115].eqiad.wmnet,cp[5017-5018,5020,5023-5025,5027-5030,5032].eqsin.wmnet,cp[3070,3073-3075,3079].esams.wmnet,cp[7001-7014,7016].magru.wmnet,cp[4037-4052].ulsfo.wmnet   
----- OUTPUT for command #1: 'curl -s https://...t.png | md5sum -' -----                                                              
64a5c589df8d0d96db7475340b5c4c91  -                                                                                                   
================                                                                                                                      
PASS |████████████████████████████████████████████████████████████████████████████████████| 100% (111/111) [00:08<00:00, 12.77hosts/s]
FAIL |                                                                                              |   0% (0/111) [00:08<?, ?hosts/s]
100.0% (111/111) success ratio (>= 100.0% threshold) for command #1: 'curl -s https://...t.png | md5sum -'.
100.0% (111/111) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
May 7 2026, 3:49 PM · SRE, ops-esams, DC-Ops, Traffic, MediaWiki-Core-Revision-backend, MediaWiki-File-management, Commons
ssingh added a comment to T425216: ESAMS and others serving older revisions of overwritten files.

Yeah, no, I was wrong and I misunderstood the problem. I misread that the actual image also differs across the CDN but it clearly does not.

May 7 2026, 12:33 AM · SRE, ops-esams, Traffic, DC-Ops, MediaWiki-Core-Revision-backend, MediaWiki-File-management, Commons
ssingh updated the task description for T424686: ulsfo switch work May 2026: Host reimaging.
May 7 2026, 12:16 AM · Infrastructure-Foundations, Traffic, DC-Ops

May 6 2026

ssingh updated the task description for T424686: ulsfo switch work May 2026: Host reimaging.
May 6 2026, 6:32 PM · Infrastructure-Foundations, Traffic, DC-Ops
ssingh added a comment to T425216: ESAMS and others serving older revisions of overwritten files.

I believe there is a 24 hourly script that checks cross dc consistency or something

May 6 2026, 5:10 PM · SRE, ops-esams, Traffic, DC-Ops, MediaWiki-Core-Revision-backend, MediaWiki-File-management, Commons
ssingh added a comment to T401832: Upgrade Traffic hosts to trixie.

^ The above is not a trixie reimage but a bookworm reimage for the ulsfo work in T424686. Please disregard.

May 6 2026, 4:51 PM · Traffic

Apr 29 2026

ssingh assigned T424785: [Update DNS Record Request] - wikimedia.org - Add TXT verification for Anthropic to CDobbins.
Apr 29 2026, 1:06 PM · Traffic, SRE, DNS

Apr 28 2026

ssingh updated subscribers of T424686: ulsfo switch work May 2026: Host reimaging.
Apr 28 2026, 5:27 PM · Infrastructure-Foundations, Traffic, DC-Ops
ssingh updated the task description for T424686: ulsfo switch work May 2026: Host reimaging.
Apr 28 2026, 5:25 PM · Infrastructure-Foundations, Traffic, DC-Ops
ssingh triaged T424686: ulsfo switch work May 2026: Host reimaging as Medium priority.
Apr 28 2026, 5:25 PM · Infrastructure-Foundations, Traffic, DC-Ops
ssingh created T424686: ulsfo switch work May 2026: Host reimaging.
Apr 28 2026, 5:25 PM · Infrastructure-Foundations, Traffic, DC-Ops
ssingh added a comment to T408892: ULSFO: New switch configuration.

As a side note we will need to manually change the IPs of the routed ganeti nodes in rack 23 to the 10.128.1.0/24 subnet. Normal operation would have required a re-image but to not lose the VMs and make the migration faster it's best to re-IP the hosts.

AFAIK. the last thing needed is to convert those BGP policies from Junos to Nokia:

  • Anycast4
  • Anycast6
  • Ganeti4
  • Ganeti6
  • Management
  • PyBal
  • Core
Apr 28 2026, 4:55 PM · Patch-For-Review, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo

Apr 27 2026

ssingh added a comment to T424549: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep.

@bd808 yep it is, I missed this. I think we can spin up discovery2026 on cloud as well for consistency, I'll try to do it tomorrow!

Apr 27 2026, 4:28 PM · Traffic, Beta-Cluster-Infrastructure

Apr 24 2026

ssingh added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.

Slight correction on my end, sorry: this is host is not an active host, so you can install the NIC whenever you want before May 11 as well. We will however pick this up on our end during that week.

Apr 24 2026, 5:56 PM · SRE, Traffic
ssingh added a comment to T420604: Deduplicate CSP between VCL and MediaWiki.

For 2), we will be removing the VCL altogether I suppose? And you meant "same changes" as in same changes on MW?

Yes, remove the VCL CSP for Wikimedia production entirely. Which I guess is just removing it entirely. Except for any odd exceptions for static assets, etc. that may already exist (I think there might be one for PDFs and/or testwiki). And then Wikimedia production will be (mostly) controlled by MediaWiki's CSP implementation. So these operations really shouldn't affect what user's experience now except for the report-only header going away. But in the future we plan to define more CSP directives and further tighten the current CSP allow-list - these steps just get us further down that path.

Apr 24 2026, 3:11 PM · Traffic, Sustainability (Incident Followup), SecTeam-Processed, ContentSecurityPolicy, 2026-user-javascript-incident, Product Safety and Integrity, Security, Security-Team
ssingh added a comment to T421421: Revert lvs1017 Mellanox NIC to Broadcom.

@VRiley-WMF: We are planning to do this the week of May 11. Does that work for you?

Apr 24 2026, 2:28 PM · SRE, Traffic
ssingh added a comment to T414411: cp5022 is unreachable.

Hi @RobH. Any update on this from Dell's end?

Apr 24 2026, 2:27 PM · SRE, DC-Ops, ops-eqsin, Traffic
ssingh closed T424268: Requesting Kerberos password reset as Resolved.

It seems like @SKaram-WMF's Kerberos credentials were never created initially:

Apr 24 2026, 2:23 PM · Data-Engineering

Apr 23 2026

ssingh added a comment to T420604: Deduplicate CSP between VCL and MediaWiki.

So I think the next steps here are to:

  1. Monitor beta and check beta-logstash over the next few days to ensure stability
  2. Make these same changes within Wikimedia production
Apr 23 2026, 7:09 PM · Traffic, Sustainability (Incident Followup), SecTeam-Processed, ContentSecurityPolicy, 2026-user-javascript-incident, Product Safety and Integrity, Security, Security-Team
ssingh added a comment to T408892: ULSFO: New switch configuration.

Discussed with @Papaul a bit -- we will depool the site for all three days, just to be on the safe side and since it's ulsfo, one extra day does not really change things.

Apr 23 2026, 3:54 PM · Patch-For-Review, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
ssingh added a comment to T420604: Deduplicate CSP between VCL and MediaWiki.

We can selectively remove the filter in Beta if desired, yes. Right now there isn't a proper way of doing this other than using the hacky etcd_filters variable. I will prepare a patch.

Ok, sounds great. Happy to help test/confirm once that's deployed.

Apr 23 2026, 3:45 PM · Traffic, Sustainability (Incident Followup), SecTeam-Processed, ContentSecurityPolicy, 2026-user-javascript-incident, Product Safety and Integrity, Security, Security-Team