Page MenuHomePhabricator

ssingh (Sukhbir Singh)
SRE/Traffic

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
Dec 11 2018, 9:39 PM (364 w, 6 d)
Availability
Available
IRC Nick
sukhe
LDAP User
Unknown
MediaWiki User
SSingh (WMF) [ Global Accounts ]

Oh hi. Nice to see you here.

Recent Activity

Wed, Dec 3

ssingh updated the task description for T411675: druid-public-coordinator: no backend servers pooled.
Wed, Dec 3, 8:32 PM · Traffic
ssingh created T411675: druid-public-coordinator: no backend servers pooled.
Wed, Dec 3, 8:32 PM · Traffic
ssingh added a comment to T408892: ULSFO: New switch configuration.

@ssingh yes we have to depool the site, yes 10 AM CT

Wed, Dec 3, 5:15 PM · Patch-For-Review, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo

Tue, Dec 2

ssingh added a comment to T408892: ULSFO: New switch configuration.

@ssingh We are planning on doing the first phase(loopback IP change on core routers and management router) of the ULSFO refresh next week Dec 09th at 10:00am. Please let me know if this work for you an your team.

Thanks .

Tue, Dec 2, 6:46 PM · Patch-For-Review, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
ssingh updated subscribers of T411452: No space left on device on VRTS host.

@Dzahn has freed up some inodes. We were not out of disk space, we were out of inodes. We are trying to free up some more but for now, we should be running again. As @AntiCompositeNumber mentioned, there has been a steady rise for a while now so we should look into that.

Tue, Dec 2, 3:04 AM · Wikimedia-Incident, collaboration-services, SRE, vrts, Znuny

Thu, Nov 27

ssingh added a comment to T411191: hcaptcha-proxy health checks should also depool sites if their upstream is unreachable.

And while there is that fallback mechanism to the old system, this is something to keep in mind.

Thu, Nov 27, 3:06 PM · WE4.2 Bot detection, Traffic, serviceops
ssingh added a comment to T411191: hcaptcha-proxy health checks should also depool sites if their upstream is unreachable.

Yeah I think that makes sense if we want to exert control over upstream issues and how it reflects to the proxy itself. We can approach this in two ways:

Thu, Nov 27, 3:05 PM · WE4.2 Bot detection, Traffic, serviceops

Wed, Nov 26

ssingh created T411097: Deprecate low-traffic proxoid service and O:hcaptcha_proxy for the older hcaptcha proxy setup.
Wed, Nov 26, 2:17 PM · SRE, Traffic

Tue, Nov 25

ssingh updated the task description for T411043: Revisit the 1GB cache size limit for ATS.
Tue, Nov 25, 7:00 PM · Traffic, SRE
ssingh created T411043: Revisit the 1GB cache size limit for ATS.
Tue, Nov 25, 6:59 PM · Traffic, SRE
ssingh added a project to T410944: Reboot cookbook workflow leaves Puppet disabled: Traffic.
Tue, Nov 25, 1:06 AM · Traffic, Infrastructure-Foundations, SRE-tools, SRE

Fri, Nov 21

ssingh added a comment to T410201: Error: 503, Backend fetch failed.

Yeah, there are certainly other files. I think you can remove your comment that has the SVG file contents otherwise it makes it difficult to read the task.

Fri, Nov 21, 4:21 PM · Traffic, Commons

Thu, Nov 20

ssingh closed T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 as Resolved.

Once https://gerrit.wikimedia.org/r/c/operations/software/homer/deploy/+/1207915 is merged tomorrow or Monday, we will enable BGP, test the anycast address and then switch the backend to use that. But since the VMs are up and running with the desired role, marking this as resolved.

Thu, Nov 20, 8:27 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests

Wed, Nov 19

ssingh added a comment to T410201: Error: 503, Backend fetch failed.

Hi @RoyZuo: we have tried to debug this on the CDN side and can't seem to find anything there that can point us to the problem. Can you upload any file at all, or is it simply this file, which given an SVG and 36.7Mb can be leading to some issues on the app layer.

Wed, Nov 19, 6:32 PM · Traffic, Commons
ssingh added a comment to T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.

Oh wow, thanks @MoritzMuehlenhoff! But what was the issue for my understanding?

Basically just magru being affected by https://phabricator.wikimedia.org/T396864. I ran makevm and once it had created the VM (but not yet initiated the reimage), I moved the install3004 VM to a different node than the node where the new hcaptchy-proxy700X was added. After that the reimage proceeds as usual. We'll have the underlying issue fixed once the new dnsmasq release is out.

Wed, Nov 19, 2:17 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh added a comment to T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.

Oh wow, thanks @MoritzMuehlenhoff! But what was the issue for my understanding?

Wed, Nov 19, 2:06 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests

Tue, Nov 18

ssingh added a project to T408510: ULSFO: switch refresh: Traffic.
Tue, Nov 18, 8:03 PM · Traffic, SRE, Infrastructure-Foundations, DC-Ops, netops, ops-ulsfo
ssingh added a comment to T367732: POPs LVS : remove public vlan trunking.

@ssingh started working on this with https://gerrit.wikimedia.org/r/1206424 in T410047: No free IPs on public1-ulsfo vlan (Nov 2025) boldly assigning the task to him :)

Tue, Nov 18, 6:51 PM · netops, Traffic, Infrastructure-Foundations
ssingh added a comment to T367732: POPs LVS : remove public vlan trunking.

Related: T410411.

Tue, Nov 18, 6:50 PM · netops, Traffic, Infrastructure-Foundations
ssingh triaged T410411: Cleaning up Puppet and Netbox VLAN sub-ints on edge sites as Low priority.
Tue, Nov 18, 3:49 PM · Patch-For-Review, Infrastructure-Foundations, SRE, netops, Traffic
ssingh created T410411: Cleaning up Puppet and Netbox VLAN sub-ints on edge sites.
Tue, Nov 18, 3:44 PM · Patch-For-Review, Infrastructure-Foundations, SRE, netops, Traffic

Mon, Nov 17

ssingh added a comment to T410201: Error: 503, Backend fetch failed.

Tagging Traffic for this is perfectly fine, thanks @A_smart_kitten.

Mon, Nov 17, 7:04 PM · Traffic, Commons
ssingh added a comment to T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.

hcaptcha-proxy3001 worked just fine but hcaptcha-proxy3002 does not come up after reboot (tried twice). A manual start also did not work and sudo gnt-instance info hcaptcha-proxy3002.wikimedia.org on ganeti3005.esams.wmnet also wasn't really helpful.

This was caused by https://phabricator.wikimedia.org/T396864. I've shuffled the VM to a different node. Simply running the reimage cookbook on the node should fix it.

Once dnsmasq 2.92 is out, this will no longer affected routed Ganeti: https://phabricator.wikimedia.org/T396864#11113708

Mon, Nov 17, 5:46 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh added a comment to T410047: No free IPs on public1-ulsfo vlan (Nov 2025).

Yeah, good point about the LVS IPs since we no longer need them given Liberica. I will be checking that with Valentin today.

It's more than they aren't "needed", they aren't being used. They are just incorrectly marked as in use in Netbox and need to be tidied up (also in puppet where they are referenced).

Mon, Nov 17, 3:10 PM · Patch-For-Review, Traffic, netops, Infrastructure-Foundations, SRE
ssingh added a comment to T410047: No free IPs on public1-ulsfo vlan (Nov 2025).

My plan for now to unblock the hCaptcha work was to decommission one of the Wikidough hosts in ulsfo -- which should be fine since they both average ~14 rps between them -- and then use that IP to create the hCaptcha VM. The reasoning for doing so is that hCaptcha is more critical service than Wikidough right now and I don't want it to be running on a single VM in ulsfo. But let me know what you think about this, in general, and if I should not go down this path!

Why not just re-use the LVS IPs? I think it's better to keep Wikidough with two hosts in case one fails?

In terms of expanding the vlan I think we should do it anyway, but I agree it should not hold up hCaptcha.

Mon, Nov 17, 2:53 PM · Patch-For-Review, Traffic, netops, Infrastructure-Foundations, SRE
ssingh added a comment to T410047: No free IPs on public1-ulsfo vlan (Nov 2025).

You can use 198.35.26.5/28. It's marked as reserved for infra, but we don't need it (and we will even less need it after the network upgrade).

Mon, Nov 17, 2:52 PM · Patch-For-Review, Traffic, netops, Infrastructure-Foundations, SRE
ssingh added a comment to T410047: No free IPs on public1-ulsfo vlan (Nov 2025).

@ssingh I made a patch and can kick off the changes in Netbox and on the routers next week for this.

However I wonder what your thoughts are, how many more public IPs do you need in the short term? Reason I ask is this vlan will be doubled (quadrupled actually as we will add another public vlan for the second rack) during the T408510: ULSFO: switch refresh work which is coming up in the next few months. That work will require most hosts to be reimaged while we change the network setup to a L3 POP.

So an option, if removing unused IPs like the LVS from the vlan now gives enough for the proxy VMs, is to decline this task and increase the subnet size as planned during the larger job?

Actually I discussed with @Papaul in relation to our plans for ulsfo, and we both agree that work would be a lot simpler if we make this change now. We can discuss the way forward next week.

Mon, Nov 17, 2:45 PM · Patch-For-Review, Traffic, netops, Infrastructure-Foundations, SRE

Thu, Nov 13

ssingh added a comment to T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.

hcaptcha-proxy3001 worked just fine but hcaptcha-proxy3002 does not come up after reboot (tried twice). A manual start also did not work and sudo gnt-instance info hcaptcha-proxy3002.wikimedia.org on ganeti3005.esams.wmnet also wasn't really helpful.

Thu, Nov 13, 7:51 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh moved T410019: alerts should be triggered if druid fails to consume webrequest_sampled kafka topic from Radar/Not for Service to Actively Servicing on the Traffic board.
Thu, Nov 13, 6:31 PM · observability, Data-Platform-SRE (2025.11.07 - 2025.11.28), Data-Engineering (Q2 FY25/26 October 1st - December 31th), Sustainability (Incident Followup), Traffic, SRE

Wed, Nov 12

ssingh added a comment to T405623: eqiad row C/D Traffic host migrations.

All cp hosts in rows C/D have been migrated as of today (last ones done) and all that is left in Traffic realm for migration is dns1006 and lvs1020 via T405602.

We moved the existing cp hosts in C/D either last Thursday, Friday, or today. For each one I put in maint mode via icinga cookbook then logged into the cp host directly and depooled it, had the cable migrated, and then when ping resumed repooled the host. Each host was depooled for less than 5 minutes. I did not depool more than 1 host at a time from either text or upload.

Wed, Nov 12, 7:10 PM · Traffic, SRE, DC-Ops, ops-eqiad
ssingh added a comment to T405623: eqiad row C/D Traffic host migrations.

Please note this migration has shifted from Oct 15th start date to Nov 1 start date.

Wed, Nov 12, 7:02 PM · Traffic, SRE, DC-Ops, ops-eqiad
ssingh added a comment to T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.

sudo cookbook sre.ganeti.makekevm --vcpus 2 --memory 2 --disk 20 --network public --os trixie -t T409860 --cluster <site> --group <group> <hostname>

Wed, Nov 12, 4:18 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh added a comment to T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.

One thing to consider, when we actually apply the role on DCs enabled with routed ganeti (esams and magru) we need to set profile::bird::routed_ganeti_apt: true to enable the apt component with the Bird package that has been enabled for routed ganeti (via hieradata/role/magru|esams/foo.yaml

Wed, Nov 12, 3:49 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh added a project to T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14: Infrastructure-Foundations.
Wed, Nov 12, 3:32 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh added a comment to T409871: Local coding agents should be allowed to fetch wikipedia / wikidata.

Hi @Monneyboi: Thanks for reporting! Setting a user-agent will fix the error as your own example above shows (thanks for trying that!) And additionally, like you mentioned, the reason this policy is now being enforced, is to ensure fair-use of infrastructure; as such, user-agents are one of the ways of identifying or classifying the request, and thus we are requiring them to be set.

Wed, Nov 12, 12:38 AM · Traffic

Tue, Nov 11

ssingh added a comment to T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.

Once the VMs are up, we will need to enable BGP for all of them in Netbox and then run homer.

Tue, Nov 11, 6:24 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh added a comment to T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.

Initial role can be insetup::traffic_nftables. We will reimage to hcaptcha::proxy role later, with Debian bookworm as routed Ganeti setups (magru/esams) do not have the patched bird packaged for trixie yet and we don't want to wait. (We can reimage to trixie later.)

Tue, Nov 11, 6:23 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh renamed T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14 from eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast) to eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.
Tue, Nov 11, 6:20 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh created T409860: eqiad/codfw/esams/ulsfo/eqsin/drmrs/magru: 2 VM request for hCaptcha proxy (bird/anycast), total of 14.
Tue, Nov 11, 6:20 PM · Patch-For-Review, Infrastructure-Foundations, Traffic, SRE, vm-requests
ssingh added a comment to T408592: Request: Wikipedia 25 microsite hosting.

Hi @Jdrewniak: Daniel has already commented on the questions from Traffic's end (and as it related to the CDN and DNS) and what he has mentioned is correct per our understanding as well. We also set up some redirects under T408168 so we can work with those domains if required (wikipedia25.org).

Tue, Nov 11, 3:35 PM · Patch-For-Review, collaboration-services, SRE, PES1.3.3 WP25 Easter Eggs
ssingh added a comment to T409735: Meta query about why we map 31.13.103.0/24 to US.

Thanks @ssingh. I'm just reading about this RFC for the first time, I wonder longer term might it be a goal to automate the ingestion of data from such feeds to update our maps automatically? Obviously not as part of this work, just an idle thought.

Tue, Nov 11, 2:20 PM · Traffic, SRE

Mon, Nov 10

ssingh added a comment to T352245: Migrate the etcd main cluster to cfssl-based PKI.
Mon, Nov 10, 9:11 PM · Patch-For-Review, serviceops
ssingh added a comment to T117618: Add restrictive CSP to upload.wikimedia.org.

Commenting from Traffic's side: this is in some ways, a trivial patch for us because we are simply setting an additional header. The challenge here, though, is understanding the header itself and the associated ramifications of setting it and also keeping it updated. For that, the Security should be/needs to be consulted, so this patch currently blocks on that happening.

Would you like to have a chat about this? I think we'd be fine with @Bawolff's suggested CSP in T117618#11072208, if we're fine breaking favicons. I'm not sure of what the best way to test this would be.

Mon, Nov 10, 8:28 PM · Patch-For-Review, Traffic, ContentSecurityPolicy, WMF-General-or-Unknown, Security-Team
ssingh added a comment to T409735: Meta query about why we map 31.13.103.0/24 to US.

Thanks for filing this task @cmooney! The geofeed link above is very helpful. So it seems from the above (57.141.8.0/24, 57.141.8.0/24), we are missing the entries in the geo-maps file so they default to codfw. (We have 57.141.4.0/24 and 57.141.5.0/24 in the geo-maps).

Mon, Nov 10, 7:01 PM · Traffic, SRE

Nov 6 2025

ssingh closed T409314: [Search Console Verification DNS Request] - wiktionary.org and wikibooks.org as Resolved.
Nov 6 2025, 3:59 PM · Traffic
ssingh added a comment to T409314: [Search Console Verification DNS Request] - wiktionary.org and wikibooks.org.

Ah interesting, that explains why we couldn't see the verification option on the Search Console. So just to confirm, you are set for both?

Nov 6 2025, 3:56 PM · Traffic
ssingh added a comment to T352956: Handling inbound IPIP traffic on low traffic LVS k8s based realservers.

Thanks @akosiaris, that sounds good. We would like to get this done in Q3 to resolve this blocker and to deploy Liberica everywhere, so please do factor that in for your planning. Thank you!

Nov 6 2025, 3:34 PM · Patch-For-Review, Prod-Kubernetes, Kubernetes, serviceops, Traffic
ssingh added a comment to T409314: [Search Console Verification DNS Request] - wiktionary.org and wikibooks.org.

@JKelsoteel-WMF: Can you please try to log in to wikibooks.org as well so we can see the text of the DNS record that needs to be verified?

Nov 6 2025, 3:23 PM · Traffic
ssingh added a comment to T409314: [Search Console Verification DNS Request] - wiktionary.org and wikibooks.org.

Thanks @JKelsoteel-WMF, we will be picking this up today.

Nov 6 2025, 1:57 PM · Traffic

Nov 5 2025

ssingh added a comment to T408168: Request to create the donate.wikipedia25.org domain + 301 redirect to a donate.wiki page.

Hi @BCornwall thank you for this. I have one last update request, apologies for the back and forth. I just heard from Fundraising that we had to adjust one of the UTM parameters since they weren’t able to track it correctly. Could you please update the destination URL one last time to this one: https://donate.wikimedia.org/?appeal=WP25&pym_appeal=WP25&wmf_campaign=vvid&wmf_medium=vvid&wmf_source=vvid

Thank you again for your patience and help!

Nov 5 2025, 7:25 PM · Patch-For-Review, Traffic, DNS
ssingh added a comment to T409330: Transport link saturation not alerting.

Thanks for the task @ssingh !

I agree this is definitely a major gap. In terms of the alertmanager rule you list it does make sense we should have another one (or expand it) to also cover transport / private WAN circuit. So we can absolutely do that.

Historically these alerts were triggered for us by LibreNMS. Checking there it's pretty obvious why those are no longer firing - they are turned off!

image.png (270×1 px, 86 KB)

It's not at all clear to me when or why those were disabled. But in any event I have re-enabled them now which I think should make alerting work again. This setup may also explain why my attempts to get LibreNMS to alert at lower-than-line-rate for eqsin didn't work - the damn things were disabled completely!

Ultimately we can move these to alertmanager, we can work on what those alerts look like. And potentially move to basing them on dropped outbound packets taking QoS priority into account (T384052).

Nov 5 2025, 7:13 PM · Infrastructure-Foundations, SRE, Traffic, netops
ssingh moved T409314: [Search Console Verification DNS Request] - wiktionary.org and wikibooks.org from Backlog to Actively Servicing on the Traffic board.
Nov 5 2025, 6:21 PM · Traffic
ssingh added a comment to T390813: Upgrade End Of Support Junos.

@ssingh @Vgutierrez planning on doing this on Nov 19th @10:am CT. Thank you

Nov 5 2025, 6:17 PM · Traffic, netops, Infrastructure-Foundations
ssingh triaged T409330: Transport link saturation not alerting as High priority.
Nov 5 2025, 6:12 PM · Infrastructure-Foundations, SRE, Traffic, netops
ssingh created T409330: Transport link saturation not alerting.
Nov 5 2025, 6:12 PM · Infrastructure-Foundations, SRE, Traffic, netops

Nov 3 2025

ssingh moved T408857: ncmonitor: Migrate from deprecated API to new API from Backlog to Actively Servicing on the Traffic board.
Nov 3 2025, 2:28 PM · Traffic

Oct 29 2025

ssingh added a comment to T408689: Requesting access to analytics_privatedata_users for slyngshede.

@ssingh - for manager sign off

@Ottomata - Group access approval

Oct 29 2025, 2:01 PM · SRE, SRE-Access-Requests

Oct 28 2025

ssingh closed T408549: lvs2011 hardware issue after reboot as Resolved.

Thanks for the help @Jhancock.wm. Marking this as resolved for now.

Oct 28 2025, 2:31 PM · SRE, DC-Ops, Traffic, ops-codfw
ssingh triaged T408549: lvs2011 hardware issue after reboot as High priority.
Oct 28 2025, 1:18 PM · SRE, DC-Ops, Traffic, ops-codfw
ssingh created T408549: lvs2011 hardware issue after reboot.
Oct 28 2025, 1:18 PM · SRE, DC-Ops, Traffic, ops-codfw

Oct 27 2025

ssingh added a comment to T408202: varnishtests are broken with podman.

Brett pointed out that the regression was introduced in https://gerrit.wikimedia.org/r/q/I9fab3e43a39456432eb148df91faffba54b1926e.

Oct 27 2025, 7:44 PM · Patch-For-Review, Traffic
ssingh closed T408148: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep, a subtask of T404826: Integrate code from the private repository into the CDN, as Resolved.
Oct 27 2025, 7:27 PM · Hiddenparma, Traffic, SRE
ssingh closed T408148: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep as Resolved.

Sorry this took a while but this should now be resolved. Thanks to Giuseppe for taking care of it in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1198424, which we will build on further as required, with Traffic responsible for ensuring parity.

Oct 27 2025, 7:27 PM · Traffic, Beta-Cluster-Infrastructure
ssingh added a project to T406545: FY 25/26 WE 5.4.5: Enforce global rate-limits: Traffic.
Oct 27 2025, 12:54 PM · Traffic, Hiddenparma, SRE

Oct 24 2025

ssingh created P84298 (An Untitled Masterwork).
Oct 24 2025, 6:37 PM
ssingh assigned T408202: varnishtests are broken with podman to BCornwall.
Oct 24 2025, 1:46 PM · Patch-For-Review, Traffic
ssingh assigned T408168: Request to create the donate.wikipedia25.org domain + 301 redirect to a donate.wiki page to BCornwall.
Oct 24 2025, 12:38 PM · Patch-For-Review, Traffic, DNS

Oct 23 2025

ssingh added a comment to T408148: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep.

Yeah I missed this in the previous fix. I am going to take this tomorrow since now essentially we have to guard the includes as well.

Oct 23 2025, 7:56 PM · Traffic, Beta-Cluster-Infrastructure
ssingh added a comment to T404826: Integrate code from the private repository into the CDN.

[operations/puppet@production] varnish: add conditional to varnish::common::vcl for beta

https://gerrit.wikimedia.org/r/1198132

Note that while this silenced the alert for Puppet, it did not resolve the Varnish compilation failure. Beta is kept online only by Varnish's stale memory of the last working config, because reloads fail to apply new changes:

Oct 23 2025, 7:36 PM · Hiddenparma, Traffic, SRE
ssingh added a comment to T390813: Upgrade End Of Support Junos.

@ssingh thanks for the update. I am planning on doing it before Thanksgiving any day during the week of November 17th works for me. Let me know if that woks for you and I can get back with you on the exact day and time.

Oct 23 2025, 1:32 PM · Traffic, netops, Infrastructure-Foundations

Oct 22 2025

ssingh closed T407966: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep, a subtask of T404826: Integrate code from the private repository into the CDN, as Resolved.
Oct 22 2025, 7:10 PM · Hiddenparma, Traffic, SRE
ssingh closed T407966: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep as Resolved.

Sorry about this, this should now be fixed. And glad to see that Traffic was added automatically, thanks to @bd808 and @Ladsgroup for their work on this!

Oct 22 2025, 7:10 PM · Traffic, Beta-Cluster-Infrastructure
ssingh assigned T408003: [Update DNS Record Request] - wikimedia.org to BCornwall.
Oct 22 2025, 5:37 PM · SRE, DNS, Traffic
ssingh added a comment to T404826: Integrate code from the private repository into the CDN.

https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197986 has caused puppet to break on deployment-cache-upload08.deployment-prep. Please help!

Oct 22 2025, 3:08 PM · Hiddenparma, Traffic, SRE
ssingh added a comment to T407966: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep.

This is because of:

Oct 22 2025, 2:55 PM · Traffic, Beta-Cluster-Infrastructure
ssingh added a comment to T390813: Upgrade End Of Support Junos.

@ssingh @Vgutierrez hello just checking in to see if you have a day and time for this for drmrs.
Thanks

Oct 22 2025, 2:45 PM · Traffic, Infrastructure-Foundations, netops

Oct 21 2025

ssingh added a comment to T404913: Transfer wikipedia.pt domain to community.

Hi @CRoslof: This is another ticket that we would like to take up and will need your help with so that we can reflect it in downstream services as well. Let me know if I should create a Zendesk thread for tracking by other Legal members? Thanks a lot of for bearing with us and helping us clean the ownership.

Oct 21 2025, 7:18 PM · Traffic, Domains
ssingh added a project to T407787: Alertmanager triggers an alert on IRC and email after the alert has resolved: Spicerack.

It looks like spicerack should check that alerts for the downtimed host have been resolved (not in firing state) before deleting the silence/downtime with ALERTS{alertstate="firing", instance=~"cp5018:.*"}

Oct 21 2025, 1:34 PM · Infrastructure-Foundations, SRE-tools, Spicerack, Traffic, Observability-Alerting

Oct 20 2025

ssingh triaged T407787: Alertmanager triggers an alert on IRC and email after the alert has resolved as Low priority.
Oct 20 2025, 7:01 PM · Infrastructure-Foundations, SRE-tools, Spicerack, Traffic, Observability-Alerting
ssingh created T407787: Alertmanager triggers an alert on IRC and email after the alert has resolved.
Oct 20 2025, 7:01 PM · Infrastructure-Foundations, SRE-tools, Spicerack, Traffic, Observability-Alerting
ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Oct 20 2025, 5:18 PM · Traffic
ssingh edited projects for T407769: Improve how we build the 'haproxy_allowed_healthcheck_sources' list of IPs, added: Traffic; removed Traffic-Icebox.

Thanks for filing this task! I think this is a good idea to reduce the manual updates to this list, and something we have failed to keep updated. We will triage this after discussion in Traffic.

Oct 20 2025, 3:31 PM · Traffic, SRE

Oct 17 2025

ssingh added a comment to T332220: Acquire enwp.org.

Nice job indeed in pursuing this over the years, Brett!

Oct 17 2025, 7:25 PM · Traffic, SRE, Domains
ssingh updated subscribers of T406880: hCaptcha: Implement alerts.

[Adding Raine @kamila as well.]

Oct 17 2025, 7:23 PM · ConfirmEdit (CAPTCHA extension), Product Safety and Integrity (Sprint Mince Pie Dec 1 - Dec 12), MW-1.45-notes (1.45.0-wmf.25; 2025-10-28), Observability-Alerting, WE4.2 Bot detection (WE4.2 hCaptcha account creation trial)
ssingh moved T407570: Test the impact of incremental increase in traffic for cache splitting experiments from Backlog to Actively Servicing on the Traffic board.
Oct 17 2025, 1:20 PM · Patch-For-Review, MW-1.46-notes (1.46.0-wmf.7; 2025-12-16), Test Kitchen (Experiment Platform Sprint 16), Essential-Work, Traffic
ssingh added a comment to T407570: Test the impact of incremental increase in traffic for cache splitting experiments.

Thanks for filing the task, @JVanderhoop-WMF. As per the discussion on Slack, the above sounds good.

Oct 17 2025, 1:18 PM · Patch-For-Review, MW-1.46-notes (1.46.0-wmf.7; 2025-12-16), Test Kitchen (Experiment Platform Sprint 16), Essential-Work, Traffic

Oct 16 2025

ssingh closed T407421: cp7007 hardware issues after reboot as Resolved.

Thanks to @Jhancock.wm for the help with this!

Oct 16 2025, 3:22 PM · DC-Ops, Traffic, ops-magru
ssingh added a comment to T392851: Q4:rack/setup/install cp20[43-58] codfw.

FWIW doing one or two hosts is more than enough. We will reimage them again anyway so it doesn't make sense IMO for you both to spend time upgrading all of them to trixie. If one or two reimage fine, please leave the rest to us.

Oct 16 2025, 3:20 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops

Oct 15 2025

ssingh assigned T407421: cp7007 hardware issues after reboot to BCornwall.
Oct 15 2025, 7:13 PM · DC-Ops, Traffic, ops-magru
ssingh created T407421: cp7007 hardware issues after reboot.
Oct 15 2025, 7:12 PM · DC-Ops, Traffic, ops-magru
ssingh added a comment to T407320: Package benthos/redpanda for trixie.

Do you happen to have a trixie host available that we can try the existing package on?

Oct 15 2025, 2:24 PM · Observability-Logging, Traffic
ssingh added a comment to T407156: Request to create the 25.wikipedia.org domain + 301 redirect to the org site.

I was also looped into a new request today. As part of the birthday initiative, the Fundraising team is developing a customized donation portal under the donate.wiki domain. Would it be possible to set up a redirect for this new portal as well? I don’t have the final destination URL yet, but we’d like to create the domain donate.wikipedia25.org to redirect to the donation portal once it’s ready. Is this something you could help with too?

Oct 15 2025, 1:22 PM · Traffic, DNS, Domains
ssingh added a comment to T405499: Remove lvs1018 L2 link to ssw1-e1-eqiad.

FWIW we have typically reimaged for this in the past. I am not suggesting, just sharing! And given that this is lvs1020, that might be OK? (Leaving to you both for the final decision.)

This is lvs1018. I'm comfortable enough either way, can I leave the decision with traffic?

I was thinking of trying to schedule this for tomorrow, Thurs Oct 16th if that worked for you guys?

Oct 15 2025, 1:10 PM · DC-Ops, ops-eqiad, Infrastructure-Foundations, netops, SRE

Oct 14 2025

ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Oct 14 2025, 7:02 PM · Traffic
ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Oct 14 2025, 6:59 PM · Traffic
ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Oct 14 2025, 6:26 PM · Traffic
ssingh updated the task description for T401832: Upgrade Traffic hosts to trixie.
Oct 14 2025, 6:25 PM · Traffic
ssingh added a comment to T406650: Copy the Traffic team on alerts for deployment-cache* hosts.

Thanks for working on this! We will try our best to follow up on our end in making sure that Puppet is not broken on the cache hosts in Beta.

Oct 14 2025, 5:49 PM · User-bd808, Traffic, Beta-Cluster-Infrastructure
ssingh added a comment to T405499: Remove lvs1018 L2 link to ssw1-e1-eqiad.

FWIW we have typically reimaged for this in the past. I am not suggesting, just sharing! And given that this is lvs1020, that might be OK? (Leaving to you both for the final decision.)

Oct 14 2025, 5:40 PM · DC-Ops, ops-eqiad, Infrastructure-Foundations, netops, SRE
ssingh closed T405102: Create boot environment of Bullseye with a 6.1 kernel , a subtask of T392851: Q4:rack/setup/install cp20[43-58] codfw, as Resolved.
Oct 14 2025, 5:38 PM · User-Elukey, SRE, Patch-For-Review, Traffic, ops-codfw, DC-Ops