Page MenuHomePhabricator

ssingh (Sukhbir Singh)
SRE/Traffic

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Monday

  • Clear sailing ahead.

User Details

User Since
Dec 11 2018, 9:39 PM (223 w, 3 d)
Availability
Available
IRC Nick
sukhe
LDAP User
Unknown
MediaWiki User
SSingh (WMF) [ Global Accounts ]

Oh hi.

Recent Activity

Thu, Mar 23

ssingh added a comment to T321309: Upgrade Traffic hosts to bullseye.

Reimaged pybal-test2003 to bullseye, added component/pybal and everything appears to be fine with the installation.

Thu, Mar 23, 2:31 PM · Patch-For-Review, Traffic, SRE

Tue, Mar 21

ssingh claimed T274431: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467.
Tue, Mar 21, 4:44 PM · SRE, Traffic
ssingh claimed T252132: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver.
Tue, Mar 21, 4:43 PM · SRE, Traffic

Thu, Mar 16

ssingh closed T332083: Clean up and refactor the dnsrecursor module as Resolved.

This has been resolved with the https://gerrit.wikimedia.org/r/898957 and all R:Class = dnsrecursor hosts running bullseye:

Thu, Mar 16, 3:36 PM · Traffic, SRE

Wed, Mar 15

ssingh closed T287266: Unexpected auditd service restart failure as Resolved.

We reimaged two hosts to bullseye and didn't notice any auditd failure, so confirming what @MoritzMuehlenhoff said above and marking this as resolved.

Wed, Mar 15, 5:36 PM · User-MoritzMuehlenhoff, SRE, Traffic
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Wed, Mar 15, 5:35 PM · Patch-For-Review, Traffic, SRE
ssingh triaged T332202: Consider confirming the hostname by user input when running the reimaging cookbook as Low priority.
Wed, Mar 15, 5:22 PM · Patch-For-Review, Traffic, SRE, Infrastructure-Foundations
ssingh created T332202: Consider confirming the hostname by user input when running the reimaging cookbook.
Wed, Mar 15, 5:20 PM · Patch-For-Review, Traffic, SRE, Infrastructure-Foundations
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Wed, Mar 15, 5:04 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T330165: eqiad row B switches upgrade.
Wed, Mar 15, 2:26 PM · Patch-For-Review, Data Pipelines, Data-Engineering-Planning, DBA, Discovery-Search (Current work), SRE, serviceops, cloud-services-team, Machine-Learning-Team, Platform Engineering, SRE Observability, Infrastructure-Foundations, serviceops-collab, Traffic
ssingh updated the task description for T330165: eqiad row B switches upgrade.
Wed, Mar 15, 2:23 PM · Patch-For-Review, Data Pipelines, Data-Engineering-Planning, DBA, Discovery-Search (Current work), SRE, serviceops, cloud-services-team, Machine-Learning-Team, Platform Engineering, SRE Observability, Infrastructure-Foundations, serviceops-collab, Traffic

Tue, Mar 14

ssingh triaged T332083: Clean up and refactor the dnsrecursor module as Medium priority.
Tue, Mar 14, 8:29 PM · Traffic, SRE
ssingh renamed T332083: Clean up and refactor the dnsrecursor module from Cleanup and refactor the dnsrecursor module to Clean up and refactor the dnsrecursor module.
Tue, Mar 14, 8:28 PM · Traffic, SRE
ssingh created T332083: Clean up and refactor the dnsrecursor module.
Tue, Mar 14, 8:28 PM · Traffic, SRE
ssingh renamed T252132: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver from Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver to Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver.
Tue, Mar 14, 7:02 PM · SRE, Traffic
ssingh closed T305589: Upgrading Wikidough and durum VMs to bullseye as Resolved.

Closing this in favour of T321309 where it is being tracked and also given that the Ganeti reimaging cookbook exists which was the primary motivation behind this task.

Tue, Mar 14, 7:02 PM · Patch-For-Review, Traffic, SRE
ssingh closed T305589: Upgrading Wikidough and durum VMs to bullseye, a subtask of T252132: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver, as Resolved.
Tue, Mar 14, 7:01 PM · SRE, Traffic
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 14, 6:57 PM · Patch-For-Review, Traffic, SRE

Mon, Mar 13

ssingh updated subscribers of T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001..

@ayounsi, @cmooney: Quick question about Junos OS: so we are planning to spread ns0 over dns100[123] and ns1 over dns200[123], similar to how we are doing with ns2:

Mon, Mar 13, 3:08 PM · SRE, Traffic
ssingh added a comment to T323944: haproxy: work on systemd unit hardening (cp hosts).

Thanks to @Vgutierrez for taking care of the rollout of this. For posterity, the final result for now before we do more enhancements:

Mon, Mar 13, 1:33 PM · SRE, Traffic

Fri, Mar 10

ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Fri, Mar 10, 1:55 AM · Patch-For-Review, Traffic, SRE

Thu, Mar 9

ssingh created P45726 (An Untitled Masterwork).
Thu, Mar 9, 6:50 PM
ssingh added a comment to T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001..

cr2-codfw (replicated to cr1-codfw as well):

Thu, Mar 9, 5:56 PM · SRE, Traffic
ssingh added a comment to T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001..

cr2-eqiad (replicated to cr1-eqiad as well):

Thu, Mar 9, 5:45 PM · SRE, Traffic

Wed, Mar 8

ssingh added a comment to T331478: Ganeti reimage cookbook exception when running _clear_dhcp_cache.

@BCornwall / @ssingh we've removed the clear dhcp cache part of the cookbook. It's technically not need at this point, as no interfaces or IPs move during reimaging.

That should allow the cookbook to run to completion, also in DRMRS.

Wed, Mar 8, 3:30 PM · SRE-tools, Infrastructure-Foundations
ssingh awarded T331478: Ganeti reimage cookbook exception when running _clear_dhcp_cache a Love token.
Wed, Mar 8, 2:40 PM · SRE-tools, Infrastructure-Foundations
ssingh added a comment to T331478: Ganeti reimage cookbook exception when running _clear_dhcp_cache.

From a quick look the current data is correct and doesn't error out:

>>> node.primary_ip.assigned_object.connected_endpoint.device
asw-0603-eqsin

Discard this, I tested it against ncredir5001 as reported in the task description, but that's a typo, the actual error is for ncredir6001 and only drmrs hosts failed.

Wed, Mar 8, 2:26 PM · SRE-tools, Infrastructure-Foundations
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Wed, Mar 8, 12:25 AM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Wed, Mar 8, 12:24 AM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Wed, Mar 8, 12:24 AM · Patch-For-Review, Traffic, SRE

Tue, Mar 7

ssingh added a comment to T331478: Ganeti reimage cookbook exception when running _clear_dhcp_cache.

First of all, thanks so much for the Ganeti cookbook -- it's a lifesaver. I can't imagine reimaging these hosts without the cookbook and all the manual hours that would have gone into that, so much thanks to @SLyngshede-WMF and @Volans for working on it!

Tue, Mar 7, 10:01 PM · SRE-tools, Infrastructure-Foundations
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 9:24 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 7:38 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 7:03 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 7:03 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 6:17 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 5:40 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 5:40 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 4:53 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 4:27 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 4:24 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 4:24 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Tue, Mar 7, 3:27 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T329073: eqiad row A switches upgrade.
Tue, Mar 7, 12:57 PM · Patch-For-Review, Discovery-Search (Current work), Shared-Data-Infrastructure, Data-Engineering-Planning, DBA, SRE, Platform Engineering, Infrastructure-Foundations, Traffic, serviceops, Machine-Learning-Team, cloud-services-team, Data-Persistence, SRE Observability, serviceops-collab

Mon, Mar 6

ssingh updated the task description for T329073: eqiad row A switches upgrade.
Mon, Mar 6, 5:35 PM · Patch-For-Review, Discovery-Search (Current work), Shared-Data-Infrastructure, Data-Engineering-Planning, DBA, SRE, Platform Engineering, Infrastructure-Foundations, Traffic, serviceops, Machine-Learning-Team, cloud-services-team, Data-Persistence, SRE Observability, serviceops-collab

Fri, Mar 3

ssingh updated the task description for T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001..
Fri, Mar 3, 4:14 PM · SRE, Traffic

Thu, Mar 2

ssingh added a comment to T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001..

lgtm just some curiosity :)

After the above change, we will have three DNS boxes in the core DCs, with ns0 pointing to dns1001 in eqiad and ns1 pointing to dns2001

Curious why you dont point
ns0 -> dns1001 & dns1002
ns1 -> dns2001 & dns2002

Thu, Mar 2, 3:15 PM · SRE, Traffic

Tue, Feb 28

ssingh added a comment to T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001..

@ssing

  1. for the cookbooks all that I see is that they use the A:dns-auth cumin alias, so they will follow along.
Tue, Feb 28, 12:49 AM · SRE, Traffic
ssingh updated the task description for T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001..
Tue, Feb 28, 12:45 AM · SRE, Traffic

Mon, Feb 27

ssingh triaged T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001. as Medium priority.
Mon, Feb 27, 4:08 PM · SRE, Traffic
ssingh created T330670: Deprecating the dns::auth role and moving authdns[12]001 to dns[12]001..
Mon, Feb 27, 4:08 PM · SRE, Traffic

Thu, Feb 23

ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Thu, Feb 23, 7:21 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Thu, Feb 23, 6:59 PM · Patch-For-Review, Traffic, SRE

Feb 23 2023

ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 23 2023, 6:11 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 23 2023, 6:09 PM · Patch-For-Review, Traffic, SRE
ssingh added a comment to T330318: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run.

Confirming that I did a new reimage and it completed successfully. Thanks everyone who worked on this to resolve it so quickly.

Feb 23 2023, 5:54 PM · Infrastructure-Foundations, SRE
ssingh added a comment to T330318: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run.

@ssingh just run the Netbox script https://netbox.wikimedia.org/extras/scripts/interface_automation.ImportPuppetDB/ against the reimaged host (checking the commit changes checkbox) and you should be good to go.

FYI the fix is about to be released to prod.

Feb 23 2023, 2:40 PM · Infrastructure-Foundations, SRE
ssingh awarded T330318: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run a Love token.
Feb 23 2023, 2:39 PM · Infrastructure-Foundations, SRE
ssingh added a comment to T330318: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run.

Thanks all for the quick response to this task, everyone!

Feb 23 2023, 2:36 PM · Infrastructure-Foundations, SRE

Feb 22 2023

ssingh created T330318: sre.hosts.reimage failed with "No commands provided" after completion of Puppet run.
Feb 22 2023, 6:11 PM · Infrastructure-Foundations, SRE
ssingh created P44735 (An Untitled Masterwork).
Feb 22 2023, 5:37 PM

Feb 21 2023

ssingh moved T309787: Remove IEContentAnalyzer from Triage to In Progress on the Traffic board.
Feb 21 2023, 3:51 PM · MW-1.41-notes (1.41.0-wmf.2; 2023-03-27), MW-1.40-notes (1.40.0-wmf.27; 2023-03-13), SRE, Traffic, Technical-Debt, MediaWiki-File-management
ssingh moved T330024: Let all requests from mainland China will be processed to codfw/esams/drmrs from Triage to Radar on the Traffic board.
Feb 21 2023, 3:50 PM · Chinese-Sites, Traffic, DNS, SRE
ssingh moved T252132: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver from In Progress to Queued on the Traffic board.
Feb 21 2023, 3:49 PM · SRE, Traffic

Feb 10 2023

ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 10 2023, 5:33 PM · Patch-For-Review, Traffic, SRE

Feb 7 2023

ssingh updated the task description for T327925: codfw row A switches upgrade.
Feb 7 2023, 12:26 PM · Shared-Data-Infrastructure, Data-Engineering-Planning, Discovery-Search (Current work), DBA, serviceops, Traffic, Machine-Learning-Team, serviceops-collab, cloud-services-team, Platform Engineering, SRE Observability, Data-Persistence, SRE, netops, Infrastructure-Foundations

Feb 6 2023

ssingh added a comment to T327812: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook.

Hi @BCornwall: Thanks for checking! I am pretty convinced that this is not related to the NIC firmware, for the following reasons; but I may be missing something and so just writing it here:

Feb 6 2023, 2:12 PM · SRE, Traffic
ssingh updated the task description for T327925: codfw row A switches upgrade.
Feb 6 2023, 2:09 PM · Shared-Data-Infrastructure, Data-Engineering-Planning, Discovery-Search (Current work), DBA, serviceops, Traffic, Machine-Learning-Team, serviceops-collab, cloud-services-team, Platform Engineering, SRE Observability, Data-Persistence, SRE, netops, Infrastructure-Foundations

Feb 3 2023

ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 3 2023, 1:55 PM · Patch-For-Review, Traffic, SRE
ssingh committed rODGD5602304195b6: Release 3.8.0-1~wmf2 (authored by ssingh).
Release 3.8.0-1~wmf2
Feb 3 2023, 1:14 PM
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 3 2023, 2:28 AM · Patch-For-Review, Traffic, SRE

Feb 2 2023

ssingh added a comment to T321309: Upgrade Traffic hosts to bullseye.

Steps to follow for manual upgrade of the iDRAC firmwares for the cp hosts in eqiad for us and in case someone else stumbles on this issue.

Can you please mention/link or integrate this into https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation ? We'll probably run into this with other servers as well.

Feb 2 2023, 3:20 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 2 2023, 1:34 PM · Patch-For-Review, Traffic, SRE
ssingh added a comment to T321309: Upgrade Traffic hosts to bullseye.

Steps to follow for manual upgrade of the iDRAC firmwares for the cp hosts in eqiad for us and in case someone else stumbles on this issue.

Feb 2 2023, 2:27 AM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 2 2023, 1:50 AM · Patch-For-Review, Traffic, SRE
ssingh committed rODGD8efdb03f94cd: New upstream version 3.8.0 (authored by ssingh).
New upstream version 3.8.0
Feb 2 2023, 1:48 AM
ssingh committed rODGDd35782e788e9: pristine-tar data for gdnsd_3.8.0.orig.tar.gz (authored by ssingh).
pristine-tar data for gdnsd_3.8.0.orig.tar.gz
Feb 2 2023, 1:48 AM
ssingh committed rODGD0d3fed4caa01: New upstream version 3.8.0 (authored by ssingh).
New upstream version 3.8.0
Feb 2 2023, 1:48 AM
ssingh committed rODGD53742578780b: Initial upstream branch for gdnsd 3.8.0 (authored by ssingh).
Initial upstream branch for gdnsd 3.8.0
Feb 2 2023, 1:48 AM
ssingh committed rODGD381f4cefb38d: Initial upstream branch for gdnsd 3.8.0 (authored by ssingh).
Initial upstream branch for gdnsd 3.8.0
Feb 2 2023, 1:48 AM
ssingh added a comment to T321309: Upgrade Traffic hosts to bullseye.

Using a slight modification of @jbond's script in T328593, the list of cp nodes in eqiad with the oudated firmware (3.15.17.15) is basically all the cp nodes in eqiad:

Feb 2 2023, 1:05 AM · Patch-For-Review, Traffic, SRE

Feb 1 2023

ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 1 2023, 8:03 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 1 2023, 7:42 PM · Patch-For-Review, Traffic, SRE
ssingh added a comment to T321309: Upgrade Traffic hosts to bullseye.

@ssingh i have finished with cp1075, i have upgraded it to the most recent network, bios and idrac version. in relation to other servers that you may have issues with i have noticed that any machine with an idrac version < 3.30.30.30 first needs to have a manual upgrade to that version. the script cant work for idrac below that version and you have to upgrade to that version before progressing to later version. after you are on 3.30.30.30 you should be able to use the script to upgrade to the most recent (although the lower limit may change to 4.40.0.0 in the future, see T328593 for more info)

Feb 1 2023, 7:27 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 1 2023, 6:20 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Feb 1 2023, 6:20 PM · Patch-For-Review, Traffic, SRE
ssingh updated subscribers of T327812: eqsin hosts are not rebooting when running sre.hosts.reimage cookbook.

@Volans and I were discussing this on IRC today, some more observations with cp5019, that failed the first attempt but worked on the second.

Feb 1 2023, 4:16 PM · SRE, Traffic

Jan 31 2023

ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 31 2023, 10:07 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 31 2023, 8:09 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 31 2023, 6:53 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 31 2023, 5:34 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 31 2023, 5:31 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 31 2023, 3:46 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 31 2023, 2:43 AM · Patch-For-Review, Traffic, SRE
ssingh added a comment to T321309: Upgrade Traffic hosts to bullseye.

For posterity, the versions of the iDRAC and the NIC firmware that we are looking for for the cp hosts bullseye upgrade and that we pass to the firmware cookbook/upload on the HTTP management interface:

Jan 31 2023, 12:53 AM · Patch-For-Review, Traffic, SRE

Jan 30 2023

ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 30 2023, 10:36 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 30 2023, 9:02 PM · Patch-For-Review, Traffic, SRE
ssingh updated the task description for T321309: Upgrade Traffic hosts to bullseye.
Jan 30 2023, 7:58 PM · Patch-For-Review, Traffic, SRE
ssingh triaged T328343: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6] as Medium priority.
Jan 30 2023, 5:57 PM · serviceops-radar, SRE, DC-Ops, ops-codfw
ssingh created T328343: PROBLEM - IPMI Sensor Status is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status [codfw rack B6].
Jan 30 2023, 5:57 PM · serviceops-radar, SRE, DC-Ops, ops-codfw