Page MenuHomePhabricator

ssingh (Sukhbir Singh)
SRE/Traffic

Projects

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Saturday

  • Clear sailing ahead.

User Details

User Since
Dec 11 2018, 9:39 PM (280 w, 1 d)
Availability
Available
IRC Nick
sukhe
LDAP User
Unknown
MediaWiki User
SSingh (WMF) [ Global Accounts ]

Oh hi.

Recent Activity

Tue, Apr 23

ssingh created T363119: db1246 crashed.
Tue, Apr 23, 12:26 AM · SRE, ops-eqiad, DBA

Fri, Apr 19

ssingh added a comment to T362730: Q4:rack/setup/install magru misc servers.

Thanks @cmooney, looks good! One small update to the above since we will most likely transpose these to hieradata/common/lvs/interfaces.yaml: 10.140.1.3/24 is private1-b4-magru like 10.140.0.2/24 is private1-b3-magru (and not just private-b4-magru).

Fri, Apr 19, 5:55 PM · Traffic, netops, ops-magru, DC-Ops, Infrastructure-Foundations
ssingh added a comment to T362959: Grant Access to NDA for lina.farid.

@Lina_Farid_WMDE: to speed up things, you can also send an email to @KFrancis ( https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) ; kfrancis@wikimedia.org) from your email, requesting an NDA. Thanks!

Fri, Apr 19, 5:08 PM · Patch-For-Review, WMF-NDA-Requests, SRE, LDAP-Access-Requests
ssingh updated subscribers of T362959: Grant Access to NDA for lina.farid.

Hi @KFrancis: @Lina_Farid_WMDE will require an NDA as well as I don't see their name on the spreadsheet. Thank you as always!

Fri, Apr 19, 4:24 PM · Patch-For-Review, WMF-NDA-Requests, SRE, LDAP-Access-Requests
ssingh added a comment to T362921: Authenticating wikimedia.org domain with MailChimp.

Update is that we will need to add a DKIM record for MailChimp so a patch will follow. Rest everything seems to be in order.

Fri, Apr 19, 4:11 PM · Patch-For-Review, Traffic, DNS, SRE
ssingh closed T349314: cp3079 bios settings as Resolved.

We fixed it but forgot to close this task so resolving. Thanks @Dzahn!

Fri, Apr 19, 1:17 AM · DC-Ops, ops-esams, SRE, Traffic

Thu, Apr 18

ssingh added a comment to T362921: Authenticating wikimedia.org domain with MailChimp.

Discussed a bit with @EdErhart-WMF on what the goal is here on Slack and will update this task later when there is more clarity.

Thu, Apr 18, 7:11 PM · Patch-For-Review, Traffic, DNS, SRE
ssingh closed T362731: Grant Access to 'wmf' ldap group for DErenrich to allow logstash access as Resolved.

This request is now merged; please re-open if there are any issues, thanks!

Thu, Apr 18, 6:00 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh added a member for WMF-NDA: derenrich.
Thu, Apr 18, 5:56 PM
ssingh closed T362812: Grant Access to ldap/wmf for kgraessle as Resolved.

@Kgraessle: The request has been merged; please let us know if there is any issue. Thanks!

Thu, Apr 18, 2:01 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh added a member for WMF-NDA: Kgraessle.
Thu, Apr 18, 1:58 PM
ssingh added a comment to T362421: magru network setup.

Thanks @jcrespo! I should have silenced the alert or restarted the service; both of those are in progress now so we should see this resolve soon.

Thu, Apr 18, 1:06 PM · Patch-For-Review, netops, SRE, Infrastructure-Foundations

Wed, Apr 17

ssingh added a comment to T362833: Improve HAProxy unexpected restart alert.

Thanks for the task! At least for now, I restarted haproxy so that we don't get this alert and we also don't leave it silenced in case the initial restart (below) was nothing more than a transient one.

Wed, Apr 17, 11:43 PM · Traffic
ssingh added a comment to T362812: Grant Access to ldap/wmf for kgraessle.

@DMburugu: this requires your approval, thanks!

Wed, Apr 17, 6:03 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh added a comment to T362731: Grant Access to 'wmf' ldap group for DErenrich to allow logstash access.

@NBaca-WMF: This needs your approval, thanks!

Wed, Apr 17, 5:44 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh closed T362533: Grant Access to Superset for aitolkyn as Resolved.

@Aitolkyn I am marking this as resolved but if that's not the case, please re-open it again thanks!

Wed, Apr 17, 3:32 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh closed T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting as Resolved.

@Papaul deserves a lot of love for fixing this persistent issue. The 21.x firmware (specifically, Network_Firmware_YK81Y_WN64_21.60.22.11_03) worked in the first attempt when reimaging cp1114. I think we can consider this closed given we have observed the fix on two hosts now.

Wed, Apr 17, 3:21 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

@Papaul: Thanks for the update! Looks promising indeed and to actually close this, we should downgrade another host in eqiad and then try it out. Because what happens sometimes is that if a given host reimaged successfully once then it continues to reimage successfully for some more period of time (we don't know what that is but at least the same day :).

Wed, Apr 17, 1:03 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh updated subscribers of T362729: Q4:rack/setup/install cp70[01-16].
Wed, Apr 17, 12:38 AM · Traffic, ops-magru, DC-Ops
ssingh added a comment to T362729: Q4:rack/setup/install cp70[01-16].

Thanks for the task @RobH! As in the previous runs, please feel free to leave these for Traffic:

Wed, Apr 17, 12:38 AM · Traffic, ops-magru, DC-Ops

Tue, Apr 16

ssingh reopened T362618: Grant Access to 'wmf' ldap group for Michael to allow logstash access as "Open".

Thanks @Muehlenhoff! And good to know @Michael that this is resolved; closing this task.

Tue, Apr 16, 3:47 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh closed T362618: Grant Access to 'wmf' ldap group for Michael to allow logstash access as Resolved.
Tue, Apr 16, 3:46 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh closed T362618: Grant Access to 'wmf' ldap group for Michael to allow logstash access as Resolved.

Added to wmf LDAP group (as well as Phabricator). Please try to access Logstash and let us know if there are any issues.

Tue, Apr 16, 2:22 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh added a member for WMF-NDA: Michael.
Tue, Apr 16, 2:10 PM
ssingh closed T362602: Requesting kerberos identity for Surbhi Gupta as Resolved.

Marking this as resolved; if kinit doesn't work for you or if there are any issues, please re-open this. Thanks!

Tue, Apr 16, 1:53 PM · Patch-For-Review, SRE-Access-Requests, SRE, Data-Engineering
ssingh added a comment to T362602: Requesting kerberos identity for Surbhi Gupta .

I have created the principal for Surbhi.

btullis@krb1001:~$ sudo sudo manage_principals.py get sg912
get_principal: Principal does not exist while retrieving "sg912@WIKIMEDIA".
btullis@krb1001:~$ sudo manage_principals.py create sg912 --email_address=sgupta@wikimedia.org
Principal successfully created. Make sure to update data.yaml in Puppet.
Successfully sent email to sgupta@wikimedia.org
Tue, Apr 16, 1:19 PM · Patch-For-Review, SRE-Access-Requests, SRE, Data-Engineering
ssingh updated the task description for T362602: Requesting kerberos identity for Surbhi Gupta .
Tue, Apr 16, 1:06 PM · Patch-For-Review, SRE-Access-Requests, SRE, Data-Engineering
ssingh updated the task description for T362602: Requesting kerberos identity for Surbhi Gupta .
Tue, Apr 16, 1:00 PM · Patch-For-Review, SRE-Access-Requests, SRE, Data-Engineering
ssingh added a comment to T328457: Grant all authenticated users access to SQL Lab in Superset.

@BTullis: Sorry if you are not the right person for this but it seems like we are having a related permissions issue in T362533 where @Aitolkyn is unable to use SQL Lab and I suspect that this is related. If not, please let me know. Thanks!

Tue, Apr 16, 12:14 PM · Data-Platform-SRE
ssingh added a comment to T362533: Grant Access to Superset for aitolkyn.

Thanks indeed @Urbanecm_WMF! Nice catch. @Aitolkyn: the contract expiry and date have been updated. If this has been resolved for you, please feel free to the task.

Tue, Apr 16, 12:13 PM · Patch-For-Review, SRE, LDAP-Access-Requests

Mon, Apr 15

ssingh added a comment to T346722: Sao Paulo, Brazil, South America POP tracking task.

Can we open this task up now or we should still keep it private? The subtasks will still be private if this is public -- can someone confirm? If yes, we should open this up so that we can use it for tracking and if not, we should create a new public task.

Mon, Apr 15, 6:33 PM · ops-magru, Patch-For-Review
ssingh updated subscribers of T328457: Grant all authenticated users access to SQL Lab in Superset.

@BTullis: Sorry if you are not the right person for this but it seems like we are having a related permissions issue inT362533 where @Aitolkyn is unable to use SQL Lab and I suspect that this is related. If not, please let me know. Thanks!

Mon, Apr 15, 4:24 PM · Data-Platform-SRE
ssingh added a comment to T362533: Grant Access to Superset for aitolkyn.

@ssingh Thank you for checking! I get the following error when trying to access my tables:

mysql error: SELECT command denied to user 'research'@'10.67.30.187' for table `aitolkyn`.`domain_reverted_added_2019_2023`

Maybe I need to set anything before I can access those?

Mon, Apr 15, 2:39 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh added a comment to T362533: Grant Access to Superset for aitolkyn.

Hi @FNavas-foundation: @Aitolkyn already should already have access to Superset as they are part of the analytics_privatedata_users group. @Aitolkyn, can you please confirm?

Mon, Apr 15, 2:21 PM · Patch-For-Review, SRE, LDAP-Access-Requests
ssingh placed T204993: Update certspotter up for grabs.
Mon, Apr 15, 1:51 PM · Traffic
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

@ssingh one thing that I found between the server NiC and the switch interface is the vendor . In Eqiad, I checked 3 nodes cp1115, 1113 and 1100 all have for vendor under Transceiver inventory W2W and in Esams the vendor is FS. Since @ayounsi mentioned this morning that the request was not reaching the switch I focused on the media type used in esams and in eqiad so it looks like both connections are Direct Attach Copper but different vendor.

What i will like to test next

  • use a DAC from FS
  • just use a transceiver and connect a fiber

@Jclark-ctr @VRiley-WMF if next on site can you please find a DAC cable from FS.com and replace the cable on cp1115 if you have no DAC cable from FS.com can you please use a 10G transceiver with a fiber.

Mon, Apr 15, 1:16 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Thu, Apr 11

ssingh updated the task description for T360430: esams text cp nvme upgrade.
Thu, Apr 11, 2:29 PM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

@Papaul suggested to try a host in codfw and cp2042 PXE booted successfully. In one of the above messages, @cmooney suggested looking at if the new QFX5120 in esams/drmrs can be why it works in esams.

Thu, Apr 11, 2:03 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

Traffic reimaged 8 text nodes in esams and all of them PXE-booted the first time, without any issues. I think looking at why things worked flawlessly in esams but not in other sites such as eqiad and ulsfo is probably how we should try to get to the bottom of this ticket!

Thu, Apr 11, 1:41 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Wed, Apr 10

ssingh updated the task description for T360430: esams text cp nvme upgrade.
Wed, Apr 10, 7:24 PM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

For cp1115 that we tried today, I downgraded the BIOS, NIC and iDRAC firmwares, to match what we have in esams, where 6/6 hosts have been reimaged without any issue (PXE-booting the first time).

Wed, Apr 10, 6:40 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Tue, Apr 9

ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

One more thing I will try to do is to successively try all NIC firmwares in 22.x instead of picking the highest supported version but we can't do multiple reimages in a day as it clears the caches and while one host being down is not a big deal, it's not ideal. So I will report back tomorrow.

Tue, Apr 9, 6:28 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

Some updates with the TL;DR that it is still failing for hosts in eqiad and ulsfo:

Tue, Apr 9, 6:23 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

cp4052 BIOS version 1.9.2 also didn't work; no PXE boot. I am going to focus on the install server now and see if we can pick up something there.

Tue, Apr 9, 1:43 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Mon, Apr 8

ssingh updated the task description for T360430: esams text cp nvme upgrade.
Mon, Apr 8, 3:39 PM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

cp3069 also did PXE boot successfully, in the first attempt, so it makes it the fourth host in esams to not have any issue. I think maybe focusing on why it works in esams but not in eqiad/ulsfo might be the way forward.

Mon, Apr 8, 2:34 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Fri, Apr 5

ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

Continuing to trying to isolate the possible causes of this, I noticed when dumping the facter output between the difference hosts (ones that work vs the ones that don't), that the BIOS versions also seems to vary:

Fri, Apr 5, 3:19 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

We could also consider to pass this over to Dell support?

Fri, Apr 5, 2:29 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

Any other opinions/thoughts on how we can try and fix this and where? I am very happy to do the legwork but kind of lost here on what to check next.

Yeah it's very odd alright. That pattern of firmware versions looked so promising - nice sleuthing all the same!

esams is working fine for us without any issues but ulsfo continues to be an issue and that uncertainty extends to magru as well.

100% clutching at straws here but I wonder if the type of switch is having any effect? esams and drmrs have newer QFX5120 switches (I guess small good news here is magru will have same setup as these locations). ulsfo and (most of) eqiad have older QFX5100 devices. I fail to see why that would make such a difference, but we're at the clutching at straws level so maybe?

Eqiad rows E and F, and codfw rows A and B, have the newer model switches too. So if we have the same problem in those places we can rule out it being anything to do with the switch.

Fri, Apr 5, 2:20 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Thu, Apr 4

ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

Any other opinions/thoughts on how we can try and fix this and where? I am very happy to do the legwork but kind of lost here on what to check next. The worry continues to be that magru is coming close and we should not be running the cookbook multiple times to get a reimage done. esams is working fine for us without any issues but ulsfo continues to be an issue and that uncertainty extends to magru as well.

Thu, Apr 4, 6:36 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

Update: I ran the firmware-upgrade cookbook on cp4052 and updated it's firmware to 6.10.30.20, did a racreset to be absolutely sure and it still failed for me in the first attempt. It seems my job that this issue has been resolved was short-lived so we move on and try to debug it more :)

Thu, Apr 4, 6:28 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

Traffic has been reimaging hosts in esams (we have done three so far for T360430) and we observed that we didn't have this issue on any of those hosts. Relatedly, we reimaged cp4052 last week where we did hit the issue again.

Thu, Apr 4, 5:17 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh created T361851: db2214 crashed.
Thu, Apr 4, 3:35 PM · SRE, ops-codfw, Patch-For-Review, DBA
ssingh added a comment to T358936: Kubernetes apiserver probe failures on restart.

This happened today as well, at 00:35 UTC, when we were paged for this:

Thu, Apr 4, 12:49 AM · Prod-Kubernetes, serviceops, SRE

Tue, Apr 2

ssingh added a comment to T360430: esams text cp nvme upgrade.

For posterity, an annotated Grafana dashboard that shows incoming traffic to esams after and during the depool and power-off events: https://grafana.wikimedia.org/goto/DbIJc7bSk?orgId=1. Note that the current TTL for dyna.wikimedia.org is five minutes, reduced in T140365. This is only for incoming traffic to Varnish, not the number of connections.

Tue, Apr 2, 1:57 PM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops
ssingh updated the task description for T360430: esams text cp nvme upgrade.
Tue, Apr 2, 1:42 PM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops

Thu, Mar 28

ssingh added a comment to T360982: Connection failed for a few minutes.

Hi, thanks for reporting this. The timing does seem to match an increase in 503s for this period but it was transient due to an increase in the number of requests and seems to have resolved at around ~06:10, with a start time of around ~06:07. As such, there isn't any actionable on our end but the error above is expected in the sense that it was an issue for that period.

Thu, Mar 28, 3:53 PM · Traffic

Mar 21 2024

ssingh added a comment to T360430: esams text cp nvme upgrade.

@RobH: Verified the hosts, serial numbers, racking and the cadence. Looks good!

Mar 21 2024, 1:28 PM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops
ssingh added a comment to T360430: esams text cp nvme upgrade.

Hi Rob: Checking if the date/time above has been confirmed by remote hands?

Mar 21 2024, 1:07 AM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops

Mar 19 2024

ssingh added a comment to T355811: Feature request: When cumin is running with -b (and -s), it should display the current host being affected.

For another data point: this can also be handy when you are running -b1 -s<something> and want to cancel the execution for any reason; if you know which host was currently being affected, it makes doing so a lot easier.

Mar 19 2024, 5:46 PM · SRE, Cumin, Infrastructure-Foundations
ssingh added a comment to T360430: esams text cp nvme upgrade.

Rob, once the time/data is confirmed, please let me know here or on IRC and I will send an email to sre@. Thanks!

Mar 19 2024, 5:29 PM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops
ssingh added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

Mar 19 2024, 5:17 PM · Traffic
ssingh added a comment to T360430: esams text cp nvme upgrade.

Remote hands won't have any ability to power down a host other than by pressing the front power button. It would reduce potential complexity if we power down all the hosts for them to work on to prevent confusion. That way they know if it is powered off and matches the list, they can work on it.

Would that adjustment work, and traffic send a power off to those hosts in advance of the work?

Mar 19 2024, 3:41 PM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops
ssingh updated subscribers of T360430: esams text cp nvme upgrade.

Thanks for creating the task. In some further discussion with @BBlack today, we decided that we will do the following:

Mar 19 2024, 3:29 PM · Patch-For-Review, SRE, Traffic, ops-esams, DC-Ops

Mar 7 2024

ssingh updated subscribers of T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.
Mar 7 2024, 4:33 PM · Infrastructure-Foundations, SRE, Traffic

Mar 5 2024

ssingh updated subscribers of T359053: Reimage one of each Traffic hosts before magru.
Mar 5 2024, 6:32 PM · Traffic
ssingh updated subscribers of T359053: Reimage one of each Traffic hosts before magru.

@cmooney reimaged lvs2011 today and it seems to have gone fine. Sharing this here in case we want to check `lvs' in the list above.

Mar 5 2024, 6:32 PM · Traffic

Mar 4 2024

bking awarded T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work a Insectivore token.
Mar 4 2024, 8:14 PM · Traffic
KOfori awarded T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work a Party Time token.
Mar 4 2024, 7:04 PM · Traffic
ssingh closed T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work as Resolved.

We have finished rolling the changes today, so all state management -- authdns-update, recdns, NTP (Debian installer), authdns-ns[0-2] -- is now managed via confd/conftcl, for all DNS hosts.

Mar 4 2024, 6:56 PM · Traffic
ssingh changed the visibility for T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.
Mar 4 2024, 6:12 PM · Infrastructure-Foundations, SRE, Traffic
ssingh added a project to T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps: Infrastructure-Foundations.
Mar 4 2024, 6:11 PM · Infrastructure-Foundations, SRE, Traffic
ssingh updated the task description for T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.
Mar 4 2024, 6:02 PM · Traffic
ssingh triaged T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps as Medium priority.
Mar 4 2024, 5:46 PM · Infrastructure-Foundations, SRE, Traffic
ssingh edited projects for T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps, added: SRE; removed WMF-NDA.
Mar 4 2024, 5:46 PM · Infrastructure-Foundations, SRE, Traffic
ssingh updated the task description for T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.
Mar 4 2024, 2:23 PM · Infrastructure-Foundations, SRE, Traffic
ssingh created T359054: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps.
Mar 4 2024, 2:23 PM · Infrastructure-Foundations, SRE, Traffic
ssingh updated subscribers of T359053: Reimage one of each Traffic hosts before magru.
Mar 4 2024, 2:10 PM · Traffic
ssingh created T359053: Reimage one of each Traffic hosts before magru.
Mar 4 2024, 2:10 PM · Traffic

Feb 29 2024

ssingh added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

Update: We have merged the service depooling change on dns6001. This means that service depooling -- recdns, ntp, authdns-ns2 -- on dns6001, is now managed via confd.

Feb 29 2024, 7:07 PM · Traffic
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

Thanks for sharing @ayounsi!

Feb 29 2024, 5:37 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad

Feb 28 2024

ssingh added a project to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting: Traffic.
Feb 28 2024, 10:00 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T350179: Reimage cookbook on new eqiad hosts stuck at PXE booting.

Hi folks: Just wondering if there is a path forward on this task as we hit the same issue last week while reimaging cp4052. No PXE boot during the first reimage attempt but it worked the second time.

Feb 28 2024, 9:59 PM · SRE, Traffic, SRE-swift-storage, ops-codfw, DC-Ops, ops-eqiad
ssingh added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

The NAMESERVERS list that is populated by confd affects only the hosts to which we SSH and run authdns-update, not the host itself. So if you depool dns1004 and run authdns-update from there, nothing changes. If you run authdns-update from dns1005 (or anywhere else), it won't touch dns1004. On my end, I think this behaviour makes sense. But is it fine from your perspective of automation and cookbooks?

I'm more worried for the human side of the problem. In my experience people rely a lot on muscle memory and shell history, so they tend to go always to the same host to run a given command. So I think it would be very easy that someone will run the authdns-update on dns1001 even if depooled just because they're used to run it there.

Feb 28 2024, 6:35 PM · Traffic
ssingh added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

Do we need to change anything on the sre.dns.netbox cookbook?
It currently runs:

cd {git} && utils/deploy-check.py -g {netbox} --deploy
Feb 28 2024, 6:31 PM · Traffic
ssingh added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

Even from itself? As in what happens if an operator runs authdns-update on a depooled host?

Feb 28 2024, 6:29 PM · Traffic
ssingh added a comment to T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work.

Status as of today: we are now managing authdns-update state, that is, the list of hosts in /etc/wikimedia-authdns.conf. To depool a host so that it doesn't receive authdns updates (state, not service) you can do something like:

Feb 28 2024, 4:44 PM · Traffic

Feb 22 2024

ssingh added a comment to T358187: Connection errors from puppetmaster1002 to puppetdb.

When I looked at this last night as the alerts were coming in, I noticed that some hosts were not reporting the connection failure but simply the content error; example https://puppetboard.wikimedia.org/report/parse1004.eqiad.wmnet/ef81872fca86746fbbd87800da1da74b64d3839b.

Feb 22 2024, 2:59 PM · Infrastructure-Foundations, Puppet-Infrastructure, SRE

Feb 21 2024

ssingh triaged T358133: Security Issue Access Request for cdobbins as Medium priority.
Feb 21 2024, 6:03 PM · SecTeam-Processed, Security-Team, Security
ssingh created T358133: Security Issue Access Request for cdobbins.
Feb 21 2024, 6:02 PM · SecTeam-Processed, Security-Team, Security

Feb 20 2024

ssingh updated subscribers of T352744: OpenSSL 3.x performance issues.

Thanks to @MoritzMuehlenhoff, we have imported the forward port of OpenSSL 1.1.1 and have built haproxy 2.6 against it. We will be reimaging a cp host to bookworm.

Feb 20 2024, 4:17 PM · SRE-swift-storage, Traffic
ssingh updated subscribers of T352253: Decommission task for old cp hosts (cp1075-1090).

For further context: we have a request from @dr0ptp4kt for running a Blazegraph experiment and we are trying to free up a cp node for him. So we were wondering if this hardware has still not been hardware decomissioned, we can just bring up a host here.

Feb 20 2024, 4:02 PM · SRE, ops-eqiad, DC-Ops, Traffic
ssingh added a comment to T352253: Decommission task for old cp hosts (cp1075-1090).

Hi dc-ops team: quick question: have these hosts already been hardware decommissioned?

Feb 20 2024, 3:58 PM · SRE, ops-eqiad, DC-Ops, Traffic

Feb 16 2024

ssingh added a comment to T352744: OpenSSL 3.x performance issues.
10:01:40 < sukhe> !log reprepro -C component/haproxy26 include bookworm-wikimedia haproxy_2.6.16-1~bpo11+2_source.changes: T352744
Feb 16 2024, 3:04 PM · SRE-swift-storage, Traffic

Feb 13 2024

ssingh closed T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) as Resolved.

We have rolled this out today. For a complete list of domains affected, see the commit above.

Feb 13 2024, 8:27 PM · Traffic, SRE
ssingh edited projects for T357436: Request donatewiki redirect, added: Fundraising-Backlog; removed Traffic.
Feb 13 2024, 6:36 PM · fundraising-tech-ops, Wikimedia-Apache-configuration, serviceops, Fundraising-Backlog, SRE
ssingh added a comment to T357449: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify").

The cookbook doesn't reboot the host once in the Debian Installer, it's the Debian Installer that reboots the hosts once the base installation is completed.

Have you checked what was the status of the host during the installation?

Feb 13 2024, 6:04 PM · SRE-tools, Infrastructure-Foundations, Ganeti
ssingh added a comment to T357449: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify").

One more data point: note that gnt-instance console FQDN is broken because of T309724 so we don't know the exact failure.

Feb 13 2024, 5:47 PM · SRE-tools, Infrastructure-Foundations, Ganeti

Feb 12 2024

ayounsi awarded T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) a Like token.
Feb 12 2024, 7:22 AM · Traffic, SRE

Feb 9 2024

CDanis awarded T140365: Lower geodns TTLs from 600 (10min) to 300 (5min) a Love token.
Feb 9 2024, 7:24 PM · Traffic, SRE