Oh hi.
User Details
- User Since
- Dec 11 2018, 9:39 PM (280 w, 1 d)
- Availability
- Available
- IRC Nick
- sukhe
- LDAP User
- Unknown
- MediaWiki User
- SSingh (WMF) [ Global Accounts ]
Tue, Apr 23
Fri, Apr 19
Thanks @cmooney, looks good! One small update to the above since we will most likely transpose these to hieradata/common/lvs/interfaces.yaml: 10.140.1.3/24 is private1-b4-magru like 10.140.0.2/24 is private1-b3-magru (and not just private-b4-magru).
@Lina_Farid_WMDE: to speed up things, you can also send an email to @KFrancis ( https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) ; kfrancis@wikimedia.org) from your email, requesting an NDA. Thanks!
Hi @KFrancis: @Lina_Farid_WMDE will require an NDA as well as I don't see their name on the spreadsheet. Thank you as always!
Update is that we will need to add a DKIM record for MailChimp so a patch will follow. Rest everything seems to be in order.
We fixed it but forgot to close this task so resolving. Thanks @Dzahn!
Thu, Apr 18
Discussed a bit with @EdErhart-WMF on what the goal is here on Slack and will update this task later when there is more clarity.
This request is now merged; please re-open if there are any issues, thanks!
@Kgraessle: The request has been merged; please let us know if there is any issue. Thanks!
Thanks @jcrespo! I should have silenced the alert or restarted the service; both of those are in progress now so we should see this resolve soon.
Wed, Apr 17
Thanks for the task! At least for now, I restarted haproxy so that we don't get this alert and we also don't leave it silenced in case the initial restart (below) was nothing more than a transient one.
@DMburugu: this requires your approval, thanks!
@NBaca-WMF: This needs your approval, thanks!
@Aitolkyn I am marking this as resolved but if that's not the case, please re-open it again thanks!
@Papaul deserves a lot of love for fixing this persistent issue. The 21.x firmware (specifically, Network_Firmware_YK81Y_WN64_21.60.22.11_03) worked in the first attempt when reimaging cp1114. I think we can consider this closed given we have observed the fix on two hosts now.
@Papaul: Thanks for the update! Looks promising indeed and to actually close this, we should downgrade another host in eqiad and then try it out. Because what happens sometimes is that if a given host reimaged successfully once then it continues to reimage successfully for some more period of time (we don't know what that is but at least the same day :).
Thanks for the task @RobH! As in the previous runs, please feel free to leave these for Traffic:
Tue, Apr 16
Thanks @Muehlenhoff! And good to know @Michael that this is resolved; closing this task.
Added to wmf LDAP group (as well as Phabricator). Please try to access Logstash and let us know if there are any issues.
Marking this as resolved; if kinit doesn't work for you or if there are any issues, please re-open this. Thanks!
Thanks indeed @Urbanecm_WMF! Nice catch. @Aitolkyn: the contract expiry and date have been updated. If this has been resolved for you, please feel free to the task.
Mon, Apr 15
Can we open this task up now or we should still keep it private? The subtasks will still be private if this is public -- can someone confirm? If yes, we should open this up so that we can use it for tracking and if not, we should create a new public task.
Hi @FNavas-foundation: @Aitolkyn already should already have access to Superset as they are part of the analytics_privatedata_users group. @Aitolkyn, can you please confirm?
Thu, Apr 11
Traffic reimaged 8 text nodes in esams and all of them PXE-booted the first time, without any issues. I think looking at why things worked flawlessly in esams but not in other sites such as eqiad and ulsfo is probably how we should try to get to the bottom of this ticket!
Wed, Apr 10
For cp1115 that we tried today, I downgraded the BIOS, NIC and iDRAC firmwares, to match what we have in esams, where 6/6 hosts have been reimaged without any issue (PXE-booting the first time).
Tue, Apr 9
One more thing I will try to do is to successively try all NIC firmwares in 22.x instead of picking the highest supported version but we can't do multiple reimages in a day as it clears the caches and while one host being down is not a big deal, it's not ideal. So I will report back tomorrow.
Some updates with the TL;DR that it is still failing for hosts in eqiad and ulsfo:
cp4052 BIOS version 1.9.2 also didn't work; no PXE boot. I am going to focus on the install server now and see if we can pick up something there.
Mon, Apr 8
cp3069 also did PXE boot successfully, in the first attempt, so it makes it the fourth host in esams to not have any issue. I think maybe focusing on why it works in esams but not in eqiad/ulsfo might be the way forward.
Fri, Apr 5
Continuing to trying to isolate the possible causes of this, I noticed when dumping the facter output between the difference hosts (ones that work vs the ones that don't), that the BIOS versions also seems to vary:
Thu, Apr 4
Any other opinions/thoughts on how we can try and fix this and where? I am very happy to do the legwork but kind of lost here on what to check next. The worry continues to be that magru is coming close and we should not be running the cookbook multiple times to get a reimage done. esams is working fine for us without any issues but ulsfo continues to be an issue and that uncertainty extends to magru as well.
Update: I ran the firmware-upgrade cookbook on cp4052 and updated it's firmware to 6.10.30.20, did a racreset to be absolutely sure and it still failed for me in the first attempt. It seems my job that this issue has been resolved was short-lived so we move on and try to debug it more :)
Traffic has been reimaging hosts in esams (we have done three so far for T360430) and we observed that we didn't have this issue on any of those hosts. Relatedly, we reimaged cp4052 last week where we did hit the issue again.
This happened today as well, at 00:35 UTC, when we were paged for this:
Tue, Apr 2
For posterity, an annotated Grafana dashboard that shows incoming traffic to esams after and during the depool and power-off events: https://grafana.wikimedia.org/goto/DbIJc7bSk?orgId=1. Note that the current TTL for dyna.wikimedia.org is five minutes, reduced in T140365. This is only for incoming traffic to Varnish, not the number of connections.
Thu, Mar 28
Hi, thanks for reporting this. The timing does seem to match an increase in 503s for this period but it was transient due to an increase in the number of requests and seems to have resolved at around ~06:10, with a start time of around ~06:07. As such, there isn't any actionable on our end but the error above is expected in the sense that it was an issue for that period.
Mar 21 2024
@RobH: Verified the hosts, serial numbers, racking and the cadence. Looks good!
Hi Rob: Checking if the date/time above has been confirmed by remote hands?
Mar 19 2024
For another data point: this can also be handy when you are running -b1 -s<something> and want to cancel the execution for any reason; if you know which host was currently being affected, it makes doing so a lot easier.
Rob, once the time/data is confirmed, please let me know here or on IRC and I will send an email to sre@. Thanks!
Thanks for creating the task. In some further discussion with @BBlack today, we decided that we will do the following:
Mar 7 2024
Mar 5 2024
@cmooney reimaged lvs2011 today and it seems to have gone fine. Sharing this here in case we want to check `lvs' in the list above.
Mar 4 2024
We have finished rolling the changes today, so all state management -- authdns-update, recdns, NTP (Debian installer), authdns-ns[0-2] -- is now managed via confd/conftcl, for all DNS hosts.
Feb 29 2024
Update: We have merged the service depooling change on dns6001. This means that service depooling -- recdns, ntp, authdns-ns2 -- on dns6001, is now managed via confd.
Thanks for sharing @ayounsi!
Feb 28 2024
Hi folks: Just wondering if there is a path forward on this task as we hit the same issue last week while reimaging cp4052. No PXE boot during the first reimage attempt but it worked the second time.
Status as of today: we are now managing authdns-update state, that is, the list of hosts in /etc/wikimedia-authdns.conf. To depool a host so that it doesn't receive authdns updates (state, not service) you can do something like:
Feb 22 2024
When I looked at this last night as the alerts were coming in, I noticed that some hosts were not reporting the connection failure but simply the content error; example https://puppetboard.wikimedia.org/report/parse1004.eqiad.wmnet/ef81872fca86746fbbd87800da1da74b64d3839b.
Feb 21 2024
Feb 20 2024
Thanks to @MoritzMuehlenhoff, we have imported the forward port of OpenSSL 1.1.1 and have built haproxy 2.6 against it. We will be reimaging a cp host to bookworm.
For further context: we have a request from @dr0ptp4kt for running a Blazegraph experiment and we are trying to free up a cp node for him. So we were wondering if this hardware has still not been hardware decomissioned, we can just bring up a host here.
Hi dc-ops team: quick question: have these hosts already been hardware decommissioned?
Feb 16 2024
10:01:40 < sukhe> !log reprepro -C component/haproxy26 include bookworm-wikimedia haproxy_2.6.16-1~bpo11+2_source.changes: T352744
Feb 13 2024
We have rolled this out today. For a complete list of domains affected, see the commit above.
One more data point: note that gnt-instance console FQDN is broken because of T309724 so we don't know the exact failure.