Oh hi.
User Details
- User Since
- Dec 11 2018, 9:39 PM (289 w, 17 h)
- Availability
- Available
- IRC Nick
- sukhe
- LDAP User
- Unknown
- MediaWiki User
- SSingh (WMF) [ Global Accounts ]
Mon, Jun 24
This happened today again, starting at 20:05 and resulted in basically the same issues we observed over the weekend and in the previous incident. I have not identified the query and I leave that to people who know better, but the steps we took to resolve this were:
Fri, Jun 21
Thu, Jun 6
Moving the links working out well (which I think this is the first time?) is a big take away from this task; glad to hear it went nicely!
I went ahead and created silence ID 71789606-ffdc-4244-a12e-d6344b1f1ab9. Sorry for being bold; please remove if required.
Wed, Jun 5
Tue, Jun 4
On investigation, we found that (cp7001):
Mon, Jun 3
Thanks for the great analysis, Chris!
Fri, May 31
To clarify, there is no change to the configuration of the DNS hosts themselves and the peer list there. This is only for the consumers of P:systemd::timesyncd (which we don't use on the DNS hosts).
Thu, May 30
Wed, May 29
Tue, May 28
May 23 2024
May 22 2024
Hi @jhathaway: I wanted to get your input about this. The request here is to add a DKIM record for wikimedia.org so that learn.wiki can allow sending email from comdevteam@wikimedia.org. The CNAMEs above look fine so there are no concerns with that.
If you want to send from the wikimedia.org domain, then yes, that's what you will need. There is no concern with the records themselves.
May 21 2024
Some numbers from cp7001, with the usual caveats around measuring this with openssl speed and commenting on this simply to understand the above-mentioned performance differences:
The CNAME here specifies wikimedia.org but I think it should be learn.wiki here. So instead of:
May 17 2024
Host is depooled:
May 16 2024
Thanks to @cmooney for rolling the above out. For further context, we (Traffic and netops) decided to try out the anycast range in magru for the Wikidough service before doing it for more critical things, such as announcing the ns2.wikimedia.org IP from magru. The timeline on that still is a TBD but this is a nice test to make sure that the configuration was correct.
May 15 2024
This has been alerting for a while and in general has many alerts on the -ops channel. If this expected in some way (I have no idea and I haven't looked!) can it at least be silenced? This is not paging or anything just to be clear so that's fine.
May 14 2024
May 10 2024
May 9 2024
We have merged this change today and are serving the /.well-known/traffic-advice file with the content noted in the documentation to disable this feature.
curl https://en.wikipedia.org/.well-known/traffic-advice [{ "user_agent": "prefetch-proxy", "disallow": true }]
May 8 2024
On mr1-magru, I see 10.140.1.18 (prometheus7001) and denied by policy, which makes me wonder if we need to run https://netbox.wikimedia.org/extras/scripts/capirca.GetHosts/ for Capirca to generate the ACL?
On https://librenms.wikimedia.org/alerts, I see the following for mr1-magru:
May 3 2024
May 2 2024
May 1 2024
Apr 30 2024
OK, so I finally found why this is failing. For a reason that I don't fully understand, hieradata/magru/ directory actually needs to exist for the lookup() against hieradata/magru.yaml to work. See the commit that fixes this (and notice the commit message!):
Function lookup() did not find a value for the name 'prometheus_nodes'
Apr 29 2024
Sorry, I forgot to reply to this!
Apr 28 2024
The failure above is:
Apr 23 2024
Apr 19 2024
Thanks @cmooney, looks good! One small update to the above since we will most likely transpose these to hieradata/common/lvs/interfaces.yaml: 10.140.1.3/24 is private1-b4-magru like 10.140.0.2/24 is private1-b3-magru (and not just private-b4-magru).
@Lina_Farid_WMDE: to speed up things, you can also send an email to @KFrancis ( https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) ; kfrancis@wikimedia.org) from your email, requesting an NDA. Thanks!
Hi @KFrancis: @Lina_Farid_WMDE will require an NDA as well as I don't see their name on the spreadsheet. Thank you as always!
Update is that we will need to add a DKIM record for MailChimp so a patch will follow. Rest everything seems to be in order.
We fixed it but forgot to close this task so resolving. Thanks @Dzahn!
Apr 18 2024
Discussed a bit with @EdErhart-WMF on what the goal is here on Slack and will update this task later when there is more clarity.
This request is now merged; please re-open if there are any issues, thanks!
@Kgraessle: The request has been merged; please let us know if there is any issue. Thanks!
Thanks @jcrespo! I should have silenced the alert or restarted the service; both of those are in progress now so we should see this resolve soon.
Apr 17 2024
Thanks for the task! At least for now, I restarted haproxy so that we don't get this alert and we also don't leave it silenced in case the initial restart (below) was nothing more than a transient one.
@DMburugu: this requires your approval, thanks!
@NBaca-WMF: This needs your approval, thanks!
@Aitolkyn I am marking this as resolved but if that's not the case, please re-open it again thanks!
@Papaul deserves a lot of love for fixing this persistent issue. The 21.x firmware (specifically, Network_Firmware_YK81Y_WN64_21.60.22.11_03) worked in the first attempt when reimaging cp1114. I think we can consider this closed given we have observed the fix on two hosts now.
@Papaul: Thanks for the update! Looks promising indeed and to actually close this, we should downgrade another host in eqiad and then try it out. Because what happens sometimes is that if a given host reimaged successfully once then it continues to reimage successfully for some more period of time (we don't know what that is but at least the same day :).
Thanks for the task @RobH! As in the previous runs, please feel free to leave these for Traffic:
Apr 16 2024
Thanks @Muehlenhoff! And good to know @Michael that this is resolved; closing this task.
Added to wmf LDAP group (as well as Phabricator). Please try to access Logstash and let us know if there are any issues.
Marking this as resolved; if kinit doesn't work for you or if there are any issues, please re-open this. Thanks!
Thanks indeed @Urbanecm_WMF! Nice catch. @Aitolkyn: the contract expiry and date have been updated. If this has been resolved for you, please feel free to the task.
Apr 15 2024
Can we open this task up now or we should still keep it private? The subtasks will still be private if this is public -- can someone confirm? If yes, we should open this up so that we can use it for tracking and if not, we should create a new public task.
Hi @FNavas-foundation: @Aitolkyn already should already have access to Superset as they are part of the analytics_privatedata_users group. @Aitolkyn, can you please confirm?