Page MenuHomePhabricator

Alert for snapshot101[4567] not in mediawiki-installation dsh group
Closed, ResolvedPublic

Description

These hosts pop up as alerting from time to time, not sure what's going on with them?

summary: Host snapshot1017 is not in mediawiki-installation dsh group
summary: Host snapshot1016 is not in mediawiki-installation dsh group
summary: Host snapshot1015 is not in mediawiki-installation dsh group
summary: Host snapshot1014 is not in mediawiki-installation dsh group

https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3Dmediawiki-installation%20DSH%20group

Event Timeline

Given that these are snapshot hosts - I think that they are related to the XML Dumps architecture.

The best project tag is therefore probably Dumps-Generation
The most knowledgeable person is likely to be @ArielGlenn - but I believe that the Data Products team is also increasingly involved in the support of this product.
cc: @Milimetric , @xcollazo , @JEbe-WMF

For SRE related support, it will be the Data-Platform-SRE team who gets involved.
Finally, @WDoranWMF and @VirginiaPoundstone would also like to be notified about any support work related to this.

I'm not immediately sure what the check_dsh_groups is needed for, but I can help do more investigation if required.

Thank you for the extensive info @BTullis !

AFAICT the check will make sure we're not leaving hosts behind not in dsh groups (though the check itself will eventually go away with mw on k8s). It seems to me that the immediate action is to keep hieradata/common/scap/dsh.yaml updated when snapshot hosts are (de)commissioned

taavi renamed this task from Alert for snapshot100[4567] not in mediawiki-installation dsh group to Alert for snapshot101[4567] not in mediawiki-installation dsh group.Sep 8 2023, 12:25 PM
taavi added a subscriber: taavi.

I'm not immediately sure what the check_dsh_groups is needed for, but I can help do more investigation if required.

That check ensures that hosts with the production mediawiki installation are in the list of hosts to push updates to.

Change 955931 had a related patch set uploaded (by Btullis; author: Btullis):

[operations/puppet@production] Add snapshot101[4-7] to the dsh group

https://gerrit.wikimedia.org/r/955931

Thanks @fgiunchedi and @taavi - I have created a patch to add them to the correct group and added a few reviewers from other teams for added visibility.

It's interesting that you say then only pop up from time to time on alertmanager. I would have thought that the alert would be constant until it is added to the correct group, but I'm not going to dive too deeply in to it right now.

the ops-dumps email alias ought to get notified about things like this; that way all the right people will see it.

Change 955931 merged by Btullis:

[operations/puppet@production] Add snapshot101[4-7] to the mediawiki-installation dsh group

https://gerrit.wikimedia.org/r/955931

BTullis claimed this task.
BTullis moved this task from Incoming to Done on the Data-Platform-SRE board.

I have merged this patch, so I believe that we can close this.