Page MenuHomePhabricator

Retire WatchMouse (CA DX APP)
Closed, ResolvedPublic

Assigned To
Authored By
lmata
Jan 13 2022, 5:11 PM
Referenced Files
None
Tokens
"Love" token, awarded by lmata."Like" token, awarded by Ladsgroup."100" token, awarded by MoritzMuehlenhoff."Cup of Joe" token, awarded by herron.

Description

The o11y team has discussed this internally and has decided to sunset Watchmouse as we do not seem to be obtaining great value from Watchmouse as external monitoring. Recent examples in T292603.

Historically the tool was initially used for the public status page and to expose some KPIs for external availability and an external uptime checker. Additional background: T81454, T85829, T89877, T79416

StatusPage effectively replaced Watchmouse recently (See T202061 and T285769), and the new stack is ready for production.

We also have enough redundancy with our existing Icinga and external testing infrastructure to sunset this tool without losing core functionality while reducing the scope of technology to support. In addition, we aim to improve external monitoring as part of plans and roadmap for alerting. However, we will not wait for the new solution's implementation as a dependency for moving forward with this decommission this quarter.

Checks currently defined in watchmouse / CA App Synthetic Monitor:

Note: "ops-critical-phone" is actually an email contact for "noc@" and "watchmouse@" (and watchmouse@ is further aliased to maint-announce@)

Enabled (Y default)NameTagsURLTypeContactReplacementRetirement Status
APIhttp://en.wikipedia.org/w/api.phpHTTPops-critical-phoneexisting check in icinga✅ deactivated in watchmouse
DNSwiki_platformwikipedia.orgDNSops-critical-phoneexisting check in icinga✅ deactivated in watchmouse
Dumps downloadhttps://dumps.wikimedia.org/backup-index.htmlHTTPSops-non-critical-mailadded to watchrat✅ deactivated in watchmouse
Gerrithttps://gerrit.wikimedia.org/r/HTTPSwikimedia opsexisting check in icinga✅ deactivated in watchmouse
https content - commonshttps://commons.wikimedia.org/wiki/Main_PageHTTPSwikimedia opsexisting check in icinga✅ deactivated in watchmouse
https services - commonswiki_platformhttps://commons.wikimedia.org/wiki/Main_PageHTTPSops-critical-phoneexisting check in icinga✅ deactivated in watchmouse
https services - foundationwikiwiki_platformhttps://wikimediafoundation.org/wiki/HomeHTTPSops-critical-phoneadded to watchrat✅ deactivated in watchmouse
https services - loginwikiwiki_platformhttps://login.wikimedia.org/wiki/Main_PageHTTPSops-critical-phoneadded to watchrat✅ deactivated in watchmouse
https services - mediawikiwiki_platformhttps://www.mediawiki.org/wiki/MediaWikiHTTPSops-critical-phoneadded to watchrat✅ deactivated in watchmouse
https services - wikibookswiki_platformhttps://en.wikibooks.org/wiki/Main_PageHTTPSops-critical-phoneexisting check in icinga✅ deactivated in watchmouse
https services - wikidatawiki_platformhttps://www.wikidata.org/wiki/Wikidata:Main_PageHTTPSops-critical-phoneexisting check in icinga✅ deactivated in watchmouse
https services - wikinewswiki_platformhttps://en.wikinews.org/wiki/Main_PageHTTPSops-critical-phoneadded to watchrat✅ deactivated in watchmouse
https services - wikipediawiki_platformhttps://en.wikipedia.org/wiki/Main_PageHTTPSops-critical-phoneexisting check in icinga✅ deactivated in watchmouse
https services - wikiquotewiki_platformhttps://en.wikiquote.org/wiki/Main_PageHTTPSops-critical-phoneadded to watchrat✅ deactivated in watchmouse
https services - wikisourcewiki_platformhttps://en.wikisource.org/wiki/Main_PageHTTPSops-critical-phoneadded to watchrat✅ deactivated in watchmouse
https services - wikiversitywiki_platformhttps://en.wikiversity.org/wiki/Main_PageHTTPSops-critical-phoneadded to watchrat✅ deactivated in watchmouse
https services - wikivoyagewiki_platformhttps://en.wikivoyage.org/wiki/Main_PageHTTPSops-critical-phoneadded to watchrat✅ deactivated in watchmouse
https services - wiktionarywiki_platformhttps://en.wiktionary.org/wiki/Main_PageHTTPSops-critical-phoneadded to watchrat✅ deactivated in watchmouse
NIcinga (disabled)http://icinga.wikimedia.orgHTTPicingaretire✅ deactivated in watchmouse
Images & media (HTTPS)wiki_platformhttps://upload.wikimedia.org/monitoring/backendHTTPSops-critical-phoneexisting check in icinga✅ deactivated in watchmouse
Images & mediawiki_platformhttps://upload.wikimedia.org/monitoring/backendHTTPSops-critical-phoneduplicate of above✅ deactivated in watchmouse
IRC RecentChangesirc.wikimedia.org:6667CONNECTops-non-critical-mailretire✅ deactivated in watchmouse
Mail (SMTP)publicmx1001.wikimedia.orgSMTPops-critical-phoneexisting check in icinga✅ deactivated in watchmouse
Mobile sitewiki_platformhttp://en.m.wikipedia.org/wiki/Main_PageHTTPwikimedia opsexisting check in icinga✅ deactivated in watchmouse
Phabricatorhttps://phabricator.wikimedia.org/T2001HTTPSops-non-critical-mailexisting check in icinga✅ deactivated in watchmouse
Static assets (CSS/JS)wiki_platformhttp://meta.wikimedia.org/w/resources/assets/poweredby_mediawiki_88x31.pngHTTPwikimedia opsretire✅ deactivated in watchmouse
Static assets (HTTPS - CSS/JS)wiki_platformhttps://meta.wikimedia.org/w/resources/assets/poweredby_mediawiki_88x31.pngHTTPSwikimedia opsadded to watchrat✅ deactivated in watchmouse
Wiki commons (s4)wiki_platformhttp://commons.wikimedia.org/wiki/Main_PageHTTPops-critical-phoneexisting check in icinga✅ deactivated in watchmouse
NWiki commons (s4) - UNCACHEDwiki_platformhttp://commons.wikimedia.org/wiki/Main_PageHTTPops-critical-phoneretire✅ deactivated in watchmouse
Wiki platform [[w:de:Main Page]] (s5)wiki_platformhttp://de.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneadded to watchrat✅ deactivated in watchmouse
NWiki platform [[w:de:Main Page]] (s5) - UNCACHEDwiki_platformhttp://de.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneretire✅ deactivated in watchmouse
Wiki platform [[w:dsb:Main Page]] (s3)wiki_platformhttp://dsb.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneadded to watchrat✅ deactivated in watchmouse
NWiki platform [[w:dsb:Main Page]] (s3) - UNCACHEDwiki_platformhttp://dsb.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneretire✅ deactivated in watchmouse
Wiki platform [[w:en:Main Page]] (s1)wiki_platformhttp://en.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneexisting checkin icinga✅ deactivated in watchmouse
NWiki platform [[w:en:Main Page]] (s1) - UNCACHEDwiki_platformhttp://en.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneretire✅ deactivated in watchmouse
Wiki platform [[w:en:Special:Random]]wiki_platformhttp://en.wikipedia.org/wiki/Special:RandomHTTPops-critical-phoneadded to watchrat✅ deactivated in watchmouse
Wiki platform [[w:fi:Main Page]] (s2)wiki_platformhttp://fi.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneadded to watchrat✅ deactivated in watchmouse
NWiki platform [[w:fi:Main Page]] (s2) - UNCACHEDwiki_platformhttp://fi.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneretire✅ deactivated in watchmouse
Wiki platform [[w:fr:Main Page]] (s6)wiki_platformhttp://fr.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneadded to watchrat✅ deactivated in watchmouse
NWiki platform [[w:fr:Main Page]] (s6) - UNCACHEDwiki_platformhttp://fr.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneretire✅ deactivated in watchmouse
Wiki platform [[w:uk:Main Page]] (s7)wiki_platformhttp://uk.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneadded to watchrat✅ deactivated in watchmouse
NWiki platform [[w:uk:Main Page]] (s7) - UNCACHEDwiki_platformhttp://uk.wikipedia.org/wiki/Main_PageHTTPops-critical-phoneretire✅ deactivated in watchmouse
Wikimedia bloghttp://blog.wikimedia.org/HTTPops-non-critical-mailexisting check in icinga✅ deactivated in watchmouse
wikimedia foundation mainpagewiki_platformhttp://wikimediafoundation.org/wiki/HomeHTTPops-critical-phoneretire (dupe of above)✅ deactivated in watchmouse
donate httphttp://donate.wikimedia.org/HTTPfundraising-criticalretire✅ deactivated in watchmouse
donate httpshttps://donate.wikimedia.org/HTTPSfundraising-criticaladded to watchrat (with email alert routing to fr-tech)✅ deactivated in watchmouse
frdata httphttp://frdata.wikimedia.org/HTTPfundraising-criticalretire✅ deactivated in watchmouse
frdata httpshttps://frdata.wikimedia.org/HTTPSfundraising-criticaladded to watchrat✅ deactivated in watchmouse
payments httpshttps://payments.wikimedia.org/index.php/Special:SystemStatusHTTPSfundraising-criticaladded to watchrat✅ deactivated in watchmouse
payments listenerhttps://payments-listener.wikimedia.org/globalcollectHTTPSfundraising-criticaladded to watchrat✅ deactivated in watchmouse
banner load testing chrome8http://en.wikipedia.org/wiki/List_of_collective_nounsFull-Page(none)retire✅ deactivated in watchmouse
banner load testing FF3.6http://en.wikipedia.org/wiki/List_of_collective_nounsFull-Page(none)retire✅ deactivated in watchmouse
banner load testing ie6http://en.wikipedia.org/wiki/List_of_collective_nounsFull-Page(none)retire✅ deactivated in watchmouse
banner load testing ie7http://en.wikipedia.org/wiki/List_of_collective_nounsFull-Page(none)retire✅ deactivated in watchmouse
banner load testing ie8http://en.wikipedia.org/wiki/List_of_collective_nounsFull-Page(none)retire✅ deactivated in watchmouse
banner load testing safari4http://en.wikipedia.org/wiki/List_of_collective_nounsFull-Page(none)retire✅ deactivated in watchmouse
mr1-codfw OOBmr1-codfw.oob.wikimedia.orgPINGwikimedia opsexisting check in icinga✅ deactivated in watchmouse
mr1-eqiad OOBmr1-eqiad.oob.wikimedia.orgPINGwikimedia opsexisting check in icinga✅ deactivated in watchmouse
mr1-eqsin OOBmr1-eqsin.oob.wikimedia.orgPINGwikimedia opsexisting check in icinga✅ deactivated in watchmouse
mr1-esams OOBmr1-esams.oob.wikimedia.orgPINGwikimedia opsexisting check in icinga✅ deactivated in watchmouse
mr1-ulsfo OOBmr1-ulsfo.oob.wikimedia.orgPINGwikimedia opsexisting check in icinga✅ deactivated in watchmouse
Ping offload text-lb.codfwtext-lb.codfw.wikimedia.orgPINGwikimedia opsexisting check in icinga✅ deactivated in watchmouse
Ping offload text-lb.eqiadtext-lb.eqiad.wikimedia.orgPINGwikimedia opsexisting check in icinga✅ deactivated in watchmouse
secure.wikimedia.orghttps://secure.wikimedia.org/wikipedia/en/wiki/Main_PageHTTPSops-non-critical-mailadded to watchrat✅ deactivated in watchmouse
shop.wikimedia.orgpublichttp://shop.wikimedia.org/HTTPops-non-critical-mailretire✅ deactivated in watchmouse
en wiki login -scriptScriptops-non-critical-mailretire✅ deactivated in watchmouse
icinga https port openicinga.wikimedia.org:443CONNECTChris Danisretire✅ deactivated in watchmouse
icinga-httpshttps://icinga.wikimedia.org/HTTPSChris Danisretire✅ deactivated in watchmouse
wikimediafoundation.org - scriptScriptops-non-critical-mailretire✅ deactivated in watchmouse

High level todo list:

  • set up replacement static check capability via prometheus blackbox exporter (nicknamed "watchrat")
  • export/import relevant watchmouse checks, audit checks and move needed checks to watchrat config
  • enable alerting
  • disable watchmouse checks

Event Timeline

After an internal discussion with the team, we are aiming for a target retirement at the end of Jan 2022.

Change 747550 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] prometheus: add blackbox generic \"watchrat\" http/s static check support

https://gerrit.wikimedia.org/r/747550

Change 747550 merged by Herron:

[operations/puppet@production] prometheus: add blackbox generic \"watchrat\" http/s static check support

https://gerrit.wikimedia.org/r/747550

Change 759297 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] watchrat: check URLs from watchmouse not already covered by icinga

https://gerrit.wikimedia.org/r/759297

herron updated the task description. (Show Details)
herron updated the task description. (Show Details)

Change 759302 had a related patch set uploaded (by Herron; author: Herron):

[operations/alerts@master] initial sketch of watchrat alert

https://gerrit.wikimedia.org/r/759302

Change 759297 merged by Herron:

[operations/puppet@production] watchrat: check URLs from watchmouse not already covered by icinga

https://gerrit.wikimedia.org/r/759297

Change 759302 merged by Herron:

[operations/alerts@master] watchrat: add http probe alerting with warning severity

https://gerrit.wikimedia.org/r/759302

herron updated the task description. (Show Details)

Change 761064 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] watchrat: route alerts to irc and noc@

https://gerrit.wikimedia.org/r/761064

Change 761403 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] watchrat: route donate.wm.o alerts to fr-ircmail

https://gerrit.wikimedia.org/r/761403

Change 761064 merged by Herron:

[operations/puppet@production] watchrat: route alerts to irc and noc@

https://gerrit.wikimedia.org/r/761064

Change 761403 merged by Herron:

[operations/puppet@production] watchrat: route donate.wm.o alerts to fr-ircmail

https://gerrit.wikimedia.org/r/761403

herron triaged this task as Medium priority.Feb 10 2022, 5:43 PM
herron updated the task description. (Show Details)

Change 761715 had a related patch set uploaded (by Herron; author: Herron):

[operations/puppet@production] watchrat: add shop.wm.o to url list

https://gerrit.wikimedia.org/r/761715

Change 761715 merged by Herron:

[operations/puppet@production] watchrat: add shop.wm.o to url list

https://gerrit.wikimedia.org/r/761715

herron claimed this task.
herron subscribed.

All checks are now deactivated, resolving!

Change 771009 had a related patch set uploaded (by Herron; author: Herron):

[operations/alerts@master] watchrat: require 3+ sites to agree on error status before alerting

https://gerrit.wikimedia.org/r/771009

Change 771009 merged by jenkins-bot:

[operations/alerts@master] watchrat: require 3+ sites to agree on error status before alerting

https://gerrit.wikimedia.org/r/771009