Create an alert based on index age for reindexing Commons and Wikidata
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	TJones
	Jan 27 2022, 7:31 PM

Description

User Story:

As a searcher on Commons or Wikidata, I want to be able to take advantage of the latest language analysis improvements for all languages in use on those projects.
As a Search Platform engineer, I don't want to have to reindex Commons and Wikidata every time any language is updated (it takes too long—multiple days), but I do want non-English language analysis improvements to make it to those projects in a reasonably timely manner.

Most wikis only use language analysis for one language, and that language can be recovered from the wiki config. Commons and Wikidata support many languages and use many analyzers, but their config indicates that their language is English. By default, then, Commons and Wikidata only get reindexed when the English analysis chain is updated, or when we have a reason to reindex everything, neither of which is very often.

A reasonable compromise would be to reindex Commons and Wikidata on a regular schedule. We discussed it in our weekly meeting and every 3–6 months (2–4 times a year) seems reasonable. Every 4 months/3 times a year seems like a good compromise to start with.

Elastic index names include a Unix timestamp, so determining how old an index is should be straightforward. Using the index timestamp also means the clock is automatically reset if a project gets reindexed sooner for some other reason, so there will be less unnecessary reindexing than if we were to follow a strict schedule like "the first week of every third month".

There are actually several indexes for each wiki project, and we may want to be able to make simple variations on this kind of alert (like reindexing every wiki at least once a year or something). All projects have a content index, and normally all related indexes are refreshed one right after another, so only checking the content index makes sense, though counter arguments are welcome.

Acceptance Criteria:

Any concerns about the schedule (every 4 months/120 days / ~3 times a year) or letting the content index be the benchmark are discussed and resolved.
An alert is set up to send email if the timestamp for the Commons content index is more than 120 days old.
An alert is set up to send email if the timestamp for the Wikidata content index is more than 120 days old.
Ideally a phab ticket should be automatically created and assigned to the Discovery-Search backlog

Details

Subject	Repo	Branch	Lines +/-
Create phab task when indices are too old	operations/alerts	master	+27 -0
alertmanager: Configure task creation for search-platform	operations/puppet	production	+11 -1
prometheus: Add more per-index metrics for elasticsearch	operations/puppet	production	+26 -5

Customize query in gerrit

Related Objects

Mentioned In: T300564: Create phabricator tickets on a schedule
Mentioned Here: T300836: More accurate tasks for inbound interface errors
T303548: CirrusSearchIndexTooOld
T272453: Create phabricator tasks from Alertmanager alerts
T299871: Degraded RAID on restbase2011

Event Timeline

TJones created this task.Jan 27 2022, 7:31 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 27 2022, 7:31 PM

MPhamWMF moved this task from needs triage to Current work on the Discovery-Search board.Jan 31 2022, 4:17 PM

MPhamWMF edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Gehel updated the task description. (Show Details)Jan 31 2022, 4:17 PM

bking mentioned this in T300564: Create phabricator tickets on a schedule.Jan 31 2022, 6:51 PM

@Dzahn suggestion in wikimedia-sre IRC room:

"some phab tickets are created automatically by Icinga, using the eventhandler, like the RAID ones: https://phabricator.wikimedia.org/T299871 so you could copy code it uses to create them yourself by running what Icinga runs"

See in the operations/puppet repo around these places:

modules/profile/manifests/icinga.pp: class { 'icinga::event_handlers::raid'

modules/icinga/templates/event_handlers/raid_handler.cfg.erb

modules/icinga/manifests/event_handlers/raid.pp

modules/icinga/files/raid_handler.py

and then phab_client = get_phabricator_client() in that .py file above.

MPhamWMF set the point value for this task to 3.Feb 7 2022, 4:51 PM

MPhamWMF moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

EBernhardson claimed this task.Mar 7 2022, 7:43 PM

EBernhardson moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.

From looking it seems we need two pieces:

We need to configure a nagios check that will talk to elasticsearch and return the age of the index, and then have nagios alert when the age is more than allowed. This check needs to be per-datacenter. The general concept is curl https://search.svc.eqiad.wmnet:9243/commonswiki_content/_settings | jq '. | to_entries | .[].value.settings.index.creation_date'
We need to configure a nagios event handler (see raid_handler notes from dzahn) to receive the alert and create a phabricator ticket

Volans (worked on the raid_handler parts years ago) suggested that skiping icinga altogether might be better. If we instead push the index creation dates (or maybe age) into prometheus for these indices we can use alert manager which has direct support for creating phabricator tickets from T272453.

Overall this seems like a better approach, in that there are less moving pieces for us to maintain or adjust. With the icinga approach we would need to maintain two new python scripts to integrate with icinga and phabricator, with prometheus we only need to adjust the existing collector. Will look into this more today.

Change 769123 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] prometheus: Add more per-index metrics for elasticsearch

https://gerrit.wikimedia.org/r/769123

gerritbot added a project: Patch-For-Review.Mar 8 2022, 9:40 PM

Change 769127 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/alerts@master] Create phab task when indices are too old

https://gerrit.wikimedia.org/r/769127

EBernhardson moved this task from In Progress to Needs review on the Discovery-Search (Current work) board.Mar 8 2022, 10:54 PM

Change 769131 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] alertmanager: Configure task creation for search-platform

https://gerrit.wikimedia.org/r/769131

Change 769123 merged by Razzi:

[operations/puppet@production] prometheus: Add more per-index metrics for elasticsearch

https://gerrit.wikimedia.org/r/769123

Change 769131 merged by Razzi:

[operations/puppet@production] alertmanager: Configure task creation for search-platform

https://gerrit.wikimedia.org/r/769131

Mentioned in SAL (#wikimedia-operations) [2022-03-10T18:15:22Z] <razzi> systemctl restart prometheus-wmf-elasticsearch-exporter-9200.service on elastic2042 for T300295

Change 769127 merged by jenkins-bot:

[operations/alerts@master] Create phab task when indices are too old

https://gerrit.wikimedia.org/r/769127

Maintenance_bot removed a project: Patch-For-Review.Mar 10 2022, 7:10 PM

Deployed and it created a task for the already old indices: T303548. This time around they are nicely bundled together but in the future they might spread out and get multiple tickets over a few days as they wont be all reindexed on the same day.

In T300295#7768391, @EBernhardson wrote:

Deployed and it created a task for the already old indices: T303548.

That is very cool!!

Gehel closed this task as Resolved.Mar 14 2022, 2:51 PM

In T300295#7768391, @EBernhardson wrote:

Deployed and it created a task for the already old indices: T303548. This time around they are nicely bundled together but in the future they might spread out and get multiple tickets over a few days as they wont be all reindexed on the same day.

Thanks Erik for helping to test drive this feature! I'm glad it is useful and working as expected. re: bundling, currently the bot will update the same (open) task using the alert group name as its "key" (i.e. the task's title). You can tweak the key/title based on the alert group information. For example in T300836 we're opening a task per device, hope that helps!

Create an alert based on index age for reindexing Commons and WikidataClosed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related Objects

Event Timeline

Create an alert based on index age for reindexing Commons and Wikidata
Closed, ResolvedPublic3 Estimated Story Points
Actions