Page MenuHomePhabricator

Create an alert based on index age for reindexing Commons and Wikidata
Closed, ResolvedPublic3 Estimated Story Points

Description

User Story:

  • As a searcher on Commons or Wikidata, I want to be able to take advantage of the latest language analysis improvements for all languages in use on those projects.
  • As a Search Platform engineer, I don't want to have to reindex Commons and Wikidata every time any language is updated (it takes too long—multiple days), but I do want non-English language analysis improvements to make it to those projects in a reasonably timely manner.

Most wikis only use language analysis for one language, and that language can be recovered from the wiki config. Commons and Wikidata support many languages and use many analyzers, but their config indicates that their language is English. By default, then, Commons and Wikidata only get reindexed when the English analysis chain is updated, or when we have a reason to reindex everything, neither of which is very often.

A reasonable compromise would be to reindex Commons and Wikidata on a regular schedule. We discussed it in our weekly meeting and every 3–6 months (2–4 times a year) seems reasonable. Every 4 months/3 times a year seems like a good compromise to start with.

Elastic index names include a Unix timestamp, so determining how old an index is should be straightforward. Using the index timestamp also means the clock is automatically reset if a project gets reindexed sooner for some other reason, so there will be less unnecessary reindexing than if we were to follow a strict schedule like "the first week of every third month".

There are actually several indexes for each wiki project, and we may want to be able to make simple variations on this kind of alert (like reindexing every wiki at least once a year or something). All projects have a content index, and normally all related indexes are refreshed one right after another, so only checking the content index makes sense, though counter arguments are welcome.

Acceptance Criteria:

  • Any concerns about the schedule (every 4 months/120 days / ~3 times a year) or letting the content index be the benchmark are discussed and resolved.
  • An alert is set up to send email if the timestamp for the Commons content index is more than 120 days old.
  • An alert is set up to send email if the timestamp for the Wikidata content index is more than 120 days old.
  • Ideally a phab ticket should be automatically created and assigned to the Discovery-Search backlog

Event Timeline

@Dzahn suggestion in wikimedia-sre IRC room:

"some phab tickets are created automatically by Icinga, using the eventhandler, like the RAID ones: https://phabricator.wikimedia.org/T299871 so you could copy code it uses to create them yourself by running what Icinga runs"

See in the operations/puppet repo around these places:

modules/profile/manifests/icinga.pp: class { 'icinga::event_handlers::raid'

modules/icinga/templates/event_handlers/raid_handler.cfg.erb

modules/icinga/manifests/event_handlers/raid.pp

modules/icinga/files/raid_handler.py

and then phab_client = get_phabricator_client() in that .py file above.

MPhamWMF set the point value for this task to 3.Feb 7 2022, 4:51 PM

From looking it seems we need two pieces:

  • We need to configure a nagios check that will talk to elasticsearch and return the age of the index, and then have nagios alert when the age is more than allowed. This check needs to be per-datacenter. The general concept is curl https://search.svc.eqiad.wmnet:9243/commonswiki_content/_settings | jq '. | to_entries | .[].value.settings.index.creation_date'
  • We need to configure a nagios event handler (see raid_handler notes from dzahn) to receive the alert and create a phabricator ticket

Volans (worked on the raid_handler parts years ago) suggested that skiping icinga altogether might be better. If we instead push the index creation dates (or maybe age) into prometheus for these indices we can use alert manager which has direct support for creating phabricator tickets from T272453.

Overall this seems like a better approach, in that there are less moving pieces for us to maintain or adjust. With the icinga approach we would need to maintain two new python scripts to integrate with icinga and phabricator, with prometheus we only need to adjust the existing collector. Will look into this more today.

Change 769123 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] prometheus: Add more per-index metrics for elasticsearch

https://gerrit.wikimedia.org/r/769123

Change 769127 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/alerts@master] Create phab task when indices are too old

https://gerrit.wikimedia.org/r/769127

Change 769131 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/puppet@production] alertmanager: Configure task creation for search-platform

https://gerrit.wikimedia.org/r/769131

Change 769123 merged by Razzi:

[operations/puppet@production] prometheus: Add more per-index metrics for elasticsearch

https://gerrit.wikimedia.org/r/769123

Change 769131 merged by Razzi:

[operations/puppet@production] alertmanager: Configure task creation for search-platform

https://gerrit.wikimedia.org/r/769131

Mentioned in SAL (#wikimedia-operations) [2022-03-10T18:15:22Z] <razzi> systemctl restart prometheus-wmf-elasticsearch-exporter-9200.service on elastic2042 for T300295

Change 769127 merged by jenkins-bot:

[operations/alerts@master] Create phab task when indices are too old

https://gerrit.wikimedia.org/r/769127

Deployed and it created a task for the already old indices: T303548. This time around they are nicely bundled together but in the future they might spread out and get multiple tickets over a few days as they wont be all reindexed on the same day.

Deployed and it created a task for the already old indices: T303548.

That is very cool!!

Deployed and it created a task for the already old indices: T303548. This time around they are nicely bundled together but in the future they might spread out and get multiple tickets over a few days as they wont be all reindexed on the same day.

Thanks Erik for helping to test drive this feature! I'm glad it is useful and working as expected. re: bundling, currently the bot will update the same (open) task using the alert group name as its "key" (i.e. the task's title). You can tweak the key/title based on the alert group information. For example in T300836 we're opening a task per device, hope that helps!