Page MenuHomePhabricator

SUP: set up alerting for page_change_weighted_tags ingestion
Closed, ResolvedPublic5 Estimated Story Points

Description

Possible approach: Compare the number of events coming in via page_change_weighted_tags with the number of bulk actions modifying the weighted_tags multimap.

AC:

Event Timeline

Gehel triaged this task as High priority.Aug 28 2024, 8:26 AM
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.
Gehel set the point value for this task to 5.Sep 2 2024, 3:44 PM
Gehel updated the task description. (Show Details)

Change #1081172 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/deployment-charts@master] cirrus-streaming-updater: bump to v20241017132903-67693a7

https://gerrit.wikimedia.org/r/1081172

Change #1081172 merged by jenkins-bot:

[operations/deployment-charts@master] cirrus-streaming-updater: bump to v20241017132903-67693a7

https://gerrit.wikimedia.org/r/1081172

What is the state of this? I'm asking, because looking at things on our end, it seems that the "database-and-search-index"-pairs are missing the search-index part. And the staircase-pattern makes me suspect that something is going wrong when the maintenance script every couple of hours is trying to add new records both to the database and to the search-index. Adding to the database succeeds, adding to the search-index seems to fail, thus the dangling record.

image.png (319×784 px, 28 KB)

Our code is supposed to cancel the transaction adding the db-row if \CirrusSearch\WeightedTagsUpdater::updateWeightedTags throws, but maybe that is not happening?

(Or maybe the error is somewhere in the deleting step after all. But why the pattern then?)

What is the state of this?

On Wednesday October 30 we switched to using event-gate for weighted tags emitted from CirrusSearch at this point we started collecting some metrics under flink_taskmanager_job_task_operator_weighted_tags_tag_prefix_clear & flink_taskmanager_job_task_operator_weighted_tags_tag_prefix_set.
The status of this ticket is that now we have to create a dashboard based on this data so that it might help better correlate possible errors.
I'll be setting this dashboard asap so that we have a better understanding of what's happening.

I'm asking, because looking at things on our end, it seems that the "database-and-search-index"-pairs are missing the search-index part. And the staircase-pattern makes me suspect that something is going wrong when the maintenance script every couple of hours is trying to add new records both to the database and to the search-index. Adding to the database succeeds, adding to the search-index seems to fail, thus the dangling record.

If you noticed that the issues started right after October 10 then it might be very possible that it's related to T377150.

Our code is supposed to cancel the transaction adding the db-row if \CirrusSearch\WeightedTagsUpdater::updateWeightedTags throws, but maybe that is not happening?

(Or maybe the error is somewhere in the deleting step after all. But why the pattern then?)

@pfischer/@Michael a dashboard is up at: https://grafana-rw.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1
I'm not sure how to craft an alert based on these numbers, please let me know if you have ideas. Moving to Blocked/Waiting while we decide on if we want to set up an alert and how.

@pfischer/@Michael a dashboard is up at: https://grafana-rw.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1
I'm not sure how to craft an alert based on these numbers, please let me know if you have ideas. Moving to Blocked/Waiting while we decide on if we want to set up an alert and how.

Thanks, this is useful!

I think my recommendation would be: let's fix T378983: Add Link recommendation are not being processed by CirrusSearch (November 2024) first, then we wait a week or two to see what kind of numbers we're working with, and then we maybe can set some lower-bounds alerts.

@Michael I think this is relatively stable now, since search is not owning the individual sources of tags I think it might be better to have more fine-grained alerts (per tag?) on your side if you want, on our side I might set a very broad alert to capture only obvious problems (i.e. no tags updated in the last hour) but it might not cover failures specific to a particular tag.

Change #1111300 had a related patch set uploaded (by DCausse; author: DCausse):

[operations/alerts@master] search: add alerts for weighted_tags indexing throughput

https://gerrit.wikimedia.org/r/1111300

Change #1111300 merged by jenkins-bot:

[operations/alerts@master] search: add alerts for weighted_tags indexing throughput

https://gerrit.wikimedia.org/r/1111300