In the past we had repeated problems with the Add Link database and search index getting out of sync and users getting "no suggestions for this page" errors upon arrival. While we think we fixed all those problems, we should set up some monitoring so we can notice if there are a significant number of such errors.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | Tgr | T283868 Monitor "no suggestion" rate for Add Link tasks | |||
Resolved | Tgr | T289550 Add Link: Set up cronjob for collecting statsd metrics about dangling search index entries |
Event Timeline
Change 697680 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] Add Link: Send "no suggestions found" events to statsd
Change 697680 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] Add Link: Send "no suggestions found" events to statsd
This still needs an addition to the Grafana dashboard, so moving back to In Progress.
Dashboard: https://grafana.wikimedia.org/d/vGq7hbnMz/special-homepage-and-suggested-edits?viewPanel=35&orgId=1
(Shows the number of errors per hour. I guess it would be more informative if we knew the number of tasks opens per hour.)
I'm thinking we should also chart the number of tasks which are present in the search index but not in the database (can be counted with fixLinkRecommendationData.php --verbose, we could give it a --statsd option, like for listTaskCounts.php). Maybe also the opposite (tasks present in the database but not in the search index) although that's less immediately problematic.
Change 702751 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData.php: add statsd option
Change 702751 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData.php: add statsd option
Change 712924 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):
[mediawiki/extensions/GrowthExperiments@master] AddLink: Use statsd monitoring for errors on server side
Change 713019 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: allow random sampling
Change 713019 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: allow random sampling
Change 714449 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: Allos --db-table in dry-run mode
Change 714449 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] fixLinkRecommendationData: Allow --db-table in dry-run mode
Change 715824 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.20] fixLinkRecommendationData: Allow --db-table in dry-run mode
Change 715824 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@wmf/1.37.0-wmf.20] fixLinkRecommendationData: Allow --db-table in dry-run mode
Mentioned in SAL (#wikimedia-operations) [2021-09-01T23:24:42Z] <urbanecm@deploy1002> Synchronized php-1.37.0-wmf.20/extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php: 3c7d4ecc699b7c68467a372686f5514375d2b74f: fixLinkRecommendationData: Allow --db-table in dry-run mode (T283868) (duration: 01m 06s)
Change 716755 had a related patch set uploaded (by Gergő Tisza; author: Gergő Tisza):
[operations/puppet@production] Run GrowthExperiments fixLinkRecommendationData --dry-run every day
Tested in production, seems to work well, takes about 2 min to run on all wikis. Dashboards: dangling search index records, dangling DB records (they only contain data from the manual runs so not much to see there until the puppet patch is merged). We have a significant but not large amount (hundreds) of both types of dangling records on some wikis.
Change 716755 merged by Jbond:
[operations/puppet@production] Run GrowthExperiments fixLinkRecommendationData --dry-run every day
Change 712924 merged by jenkins-bot:
[mediawiki/extensions/GrowthExperiments@master] AddLink: Use statsd monitoring for errors on server side
... which, as a note to self / others, we should do after MW-1.37-notes (1.37.0-wmf.23; 2021-09-13) is in production to include the logging from https://gerrit.wikimedia.org/r/712924
@Tgr anything else you wanted to add to the dashboard?
It would be nice to figure out how to show a graph instead of a single data point for the daily cronjob based data, but no need to keep this task for that.
Checked in wmf.2 - both Dangling search index records and Dangling DB records charts are present on Growth Team/Special:Homepage and Suggested Edits dash board. No spikes after wmf.2 deployment.