Page MenuHomePhabricator

CirrusSearch should not send outdated cirrussearch-request events
Closed, ResolvedPublic

Description

It might happen (most probably when running a maintenance) that CirrusSearch sends cirrussearch-request events that have a meta.dt field that is outdated.

Reason might be that CirrusSearch delays sending such requests at the end of a maintenance script.

Such faulty events found in the logs suggests that they relate to a maintenance script running for the GrowthExperiments extension with the hasrecommendation:link keyword.

AC:

  • CirrusSearch should no longer produce outdated events

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

According to @Urbanecm_WMF these queries are probably emitted while running https://github.com/wikimedia/mediawiki-extensions-GrowthExperiments/blob/master/maintenance/refreshLinkRecommendations.php.
Discussing possible fixes it would be ideal if cirrus could detect that it is being run via a maint script and possibly call something like disablePoolCountersAndLogging but perhaps without disabling statsd since the user script might require stats to be emitted.

Change #1017929 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[operations/mediawiki-config@master] Remove buffer on cirrussearch-request log channel

https://gerrit.wikimedia.org/r/1017929

I took a look over the actual event generation, but I can't see why meta.dt would be outdated. Our request logging does cache some things, but the meta information isn't one of them. We fetch the value for meta.dt from wfTimestamp() (global clock) and immediately provide the event to logging. Logging does put the request into a second deferred update, but as long as we are running from inside a deferred update the system guarantees it will run any deferred submitted while running the deferred immediately after (via a scope-stack abstraction).

Once suspicious bit i found is that we have buffering enabled on the logging channel. I don't have proof this is a problem, but turning it off seems reasonable. For a normal web request we only ship a single cirrussearch-request event, buffering that single event for later isn't all that helpful. Additionally we send the event during postsend, so buffering isn't delaying to a useful time. The only thing it's really doing is buffering the request during long-running maintenance scripts, which is potentially causing our problem here.

Change #1017929 merged by jenkins-bot:

[operations/mediawiki-config@master] Remove buffer on cirrussearch-request log channel

https://gerrit.wikimedia.org/r/1017929

Moving to waiting, as we need to wait and see if changing the log buffering fixed the issue or not.

per the data-engineering-alerts list archive these were triggering daily alerts the two weeks prior to 2024-04-10 and haven't been emitted since. This is two days after the fix was applied, which is slightly curious. But I remember something about event refining operating over window of hours, so maybe it took some time to pass. I'm willing to call this complete with the errors stopping.