Page MenuHomePhabricator

Investigate EQIAD daily completion suggester rebuild failure
Closed, ResolvedPublic2 Estimated Story Points

Description

We re-enabled daily completion suggester builds in eqiad , but the alert for TitleSuggestIndexTooOld is still firing{F60772784}

Logstash shows that the job is completing, so we need to figure out what's happening.

Until we fix this, we can't re-enable EQIAD. See also T388538 where these jobs were migrated to k8s

Creating this ticket to:

  • Investigate/fix completion job failures

Event Timeline

Change #1151727 had a related patch set uploaded (by DCausse; author: DCausse):

[mediawiki/extensions/CirrusSearch@master] Do not check index search stats when checking if an index is live

https://gerrit.wikimedia.org/r/1151727

Change #1151727 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Do not check index search stats when checking if an index is live

https://gerrit.wikimedia.org/r/1151727

 :) (ebernhardson@deploy1003)-~$ kube_env mw-cron eqiad
 :) (ebernhardson@deploy1003)-~$ kubectl get pods | grep OOM
cirrus-build-completion-indices-codfw-s3-29155830-52977           0/3     OOMKilled   0          37h
cirrus-build-completion-indices-codfw-s3-29157270-wnzhd           0/3     OOMKilled   0          13h
cirrus-build-completion-indices-eqiad-s3-29155830-72d5c           0/3     OOMKilled   0          37h
cirrus-build-completion-indices-eqiad-s3-29157270-mn6bp           0/3     OOMKilled   0          13h
 :) (ebernhardson@deploy1003)-~$ kubectl logs cirrus-build-completion-indices-eqiad-s3-29157270-mn6bp mediawiki-main-app | grep -A 1 -B 5 -i kill
hewikisource     100% done...
hewikisource 2025-06-09 08:16:28 Indexing from content index done.
hewikisource 2025-06-09 08:16:29 total hits: 26217
hewikisource 2025-06-09 08:16:29 Indexing 26217 documents from general with batchId: 1749445984
hewikisource     22% done...
/usr/local/bin/mwscriptwikiset: line 112:  1525 Killed                  ${RUNNER} ${CMD} ${wiki} "${@}" 2>&1
      1526 Done                    | ts "${wiki}"

Following up here instead of T388538

It looks to still be having issues, in particular the s3 job has been OOMKilled a few times recently and isn't completing a full build.

I took a brief look through the charts, possibly these jobs are using the main_app.requests.auto_compute=true option in the mediawiki chart, but it wasn't clear to me if there was a way to set per-job requests.

They're not, there is a per-pod limit of 2GiB of RAM since T395436: Limit CPU usage for mw-on-k8s cli deployments. There isn't a way to override this per-job for now. We can raise that limit temporarily, but there may be an issue with handling a very large data structure in this script, the memory consumption graph just spikes suddenly to hit the 2GiB memory limit:
{F62281846}

As far as I can tell, this behavior is there on every run, but there is a little variance in the size of the spike, meaning it won't always hit the limit and get killed. It's also spiking fast enough that the event that results in the OOMKill doesn't always get captured on the graphs.

I'll disable the memory limit as a temporary measure to unblock re-enabling the eqiad search cluster, but we'll want to re-enable them at some point.

Change #1155175 had a related patch set uploaded (by Clément Goubert; author: Clément Goubert):

[operations/deployment-charts@master] mw-cron: Disable memory limit

https://gerrit.wikimedia.org/r/1155175

Change #1155175 merged by jenkins-bot:

[operations/deployment-charts@master] mw-cron: Disable memory limit

https://gerrit.wikimedia.org/r/1155175

pfischer set the point value for this task to 2.Jun 23 2025, 3:42 PM

Relatively minimal reproduction of the OOM we trigger. It fails at around 2.3M cached entries. At a very general level the problem is that this mediawiki code is assuming a webrequest that ends in a few seconds at most, not a maintenance script that visits millions of pages in a single execution.

sudo $counter = MediaWiki\MediaWikiServices::getInstance()->getStatsFactory()->getCounter( 'pagestore_linkcache_accesses_total' );
for ( $i = 0; $i < 3000000; ++$i ) {
    if ( ($i % 3000) == 2999 ) {
        sudo echo $counter->baseMetric->getSampleCount(), "\n";
    }
    $counter->setLabel( 'reason', 'good' )->setLabel( 'status', 'hit' )->increment();
}

While the above is a reduced form of our maintenance script and does trigger an OOM, after more investigation I'm not certain that is the cause of our memory usage.

When looking into what is different about hewikisource, the biggest thing that stands out is the number of redirects. hewikisource has 1.2M redirects, while the next largest wiki with subphrase matching enabled (enwikisource) has only ~200k. The relatively large number of redirects per page results in more suggest documents being generated, and much larger batches being flushed (observed up to 180k suggest docs flushed at a time).

Above patch separates the flushing of suggest docs from the source query. Instead of flushing every time we process N source documents, it will flush every time it sees N suggest documents. With our prod config that should reduce the maximum number of suggest docs in memory at a time from 180k to 3k. In a test run against prod data I'm seeing peak memory usage reduced from 2GB to 512MB.

Change #1164293 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] UpdateSuggesterIndex: Avoid holding large in-memory batches

https://gerrit.wikimedia.org/r/1164293

Change #1164293 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] UpdateSuggesterIndex: Avoid holding large in-memory batches

https://gerrit.wikimedia.org/r/1164293

This looks to have worked as expected. Checked by reviewing max(sum by (pod) (container_memory_usage_bytes{namespace="mw-cron", pod=~"cirrus-build-completion-indices-.*", container="mediawiki-main-app"})) for the last 7 days. This shows that peak memory usage of an individual cirrus-build-completion-indices pod decreased from >2gb to ~550mb on july 3rd.

@Clement_Goubert It should be reasonable to re-enable the memory limits now.

This looks to have worked as expected. Checked by reviewing max(sum by (pod) (container_memory_usage_bytes{namespace="mw-cron", pod=~"cirrus-build-completion-indices-.*", container="mediawiki-main-app"})) for the last 7 days. This shows that peak memory usage of an individual cirrus-build-completion-indices pod decreased from >2gb to ~550mb on july 3rd.

@Clement_Goubert It should be reasonable to re-enable the memory limits now.

That's great, good job on dividing memory usage by 4!

Mentioned in SAL (#wikimedia-operations) [2025-07-09T10:37:04Z] <claime> Restoring memory limits on mw-cron - T395436 - T395465