Parallelize GrowthExperiments' refreshLinkRecommendations.php
Open, LowPublic
Actions

Assigned To

None

Authored By

	Tgr
	Mar 22 2021, 10:26 AM

Description

refreshLinkRecommendations.php goes sequentially through wikis (or rather, the cronjob calling it via foreachwikiindblist does), within a wiki it goes sequentially through topics, and within a topic it goes sequentially through article candidates until enough are found. There are two problems with that:

It cannot utilize available service capacity (with several instances and each having multiple workers, dozens of calls could be made in parallel without a performance penalty), which is a problem as the requests to the service are the bottleneck in the running speed of the script, and a significant one (preparing enough tasks for a new wiki would take several days now).
When a new wiki is added, the script would only work on that wiki until it has enough tasks (days), potentially causing the task pool on other wikis to dry out.

Details

	Subject	Repo	Branch	Lines +/-
	growthexperiments.pp: Remove absented job	operations/puppet	production	+0 -5
	growthexperiments: Run refreshLinkRecommendations in parallel	operations/puppet	production	+10 -0

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	MMiller_WMF	T252822 [EPIC] Growth: "add a link" structured task 1.0
Resolved	kostajh	T266437 Add a link engineering: backend product specifications
Resolved	kostajh	T261396 Add a link: engineering tasks for initial release
Open	None	T278103 Parallelize GrowthExperiments' refreshLinkRecommendations.php

Event Timeline

Tgr created this task.Mar 22 2021, 10:26 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 22 2021, 10:26 AM

kostajh added a parent task: T261396: Add a link: engineering tasks for initial release.Mar 22 2021, 10:28 AM

Some options:

Parallel requests via MultiHttpRequest. This might help with the first problem but not the second (but maybe if the speedup is sufficient the second stops being a problem?).
Running the script for each wiki in parallel. We don't want to have to update puppet definitions all the time, but maybe something like GNU Parallel could be used? Probably not ideal in terms of CPU, though. Plus, at some point we'll have more wikis than service workers.
Have the script run on multiple threads via ForkController. Would require the script to know about other wikis than the one it runs on which is possible (e.g. GlobalRename jobs do that) but not trivial.
Have the script schedule MediaWiki jobs. Might require the script to know about other wikis than the one it runs on, but not necessarily. More fragile and harder to debug, but less capacity limits on the MediaWiki side (one maintenance host vs. many job runners).
Have the script do a limited amount of work only per wiki or per topic, then move on. Helps with the second problem but not the first. Easy for topics, but wikis don't loop (unless we make the script know about other wikis than the one it runs on).

A somewhat related issue is that the script turns task candidates into tasks, and this can get very slow if most task candidates are unsuitable (e.g. on a small wiki where we have already used up the good ones). Some form of the "do a limited amount of work then move on" approach might help here.

kostajh moved this task from Backlog to Post-release backlog on the Add-Link board.Mar 22 2021, 10:38 AM

Loosely related: there should be some kind of back-off when the service request fails, to avoid hammering it when it's overloaded.

Was just discussing this with @MMiller_WMF as a candidate for something to work on after the initial release, as we only need the task pool populated for our four target wikis. Given that, @Tgr what do you think about creating the tables in production & enabling link recommendations for those four wikis so we can start filling up the task pool now?

Makes sense. Let's get verbose logging deployed first though so we have a better idea of what the script is doing.

(Btw normal cronjob logging will probably get messed up by running multiple cronjobs in parallel, so we'll have to switch to Logstash before doing that. But for now a log file is easier to review.)

kostajh moved this task from Post-release backlog to Backlog on the Add-Link board.Mar 23 2021, 9:05 PM

kostajh moved this task from Backlog to Post-release backlog on the Add-Link board.Mar 29 2021, 11:34 AM

MBinder_WMF added a project: Growth-Team-Filtering.Apr 15 2021, 6:47 PM

kostajh triaged this task as Low priority.Apr 26 2021, 7:53 PM

kostajh moved this task from Post-release backlog to Backlog on the Add-Link board.Jun 7 2021, 7:26 AM

rEGREaaa55a21787e: refreshLinkRecommendations.php: Use per-wiki locks, which I forgot to link to this task, tried to parallelize the script to the extent that different wikis can run different instances of it in the same time. It did not seem to work though - maybe there is some guard against it in the cronjob infrastructure itself.

In T278103#6936174, @Tgr wrote:

(Btw normal cronjob logging will probably get messed up by running multiple cronjobs in parallel, so we'll have to switch to Logstash before doing that. But for now a log file is easier to review.)

Or just put the wiki name in the logfile name, duh. (Would require a small change to how the job is puppetized, I think.)

In T278103#7186454, @Tgr wrote:

It did not seem to work though - maybe there is some guard against it in the cronjob infrastructure itself.

Of course there is, since it's not an actual cronjob but a systemd service. I suppose we'd have to use a service template instead?

This should probably go into the current sprint sometime this summer, or in any case before we expand to more wikis beyond T284481: Deploy Add a link to the second set of wikis

Change 730752 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/puppet@production] growthexperiments: Run refreshLinkRecommendations in parallel

https://gerrit.wikimedia.org/r/730752

gerritbot added a project: Patch-For-Review.Oct 14 2021, 11:01 AM

I successfully managed to run updateMenteeData.php in parallel. Even though merely running refreshLinkRecommendations in parallel for each DB shard won't likely help much, it's an easy change to do that will improve things a bit. @Tgr @kostajh Your review would be appreciated.

Change 730752 merged by Legoktm:

[operations/puppet@production] growthexperiments: Run refreshLinkRecommendations in parallel

https://gerrit.wikimedia.org/r/730752

In T278103#6933936, @Tgr wrote:

Some options:

Have the script schedule MediaWiki jobs. Might require the script to know about other wikis than the one it runs on, but not necessarily. More fragile and harder to debug, but less capacity limits on the MediaWiki side (one maintenance host vs. many job runners).

Without really understanding the limiting constraint here, this would be my recommendation on how to improve parallelization/concurrency given we can set rate limits per-job type and has automatic retries in case the service is overloaded, etc.

Legoktm unsubscribed.Oct 14 2021, 6:22 PM

Maintenance_bot removed a project: Patch-For-Review.Oct 14 2021, 7:11 PM

Change 734565 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/puppet@production] growthexperiments.pp: Remove absented job

https://gerrit.wikimedia.org/r/734565

gerritbot added a project: Patch-For-Review.Oct 26 2021, 8:35 AM

Change 734565 merged by Dzahn:

[operations/puppet@production] growthexperiments.pp: Remove absented job

https://gerrit.wikimedia.org/r/734565

Maintenance_bot removed a project: Patch-For-Review.Nov 1 2021, 9:11 PM

Tgr mentioned this in T299021: Reduce running time of refreshLinkRecommendations.php to a maximum of 60 minutes.Jan 12 2022, 6:45 PM

Tgr mentioned this in T307881: Scaling of link suggestions service.May 16 2022, 8:44 PM