Page MenuHomePhabricator

Parallelize GrowthExperiments' refreshLinkRecommendations.php
Open, LowPublic

Description

refreshLinkRecommendations.php goes sequentially through wikis (or rather, the cronjob calling it via foreachwikiindblist does), within a wiki it goes sequentially through topics, and within a topic it goes sequentially through article candidates until enough are found. There are two problems with that:

  • It cannot utilize available service capacity (with several instances and each having multiple workers, dozens of calls could be made in parallel without a performance penalty), which is a problem as the requests to the service are the bottleneck in the running speed of the script, and a significant one (preparing enough tasks for a new wiki would take several days now).
  • When a new wiki is added, the script would only work on that wiki until it has enough tasks (days), potentially causing the task pool on other wikis to dry out.

Event Timeline

Some options:

  • Parallel requests via MultiHttpRequest. This might help with the first problem but not the second (but maybe if the speedup is sufficient the second stops being a problem?).
  • Running the script for each wiki in parallel. We don't want to have to update puppet definitions all the time, but maybe something like GNU Parallel could be used? Probably not ideal in terms of CPU, though. Plus, at some point we'll have more wikis than service workers.
  • Have the script run on multiple threads via ForkController. Would require the script to know about other wikis than the one it runs on which is possible (e.g. GlobalRename jobs do that) but not trivial.
  • Have the script schedule MediaWiki jobs. Might require the script to know about other wikis than the one it runs on, but not necessarily. More fragile and harder to debug, but less capacity limits on the MediaWiki side (one maintenance host vs. many job runners).
  • Have the script do a limited amount of work only per wiki or per topic, then move on. Helps with the second problem but not the first. Easy for topics, but wikis don't loop (unless we make the script know about other wikis than the one it runs on).

A somewhat related issue is that the script turns task candidates into tasks, and this can get very slow if most task candidates are unsuitable (e.g. on a small wiki where we have already used up the good ones). Some form of the "do a limited amount of work then move on" approach might help here.

Loosely related: there should be some kind of back-off when the service request fails, to avoid hammering it when it's overloaded.

Was just discussing this with @MMiller_WMF as a candidate for something to work on after the initial release, as we only need the task pool populated for our four target wikis. Given that, @Tgr what do you think about creating the tables in production & enabling link recommendations for those four wikis so we can start filling up the task pool now?

Makes sense. Let's get verbose logging deployed first though so we have a better idea of what the script is doing.

(Btw normal cronjob logging will probably get messed up by running multiple cronjobs in parallel, so we'll have to switch to Logstash before doing that. But for now a log file is easier to review.)

rEGREaaa55a21787e: refreshLinkRecommendations.php: Use per-wiki locks, which I forgot to link to this task, tried to parallelize the script to the extent that different wikis can run different instances of it in the same time. It did not seem to work though - maybe there is some guard against it in the cronjob infrastructure itself.

(Btw normal cronjob logging will probably get messed up by running multiple cronjobs in parallel, so we'll have to switch to Logstash before doing that. But for now a log file is easier to review.)

Or just put the wiki name in the logfile name, duh. (Would require a small change to how the job is puppetized, I think.)

It did not seem to work though - maybe there is some guard against it in the cronjob infrastructure itself.

Of course there is, since it's not an actual cronjob but a systemd service. I suppose we'd have to use a service template instead?

This should probably go into the current sprint sometime this summer, or in any case before we expand to more wikis beyond T284481: Deploy Add a link to the second set of wikis

Change 730752 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/puppet@production] growthexperiments: Run refreshLinkRecommendations in parallel

https://gerrit.wikimedia.org/r/730752

I successfully managed to run updateMenteeData.php in parallel. Even though merely running refreshLinkRecommendations in parallel for each DB shard won't likely help much, it's an easy change to do that will improve things a bit. @Tgr @kostajh Your review would be appreciated.

Change 730752 merged by Legoktm:

[operations/puppet@production] growthexperiments: Run refreshLinkRecommendations in parallel

https://gerrit.wikimedia.org/r/730752

Some options:

  • Have the script schedule MediaWiki jobs. Might require the script to know about other wikis than the one it runs on, but not necessarily. More fragile and harder to debug, but less capacity limits on the MediaWiki side (one maintenance host vs. many job runners).

Without really understanding the limiting constraint here, this would be my recommendation on how to improve parallelization/concurrency given we can set rate limits per-job type and has automatic retries in case the service is overloaded, etc.

Change 734565 had a related patch set uploaded (by Urbanecm; author: Urbanecm):

[operations/puppet@production] growthexperiments.pp: Remove absented job

https://gerrit.wikimedia.org/r/734565

Change 734565 merged by Dzahn:

[operations/puppet@production] growthexperiments.pp: Remove absented job

https://gerrit.wikimedia.org/r/734565