Page MenuHomePhabricator

Reduce running time of refreshLinkRecommendations.php to a maximum of 60 minutes
Open, MediumPublic

Description

Long-running scripts that hold a DB connection open cause problems for some DB operations that include taking the DB host out of load balancing and waiting for connections to expire. Make sure refreshLinkRecommendations.php (and possibly other Growth-maintained scripts) don't hold connections open for hours.

Event Timeline

Tgr added a subscriber: Ladsgroup.

The script iterates through various things (first the ORES topics, then for each topic up to 10K search results). Either we need to store iteration state somewhere (DB? object stash?), or we need to make the script queue a ton of jobs, or we need to turn the script into a master job that somehow survives for long and creates child jobs, GWToolset-style. None of those seem like appealing options.

I don't know how long it takes but maybe you can queue a job for each topic. You can also run a child process and have --topic argument (See https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/86b7c3bb641384980b17a0c3edec36eea03220a2/addWiki.php#L185) Would that help?

Thanks! That might be a good way to store state between successive runs of scripts.

It has some overlaps with T278103: Parallelize GrowthExperiments' refreshLinkRecommendations.php.

kostajh triaged this task as Medium priority.Feb 7 2022, 10:39 AM

As an example, i depooled a host (db1120) so i could reboot it. 1.5h later, i'm still waiting for refreshLinkRecommendations to drop it's connection so that i can actually do the reboot. I have no way of knowing when that will happen. I also don't know how bad/disruptive/etc it would be to just reboot it anyway.

The script dropped the connection after 2h15m.

As an example, i depooled a host (db1120) so i could reboot it. 1.5h later, i'm still waiting for refreshLinkRecommendations to drop it's connection so that i can actually do the reboot.

Sorry for the script making you wait for 2h+ :-(. For the record/future, it is possible to restart the refreshing script by doing systemctl restart mediawiki_job_growthexperiments-refreshLinkRecommendations-sX as root at the currently active maintenance server. This will likely need to happen for multiple DB sections for the script to fully restart itself. Once a restart is done, a new DB connection will be acquired (letting you reboot the DB server without harming any conections).

On a related note, I'd like to note a similar issue with the refresh script. The other issue happened when @JMeybohm worked on T302719. As part of that task, @JMeybohm changed the way how the script accesses a service (r766780). To make the script use the newly-deployed config, a restart was needed.

As far as I know, it's safe to restart the script at any time (@Tgr would be able to confirm with certainity). Do we have a list of things that can be restarted more-or-less at any time to free connections/services/etc., for similar issues?

I have no way of knowing when that will happen. I also don't know how bad/disruptive/etc it would be to just reboot it anyway.

AFAIK, the script should be capable of dealing with various DB errors. It'll probably log a DB failure and retry at a later time. As long as the errors aren't persistent, it shouldn't be noticeable.

There is not an easy way to make depooling work inside a running script, you basically need to restart it. What I do in my script is that I run them with timeout and then let the next run pick up from where it was left off.

If you want to follow what we are trying to do to make it work inside a running script. See T305016: Think about rdbms reconnection logic and T298485: MW scripts should reload the database config

As far as I know, it's safe to restart the script at any time (@Tgr would be able to confirm with certainity). Do we have a list of things that can be restarted more-or-less at any time to free connections/services/etc., for similar issues?

Yeah, safe to stop, and to prevent from running for a couple hours. So in the first approximation this looks like a documentation issue (what would be a discoverable way of declaring the script interruptable?), although in the long term I think we should either implement forking as per T299021#7616781, or the load balancer should support getting some sort of "interruptible connection" as per T298485#7631017.

Mentioned in SAL (#wikimedia-operations) [2022-04-29T17:36:48Z] <Amir1> killed bnwiki's refresh links recommendation (T299021)

Mentioned in SAL (#wikimedia-operations) [2022-05-02T05:20:35Z] <Amir1> killed bnwiki's refresh links recommendation (T299021)

Mentioned in SAL (#wikimedia-operations) [2022-05-04T04:08:03Z] <Amir1> killed refresh job of rowiki (T299021)

I don't know how long it takes but maybe you can queue a job for each topic. You can also run a child process and have --topic argument (See https://github.com/wikimedia/mediawiki-extensions-WikimediaMaintenance/blob/86b7c3bb641384980b17a0c3edec36eea03220a2/addWiki.php#L185) Would that help?

Maintenance::runChild() doesn't seem to actually involve any child process, it just calls the execute() method of another Maintenance subclass. So it wouldn't help with reloading the configuration.

*Sigh* Why it's named child? Anyway. Thanks for reminding me. We need to find another solution.

Mentioned in SAL (#wikimedia-operations) [2022-05-17T13:02:52Z] <Amir1> killed cawiki's refreshLinkRecommendations.php (T299021)

Mentioned in SAL (#wikimedia-operations) [2022-05-27T09:24:41Z] <Amir1> killed hewiki's refresh link suggestions job (T299021)

Mentioned in SAL (#wikimedia-operations) [2022-05-31T12:59:42Z] <Amir1> killed kowiki's refreshLinkRecommendations.php (T299021)

Mentioned in SAL (#wikimedia-operations) [2022-06-09T09:26:45Z] <Amir1> killed enwiki's refreshlinksrecommandations (T299021)

Mentioned in SAL (#wikimedia-operations) [2022-06-30T13:40:05Z] <Amir1> killed refreshLinkRecommendations.php on arzwiki (T299021)

Mentioned in SAL (#wikimedia-operations) [2022-07-15T09:38:12Z] <Amir1> killed refreshLinkRecommendations.php in testwiki (T299021)

Mentioned in SAL (#wikimedia-operations) [2022-08-19T07:20:33Z] <Amir1> killing cswiki's refreshlinksrecom script T299021

Mentioned in SAL (#wikimedia-operations) [2022-08-24T12:01:52Z] <Amir1> killed refresh links-recomm scripts in rowiki, cswiki, simplewiki, frwiki (T299021)

Maybe we can reduce the value set for FIELD_MIN_TASKS_PER_TOPIC to something like 100. Currently, it is 500. @KStoller-WMF in practice, that would mean the task pool size for link recommendations would go to something like ~3900 instead of ~19500 per wiki. Based on how many link recommendation task newcomers are doing, the reduced limit might be OK, what do you think?

@Tgr @Sgs we could also think about queuing a job when a link recommendation is invalidated. We invalidate link recommendation tasks when an article is edited, for example, and we could start a job to regenerate link recommendation metadata for that article. We'd just exclude the case where the most recent edit was a link recommendation. That might also help reduce the execution time for refreshLInkRecommendations.php, since we'd reach the task pool limit faster.

Maybe we can reduce the value set for FIELD_MIN_TASKS_PER_TOPIC to something like 100. Currently, it is 500. @KStoller-WMF in practice, that would mean the task pool size for link recommendations would go to something like ~3900 instead of ~19500 per wiki. Based on how many link recommendation task newcomers are doing, the reduced limit might be OK, what do you think?

My only hesitation is that I imagine that users will be less likely to receive link results when limiting results with "And" selection ( T305408 ). But if this change is needed for performance reasons then I'm not opposed.

Per discussion with @Ladsgroup this is something we should do regardless of whether the DB connection layer supports dynamic configuration reloading (which seems farther away) as in general the long-term vision of MediaWiki seems to be involve short-running maintenance scripts (which makes containerization and updates simpler).

@Ladsgroup is there some limit that we should target for maximum run time of the script?

@Ladsgroup is there some limit that we should target for maximum run time of the script?

Maximum one hour otherwise auto_schema starts repooling depooled host back without running the schema change.

@Ladsgroup is there some limit that we should target for maximum run time of the script?

Maximum one hour otherwise auto_schema starts repooling depooled host back without running the schema change.

Thanks! Do we have any data on how long the scripts are currently taking? Maybe we should instrument that, so we can keep an eye on it alongside other metrics on the task pool section of our Special:Homepage dashboard.

kostajh renamed this task from Shorten running time of refreshLinkRecommendations.php to Reduce running time of refreshLinkRecommendations.php to a maximum of 60 minutes.Sep 15 2022, 12:30 PM

We don't (it depends on how quickly users consume tasks, whether we are opening a new wiki etc), but you can sort of eyeball it from looking at task pool size changes.

We don't (it depends on how quickly users consume tasks, whether we are opening a new wiki etc), but you can sort of eyeball it from looking at task pool size changes.

We can also review logs under mwmaint1002:/var/log/mediawiki/mediawiki_job_growthexperiments-refreshLinkRecommendations-s*/ manually. They are timestamped, but granted, the formatting doesn't allow you to easily say how long it took on a given day.

Maybe we can reduce the value set for FIELD_MIN_TASKS_PER_TOPIC to something like 100. Currently, it is 500. @KStoller-WMF in practice, that would mean the task pool size for link recommendations would go to something like ~3900 instead of ~19500 per wiki. Based on how many link recommendation task newcomers are doing, the reduced limit might be OK, what do you think?

@Tgr @Sgs we could also think about queuing a job when a link recommendation is invalidated. We invalidate link recommendation tasks when an article is edited, for example, and we could start a job to regenerate link recommendation metadata for that article. We'd just exclude the case where the most recent edit was a link recommendation. That might also help reduce the execution time for refreshLInkRecommendations.php, since we'd reach the task pool limit faster.

I think those are both viable but the more general solution would be to figure out how we can chunk processing so the script is restarted and configuration reloaded every few minutes.

At a high level, the script works like this:

  • iterate through ORES topics (a constant array)
    • calculate the number of new tasks required for that topic
    • iterate through a CirrusSearch query that returns up to 500 potential task candidates, generate tasks from them if they meet various criteria
    • repeat previous step until we have generated enough tasks or we weren't able to generate any new task in the last iteration

Basically we would either need to keep track of which topic we are processing (the CirrusSearch part doesn't use continuation, although that's only because we needed random-looking results and pseudo-random sorting wasn't available; with c712994 merged, we could improve that and possibly cut down significantly on task generation time), or do each topic in a separate run. I wrote some thoughts in T317290#8238724 in the context of chunking jobs:

GWToolset has an example of a parent job periodically rescheduling itself and scheduling child jobs to do chunks of the work; I'm not sure it's considered to be a good pattern, though. Another approach is the one taken by GlobalRename, where status is tracked in a DB table, and jobs reschedule their own follow-ups until the status table is cleared. Also not necessarily a good example, it can be fragile; but maybe a similar approach using a maintenance script to schedule jobs would work

I think for a maintenance script, "rescheduling itself" is a much less problematic pattern: you fork child jobs, make sure the parent job does not use DB connections, and each child job can do a small batch of task generation attempts and use a socket to send back results. Locking would also have to happen in the child job since that also opens a DB connection.

So that also seems viable, with no user-visible change, although some increase in complexity.

This is still nice to have as long-running scripts can cause a number of different issues, but per T298485: MW scripts should reload the database config the immediate problem caused by the script (that it interfered with the standard DB server depooling process) is now fixed in a different way.

Yeah, I think one hour is no longer a good idea but from db side of things, this is fixed.

Now that we have reload config in place running and stable, this can increase to a large number, maybe up to six hours but more than that is not recommended as for the changes in general configuration, mediawiki code, etc. should reflect as soon as possible.