Page MenuHomePhabricator

Job queue rising to nearly 3 million jobs
Closed, ResolvedPublic

Event Timeline

wikidatawiki has 2,728,526 htmlCacheUpdate jobs queued.

Legoktm triaged this task as Unbreak Now! priority.Mar 4 2017, 7:28 PM
Legoktm added a project: Wikidata.

00:47, 5 March 2017 Sjoerddebruin (talk | contribs) blocked Emijrpbot (talk | contribs) with an expiration time of indefinite (account creation disabled, autoblock disabled) (Please respect the bot policy, too high editing rate)

https://www.wikidata.org/wiki/User_talk:Emijrp#Slow_down_your_bot_please

It was going 600+ epm.

Job queue appears to be dropping now...

Suggestion, when the job queue gets too high set the maxlag parameter to a higher value, most bots use that as a throttle.

Legoktm lowered the priority of this task from Unbreak Now! to High.Mar 5 2017, 6:02 AM

Going down slowly...

All my bots follow the maxlag policy, as defined by default in Pywikibot user-config.py.

All my bots follow the maxlag policy, as defined by default in Pywikibot user-config.py.

Can you add a ratelimit to your bot? 20-30 epm would be reasonable for now.

Done. I added a put_throttle = 3 seconds. Anyway I think I will wait for the job queue to go down.

@Lydia_Pintscher not really, I'm monitoring the jobqueue and it's constantly decreasing in size. We should be ok.

Anything else we need to do here?

@Joe @Lydia_Pintscher Is the Betacommand suggestion feasible?

I think it might be worth attempting to determine the factors that lead to the rapid raise.

  • The edit rate didn't seem that high and we could easily have several bots that resulted in the same rate.
  • The number of sitelinks on these items might have increased the impact on the job queue. Merely creating new items without any sitelinks probably has a lower impact.

@Emijrp For how long did it run at that rate?

@Esc3300 About 3 days (72 hours), at a roughly edit rate of 600 epm, I did 2.5 million edits, more or less. Just a note, my bot edits add descriptions in dozens of languages.[1] I don't know if that makes jobs difficult.

[1] https://www.wikidata.org/w/index.php?title=Special:Contributions/Emijrpbot&dir=prev&offset=20170305004655&target=Emijrpbot

Many seem to be descriptions for category items (Something that might not be of much use to Wikipedia).

On a few items I checked, some languages use only the sitelink (en, fr), others all entity data (sv, es). Maybe this is linked to descriptions from Wikidata being used there.

I think it might be worth attempting to determine the factors that lead to the rapid raise.

  • The edit rate didn't seem that high and we could easily have several bots that resulted in the same rate.

I disagree. According to https://wikipulse.herokuapp.com/, the edit rate on Wikidata right now is anywhere from 150-220 epm. Adding an extra 600 epm on top of that is what caused the problem.

I think it might be worth attempting to determine the factors that lead to the rapid raise.

  • The edit rate didn't seem that high and we could easily have several bots that resulted in the same rate.

I disagree. According to https://wikipulse.herokuapp.com/, the edit rate on Wikidata right now is anywhere from 150-220 epm. Adding an extra 600 epm on top of that is what caused the problem.

The edit rate may have been the issue, but we should still utilize the tools we have (maxlag) to notify bots that the server is under high load. If we throw a check in maxlag value calculation checking for the number of JobQueue entries and then raising the maxlag to indicate it, it would prevent bots from causing this issue again. Regardless of whether its one bot or several causing the spike, the existing maxlag checks could be used to notify all bots to back off.

Legoktm claimed this task.

The edit rate may have been the issue, but we should still utilize the tools we have (maxlag) to notify bots that the server is under high load. If we throw a check in maxlag value calculation checking for the number of JobQueue entries and then raising the maxlag to indicate it, it would prevent bots from causing this issue again. Regardless of whether its one bot or several causing the spike, the existing maxlag checks could be used to notify all bots to back off.

Sounds reasonable. Can you file a task for that?

Closing this task as the job queue has now returned to normal levels.