Job queue rising to nearly 3 million jobs
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Legoktm
	Mar 4 2017, 7:15 PM

Description

job queue 3 mill.png (439×945 px, 29 KB)

https://grafana.wikimedia.org/dashboard/db/job-queue-health

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Open		None	T160003 Factor the JobQueue into the maxlag value
		Resolved		Legoktm	T159618 Job queue rising to nearly 3 million jobs

Event Timeline

Legoktm created this task.Mar 4 2017, 7:15 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptMar 4 2017, 7:15 PM

wikidatawiki has 2,728,526 htmlCacheUpdate jobs queued.

Legoktm triaged this task as Unbreak Now! priority.Mar 4 2017, 7:28 PM

Legoktm added a project: Wikidata.

Restricted Application added subscribers: Jay8g, TerraCodes. · View Herald TranscriptMar 4 2017, 7:28 PM

BethNaught subscribed.Mar 4 2017, 8:55 PM

00:47, 5 March 2017 Sjoerddebruin (talk | contribs) blocked Emijrpbot (talk | contribs) with an expiration time of indefinite (account creation disabled, autoblock disabled) (Please respect the bot policy, too high editing rate)

https://www.wikidata.org/wiki/User_talk:Emijrp#Slow_down_your_bot_please

It was going 600+ epm.

Job queue appears to be dropping now...

Suggestion, when the job queue gets too high set the maxlag parameter to a higher value, most bots use that as a throttle.

Going down slowly...

All my bots follow the maxlag policy, as defined by default in Pywikibot user-config.py.

In T159618#3073691, @Emijrp wrote:

All my bots follow the maxlag policy, as defined by default in Pywikibot user-config.py.

Can you add a ratelimit to your bot? 20-30 epm would be reasonable for now.

Done. I added a put_throttle = 3 seconds. Anyway I think I will wait for the job queue to go down.

Anything else we need to do here?

@Lydia_Pintscher not really, I'm monitoring the jobqueue and it's constantly decreasing in size. We should be ok.

Cool. Thanks!

In T159618#3074646, @Lydia_Pintscher wrote:

Anything else we need to do here?

@Joe @Lydia_Pintscher Is the Betacommand suggestion feasible?

I think it might be worth attempting to determine the factors that lead to the rapid raise.

The edit rate didn't seem that high and we could easily have several bots that resulted in the same rate.
The number of sitelinks on these items might have increased the impact on the job queue. Merely creating new items without any sitelinks probably has a lower impact.

@Emijrp For how long did it run at that rate?

@Esc3300 About 3 days (72 hours), at a roughly edit rate of 600 epm, I did 2.5 million edits, more or less. Just a note, my bot edits add descriptions in dozens of languages.[1] I don't know if that makes jobs difficult.

[1] https://www.wikidata.org/w/index.php?title=Special:Contributions/Emijrpbot&dir=prev&offset=20170305004655&target=Emijrpbot

Many seem to be descriptions for category items (Something that might not be of much use to Wikipedia).

On a few items I checked, some languages use only the sitelink (en, fr), others all entity data (sv, es). Maybe this is linked to descriptions from Wikidata being used there.

In T159618#3079361, @Esc3300 wrote:

I think it might be worth attempting to determine the factors that lead to the rapid raise.

The edit rate didn't seem that high and we could easily have several bots that resulted in the same rate.

I disagree. According to https://wikipulse.herokuapp.com/, the edit rate on Wikidata right now is anywhere from 150-220 epm. Adding an extra 600 epm on top of that is what caused the problem.

In T159618#3080947, @Legoktm wrote:

In T159618#3079361, @Esc3300 wrote:

I think it might be worth attempting to determine the factors that lead to the rapid raise.

The edit rate didn't seem that high and we could easily have several bots that resulted in the same rate.

I disagree. According to https://wikipulse.herokuapp.com/, the edit rate on Wikidata right now is anywhere from 150-220 epm. Adding an extra 600 epm on top of that is what caused the problem.

The edit rate may have been the issue, but we should still utilize the tools we have (maxlag) to notify bots that the server is under high load. If we throw a check in maxlag value calculation checking for the number of JobQueue entries and then raising the maxlag to indicate it, it would prevent bots from causing this issue again. Regardless of whether its one bot or several causing the spike, the existing maxlag checks could be used to notify all bots to back off.

Legoktm mentioned this in T157670: Periodically run refreshLinks.php on production sites..Mar 8 2017, 8:05 AM

IKhitron subscribed.Mar 8 2017, 11:36 AM

JJMC89 subscribed.Mar 8 2017, 3:28 PM

In T159618#3080998, @Betacommand wrote:

The edit rate may have been the issue, but we should still utilize the tools we have (maxlag) to notify bots that the server is under high load. If we throw a check in maxlag value calculation checking for the number of JobQueue entries and then raising the maxlag to indicate it, it would prevent bots from causing this issue again. Regardless of whether its one bot or several causing the spike, the existing maxlag checks could be used to notify all bots to back off.

Sounds reasonable. Can you file a task for that?

Closing this task as the job queue has now returned to normal levels.

Betacommand mentioned this in T160003: Factor the JobQueue into the maxlag value.Mar 8 2017, 11:39 PM

Betacommand added a parent task: T160003: Factor the JobQueue into the maxlag value.

• MZMcBride subscribed.May 9 2017, 5:25 AM

Job queue rising to nearly 3 million jobsClosed, ResolvedPublicActions

Description

Related ObjectsSearch...

Event Timeline

Job queue rising to nearly 3 million jobs
Closed, ResolvedPublic
Actions

Related Objects
Search...