Wikidata Dispatcher and Job Queue is overflowed
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	PokestarFan
	Jul 21 2017, 3:33 AM

Description

In Wikidata, jobs are overflowed.

~~In Special:DispatchStats, enwiki is behind by 3 days!~~
~~In grafana, there are 1.3 million queued jobs.~~

As Wikidata grows, job rates start increasing rapidly every day.

I think the solution is to dedicate more and more to Wikidata as the number of jobs start to increase.

Details

	Subject	Repo	Branch	Lines +/-
	mediawiki: Another increase of batch size in dispatchChanges cronjob	operations/puppet	production	+1 -1
	mediawiki: increase the batch size of dispatchChanges cronjob	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

		Status	Subtype	Assigned	Task
		Resolved		Lydia_Pintscher	T171263 Wikidata Dispatcher and Job Queue is overflowed
		Resolved		None	T105764 Pywikibot does not honor API etiquette regarding dispatch

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 21 2017, 3:33 AM

THis problem should be solved ASAP

PokestarFan moved this task from incoming to needs discussion or investigation on the Wikidata board.Jul 21 2017, 3:38 AM

PokestarFan added projects: MediaWiki-extensions-WikibaseRepository, MediaWiki-extensions-WikibaseClient, Wikidata.org.Jul 21 2017, 3:42 AM

Bugreporter added subscribers: Mr.Ibrahem, Emijrp, Magnus, Sjoerddebruin.Jul 21 2017, 5:41 AM

Bugreporter subscribed.

hoo subscribed.Jul 21 2017, 10:10 AM

This is likely a duplicate of several existing tickets, but I don't have them at hands now.

For example T151681, T110528 for (some of) the dispatch issues.

I don't know how many server resources are dedicated to Wikidata (CPU, memory, etc), but could be possible to increase them? Setting another job server?

In the future (maybe a year?) more data will be integrated into Wikidata (Commons structured data and Wiktionary lexemes). How is Wikidata going to support that huge amount of new data and bot activity?

1 million queued jobs is absolutely normal. Problem lies somewhere else.

In T171263#3459885, @Ladsgroup wrote:

1 million queued jobs is absolutely normal. Problem lies somewhere else.

It is 1 million because we have shut down almost all bots. It was 3 million a few days ago.[1]

If you check Wikidata edits Graphana, you will see how Wikidata community has reduced its activity[2] to reduce the jobs queue. I smell a bottleneck here, but the answer we get always is "slow down your bots". We cannot keep Wikidata items updated running a bunch of bots at a few dozens edit/minute. Wikidata has 30 million items, it is 6 times larger than English Wikipedia.

[1] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1
[2] https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1

My bot has 25 million edits in Wikidata, I know it's large. The problem is that we ran too fast for a while and now the backlog is so big that takes some time for the infra to handle the backlog. We should have more resources for Wikidata but resources alone can't fix this. Better software is also needed and we are working on it for example T151681: DispatchChanges: Avoid long-lasting connections to the master DB

Bugreporter added a project: Performance Issue.Jul 21 2017, 11:03 AM

We indeed lowered the amount of edits to at least try to make the process go faster. Let's hope this is a wakeup call for all of us.

As more data gets added, job queue is going to get even higher. More and more bots are going to be added. The day that wiktionary words are added to items, Wikidata is going to overflow.

PokestarFan awarded a token.Jul 21 2017, 1:45 PM

The core problem is the fact that Wikidata is poorly organized as a community and project. Per overall a big part of the work done is inefficient and counterproductive. At least the addition of descriptions/labels could and should be optimized. There are many users who acts just like bots, and using mass-editing tools on their own they add just by one description in a single language to series of items consisting of tens of thousands of items. For last month there are 4 human users who together made 3 Millions of mass-edits by adding just descriptions in their native languages (of them one user in 10 days made over 900,000 mass-edits to add descriptions). I can show you some large series of items where each item has over 50 consecutive edits made by one user in a short time span just to add descriptions/identical labels in several languages. I'm pretty sure that for our 30+ Millions of items we don't want by 200+ edits for each item just for addition of descriptions/labels.

Esc3300 subscribed.Jul 21 2017, 1:56 PM

To avoid confusion: dispatch lag is the time that changes sit around before they go into the client wikis' job queue. The time they spent in the job queue is not the issue here! Changes "sit around" because finding the changes relevant for a given wiki, and pushing them to that wiki's job queue, takes time. We are dispatching changes to over 800 wikis.

In any case: one thing we could do is skip old changes. Changes older than a day are unlikely to be seen on the client wiki's recentchanges feed anyway. Or perhaps, instead of skipping, they could go to a slow propagation queue, so new changes are pushed out quickly, but "stale" changes still get propagated eventually.

Liuxinyu970226 subscribed.Jul 21 2017, 2:32 PM

How much are we sure that this problem was predominantly caused by the recently “high” edit rate (or undesired number of parallel bot runs) at Wikidata? At Wikidata we are trying to get edit rates down, but I am uncomfortable with the notion that Wikidata operates already at the edge of what’s technically possible. Recent activity might have been pretty high, but not that much that I would have expected problems of that impact.

I would like to point to the dispatch graphs [1] in Grafana as well. If you look closely, there seems to be a very clear singularity in many of the graphs on 2017-06-28, between 21:15 and 21:30 (Grafana time, no idea whether this is UTC). What happend at this time? If there was a software or hardware failure, it probably would have been noticed by today, but this does not seem to be the case according to the Grafana charts. I thus speculate that some very expensive new software was deployed at this singularity, or an update which is badly inefficient compared to the previous condition. The history of T151681 (and maybe other tasks) indicates that there was in fact activity in connection with the dispatcher around that time.

[1] https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1

@MisterSynergy I agree that a software or config change is a prime candidate for causing an issue like this. But as far as I can tell from https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2017-06-28 there were no relevant changes deployed on 2017-06-28.

The lag started to increase around 30th July - around that day, the great import of cebwiki items started as well. Now look at when the sharp increase of items ends and when the increase in lag ends. The difference is 4 days, which is the number on the peak of the lag graph. In other words, the mass import of new items (not only from cebwiki) is what made the dispatching slower.

I wouldn't say it makes dispatching slower, the rate is essentially the same, just there is more to dispatch, and currently the dispatch system can only handle so many changes per second / minute / hour.

I'm so happy to see tons of donors money on bandwidth, database storage, and software engineer time are being wasted on articles that never will be read.

Just note that median of dispatch lag is not horrible (2 minutes) and only English Wikipedia and cebwiki are lagging behind. So the fastest and hackiest solution in my opinion is to set up a dedicated dispatching just for English Wikipedia. Let me see how hard it will be.

Change 366887 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: increase the maximum time of dispatchChanges cronjob

https://gerrit.wikimedia.org/r/366887

gerritbot added a project: Patch-For-Review.Jul 21 2017, 6:17 PM

It's not possible to do dispatching on one wiki only but my patch increases the maximum time of each dispatching job, it means two things: 1- We can have four instances of dispatchers instead of three, it's good 2- If dispatching to big wikis time out because of it's too large, this one might be able to help with that.

Maybe we can run it once manually from terbium? Since it pushes jobs to jobqueue, might not be the best idea in the world.

Ladsgroup added a project: Wikidata-Former-Sprint-Board.Jul 21 2017, 6:24 PM

Ladsgroup moved this task from Proposed to Backlog on the Wikidata-Former-Sprint-Board board.

Ladsgroup moved this task from Backlog to Review on the Wikidata-Former-Sprint-Board board.

Ladsgroup moved this task from needs discussion or investigation to in progress on the Wikidata board.Jul 22 2017, 10:51 AM

The job queue is fixed for now, dispatch queue is catching up, this can be fixed later

Reverting to high. The main reason the dispatch queue is catching up is that the Wikidata administrators asked bot operators to reduce their operation. Once the bots resume again, we might get the same problem (in fact, I’m a bit afraid it might be worse if everyone starts again at the same time, trying to catch up on their own week of backlog as fast as possible).

@PokestarFan, please refrain from re-prioritizing issues, that’s not for you to decide.

daniel awarded a token.Jul 24 2017, 2:24 PM

Lucas_Werkmeister_WMDE moved this task from Review to Blocked External on the Wikidata-Former-Sprint-Board board.Jul 25 2017, 10:04 AM

More good progress in the last 24 hour. Will notify the community when things are really down, then we need to keep watching if the dispatch increases too fast again. Editing speed of users is visible here.

Liuxinyu970226 awarded a token.Jul 25 2017, 12:35 PM

Still interesting spikes happening. The stalest wiki is wikidatawiki for some reason, is that normal?

Could the situation be improved by limiting the type of changes that are dispatched to various wikis?

I noticed that in some wikis, en labels are systematically subscribed to, but not displayed (possibly some inefficiency in their Module:Wikidata .
Some bots just "update items", others specify language of labels/descriptions. Could dispatching be improved by systematically indicating language and label?

Is there much demand for the recent changes feed in client wikis (other than update of displayed statements)? Personally, I find it hard to read, even for my own edits.

In T171263#3474757, @Esc3300 wrote:

Could the situation be improved by limiting the type of changes that are dispatched to various wikis?

We only dispatch changes to wikis that use the given item. Further filtering, as suggested below, happens on the client side. We could also do it on the repo, but that would not reduce the load, just concentrate it in one place - whatever data needs to be loaded and whatever code needs to run for the filtering, needs to run anyway - on the repo or the client.

I noticed that in some wikis, en labels are systematically subscribed to, but not displayed (possibly some inefficiency in their Module:Wikidata .

This is because of language fallback - "en" is the fallback for all languages.

Some bots just "update items", others specify language of labels/descriptions. Could dispatching be improved by systematically indicating language and label?

We filter based on the actual diff, the summary is irrelevant.

Is there much demand for the recent changes feed in client wikis (other than update of displayed statements)? Personally, I find it hard to read, even for my own edits.

I would say yes, as it was considered a precondition to allowing any data re-use. It's also an important safeguard against vandalism on wikidata.

I agree that integration with the RC feed could be greatly improved. Once we have a mechanism to attach arbitrary structured data to revisions, this will hopefully get much better.

Thanks for your response. Helped me understand how the pipeline(s) work.

Stalest wiki now more than one day behind again. Median is fine, but as the stalest wiki is enwiki, that’s not a great consolation.

In T171263#3480923, @Lucas_Werkmeister_WMDE wrote:

Stalest wiki now more than one day behind again. Median is fine, but as the stalest wiki is enwiki, that’s not a great consolation.

And what is causing this? Bot page creation or bot edits?

matej_suchanek removed a project: MediaWiki-extensions-WikibaseClient.Jul 28 2017, 2:33 PM

matej_suchanek updated the task description. (Show Details)

In T171263#3481214, @Emijrp wrote:

And what is causing this? Bot page creation or bot edits?

From Daniel's explanation and the documentation he provided, it seems that the more pages are linked to an item, the larger the impact ..

If page creations are only for another wiki, the impact on the enwiki job queue should be zero.

Cosmetic edits to much linked items for categories can have a huge impact, especially when the bot updates the description in twenty separate edits.

Edits to country items might be expensive as well.

If the item edited has no site links at all, the impact may be negligible.

Ah .. and "linked" in this context may be through arbitrary access.

Do my conclusions make sense ?

especially when the bot updates the description in twenty separate edits.

Is it true that this makes a big difference? Aren’t changes batched together?

I think that the current dispatchlag growing is due to ResearchBot academic paper item creation. I say that because some days ago my bot was editing at 60 ed/min and ResearchBot was doing a similar rate. I stopped my bot for a few hours and the dispatchlag kept growing. So I concluded that it wasn't my fault. It is strange because ResearchBot creations aren't linked to any Wikipedia, but I am inclined to think that it is the current cause for lag raising.

In T171263#3460518, @daniel wrote:

To avoid confusion: dispatch lag is the time that changes sit around before they go into the client wikis' job queue. The time they spent in the job queue is not the issue here! Changes "sit around" because finding the changes relevant for a given wiki, and pushing them to that wiki's job queue, takes time. We are dispatching changes to over 800 wikis.

In any case: one thing we could do is skip old changes. Changes older than a day are unlikely to be seen on the client wiki's recentchanges feed anyway. Or perhaps, instead of skipping, they could go to a slow propagation queue, so new changes are pushed out quickly, but "stale" changes still get propagated eventually.

Can't we skip changes made by bots (or give them very low priority)? Whats the point to check bot edits in recentchanges? We assume they are approved and working properly.

@Emijrp the point is that at least some people should have an eye on bot edits some time, otherwise nobody will notice when a bot goes wrong. Bot edits on wikidata should be marked as such on client wikis too (let me know if they are not), so people can filter them out. But skipping propagation alltogether would mean that wikidata bots effectively run unsupervised. That would not be good.

Now, if we need an emergency way to cut down the backlog, I agree that it's better to drop bot edits than to drop manual edits.

In T171263#3482050, @Lucas_Werkmeister_WMDE wrote:

especially when the bot updates the description in twenty separate edits.

Is it true that this makes a big difference? Aren’t changes batched together?

Changes get batched (coalesced) together on the client side, not before dispatching. So yes, it does make a big difference to the size of the dispatch queue.

We could indeed looking into coalescing changes in the dispatcher, when they are pulled from the queue. But since we already put an entire batch of changes into a single ChangeNotification job, that probably won't make much of a difference. I'm also not sure how we'd actually represent coalesced changed in the job, given that we just use the ID to represent it.

In T171263#3484037, @daniel wrote:

@Emijrp the point is that at least some people should have an eye on bot edits some time, otherwise nobody will notice when a bot goes wrong. Bot edits on wikidata should be marked as such on client wikis too (let me know if they are not), so people can filter them out. But skipping propagation alltogether would mean that wikidata bots effectively run unsupervised. That would not be good.

Now, if we need an emergency way to cut down the backlog, I agree that it's better to drop bot edits than to drop manual edits.

We could set different priority levels for dispatching changes. For IP (high level, vandalism is more possible), registered users (medium level), bots (low level). According to WikiScan, in the last 24 hours, there were these edits:

Total :	791,322
Users :	142,541
IP :	1,322
IPv6 :	141
Bots :	647,459

Bot changes could be discarded sooner too, like in the first 24 hours if they are not dispatched in that period.

Changes to descriptions seem lower priority than changes to labels/statements.

In T171263#3474950, @daniel wrote:

Is there much demand for the recent changes feed in client wikis (other than update of displayed statements)? Personally, I find it hard to read, even for my own edits.

I would say yes, as it was considered a precondition to allowing any data re-use. It's also an important safeguard against vandalism on wikidata.

I agree that integration with the RC feed could be greatly improved. Once we have a mechanism to attach arbitrary structured data to revisions, this will hopefully get much better.

In wikis that routinely read information from dozens of items (e.g. frwiki), Wikidata recent changes is just impossible to comprehend.

Supposedly, it works fine in wikis that only use one item per article (dewiki?).

Multichill mentioned this in T105764: Pywikibot does not honor API etiquette regarding dispatch.Jul 30 2017, 11:51 AM

Sjoerddebruin added a subtask: T105764: Pywikibot does not honor API etiquette regarding dispatch.Jul 30 2017, 11:51 AM

matej_suchanek mentioned this in T110825: I see inconsistent Wikidata entries in the watchlist in hewiki, enwiki, and cswiki.Jul 30 2017, 5:38 PM

Change 366887 merged by Filippo Giunchedi:
[operations/puppet@production] mediawiki: increase the batch size of dispatchChanges cronjob

https://gerrit.wikimedia.org/r/366887

Agabi10 subscribed.Aug 1 2017, 6:19 PM

Quick status summary:

Dispatcher batch size has been increased. Seems to have the desired effect, dispatch lag is going down, but only slowly. Maybe we can bump the batch size some more?
Patches for improving throughput on the receiving end have been merged. This has no impact on dispatch lag, but it does have an impact on client side change handling.

Is there a way to meter specifically the update channel for enwiki?

e.g. (wikidata edit > queue 1 > queue 2 > queue 3 > .. > update enwiki : actual size / delay )

In T171263#3497960, @Esc3300 wrote:

Is there a way to meter specifically the update channel for enwiki?

Not really - there is one queue for dispatching to enwiki, one queue for receiving on enwiki, and then there are several queues for processing different kinds of updates in parallel.

Change 370315 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: Another increase of batch size in dispatchChanges cronjob

https://gerrit.wikimedia.org/r/370315

The dispatch decreased a lot since yesterday, but is increasing again since botimport for cebwiki and svwiki started again. I hope the above patch helps with keeping the dispatch steady.

Change 370315 merged by Jcrespo:
[operations/puppet@production] mediawiki: Another increase of batch size in dispatchChanges cronjob

https://gerrit.wikimedia.org/r/370315

Now the stalest wiki is 3 minutes, should we close this?

Lydia_Pintscher closed this task as Resolved.Aug 10 2017, 11:26 PM

Lydia_Pintscher claimed this task.

Good news!

I wonder if we now agree what people should look for and what should be done if there are delays.

daniel mentioned this in T173216: Allow changeDispatcher to run with different configuration for different groups of wikis.Aug 12 2017, 8:11 PM

I don't think this is resolved, see https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1&from=now-90d&to=now

The median dispatch lag is almost never higher than 200/20s before 2017-06-29 (and since we start to record this), but is always higher than 1 minutes since that point.

Editing Wikidata is pretty slow for me now.

I'll ask people to have another look here,

Liuxinyu970226 unsubscribed.Aug 15 2017, 11:38 AM

More at T173710

daniel added a project: User-Daniel.Sep 5 2017, 3:02 PM

matej_suchanek removed a project: Patch-For-Review.Sep 5 2017, 3:06 PM

matej_suchanek removed subscribers: matej_suchanek, Wikidata.

Emijrp unsubscribed.Sep 5 2017, 3:18 PM

Base subscribed.Nov 8 2017, 3:42 PM

matej_suchanek closed subtask T105764: Pywikibot does not honor API etiquette regarding dispatch as Resolved.Jan 18 2020, 9:47 AM