Page MenuHomePhabricator

Wikidata Dispatcher and Job Queue is overflowed
Closed, ResolvedPublic

Description

In Wikidata, jobs are overflowed.

  1. In Special:DispatchStats, enwiki is behind by 3 days!
  2. In grafana, there are 1.3 million queued jobs.

As Wikidata grows, job rates start increasing rapidly every day.

I think the solution is to dedicate more and more to Wikidata as the number of jobs start to increase.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

THis problem should be solved ASAP

This is likely a duplicate of several existing tickets, but I don't have them at hands now.

For example T151681, T110528 for (some of) the dispatch issues.

I don't know how many server resources are dedicated to Wikidata (CPU, memory, etc), but could be possible to increase them? Setting another job server?

In the future (maybe a year?) more data will be integrated into Wikidata (Commons structured data and Wiktionary lexemes). How is Wikidata going to support that huge amount of new data and bot activity?

1 million queued jobs is absolutely normal. Problem lies somewhere else.

1 million queued jobs is absolutely normal. Problem lies somewhere else.

It is 1 million because we have shut down almost all bots. It was 3 million a few days ago.[1]

If you check Wikidata edits Graphana, you will see how Wikidata community has reduced its activity[2] to reduce the jobs queue. I smell a bottleneck here, but the answer we get always is "slow down your bots". We cannot keep Wikidata items updated running a bunch of bots at a few dozens edit/minute. Wikidata has 30 million items, it is 6 times larger than English Wikipedia.

[1] https://grafana.wikimedia.org/dashboard/db/job-queue-health?refresh=1m&orgId=1
[2] https://grafana.wikimedia.org/dashboard/db/wikidata-edits?refresh=1m&orgId=1

My bot has 25 million edits in Wikidata, I know it's large. The problem is that we ran too fast for a while and now the backlog is so big that takes some time for the infra to handle the backlog. We should have more resources for Wikidata but resources alone can't fix this. Better software is also needed and we are working on it for example T151681: DispatchChanges: Avoid long-lasting connections to the master DB

We indeed lowered the amount of edits to at least try to make the process go faster. Let's hope this is a wakeup call for all of us.

As more data gets added, job queue is going to get even higher. More and more bots are going to be added. The day that wiktionary words are added to items, Wikidata is going to overflow.

The core problem is the fact that Wikidata is poorly organized as a community and project. Per overall a big part of the work done is inefficient and counterproductive. At least the addition of descriptions/labels could and should be optimized. There are many users who acts just like bots, and using mass-editing tools on their own they add just by one description in a single language to series of items consisting of tens of thousands of items. For last month there are 4 human users who together made 3 Millions of mass-edits by adding just descriptions in their native languages (of them one user in 10 days made over 900,000 mass-edits to add descriptions). I can show you some large series of items where each item has over 50 consecutive edits made by one user in a short time span just to add descriptions/identical labels in several languages. I'm pretty sure that for our 30+ Millions of items we don't want by 200+ edits for each item just for addition of descriptions/labels.

To avoid confusion: dispatch lag is the time that changes sit around before they go into the client wikis' job queue. The time they spent in the job queue is not the issue here! Changes "sit around" because finding the changes relevant for a given wiki, and pushing them to that wiki's job queue, takes time. We are dispatching changes to over 800 wikis.

In any case: one thing we could do is skip old changes. Changes older than a day are unlikely to be seen on the client wiki's recentchanges feed anyway. Or perhaps, instead of skipping, they could go to a slow propagation queue, so new changes are pushed out quickly, but "stale" changes still get propagated eventually.

How much are we sure that this problem was predominantly caused by the recently “high” edit rate (or undesired number of parallel bot runs) at Wikidata? At Wikidata we are trying to get edit rates down, but I am uncomfortable with the notion that Wikidata operates already at the edge of what’s technically possible. Recent activity might have been pretty high, but not that much that I would have expected problems of that impact.

I would like to point to the dispatch graphs [1] in Grafana as well. If you look closely, there seems to be a very clear singularity in many of the graphs on 2017-06-28, between 21:15 and 21:30 (Grafana time, no idea whether this is UTC). What happend at this time? If there was a software or hardware failure, it probably would have been noticed by today, but this does not seem to be the case according to the Grafana charts. I thus speculate that some very expensive new software was deployed at this singularity, or an update which is badly inefficient compared to the previous condition. The history of T151681 (and maybe other tasks) indicates that there was in fact activity in connection with the dispatcher around that time.

[1] https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1

@MisterSynergy I agree that a software or config change is a prime candidate for causing an issue like this. But as far as I can tell from https://wikitech.wikimedia.org/wiki/Server_Admin_Log#2017-06-28 there were no relevant changes deployed on 2017-06-28.

The lag started to increase around 30th July - around that day, the great import of cebwiki items started as well. Now look at when the sharp increase of items ends and when the increase in lag ends. The difference is 4 days, which is the number on the peak of the lag graph. In other words, the mass import of new items (not only from cebwiki) is what made the dispatching slower.

I wouldn't say it makes dispatching slower, the rate is essentially the same, just there is more to dispatch, and currently the dispatch system can only handle so many changes per second / minute / hour.

I'm so happy to see tons of donors money on bandwidth, database storage, and software engineer time are being wasted on articles that never will be read.

Just note that median of dispatch lag is not horrible (2 minutes) and only English Wikipedia and cebwiki are lagging behind. So the fastest and hackiest solution in my opinion is to set up a dedicated dispatching just for English Wikipedia. Let me see how hard it will be.

Change 366887 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: increase the maximum time of dispatchChanges cronjob

https://gerrit.wikimedia.org/r/366887

It's not possible to do dispatching on one wiki only but my patch increases the maximum time of each dispatching job, it means two things: 1- We can have four instances of dispatchers instead of three, it's good 2- If dispatching to big wikis time out because of it's too large, this one might be able to help with that.

Maybe we can run it once manually from terbium? Since it pushes jobs to jobqueue, might not be the best idea in the world.

PokestarFan lowered the priority of this task from High to Medium.Jul 22 2017, 5:12 PM

The job queue is fixed for now, dispatch queue is catching up, this can be fixed later

Lucas_Werkmeister_WMDE raised the priority of this task from Medium to High.EditedJul 24 2017, 2:20 PM

Reverting to high. The main reason the dispatch queue is catching up is that the Wikidata administrators asked bot operators to reduce their operation. Once the bots resume again, we might get the same problem (in fact, I’m a bit afraid it might be worse if everyone starts again at the same time, trying to catch up on their own week of backlog as fast as possible).

@PokestarFan, please refrain from re-prioritizing issues, that’s not for you to decide.

More good progress in the last 24 hour. Will notify the community when things are really down, then we need to keep watching if the dispatch increases too fast again. Editing speed of users is visible here.

Still interesting spikes happening. The stalest wiki is wikidatawiki for some reason, is that normal?

Could the situation be improved by limiting the type of changes that are dispatched to various wikis?

  • I noticed that in some wikis, en labels are systematically subscribed to, but not displayed (possibly some inefficiency in their Module:Wikidata .
  • Some bots just "update items", others specify language of labels/descriptions. Could dispatching be improved by systematically indicating language and label?
  • Is there much demand for the recent changes feed in client wikis (other than update of displayed statements)? Personally, I find it hard to read, even for my own edits.

Could the situation be improved by limiting the type of changes that are dispatched to various wikis?

We only dispatch changes to wikis that use the given item. Further filtering, as suggested below, happens on the client side. We could also do it on the repo, but that would not reduce the load, just concentrate it in one place - whatever data needs to be loaded and whatever code needs to run for the filtering, needs to run anyway - on the repo or the client.

  • I noticed that in some wikis, en labels are systematically subscribed to, but not displayed (possibly some inefficiency in their Module:Wikidata .

This is because of language fallback - "en" is the fallback for all languages.

  • Some bots just "update items", others specify language of labels/descriptions. Could dispatching be improved by systematically indicating language and label?

We filter based on the actual diff, the summary is irrelevant.

  • Is there much demand for the recent changes feed in client wikis (other than update of displayed statements)? Personally, I find it hard to read, even for my own edits.

I would say yes, as it was considered a precondition to allowing any data re-use. It's also an important safeguard against vandalism on wikidata.

I agree that integration with the RC feed could be greatly improved. Once we have a mechanism to attach arbitrary structured data to revisions, this will hopefully get much better.

Thanks for your response. Helped me understand how the pipeline(s) work.

Stalest wiki now more than one day behind again. Median is fine, but as the stalest wiki is enwiki, that’s not a great consolation.

Stalest wiki now more than one day behind again. Median is fine, but as the stalest wiki is enwiki, that’s not a great consolation.

And what is causing this? Bot page creation or bot edits?

And what is causing this? Bot page creation or bot edits?

From Daniel's explanation and the documentation he provided, it seems that the more pages are linked to an item, the larger the impact ..

If page creations are only for another wiki, the impact on the enwiki job queue should be zero.

Cosmetic edits to much linked items for categories can have a huge impact, especially when the bot updates the description in twenty separate edits.

Edits to country items might be expensive as well.

If the item edited has no site links at all, the impact may be negligible.

Ah .. and "linked" in this context may be through arbitrary access.

Do my conclusions make sense ?

especially when the bot updates the description in twenty separate edits.

Is it true that this makes a big difference? Aren’t changes batched together?

I think that the current dispatchlag growing is due to ResearchBot academic paper item creation. I say that because some days ago my bot was editing at 60 ed/min and ResearchBot was doing a similar rate. I stopped my bot for a few hours and the dispatchlag kept growing. So I concluded that it wasn't my fault. It is strange because ResearchBot creations aren't linked to any Wikipedia, but I am inclined to think that it is the current cause for lag raising.

To avoid confusion: dispatch lag is the time that changes sit around before they go into the client wikis' job queue. The time they spent in the job queue is not the issue here! Changes "sit around" because finding the changes relevant for a given wiki, and pushing them to that wiki's job queue, takes time. We are dispatching changes to over 800 wikis.

In any case: one thing we could do is skip old changes. Changes older than a day are unlikely to be seen on the client wiki's recentchanges feed anyway. Or perhaps, instead of skipping, they could go to a slow propagation queue, so new changes are pushed out quickly, but "stale" changes still get propagated eventually.

Can't we skip changes made by bots (or give them very low priority)? Whats the point to check bot edits in recentchanges? We assume they are approved and working properly.

@Emijrp the point is that at least some people should have an eye on bot edits some time, otherwise nobody will notice when a bot goes wrong. Bot edits on wikidata should be marked as such on client wikis too (let me know if they are not), so people can filter them out. But skipping propagation alltogether would mean that wikidata bots effectively run unsupervised. That would not be good.

Now, if we need an emergency way to cut down the backlog, I agree that it's better to drop bot edits than to drop manual edits.

especially when the bot updates the description in twenty separate edits.

Is it true that this makes a big difference? Aren’t changes batched together?

Changes get batched (coalesced) together on the client side, not before dispatching. So yes, it does make a big difference to the size of the dispatch queue.

We could indeed looking into coalescing changes in the dispatcher, when they are pulled from the queue. But since we already put an entire batch of changes into a single ChangeNotification job, that probably won't make much of a difference. I'm also not sure how we'd actually represent coalesced changed in the job, given that we just use the ID to represent it.

@Emijrp the point is that at least some people should have an eye on bot edits some time, otherwise nobody will notice when a bot goes wrong. Bot edits on wikidata should be marked as such on client wikis too (let me know if they are not), so people can filter them out. But skipping propagation alltogether would mean that wikidata bots effectively run unsupervised. That would not be good.

Now, if we need an emergency way to cut down the backlog, I agree that it's better to drop bot edits than to drop manual edits.

We could set different priority levels for dispatching changes. For IP (high level, vandalism is more possible), registered users (medium level), bots (low level). According to WikiScan, in the last 24 hours, there were these edits:

Total :	791,322
Users :	142,541
IP :	1,322
IPv6 :	141
Bots :	647,459

Bot changes could be discarded sooner too, like in the first 24 hours if they are not dispatched in that period.

Changes to descriptions seem lower priority than changes to labels/statements.

  • Is there much demand for the recent changes feed in client wikis (other than update of displayed statements)? Personally, I find it hard to read, even for my own edits.

I would say yes, as it was considered a precondition to allowing any data re-use. It's also an important safeguard against vandalism on wikidata.

I agree that integration with the RC feed could be greatly improved. Once we have a mechanism to attach arbitrary structured data to revisions, this will hopefully get much better.

In wikis that routinely read information from dozens of items (e.g. frwiki), Wikidata recent changes is just impossible to comprehend.

Supposedly, it works fine in wikis that only use one item per article (dewiki?).

Change 366887 merged by Filippo Giunchedi:
[operations/puppet@production] mediawiki: increase the batch size of dispatchChanges cronjob

https://gerrit.wikimedia.org/r/366887

Quick status summary:

  • Dispatcher batch size has been increased. Seems to have the desired effect, dispatch lag is going down, but only slowly. Maybe we can bump the batch size some more?
  • Patches for improving throughput on the receiving end have been merged. This has no impact on dispatch lag, but it does have an impact on client side change handling.

Is there a way to meter specifically the update channel for enwiki?

e.g. (wikidata edit > queue 1 > queue 2 > queue 3 > .. > update enwiki : actual size / delay )

Is there a way to meter specifically the update channel for enwiki?

Not really - there is one queue for dispatching to enwiki, one queue for receiving on enwiki, and then there are several queues for processing different kinds of updates in parallel.

Change 370315 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani):
[operations/puppet@production] mediawiki: Another increase of batch size in dispatchChanges cronjob

https://gerrit.wikimedia.org/r/370315

The dispatch decreased a lot since yesterday, but is increasing again since botimport for cebwiki and svwiki started again. I hope the above patch helps with keeping the dispatch steady.

Change 370315 merged by Jcrespo:
[operations/puppet@production] mediawiki: Another increase of batch size in dispatchChanges cronjob

https://gerrit.wikimedia.org/r/370315

Now the stalest wiki is 3 minutes, should we close this?

Good news!

I wonder if we now agree what people should look for and what should be done if there are delays.

I don't think this is resolved, see https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1&from=now-90d&to=now

The median dispatch lag is almost never higher than 200/20s before 2017-06-29 (and since we start to record this), but is always higher than 1 minutes since that point.

Editing Wikidata is pretty slow for me now.

I'll ask people to have another look here,