Page MenuHomePhabricator

Move dispatching of wikidata to a dedicated node
Closed, DeclinedPublic

Description

Dispatching of wikidata is happening on terbium but terbium is pretty unstable for such a vital service. In a more long-term solution, the whole dispatching needs to be dismantled and moved to jobqueue/CP but in a more mid-term solution and specially in order to avoid interruptions that happen in terbium for numerous reasons, dispatching should be moved out of terbium.

Event Timeline

What makes you think that Terbium is to unstable for this? Terbium seems to always have more than enough spare resources, and the dispatch problems we saw recently seem to correlate with a high number of edits, but not with spikes on Terbium.

RobH triaged this task as Medium priority.May 3 2018, 4:38 PM
RobH subscribed.

As part of SRE clinic duty, I'm reviewing all unassigned, needs triage tasks in SRE and attempting to review if any are critical, or if they are normal priority.

This task appears to be normal priority, and I have set it such. If any one on this task disagrees, please comment and correct. Anything with a high priority or above typically requires response ahead of other items, so please ensure you have supporting documentation on why those priorities should be used.

Thanks!

What makes you think that Terbium is to unstable for this? Terbium seems to always have more than enough spare resources, and the dispatch problems we saw recently seem to correlate with a high number of edits, but not with spikes on Terbium.

That is a very valid reason and I failed to demonstrate my intention fully. One big reason is to be able to scale better and utilize more resources, because if we have a dedicated node, we know the limits and have a pretty stable view of things but in terbium it's not okay to take over the whole node because the number of edits is just too high. Also having a node means more resources which seems what we need here too.

Ok, in that case this sounds like a valid request to get an own VM (or even bare metal server).

Depending on how fast this can be done, this is a good short term solution for the performance problems. Long-term this sounds like a good idea as this moves the "real time" service away from terbium, which is probably not the right place for this.

We shouldn keep track of things like T110528: [Task] Use JOIN to find changes relevant for a given wiki though (which should make this use way less resources, as most of the comparison logic will be done with a simple JOIN).

Vvjjkkii renamed this task from Move dispatching of wikidata to a dedicated node to updaaaaaaa.Jul 1 2018, 1:12 AM
Vvjjkkii raised the priority of this task from Medium to High.
Vvjjkkii updated the task description. (Show Details)
Vvjjkkii removed subscribers: Aklapper, gerritbot.
CommunityTechBot renamed this task from updaaaaaaa to Move dispatching of wikidata to a dedicated node.Jul 2 2018, 2:09 PM
CommunityTechBot lowered the priority of this task from High to Medium.
CommunityTechBot updated the task description. (Show Details)
CommunityTechBot added subscribers: Aklapper, gerritbot.
Addshore changed the task status from Open to Stalled.Jun 22 2019, 10:05 PM
Addshore lowered the priority of this task from Medium to Low.
Addshore subscribed.

Going to mark this as stalled.

Also we havn't had performance issues with dispatching for quite some time now after fixing something with the db tables that we spotted at the last DC switch over, so maybe these can stay where they are for now.

Going to mark this as stalled.

Also we havn't had performance issues with dispatching for quite some time now after fixing something with the db tables that we spotted at the last DC switch over, so maybe these can stay where they are for now.

In general, change dispatching is rather big technical debt that needs to be dropped. For these reasons:

  • It's a SPOF, if mwmaint1002 node goes down for HW issues, we can't dispatch at all. If there's a need to restart the node, dispatching has to stop until it's done.
  • "Noisy neighbor" effect, people run maintenance scripts in the mwmaint node, it can be choked to death by other scripts and it can make running maintenance scripts impossible by having bugs that eats all of the resources.
  • The distributed system we designed for this (pulling the wikis using three cronjobs, dispatching and picking up basically random + most stalled ones). This can use the great infrastructure for jobqueues we have.
  • Cronjobs are hard to debug, moving them to jobqueue makes it easier to debug in logstash.

So I like to just drop the whole thing but first we need to address T220696: [Story] Create better edit summaries for wbeditentity API endpoint which enables us to make all the edits done in UI by just one edit instead of several back-to-back edits. Once that's done, we can switch to each edit triggering job to trigger jobs to wikis that are subscribed to the entity and then those tirgger parser cache invalidation and other needed jobs for pages in the client. Does it sound good to you @Addshore? (Maybe there's other historical reasons to still use cronjob-based dispatching I'm missing, @daniel Do you know anything else?)

So I like to just drop the whole thing but first we need to address T220696: [Story] Create better edit summaries for wbeditentity API endpoint which enables us to make all the edits done in UI by just one edit instead of several back-to-back edits. Once that's done, we can switch to each edit triggering job to trigger jobs to wikis that are subscribed to the entity and then those tirgger parser cache invalidation and other needed jobs for pages in the client.

Why do we need to wait for the edit summaries?
The job queue already has good de duplication, so triggering a job after every edit already should not really be an issue.

Does it sound good to you @Addshore? (Maybe there's other historical reasons to still use cronjob-based dispatching I'm missing, @daniel Do you know anything else?)

The main benefit of the current system vs the way a job based system would probably look is the batching, and there are a few different way we could think about doing the batching.

There are some details in T48643 from back in 2015, but they should probably have another wave of thinking about now we are 4 years down the line.

Eg. Do we:

  1. Schedule a job post edit to push out the change for the 1 entity to all subscribed clients? Do we do this all in 1 job, or spin off a job per client?
    • less batching, not very similar to how we do things now
    • Means we can run multiple jobs for a single wiki at once
  2. Schedule a job post edit to spawn new jobs for updating each client subscribed to the entity? But then in the branch jobs pull in all available changes?
    • More similar to our current batching process
    • Probably only want to run 1 branch job per client wiki at a time
  3. Some other odd smush of the above

Some other small things to consider with the job queue, is determining how lagged (in terms of minutes and seconds) these jobs are / dispatching is becomes slightly harder, as we dont just have a column in a db table with timestamps we can look at (or maybe we do if we decide to keep the table?)

Still some more thinking to be done regarding jobs, but I'd rather us focus on jobs than try to move the dispatching cron job to somewhere else.

So perhaps we leave this ticket alone for now and get discussing on T48643: [Story] Dispatching via job queue (instead of cron script) ?