Page MenuHomePhabricator

Temporarily run more refreshLinks jobs on Commons
Closed, ResolvedPublic

Description

Wikimedia Commons currently has a large backlog of (I believe) refreshLinks jobs, as a result of some edits (and not-yet-processed now-processed edit requests) for highly-used CC license templates: see edit, edit, edit, edit, edit and request, request, request, request. Together, these should result in a large percentage of Commons’ files being re-rendered, and having one templatelinks row each removed from the database (CC T343131).

Currently, the number of links to Template:SDC statement has value (as counted by search – this lags somewhat behind the “real” number, as it depends on a further job, but it should be a decent approximation) is only going down rather slowly; @Nikki estimated that the jobs would take some 10 years to complete at the current rate. Can we increase the rate at which these jobs are run? Discussion in #wikimedia-tech suggests this should be possible in changeprop-jobqueue.

Event Timeline

Per IRC discussion, marking as High priority. @AntiCompositeNumber reports that this results in category changes being slow to propagate (Category:Johann Baptist Hops not showing up in Category:Hops (surname) yet).

Currently, the number of links to Template:SDC statement has value (as counted by search – this lags somewhat behind the “real” number, as it depends on a further job, but it should be a decent approximation)

To put a concrete number out: at the moment, Quarry reports 79158475 links (79.1M), CirrusSearch reports 77979457 links (77.9M). I’m not really sure why Quarry’s number is higher, to be honest – but at least they’re somewhat close to each other. (If we bump the refreshLinks concurrency or do whatever else the right technical thing for this task is, we might have to do the same for some CirrusSearch-related jobs too… I think cirrusSearchLinksUpdate might be the right job type? But I’m not sure at all.)

Also interesting: pages with cc-by-sa-3.0 and *without* SDC statement has value – currently 1289684 (1.2M), almost two weeks after I updated the former template to no longer use the latter. (I’m not trying a Quarry version of this because I expect it would be prohibitively expensive.)

I have processed those editrequests, so at least when a page contains multiple license templates, it will only need to be reparsed once.

Indeed, it looks like the refreshLinks_partitioner rule is easily keeping up with the "upstream" rate of new jobs [0] but the "real" refreshLinks rule on partition 3 (commons) has a rather deep backlog.

Unfortunately, I don't think changeprop offers a way to increase just the concurrency for commons [0], so if we do increase the concurrency for refreshLinks as a whole, we'd need to be comfortable with possibly 8x'ing that at times (number of partitions) in aggregate. That said, since partitioning is by database, the amplification factor is perhaps a bit less concerning.

I am hopeful that some other folks in serviceops might have practical experience with the risks of doing something like this.

[0] In short, there's no support for manual assignment of {topic, partition} to a given consumer.

This type of Kafka Consumer lag for that job isn't unheard of. In fact, just recently we had way higher consumer lags for commons specifically.

image.png (961×1 px, 91 KB)

and this has also been happening in eqiad as well - we have switched over to codfw since then.

image.png (873×1 px, 77 KB)

That panel is somewhat misleading btw. The metric used is kafka_burrow_partition_lag, which isn't the same as consumer lag (despite the title of the panel). It represents is the number of messages in that partition that haven't been consumed (so backlog), but Kafka consumer lag is about the ACKed consumer processing delay (offset vs consumer group commited offset).

Anyway, doing an eyeball linear regression of codfw's kakfa consumer lag seems to say ~3-4 weeks at the current rate, but the history of that panel says that probably this isn't the most interesting metric, as it is not unheard of for message processing to suddenly spike from ~500 jobs per second to 5k jobs per second for a short amount of time.

As another data point, per RefreshLinks JobQueue Job stats, it seems like refreshlinks is servicing requests at double the rate for insertion and the backlog time is consistent with business as usual.

Unfortunately, I don't think changeprop offers a way to increase just the concurrency for commons

That is correct, the setting is for all partitioners, we can't do just commons.

if we do increase the concurrency for refreshLinks as a whole, we'd need to be comfortable with possibly 8x'ing that at times (number of partitions) in aggregate

It is definitely something that we 'd want to gather more information about the issue at hand before we bump this setting considerably.

@LucasWerkmeister how was the 10 year estimation calculated?

@LucasWerkmeister how was the 10 year estimation calculated?

I've been using the number of results of a hastemplate: search between two points in time to work out the average over that period of time. I did various searches between the 8th and 10th and also checked things like https://templatecount.toolforge.org/ and it didn't seem to be going down much at all, so I mentioned it to Lucas and decided to see how much one search would drop over a few days:

At 14:44 on the 10th, https://commons.wikimedia.org/w/index.php?search=hastemplate%3A%22SDC_statement_has_value%22&ns6=1 had 79,628,867 results.
At 22:33 on the 14th, it had 79,533,769 results, so it went down by ~95,000 in ~106 hours, around 900 per hour (and 900 per hour would be 21,600 per day, 7.8 million per year).

It seems it has finally started to go faster now though:

At 23:18 on the 21st, there were 77,979,524 results (down by ~1,550,000 in ~168 hours, around 9200 per hour).
At 12:52 on the 22nd, there were 77,593,815 results (down by ~385,000 in ~13 hours, around 30,000 per hour).

Anyway, doing an eyeball linear regression of codfw's kakfa consumer lag seems to say ~3-4 weeks at the current rate

If it continues at that rate, then that seems fine (and a lot more like what I was expecting).

For the record, I just found another set of templates we want to update on most of the affected files (Cc-by(-sa)-layout should bypass the SDC_statement_exist template) and filed edit requests for them.

Just processed those edit requests.

@Nikki thanks for the explanation. As you note yourself, speed wise, there are many ups and downs.

As an update, kafka consumer lag for the partition in question has indeed increased again, but it is still in levels that have been seen before in the last 60 days. the refreshlinks job is currently processing jobs as normal, average job processing fluctuates around the 5 second area. kafka lag catch up will probably speed up considerably at some point, as it has done in the recent past

akosiaris lowered the priority of this task from High to Low.Nov 29 2024, 2:29 PM

I 'll switch to low, we can keep monitoring the next few weeks and see how this pans out.

To put a concrete number out: at the moment, Quarry reports 79158475 links (79.1M), CirrusSearch reports 77979457 links (77.9M). I’m not really sure why Quarry’s number is higher, to be honest – but at least they’re somewhat close to each other. (If we bump the refreshLinks concurrency or do whatever else the right technical thing for this task is, we might have to do the same for some CirrusSearch-related jobs too… I think cirrusSearchLinksUpdate might be the right job type? But I’m not sure at all.)

Also interesting: pages with cc-by-sa-3.0 and *without* SDC statement has value – currently 1289684 (1.2M), almost two weeks after I updated the former template to no longer use the latter. (I’m not trying a Quarry version of this because I expect it would be prohibitively expensive.)

The first CirrusSearch now went down from 77.9M to 45.7M (45744139); the second CirrusSearch increased from 1.2M to 5.4M (5419661). I think this supports the assumption that jobs are being processed at an acceptable rate after all (though it might still take a few more months for the backlog to clear out).

Overall, I’m inclined to say that by now it’s safe to just close this task and assume the job queue is doing its job reasonably well. However:

To put a concrete number out: at the moment, Quarry reports 79158475 links (79.1M), CirrusSearch reports 77979457 links (77.9M). I’m not really sure why Quarry’s number is higher, to be honest – but at least they’re somewhat close to each other.

Quarry now reports 51504162 links (51.5M) while CirrusSearch is down to 16948938 (16.9M). This is starting to get weird (and kind of undermines the original goal of these edits which was to reduce the number of rows in the table, not the data in Elasticsearch / OpenSearch).

akosiaris claimed this task.

Overall, I’m inclined to say that by now it’s safe to just close this task and assume the job queue is doing its job reasonably well.

Done.

However:

To put a concrete number out: at the moment, Quarry reports 79158475 links (79.1M), CirrusSearch reports 77979457 links (77.9M). I’m not really sure why Quarry’s number is higher, to be honest – but at least they’re somewhat close to each other.

Quarry now reports 51504162 links (51.5M) while CirrusSearch is down to 16948938 (16.9M). This is starting to get weird (and kind of undermines the original goal of these edits which was to reduce the number of rows in the table, not the data in Elasticsearch / OpenSearch).

Is this something that we should open a different task ? Or is it preferable to just monitor it for now?

I guess the most likely explanation is that the processing of the updates for MariaDB and OpenSearch happens separately, and the MariaDB updates to the templatelinks table are progressing rather more slowly. Between T380544#10346771 (79.1M) and T380544#10924765 (51.5M), almost seven months passed; based on this, we can extrapolate that the remaining 51.5M rows would take another year or so to process. (Or, if we assume the CirrusSearch updates are complete and the 16.9M pages there still have genuine uses of Template:SDC statement has value, then there are 34.6M rows left which would take between 8 and 9 months at the current rate.) That’s not as bad as the “ten years” estimate, and certainly not bad enough for High priority, but I still feel like a somewhat higher rate of these jobs wouldn’t hurt.

I guess the most likely explanation is that the processing of the updates for MariaDB and OpenSearch happens separately, and the MariaDB updates to the templatelinks table are progressing rather more slowly. Between T380544#10346771 (79.1M) and T380544#10924765 (51.5M), almost seven months passed; based on this, we can extrapolate that the remaining 51.5M rows would take another year or so to process. (Or, if we assume the CirrusSearch updates are complete and the 16.9M pages there still have genuine uses of Template:SDC statement has value, then there are 34.6M rows left which would take between 8 and 9 months at the current rate.) That’s not as bad as the “ten years” estimate, and certainly not bad enough for High priority, but I still feel like a somewhat higher rate of these jobs wouldn’t hurt.

ACK, but as pointed out above, there is no capability in the software for doing so just for commons (which is what was required here), it's still a global toggle for all refreshLinks jobs.

Okay, would there be a problem with running more refreshLinks jobs across all wikis? 😇

(I’m not sure how we could evaluate the effect of any such change, beyond checking the number of templatelinks for this one particular template because we happen to know it should be lower… do we track the duration between the root job and a “leaf” job’s completion somewhere? The JobQueue Job dashboard has the refreshLinks p99 normal backlog time, which is stable at 1 h, but I suspect this might be the time between the “leaf” job being scheduled and completed.)

Okay, would there be a problem with running more refreshLinks jobs across all wikis? 😇

How many more? If it is say 3% probably not, but would that be helpful? If it is say 30% or more? Unclear. Back in T370304: Bursts of occasional severe contention on s4 (commonswiki) primary mariadb causing recurrent user-facing outages on all wikis, we already went down from 30 to 25 as stopgap for issues experienced with s4. That was reverted and we are back to 30 (with 1 being the smallest possible unit ofc).

To give you an idea of the few things that would indeed happen. We would be throwing up to 8xN more requests to the mw-jobrunner cluster, which could end up requiring more resources (not really that much of a problem, but it's not as easy to reason with a multiplier of 8 in there), to avoid increasing latencies for the rest of the jobs. In turn, those would also be hitting up to 8xN more times the databases. If the N is small, it wouldn't cause issues, but then you probably wouldn't see the effect I assume you desire (which I assume is all the SDC statement related jobs finish).

All in all there is less clarity regarding what you 'd probably like to hear as to what would the repercussions be.

(I’m not sure how we could evaluate the effect of any such change beyond checking the number of templatelinks for this one particular template because we happen to know it should be lower…

That's pretty telling, isn't it? To ask the question in a different way. What would the visible benefit be ?

do we track the duration between the root job and a “leaf” job’s completion somewhere?

Care to elaborate a bit on this one? By root and leaf do you mean the non partitioned vs partitioned jobs ? Or something else?

The JobQueue Job dashboard has the refreshLinks p99 normal backlog time, which is stable at 1 h, but I suspect this might be the time between the “leaf” job being scheduled and completed.)

I doubt that number is somehow useful. The maximum bucket is 3600, which is that 1h you are seeing. I wouldn't use it for anything. I 'll file a patch though to increase the maximum bucket.

https://grafana.wikimedia.org/goto/g3I4GXPNR?orgId=1 (which represents the average) is somewhat more useful as numbers, ranging between less than 1m (very recently in fact) to up to 4 days But that number tells us hyperswitch's internal-startup delay. We use it as a proxy for the normal backlog time, but it's less of a direct metric that we would all like.

I 'll file a patch though to increase the maximum bucket.

Already being done by @hnowlan in an unrelated patch at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1164422