Page MenuHomePhabricator

Queuing jobs is extremely slow
Open, Needs TriagePublic

Description

Looking at a recent flamegraph of the jobqueue, it seems 23% of the wall time in the jobrunners is just being spent on queuing CirrusSearch jobs. @dcausse looked at these jobs and they don't contain anything large. In fact, the time it takes to write the actual content to Elastic is around 3% while pushing the jobs that contain only metadata is taking 23%.

An edit in wikidata is also similar. XHGui says 2 seconds (out of 3.3 seconds) has been spent on queuing 10 jobs.

According to reverse flamegraphs. 4% of wall time in appservers (handing index.php) and 46% of the wall time on jobrunners is being spent on just queuing the jobs.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

This is verbose logs of jobs being queued, if that's going to help https://logstash.wikimedia.org/goto/a288ef574056abcd7bea7ad72eb2c3a4

it seems the time to establish connection is negligible but time to consume it is not. 40ms is enough time to queue the job in another dc.

Change 724820 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/CirrusSearch@master] Use lazypush for queuing job of onLinksUpdateCompleted hook

https://gerrit.wikimedia.org/r/724820

Change 724822 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Wikibase@master] Use lazypush for pushing the DispatchChanges job

https://gerrit.wikimedia.org/r/724822

Change 724826 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Echo@master] Use lazypush for queuing jobs

https://gerrit.wikimedia.org/r/724826

JobQueueEventBus is splitting jobs by type and then sends each job type separately.

@Ottomata I forgot, can we send multiple different streams to eventgate in a single request?

Yes for sure, as long as each one has meta.stream set correctly, eventgate will route them to the right place.

Change 724828 had a related patch set uploaded (by Ppchelko; author: Ppchelko):

[mediawiki/extensions/EventBus@master] Batch together all jobs destined to the same eventgate

https://gerrit.wikimedia.org/r/724828

Change 724843 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] jobqueue: Batch jobs that will end up in the default queue

https://gerrit.wikimedia.org/r/724843

I think batching more jobs together is a good idea but in the case of cirrusSearchElasticaWrite jobs cirrus already uses JobQueueGroup::singleton()->push( $jobs ); pushing 3 jobs at once but still the timings are very high. Do we have a comparison with similar usages and the use of the hasty flag (just to make sure that the time is not spent in eventgate)?

I think batching more jobs together is a good idea but in the case of cirrusSearchElasticaWrite jobs cirrus already uses JobQueueGroup::singleton()->push( $jobs ); pushing 3 jobs at once but still the timings are very high. Do we have a comparison with similar usages and the use of the hasty flag (just to make sure that the time is not spent in eventgate)?

Well, CirrusSearch batches it correctly but core happily split them by type and pushes them separately *put the old man in pain meme here*. That's one thing I'm planning to tackle and fix today.

That being said, I'm happy with experimenting with hasty flag as well. Possibly a bit later to avoid things getting meshed together (so we can isolate and measure impact of each change separately).

Change 724822 merged by jenkins-bot:

[mediawiki/extensions/Wikibase@master] Use lazypush for pushing the DispatchChanges job

https://gerrit.wikimedia.org/r/724822

Change 724820 merged by jenkins-bot:

[mediawiki/extensions/CirrusSearch@master] Use lazypush for queuing job of onLinksUpdateCompleted hook

https://gerrit.wikimedia.org/r/724820

EventGate's hasty param should only be used if you don't want acknowledgement that your events have been successfully persisted. Using it allows EventGate to respond to the HTTP request ASAP, even before the event has been validated against its schema. We'll should have logs, etc. from EventGate if things fail, but MediaWiki (or whatever app) will not get any synchronous feedback about it.

EventGate's hasty param should only be used if you don't want acknowledgement that your events have been successfully persisted. Using it allows EventGate to respond to the HTTP request ASAP, even before the event has been validated against its schema. We'll should have logs, etc. from EventGate if things fail, but MediaWiki (or whatever app) will not get any synchronous feedback about it.

I don't think we should use hasty everywhere as it would be pretty dangerous but I remember that MW is using some eventgate endpoints with hasty enabled (monolog?) and thus was wondering if we see the same slowness there. If not it would perhaps mean that waiting for eventgate to do its stuff is more to blame than HTTP conversations.

Change 725258 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Lazy push WatchlistExpiryJob

https://gerrit.wikimedia.org/r/725258

Change 725259 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/Cognate@master] Lazy push CacheInvalidator job

https://gerrit.wikimedia.org/r/725259

Change 725259 merged by jenkins-bot:

[mediawiki/extensions/Cognate@master] Lazy push CacheInvalidator job

https://gerrit.wikimedia.org/r/725259

Edit: Report of errors moved to T292291

We haven't deployed anything related to this ticket yet but it's probably something with Elastic. I let the search team know so they take a look

jcrespo triaged this task as High priority.EditedOct 1 2021, 12:05 PM

Setting to high because almost half a million extra errors per hour. I will revert if the issue is unrelated to the original reporting and create a separate task instead.

I'm not sure if these are related. It's a problem for sure but we haven't deployed anything related to this ticket. I suggest creating a dedicated ticket and tagging search platform team.

I suggest creating a dedicated ticket and tagging search platform team.

Ok.

jcrespo raised the priority of this task from High to Needs Triage.Oct 1 2021, 12:19 PM

Change 725258 merged by jenkins-bot:

[mediawiki/core@master] watcheditem: Use lazy push for WatchlistExpiryJob

https://gerrit.wikimedia.org/r/725258

Change 724826 abandoned by Ladsgroup:

[mediawiki/extensions/Echo@master] Use lazypush for queuing jobs

Reason:

This one doesn't give much benefit, the ones that do are already fixed now.

https://gerrit.wikimedia.org/r/724826

We ran into this issue again today, we introduced a new job, let's call it A that queues jobs in other wikis, let call them job B. Ratio of job B is sixteen times more than job A. Job B does the actual work of looking up data, inserting data, etc. and yet with basically the same concurrency, backlog of job A is growing without stop. Batching wouldn't help here, the p99 of that job that just inject jobs is 2 minutes.

Maybe, Eventgate-main keeps all jobs in a stream waiting. Treating input for each stream as serial?

Change 725970 had a related patch set uploaded (by Michael Große; author: Michael Große):

[mediawiki/extensions/Wikibase@master] Track time it takes to insert a job in DispatchChanges

https://gerrit.wikimedia.org/r/725970

Change 726513 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[operations/deployment-charts@master] eventgate-main: Increase number of replicas to five (from 3)

https://gerrit.wikimedia.org/r/726513

Change 726513 merged by jenkins-bot:

[operations/deployment-charts@master] eventgate-main: Increase number of replicas to five (from 3)

https://gerrit.wikimedia.org/r/726513

while this definitely has helped but I'm not sure it's the root cause. I have a new hypotheses. Might be the schemas in the eventgate are not cached. Looking at the code I do see that it gets cached in the class but maybe for whatever reason the class gets rebuilt in eventgate-wikimedia.js. A quick double check e.g. by doing tcpdump on the containers would be amazing.

eventgate-main does not use the remote schemas. The repositories are cloned and baked into the docker image.

But I think you are talking about the AJV compiled versions of the schemas? AFAIK those should never be un-cached, as the schemas are meant to be immutable. When we have modified a schema version in the past, we have to restart eventgate to get it to pick up the change.

I don't know the code well enough to judge but I ask for a double check when you have time.

Adding the eventgate replicas managed to reduce the time a lot.

This is not all of it, we deployed the change seven hours in the day so tomorrow we can get a better metric. If it's still CPU throttling and/or have other issues we can maybe increase it a bit more. Also later today we will deploy several changes on the jobqueue batching improvements. Fingers crossed that it would help.

Change 724843 merged by jenkins-bot:

[mediawiki/core@master] jobqueue: Batch jobs that will end up in the default queue

https://gerrit.wikimedia.org/r/724843

Yup. The full results are in:

That's roughly more then twice better than the original value. I wonder if we should increase the replicas even more 🤔

Let's get the batching out of the door for now to see its impact.

Yup. The full results are in:

That's roughly more then twice better than the original value. I wonder if we should increase the replicas even more 🤔

Let's get the batching out of the door for now to see its impact.

Or maybe we should remove a tcp connection from the equation.

Change 727186 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.3] jobqueue: Batch jobs that will end up in the default queue

https://gerrit.wikimedia.org/r/727186

Change 727187 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.38.0-wmf.2] jobqueue: Batch jobs that will end up in the default queue

https://gerrit.wikimedia.org/r/727187

Change 727170 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/extensions/EventBus@master] Set supportsTypeAgnostic to true for the JobQueue class

https://gerrit.wikimedia.org/r/727170

Change 727186 abandoned by Ladsgroup:

[mediawiki/core@wmf/1.38.0-wmf.3] jobqueue: Batch jobs that will end up in the default queue

Reason:

not worth backporting

https://gerrit.wikimedia.org/r/727186

Change 725970 abandoned by Michael Große:

[mediawiki/extensions/Wikibase@master] Track time it takes to insert a job in DispatchChanges

Reason:

https://gerrit.wikimedia.org/r/725970

Change 727187 abandoned by Ladsgroup:

[mediawiki/core@wmf/1.38.0-wmf.2] jobqueue: Batch jobs that will end up in the default queue

Reason:

wmf.2 is gone AFAIK

https://gerrit.wikimedia.org/r/727187

We ran into this issue again today, we introduced a new job, let's call it A that queues jobs in other wikis, let call them job B. Ratio of job B is sixteen times more than job A. Job B does the actual work of looking up data, inserting data, etc. and yet with basically the same concurrency, backlog of job A is growing without stop. Batching wouldn't help here, the p99 of that job that just inject jobs is 2 minutes.

This is exactly what the GlobalUserPageLocalJobSubmitJob job does:

	public function run() {
		$job = new LocalCacheUpdateJob(
			Title::newFromText( 'User:' . $this->params['username'] ),
			$this->params
		);
		foreach ( GlobalUserPage::getEnabledWikis() as $wiki ) {
			JobQueueGroup::singleton( $wiki )->push( $job );
		}
		return true;
	}

Apparently it's taking 5 minutes (some timeout?): https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?viewPanel=3&orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=GlobalUserPageLocalJobSubmitJob&from=now-7d&to=now

Screenshot 2021-10-14 at 12-07-19 JobQueue Job - Grafana.png (1×3 px, 208 KB)

Is this the same issue, should I file a different task for this specifically? Nothing has really changed in the GlobalUserPage code for years, these jobs used to run incredibly quickly.

Increasing the replicas also reduced the save time p75

Wow, let's add more more replicas. I can see the p99 for EventGate response time was reduced too.

Screen Shot 2021-10-14 at 15.36.34.png (1×2 px, 789 KB)

Also not sure if eventgate latency is the root cause, but it can't hurt to make it better.

Or maybe we should remove a tcp connection from the equation.

@Joe you mean skip the eventgate middle-man? I'm all for it...I guess we'd need something in MediaWiki to validate and produce to Kafka...although why not something not MediaWiki PHP that is located with MW? Any reason we can't/shouldn't run an eventgate sidecar along with MW in k8s?

Any reason we can't/shouldn't run an eventgate sidecar along with MW in k8s?

That doesn't remove the TCP connection, but at least it localizes and scales it out a bit?

Did anyone happen to look back at this graph further than this.. ? Because this spikes rather suddenly on the UTC morning of September 13th. It wasn't some sort of gradual increase.

Screenshot 2021-10-15 at 11.14.29.png (385×739 px, 82 KB)

Screenshot 2021-10-15 at 11.18.39.png (361×741 px, 62 KB)

Increasing the replicas also reduced the save time p75

Wow, let's add more more replicas.

I'm fine with it but I will be traveling the next two weeks. Doing so is really easy so if it's possible, please do it.

Also not sure if eventgate latency is the root cause, but it can't hurt to make it better.

My request is a double check. It's not in mediawiki side, I checked in every way I could and trust me I looked at everything.

Eventgate is a bit of obscure code that is hard to debug in production, something like xhgui would be amazing on it, maybe just spawn a replica in debug logging for a minute and look at the logs? I'm not familiar enough with k8s to pull it off.

Change 727170 merged by jenkins-bot:

[mediawiki/extensions/EventBus@master] Set supportsTypeAgnostic to true for the JobQueue class

https://gerrit.wikimedia.org/r/727170

I think you are right, looking at eventgate -> kafka latencies, etc, it does look like the topic balance correlates.

Screen Shot 2021-11-12 at 09.15.46.png (1×3 px, 244 KB)
link

Screen Shot 2021-11-12 at 09.13.53.png (1×2 px, 745 KB)
link (max rtt is better too)

According to https://phabricator.wikimedia.org/T288825#7432335, the balances started on 10-15 and finished on 10-18, which lines up.

Ping @elukey, thanks for the rebalances!