Page MenuHomePhabricator

Run Mediawiki Jobs in the background not at the end of requests
Closed, ResolvedPublic8 Estimated Story Points

Description

Issue:
For reasons not clear, we need to limit the amount of jobs per request to MediaWiki to 2 (https://www.mediawiki.org/wiki/Manual:Job_queue). When someone does a lot of edits, but no reads, we can end up with a backlog of half a million jobs per Wiki. For users, they experience this as their data not appearing in their Wikibase, when in fact they are there, just not indexed in Search. Examples: T330069#8632297

We need to automate the process of running the jobs that are stuck in the backlog, which we now do manually.
There are probably a few options to implement this:
1. have a new k8s deployment checking all wikis and running their jobs as needed in the same pod
2. have a new k8s deployment checking for mediawiki jobs and then creating kubernetes jobs to run them
3. use the platform api queue to create kubernetes jobs to run the mediawiki jobs < THE WINNER
4. use the platform api queue to run them in existing mediawiki pods

AC:

  • One Wiki doesn't block another
  • Any change from an individual user will not appear later than 3 mins
  • Don't ddos ourselves
  • Needs to be run per Wiki, since the backlog gets stuck per Wiki

Event Timeline

I was interested to see how many wikis have a huge job backlog and did some spelunking. It seems most wikis can keep up, but some have amassed a high number of pending jobs (this is the state of 02/27/2023 at ~3PM).

Evelien_WMDE set the point value for this task to 8.

For reasons not clear, we need to limit the amount of jobs per request to MediaWiki to 2 (https://www.mediawiki.org/wiki/Manual:Job_queue).

From memory, I believe even though this is set to 2, only 1 job will actually get executed at the end of a request.
Not sure if this has been fixed in the past years.

  1. use the platform api queue to create kubernetes jobs to run the mediawiki jobs < THE WINNER

Sounds perfectly fine for a first take on this.
The mediawiki job running doesnt need to be super resiliant.
It does draw another connection between the api and mw, but I think that is unavoidable one way or another to have the jobs run on sites as things continue to scale.

You could inverse the relationship slightly.
Option 3 shows that the API is the one triggering the job in k8s
You could have an option 5 where there is a permenantly running service which checks what sites need jobs run from the API and makes the k8s jobs itself, meaning the api nees to know less about mediawiki.
In practise the API could just return a random (or not so random) wiki ID that exists (you could do more complicated things in the future.

One things to consider in this decision would be the future and what other things you also might want to do.
For example, something similar to option 5 (or the same) could be used to scheudle pods to re index elastic search when needed
This service could also extract the running on update.php out of both the API and MediaWiki presently, giving it its own recourses in k8s pods

Again, all of this could be done in the API, but it would certainly being a lot of k8s and mw concerns into the API on top of everything else that is there

I'm curious about this statement from the original post:

When someone does a lot of edits, but no reads

I've been struggling with how to scale bots appropriately to handle a relatively large volume of entity/claim creation. I have done some parallelization with Dask, but I'm guessing that just makes things worse for this type of issue. If an item read coupled with edits triggers necessary actions, I can build that into a bot process, even though that will introduce more transactions.

Following up on my previous comment, I tried this type of write/read/write approach in the latest work for the https://eew-edgi.wikibase.cloud Wikibase instance. I needed to work through all the versions of the North American Industry Classification System (NAICS), creating items for the different industries to be linked to and support analysis of regulated facilities. The bot for this using WikibaseIntegrator works through the 6 different years/editions of the NAICS to develop logical items and their ExternalIDs. I first ran through and created the individual 1981 items and then worked over the items again to add in hierarchical relationships through "has part" and "part of" claims. This involved reading each item serially to add in additional claims, with the idea this might trigger something to complete the jobs.

Based on this report, the Wikibase instance once again has thousands of incomplete jobs. If someone reading this has the ability to kick something off to clear the backlog, that would be great!

Based on this report, the Wikibase instance once again has thousands of incomplete jobs.

I triggered a job that will process the pending jobs (*10365* as of now) just now.

Well, I managed to get a whole backlog of jobs going again in another wikibase.cloud instance. This one is showing over 50K jobs.

In this case, I built out a representation of about 12K government reports (mining technical reports submitted to the Canadian government). We use these to develop mineral resource exploration history in conducting scientific assessments. I am working on how we pull in information derived through an AI engine that is associated with these documents as claims with evidence so that we can build reasoning to determine what we can use. I ran a process here to pull some information on geologic time periods addressed in the documents.

Things are mostly all working (other than search, mouseover, etc.), but it will be cool to get this figured out at some point.

I triggered a job for processing the pending 50k jobs for geokb.wikibase.cloud

Well, I apologize, but I've once again started to clog up the system with edits. If someone can kick off some job processing for geokb.wikibase.cloud again, that would be great.

This time, it's rock types, classification, and I'll be pulling in mineral species connected with those. I'm starting to think more about a strategy on this where we have perhaps multiple interacting wikibase instances, each with some scope of a domain of knowledge they are responsible for. The issue we run into is that, despite some good progress in some communities on developing formal ontologies, much of the information and knowledge we need has not yet been organized this way. Wikibase provides a great platform for doing that, but we'll hit a point where we are pushing too many copies of the same items/claims into more contexts than we really need. Ideally, we'll curate much of this into Wikidata at some point, but there are a ton of semantic dissonance issues to hash out there yet.

I triggered a job for processing the pending 11903 items in the queue.

@Fring - Fantastic! Thank you! I see they cleared out pretty quick, and search is all working as it should.

We talked a little in person about the awkwardness of the k8s client we currently use. This is another place where the importance of T323211 has been surfaced

Glad to see this working its way toward some kind of resolution. In the meantime, would someone mind kicking off a job to complete jobs for the geokb.wikibase.cloud instance? I've got some things backed up that are causing problems in SPARQL responses.

@Skybristol a job is now running for geokb.wikibase.cloud.

Fring removed Fring as the assignee of this task.May 15 2023, 8:30 AM
Fring moved this task from Doing to In Review on the Wikibase Cloud (WB Cloud Sprint 20) board.

Relevant PRs:

Following up on our call, here's what I found:

Selecting jobs on status

Inspecting different kinds of job states locally the situation seems ugly still. This is what I came up with:

$conditions = data_get($job->toArray(), 'status.conditions.0.status', 'False');
$type = data_get($job->toArray(), 'status.conditions.0.type', 'n/a');

$status = data_get($conditions, 'status', 'False');
if ($status !== 'True') {
    return true;
}

return $type !== 'Complete' && $type !== 'Failure';

which probably covers conventional cases, but I have zero idea about what edge cases there are left for us.

Limiting the number of times a job can fail

When a job fails to run and exits non-zero, Kubernetes wants to retry running the job immediately until it failed 6 times. This number can be configured. I'm not sure whether it's better to have this set to 1 and have Laravel enqueue a new job, or if we should have Kubernetes retry a couple of times. Probably makes no real difference.

Adjusting backoff limits for scheduling failures

I couldn't really find any information about how to easily configure the backoff behavior for the job-controller / scheduler. Not sure if this is meant to be configured, or if we need to add another kubernetes object that controls this?

I also looked into whether it's possible to deduplicate jobs per wiki on Kubernetes level (so we don't have to deal with fiddling with the selector). This could work in theory (we'd have to set .spec.ttlSecondsAfterFinished to 0), but it's currently blocked by the Kubernetes Client library silently swallowing 409 status codes (which is what the API returns in case you try to create the same job twice):

			// Error Handling
			if ($response->getStatusCode() >= 500) {
				$msg = substr((string) $response->getBody(), 0, 1200); // Limit maximum chars
				throw new ApiServerException("Server responded with 500 Error: " . $msg, 500);
			}

			if (in_array($response->getStatusCode(), [401, 403], true)) {
				$msg = substr((string) $response->getBody(), 0, 1200); // Limit maximum chars
				throw new ApiServerException("Authentication Exception: " . $msg, $response->getStatusCode());
			}

			if (isset($requestOptions['stream']) && $requestOptions['stream'] === true) {
				return $response;
			}

			$responseBody = (string) $response->getBody();
			$jsonResponse = json_decode($responseBody, true);

			return is_array($jsonResponse) ? $jsonResponse : $responseBody;

which means we'd need to resort to look for certain tokens in the response body (or just not look at it at all, accepting all 404-409 errors), which seems overy brittle to me (then again, we already do it).

As it makes the Laravel job as dumb as possible, I do like the idea however, so I wonder if we can:

  • somehow patch that dreaded client library to bail on any 4xx response
  • use a different client lib right now
  • live with accepting the 40[2,4-9]s

Turns out, the situation is even worse. The error handling code in the client library above never runs, but instead a BadRequestException is raised, but swallowing the response code:

		} catch (HttpTransferException $e) {
			$response = $e->getResponse();

			$responseBody = (string) $response->getBody();

			if (in_array('application/json', $response->getHeader('Content-Type'), true)) {
				$jsonResponse = json_decode($responseBody, true);

				if ($this->isUpgradeRequestRequired($jsonResponse)) {
					return $this->sendUpgradeRequest($requestUrl, $query);
				}
			}

			throw new BadRequestException($responseBody, 0, $e);
		}

I chose to set the TTL for finished jobs to 0, which effectively removes the need for any filtering: when a job is still present, it's guaranteed it's not done.

I *think* I have everything set up correctly right now on my local cluster so that the automatic mw job magic should work, but I think it doesn't (because the job count for my wiki doesnt decrease over time)

When dispatching PollForMediaWikiJobsJob, I see in the api logs:

api-queue-6c857cfb7f-z4qkw api-queue [2023-05-31 16:07:59] production.INFO: MediaWiki Job for wiki "test1test.wbaas.localhost" created with name "n/a".

So i guess something between ProcessMediaWikiJobsJob.php and k8s doesn't quite work out. (I also don't see any jobs created in k8s)

@Deniz_WMDE The state you looked at was indeed broken as my naming scheme created names that were longer than K8s would allow. This is fixed in https://github.com/wbstack/api/pull/603/commits/3bee2e8832fa6c6fd6c7cb930fd2f91d08a75826

@Deniz_WMDE The state you looked at was indeed broken as my naming scheme created names that were longer than K8s would allow. This is fixed in https://github.com/wbstack/api/pull/603/commits/3bee2e8832fa6c6fd6c7cb930fd2f91d08a75826

With the current state I can happily report that this works now on my local cluster! Amazing!

I tested it with using:

note: while I used the new chart, I deployed the api image via skaffold.

I then created a new wiki and created several hundreds of items via wikibase-cli.

In the api-queue logs I could follow how every minute more and more jobs got processed:

api-queue-8d859ffcc-6tvd6 api-queue [2023-06-01 11:28:23][SRNDU5WDY8fhlFysEDf77NT0PTD2ZTf1] Processing: App\Jobs\PollForMediaWikiJobsJob
api-queue-8d859ffcc-6tvd6 api-queue [2023-06-01 11:28:24][SRNDU5WDY8fhlFysEDf77NT0PTD2ZTf1] Processed:  App\Jobs\PollForMediaWikiJobsJob
api-queue-8d859ffcc-6tvd6 api-queue [2023-06-01 11:28:24][1Vee9C56jJrwoOoBKFfjxhRt2JazWVN4] Processing: App\Jobs\ProcessMediaWikiJobsJob
api-queue-8d859ffcc-6tvd6 api-queue [2023-06-01 11:28:24] production.INFO: MediaWiki Job for wiki "test1test.wbaas.localhost" created with name "run-all-mw-jobs-39d082fdfdb938bfbad4fdbefea3301e285f8e93".  
api-queue-8d859ffcc-6tvd6 api-queue [2023-06-01 11:28:24][1Vee9C56jJrwoOoBKFfjxhRt2JazWVN4] Processed:  App\Jobs\ProcessMediaWikiJobsJob

Also the pending job count in these stats dropped to 0 after some time (the following link most probably doesn't work for anyone as-is):
http://test1test.wbaas.localhost/w/api.php?action=query&meta=siteinfo&siprop=statistics

Let's get this to staging :)

Deployment to staging fails because of issues with the GKE provider. New PR here: https://github.com/wmde/wbaas-deploy/pull/931

Fring removed Fring as the assignee of this task.Jun 1 2023, 12:36 PM

This PR fixed the deployment to staging: https://github.com/wmde/wbaas-deploy/pull/935
A similar change is needed for prod.

Also: There seems to be issues with accessing the k8s secrets. Tried to test this on staging and the k8s job kept failing:

│ Warning Failed 5m33s (x12 over 7m43s) kubelet Error: secret "smtp-credentials" not found │

Evelien_WMDE claimed this task.