Page MenuHomePhabricator

JobQueue: Unify JobRunner entry points
Open, MediumPublic

Description

Historically, the HTTP-facing script for running jobs on the jobrunners (RunJobs.php) resides in wmf-config repository, despite the fact that it is not really configuration material.

Instead, we think the rpc/RunJobs.php and rpc/RunSingleJob.php scripts should be moved to the MediaWiki core repository under the /rpc hierarchy (as this is the current location they are being synced to during deployment).

Rationale:

  • They should be treated as software, not configuration
  • They are not WMF-specific, but rather, specific to MW installs using an HTTP interface for running their jobs asynchronously.
  • They are tightly coupled with the MW JobQueue codebase and infrastructure.
  • Easier sharing with other environments (BetaCluster, MW-Vagrant, etc)

Thoughts? Suggestions? Objections?

As of February 2020 (based on T175146#5882126) we have:

  1. MediaWiki core: Special:RunJobs (default for MW)
    • Protected at run-time by HMAC/SHA1
  2. MediaWiki-EventBus extension: Special:RunSingleJob (unused)
    • Protected at run-time by JWT/sha256
    • Removed in Feb 2020, T244826.
  3. MediaWiki-EventBus extension: /w/rest.php/eventbus/v0/internal/job/execute RunSingleJobHandler (unused)
    • Protected structurally via config (would only be enabled on jobrunner servers).
    • Protected at runt-time by SHA1 secret signature.
    • Introduced in Feb 2020, T244826.
  4. WMF: rpc/RunJobs.php (unused)
    • Protected at run-time by REMOTE_ADDR check (localhost-only).
    • Protected structurally as it is only exposed on a non-public pool of app servers.
    • Removed in Jun 2022, https://gerrit.wikimedia.org/r/c/805775.
  5. WMF: rpc/RunSingleJob.php (default for WMF)
    • Not protected at run-time.
    • Protected structurally as it is only exposed on a non-public pool of app servers.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This hasn't been fully implemented yet, SpecialRunSingleJob still lives in the EventBus extension, not MediaWiki core.

eprodromou claimed this task.
eprodromou subscribed.

@Legoktm that's a good point, but I think it makes sense to have a separate issue about SpecialRunSingleJob.

Not sure this should be closed. RunJobs.php and RunSingleJob.php are still present (and used from) mediawiki-config, and not mediawiki-core

@mobrovac ha! I misread the thread. There's no substitute for actually checking the repos! Thanks.

Change 385382 abandoned by Mobrovac:
CP-JobQueue: Use the Special:RunSingleJob page to execute jobs

https://gerrit.wikimedia.org/r/385382

As of Feb 2020, the following are at play:

  • MediaWiki core: Special:RunJobs (default for MW)
    • Protected at run-time by HMAC/SHA1
  • MediaWiki-EventBus extension: Special:RunSingleJob (unused?)
    • Protected at run-time by JWT/sha256
  • WMF: rpc/RunJobs.php (unused?)
    • Protected at run-time by REMOTE_ADDR check (localhost-only).
    • Protected structurally as it is only exposed on a non-public pool of app servers.
  • WMF: rpc/RunSingleJob.php
    • Not protected at run-time.
    • Protected structurally as it is only exposed on a non-public pool of app servers.

Change 575392 had a related patch set uploaded (by Aaron Schulz; owner: Aaron Schulz):
[operations/puppet@production] Remove references to obsolete rpc/RunJobs.php endpoint

https://gerrit.wikimedia.org/r/575392

Naike changed the task status from Open to Stalled.May 29 2020, 9:21 AM

Change 805775 had a related patch set uploaded (by D3r1ck01; author: Derick Alangi):

[operations/mediawiki-config@master] rpc: Remove unused RunJobs.php

https://gerrit.wikimedia.org/r/805775

Change 805775 merged by jenkins-bot:

[operations/mediawiki-config@master] rpc: Remove unused RunJobs.php

https://gerrit.wikimedia.org/r/805775

Mentioned in SAL (#wikimedia-operations) [2022-06-21T13:28:38Z] <daniel@deploy1002> Synchronized rpc/: Config: [[gerrit:805775|rpc: Remove unused RunJobs.php (T175146 T243096)]] (duration: 03m 45s)

daniel claimed this task.

RunJobs.php has been deleted

Krinkle reopened this task as Open.EditedJun 21 2022, 8:08 PM

That's technically true, but not what the task is about.

Krinkle renamed this task from [RfC] Move RunJobs.php to the mediawiki (core) repository to [RfC] Move RunSingleJob.php to the mediawiki (core) repository.Jun 21 2022, 8:08 PM
Krinkle renamed this task from [RfC] Move RunSingleJob.php to the mediawiki (core) repository to JobQueue: Unify JobRunner entry points.Mar 19 2024, 6:04 PM

This needs discussion with someone from service ops about how to safely expose RunSingleJob as a REST endpoint, so it does not become publically accessible.

Quick brainstorming:

We could introduce a utility method into MediaWiki (not sure where yet) that returns true if the current request is trusted to be coming from the private subnet. The RunSingleJob endpoint would fail if that method returns false.

The implementation of the isTrustedLocalClient method would:

  • return false on a vanilla install
  • use a configuration variable that defines a subnet mask for request IPs (checked after resolving XFF)
  • the configuration can also specify a callback function, so we can use a more complex check in production

Questions:

  • Would such a generic mechanism be useful? Or should we just have a setting that specifies the subnet mask for hosts that should be allowed to call RunSingleJob specifically?
  • Is it safe enough to rely on the IP address alone? Is that what we are effectively doing right now?
  • SpecialRunJob requires a secret to be specified, could we just do that for RunSingleJob as well? Why didn't we do that originally?
Krinkle updated the task description. (Show Details)
Krinkle updated the task description. (Show Details)

@daniel Are you thinking about IP subnets as protection mechanism because you prefer it, or because you think it's what we use today? Of the 5 job runners in the task description, number 4 used this mechanism. However that one (runJobs.php) has not been in use for several years. I've updated the task description to clarify this.

@daniel Are you thinking about IP subnets as protection mechanism because you prefer it, or because you think it's what we use today? Of the 5 job runners in the task description, number 4 used this mechanism. However that one (runJobs.php) has not been in use for several years. I've updated the task description to clarify this.

I mention it because it's an easy option to include that would be viable for third parties. For production use at WMF, we could implement whatever logic we want in a callback, or we just leave the feature disabled on most clusters, and only enabled it on an internal cluster (by specifying a subnet mask - which could be as permissive or restrictive as we want, it probably doesn't matter).

Change #1019398 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] WIP: add REST handler for running a job

https://gerrit.wikimedia.org/r/1019398

Change #1019399 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] WIP: add a REST endpoint equivalent to SpecialRunJobs.

https://gerrit.wikimedia.org/r/1019399

@daniel Is the REST API safe to remove from the EventBus extension? We may want to do that first.

@daniel Is the REST API safe to remove from the EventBus extension? We may want to do that first.

Looks like RunSingleJobHandler is enabled on jobrunners, but according to grafite, nothing is hitting that endpoint.

What I did in the patch above is pretty much just pulling that endpoint into core. I also added a few filter mechanisms for the benefit of third parties.

Why do you think that endpoint should be removed first? I thin the order doesn't matter...

What I am trying to figure out right now is how we can send API requests to the correct domain. rpc/runSingleJob.php looks at the "databas" field of the payload and then uses MWMultiVersion::getMediaWiki to initialize the correct wiki. If we use the RESt API, we have to use the Host header instead. Or we send some other header, like X-Wiki-Name, that gets interpreted by MWMultiVersion::getMediaWiki. But that shouldn't work for external requests.

Where is the code that sends that request? I suppose it's hiding somehwere in mediawiki-services-change-propagation-jobqueue-deploy, but I can't find it.

Why do you think that endpoint should be removed first? I thin the order doesn't matter...

I agree the order doesn't matter, but only if the time between them is small. Mainly I just want us to commit to removing it, and asking you to remove it before adding a sixth one, is one way of making sure of that. Once we add the new one, the incentive kinda falls and drifts away. The idea being that if we believe it's easy to do and won't take long, perhaps we should just do it. If we believe it would not be easy, then that's a reason to be suspicious of a promise that it'll happen "soon".

Notes to self:

Quick summary of a conversation with Timo:

  • the primary protection mechansim should eb "off per default, enabled on internal cluster"
  • to prevent 3rd frmo accidentially making this endpoint public , there should be a second line of defense, like a list of IPs
  • signing requests, like SpecialRunJobs does, is a good mechanism as well. As long as the signature is created by MW, it's easy. It's inconvenient to try and generate a signature outside MW.
  • relying on the host header for picking the wiki that needs to process a job should be fine, but we nede to actually start setting this header in changeprop
  • we can probably just use the RunSingleJob REST enpoint that exists in the EventBus extension. Should we move it into core? 3rd parties are very unlinkely to ever need it.
  • we may want to establish best practices for "private" endpoints in general

Change #1033227 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] REST: add support for private modules

https://gerrit.wikimedia.org/r/1033227

@daniel

  • The patches under T365752 add generic support for private modules as a new feature in REST. It seems non-trivial to maintain and support in a generic way. A simpler approach might be a feature flag specific to JobRunner, in which case one could communicate disablement via an HTTP status code from RunSingleJob, rather than by being absent from the registry. This might make sense since as that way the manifest isn't variable by user, which would risk/complicate internals around caching, and e.g. generating Swagger specs as we'd need to be careful not to let the cache be infected by a jobrunner. For special pages and API modules, we don't vary their registration by permission. The pages that are installed are always existent, but they may return a permission error. Anyway, more than the registry aspect pro/con, my main question here is about cost/benefit of an early abstraction now (which delays the main work of this task, and means there will be multiple moving pieces when we deploy/test this) vs later if/when there is more than 1 user for it (which would contain the business logic in a single class, and allow us to move more rapidly).
  • The patches also introduce support for signed headers. Is that for the RunSingleJob endpoint, as replacement for the current one in production? I thought that the feature flag for private modules (only on for jobrunners in wmf-config) combined with network access check (just in case) would suffice. But maybe this is for RunJobs (which WMF doesn't use) rather than RunSingleJob (which I thought this task is mainly about). I very much like the idea of replacing Special:RunJobs with a (non-private) REST replacement with signed headers. Is that planned as part of this task?

But maybe this is for RunJobs (which WMF doesn't use) rather than RunSingleJob (which I thought this task is mainly about). I very much like the idea of replacing Special:RunJobs with a (non-private) REST replacement with signed headers. Is that planned as part of this task?

Yes

A simpler approach might be a feature flag specific to JobRunner, in which case one could communicate disablement via an HTTP status code from RunSingleJob, rather than by being absent from the registry.

That would be the use casue of the trivial "allowed" protection mode mentioned in T365752. The module would be registered but disabled per config.