Page MenuHomePhabricator

Provide a method for internal services to run api requests for private wikis
Open, HighPublic

Description

As a developer of a supporting application to mediawiki I need to be able to make read-only api calls against private wikis.

The CirrusSearch streaming updater, due to replace the existing job-based search updater, needs to perform a specific api call to collect the information that needs to be indexed into the search engine. For the majority of wikis, such as mediawiki.org, we could query the public api. But that would fail on the private wikis.

As an additional constraint our current expectation is that these api calls will terminate into the job runner cluster, as these api calls are going to replace jobs that exist today and this will keep the load in the same place. Future work with mediawiki-on-k8s may see the exact cluster we submit to changed in the future. This is strictly due to scaling concerns, the existing updater runs hundreds of jobs per second, scaling down the job runners and up a different cluster is currently a bit complicated, but will hopefully be doable more easily with k8s soon.

Some options:

  • Some sort of global bot account that has read access everywhere?
  • A new endpoint in mediawiki-config, perhaps /rpc/read-api.php, that is only exposed internally and allows any read only api call.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Gehel triaged this task as High priority.Sep 4 2023, 3:18 PM
Gehel moved this task from needs triage to elastic / cirrus on the Discovery-Search board.
Gehel moved this task from elastic / cirrus to Current work on the Discovery-Search board.
Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

I would maybe create a small authentication extension that uses PHP configuration to define a map of bearer token => (username, MWRestrictions, grants[]).

If you want to use the job runner cluster, IMO the cleanest way is to make an API endpoint which schedules jobs.

The NetworkAuth extension would cover the need, though we'd likely want to trim it down a bit for production use.

The NetworkAuth extension would cover the need, though we'd likely want to trim it down a bit for production use.

Let's not, the functionality it offers is simple and the implementation method is questionable (it should be using a session provider). And I really think such a system should use grants.

aaron renamed this task from Provide a method for internal services to run api requests in a private context to Provide a method for internal services to run api requests for private wikis.Oct 25 2023, 5:03 PM

I would maybe create a small authentication extension that uses PHP configuration to define a map of bearer token => (username, MWRestrictions, grants[]).

That seems reasonable. The username would need to be a special one that counts as having high/full rights (thus making grants the limiting factor). I'm reading up on our modern OAuth (2.0) codebase...

The required grants seem to mostly be generic read rights (no need to see deleted/suppressed content) though at production volume (no rate limiting). This might require a hidden grant that does not show up in Special:OAuthConsumerRegistration/SpecialMWOAuth.

I any case, I think OAuth-based logic should be used instead of something like NetworkAuth (which predated the former).

If you want to use the job runner cluster, IMO the cleanest way is to make an API endpoint which schedules jobs.

The job could take care of the read/write cycle instead of using the prop=cirrusbuilddoc API. I suppose it could also have the blob of the metadata, though that would be bulky in job queue. Some new grant would have to be created to avoid write action throttling. A higher grant would be needed for an API that lets the client provide the new search metadata and more so for an API that lets a client post any generic job type to the queue. The grant would have to be hidden from Special:OAuthConsumerRegistration/SpecialMWOAuth as well.

In terms of non-OAuth approaches...it looks like rpc/RunSingleJob.php is kind of deprecated (e.g. T246371) so adding an rpc/QueueJobs.php script would be going in the wrong direction. A similar idea would be to create a QueueJobsHandler to compliment the RunSingleJobHandler REST API. That way, a "recompute and update search index data for page X" job could be POSTed based on the update stream.

I'm reading up on our modern OAuth (2.0) codebase...

Grants and MWRestrictions are in core, you wouldn't have to involve OAuth. If the wiki does have the OAuth extension enabled, you can just use owner-only OAuth 2 consumers out of the box, but most private wikis don't; and even the few that do each have their own database, so making sure they each have the same access token (or using different access tokens for different wikis) would be cumbersome. A new extension can just use configuration instead of the database. (Or I suppose such functionality could easily be added to the OAuth extension as well.)

The required grants seem to mostly be generic read rights (no need to see deleted/suppressed content) though at production volume (no rate limiting). This might require a hidden grant that does not show up in Special:OAuthConsumerRegistration/SpecialMWOAuth.

For MediaWiki's ping limiter, there is a highvolume grant. The API gateway just uses the contents of the access token so if there is rate limiting to bypass there, you'd have to use JWTs for access tokens, with the same format as the OAuth extension.

If you want to use the job runner cluster, IMO the cleanest way is to make an API endpoint which schedules jobs.

The job could take care of the read/write cycle instead of using the prop=cirrusbuilddoc API. I suppose it could also have the blob of the metadata, though that would be bulky in job queue. Some new grant would have to be created to avoid write action throttling. A higher grant would be needed for an API that lets the client provide the new search metadata and more so for an API that lets a client post any generic job type to the queue. The grant would have to be hidden from Special:OAuthConsumerRegistration/SpecialMWOAuth as well.

I don't think running as jobs is going to help here, one of the goals of shifting the cirrus updates to a streaming application was to avoid the impedance mismatch of implementing multi-cluster replication via the job queue. Important functionality around error handling, retrys, and backpressure are much harder to implement and reason about in the job queue. Mediawiki-on-k8s land might be moving fast enough (or our updater deployment moving slow enough) that we can send the api load directly to an mwapi cluster.

If T246371 was done, then the stream updater could just make a request to run a serialized job (provided in the same request) for each page that needs updating. The job would do the backend work of prop=cirrusbuilddoc and also the search index updates. Though the job would have to be registered in $wgJobClasses, nothing would actually enqueue the jobs. The error/retry logic and ability to curtail concurrency would be controlled by the stream updater rather than the generic job runner logic. Would that avoid the job queue issues?

I don't know if this would confuse some of the generic enqueue/pop stats for the type or for all jobs in general...though that could be fixed by some exclusion if needed (e.g. for "not actually enqueued jobs").

I'm still not sure that would be better off in a job with mediawiki performing the writes.

If i understand your suggestion, it's that the streaming application would submit a job, the job would produce something to kafka, and the streaming application would pick it up and continue on? In that case we estimated the size of writing this data to kafka, and came to approximately ~11TB based on an estimate of 300 updates/s and an average response size of 20kb. That ends up being beyond the capacity of our systems.

If that's not the case, I'm not sure how the response would make it's way back into the streaming updater. I would really like to avoid having mediawiki send writes directly to elasticsearch via the job queue. Using the jobqueue to isolate writes to multiple clusters happens in the exact opposite way that kafka consumers should operate. With the job queue we enqueue the same job multiple times with different target cluster parameters into a single partitioned topic. cpjobqueue reads that topic and writes the jobs out to a second partitioned topic based on job parameters. This then allows the cpjobqueue consumers to take backpressure from the write jobs and slow down writes to a single cluster without slowing down the others. Due to cpjobqueue capacity issues we can't actually isolate all the clusters this way, instead we only isolate the cloud replica away from the production clusters. In a proper kafka native application we submit a single message to kafka and each cluster has it's own consumer application that stays completely isolated from the other clusters.

I mean that the job would never be enqueued into kafka, it would be executed by specifying the serialized job to RunSingleJobHandler, which runs it on the fly and returns a status code. If retries are enabled, it just changes the response status in some cases, but that endpoint itself does not try to re-enqueue the job (in this case a never-enqueued job). The Job class would decide what success/error info is propagated up to the JSON response. None of this would involve touching kafka. The stream updater would just be POSTing requests to the RunSingleJobHandler to "refresh this page" and reading the response.

Hmm, with the jobs providing a response directly it seems plausible. Jobs avoid the issue with private wikis and it sounds like this is basically turning a job into a pseudo-api (but not mwapi) request. I suspect long term a solution that more directly ties to mwapi once mw-on-k8s is fully operational would be more maintainable, but this approach shouldn't have significant maintenance overheads.

So, without T246371 being done yet, it looks like current apache config on the jobrunners won't let you use RunSingleJobHandler, but only /rpc/* stuff. So, either:

  • an exception would have to be made for the /w/rest.php/eventbus/v0/internal/job/execute within modules/profile/templates/mediawiki/jobrunner/site.conf.erb
  • rpc/RunSingleJob.php would have to temporarily be used by the search updater and migrated later.
aaron removed aaron as the assignee of this task.Dec 4 2023, 7:04 PM
aaron subscribed.

Change 988737 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] [WIP] Pseudo-Job for building documents

https://gerrit.wikimedia.org/r/988737

Change 989253 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] [WIP] Custom session provider to support private wiki reads

https://gerrit.wikimedia.org/r/989253

After reviewing the option here, along with reviewing the current state of the mw k8s deployment, it looks like we can drop the requirement to execute via the job runner infrastructure. If that's the case, is running a pseudo-job still the best plan? Some considerations:

  • rpc/RunSingleJob.php is a bit tedious to test with. It only exists within mediawiki-config and is not part of mediawiki proper. It can be hacked into place for dev envs, but adds complications.
  • There is REST based job executor available from EventBus. But it doesn't look like it's ever been used in production. In a quick local test it still requires an authenticated user account to run on private wikis (RunSingleJobHandler::needsReadAccess returns true, which causes the REST infra to deny unauthenticated requests).

One option presented earlier was to do something like NetworkAuth, but as a SessionProvider. I've worked up a first draft of this and it seems viable, although i'm not entirely clear on the process through which it creates users (transparently happens if the returned user does not exist). For demonstration purposes I've submitted the patch to the CirrusSearch repo, but it seems could create a new extension to hold just this bit similar to how NetworkAuth was. Thoughts?

Thoughts?

That sounds like a reasonable plan to me.

Change 988737 abandoned by Ebernhardson:

[mediawiki/extensions/CirrusSearch@master] [WIP] Pseudo-Job for building documents

Reason:

going to go with a custom SessionProvider to handle the login and account auto-creation

https://gerrit.wikimedia.org/r/988737

Change 990241 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/core@master] SessionProvider: add getAutoCreateUserAuthority to interface

https://gerrit.wikimedia.org/r/990241

Change 989253 abandoned by Ebernhardson:

[mediawiki/extensions/CirrusSearch@master] [WIP] Custom session provider to support private wiki reads

Reason:

This will be implemented in a dedicated extension, in Ic36891ab6686dac19b846a1a261e94ae52c43a1b

https://gerrit.wikimedia.org/r/989253

The extension is now documented and written, but still needs to finish code review. Perhaps though it would be worth talking about what is the appropriate level of verification that should be applied to the requests. Some options, in order of increasing complexity:

  • NetworkAuth worked strictly off ip addresses
  • The current NetworkSession implementation uses ip address ranges along with a pre-shared secret (similar-ish to bot passwords) that gets included directly as an http header. This allows to have multiple users from the same ip address ranges. I added this because in the current use case the ip addresses will be generic k8s ip addresses and it seemed we might want to be more specific.
  • Another possibility would be some sort of hmac over the request parameters. It would be similar to the mediawiki_signature handling in MediaWiki\Extension\EventBus\Rest\EventBodyValidator. This would add some protection around a hostile actor aquiring the token by sniffing traffic and performing arbitrary actions with it. Replay attacks that repeat the full request at a later date would still be possible.
  • Replay attacks could be further limited by requiring a timestamp to be provided. Verification would check that the timestamp is recent enough (60s?) and include the timestamp in the hmac.

In general the token shouldn't be crossing networks unencrypted, endpoints are secured with TLS. Perhaps the token could be sniffed by monitoring localhost traffic into/out of envoy that is unencrypted. Even if a hostile actor is able to construct their own valid requests they will still be limited by the available user rights. Current expectation is we will be limiting available rights to only read.

A token is the same level of protection we have for OAuth 2 and that's used all over the place, sometimes with fairly extensive privileges, so I don't think it's worth going beyond that. Especially not worth if the grants are going to be very limited. (You should run it by Security, though.)

Change 990241 merged by jenkins-bot:

[mediawiki/core@master] SessionProvider: add ability to declare maintenance users

https://gerrit.wikimedia.org/r/990241