Provide a method for internal services to run api requests for private wikis
Closed, ResolvedPublic
Actions

Description

As a developer of a supporting application to mediawiki I need to be able to make read-only api calls against private wikis.

The CirrusSearch streaming updater, due to replace the existing job-based search updater, needs to perform a specific api call to collect the information that needs to be indexed into the search engine. For the majority of wikis, such as mediawiki.org, we could query the public api. But that would fail on the private wikis.

As an additional constraint our current expectation is that these api calls will terminate into the job runner cluster, as these api calls are going to replace jobs that exist today and this will keep the load in the same place. Future work with mediawiki-on-k8s may see the exact cluster we submit to changed in the future. This is strictly due to scaling concerns, the existing updater runs hundreds of jobs per second, scaling down the job runners and up a different cluster is currently a bit complicated, but will hopefully be doable more easily with k8s soon.

Some options:

Some sort of global bot account that has read access everywhere?
A new endpoint in mediawiki-config, perhaps /rpc/read-api.php, that is only exposed internally and allows any read only api call.

Details

Subject	Repo	Branch	Lines +/-
SessionProvider: add ability to declare maintenance users	mediawiki/core	master	+27 -1
[WIP] Custom session provider to support private wiki reads	mediawiki/extensions/CirrusSearch	master	+406 -0
[WIP] Pseudo-Job for building documents	mediawiki/extensions/CirrusSearch	master	+276 -67

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	Gehel	T317045 [Epic] Re-architect the Search Update Pipeline
Resolved	Gehel	T341332 [EPIC] The CirrusSearch streaming updater should support private wikis
Resolved	EBernhardson	T345185 Provide a method for internal services to run api requests for private wikis
Resolved	EBernhardson	T354976 Create new NetworkSession mediawiki extension
Resolved	jhsoby	T356438 Add NetworkSession mediawiki extension to translatewiki.net
Resolved	EBernhardson	T355267 Add extension NetworkSession to all wmf wikis
Resolved	Mstyles	T357353 Application Security Review Request : NetworkSession MediaWiki extension

Event Timeline

EBernhardson created this task.Aug 29 2023, 4:38 PM

Restricted Application added a project: Discovery-Search. · View Herald TranscriptAug 29 2023, 4:38 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

EBernhardson updated the task description. (Show Details)Aug 29 2023, 4:40 PM

EBernhardson updated the task description. (Show Details)

EBernhardson added a project: MediaWiki-Platform-Team.

Aklapper added a project: MediaWiki-Configuration.Sep 1 2023, 8:09 AM

Aklapper removed subscribers: Discovery-Search, MediaWiki-Configuration.

Gehel added a parent task: T341332: [EPIC] The CirrusSearch streaming updater should support private wikis.Sep 4 2023, 2:30 PM

Gehel triaged this task as High priority.Sep 4 2023, 3:18 PM

Gehel moved this task from needs triage to elastic / cirrus on the Discovery-Search board.

Gehel moved this task from elastic / cirrus to Current work on the Discovery-Search board.

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.

Gehel added a project: serviceops.Sep 4 2023, 3:39 PM

I would maybe create a small authentication extension that uses PHP configuration to define a map of bearer token => (username, MWRestrictions, grants[]).

If you want to use the job runner cluster, IMO the cleanest way is to make an API endpoint which schedules jobs.

daniel subscribed.Sep 25 2023, 8:59 AM

The NetworkAuth extension would cover the need, though we'd likely want to trim it down a bit for production use.

bking subscribed.Sep 27 2023, 3:45 PM

EBernhardson updated the task description. (Show Details)Sep 27 2023, 4:29 PM

EBernhardson updated the task description. (Show Details)

larissagaulia edited projects, added MediaWiki-Engineering; removed MediaWiki-Platform-Team.Oct 9 2023, 2:09 PM

Bmueller moved this task from Inbox, needs triage to MediaWiki Interfaces team on the MediaWiki-Engineering board.Oct 24 2023, 4:01 PM

In T345185#9203965, @daniel wrote:

The NetworkAuth extension would cover the need, though we'd likely want to trim it down a bit for production use.

Let's not, the functionality it offers is simple and the implementation method is questionable (it should be using a session provider). And I really think such a system should use grants.

aaron renamed this task from Provide a method for internal services to run api requests in a private context to Provide a method for internal services to run api requests for private wikis.Oct 25 2023, 5:03 PM

aaron claimed this task.Oct 25 2023, 5:09 PM

Gehel moved this task from Incoming to Blocked/Waiting on the Discovery-Search (Current work) board.Oct 31 2023, 1:41 PM

In T345185#9141119, @Tgr wrote:

I would maybe create a small authentication extension that uses PHP configuration to define a map of bearer token => (username, MWRestrictions, grants[]).

That seems reasonable. The username would need to be a special one that counts as having high/full rights (thus making grants the limiting factor). I'm reading up on our modern OAuth (2.0) codebase...

The required grants seem to mostly be generic read rights (no need to see deleted/suppressed content) though at production volume (no rate limiting). This might require a hidden grant that does not show up in Special:OAuthConsumerRegistration/SpecialMWOAuth.

I any case, I think OAuth-based logic should be used instead of something like NetworkAuth (which predated the former).

If you want to use the job runner cluster, IMO the cleanest way is to make an API endpoint which schedules jobs.

The job could take care of the read/write cycle instead of using the prop=cirrusbuilddoc API. I suppose it could also have the blob of the metadata, though that would be bulky in job queue. Some new grant would have to be created to avoid write action throttling. A higher grant would be needed for an API that lets the client provide the new search metadata and more so for an API that lets a client post any generic job type to the queue. The grant would have to be hidden from Special:OAuthConsumerRegistration/SpecialMWOAuth as well.

In terms of non-OAuth approaches...it looks like rpc/RunSingleJob.php is kind of deprecated (e.g. T246371) so adding an rpc/QueueJobs.php script would be going in the wrong direction. A similar idea would be to create a QueueJobsHandler to compliment the RunSingleJobHandler REST API. That way, a "recompute and update search index data for page X" job could be POSTed based on the update stream.

In T345185#9314798, @aaron wrote:

I'm reading up on our modern OAuth (2.0) codebase...

Grants and MWRestrictions are in core, you wouldn't have to involve OAuth. If the wiki does have the OAuth extension enabled, you can just use owner-only OAuth 2 consumers out of the box, but most private wikis don't; and even the few that do each have their own database, so making sure they each have the same access token (or using different access tokens for different wikis) would be cumbersome. A new extension can just use configuration instead of the database. (Or I suppose such functionality could easily be added to the OAuth extension as well.)

The required grants seem to mostly be generic read rights (no need to see deleted/suppressed content) though at production volume (no rate limiting). This might require a hidden grant that does not show up in Special:OAuthConsumerRegistration/SpecialMWOAuth.

For MediaWiki's ping limiter, there is a highvolume grant. The API gateway just uses the contents of the access token so if there is rate limiting to bypass there, you'd have to use JWTs for access tokens, with the same format as the OAuth extension.

If you want to use the job runner cluster, IMO the cleanest way is to make an API endpoint which schedules jobs.

The job could take care of the read/write cycle instead of using the prop=cirrusbuilddoc API. I suppose it could also have the blob of the metadata, though that would be bulky in job queue. Some new grant would have to be created to avoid write action throttling. A higher grant would be needed for an API that lets the client provide the new search metadata and more so for an API that lets a client post any generic job type to the queue. The grant would have to be hidden from Special:OAuthConsumerRegistration/SpecialMWOAuth as well.

I don't think running as jobs is going to help here, one of the goals of shifting the cirrus updates to a streaming application was to avoid the impedance mismatch of implementing multi-cluster replication via the job queue. Important functionality around error handling, retrys, and backpressure are much harder to implement and reason about in the job queue. Mediawiki-on-k8s land might be moving fast enough (or our updater deployment moving slow enough) that we can send the api load directly to an mwapi cluster.

If T246371 was done, then the stream updater could just make a request to run a serialized job (provided in the same request) for each page that needs updating. The job would do the backend work of prop=cirrusbuilddoc and also the search index updates. Though the job would have to be registered in $wgJobClasses, nothing would actually enqueue the jobs. The error/retry logic and ability to curtail concurrency would be controlled by the stream updater rather than the generic job runner logic. Would that avoid the job queue issues?

I don't know if this would confuse some of the generic enqueue/pop stats for the type or for all jobs in general...though that could be fixed by some exclusion if needed (e.g. for "not actually enqueued jobs").

I'm still not sure that would be better off in a job with mediawiki performing the writes.

If i understand your suggestion, it's that the streaming application would submit a job, the job would produce something to kafka, and the streaming application would pick it up and continue on? In that case we estimated the size of writing this data to kafka, and came to approximately ~11TB based on an estimate of 300 updates/s and an average response size of 20kb. That ends up being beyond the capacity of our systems.

If that's not the case, I'm not sure how the response would make it's way back into the streaming updater. I would really like to avoid having mediawiki send writes directly to elasticsearch via the job queue. Using the jobqueue to isolate writes to multiple clusters happens in the exact opposite way that kafka consumers should operate. With the job queue we enqueue the same job multiple times with different target cluster parameters into a single partitioned topic. cpjobqueue reads that topic and writes the jobs out to a second partitioned topic based on job parameters. This then allows the cpjobqueue consumers to take backpressure from the write jobs and slow down writes to a single cluster without slowing down the others. Due to cpjobqueue capacity issues we can't actually isolate all the clusters this way, instead we only isolate the cloud replica away from the production clusters. In a proper kafka native application we submit a single message to kafka and each cluster has it's own consumer application that stays completely isolated from the other clusters.

I mean that the job would never be enqueued into kafka, it would be executed by specifying the serialized job to RunSingleJobHandler, which runs it on the fly and returns a status code. If retries are enabled, it just changes the response status in some cases, but that endpoint itself does not try to re-enqueue the job (in this case a never-enqueued job). The Job class would decide what success/error info is propagated up to the JSON response. None of this would involve touching kafka. The stream updater would just be POSTing requests to the RunSingleJobHandler to "refresh this page" and reading the response.

Hmm, with the jobs providing a response directly it seems plausible. Jobs avoid the issue with private wikis and it sounds like this is basically turning a job into a pseudo-api (but not mwapi) request. I suspect long term a solution that more directly ties to mwapi once mw-on-k8s is fully operational would be more maintainable, but this approach shouldn't have significant maintenance overheads.

• FJoseph-WMF edited projects, added API Platform; removed MediaWiki-Engineering.Dec 4 2023, 4:07 PM

So, without T246371 being done yet, it looks like current apache config on the jobrunners won't let you use RunSingleJobHandler, but only /rpc/* stuff. So, either:

an exception would have to be made for the /w/rest.php/eventbus/v0/internal/job/execute within modules/profile/templates/mediawiki/jobrunner/site.conf.erb
rpc/RunSingleJob.php would have to temporarily be used by the search updater and migrated later.

aaron removed aaron as the assignee of this task.Dec 4 2023, 7:04 PM

aaron subscribed.

EBernhardson claimed this task.Jan 5 2024, 5:31 PM

EBernhardson moved this task from Blocked/Waiting to In Progress on the Discovery-Search (Current work) board.

Change 988737 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] [WIP] Pseudo-Job for building documents

https://gerrit.wikimedia.org/r/988737

gerritbot added a project: Patch-For-Review.Jan 8 2024, 10:43 PM

Change 989253 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/extensions/CirrusSearch@master] [WIP] Custom session provider to support private wiki reads

https://gerrit.wikimedia.org/r/989253

After reviewing the option here, along with reviewing the current state of the mw k8s deployment, it looks like we can drop the requirement to execute via the job runner infrastructure. If that's the case, is running a pseudo-job still the best plan? Some considerations:

rpc/RunSingleJob.php is a bit tedious to test with. It only exists within mediawiki-config and is not part of mediawiki proper. It can be hacked into place for dev envs, but adds complications.

There is REST based job executor available from EventBus. But it doesn't look like it's ever been used in production. In a quick local test it still requires an authenticated user account to run on private wikis (RunSingleJobHandler::needsReadAccess returns true, which causes the REST infra to deny unauthenticated requests).

One option presented earlier was to do something like NetworkAuth, but as a SessionProvider. I've worked up a first draft of this and it seems viable, although i'm not entirely clear on the process through which it creates users (transparently happens if the returned user does not exist). For demonstration purposes I've submitted the patch to the CirrusSearch repo, but it seems could create a new extension to hold just this bit similar to how NetworkAuth was. Thoughts?

In T345185#9448367, @EBernhardson wrote:

Thoughts?

That sounds like a reasonable plan to me.

Change 988737 abandoned by Ebernhardson:

[mediawiki/extensions/CirrusSearch@master] [WIP] Pseudo-Job for building documents

Reason:

going to go with a custom SessionProvider to handle the login and account auto-creation

https://gerrit.wikimedia.org/r/988737

Change 990241 had a related patch set uploaded (by Ebernhardson; author: Ebernhardson):

[mediawiki/core@master] SessionProvider: add getAutoCreateUserAuthority to interface

https://gerrit.wikimedia.org/r/990241

Change 989253 abandoned by Ebernhardson:

[mediawiki/extensions/CirrusSearch@master] [WIP] Custom session provider to support private wiki reads

Reason:

This will be implemented in a dedicated extension, in Ic36891ab6686dac19b846a1a261e94ae52c43a1b

https://gerrit.wikimedia.org/r/989253

EBernhardson mentioned this in T355267: Add extension NetworkSession to all wmf wikis.Jan 17 2024, 10:08 PM

The extension is now documented and written, but still needs to finish code review. Perhaps though it would be worth talking about what is the appropriate level of verification that should be applied to the requests. Some options, in order of increasing complexity:

NetworkAuth worked strictly off ip addresses
The current NetworkSession implementation uses ip address ranges along with a pre-shared secret (similar-ish to bot passwords) that gets included directly as an http header. This allows to have multiple users from the same ip address ranges. I added this because in the current use case the ip addresses will be generic k8s ip addresses and it seemed we might want to be more specific.
Another possibility would be some sort of hmac over the request parameters. It would be similar to the mediawiki_signature handling in MediaWiki\Extension\EventBus\Rest\EventBodyValidator. This would add some protection around a hostile actor aquiring the token by sniffing traffic and performing arbitrary actions with it. Replay attacks that repeat the full request at a later date would still be possible.
Replay attacks could be further limited by requiring a timestamp to be provided. Verification would check that the timestamp is recent enough (60s?) and include the timestamp in the hmac.

In general the token shouldn't be crossing networks unencrypted, endpoints are secured with TLS. Perhaps the token could be sniffed by monitoring localhost traffic into/out of envoy that is unencrypted. Even if a hostile actor is able to construct their own valid requests they will still be limited by the available user rights. Current expectation is we will be limiting available rights to only read.

A token is the same level of protection we have for OAuth 2 and that's used all over the place, sometimes with fairly extensive privileges, so I don't think it's worth going beyond that. Especially not worth if the grants are going to be very limited. (You should run it by Security, though.)

EBernhardson moved this task from In Progress to Blocked/Waiting on the Discovery-Search (Current work) board.Jan 18 2024, 8:46 PM

Change 990241 merged by jenkins-bot:

[mediawiki/core@master] SessionProvider: add ability to declare maintenance users

https://gerrit.wikimedia.org/r/990241

Maintenance_bot removed a project: Patch-For-Review.Jan 25 2024, 6:30 AM

ReleaseTaggerBot added a project: MW-1.42-notes (1.42.0-wmf.16; 2024-01-30).Jan 25 2024, 7:00 AM

EBernhardson added a subtask: T355267: Add extension NetworkSession to all wmf wikis.Feb 12 2024, 9:19 PM

Gehel closed subtask T354976: Create new NetworkSession mediawiki extension as Resolved.Feb 16 2024, 1:16 PM

bcampbell subscribed.Mar 26 2024, 11:25 PM

Gehel moved this task from Blocked/Waiting to In Progress on the Discovery-Search (Current work) board.Jul 15 2024, 3:12 PM

dr0ptp4kt moved this task from In Progress to To Be Deployed on the Discovery-Search (Current work) board.Jul 29 2024, 3:21 PM

@Tgr we're looking to deploy this today or tomorrow. @EBernhardson is working on a patch and will schedule a deployment.