Page MenuHomePhabricator

Enable API integration tests to ensure that GET requests will always see the effect of previous POST requests.
Open, NormalPublic8 Story Points

Description

Problem:
A client performs a write action via the API, and wants to perform a subsequent action that requires the first write to have been completed. E.g. 1) create account 2) add groups 3) log in. In this example, the problem is aggravated by the fact that adding user groups requires elevated privileges, so it would have to be done with a logged in client session distinct from the one that does step (3).

Solution 1 (as originally proposed, for context)

Add support for a waitforreplication flag ApiMain. If this flag is set, LBFactory::waitForReplication() is called before the response is sent to the client (unless an error occurred, in which case no wait happens). The response should indicate whether the wait actually happened or not.

To avoid abuse, this behavior can be restricted to POST requests, perhaps only those that also require a token. We could be even more restrictive and require a special permission for this, which can then be restricted to a trusted groups, like the bot and sysop groups.

Solution 2

Instead of a flag for the API, introduce a header field that could be used to make any request to MediaWiki wait until its effects have propagated to all replicas.

Solution 3

Use ChronologyProtector to ensure consistency between the requests. No modification of MediaWiki is necessary.

Background:

The Core Platform Team is working on a suite of API tests. These tests will need to test use cases that span multiple users, with minimal delay between requests, potentially in an environment with replicated databases like the beta cluster. Examples of such use cases to be tested are a user getting a newtalk notification when another user edits their talk page, or a user seeing an edit by another user in their watchlist.

Details

Related Gerrit Patches:

Event Timeline

daniel created this task.Aug 9 2019, 3:55 PM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptAug 9 2019, 3:55 PM
daniel updated the task description. (Show Details)Aug 9 2019, 3:58 PM

Only for sequences of actions taken by the same user. The use cases I described above are interactions between multiple users.

Anomie moved this task from Unsorted to Needs Code on the MediaWiki-API board.Aug 13 2019, 4:05 PM

To avoid abuse

What sort of abuse would be being avoided? i.e. in what way can a call to LBFactory::waitForReplication() cause problems?

What sort of abuse would be being avoided? i.e. in what way can a call to LBFactory::waitForReplication() cause problems?

Probably nothing too serious. I'm just careful about anything that can force a connection to remain open for a prolonged period of time, especially at times when the servers are busy.

I just realized a complication: deferred updates. When the flag is set, any deferred updates should probably be executed, including ones marked for post-sent, before waiting for replication. Or does that need a separate flag?

Wanting to wait for post-send updates before sending seems like you're introducing a confusing situation. What then if post-send updates are further deferred to the job queue?

Wanting to wait for post-send updates before sending seems like you're introducing a confusing situation. What then if post-send updates are further deferred to the job queue?

That can't really be helped, I guess. If a client wants to wait for the job you to be empty, they can just poll the sites stats until the number of jobs is 0. This doesn't make sense on a live site, but for testing, where the client is the only client, it's useful. But again, the client would want to wait until the changes made by the last job have actually been executed and replicated.

The need is: "I made a change, I now want to wait until all effects of that change are visible to future requests by any client". To my knowledge, this is currently impossible. With a "wait for replication" and "wait for deferred" flag, this would become possible (together with polling the job queue until it's empty). "wait for replication" alone would not be sufficient, especially since the deferred changes will also have to be replicated.

Waiting for replication makes potential sense to me. Waiting for pre-send deferred changes already happens.

But waiting for post-send deferred changes seems highly prone to inconsistency and confusion. MediaWiki's deferred changes mechanisms seem to generally be moving in the opposite direction, towards fewer guarantees on how quickly post-send changes are processed, as we try to reduce the instances of post-send deferred changes being dropped on the floor.

But waiting for post-send deferred changes seems highly prone to inconsistency and confusion. MediaWiki's deferred changes mechanisms seem to generally be moving in the opposite direction, towards fewer guarantees on how quickly post-send changes are processed, as we try to reduce the instances of post-send deferred changes being dropped on the floor.

MediaWiki wants to respond as quickly as possible. For human interaction, this kind of "eventual consistency" is fine. For automated clients, it may be more problematic. For testing, it's terrible - because tests by nature rely on performing an action and immediately trying to verify the results. I don't see how I can write tests for the API if I have no guarantee about at what point the changes I applied should be visible to a subsequent request.

For example, if I make an edit that adds some links to a page, and then try to verify that they are correctly reflected by prop=linkshere, I need to wait until the deferred updates have been processed, and *then* for all replicas to receive the resulting data.

What mechanism do you propose? Just wait and retry? That's going to make slow tests even slower, and doesn't provide any guarantees, just better chances.

You're trying to test something where the behavior isn't guaranteed, it shouldn't be much of a surprise that you're having trouble getting guaranteed behavior.

Rather than an Action API parameter for that use case, I'd suggest you see whether you can make this a feature of your test environment: have it force all deferred changes (API or not) to be run pre-send, plus include a way to directly run all pending jobs and then wait for replication. That may also help you ensure that a post-send deferred change or scheduled job from one test doesn't wind up interfering with the next test somehow.

You're trying to test something where the behavior isn't guaranteed, it shouldn't be much of a surprise that you're having trouble getting guaranteed behavior.

I'm aware of that.
My point is that it would not be hard to provide these guarantees to clients that want them, in return for slower response times. Providing no way for clients to get these guarantees leads to a situation where clients have no choice but to fall into a wait-and-retry pattern, which seems like a bad thing.

Rather than an Action API parameter for that use case, I'd suggest you see whether you can make this a feature of your test environment

By design, the testing environment's only interface to MediaWiki is the API. It's written in JavaScript. I see no way to achieve this without putting code into core. Maybe an extension would work, it could hook into APIAfterExecute and wait for all the things I want to wait for.

That would however mean that we can't reliably run integration tests against core without that extension. Perhaps that's fine for tests. But my feeling is that regular clients run into the same problem, just not frequently enough to complain much about it, or they just wait & retry.

Since links updates being done via the job queue has been the case for a long time, regular clients are probably coded to expect that.

Probably they don't even care, as in my experience the edit is usually the end of the client's process of dealing with a page rather than an intermediate step with further fetching of links data resulting from the edit.

@Anomie: So what would you propose as a way forward for API based testing?

I think an API parameter is probably the wrong way to go.

  • If your testing-extension isn't good enough, for the post-send updates I'd probably look at a custom HTTP header that causes MediaWiki::preOutputCommit() to run all updates rather than just pre-send. That would work the same for the REST API and anything else in the future where you might need that behavior without having to add a confusing parameter everywhere.
  • Plus you'll also need to add some way to wait on the job queue, both for jobs pending at call-time and jobs that those jobs in turn create. Which sounds a lot like T149239: Ensure consistency of secondary data for external consumers, unless you're limiting this to only working on wikis not serving any other traffic as you noted in T230211#5417720.
  • The "wait for replication" bit could probably be included in either or both of those.

Change 530630 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/core@master] (DNM) Allow clients to keep in sync with db updates.

https://gerrit.wikimedia.org/r/530630

Change 530631 had a related patch set uploaded (by Daniel Kinzler; owner: Daniel Kinzler):
[mediawiki/tools/api-testing@master] Add support for X-MW-Sync-Write header.

https://gerrit.wikimedia.org/r/530631

Since links updates being done via the job queue has been the case for a long time, regular clients are probably coded to expect that.

That doesn't appear to be correct. As far as I can tell, LinksUpdate runs post-send as a deferred update. Only cascading updates are posted as jobs.

I think an API parameter is probably the wrong way to go.

  • If your testing-extension isn't good enough, for the post-send updates I'd probably look at a custom HTTP header that causes MediaWiki::preOutputCommit() to run all updates rather than just pre-send. That would work the same for the REST API and anything else in the future where you might need that behavior without having to add a confusing parameter everywhere.

Experimental patch is up, feedback would be welcome.

Tests that write data are not intended to run against busy wikis. Monitoring probes using the same framework would never need to wait for the job queue.

I'll just poll until the job queue is empty. That may still run into trouble when many tests run in parallel, but I don't have a better idea right now.

The primary use cases are running tests against a local dev environment, and running them in CI. The job queue count should work there.

  • The "wait for replication" bit could probably be included in either or both of those.

I included it in the behavior for the proposed X-MW-Sync-Write header.

Krinkle added a subscriber: Krinkle.EditedAug 29 2019, 1:49 PM

Only for sequences of actions taken by the same user. The use cases I described above are interactions between multiple users.

I see, a single API client that acts directly (not a proxy) as multiple users. That seems unlikely for things other than a test framework, but fair nonetheless.

Still, that sounds like a straight-forward case for ChronologyProtector. Before 2015, CP used to be tied to user sessions but not since then (85c0f85e9). It was decoupled for many reasons, but among them to allow standalone services to control their chronology, regardless of whether they are for the same or different users. And also to wait for a user action without having to have its credentials. (e.g. emit an RC event and then a standalone service makes an API request with that cpIndex, doesn't need the credentials for the original editor or smth like that).

I think adding this to core as a synchronous wait would significantly complicate an already very complicated piece of code. It could be inflated to be easier to understand, but doesn't change that it would increase maintenance cost long-term. Best avoided, if possible?

daniel added a comment.EditedAug 29 2019, 3:09 PM

Still, that sounds like a straight-forward case for ChronologyProtector. Before 2015, CP used to be tied to user sessions but not since then (85c0f85e9). It was decoupled for many reasons, but among them to allow standalone services to control their chronology, regardless of whether they are for the same or different users. And also to wait for a user action without having to have its credentials. (e.g. emit an RC event and then a standalone service makes an API request with that cpIndex, doesn't need the credentials for the original editor or smth like that).

So how would that work? Each "user" has their own cookie jar, bound to the user's session on the server. The chronology cookie would have to be shared between them, and updated on every response. I don't see a good way to do that, especially since the JS code may be running multiple requests concurrently (not for things that logically depend on each other obviously, but all clients share the chronology cookie, it would still have to be synced between them somehow).

I think adding this to core as a synchronous wait would significantly complicate an already very complicated piece of code. It could be inflated to be easier to understand, but doesn't change that it would increase maintenance cost long-term. Best avoided, if possible?

Sure, if possible...

I don't think it would add a lot of complexity though. See the POC patch.

Still, that sounds like a straight-forward case for ChronologyProtector. [..]

So how would that work? Each "user" has their own cookie jar, [..]

It's also accepted as request header and query parameter.

This particular use case isn't something we've optimised it for. But, in theory it could be as easy as as giving the requests you want to serialise the same MediaWiki-Chronology-Client-Id header value. We can build towards that if it's not currently that way.

I'd rather do that then introduce another feature separate from it.

[..] complicate an already very complicated piece of code. [..]. Best avoided, if possible?

Sure, if possible...
I don't think it would add a lot of complexity though. See the POC patch.

I can't say it will definitely be safe and never cause issues in prod. Hence, would want it behind a feature flag at least. With CP the client has to wait for a req to come back and then make the next. It enforces that there is a single chain, and breaks it up with natural latency as well. It can't (easily?) be used to make a bunch of parallel requests that would all synchronously lock a server for a while.

CCicalese_WMF triaged this task as Normal priority.Sep 4 2019, 6:29 PM
WDoranWMF set the point value for this task to 8.Sep 4 2019, 6:38 PM
daniel renamed this task from The action API should have a flag that delays the response until all replicas have seen any changes caused by the request to Enable API integration tests to ensure that GET requests will always see the effect of previous POST requests..Sep 4 2019, 8:14 PM
daniel updated the task description. (Show Details)
daniel added a comment.Sep 4 2019, 8:51 PM

This particular use case isn't something we've optimised it for. But, in theory it could be as easy as as giving the requests you want to serialise the same MediaWiki-Chronology-Client-Id header value. We can build towards that if it's not currently that way.

Oh, that sounds nice! I'll give that a try, then.

daniel claimed this task.Sep 6 2019, 11:52 AM
daniel moved this task from Ready to Doing on the Core Platform Team Workboards (Purple) board.

This particular use case isn't something we've optimised it for. But, in theory it could be as easy as as giving the requests you want to serialise the same MediaWiki-Chronology-Client-Id header value. We can build towards that if it's not currently that way.

I have tried this and it doesn't seem to work for me. Here's the relevant code: https://gerrit.wikimedia.org/r/c/mediawiki/tools/api-testing/+/530631/4/actionapi.js#134

Steps to test:
Use CHANGE MASTER TO MASTER_DELAY=1; on the replica to emulate lag (don't forget STOP SLAVE first).
Run ./node_modules/.bin/mocha --timeout 0 test/Categories.js

This passes if there is no lag, but if fails with lag.

In contrast, the test passes with the` X-MW-Sync-Write` patch in core (and PS2 of the api-testing patch).

@Krinkle do you have an idea of how to fix this?

Digging in a bit, the problem seems to be that ChronologyProtector will not work at all with the default setup of MediaWiki, because it ends up using an EmptyBagOStuff for storage. This happens if the 'memStash' field in the load balance config is not set. That config comes from a somewhat cryptic bit of code in MWLBFactory:

if ( $mStash->getQoS( $mStash::ATTR_EMULATION ) > $mStash::QOS_EMULATION_SQL ) {
  $lbConf['memStash'] = $mStash;
}

$mStash is a BagOStuff that is defined by $wgMainStash, which in a vanilla install is 'db-replicated', which turns out to be a ReplicatedBagOStuff containing two SqlBagOStuff instances. Which doesn't pass the test above. And even if it did, it wouldn't work, since it would be subject to the same replication lag it is trying to protect against.

I could make the ChonologyProtector work by setting $wgMainStash = CACHE_ACCEL; However, this didn't seem to result in the desired consistency in the output of the API. In fact, the results are somewhat confusing. I'm still trying to understand what's going on.

In any case, it's confusing that ChronologyProtector is disabled in the default environment, and gets automatically enabled when a "good enough" main stash is configured.

Digging in a bit, the problem seems to be that ChronologyProtector will not work at all with the default setup of MediaWiki, because it ends up using an EmptyBagOStuff for storage. [..]
In any case, it's confusing that ChronologyProtector is disabled in the default environment, and gets automatically enabled when a "good enough" main stash is configured.

Yeah, it is confusing. I'm not sure how this can be changed, though. Open to ideas :)

It's not so much about a "good enough" storage being available.

It's specifically about the storage not being the very database that we're tracking its replication position of. This is a paradox.

We cannot store the last seen offset of the master database, on the master database, and then expect to read from the replica what that position is. The replication that would occur there is the very thing ChronologyProtector helps protect against. When there is only a single database (e.g. local test wiki or CI), this shouldn't be an issue either because there is only one database and thus replica==master, and there is no lag.

If you do have a test setup with replicating databases, you'll need to store the positions outside that DB in order to have CP. Otherwise it wouldn't actually work correctly, and would be as good as not having it turned on, given the replica's view of the last position would always be up-to-date with the position of that same replica and thus be a no-op as well.

It's specifically about the storage not being the very database that we're tracking its replication position of. This is a paradox.

Yea, I figured that out :)

If you do have a test setup with replicating databases, you'll need to store the positions outside that DB in order to have CP. Otherwise it wouldn't actually work correctly, and would be as good as not having it turned on, given the replica's view of the last position would always be up-to-date with the position of that same replica and thus be a no-op as well.

I'm using mediawiki-docker-dev, which has a replicated setup. It also spins up a redis instance, but it doesn't seem to use it. I couldn't figure out how to configure a redis based cache in the docker environment of mediawiki-docker-dev.

I'm using mediawiki-docker-dev, which has a replicated setup. It also spins up a redis instance, but it doesn't seem to use it. I couldn't figure out how to configure a redis based cache in the docker environment of mediawiki-docker-dev.

@daniel there is some documentation here if that helps https://www.mediawiki.org/wiki/MediaWiki-Docker-Dev#Redis

@kostajh ok, thank you! yes, that is very helpful!

daniel added a comment.EditedSep 30 2019, 1:59 PM

So, Chronology Protector appears to do something, but tests still fail when the replica is lagged. Needs more investigation.

Also Chronology Protector does not allow tests to wait for deferred post-send updates to complete. We will need a separate mechanism for that.

Krinkle added a comment.EditedSep 30 2019, 2:13 PM

So, Chronology Protector appears to do something, but tests still fail when the replica is lagged. Needs more investigation.
Also Chronology Protector does not allow tests to wait for deferred post-send updates to complete. We will need a separate mechanism for that.

Would help I think to have a concrete example. Things like inserting RC rows for example, is actually pre-send and thus requires nothing nothing beyond CP.

I'm skeptical of a separate mechanism because I'm not sure what that would assert. What kind of updates do we do post-send that we want the API to reflect between test steps, but not for users in the real world? What are we testing?

If we are not integration-testing the logic of replication and CP, but rather core logic of how updates apply. Then, perhaps it would be easier to not use a replica in this test and/or disable post-send optimisations in MW. Then all you need is serial requests and not even CP is needed.

daniel added a comment.EditedSep 30 2019, 2:17 PM

Then, perhaps it would be easier to not use a replica in this test and/or disable post-send optimisations in MW

Disabling post-send optimisations is exactly what this ticket is about. The question is now whether we want to support this per request, or just globally.

If we use a configuration setting, this would mean that API based tests wouldn't run reliably against a vanilla install.

daniel added a comment.EditedSep 30 2019, 4:23 PM

Things like inserting RC rows for example, is actually pre-send and thus requires nothing nothing beyond CP.

Looking at RecentChange::notifyEdit, I see it calling $rc-save() in a DeferredUpdates::POSTSEND callback. Am I missing something?

Other uses of POSTSEND updates:

  • WikiPage::insertRedirect. Client would have to wait until the redirect becomes effective.
  • User::incEditCount. Can't test the edit count functionality without waiting for the update.
  • WatchedItemStore::updateNotificationTimestamp. Can't test watchlist functionality without waiting for the update.
daniel added a comment.Mon, Nov 4, 4:49 PM

After discussing this with Cindy, we decided to push this back. While we still want a solution that provides some kind of "deterministic mode" for interaction with MediaWiki, this is not a problem for tests in practice if we poll and wait for job execution anyway.