Page MenuHomePhabricator

Misc ChronologyProtector follow-up (Feb 2021)
Closed, ResolvedPublic

Description

Details

  1. Support partial disablement of ChronologyProtector

As part of T212550 we introduced support for setting a MediaWiki-Chronology-Protection=false request header, which would disable the "wait for replication" logic part of ChronologyProtector (whilst still reading and exposing it as touched timestamp, and still writing pos data to the store and reflecting the client ID).

However both the job queue and wikidata use cases have changed to no longer use or need this. Wikidata removed it in T241536. And the job queue stuck to using disableChronologyProtection() to fully disable it instead of just the wait-for part of it.

To consider:

  • Have the header supported but result in disabling CP entirely, thus we can remove the distinction in CP internally. Akin to calling disableChronologyProtection(). No reading for touch timestamps, no saving/reflecting of client IDs in the response.
  • Or drop if we don't want to support it long-term.
  • Or keep as-is.

If we keep it as-is or in simplified form, we should probably document it in the high-level class documentation and perhaps in this task come up with a rationale for when someone would want to do this, and how that's better than any alternative approaches a consumer could utilise (if any).


  1. Support setting clientId via header instead of cookie or query.

The clientId is reflected to a client by default via a cookie or query parameter. However, while we don't advertise or use this anywhere today, we also support supplying it via a MediaWiki-Chronology-Client-Id request header if a client obtained it through other means and/or doesn't want to set a cookie for itself.

To consider:

  • If we know of current or likely use cases where a cookie or query param is not acceptable, then keep it and document the feature as public and stable in the class docs.
  • Or drop it.

  1. Support obtaining past position data even if the client lost its client ID.

We currently obtain position information even if we had to issue a new client ID that hasn't been shared with the client yet. In theory this means if the client hasn't changed their IP address (e.g. not using a mobile, IPv6 VPN, Tor or other volatile IP), but has somehow lost the cookie, then we still try to re-create their client ID, fetch the positions and wait for them if needed.

Assuming this is intentional, I guess this for the cross-origin case where we weren't sure in the past if we could reliably set query params or cookies on subsequent requests. However in the updated multi-dc plan we now require the request to have a sticky DC cookie (or otherwise be routed by wiki farm's traffic layers as a write request) in order to arrive in the DC where the positions were stored.

If we're comfortable going all-in on this, and if I haven't missed another reason for this feature, then we might be able to simplify it so that ChronologyProtector only fetches and waits for positions on startup if it is given a client ID, and otherwise effectively is in "shutdown only" mode where we do store positions and reflect (new) client IDs. This might also help settle point 1 above.


  1. Exposing "offset" (cpPosIndex) separate from "key" (clientId+index+timestamp)

If I understand correctly, the offset stored for the clientId in the CP store mainly served for the purpose of cross-dc replication and waiting for the positions to arrive, before we then wait for the MySQL data associated with those positions to arrive in turn. With the updated model, we no longer need that cross-dc aspect as CP is now only an intra-dc manager.

I wonder if we still need the "index" concept in that case? My gut feeling is yes, as it may help with concurrency so that subsequent requests only wait for the one right before it, so that we don't immediatly overwrite and repurpose a set of positions only. E.g. if you submit request A and B they would get a response with clientId:pos X#1 and X#2 respetively, which seems desirable.

I'm guessing that we can't easily translate that to a simpler storage model, though. E.g. could we use these together as the storage key and let them expire together? That would save the need for locking and merging. On the other hand, if we do that we no longer have a natural place to keep our offset counter, and working around that seems to require at introducing at least as much complexity as we'd save.

What might make sense though, is for the cpPosIndex query parameter to actually reflect the same fully key we use for the cookie. It actually seems like a bug to me that we don't right now. I assume that cross-origin cases right now likely break down if the IP changes (which is why we hand out clientId in the cookie)


  1. Support ChronologyProtector::getTouched()

We currently store a timestamp of when the last write happened and provide it via LBFactory::getChronologyProtectorTouched.

It was introduced in 2016 with change 309410, for use by the FileAnnotations extension (change 309412). The use case was to make use of memcached data where possible, but ignore it if the cache predates the CP touch timestamp. This makes a lot of sense, but the fact that we don't do that anywhere else suggests that perhaps we've since solved this problem by other means. (maybe global keys, check keys, and broadcasted tombstones are relevant here). Note that the FileAnnotations extension was a WMF product experiment and has since been abandoned, so it doesn't matter now.

The method received a breaking change in MW 1.35 as part of change 599478, which means we can somewhat ignore older code anyway as it would need to account for this at least, and as part of that change could then also be migrated to anything else we consider best practices today.

However that breaking change wasn't a coincidence of course, it was to clean up as prep for change 602284, which re-introduced use of this method for PoolWorkArticleView in core for T250248: Fast stale ParserCache responses on PoolCounter contention.

The use is reasonable, it's not hard to support in CP, and I don't see an obvious alternate way to accomplish that there. Again, it just surprises me that we don't do this more often which maybe means the issue is in theory widespread but usually not worth the hassle, or maybe there's another clever way out there that I can't think of right now. @aaron do you see any obvious alternative here? If so, could be another small thing to carve out, but no problem either way I suppose. I just wanted to include it here since we're refactoring it as part of T254634, and I just want to shake it a little and make sure we want to keep it.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript
Krinkle triaged this task as Medium priority.May 13 2022, 7:37 PM

The third one is already done, I would gladly do the rest this quarter. Not sure about the fifth one, needs some thinking. Also thinking of getting rid of server-side storage altogether and simply encrypt the cookie.

Regarding the fourth one. Would it make a difference to wait for the most recent position all the time? Our replication is decently fast and it doesn't really matter.

Krinkle updated the task description. (Show Details)

Change 952146 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Drop partial disablement of CP and setting client id via header

https://gerrit.wikimedia.org/r/952146

Change 952146 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Drop partial disablement of CP and setting client id via header

https://gerrit.wikimedia.org/r/952146

Regarding the fourth one: I started trying to do that but the index piece is used so heavily everywhere that removing it without leaving a mess is not really possible. I'm inclined not to do that.

Regarding the fifth one: I like removing that, ts is not really useful and is basically the same as position, just a different unit. I understand poolcounter needs it and was wondering if we could simply take the LBF's last timestamp of change made (LB::lastPrimaryChangeTimestamp()) and add a bit of time to it (let's say 5seconds?) and completely remove it. But I'm slightly worried it might re-introduce the Michael Jackson effect.

Random observation: The distinction between LBF and CP is not clear, I already moved some methods but still we have methods such as ::appendShutdownCPIndexAsQuery in LBF. Introducing CPFactory might allow us to decouple it properly from LBF (most of db-related parts of CP are done via passing the LB the object in the function call). I'll give this a try.

Change 952318 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Introduce CPFactory to decouple from LBF

https://gerrit.wikimedia.org/r/952318

Change 952353 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Centralize logic behind building ChronologyProtector object

https://gerrit.wikimedia.org/r/952353

Change 952365 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Remove LBF::appendShutdownCPIndexAsQuery()

https://gerrit.wikimedia.org/r/952365

Change 952387 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Rename CP::yieldSessionPrimaryPos to ::getSessionPrimaryPos

https://gerrit.wikimedia.org/r/952387

I understand poolcounter needs it and was wondering if we could simply take the LBF's last timestamp of change made (LB::lastPrimaryChangeTimestamp()) and add a bit of time to it (let's say 5seconds?) and completely remove it. But I'm slightly worried it might re-introduce the Michael Jackson effect.

Scratch that, that didn't make much sense after thinking a bit about it.

Change 953263 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Decouple ChronologyProtector from LBF

https://gerrit.wikimedia.org/r/953263

Change 953264 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Centralize logic behind building ChronologyProtector object

https://gerrit.wikimedia.org/r/953264

Change 953265 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Remove LBF::appendShutdownCPIndexAsQuery()

https://gerrit.wikimedia.org/r/953265

Change 953266 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Rename CP::yieldSessionPrimaryPos to ::getSessionPrimaryPos

https://gerrit.wikimedia.org/r/953266

Change 953263 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Decouple ChronologyProtector from LBF

https://gerrit.wikimedia.org/r/953263

Change 953264 merged by jenkins-bot:

[mediawiki/core@master] Centralize logic behind building ChronologyProtector object

https://gerrit.wikimedia.org/r/953264

Change 953265 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Remove LBF::appendShutdownCPIndexAsQuery()

https://gerrit.wikimedia.org/r/953265

Change 953266 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Rename CP::yieldSessionPrimaryPos to ::getSessionPrimaryPos

https://gerrit.wikimedia.org/r/953266

Change 953965 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Drop LBF::setRequestInfo

https://gerrit.wikimedia.org/r/953965

Change 954259 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Inject CP instead of relying on LBF

https://gerrit.wikimedia.org/r/954259

Change 953965 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Drop LBF::setRequestInfo

https://gerrit.wikimedia.org/r/953965

Change 952353 abandoned by Ladsgroup:

[mediawiki/core@master] Centralize logic behind building ChronologyProtector object

Reason:

Done in I7431dbb52b12437bc chain

https://gerrit.wikimedia.org/r/952353

Change 952318 abandoned by Ladsgroup:

[mediawiki/core@master] rdbms: Introduce CPFactory to decouple from LBF

Reason:

Done in I7431dbb52b12437bc chain

https://gerrit.wikimedia.org/r/952318

Change 954259 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Inject CP instead of relying on LBF

https://gerrit.wikimedia.org/r/954259

Ladsgroup edited projects, added DBA; removed Patch-For-Review.

I think we can call this done. The fourth and fifth ones are not really worth the work and we already made quite a progress in the clean up. Specially, given now much cleaner and simpler LBF.