Page MenuHomePhabricator

Misc ChronologyProtector follow-up (Feb 2021)
Open, Needs TriagePublic

Description

  1. Support partial disablement of ChronologyProtector

As part of T212550 we introduced support for setting a MediaWiki-Chronology-Protection=false request header, which would disable the "wait for replication" logic part of ChronologyProtector (whilst still reading and exposing it as touched timestamp, and still writing pos data to the store and reflecting the client ID).

However both the job queue and wikidata use cases have changed to no longer use or need this. Wikidata removed it in T241536. And the job queue stuck to using disableChronologyProtection() to fully disable it instead of just the wait-for part of it.

To consider:

  • Have the header supported but result in disabling CP entirely, thus we can remove the distinction in CP internally. Akin to calling disableChronologyProtection(). No reading for touch timestamps, no saving/reflecting of client IDs in the response.
  • Or drop if we don't want to support it long-term.
  • Or keep as-is.

If we keep it as-is or in simplified form, we should probably document it in the high-level class documentation and perhaps in this task come up with a rationale for when someone would want to do this, and how that's better than any alternative approaches a consumer could utilise (if any).


  1. Support setting clientId via header instead of cookie or query.

The clientId is reflected to a client by default via a cookie or query parameter. However, while we don't advertise or use this anywhere today, we also support supplying it via a MediaWiki-Chronology-Client-Id request header if a client obtained it through other means and/or doesn't want to set a cookie for itself.

To consider:

  • If we know of current or likely use cases where a cookie or query param is not acceptable, then keep it and document the feature as public and stable in the class docs.
  • Or drop it.

  1. Support obtaining past position data even if the client lost its client ID.

We currently obtain position information even if we had to issue a new client ID that hasn't been shared with the client yet. In theory this means if the client hasn't changed their IP address (e.g. not using a mobile, IPv6 VPN, Tor or other volatile IP), but has somehow lost the cookie, then we still try to re-create their client ID, fetch the positions and wait for them if needed.

Assuming this is intentional, I guess this for the cross-origin case where we weren't sure in the past if we could reliably set query params or cookies on subsequent requests. However in the updated multi-dc plan we now require the request to have a sticky DC cookie (or otherwise be routed by wiki farm's traffic layers as a write request) in order to arrive in the DC where the positions were stored.

If we're comfortable going all-in on this, and if I haven't missed another reason for this feature, then we might be able to simplify it so that ChronologyProtector only fetches and waits for positions on startup if it is given a client ID, and otherwise effectively is in "shutdown only" mode where we do store positions and reflect (new) client IDs. This might also help settle point 1 above.


  1. Exposing "offset" (cpPosIndex) separate from "key" (clientId+index+timestamp)

If I understand correctly, the offset stored for the clientId in the CP store mainly served for the purpose of cross-dc replication and waiting for the positions to arrive, before we then wait for the MySQL data associated with those positions to arrive in turn. With the updated model, we no longer need that cross-dc aspect as CP is now only an intra-dc manager.

I wonder if we still need the "index" concept in that case? My gut feeling is yes, as it may help with concurrency so that subsequent requests only wait for the one right before it, so that we don't immediatly overwrite and repurpose a set of positions only. E.g. if you submit request A and B they would get a response with clientId:pos X#1 and X#2 respetively, which seems desirable.

I'm guessing that we can't easily translate that to a simpler storage model, though. E.g. could we use these together as the storage key and let them expire together? That would save the need for locking and merging. On the other hand, if we do that we no longer have a natural place to keep our offset counter, and working around that seems to require at introducing at least as much complexity as we'd save.

What might make sense though, is for the cpPosIndex query parameter to actually reflect the same fully key we use for the cookie. It actually seems like a bug to me that we don't right now. I assume that cross-origin cases right now likely break down if the IP changes (which is why we hand out clientId in the cookie)


  1. Support ChronologyProtector::getTouched()

We currently store a timestamp of when the last write happened and provide it via LBFactory::getChronologyProtectorTouched.

It was introduced in 2016 with change 309410, for use by the FileAnnotations extension (change 309412. The use case was to make use of memcached data where possible, but ignore it if the cache predates the CP touch timestamp. This makes a lot of sense, but the fact that we don't do that anywhere else suggests that perhaps we've since solved this problem by other means. (maybe global keys, check keys, and broadcasted tombstones are relevant here). Note that the extension was WMF product experiment that has since been abandoned, so it might not matter right now.

The method received a breaking change in MW 1.35 as part of change 599478, which means we can somewhat ignore older code anyway as it would need to account for this at least, and as part of that change could then also be migrated to anything else we consider best practices today.

However that breaking change wasn't a coincidence of course, it was to clean up as prep for change 602284, which introduced new use of the method for PoolWorkArticleView in core for T250248: Fast stale ParserCache responses on PoolCounter contention.

The use is reasonable, it's not hard to support in CP, and I don't see an obvious alternate way to accomplish that there. Again, it just surprises me that we don't do this more often which maybe means the issue is in theory widespread but usually not worth the hassle, or maybe there's another clever way out there that I can't think of right now. @aaron do you see any obvious alternative here? If so, could be another small thing to carve out, but no problem either way I suppose. I just wanted to include it here since we're refactoring it as part of T254634, and I just want to shake it a little and make sure we want to keep it.