Misc ChronologyProtector follow-up (Feb 2021)
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Krinkle
	Feb 24 2021, 11:40 PM

Description

1. Do we need to support partial disablement via the MediaWiki-Chronology-Protection header?
2. Do we need to support the MediaWiki-Chronology-Client-Id header?
3. Do we need to support cookie loss? Removed: T314434: Avoid ChronologyProtector queries on majory of pageviews that have no recent positions
4. Do we need the nested storage by offset?
5. Do we need ChronologyProtector::getTouched?

Details

Support partial disablement of ChronologyProtector

As part of T212550 we introduced support for setting a MediaWiki-Chronology-Protection=false request header, which would disable the "wait for replication" logic part of ChronologyProtector (whilst still reading and exposing it as touched timestamp, and still writing pos data to the store and reflecting the client ID).

However both the job queue and wikidata use cases have changed to no longer use or need this. Wikidata removed it in T241536. And the job queue stuck to using disableChronologyProtection() to fully disable it instead of just the wait-for part of it.

To consider:

Have the header supported but result in disabling CP entirely, thus we can remove the distinction in CP internally. Akin to calling disableChronologyProtection(). No reading for touch timestamps, no saving/reflecting of client IDs in the response.
Or drop if we don't want to support it long-term.
Or keep as-is.

If we keep it as-is or in simplified form, we should probably document it in the high-level class documentation and perhaps in this task come up with a rationale for when someone would want to do this, and how that's better than any alternative approaches a consumer could utilise (if any).

Support setting clientId via header instead of cookie or query.

The clientId is reflected to a client by default via a cookie or query parameter. However, while we don't advertise or use this anywhere today, we also support supplying it via a MediaWiki-Chronology-Client-Id request header if a client obtained it through other means and/or doesn't want to set a cookie for itself.

To consider:

If we know of current or likely use cases where a cookie or query param is not acceptable, then keep it and document the feature as public and stable in the class docs.
Or drop it.

Support obtaining past position data even if the client lost its client ID.

We currently obtain position information even if we had to issue a new client ID that hasn't been shared with the client yet. In theory this means if the client hasn't changed their IP address (e.g. not using a mobile, IPv6 VPN, Tor or other volatile IP), but has somehow lost the cookie, then we still try to re-create their client ID, fetch the positions and wait for them if needed.

Assuming this is intentional, I guess this for the cross-origin case where we weren't sure in the past if we could reliably set query params or cookies on subsequent requests. However in the updated multi-dc plan we now require the request to have a sticky DC cookie (or otherwise be routed by wiki farm's traffic layers as a write request) in order to arrive in the DC where the positions were stored.

If we're comfortable going all-in on this, and if I haven't missed another reason for this feature, then we might be able to simplify it so that ChronologyProtector only fetches and waits for positions on startup if it is given a client ID, and otherwise effectively is in "shutdown only" mode where we do store positions and reflect (new) client IDs. This might also help settle point 1 above.

Exposing "offset" (cpPosIndex) separate from "key" (clientId+index+timestamp)

If I understand correctly, the offset stored for the clientId in the CP store mainly served for the purpose of cross-dc replication and waiting for the positions to arrive, before we then wait for the MySQL data associated with those positions to arrive in turn. With the updated model, we no longer need that cross-dc aspect as CP is now only an intra-dc manager.

I wonder if we still need the "index" concept in that case? My gut feeling is yes, as it may help with concurrency so that subsequent requests only wait for the one right before it, so that we don't immediatly overwrite and repurpose a set of positions only. E.g. if you submit request A and B they would get a response with clientId:pos X#1 and X#2 respetively, which seems desirable.

I'm guessing that we can't easily translate that to a simpler storage model, though. E.g. could we use these together as the storage key and let them expire together? That would save the need for locking and merging. On the other hand, if we do that we no longer have a natural place to keep our offset counter, and working around that seems to require at introducing at least as much complexity as we'd save.

What might make sense though, is for the cpPosIndex query parameter to actually reflect the same fully key we use for the cookie. It actually seems like a bug to me that we don't right now. I assume that cross-origin cases right now likely break down if the IP changes (which is why we hand out clientId in the cookie)

Support ChronologyProtector::getTouched()

We currently store a timestamp of when the last write happened and provide it via LBFactory::getChronologyProtectorTouched.

It was introduced in 2016 with change 309410, for use by the FileAnnotations extension (change 309412). The use case was to make use of memcached data where possible, but ignore it if the cache predates the CP touch timestamp. This makes a lot of sense, but the fact that we don't do that anywhere else suggests that perhaps we've since solved this problem by other means. (maybe global keys, check keys, and broadcasted tombstones are relevant here). Note that the FileAnnotations extension was a WMF product experiment and has since been abandoned, so it doesn't matter now.

The method received a breaking change in MW 1.35 as part of change 599478, which means we can somewhat ignore older code anyway as it would need to account for this at least, and as part of that change could then also be migrated to anything else we consider best practices today.

However that breaking change wasn't a coincidence of course, it was to clean up as prep for change 602284, which re-introduced use of this method for PoolWorkArticleView in core for T250248: Fast stale ParserCache responses on PoolCounter contention.

The use is reasonable, it's not hard to support in CP, and I don't see an obvious alternate way to accomplish that there. Again, it just surprises me that we don't do this more often which maybe means the issue is in theory widespread but usually not worth the hassle, or maybe there's another clever way out there that I can't think of right now. @aaron do you see any obvious alternative here? If so, could be another small thing to carve out, but no problem either way I suppose. I just wanted to include it here since we're refactoring it as part of T254634, and I just want to shake it a little and make sure we want to keep it.

Details

Subject	Repo	Branch	Lines +/-
rdbms: Inject CP instead of relying on LBF	mediawiki/core	master	+51 -53
rdbms: Introduce CPFactory to decouple from LBF	mediawiki/core	master	+125 -76
Centralize logic behind building ChronologyProtector object	mediawiki/core	master	+100 -75
rdbms: Drop LBF::setRequestInfo	mediawiki/core	master	+0 -4
rdbms: Remove LBF::appendShutdownCPIndexAsQuery()	mediawiki/core	master	+13 -16
rdbms: Rename CP::yieldSessionPrimaryPos to ::getSessionPrimaryPos	mediawiki/core	master	+7 -7
Centralize logic behind building ChronologyProtector object	mediawiki/core	master	+99 -70
rdbms: Decouple ChronologyProtector from LBF	mediawiki/core	master	+118 -127
rdbms: Drop partial disablement of CP and setting client id via header	mediawiki/core	master	+3 -20

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Open	pmiazga	T302623 FY2022-2023: Improve Backend Pageview Timing
Open	Ladsgroup	T343098 [epic] Data Persistence Hypothesis WE 3.2.1
Resolved	Ladsgroup	T275713 Misc ChronologyProtector follow-up (Feb 2021)

Event Timeline

Krinkle created this task.Feb 24 2021, 11:40 PM

Restricted Application added a project: Platform Engineering. · View Herald TranscriptFeb 24 2021, 11:40 PM

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

BPirkle moved this task from Inbox to Tracking/Watching on the Platform Engineering board.Mar 2 2021, 9:19 PM

Krinkle moved this task from Untriaged to Rdbms library on the MediaWiki-libs-Rdbms board.Jun 30 2021, 6:18 PM

Krinkle added a project: Performance-Team.Aug 30 2021, 3:40 AM

Krinkle moved this task from Inbox, needs triage to To-do: Goals, prioritized next 4 Quarters on the Performance-Team board.

Krinkle triaged this task as Medium priority.May 13 2022, 7:37 PM

Krinkle moved this task from To-do: Goals, prioritized next 4 Quarters to Backlog: Maintenance, non-prioritized on the Performance-Team board.Oct 4 2022, 8:31 PM

Krinkle added a parent task: T302623: FY2022-2023: Improve Backend Pageview Timing.

Ladsgroup subscribed.Jan 11 2023, 1:27 AM

Krinkle mentioned this in T254634: Determine and implement multi-dc strategy for ChronologyProtector.Jan 23 2023, 7:27 PM

Krinkle assigned this task to Ladsgroup.Aug 10 2023, 10:22 PM

Krinkle edited projects, added MediaWiki-Platform-Team (Radar); removed Performance-Team, Platform Engineering.

The third one is already done, I would gladly do the rest this quarter. Not sure about the fifth one, needs some thinking. Also thinking of getting rid of server-side storage altogether and simply encrypt the cookie.

Regarding the fourth one. Would it make a difference to wait for the most recent position all the time? Our replication is decently fast and it doesn't really matter.

Krinkle updated the task description. (Show Details)Aug 11 2023, 12:12 AM

Krinkle updated the task description. (Show Details)

Change 952146 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Drop partial disablement of CP and setting client id via header

https://gerrit.wikimedia.org/r/952146

gerritbot added a project: Patch-For-Review.Aug 24 2023, 9:05 AM

Krinkle updated the task description. (Show Details)Aug 24 2023, 11:41 PM

Change 952146 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Drop partial disablement of CP and setting client id via header

https://gerrit.wikimedia.org/r/952146

Maintenance_bot removed a project: Patch-For-Review.Aug 25 2023, 12:10 AM

Ladsgroup added a parent task: T343098: [epic] Data Persistence Hypothesis WE 3.2.1.Aug 25 2023, 6:53 AM

Ladsgroup updated the task description. (Show Details)Aug 25 2023, 7:44 AM

Regarding the fourth one: I started trying to do that but the index piece is used so heavily everywhere that removing it without leaving a mess is not really possible. I'm inclined not to do that.

Regarding the fifth one: I like removing that, ts is not really useful and is basically the same as position, just a different unit. I understand poolcounter needs it and was wondering if we could simply take the LBF's last timestamp of change made (LB::lastPrimaryChangeTimestamp()) and add a bit of time to it (let's say 5seconds?) and completely remove it. But I'm slightly worried it might re-introduce the Michael Jackson effect.

Random observation: The distinction between LBF and CP is not clear, I already moved some methods but still we have methods such as ::appendShutdownCPIndexAsQuery in LBF. Introducing CPFactory might allow us to decouple it properly from LBF (most of db-related parts of CP are done via passing the LB the object in the function call). I'll give this a try.

Change 952318 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Introduce CPFactory to decouple from LBF

https://gerrit.wikimedia.org/r/952318

gerritbot added a project: Patch-For-Review.Aug 25 2023, 9:30 AM

Change 952353 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Centralize logic behind building ChronologyProtector object

https://gerrit.wikimedia.org/r/952353

Change 952365 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Remove LBF::appendShutdownCPIndexAsQuery()

https://gerrit.wikimedia.org/r/952365

Change 952387 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Rename CP::yieldSessionPrimaryPos to ::getSessionPrimaryPos

https://gerrit.wikimedia.org/r/952387

I understand poolcounter needs it and was wondering if we could simply take the LBF's last timestamp of change made (LB::lastPrimaryChangeTimestamp()) and add a bit of time to it (let's say 5seconds?) and completely remove it. But I'm slightly worried it might re-introduce the Michael Jackson effect.

Scratch that, that didn't make much sense after thinking a bit about it.

Change 953263 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Decouple ChronologyProtector from LBF

https://gerrit.wikimedia.org/r/953263

Change 953264 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] Centralize logic behind building ChronologyProtector object

https://gerrit.wikimedia.org/r/953264

Change 953265 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Remove LBF::appendShutdownCPIndexAsQuery()

https://gerrit.wikimedia.org/r/953265

Change 953266 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Rename CP::yieldSessionPrimaryPos to ::getSessionPrimaryPos

https://gerrit.wikimedia.org/r/953266

ReleaseTaggerBot added a project: MW-1.41-notes (1.41.0-wmf.25; 2023-09-05).Aug 30 2023, 2:01 AM

Change 953263 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Decouple ChronologyProtector from LBF

https://gerrit.wikimedia.org/r/953263

Change 953264 merged by jenkins-bot:

[mediawiki/core@master] Centralize logic behind building ChronologyProtector object

https://gerrit.wikimedia.org/r/953264

Change 953265 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Remove LBF::appendShutdownCPIndexAsQuery()

https://gerrit.wikimedia.org/r/953265

Change 953266 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Rename CP::yieldSessionPrimaryPos to ::getSessionPrimaryPos

https://gerrit.wikimedia.org/r/953266

Change 953965 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Drop LBF::setRequestInfo

https://gerrit.wikimedia.org/r/953965

Change 954259 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Inject CP instead of relying on LBF

https://gerrit.wikimedia.org/r/954259

Change 953965 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Drop LBF::setRequestInfo

https://gerrit.wikimedia.org/r/953965

Change 952353 abandoned by Ladsgroup:

[mediawiki/core@master] Centralize logic behind building ChronologyProtector object

Reason:

Done in I7431dbb52b12437bc chain

https://gerrit.wikimedia.org/r/952353

Change 952318 abandoned by Ladsgroup:

[mediawiki/core@master] rdbms: Introduce CPFactory to decouple from LBF

Reason:

Done in I7431dbb52b12437bc chain

https://gerrit.wikimedia.org/r/952318

Change 954259 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Inject CP instead of relying on LBF

https://gerrit.wikimedia.org/r/954259

ReleaseTaggerBot edited projects, added MW-1.41-notes (1.41.0-wmf.26; 2023-09-12); removed MW-1.41-notes (1.41.0-wmf.25; 2023-09-05).Sep 5 2023, 6:00 PM

I think we can call this done. The fourth and fifth ones are not really worth the work and we already made quite a progress in the clean up. Specially, given now much cleaner and simpler LBF.

Ladsgroup moved this task from Triage to Done on the DBA board.Sep 5 2023, 6:24 PM

Krinkle updated the task description. (Show Details)Sep 6 2023, 12:51 AM

Misc ChronologyProtector follow-up (Feb 2021)Closed, ResolvedPublicActions

Description

Details

Details

Related ObjectsSearch...

Event Timeline

Misc ChronologyProtector follow-up (Feb 2021)
Closed, ResolvedPublic
Actions

Related Objects
Search...