Page MenuHomePhabricator

Avoid ChronologyProtector queries on majory of pageviews that have no recent positions
Closed, ResolvedPublic

Description

Prior work:

This inefficiency was identified by @tstarling during T279664 to assess why a synthetic page view was slower in Codfw than Eqiad. While CP was not the cause, it did come up as one of the things we do on each page view, and this seems wrong.

Background

As part of my analysis with @aaron at T254634, we arrived on a simplified roll-out plan for Multi-DC. For ChronologyProtector this was mainly that we dropped support for always-on stateless ChronologyProtector. This was effectively done in 2020 already, but summarised below:

Drop support for protecting chronology for cross-domain redirects where we can neither set a cookie nor a query parameter (except login.wikimedia.org, ref T91820).

We were unable to find real-world examples of this scenario. That is, as scenario where a user 1) performs database write in the primary DC but not a POST request with sticky-dc cookie, or with such stickiness but being followed by a cross-domain request and 2) the next secondary DC request happens mechanically or otherwise in really rapid manner and 3) is unable to receive a query parameter and 4) has a hard requirement to see the prior db write in this subsequent request.

Basically, our other measures have obsoleted this:

  • sticky-dc cookie after POST requests and other DB writes.
  • loginwiki routing, ref T91820.
  • token store, ref T278392.

What remains

Much of the related code for this was all removed (including the requirement to replicate the temporary position value across DCs and to first wait-for the position store and then wait-for MySQL replication, and WaitConditionLoop-based BagOStuff callback for OutputPage).

However what we forgot, is that we still generate a clientId when LBFactory instantiates the ChronologyProtector class, and there's nothing in place to prevent it from seeing that as instruction to check the position store and wait for something that doesn't exist.

From #wikimedia-perf 2022-08-01

<TimStarling> on an anonymous request, MW generates a random session ID and then fetches that session from Kask
<TimStarling> no Set-Cookie header
<TimStarling> after it's done talking to kask for 4ms (in the strace log I made), it connects to nutcracker and asks for ChronologyProtector positions

Event Timeline

Change 819753 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove unconditional ChronologyProtector query from pageviews

https://gerrit.wikimedia.org/r/819753

Change 819754 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove obsolete cross-dc wait from ChronologyProtector

https://gerrit.wikimedia.org/r/819754

Change 819759 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove synchronous pre-send waitForReplication fallback

https://gerrit.wikimedia.org/r/819759

From today's AU chat with @aaron and @tstarling, Aaron suspects there may still be a need for the stateless CP position discovery.

TODO for me: Log a message when we fetch positions when there was no clientId from cookie/query source, and log another message if we found positions in that case (normally we only log if we didn't find positions, but we can temporarily add a warning message here if we found positions without a clientId supplied).

Then we can let that run for a few days or weeks and identify what if anything might be benefiting or relying on this.

[…]
As expected, there is now an increase in "get" misses on Grafana: Memcache dashboard due to the unresolved T314434.

Screenshot 2022-08-19 at 14.00.55.png (534×618 px, 110 KB)

Change 819753 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove unconditional ChronologyProtector query from pageviews

https://gerrit.wikimedia.org/r/819753

To keep in mind later:

Duplicate get(): "global:Wikimedia\Rdbms\ChronologyProtector:5492c7bb49467184f4f3a09b4f4e21b4:v2" fetched 2 times

It seems we're sometimes actually fetching the value twice during the same GET request (this was from viewing a diff on enwiki via mwdebug1002). I'll try to reproduce that as part of iterating on the above patch, and then see if we can fix that at the same time.

Change 819754 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Remove obsolete cross-dc wait from ChronologyProtector

https://gerrit.wikimedia.org/r/819754

Change 819759 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Remove synchronous pre-send waitForReplication fallback

https://gerrit.wikimedia.org/r/819759

Change 829187 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Log warning if ChronologyProtector finds data under presumed clientID

https://gerrit.wikimedia.org/r/829187

Change 829187 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Log warning if ChronologyProtector finds data under presumed clientID

https://gerrit.wikimedia.org/r/829187

Change 832624 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Limit CP warning message for T314434 to initial 10s window

https://gerrit.wikimedia.org/r/832624

Change 832624 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Limit CP warning message for T314434 to initial 10s window

https://gerrit.wikimedia.org/r/832624

Analysis from https://logstash.wikimedia.org/goto/95b803b2cb6e4f792bf776190a25ba47

  • Logstash dashboard: mediawiki.
  • Query: channel:DBReplication normalized.message:T314434
  • Filters:
    • servergroup:appserver (ignore API requests where it's caller responsibility to reflect cookies if CP is desired, plus many false positives from clouds where IPs are shared between unrelated processes).
    • NOT url:"action=raw" (static requests don't need CP).
    • AND NOT server:"login.wikimedia.org" AND NOT url:"Special:CentralAutoLogin" (routes to primary DC as per Multi-DC, to be determined if it actually needs CP for all but the first local>loginwiki redirect where query param can be used)
    • AND NOT host:mw2* (not sure why it engages in Codfw)
    • AND NOT (clientAgent:*crawler*)
    • AND referer:exists (focus on subsequent page loads within a browsing session)

This leaves us with ~10K log entries per hour. I've not seen any obvious pattern there, they seem like normal page views with a referer and a common user agent from a major browser. I suspect it might have something to do with a cross-wiki write, e.g. where in another tab or with a gadget something is written to a different wiki and so we happen to find data but don't strictly need it for anything in particular.

My hypothesis that it is rare for CP data to be found without a cookie set, is incorrect. However, we're no closer to determining whether these scenarios need/want/expect CP to kick in, and why it is that the cookie isn't there.

Change 870733 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@master] rdbms: clean up LoadBalancer/ChronologyProtector primary pos methods

https://gerrit.wikimedia.org/r/870733

Change 870733 merged by jenkins-bot:

[mediawiki/core@master] rdbms: clean up LoadBalancer/ChronologyProtector primary pos methods

https://gerrit.wikimedia.org/r/870733

Analysis from https://logstash.wikimedia.org/goto/95b803b2cb6e4f792bf776190a25ba47

[…]

My hypothesis that it is rare for CP data to be found without a cookie set, is incorrect. However, we're no closer to determining whether these scenarios need/want/expect CP to kick in, and why it is that the cookie isn't there.

Chatting with @aaron about this, there's a way we can widdle this down a bit to better avoid these false positives that are making it hard to find whether there's a real problem or not.

We current store in the position store which DB the position relates to, but in that context "DB" refers to the host/cluster, not the individual wiki DB. This means that cross-domain requests stil wrongly trigger this warning. E.g. you made a write on one wiki, and then send a request for some action=raw JavaScript for Meta-Wiki, that Meta-Wiki request has no cookie on it but also doesn't need it, yet the MW side will warni currently becuase it'll find the deterministic postitionstore key from the would-be-cookie, and decides to wait for replication.

We can (temporarily) store the domain name that we're sending the cookie for in the memc value, and then only emit the warning during subsequent requests if it's actually for the same domain name. When that's not the case, we know the position data is irrelevant. This should also speed up page views for logged-in users by avoiding waits for replication lag in the post-edit pageview during cross-wiki requests.

Change 940987 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Update outdated docs regarding ChronologyProtector

https://gerrit.wikimedia.org/r/940987

I can consistently reproduce it trying to look up the most recent position even in most benign requests without cpPosIndex cookie set (e.g. https://logstash.wikimedia.org/goto/64b1b6d20661a6136938daa886ca7aa9) . @TK-999 and I debugged this and found what was wrong. The patch is coming and would reduce quite a lot of load from memcached and rather decent dent at appservers and response time (https://performance.wikimedia.org/arclamp/svgs/daily/2023-07-23.excimer-wall.index.svgz?s=Chronology&x=175.4&y=1413 says 0.26%)

Basically, there is no cookie, so clientId is null, making us to create $this->clientId. Then we still look up the position in memcached (why? What are planning to wait for when there is no cookie set?) and while wasting a lot of resources, in almost most cases we don't find anything, but given that memcached's TTL is 60 seconds (not 10) and the client id is only based on ip and agent, you can easily end up in positions that something might show up from memcached and make us wait for positions we don't really need to wait for.

I have many more ideas on how to improve this.

Change 941002 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@master] rdbms: Avoid making wasteful memcached calls in CP

https://gerrit.wikimedia.org/r/941002

There are many many explanations for this amount of logs:

  • Cross-wiki navigation (the most common case from looking at the logs):
    • the timestamps and positions are clustered based on section, e.g. keyed s1, s2, DEFAULT, s4, etc. You make an edit in a wiki, make a request to another wiki in the same section.
    • This is also worse than you think: ExternalStorage do have their positions marshalled, so any edit technically has the chance of showing up somewhere else
  • GCNATs, School networks with exact same IP and UA, etc.
  • Time drift: We set the cookie with the expiry in UTC and then send it to the browser, browser based on systems' clock decides to send or not send the cookie. It can think it's expired while it's not
    • I made mw1452 log the timediff for a minute or two, stuff like 9.99849s or 9.8603s were not uncommon.
  • Browser containers:
    • If someone edits via private window and visits via main browser, you trigger a log entry
    • Firefox has containers, someone can edit with one container, open an article in another container.

For now, I suggest simply remove this, then if we users complain about issues, we can bring it back. I highly doubt this could lead to any sort of data corruption: the time diffs I collected never go below one second and our replication lag has been around 200ms consistently.

On top:

  • I honestly find the server-side storage a bit weird, I think it should be doable to possibly migrate everything client-side eventually. I know it might be hard to to store all the data in a cookie but I think we can trim it down.
  • Besides, I think there is a bug in storage of the position id in general: $data[self::FLD_WRITE_INDEX] is not set properly. In the logs, the value shows up as 1 or 9 or 2007. db position of 9? That's new to me.
  • I honestly find the server-side storage a bit weird, I think it should be doable to possibly migrate everything client-side eventually. I know it might be hard to to store all the data in a cookie but I think we can trim it down.

It's less about the size, and more about avoiding opening attack scenarios that hold database connections on waiting for future positions. The data would need to be encrypted, could be similar to how we use HMAC here already. The reason we use HMAC for the offset is because the same issue existed at a layer on top in the past, because ChronologyProtector used to support async replication (yes, waiting for the position store to replicate data about database replication; anyway, that code is all gone now).

It seems expected that, timeouts will always be at least an order of magnitude longer than the time it takes for the fast/no-op wait action, and so we wouldn't want to permit crafting requests that always reach the timeout.

  • Besides, I think there is a bug in storage of the position id in general: $data[self::FLD_WRITE_INDEX] is not set properly. In the logs, the value shows up as 1 or 9 or 2007. db position of 9? That's new to me.

The GTID or binlog positions are stored in the FLD_POSITIONS key. What's stored in FLD_WRITE_INDEX is a local increment, which acts like a parity bit kind of. It is simply 1 most of the time for servergroup:appserver:

Screenshot 2023-07-25 at 02.13.52.png (452×910 px, 70 KB)

For bot/API sessions, it can easily get up to a few dozen or hundred. The log entry with 2007 is an outlier from a bot with the user agent Java 7.0 to MediaWikiAPI which seems to be performing rapid requests to /w/api.php?action=purge.

There is only 1 CP store entry per user, the increment is stored within it and we overwrite it each time if when we store the db positions there was already a row there. This way, if you make concurrent requests, we don't wait forever, or for data from (relatively speaking) future requests, but rather only wait if the row was relevant to the request you're coming from.

Change 941002 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Avoid making wasteful memcached calls in CP

https://gerrit.wikimedia.org/r/941002

Change 942000 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.41.0-wmf.19] rdbms: Avoid making wasteful memcached calls in CP

https://gerrit.wikimedia.org/r/942000

Change 942001 had a related patch set uploaded (by Ladsgroup; author: Amir Sarabadani):

[mediawiki/core@wmf/1.41.0-wmf.18] rdbms: Avoid making wasteful memcached calls in CP

https://gerrit.wikimedia.org/r/942001

Change 942001 abandoned by Ladsgroup:

[mediawiki/core@wmf/1.41.0-wmf.18] rdbms: Avoid making wasteful memcached calls in CP

Reason:

wmf.19 is deployed everywhere.

https://gerrit.wikimedia.org/r/942001

Change 942000 merged by jenkins-bot:

[mediawiki/core@wmf/1.41.0-wmf.19] rdbms: Avoid making wasteful memcached calls in CP

https://gerrit.wikimedia.org/r/942000

Mentioned in SAL (#wikimedia-operations) [2023-07-27T11:51:10Z] <ladsgroup@deploy1002> Started scap: Backport for [[gerrit:942000|rdbms: Avoid making wasteful memcached calls in CP (T314434)]]

Mentioned in SAL (#wikimedia-operations) [2023-07-27T11:52:40Z] <ladsgroup@deploy1002> ladsgroup: Backport for [[gerrit:942000|rdbms: Avoid making wasteful memcached calls in CP (T314434)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)

Mentioned in SAL (#wikimedia-operations) [2023-07-27T12:00:05Z] <ladsgroup@deploy1002> Finished scap: Backport for [[gerrit:942000|rdbms: Avoid making wasteful memcached calls in CP (T314434)]] (duration: 08m 54s)

  • I honestly find the server-side storage a bit weird, I think it should be doable to possibly migrate everything client-side eventually. I know it might be hard to to store all the data in a cookie but I think we can trim it down.

It's less about the size, and more about avoiding opening attack scenarios that hold database connections on waiting for future positions. The data would need to be encrypted, could be similar to how we use HMAC here already. The reason we use HMAC for the offset is because the same issue existed at a layer on top in the past, because ChronologyProtector used to support async replication (yes, waiting for the position store to replicate data about database replication; anyway, that code is all gone now).

It seems expected that, timeouts will always be at least an order of magnitude longer than the time it takes for the fast/no-op wait action, and so we wouldn't want to permit crafting requests that always reach the timeout.

Maybe we can encrypt it to avoid crafting of future position cookies?

  • Besides, I think there is a bug in storage of the position id in general: $data[self::FLD_WRITE_INDEX] is not set properly. In the logs, the value shows up as 1 or 9 or 2007. db position of 9? That's new to me.

The GTID or binlog positions are stored in the FLD_POSITIONS key. What's stored in FLD_WRITE_INDEX is a local increment, which acts like a parity bit kind of. It is simply 1 most of the time for servergroup:appserver:

Screenshot 2023-07-25 at 02.13.52.png (452×910 px, 70 KB)

For bot/API sessions, it can easily get up to a few dozen or hundred. The log entry with 2007 is an outlier from a bot with the user agent Java 7.0 to MediaWikiAPI which seems to be performing rapid requests to /w/api.php?action=purge.

There is only 1 CP store entry per user, the increment is stored within it and we overwrite it each time if when we store the db positions there was already a row there. This way, if you make concurrent requests, we don't wait forever, or for data from (relatively speaking) future requests, but rather only wait if the row was relevant to the request you're coming from.

Yup I saw that but at the same time we also have this: CP::mergePositions() takes $clientPosIndex which is called in CP::persistSessionReplicationPositions() and all called via reference, so any change will get reflected down the line. And then CP::persistSessionReplicationPositions() takes that variable as an argument and gets called in LBF::shutdownChronologyProtector with cpIndex being passed by reference as well. That is also called from LBF::shutdown().

So it goes around, I thought it might have been mistakenly assigned somewhere but I cna't find it, regardless the flow of data is not really clear which I will fix later.

Change 940987 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Update outdated docs regarding ChronologyProtector

https://gerrit.wikimedia.org/r/940987

[…] I think there is a bug in storage of the position id in general: $data[self::FLD_WRITE_INDEX] is not set properly. In the logs, the value shows up as 1 or 9 or 2007. db position of 9? That's new to me.

The GTID or binlog positions are stored in the FLD_POSITIONS key. What's stored in FLD_WRITE_INDEX is a local increment, which acts like a parity bit kind of. It is simply 1 most of the time […]

So it goes around, I thought it might have been mistakenly assigned somewhere but I cna't find it, regardless the flow of data is not really clear which I will fix later.

The increment takes place here:

ChronologyProtector.php@mergePositions
		$clientPosIndex = ( $storedValue[self::FLD_WRITE_INDEX] ?? 0 ) + 1;

In other words, if there was already a position store entry during this request, we overwrite it and increment the index to ensure it won't apply to other requests from the same or older time origin. Otherwise, (most cases) there was no value yet, or it has since expired/evicted, and we initialise it with 1.

Change 941002 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Avoid making wasteful memcached calls in CP

https://gerrit.wikimedia.org/r/941002

Let's measure the impact! The date of the deployment was July 27.

The graphs are from Grafana: Memcache, which plots Prometheus stats from MediaWiki's Memcached cluster.

The ChronologyProtector demand was largely querying for keys that don't exist, so I'll first look at the impact there:

Screenshot 2023-08-02 at 18.55.24.png (400×1 px, 58 KB)

The cache-miss req/s rate went from 31K/s to 17/Ks, a nearly 50% drop. On a per-server level, that translates to drop in miss-rate from 1.5-2K/s to now ~0.6K/s.

Screenshot 2023-08-02 at 18.59.01.png (544×648 px, 72 KB)

Naturally, by removing this "miss" demand, the cache-hit ratio for the Memcached service as a whole improves quite a bit. It gave us almost an extra 9. Going up from 0.925 to 0.973, that's an increase of 5 percentage points (from 92.5% to 97.3%):

Screenshot 2023-08-02 at 19.00.12.png (534×1 px, 99 KB)

Change 819753 abandoned by Krinkle:

[mediawiki/core@master] rdbms: Remove unconditional ChronologyProtector query from pageviews

Reason:

Superseded by https://gerrit.wikimedia.org/r/941002 (If7082f62f2d6cdedb7a505ac68ba79f08634034d)

https://gerrit.wikimedia.org/r/819753

Change 957372 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Minor clean up in ChronologyProtector

https://gerrit.wikimedia.org/r/957372

Change 957372 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Minor clean up in ChronologyProtector

https://gerrit.wikimedia.org/r/957372