Page MenuHomePhabricator

Avoid ChronologyProtector queries on majory of pageviews that have no recent positions
Open, MediumPublic

Description

Prior work:

This inefficiency was identified by @tstarling during T279664 to assess why a synthetic page view was slower in Codfw than Eqiad. While CP was not the cause, it did come up as one of the things we do on each page view, and this seems wrong.

Background

As part of my analysis with @aaron at T254634, we arrived on a simplified roll-out plan for Multi-DC. For ChronologyProtector this was mainly that we dropped support for always-on stateless ChronologyProtector. This was effectively done in 2020 already, but summarised below:

Drop support for protecting chronology for cross-domain redirects where we can neither set a cookie nor a query parameter (except login.wikimedia.org, ref T91820).

We were unable to find real-world examples of this scenario. That is, as scenario where a user 1) performs database write in the primary DC but not a POST request with sticky-dc cookie, or with such stickiness but being followed by a cross-domain request and 2) the next secondary DC request happens mechanically or otherwise in really rapid manner and 3) is unable to receive a query parameter and 4) has a hard requirement to see the prior db write in this subsequent request.

Basically, our other measures have obsoleted this:

  • sticky-dc cookie after POST requests and other DB writes.
  • loginwiki routing, ref T91820.
  • token store, ref T278392.

What remains

Much of the related code for this was all removed (including the requirement to replicate the temporary position value across DCs and to first wait-for the position store and then wait-for MySQL replication, and WaitConditionLoop-based BagOStuff callback for OutputPage).

However what we forgot, is that we still generate a clientId when LBFactory instantiates the ChronologyProtector class, and there's nothing in place to prevent it from seeing that as instruction to check the position store and wait for something that doesn't exist.

From #wikimedia-perf 2022-08-01

<TimStarling> on an anonymous request, MW generates a random session ID and then fetches that session from Kask
<TimStarling> no Set-Cookie header
<TimStarling> after it's done talking to kask for 4ms (in the strace log I made), it connects to nutcracker and asks for ChronologyProtector positions

Event Timeline

Change 819753 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove unconditional ChronologyProtector query from pageviews

https://gerrit.wikimedia.org/r/819753

Change 819754 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove obsolete cross-dc wait from ChronologyProtector

https://gerrit.wikimedia.org/r/819754

Change 819759 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove synchronous pre-send waitForReplication fallback

https://gerrit.wikimedia.org/r/819759

From today's AU chat with @aaron and @tstarling, Aaron suspects there may still be a need for the stateless CP position discovery.

TODO for me: Log a message when we fetch positions when there was no clientId from cookie/query source, and log another message if we found positions in that case (normally we only log if we didn't find positions, but we can temporarily add a warning message here if we found positions without a clientId supplied).

Then we can let that run for a few days or weeks and identify what if anything might be benefiting or relying on this.

[…]
As expected, there is now an increase in "get" misses on Grafana: Memcache dashboard due to the unresolved T314434.

Screenshot 2022-08-19 at 14.00.55.png (534×618 px, 110 KB)

Change 819753 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Remove unconditional ChronologyProtector query from pageviews

https://gerrit.wikimedia.org/r/819753

To keep in mind later:

Duplicate get(): "global:Wikimedia\Rdbms\ChronologyProtector:5492c7bb49467184f4f3a09b4f4e21b4:v2" fetched 2 times

It seems we're sometimes actually fetching the value twice during the same GET request (this was from viewing a diff on enwiki via mwdebug1002). I'll try to reproduce that as part of iterating on the above patch, and then see if we can fix that at the same time.

Change 819754 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Remove obsolete cross-dc wait from ChronologyProtector

https://gerrit.wikimedia.org/r/819754

Change 819759 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Remove synchronous pre-send waitForReplication fallback

https://gerrit.wikimedia.org/r/819759

Change 829187 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Log warning if ChronologyProtector finds data under presumed clientID

https://gerrit.wikimedia.org/r/829187

Change 829187 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Log warning if ChronologyProtector finds data under presumed clientID

https://gerrit.wikimedia.org/r/829187

Change 832624 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] rdbms: Limit CP warning message for T314434 to initial 10s window

https://gerrit.wikimedia.org/r/832624

Change 832624 merged by jenkins-bot:

[mediawiki/core@master] rdbms: Limit CP warning message for T314434 to initial 10s window

https://gerrit.wikimedia.org/r/832624

Analysis from https://logstash.wikimedia.org/goto/95b803b2cb6e4f792bf776190a25ba47

  • Logstash dashboard: mediawiki.
  • Query: channel:DBReplication normalized.message:T314434
  • Filters:
    • servergroup:appserver (ignore API requests where it's caller responsibility to reflect cookies if CP is desired, plus many false positives from clouds where IPs are shared between unrelated processes).
    • NOT url:"action=raw" (static requests don't need CP).
    • AND NOT server:"login.wikimedia.org" AND NOT url:"Special:CentralAutoLogin" (routes to primary DC as per Multi-DC, to be determined if it actually needs CP for all but the first local>loginwiki redirect where query param can be used)
    • AND NOT host:mw2* (not sure why it engages in Codfw)
    • AND NOT (clientAgent:*crawler*)
    • AND referer:exists (focus on subsequent page loads within a browsing session)

This leaves us with ~10K log entries per hour. I've not seen any obvious pattern there, they seem like normal page views with a referer and a common user agent from a major browser. I suspect it might have something to do with a cross-wiki write, e.g. where in another tab or with a gadget something is written to a different wiki and so we happen to find data but don't strictly need it for anything in particular.

My hypothesis that it is rare for CP data to be found without a cookie set, is incorrect. However, we're no closer to determining whether these scenarios need/want/expect CP to kick in, and why it is that the cookie isn't there.

Change 870733 had a related patch set uploaded (by Krinkle; author: Aaron Schulz):

[mediawiki/core@master] rdbms: clean up LoadBalancer/ChronologyProtector primary pos methods

https://gerrit.wikimedia.org/r/870733

Change 870733 merged by jenkins-bot:

[mediawiki/core@master] rdbms: clean up LoadBalancer/ChronologyProtector primary pos methods

https://gerrit.wikimedia.org/r/870733