Page MenuHomePhabricator

Investigate INM Satisfaction alert as of 2021-12-17
Closed, ResolvedPublic

Description

On Thu 16 Dec 2021, the train for 1.38.0-wmf.13 rolled out to all wikis (refs T293954).

Two days later, a Grafana alert started firing for insufficient "ResourceLoader: INM Satisfaction". I missed this as the time, and it has been firing on and off since then. The INM Satisfaction ratio is the % of "If-None-Match requests" from end-user web browsers that our CDN is able to respond to with a "304 Not Modified" response. After 2 days of churn following a train deploy, this is usually well above 80%.

https://grafana.wikimedia.org/d/000000402/resourceloader-alerts?orgId=1&from=1638316800000&to=1641167999000

Screenshot 2022-01-05 at 16.59.28.png (620×908 px, 70 KB)

This last train, it never went above 70%, despite there being no other train deployments after it for almost a month. We know that browser's http cache rarely lasts longer than two weeks, so it makes no sense that it remains stuck at 70% satisfaction.

The alert panel has a useful info text:

If this ratio is low, it may indicate a problem with cache churn on Varnish or HTTP clients. As of writing (Mar 2019) it is typically well above 80%.

After a deployment that affects most traffic (eg. MW branch to group2, or major backport), then it is normal to drop to below 30%. This is expected. It should take less than 2 days to reach 80%.

As such, I suppose it is likely then that the wmf.13 train included a change that caused a module to have an unstable version hash.

Event Timeline

My first suspect, based on past incidents with INM satisfaction and the perflogbot that Ori created long ago (e.g. T181773, T133971 etc), was to look at modules with a high build rate where the versions are flapping. This seems like the most likely, if not only, reason for INM requests to not be met with 304 Not Modified after several weeks of no major deployments. Specifically because this metric does not merely reflect cache hit ratio, but cache-hit ratio within the scope of browsers that have a recent copy of the module already, and then found it was not satisfactory. (As opposed to being a cache miss due to not having a relevant entry yet, or not any more; which could alternatively be explained by a change in device/browser).

"Highest avg build rate" graph:
https://grafana.wikimedia.org/d/000000430/resourceloader-modules-overview?viewPanel=34&orgId=1&from=1639262938501&to=1641400091106

Nothing stands out here though, there isn't a module that has risen to have a high build rate, not around 17 December, nor anywhere else in this time frame. But, maybe a module with a changing version hash doesn't have to have a high build rate. After all, we observe changes in the manifest directly, separate from whether there is demand for the module's content. If it's not regularly used, it might not have a high build rate. Also, there is still good Varnish caching on top of this which reduces backend rate a lot.

Next, I'll take a few captures of the startup response from enwiki and compare them after a few hours to see if there are flapping changes. Basically, I re-created the perflogbot source code at P18403.

5 Jan 18:00 UTC
[enwiki-mobile] New version: mediawiki.rcfilters.filters.ui `1kl4j` => `4wftv`
[enwiki-mobile] New version: ext.discussionTools.ReplyWidget `9ijyw` => `1h3dh`
6 Jan 13:00 UTC
[enwiki-canonical] New version: mediawiki.rcfilters.filters.ui `4i5iw` => `m3fd6`
[enwiki-canonical] New version: ext.discussionTools.ReplyWidget `1kths` => `1ibz4`
[enwiki-mobile] New version: mediawiki.rcfilters.filters.ui `4wftv` => `1cr6w`
[enwiki-mobile] New version: ext.discussionTools.ReplyWidget `1h3dh` => `ik6k7`

[enwiki-canonical] New version: skins.vector.es6 `1x9ns!` => `15h3z!`

[enwiki-canonical] New version: ext.gadget.charinsert-core `dhs1c` => `joicv`

[enwiki-canonical] New version: ext.centralNotice.choiceData `sqmwc` => `1cear`
[enwiki-mobile] New version: ext.centralNotice.choiceData `sqmwc` => `1cear`
6 Jan 15:00 UTC
[enwiki-canonical] New version: ext.discussionTools.ReplyWidget `1ibz4` => `1ik08`
[enwiki-mobile] New version: ext.discussionTools.ReplyWidget `ik6k7` => `16yts`
(Edit) 6 Jan 20:00 UTC
[enwiki-canonical] New version: mediawiki.rcfilters.filters.ui `m3fd6` => `19ah2`
[enwiki-canonical] New version: ext.discussionTools.ReplyWidget `1ik08` => `166m1`
[enwiki-mobile] New version: mediawiki.rcfilters.filters.ui `1cr6w` => `5mtsg`
[enwiki-mobile] New version: ext.discussionTools.ReplyWidget `16yts` => `wvifd`

The entries for mediawiki.rcfilters.filters.ui and ext.discussionTools.ReplyWidget looks suspicious as there were no relevant deployments (code, config, or l10n) and no relevant MediaWiki-namespace edits on enwiki to messages included by the module.

Both of them do use a package file with versionCallback, which I'll look into next. \cc @kostajh and @matmarex In case you know of a likely cause of fluctuation in these modules.

I see that for DiscussionTools ReplyWidget, the getTermsOfUseMessagesVersion callback uses page_touched information (possibly inspired by T189229). I ran a query as such to identify recent null edits, cascading links updates, and other purges happening in the MediaWiki-namespace on enwiki (Quarry 40713).

Krinkle renamed this task from Investigate cause of lower ceiling on INM Satisfaction metric as of 2021-12-17 to Investigate INM Satisfaction alert as of 2021-12-17.Jan 6 2022, 3:07 PM
Krinkle triaged this task as Medium priority.
Krinkle added a project: DiscussionTools.

I have nothing to add, it is probably related to the parsing of messages.

I've also used very similar code for other modules – including 'mediawiki.action.edit.preview' in MediaWiki core, and 'ext.flow.templating' in Flow. I assume they're used infrequently enough that they didn't come up in your analysis. But if that's not the case and these are unaffected by the issue, then maybe there's some significant difference that can explain it.

@matmarex Thanks, that helps rule out some of the common factors. I don't think the usage plays a role in this case since these versions are computed for the startup manifest regardless of when/whether the module itself is requested. One exception is if there is a bug in the regular (non-version) callback causing the versionCallback to fluctuate, in which case usage frequency would play a role.

@Krinkle two questions for you:

  1. Are you able to share what impact the Editing Team might expect this issue to have on the speed with which people see the Reply and New Discussion Tools open?
  2. Would it be accurate for us to think y'all, the Performance Team, are investigating a fix for this issue?

    I ask the above in an effort to understand the impact of this issue and subsequently how, if at all, the Editing Team should prioritize work on this...

Are you able to share what impact the Editing Team might expect this issue to have

@ppelberg The immediate impact for us is that this alert remains firing and we remain unable/blind to early detection of various other kinds of incidents that would manifest through this metric.

The impact for DiscussionTools is that its software is repeatedly getting re-bundled on our servers and then repeatedly re-downloaded by browers, e.g. for every talkpage session, instead of utilising the browser cache as we normally do. The discarding of browser cache when no code has changed indeed results in slower load times and more bandwidth/energy consumption.

Would it be accurate for us to think y'all, the Performance Team, are investigating a fix for this issue?

Not at this time. I've confirmed and narrowed down the cause to the DT extension, and in my previous comment also highlighted the most likely part within the DT code that causes the continuous cache churn.

Given no other extensions affected at this time, and no recent changes in core in this area that I'm aware of, I'm assuming that this is not caused by something in the core infrastructure that I can fix. If investigation does show the bug is caused by something faulty in RL that DT is making use of, I'm happy to pick it up again at that point.

Mentioned in SAL (#wikimedia-operations) [2022-01-27T04:01:44Z] <Krinkle> grafana: Temporarily silence resourceloader alert for INM satisfaction ratio, pending T298520.

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.
Krinkle moved this task from Confirmed Problem to External on the MediaWiki-ResourceLoader board.

Reminder - the INM ratio is still underperforming by the same 10-20% drop.

Screenshot 2022-02-16 at 11.10.59.png (440×900 px, 76 KB)

... and from the same module churning its version hash:

# 11:12 GMT -> 11:17 GMT
[enwiki-mobile] Changed module: ext.discussionTools.ReplyWidget `1yjep` => `1kqcu`

# 11:17 GMT -> 11:29 GMT
[enwiki-canonical] Changed module: mediawiki.rcfilters.filters.ui `149ki` => `14s6a`
[enwiki-canonical] Changed module: ext.discussionTools.ReplyWidget `h521i` => `1drw7`
[enwiki-mobile] Changed module: ext.discussionTools.ReplyWidget `1kqcu` => `114lk`

# 11:29 GMT -> 12:29 GMT
[enwiki-canonical] Changed module: mediawiki.rcfilters.filters.ui `14s6a` => `f9m7p`
[enwiki-canonical] Changed module: ext.discussionTools.ReplyWidget `1drw7` => `3ros9`
[enwiki-mobile] Changed module: mediawiki.rcfilters.filters.ui `154b0` => `nqwiz`
[enwiki-mobile] Changed module: ext.discussionTools.ReplyWidget `114lk` => `dy9e6`

I'm unaware of what might have changed with rcfilters.filters.ui, sorry.

I looked into this today. I'll document some steps here, because I started out having no idea what I'm looking for.


I was able to reproduce the problem with the changing versions locally, in the browser console:

$ mw.loader.moduleRegistry['ext.discussionTools.ReplyWidget'].version
'19904'

(refreshed the page)

$ mw.loader.moduleRegistry['ext.discussionTools.ReplyWidget'].version
'z6gwn'

I added debug logging to ResourceLoaderModule::getVersionHash():

diff --git a/includes/resourceloader/ResourceLoaderModule.php b/includes/resourceloader/ResourceLoaderModule.php
index 0db870ead52..e8057c1fc63 100644
--- a/includes/resourceloader/ResourceLoaderModule.php
+++ b/includes/resourceloader/ResourceLoaderModule.php
@@ -912,6 +912,10 @@ abstract class ResourceLoaderModule implements LoggerAwareInterface {
                                $str = json_encode( $summary );
                        }

+                       if ( $this->getName() === 'ext.discussionTools.ReplyWidget' ) {
+                               wfDebugLog( 'xxx', $str );
+                       }
+
                        $this->versionHash[$contextHash] = ResourceLoader::makeHash( $str );
                }
                return $this->versionHash[$contextHash];

Copied the logged data out of the log, prettified and compared:

image.png (2×3 px, 365 KB)


Apparently, the "touched" timestamps of the on-wiki message overrides are changing. This is particularly surprising, because the pages containing the overrides (e.g. "MediaWiki:Wikimedia-discussiontools-replywidget-terms-click" in my case) do not exist.

They are generated by code here: https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/DiscussionTools/+/4613ae78e7eff0b9820973588baa332833ba38b4/includes/ResourceLoaderData.php#129

			$msg = [
				// Include the text of the message, in case the canonical translation changes
				$message->plain(),
				// Include the page touched time, in case the on-wiki override is invalidated
				Title::makeTitle( NS_MEDIAWIKI, ucfirst( $message->getKey() ) )->getTouched(),
			];

I added debug logging to Title::getTouched() and related methods:

diff --git a/includes/Title.php b/includes/Title.php
index 12d81018d5d..857d2359fef 100644
--- a/includes/Title.php
+++ b/includes/Title.php
@@ -3730,6 +3730,8 @@ class Title implements LinkTarget, PageIdentity, IDBAccessObject {
 	 * @return string|false Last-touched timestamp
 	 */
 	public function getTouched( $flags = self::READ_NORMAL ) {
+		wfDebugLog( 'xxx', __METHOD__ );
+		wfDebugLog( 'xxx', $this->getPrefixedText() );
 		if ( is_object( $flags ) ) {
 			wfDeprecatedMsg(
 				__METHOD__ . ' was called with a ' . get_class( $flags )
@@ -3740,6 +3742,7 @@ class Title implements LinkTarget, PageIdentity, IDBAccessObject {
 		}
 
 		$touched = $this->getFieldFromPageStore( 'page_touched', $flags );
+		wfDebugLog( 'xxx', $touched );
 		return MWTimestamp::convert( TS_MW, $touched );
 	}
 
@@ -4102,6 +4105,8 @@ class Title implements LinkTarget, PageIdentity, IDBAccessObject {
 	 * @return string|false
 	 */
 	private function getFieldFromPageStore( $field, $flags ) {
+		wfDebugLog( 'xxx', __METHOD__ );
+		wfDebugLog( 'xxx', $this->getPrefixedText() );
 		$flags |= ( $flags & self::GAID_FOR_UPDATE ) ? self::READ_LATEST : 0; // b/c
 
 		$pageStore = MediaWikiServices::getInstance()->getPageStore();
@@ -4109,7 +4114,7 @@ class Title implements LinkTarget, PageIdentity, IDBAccessObject {
 		if ( !in_array( $field, $pageStore->getSelectFields(), true ) ) {
 			throw new InvalidArgumentException( "Unknown field: $field" );
 		}
-
+		wfDebugLog( 'xxx', $this->mArticleID );
 		if ( $flags === self::READ_NORMAL && $this->mArticleID === 0 ) {
 			// page does not exist
 			return false;
@@ -4122,8 +4127,10 @@ class Title implements LinkTarget, PageIdentity, IDBAccessObject {
 		$page = $pageStore->getPageByReference( $this, $flags );
 
 		if ( $page instanceof PageStoreRecord ) {
+			wfDebugLog( 'xxx', 'return the field' );
 			return $page->getField( $field );
 		} else {
+			wfDebugLog( 'xxx', 'return false' );
 			// page does not exist
 			return false;
 		}

This revealed confusion:

2022-03-01 23:53:38 ubuntu15 mediawiki: Title::getTouched
2022-03-01 23:53:38 ubuntu15 mediawiki: MediaWiki:Wikimedia-discussiontools-replywidget-terms-click
2022-03-01 23:53:38 ubuntu15 mediawiki: Title::getFieldFromPageStore
2022-03-01 23:53:38 ubuntu15 mediawiki: MediaWiki:Wikimedia-discussiontools-replywidget-terms-click
2022-03-01 23:53:38 ubuntu15 mediawiki: -1
2022-03-01 23:53:38 ubuntu15 mediawiki: return false
2022-03-01 23:53:38 ubuntu15 mediawiki:

Several things are not as they should be here:

  • When the page does not exist, $this->mArticleID is -1, while the code in getFieldFromPageStore() expects it to be 0. This is not the cause of the problem though, as the function still ends up returning false.
  • When $touched is false, its value is passed to MWTimestamp::convert(), which treats false as the current time.

Looks like the bug in Title::getTouched() was introduced in this change: rMWfd3a6952303b: Title: use PageStore instead of LinkCache. Previously it returned false for non-existent pages, and that seems to be the intended behavior (it's not stated clearly in the doc comment, but it says that false is a possible return value, and the old code obviously worked that way).

Krinkle added a project: Platform Engineering.

Thanks @matmarex. That's a great find. Signing over to Daniel with the understanding that Title::getTouched has regressed as result of T285389: Make Title use PageStore to access the page table with seemingly no test and I guess not noticed by anything else until now. Although it's not unlikely that there is more cache churning going on in various places as a result of this, RL is most likely to one to notice most due to crossing the boundary from server to client.

Change 767546 had a related patch set uploaded (by Daniel Kinzler; author: Daniel Kinzler):

[mediawiki/core@master] Fix Title::getTouched() for non-existing pages.

https://gerrit.wikimedia.org/r/767546

Change 767546 merged by jenkins-bot:

[mediawiki/core@master] title: Restore boolean false Title::getTouched() for nonexistent page

https://gerrit.wikimedia.org/r/767546

Fix is merged. Let's see if the metrics get better when this rides the train next week.

The alert remains in would-be firing range:

Screenshot 2022-04-06 at 17.40.36.png (618×924 px, 76 KB)

When running the same analysis as before, it seems ext.discussionTools.ReplyWidget is probably fixed, but mediawiki.rcfilters.filters.ui remains in heavy churn.

[16:22 BST] $ php rlstartupdiff.php 
...
[16:22 BST] $

[17:38 BST] $ php rlstartupdiff.php 
[enwiki-mobile] Changed module: mediawiki.rcfilters.filters.ui `vlrn3` => `18j7t`
[enwiki-canonical] Changed module: mediawiki.rcfilters.filters.ui `1v6ze` => `nn8lk`
[17:38 BST] $

[17:41 BST] $ php rlstartupdiff.php 
[enwiki-canonical] Changed module: mediawiki.rcfilters.filters.ui `nn8lk` => `okmxc`
[enwiki-mobile] Changed module: mediawiki.rcfilters.filters.ui `18j7t` => `9y43d`
Krinkle added a subscriber: daniel.

The fact that the version hashes differ between platforms in the samples at T298520#7835852 is suspect. Perhaps there's some poisoning the cache where some of the data is e.g. indirectly altered by MobileFrontend for m-dot only and then ending up served on the other end and then oscilating based on who wins the race each time its memcached key expires, or based on which server we reach and which platform filled the apcu key first.

w/krinkle.php
<?php
define( 'MW_NO_SESSION', 1 );
require_once __DIR__ . '/../multiversion/MWMultiVersion.php';
require MWMultiVersion::getMediaWiki( 'includes/WebStart.php' );
$ctx = ResourceLoaderContext::newDummyContext();
echo json_encode( ChangesListSpecialPage::getRcFiltersConfigSummary($ctx), JSON_PRETTY_PRINT);
echo "\n";

I served this from mwdebug1002 on test.wikipedia and test.m.wikipedia, and did not find a difference in its response. However, I did find a smoking gun:

{
    "RCFiltersChangeTags": [
        {
            "name": "twinkle",
            "labelMsg": true,
            "label": "(tag-twinkle)",
            "descriptionMsg": true,
            "description": "(tag-twinkle-description)",
            "cssClass": "mw-tag-twinkle",
            "hits": "6245"
        },
        {
            "name": "mobile web edit",
            "labelMsg": true,
            "label": "(tag-mobile web edit)",
            "descriptionMsg": true,
            "description": "(tag-mobile web edit-description)",
            "cssClass": "mw-tag-mobile_web_edit",
            "hits": "6208"
        },
        ..
    ]
    "StructuredChangeFiltersEditWatchlistUrl": "/wiki/Special:EditWatchlist"
}

The hitcounter is included in the data from the versionCallback. That seems rather problematic as that's constantly in flux. Passing back to @kostajh to assess if and where hits is needed. E.g. maybe it's unused, or only used server-side, or we can process whatever the client needs and then make it so that it's only used server-side.

Change 777841 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/core@master] [WIP] changetags: Fix mediawiki.rcfilters.filters.ui version churn

https://gerrit.wikimedia.org/r/777841

I am pretty sure that the "hits" value is unused. At one point the list would be sorted by the hit count (this was removed in rMWb6f412ea2a9e: RCFilters: Don't call ChangeTags::tagUsageStatistics() for now), which explains why it's there, but the actual value is not used and it seems that it has never been used (or at least I couldn't find any commits that used it). I think we should remove it.

Change 777841 merged by jenkins-bot:

[mediawiki/core@master] changetags: Remove "hits" key from exported RCFilters module data

https://gerrit.wikimedia.org/r/777841

Monitoring confirms that our INM satisfaction ratio has recovered and is now well-above the threshold for a heatlthy state:

https://grafana.wikimedia.org/d/000000402/resourceloader-alerts?orgId=1&from=1651712400000&to=1651930200000

Screenshot 2022-05-07 at 15.36.22.png (682×897 px, 72 KB)

I've also confirmed anecdotally that running rlstartupdiff.php at 15min intervals for two hours yielded no results that would indicate cache churn, and indeed no module version changes at all during this time frame today.