Page MenuHomePhabricator

MetricsPlatform: InstrumentConfigFetcher: Make fetching asynchronous
Closed, ResolvedPublic8 Estimated Story Points

Description

Background

Present

The logic for fetching and pre-processing experiment configs from xLab is located in InstrumentConfigFetcher (herein "the fetcher"). When $wgMetricsPlatformEnableExperimentConfigsFetching is truthy and a logged-in user navigates to a page, the MetricsPlatform extension:

  1. Fetches the experiment configs from the WAN cache
  2. If there was a cache miss or the cached value needs to be regenerated in (2), fetches the experiment configs from xLab and updates the WAN cache

T398422 Present _ Mermaid Chart-2025-07-02-104039.png (1×3 px, 221 KB)

We leverage WANObjectCache (provided by MediaWiki Core) to do the above as well as to avoid stampedes and manage experiment configs expiry.

However, there is a performance problem. (2) delays the app server responding to the user by a non-trivial amount of time if the application server isn't in the eqiad DC. The regeneration callback time for the 'MetricsPlatform' WAN cache key group graph shows two distinct bands forming around 10-25 ms and 100-250 ms. I believe that these bands map to "requests from eqiad to eqiad" and "requests from ulsfo/esams/drmrs/magru to eqiad", respectively. This banding is also clear from the data collected by the experiment config fetchers running on every cache node.

Future

We make the fetcher asynchronous. The MetricsPlatform extension will still fetch experiment configs from the WAN cache but the WAN cache will be updated by a maintenance script running every minute.

T398422 Future _ Mermaid Chart-2025-07-02-105813.png (1×3 px, 201 KB)

Pros

  • We delay the app server responding to the user by O(1 ms)
  • The delay is equitable for all users as it's the same for any app server in any DC

Cons

Notes

  1. This can be implemented by:
    1. Creating a maintenance script that fetches experiment configs from xLab and updates the cached value in the WAN cache
    2. Creating a periodic job that runs the maintenance script
    3. Updating the MetricsPlatform extension to only ever fetch experiment configs from the WAN cache

AC

  • When $wgMetricsPlatformEnableStreamConfigsFetching is truthy, instrument configs are still fetched as part of the request
  • When $wgMetricsPlatformEnableExperimentConfigsFetching is truthy, experiment configs are still fetched as part of the request

These first two AC cover maintaining the current local development experience rather than requiring the developer to create a periodic job in their local development environment.

Event Timeline

phuedx triaged this task as High priority.Jul 2 2025, 11:36 AM
phuedx moved this task from Incoming to Backlog on the Test Kitchen board.

Change #1167187 had a related patch set uploaded (by Phuedx; author: Phuedx):

[mediawiki/extensions/MetricsPlatform@master] DNM: Make updating configs asynchronous

https://gerrit.wikimedia.org/r/1167187

phuedx set the point value for this task to 8.Jul 9 2025, 8:23 AM

Testing Instructions

  • Confirm that experiment configs are still fetched during the request when $wgMetricsPlatformEnableExperimentConfigsFetching is truthy
  • Confirm that, when $wgMetricsPlatformEnableExperimentConfigsFetching is falsy:
    • Experiment configs aren't fetched during the request
    • Experiment configs are only fetched when running docker compose exec mediawiki php maintenance/run.php ./extensions/MetricsPlatform/maintenance/UpdateConfigs.php
  • Confirm that experiment configs persist between page reloads

You can confirm experiment configs are or aren't being fetched and persisted in the backing store in two different ways:

  1. In the logs of your local xLab instance via docker compose logs -f xlab
  2. In your browser console via console.log( 'wgMetricsPlatformUserExperiments' ); (because the user will be enrolled in the experiments most recently fetched from xLab)

Change #1167187 merged by jenkins-bot:

[mediawiki/extensions/MetricsPlatform@master] InstrumentConfigsFetcher: Make updating configs asynchronous

https://gerrit.wikimedia.org/r/1167187

Change #1171205 had a related patch set uploaded (by Phuedx; author: Phuedx):

[operations/puppet@production] mw::maintenance: ExperimentationLab periodic job

https://gerrit.wikimedia.org/r/1171205

Change #1171251 had a related patch set uploaded (by Clare Ming; author: Phuedx):

[mediawiki/extensions/MetricsPlatform@wmf/1.45.0-wmf.10] InstrumentConfigsFetcher: Make updating configs asynchronous

https://gerrit.wikimedia.org/r/1171251

Change #1171251 merged by jenkins-bot:

[mediawiki/extensions/MetricsPlatform@wmf/1.45.0-wmf.10] InstrumentConfigsFetcher: Make updating configs asynchronous

https://gerrit.wikimedia.org/r/1171251

Mentioned in SAL (#wikimedia-operations) [2025-07-21T20:17:41Z] <cjming@deploy1003> Started scap sync-world: Backport for [[gerrit:1171251|InstrumentConfigsFetcher: Make updating configs asynchronous (T398422)]]

Mentioned in SAL (#wikimedia-operations) [2025-07-21T20:41:41Z] <cjming@deploy1003> cjming: Backport for [[gerrit:1171251|InstrumentConfigsFetcher: Make updating configs asynchronous (T398422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-07-21T21:04:06Z] <cjming@deploy1003> Finished scap sync-world: Backport for [[gerrit:1171251|InstrumentConfigsFetcher: Make updating configs asynchronous (T398422)]] (duration: 46m 25s)

This deployment matches a sudden increase in our mainstash database cluster:
https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&from=2025-07-21T12:17:48.161Z&to=2025-07-22T10:06:34.613Z&timezone=utc&var-site=eqiad&var-group=core&var-shard=ms1&var-shard=ms2&var-shard=ms3&var-role=$__all

Seems to be mostly reads:

Captura de pantalla 2025-07-22 a las 12.08.53.png (508×2 px, 200 KB)

And matches with the finish of the deployment

21:04 cjming@deploy1003: Finished scap sync-world: Backport for InstrumentConfigsFetcher: Make updating configs asynchronous (T398422) (duration: 46m 25s)

Can we revert this please?

The queries look like this, which doesn't make sense to do it on a database and not on cache:

root@db2142[(none)]> show full processlist;
| 2013123165 | wikiuser2023         | 10.194.233.168:53048 | mainstash | Query       |       0 | starting                                                      | SELECT /* SqlBagOStuff::fetchBlobs  */  keyname,value,exptime  FROM `objectstash`    WHERE keyname = 'global:MetricsPlatform:instrument:1' AND (exptime >= '20250722101519') |    0.000 |
| 2013123166 | wikiuser2023         | 10.194.206.2:43934   | mainstash | Query       |       0 | init                                                          | SELECT /* SqlBagOStuff::fetchBlobs  */  keyname,value,exptime  FROM `objectstash`    WHERE keyname = 'global:MetricsPlatform:instrument:1' AND (exptime >= '20250722101519') |    0.000 |
| 2013123167 | wikiuser2023         | 10.194.154.169:45766 | mainstash | Query       |       0 | Writing to net                                                | SELECT /* SqlBagOStuff::fetchBlobs  */  keyname,value,exptime  FROM `objectstash`    WHERE keyname = 'global:MetricsPlatform:instrument:1' AND (exptime >= '20250722101519') |    0.000 |
| 2013123168 | wikiuser2023         | 10.194.136.165:34100 | mainstash | Query       |       0 | starting                                                      | SELECT /* SqlBagOStuff::fetchBlobs  */  keyname,value,exptime  FROM `objectstash`    WHERE keyname = 'global:MetricsPlatform:instrument:1' AND (exptime >= '20250722101519') |    0.000 |

Thanks, let's see if traffic goes back to normal

The queries look like this, which doesn't make sense to do it on a database and not on cache.

My patch should have made reading from the WAN cache fail over to reading from the MainStash when the cached value needed regenerating but it didn't. I apologise for the disruption.

OK. After noodling on this some more, my plan is as follows:

  1. Introduce the $wgMetricsPlatformConfigsCacheType variable
    • The variable controls the cache that the configs fetcher uses. It should refer to an entry in $wgObjectCaches
    • The variable defaults to the main stash. This makes local development easier
  2. Update the configs fetcher such that if there's a cache miss, then write the empty list for a week to stop stampedes on the main stash if the UpdateConfigs.php maintenance script fails
  3. Define a new cache in $wgObjectCaches
    • The cache composes the local cluster cache and the main stash with reads going to the local cluster cache first
  4. Set $wgMetricsPlatformConfigsCacheType to the name of the new cache defined in (4)

Change #1172066 had a related patch set uploaded (by Phuedx; author: Phuedx):

[mediawiki/extensions/MetricsPlatform@master] Revert "Revert "InstrumentConfigsFetcher: Make updating configs asynchronous""

https://gerrit.wikimedia.org/r/1172066

Change #1172279 had a related patch set uploaded (by Phuedx; author: Phuedx):

[operations/mediawiki-config@master] MetricsPlatform: Disable synchronous configs fetching

https://gerrit.wikimedia.org/r/1172279

OK. After noodling on this some more, my plan is as follows:

  1. Introduce the $wgMetricsPlatformConfigsCacheType variable
    • The variable controls the cache that the configs fetcher uses. It should refer to an entry in $wgObjectCaches
    • The variable defaults to the main stash. This makes local development easier
  2. Update the configs fetcher such that if there's a cache miss, then write the empty list for a week to stop stampedes on the main stash if the UpdateConfigs.php maintenance script fails
  3. Define a new cache in $wgObjectCaches
    • The cache composes the local cluster cache and the main stash with reads going to the local cluster cache first
  4. Set $wgMetricsPlatformConfigsCacheType to the name of the new cache defined in (4)

In the end, I opted to make InstrumentConfigsFetcher accept the local cluster cache and the main stash as a dependency. In the case where the cache and the main stash miss, we want to cache a valid, empty value for a short amount of time to minimise the read load on the main stash and to allow for replication lag and/or the UpdateConfigs.php script to run. I had planned to use MultiWriteBagOStuff to compose the local cluster cache and main stash in (3) above but this would have meant that that valid, empty value would have been written to the main stash also, potentially overriding a write that is currently being replicated.

Change #1172637 had a related patch set uploaded (by Phuedx; author: Phuedx):

[mediawiki/extensions/MetricsPlatform@master] ext.xLab: Introduce mw.xLab.getAssignments()

https://gerrit.wikimedia.org/r/1172637

Change #1172066 merged by Dr0ptp4kt:

[mediawiki/extensions/MetricsPlatform@master] Revert "Revert "InstrumentConfigsFetcher: Make updating configs asynchronous""

https://gerrit.wikimedia.org/r/1172066

Change #1172279 merged by jenkins-bot:

[operations/mediawiki-config@master] MetricsPlatform: Disable synchronous configs fetching

https://gerrit.wikimedia.org/r/1172279

Mentioned in SAL (#wikimedia-operations) [2025-07-30T13:06:52Z] <sgimeno@deploy1003> Started scap sync-world: Backport for [[gerrit:rECIR1173920da501|Growth: remove conditional user options for get-started-notification]], [[gerrit:1172279|MetricsPlatform: Disable synchronous configs fetching (T398422)]]

Mentioned in SAL (#wikimedia-operations) [2025-07-30T13:08:59Z] <sgimeno@deploy1003> sgimeno, phuedx: Backport for [[gerrit:rECIR1173920da501|Growth: remove conditional user options for get-started-notification]], [[gerrit:1172279|MetricsPlatform: Disable synchronous configs fetching (T398422)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Mentioned in SAL (#wikimedia-operations) [2025-07-30T13:18:18Z] <sgimeno@deploy1003> Finished scap sync-world: Backport for [[gerrit:rECIR1173920da501|Growth: remove conditional user options for get-started-notification]], [[gerrit:1172279|MetricsPlatform: Disable synchronous configs fetching (T398422)]] (duration: 11m 26s)

From the task description:

[…] We leverage WANObjectCache (provided by MediaWiki Core)

[…] Regeneration callback time for 'MetricsPlatform' graph shows two distinct bands forming around 10-25 ms and 100-250 ms. I believe that these bands map to "requests from eqiad to eqiad" and "requests from ulsfo/esams/drmrs/magru to eqiad", respectively. […]

Did you mean "requests from codfw to eqiad" as the second group?

These numbers are measured by MediaWiki, as time taken for the HTTP request to the MPIC service. There are no MediaWiki app servers in any of ulsfo/esams/drmrs/magru, because these are caching data centers only. The Codfw data center is not mentioned here, but our two application data centers are Eqiad and Codfw. (It doesn't matter for MediaWiki where the end-user is, but, indeed some cache clusters map to Eqiad, and some to Codfw, based on whichever is closest, given that we are now Multi-DC.)

There's something else on that dashboard that may be useful to note (Grafana / WANObjectCache by keygroup / MetricsPlatform):

Screenshot 2025-08-01 at 15.31.21.png (1×2 px, 384 KB)

There are hardly any cache misses, fewer than a crude TTL expiry would explain, and no spikes either. That's great, of course, but there's more to it than that.

This is because your data is automatically asynchronously refreshed by WANObjectCache, before it expires. Let's hide the "hit fresh" rate to see the "hit refresh" rate more clearly:

Screenshot 2025-08-01 at 15.31.24.png (609×2 px, 89 KB)

These "hit refresh" events are calls where the cache has not yet expired, and is thus returned as normal. But, because we're close to the expiry (worthRefreshExpiring), and random chance (worthRefreshPopular), a DeferredUpdate is scheduled to refresh the data. This can (if popular enough) avoid end-users from ever facing a cache miss, and also avoids a stampede for the backend since only a handful of random threads will fetch data.

Screenshot 2025-08-01 at 15.31.29.png (1×1 px, 146 KB)

Note how the "regeneration callback" heatmap is a lot fuller (more samples) than the "Total miss-compute" heatmap. One might assume that the "regeneration callback" is a strict subset (in time span, not in sample count) of "Total miss", because the latter measures "get + callback + set", and that's completely correct. But, in terms of sample count it is a superset, because regenerations are not exclusive to "real" cache misses, because we also invoke the regen callback for you from these async refreshes.

Change #1175928 had a related patch set uploaded (by Clare Ming; author: Clare Ming):

[operations/puppet@production] Alertmanager: add receiver and routing for experiment-platform tasks

https://gerrit.wikimedia.org/r/1175928

Change #1175928 merged by RLazarus:

[operations/puppet@production] Alertmanager: add receiver and routing for experiment-platform tasks

https://gerrit.wikimedia.org/r/1175928

Thank you very much for the information and your insight.

From the task description:

[…] We leverage WANObjectCache (provided by MediaWiki Core)

[…] Regeneration callback time for 'MetricsPlatform' graph shows two distinct bands forming around 10-25 ms and 100-250 ms. I believe that these bands map to "requests from eqiad to eqiad" and "requests from ulsfo/esams/drmrs/magru to eqiad", respectively. […]

Did you mean "requests from codfw to eqiad" as the second group?

I did.

This is because your data is automatically asynchronously refreshed by WANObjectCache, before it expires. <snip />

These "hit refresh" events are calls where the cache has not yet expired, and is thus returned as normal. But, because we're close to the expiry (worthRefreshExpiring), and random chance (worthRefreshPopular), a DeferredUpdate is scheduled to refresh the data. This can (if popular enough) avoid end-users from ever facing a cache miss, and also avoids a stampede for the backend since only a handful of random threads will fetch data.

TIL.

Are the "miss compute" events situations in which the data is synchronously refreshed? If so, that's still a (smaller than I initially thought) sample of requests that would suffer a performance penalty. Generally speaking this is bad and, in the case where these requests are from users who are part of an experiment, this could mean bias. My thinking was that moving the regeneration step to a periodic job improves performance for the end user at the cost of increased complexity and, potentially, a user flickering in and out of an experiment in the case of a transient network issue or users flickering in and out in the case of eviction from both the local cluster cache and the main stash.

I could well be over-indexing on the wrong thing though and would appreciate any thoughts you have on the implementation.

Mentioned in SAL (#wikimedia-operations) [2025-08-07T18:50:25Z] <cjming@deploy1003> mwscript-k8s job started: extensions/MetricsPlatform/maintenance/UpdateConfigs.php --wiki aawiki # Test run for T398422

Change #1171205 merged by RLazarus:

[operations/puppet@production] mw::maintenance: ExperimentationLab periodic job

https://gerrit.wikimedia.org/r/1171205