Background
Present
The logic for fetching and pre-processing experiment configs from xLab is located in InstrumentConfigFetcher (herein "the fetcher"). When $wgMetricsPlatformEnableExperimentConfigsFetching is truthy and a logged-in user navigates to a page, the MetricsPlatform extension:
- Fetches the experiment configs from the WAN cache
- If there was a cache miss or the cached value needs to be regenerated in (2), fetches the experiment configs from xLab and updates the WAN cache
We leverage WANObjectCache (provided by MediaWiki Core) to do the above as well as to avoid stampedes and manage experiment configs expiry.
However, there is a performance problem. (2) delays the app server responding to the user by a non-trivial amount of time if the application server isn't in the eqiad DC. The regeneration callback time for the 'MetricsPlatform' WAN cache key group graph shows two distinct bands forming around 10-25 ms and 100-250 ms. I believe that these bands map to "requests from eqiad to eqiad" and "requests from ulsfo/esams/drmrs/magru to eqiad", respectively. This banding is also clear from the data collected by the experiment config fetchers running on every cache node.
Future
We make the fetcher asynchronous. The MetricsPlatform extension will still fetch experiment configs from the WAN cache but the WAN cache will be updated by a maintenance script running every minute.
Pros
- We delay the app server responding to the user by O(1 ms)
- The delay is equitable for all users as it's the same for any app server in any DC
Cons
- We increase the fragility of the system
- If there is a cache miss or the cached value would otherwise be regenerated, it is not regenerated.
- This risk is mitigated by allowing the cached value to go stale rather than be deleted when it expires. This is in line with how fetching experiment configs happens on the cache nodes where the previous experiment configs are only replaced when different experiment configs are successfully fetched
- If there is a cache miss or the cached value would otherwise be regenerated, it is not regenerated.
Notes
- This can be implemented by:
- Creating a maintenance script that fetches experiment configs from xLab and updates the cached value in the WAN cache
- Creating a periodic job that runs the maintenance script
- Updating the MetricsPlatform extension to only ever fetch experiment configs from the WAN cache
AC
- When $wgMetricsPlatformEnableStreamConfigsFetching is truthy, instrument configs are still fetched as part of the request
- When $wgMetricsPlatformEnableExperimentConfigsFetching is truthy, experiment configs are still fetched as part of the request
These first two AC cover maintaining the current local development experience rather than requiring the developer to create a periodic job in their local development environment.





