Page MenuHomePhabricator

Provide way to configure or purge cache of VCS data on Special:Version or simply remove caching.
Open, Needs TriagePublic

Description

In https://gerrit.wikimedia.org/r/#/c/388049 I proposed a setting allowing configuration of the time that VCS data is cached for on Special:Version.
The plan would be to set the cache time for beta and other testing sites to be much lower than on production sites.

Discussion on the patch raised the question, is it actually beneficial to cache this VCS data.

In order to move forward with the patch either by configuring the cache time or providing a maint script to purge the cache we need an answer to that question.
If the caching is not beneficial then we can simply remove it.

Tagging Performance-Team as they might be the correct team to answer this?

Event Timeline

Addshore created this task.Nov 9 2017, 11:55 AM
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptNov 9 2017, 11:55 AM

In Wikimedia production, we have precached JSON files in $IP/cache/gitinfo:

legoktm@tin:/srv/mediawiki-staging/php-1.31.0-wmf.7/cache/gitinfo$ cat info-skins-Timeless.json 
{"head": "45493adbd89e4b57ea2dc00a9410a0ad0a325ef4", "remoteURL": "https://gerrit.wikimedia.org/r/mediawiki/skins/Timeless", "branch": "45493adbd89e4b57ea2dc00a9410a0ad0a325ef4", "headCommitDate": "1509915450", "headSHA1": "45493adbd89e4b57ea2dc00a9410a0ad0a325ef4", "@directory": "/srv/mediawiki-staging/php-1.31.0-wmf.7/skins/Timeless"}

If those file lookups are too slow in production, we should just make one JSON file that is the combination of all of the individual ones to only require one stat. On most other wikis that don't do the pre-caching, it's 3-4 stat calls per extension/skin I think.

In Wikimedia production, we have precached JSON files in $IP/cache/gitinfo:

What generates these?

Krinkle removed a project: Performance-Team.EditedNov 13 2017, 9:36 PM
Krinkle added a subscriber: Krinkle.

@Addshore Performance-Team doesn't have any particular preference about whether the cache age should be configurable, or whether to approach the problem in a different way (like Lego described). I will mention that in either case, it'd be nice if Beta worked similar to production.

Untagging Performance-Team for now as this isn't an area we maintain, and in terms of best practices, it seems fine either way.

In Wikimedia production, we have precached JSON files in $IP/cache/gitinfo:

What generates these?

scap! https://phabricator.wikimedia.org/source/scap/browse/master/scap/tasks.py;3e59e84e070c53e35cbd0224648031e429e0865a$106

Also ExtensionDistributor also generates similar JSON files for the tarballs it generates so the proper VCS info shows up for Special:Version.

Change 388049 had a related patch set uploaded (by Krinkle; owner: Addshore):
[mediawiki/core@master] Configurable Special:Version VCS info cache time

https://gerrit.wikimedia.org/r/388049

Seb35 added a subscriber: Seb35.Jan 22 2018, 10:11 AM

On my dev install, I deactivated the vcs cache - I have a "deployment" commit removing the $cache->set, but I would be happy if a better solution is found.

I remarked that the existing parameter $wgGitInfoCacheDirectory is used in reading but never in writing by MediaWiki itself (gitinfo.json are only generated by ExtensionDistributor or scap if I understand correctly). Hence in the current state there are two caches:

  • a first general cache with $cache->get/set in SpecialVersion
  • and a second specialized cache with gitinfo.json files in GitInfo

To obtain a more general situation useful both in prod and in dev, perhaps we could remove the first general cache – which is hardly purgeable – and rely entirely on the specialized cache already in place, and activate writing of this specialized cache (already implemented in GitInfo::precomputeValues(), just not called). With such a setup it will be easy to deactivate or purge the cache in both environments.

About the single cache file, it could be a performance improvement, but it remains to be implemented.

It could be a file gitinfo.json in $wgGitInfoCacheDirectory to avoid conflicts with the existing scheme with info-$dir-$subdir-$subsubdir.json files. It could be a general dictionary with keys "$dir/$subdir/$subsubdir".

I’m not sure if locks should be properly managed during writing. I see the CDB library simply write in a temporary file then rename this file, but each language is independant of each other. In our case, during rendering of Special:Version, there would be a loop over git directories, and if the single cache file is re-written for each git directory, highly-concurrent writes could create edit conflicts and result in a situation where this single cache file hardly gain its maximal size. Obviously for Wikimedia, scap could be adapted to generate this single cache file in a safe environment.

A probably better implementation in GitInfo would be to compute the entire cache file, then write it in a temporary file and rename this file to its real name. But in this case GitInfo should store all git directories in some static property and some method should trigger writing, called at the end of Special:Version rendering.

On my dev install, I deactivated the vcs cache - I have a "deployment" commit removing the $cache->set, but I would be happy if a better solution is found.

I remarked that the existing parameter $wgGitInfoCacheDirectory is used in reading but never in writing by MediaWiki itself (gitinfo.json are only generated by ExtensionDistributor or scap if I understand correctly). Hence in the current state there are two caches:

  1. a first general cache with $cache->get/set in SpecialVersion
  2. and a second specialized cache with gitinfo.json files in GitInfo

To obtain a more general situation useful both in prod and in dev, perhaps we could remove the first general cache

The first cache is volatile, its values are automatically populated and expired. It works out of the box by default.

The second cache is a way to disable the automatic caching principle. The most important aspect of it is that it is manually maintained and will never expire or regenerate itself in production. This is similar to the LocalisationCache['recache'] attribute. One of the reasons the second cache works this way is because in large wiki farms like at Wikimedia Foundation, we don't actually deploy the .git directory in the first place. So even if the cost of recomputing was acceptable, it would not work because the data to compute it, does not exist on web servers. It only exists on the deployment servers, where the gitinfo.json files are generated and we deploy those instead. So in many ways the optional wgGitInfoCacheDirectory function is not a cache in front of GitInfo, but it disables GitInfo and replaces it.

These approaches are different enough that I cannot recommend merging them. I think it may be possible to merge them safely, but I suspect the end-result would become somewhat risky to maintain, and without a clear benefit.

Going back to the original problem:

  1. Version information is often outdated on Beta Cluster.
  2. Version information is often outdated on other sites that use Git for deployment.

The solution to problem 1 is that Beta Cluster should use the same method as production. That will automatically solve the issue.

For problem 2, we could re-evaluate whether it makes sense to cache GitInfo for 24 hours by default. I think 5 minutes would work just as fine. For medium traffic sites, that will still avoid disk access most of the time. For really high traffic sites like WMF and Wikia, they should disable it and use $wgGitInfoCacheDirectory as part of their deployment system, similar to LocalisationCache.

Krinkle moved this task from To triage to Special:Version on the MediaWiki-Special-pages board.
Krinkle moved this task from Limbo to Watching on the Performance-Team (Radar) board.

Change 513729 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@master] New maintenance script rebuildVersionCache

https://gerrit.wikimedia.org/r/513729

Seb35 added a comment.Jun 1 2019, 9:24 PM

The proposed patch creates a maintenance script purging the cache in Special:Version (this task) and optionally creating the gitinfo.json files (T131003). If preferred, the proposed maintenance script could be split into different scripts.

Purging the cache can answer this issue, so that the cache remains for large sites.

Change 513729 had a related patch set uploaded (by Seb35; owner: Seb35):
[mediawiki/core@master] New maintenance script rebuildVersionCache

https://gerrit.wikimedia.org/r/513729

Would it be possible to back-port this maintenance script for the Version 1.31.x branch (long term support)? Clearing my cache has no effect to view version updates.

I recently updated an extension. rebuildVersionCache.php worked perfectly and I was able to verify the updated version info.

+1 for @Lady_G2016 's suggestion. I'm using 1.31 too and some extensions don't show up as new after I updated them....

Change 388049 abandoned by Addshore:
Configurable Special:Version VCS info cache time

https://gerrit.wikimedia.org/r/388049

This issue also happens with MediaWiki 1.34 (the lockdown extension still shows the old version number after uploading new files).