Page MenuHomePhabricator

Add file-based cached implementation of SiteStore
Closed, InvalidPublic


Version: 1.21.x
Severity: enhancement



Event Timeline

bzimport raised the priority of this task from to Medium.Nov 22 2014, 1:33 AM
bzimport set Reference to bz45532.
bzimport added a subscriber: Unknown Object (MLST).

For deployment on high traffic sites like Wikipedia, there should be an implementation of the SiteStore interface backed by CDB. This allows fast access to the sites information, simmilar to the way the interwiki mappings are currently stored using CDB.

  • Bug 48932 has been marked as a duplicate of this bug. ***

cdb is not really desired anymore, but instead recommend to cache the sites data in some static file, such as json (?), instead of loading from memcached.

Suggest this approach:

  • SiteStore class remains as is
  • CachingSiteStore decorator is added
  • CachingSiteStore has a field of type "general caching interface", which might be an existing interface or a new one
  • An implementation of this "general caching interface" is created (or re-used if it exists) that does the type of caching deemed best at this time for the site info on the WMF cluster

Change 174874 merged by jenkins-bot:
Implement SiteListFileCache and rebuild script

Some thoughts:

I'm not sure the file should really be a cache. If we have the site info in a file, just use that file, period. It can and probably should be read-only.

We may want two or three files (corresponding to $wgInterwikiScopes).

This file is going to be read and parser a *lot*, so file format and encoding matter. I think we should at least consider & benchmark CDB, JSON, and CSV.

By the way, if we don't need unicode, try to bypass utf8 decoding, that tends to be slow.

if we make it a file store (suggest a name?), that might work but the SiteStore interface is not very suitable since it contains a bunch of write methods that don't make sense in a file store.

I think we need something like:

public function getSites();

public function getSite( $siteGlobalId );

if it's json, maybe for getSite, we can load the array with mapping $globalSiteId => $siteData and then lazy initialize the Site objects when requested. There are numerous places where we want just one Site object or just a few. This doesn't necessarily fit well with the design of the SiteList class though.

good idea to benchmark cdb, json and other options.

Tobi_WMDE_SW set Security to None.
Tobi_WMDE_SW added a subscriber: Wikidata.
Tobi_WMDE_SW subscribed.

Note that SiteListFileCache was merged in Iaee4c1f9fb5d54e, but is not hooked up anywhere, as far as I can see. More work is needed to start using this.

During SprintStart meeting today, we've split up further needed work into several tasks:

T77990 Split of site lookup for readonly access
T77991 Decide on file format: CDB, JSON, CSV
T77993 Implement new interface based on pre-generated files (one file per group and one file with everything indexed by global site id)
-> load everything when local lookup is needed
T77994 Script that generates files
T77995 Minimize access to local-id-lookup
Not a blocker, but should be done at some point: T77997 Investigate how to best move the sites component out of core

Static php files (which can be created at deploy time if needed like the l10n cdb files are on the WMF cluster) are much better than json for server-side performance. HHVM and PHP5/7 can both cache PHP bytecode and eliminate the file read and parse stages that json requires.