Page MenuHomePhabricator

RFC: Overhaul Interwiki map, unify with Sites and WikiMap
Open, MediumPublic

Description

Proposal updated 2016-05-11, see below for the original RFC

Status

Next Steps

  • split CDB from SQL implementation
  • implement array-based InterwikiLookup (loads from multiple JSON or PHP files)
    • indexes should be generated on the fly, if not present in the loaded data
    • proposed structure: P3044
    • that InterwikiLookup implementation should also implement SiteLookup. Alternatively, only implement SiteLookup, and provide an adapter (SiteLookupInterwikiLookup) that implements InterwikiLookup on top of a SiteLookup.
  • implement maintenance script that can convert between different interwiki representations.
    • use InterwikiLookup for (multipke) input sources (db/files), InterwikiStore for output
    • we want an InterwikiStore that can write the new array structure (as JSON or PHP)
    • we want an InterwikiStore that can write the old CDB structure (as CDB or PHP)
  • Provide a config variable for specifying which files to read interwiki info from. If not set, use old settings and old interwiki storage.

Questions

Later

  • decide on how wikis on the WMF cluster should load their interwiki config
    • proposal: three files: family (shared by e.g. all wikipedias), language (shared by e.g. all english wikis), and local.
  • create a script that generates the family, language, and local files for all the wikis (as JSON or PHP) based on config. Should work like dumpInterwiki.
    • check this: generating CDB based on the relevant family/language/local file for a given wiki should return the same CDB as dumpInterwiki for that site.
  • create a deployment process that generates PHP files from the checked-in JSON files, for faster loading.
  • action=siteinfo&siprop=interwikimap could be ported to Sites and expose more information. Distinction from SiteMatrix is becoming somewhat unclear then.

Original RFC

We currently have three systems in core that provide information about other sites: Interwiki, WikiMap, and SiteStore. The information they provide is frequently inconsistent (between each other as well as between wikis), and none of them provides a good interface for maitaining the information. This RFC proposes a path to fix this.

Historically, Interwiki was used for linking to other wikis from wikitext, while WikiMap helps with linking to other wikis programmatically. Sites/SiteStore/SiteLookup was introduced to allow access to other wiki's APIs, and was intended to replace the old interwiki system.

This proposal builds on the idea that information about other sites is configuration, not content. There is no need to have it in the database at all, or in any way mutable by the application. This proposal assumes that reading (and caching) local files is faster than loading from a database server (or memcached).

Objectives

  • allow us to use the more flexible Sites system instead of the crusty Interwiki system
  • allow us to use Sites and Interwiki side by side, based on the same data
  • allow interwiki mappings / site definitions to be maintained in files, not in the database. This is easier to maintain via git and puppet (or vim).
  • Preserve the legacy interface for interwiki links (static methods in Interwiki)
  • make Sites (at least as) performant as the current multi-cache hodge-podge implementation of Interwiki.
  • Make WikiMap consistent with Interwiki and SiteLookup

Requirements

Outline

Refactor:

  • Create a new interface InterwikiLookup with all the public methods from Interwiki (see I7d7424345)
  • Create ClassicInterwikiLookup (better name needed) implements InterwikiLookup; implement it using the code currently in Interwiki. "classic" because it'S basically the old code, and implements the old storage backends (sql, cdb, ...). (see I7d7424345)
  • Make the public static methods in Interwiki delegate to a singleton instance of InterwikiLookup, remove everything else from the class. (see I7d7424345)
  • Add missing Interwiki concepts to the Site class, e.g. the "local" flag ("local" could be implemented as a group)
    • Allow sites to be a member of multiple groups (e.g. "wikipedia" and "english" for enwiki).
  • re-implement DBSiteStore without dependency on ORMTable. (done in I7e7ca257)
  • Reduce the complexity of Sites & co: remove SiteObject and SiteSQLStore; Consider dropping SiteList in favor of a more powerful SiteLookup interface.

Migrate to Sites:

  • Create an adapter, SiteLookupInterwikiLookup, implementing InterwikiLookup based on a SiteLookup.
  • Migrate usages of SiteStore to SiteLookup.
  • Provide a script for importing information from an InterwikiLookup into the sites table. Can be used for migrating from interwiki in the database or CDB (as generated by dumpInterwiki.php) to sites in the database.
  • Switch the singleton used by the static methods in Interwiki to use SiteLookupInterwikiLookup instead of ClassicInterwikiLookup (should be configurable)
  • Map WikiMap look up wiki info in a SiteLookup before (after?) checking in $wgConf (optional?). (done in I8186140ae)

File base backend:

  • FileSiteLookup implements a SiteLookup that will simply load site definitions from a list of local files
    • support at least JSON (easy to maintain) and PHP (code, not serialized data; fast with accelerator cache). Go by file extension.
  • Make an export script that can generate JSON site definition files:
    • Export from a SiteLookup or InterwikiLookup (needs an adapter that implements a SiteLookup based on an InterwikiLookup)
    • Export all, or a list of groups
    • Export only the ones that differ from the ones defined in a list of given files. This can be used to generate files that contain only the local overrides / additions to a common list of site definitions.
  • Provide a script for writing information from an InterwikiLookup to a JSON file. This can be used to port the output of dumpInterwiki.php to JSON.
  • Make a maintenance script that generates a PHP file with site definitions from a list of JSON (and PHP) files.
    • the generated PHP file would contain indexes for all IDs and groups as well as the main data structure.
  • Switch the default implementation of SiteLookup from DBSiteStore to FileSiteLookup.

Performance:

  • Add methods for fetching Site objects for a group to SiteLookup
  • Make CachingSiteStore (resp CachingSiteLookup) cache individual groups. Use "siblings" group for sister projects of the same language.

Configuration

  • InterwikiLookup implementation to use. Default: SiteLookupInterwikiLookup
    • For ClassicInterwikiLookup, use the old interwiki settings controlling CDB usage etc.
  • SiteLookup implementation to use. Default: FileSiteLookup
    • per default, read DefaultInterwiki.json (maintained in git) and LocalInterwiki.json (shipped empty).
    • on the WMF cluster, each wiki uses three JSON files: a common file, one per family, and one for local overrides per wiki.
    • on the WMF cluster, use PHP files for speed. The PHP file could be generated per-wiki, combining the common, family, and local JSON files. This essentially replaces the functionaly of dumpInterwiki.php.
  • Caching:
    • which cache (possibly none for PHP files)
    • duration
    • groups to cache separately (all?)

Open Questions

  • should Site objects always be fully loaded/instantiated? Or would it be better to be able to ask for individual "aspects" of a site, e.g. pathes, dbname, ids, etc?
  • should Site objects relay information from wikifarm configuration (wgConf)? Or should Sites be kept entirely separate from configuration? WikiMap already combines information from these two sources. But the old interwiki map is compeltely separate from wgConf.
  • Should SiteMatrix continue to work based on wgConf, or should it be ported to use Sites? Or combine both? Currently it has problems with Wikimedia-specific configurations, e.g. for special language codes.
  • should the JSON structure for describing sites have a narrow specification, or be flexible towards additions?
  • action=siteinfo&siprop=interwikimap could be ported to Sites and expose more information. Distinction from SiteMatrix is becoming somewhat unclear then.

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@daniel, my understanding from our previous conversations (and discussions about a new column on the TechCom-RFC board) is that this is "on track". As of right now, you're not waiting on TechCom for approval before continuing development (or understanding what next steps should be) and that everyone (including TechCom) is happy that implementation is underway. Is that a fair characterization?

(I'm asking because I was recently asked about status on this RFC)

I have proposed T149535: Refactoring the Interwiki Map: status and outlook for the developer summit in January. If you are interested in such a session, please comment on the ticket.

kchapman subscribed.

Does not currently have anyone stepping up to implement, moving to the backlog for now.

Actually, work on SiteLookup and Interwiki is now in the Wikidata backlog. I discussed this with @Ladsgroup and @Lydia_Pintscher last week. The RFC still needs an update, so backlog is appropriate.

Addshore subscribed.

Not sure if this belongs on the campsite board yet, so removing for now.
This is still on the Wikidata radar of course, but this ticket needs more before i can actually be worked on (task breakdown & what not)

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)

daniel lowered the priority of this task from High to Medium.Dec 9 2020, 12:01 PM

Any chance this is going to be taken up again?

For Parsoid purposes, the interwiki mapping needs to be bidirectional -- that is, we need to be able not just to go from enwiki:$1 to //en.wikipedia.org/wiki/$1 but also given a URL https://en.wkipedia.org/wiki/Foo we need to lookup that this patches the enwiki prefix.

The current InterwikiLoadPrefix hook system doesn't allow us to do this, for reasons detailed in T270444: Parsoid needs a bidirectional interwiki map (and hooks).

It sure would be nice if the interwiki system was replaced by something which made what Parsoid wants to do easier.