RFC: Overhaul Interwiki map, unify with Sites and WikiMap
Open, HighPublic

Description

Proposal updated 2016-05-11, see below for the original RFC

Status

Next Steps

  • split CDB from SQL implementation
  • implement array-based InterwikiLookup (loads from multiple JSON or PHP files)
    • indexes should be generated on the fly, if not present in the loaded data
    • proposed structure: P3044
    • that InterwikiLookup implementation should also implement SiteLookup. Alternatively, only implement SiteLookup, and provide an adapter (SiteLookupInterwikiLookup) that implements InterwikiLookup on top of a SiteLookup.
  • implement maintenance script that can convert between different interwiki representations.
    • use InterwikiLookup for (multipke) input sources (db/files), InterwikiStore for output
    • we want an InterwikiStore that can write the new array structure (as JSON or PHP)
    • we want an InterwikiStore that can write the old CDB structure (as CDB or PHP)
  • Provide a config variable for specifying which files to read interwiki info from. If not set, use old settings and old interwiki storage.

Questions

  • is this a good plan? (see below for rationale)
  • how does interwiki/site info relate to local wiki config (wgConf/SiteMatrix/WikiMap)?
  • should all information always be loaded? (see also T114772: [Task] Implement SiteIdMapper service)
  • do we need caching?
  • do we need to support new features also for the SQL based InterwikiLookup?
    • needs: interwiki_ids table, interwiki_groups table, and blob field with JSON or an interwiki_props table.
  • Should SiteMatrix continue to work based on wgConf, or should it be ported to use Sites? Or combine both? Currently it has problems with Wikimedia-specific configurations, e.g. for special language codes.

Later

  • decide on how wikis on the WMF cluster should load their interwiki config
    • proposal: three files: family (shared by e.g. all wikipedias), language (shared by e.g. all english wikis), and local.
  • create a script that generates the family, language, and local files for all the wikis (as JSON or PHP) based on config. Should work like dumpInterwiki.
    • check this: generating CDB based on the relevant family/language/local file for a given wiki should return the same CDB as dumpInterwiki for that site.
  • create a deployment process that generates PHP files from the checked-in JSON files, for faster loading.
  • action=siteinfo&siprop=interwikimap could be ported to Sites and expose more information. Distinction from SiteMatrix is becoming somewhat unclear then.

Original RFC

We currently have three systems in core that provide information about other sites: Interwiki, WikiMap, and SiteStore. The information they provide is frequently inconsistent (between each other as well as between wikis), and none of them provides a good interface for maitaining the information. This RFC proposes a path to fix this.

Historically, Interwiki was used for linking to other wikis from wikitext, while WikiMap helps with linking to other wikis programmatically. Sites/SiteStore/SiteLookup was introduced to allow access to other wiki's APIs, and was intended to replace the old interwiki system.

This proposal builds on the idea that information about other sites is configuration, not content. There is no need to have it in the database at all, or in any way mutable by the application. This proposal assumes that reading (and caching) local files is faster than loading from a database server (or memcached).

Objectives

  • allow us to use the more flexible Sites system instead of the crusty Interwiki system
  • allow us to use Sites and Interwiki side by side, based on the same data
  • allow interwiki mappings / site definitions to be maintained in files, not in the database. This is easier to maintain via git and puppet (or vim).
  • Preserve the legacy interface for interwiki links (static methods in Interwiki)
  • make Sites (at least as) performant as the current multi-cache hodge-podge implementation of Interwiki.
  • Make WikiMap consistent with Interwiki and SiteLookup

Requirements

Outline

Refactor:

  • Create a new interface InterwikiLookup with all the public methods from Interwiki (see I7d7424345)
  • Create ClassicInterwikiLookup (better name needed) implements InterwikiLookup; implement it using the code currently in Interwiki. "classic" because it'S basically the old code, and implements the old storage backends (sql, cdb, ...). (see I7d7424345)
  • Make the public static methods in Interwiki delegate to a singleton instance of InterwikiLookup, remove everything else from the class. (see I7d7424345)
  • Add missing Interwiki concepts to the Site class, e.g. the "local" flag ("local" could be implemented as a group)
    • Allow sites to be a member of multiple groups (e.g. "wikipedia" and "english" for enwiki).
  • re-implement DBSiteStore without dependency on ORMTable. (done in I7e7ca257)
  • Reduce the complexity of Sites & co: remove SiteObject and SiteSQLStore; Consider dropping SiteList in favor of a more powerful SiteLookup interface.

Migrate to Sites:

  • Create an adapter, SiteLookupInterwikiLookup, implementing InterwikiLookup based on a SiteLookup.
  • Migrate usages of SiteStore to SiteLookup.
  • Provide a script for importing information from an InterwikiLookup into the sites table. Can be used for migrating from interwiki in the database or CDB (as generated by dumpInterwiki.php) to sites in the database.
  • Switch the singleton used by the static methods in Interwiki to use SiteLookupInterwikiLookup instead of ClassicInterwikiLookup (should be configurable)
  • Map WikiMap look up wiki info in a SiteLookup before (after?) checking in $wgConf (optional?). (done in I8186140ae)

File base backend:

  • FileSiteLookup implements a SiteLookup that will simply load site definitions from a list of local files
    • support at least JSON (easy to maintain) and PHP (code, not serialized data; fast with accelerator cache). Go by file extension.
  • Make an export script that can generate JSON site definition files:
    • Export from a SiteLookup or InterwikiLookup (needs an adapter that implements a SiteLookup based on an InterwikiLookup)
    • Export all, or a list of groups
    • Export only the ones that differ from the ones defined in a list of given files. This can be used to generate files that contain only the local overrides / additions to a common list of site definitions.
  • Provide a script for writing information from an InterwikiLookup to a JSON file. This can be used to port the output of dumpInterwiki.php to JSON.
  • Make a maintenance script that generates a PHP file with site definitions from a list of JSON (and PHP) files.
    • the generated PHP file would contain indexes for all IDs and groups as well as the main data structure.
  • Switch the default implementation of SiteLookup from DBSiteStore to FileSiteLookup.

Performance:

  • Add methods for fetching Site objects for a group to SiteLookup
  • Make CachingSiteStore (resp CachingSiteLookup) cache individual groups. Use "siblings" group for sister projects of the same language.

Configuration

  • InterwikiLookup implementation to use. Default: SiteLookupInterwikiLookup
    • For ClassicInterwikiLookup, use the old interwiki settings controlling CDB usage etc.
  • SiteLookup implementation to use. Default: FileSiteLookup
    • per default, read DefaultInterwiki.json (maintained in git) and LocalInterwiki.json (shipped empty).
    • on the WMF cluster, each wiki uses three JSON files: a common file, one per family, and one for local overrides per wiki.
    • on the WMF cluster, use PHP files for speed. The PHP file could be generated per-wiki, combining the common, family, and local JSON files. This essentially replaces the functionaly of dumpInterwiki.php.
  • Caching:
    • which cache (possibly none for PHP files)
    • duration
    • groups to cache separately (all?)

Open Questions

  • should Site objects always be fully loaded/instantiated? Or would it be better to be able to ask for individual "aspects" of a site, e.g. pathes, dbname, ids, etc?
  • should Site objects relay information from wikifarm configuration (wgConf)? Or should Sites be kept entirely separate from configuration? WikiMap already combines information from these two sources. But the old interwiki map is compeltely separate from wgConf.
  • Should SiteMatrix continue to work based on wgConf, or should it be ported to use Sites? Or combine both? Currently it has problems with Wikimedia-specific configurations, e.g. for special language codes.
  • should the JSON structure for describing sites have a narrow specification, or be flexible towards additions?
  • action=siteinfo&siprop=interwikimap could be ported to Sites and expose more information. Distinction from SiteMatrix is becoming somewhat unclear then.

Related Objects

There are a very large number of changes, so older changes are hidden. Show Older Changes

Can you expand on why you think it's impractical to keep it in the database? We'd have some frontend for users to edit the data (aka Special:Interwiki).

Because the information we want to maintain is getting increasingly complex. To represent them in a database, we need several tables, or use serialized blobs. This makes the information rather tricky to maintain by hand. A config file with a flexible syntax like JSON makes this much easier to maintain. And optionally loading such structures from a generated PHP file would be a lot faster than reading from the database, if the file is in the php code cache.

Would normal wiki uses cases require all this? Or would we want something smaller that can then be extended by other extensions?

Would normal wiki uses cases require all this? Or would we want something smaller that can then be extended by other extensions?

I think especially for "normal" 3rd party wikis, it would be a lot easier to maintain a JSON file that to manually fiddle rows into the database. Especially if the database structure becomes more complex to accommodate things like the API path.

I think especially for "normal" 3rd party wikis, it would be a lot easier to maintain a JSON file that to manually fiddle rows into the database. Especially if the database structure becomes more complex to accommodate things like the API path.

Manually fiddling anything, even json, is pretty unappealing for these. We want interfaces. Can you put an interface on it? If so I don't really care what you do.

On-wiki tools. On-wiki tools. On-wiki tools.

@Isarra A nice UI for managing interwiki mappings would be good to have, but is beyond the scope of this proposal. For that, it should not matter how the info is stored.

I can understand that people want on-wiki interfaces for managing configuration, but I find it rather scary. It's *much* easier to hack a wiki account than to get access to the web server.

But we already have an on-wiki interface for managing configuration. We like it. We don't want to lose it. Please don't take this away.

As to cracking wiki user accounts, if the interface is part of an
extension that no one is obliged to install, as it is now, everything
can be held very secure.

Purodha

@Isarra The idea is not to take it away, but whether it will continue working depends a lot on how it is implemented. Also, that interface is, as far as I know, only for interwiki prefixes. With the growing need for federation between wikis, we need more and more information about foreign wikis (namespaces, API path, etc), which would make the on-wiki interface a lot more complex.

Anyway, it should not be hard to port the basic functionality for configuring interwiki prefixes to the new backend.

jayvdb added a subscriber: jayvdb.Nov 26 2015, 9:52 PM
hoo added a subscriber: hoo.Dec 15 2015, 1:30 AM

Wikimedia Developer Summit 2016 ended two weeks ago. This task is still open. If the session in this task took place, please make sure 1) that the session Etherpad notes are linked from this task, 2) that followup tasks for any actions identified have been created and linked from this task, 3) to change the status of this task to "resolved". If this session did not take place, change the task status to "declined". If this task itself has become a well-defined action which is not finished yet, drag and drop this task into the "Work continues after Summit" column on the project workboard. Thank you for your help!

Krinkle changed the title from "Overhaul Interwiki map, unify with Sites and WikiMap." to "RFC: Overhaul Interwiki map, unify with Sites and WikiMap".Feb 3 2016, 9:29 PM
daniel added a subscriber: Bene.Mar 8 2016, 8:33 PM

We currently have three systems in core that provide information about other sites: Interwiki, WikiMap, and SiteStore.

As a beginner who wants to understand this better, I went looking for examples of each. I found these links. Please add whatever is accurate/useful, to the task description! Thanks.

(I'm not sure about these 2...)


The information they provide is frequently inconsistent (between each other as well as between wikis)

Examples of this, might also help?
(I tried to find some, by glance-comparisons of 3 versions of Special:SiteMatrix with each other, and 3 versions of Special:Interwiki with each other, and all seemed identical. (using meta/mediawiki/outreach))

Qgil removed a subscriber: Qgil.Mar 17 2016, 8:25 PM

Change 285018 had a related patch set uploaded (by Daniel Kinzler):
Introduce InterwikiTest

https://gerrit.wikimedia.org/r/285018

Change 285018 merged by jenkins-bot:
Introduce InterwikiTest

https://gerrit.wikimedia.org/r/285018

@daniel nominated this RFC for discussion at next week's ArchCom-RFC meeting (E171), which seemed like a good idea to everyone at the time. The last RFC meeting we had on this subject was October, and there's now patch 285018 awaiting review in Gerrit.

https://gerrit.wikimedia.org/r/#/c/250150/ is probably the first step in implementing this, and I was planning to merge it by the end of this week.

Does this mean that the interwiki map at Meta will be disbanded?

Nemo_bis edited the task description. (Show Details)May 9 2016, 7:02 AM
Nemo_bis edited the task description. (Show Details)May 9 2016, 7:05 AM
daniel added a comment.May 9 2016, 9:06 PM

Does this mean that the interwiki map at Meta will be disbanded?

No. The public interfaces will stay around. Perhaps they will extended or slightly modified, but I at least have no plan to get rid of them.

One major question we have to answer at some point though is how meta-information about other wikis on the same cluster (their domain, api path, interwiki prefix, etc) relates to the configuration of those wikis. AS far as I know, the SiteMatrix is currently build from looking at the configuration of all the wikis on the Wikimedia cluster, not from information in the interwiki system. I'm not clear yet on if and how the two should be integrated with each other.

But how ever that may play out, a site matrix will be available on meta.

daniel added a comment.May 9 2016, 9:14 PM

As a beginner who wants to understand this better, I went looking for examples of each. I found these links. Please add whatever is accurate/useful, to the task description! Thanks.

It's more about the code than Special pages. The first thing to be refactored would be https://doc.wikimedia.org/mediawiki-core/master/php/classInterwiki.html#details

SiteMatrix also provides an API that is relevant: https://meta.wikimedia.org/w/api.php?action=help&modules=sitematrix

(I'm not sure about these 2...)

Yes, that thing, but more relevantly, it's DB based implementation: https://doc.wikimedia.org/mediawiki-core/master/php/DBSiteStore_8php.html

No, this here: https://doc.wikimedia.org/mediawiki-core/master/php/classWikiMap.html

daniel edited the task description. (Show Details)May 11 2016, 5:32 PM
RobLa-WMF lowered the priority of this task from "Normal" to "Low".
RobLa-WMF raised the priority of this task from "Low" to "Normal".May 11 2016, 11:33 PM

We discussed this in E171 today. Full notes are E171#2016

The summary:

  • question discussed: which backends should InterwikiLookup support? (robla, 21:10:54)
  • i imagine every wiki would read three files actually (and perform a deep merge): one with info shared across the family, one with info shared accross the laanguage, and one with local overrides for the specific wiki (DanielK_WMDE, 21:22:54)
  • aude: also can interwiki ids be renamed? daniel: you can add prefixes. (DanielK_WMDE, 21:23:42)
  • an entry can have multiple global ids. they act as aliases. only one of them would be used as a key in the file, makign it the *canonical* global id. (DanielK_WMDE, 21:24:05)
  • <aude> another thing we should have is configuration for sorting order of interwiki ids (maintained in a sane place) (DanielK_WMDE, 21:33:00)
  • LINK: https://meta.wikimedia.org/wiki/MediaWiki:Interwiki_config-sorting_order-native-languagename (aude, 21:33:23)
  • LINK: https://meta.wikimedia.org/wiki/MediaWiki:Interwiki_config-sorting_order-native-languagename-firstword (aude, 21:33:26)
  • <TimStarling> anyway, yes, the JSON format you propose looks very extensible and will presumably meet our needs (DanielK_WMDE, 21:33:39)
  • LINK: https://meta.wikimedia.org/wiki/MediaWiki:Interwiki_config-sorting_order-native-languagename-firstword (DanielK_WMDE, 21:34:21)
  • <TimStarling> I don't want to have m:Interwiki_map anymore (DanielK_WMDE, 21:36:27)
  • Tim is not convinced that interwiki info should be maintained by hand as json. Perhaps we still want dumpInterwiki (or equivaloent) (DanielK_WMDE, 21:50:15)
  • Tim thinks we need to figure out what information can be taken from wgConf, and what should come from elsewhere, and how to maintain it. But it's not a blocker for now, we can figure iot out later (DanielK_WMDE, 21:53:55)
  • next week's meeting: E184 RFC: Requirements for change propagation (T102476) (robla, 21:57:21)
  • Tim thinks it's ok to go ahead with implementing the proposed next steps, as they are non-threatening. But should we have a formal last call? (DanielK_WMDE, 22:02:40)

We agreed that there's no reason to go to last call, because we weren't making a final decision.

@daniel, my understanding from our previous conversations (and discussions about a new column on the ArchCom-RfC board) is that this is "on track". As of right now, you're not waiting on ArchCom for approval before continuing development (or understanding what next steps should be) and that everyone (including ArchCom) is happy that implementation is underway. Is that a fair characterization?

(I'm asking because I was recently asked about status on this RFC)

daniel moved this task from proposed to tracking on the WMDE-TLA-Team board.
daniel added a comment.Nov 1 2016, 5:07 PM

I have proposed T149535: Refactoring the Interwiki Map: status and outlook for the developer summit in January. If you are interested in such a session, please comment on the ticket.

cscott added a subscriber: cscott.Nov 4 2016, 8:33 PM
dcausse added a subscriber: dcausse.
dcausse edited the task description. (Show Details)Nov 15 2016, 4:25 PM
dcausse edited the task description. (Show Details)Nov 18 2016, 2:11 PM
daniel moved this task from Inbox to Project on the User-Daniel board.Jan 5 2017, 7:03 PM
Koavf added a subscriber: Koavf.Mar 14 2017, 10:50 PM