Analyze template substitution mechanism of CommonsHelper and ForTheCommonGood tools
Open, NormalPublic

Description

Which parts of the template switching mechanisms could be reused or at least be an inspiration for our tool?

Please look at the two tools and come up with a suggestions, what you would like to see in the first version of the fileImporter.

ForTheCommonGood: https://en.wikipedia.org/wiki/User:This,_that_and_the_other/For_the_Common_Good
CommonsHelper: https://bitbucket.org/magnusmanske/commonshelper

Tobi_WMDE_SW moved this task from Proposed to Todo on the WMDE-QWERTY-Team-Board board.
Tobi_WMDE_SW updated the task description. (Show Details)Jul 25 2017, 3:49 PM
Lea_WMDE updated the task description. (Show Details)Jul 31 2017, 10:02 AM
TTO added a comment.EditedJul 31 2017, 12:21 PM

I had an hour-long train ride with nothing to do, so I wrote up a guide to the approach used by For the Common Good.

From what I have been able to tell, FtCG's logic has been pretty well received by users. In my experience of using the tool myself, little manual cleanup of the wikitext is required, although it must be emphasised that the tool is not fully automatic.


In order to adjust to the idiosyncrasies of each project, FtCG uses a "local wiki data" file, similar to the data pages used by CommonsHelper2 and stored on Meta. This file stores:

  • A regular expression fragment that matches the local name of the "Information" template.
  • Regular expression fragments that match each of the standard parameters to the "Information" template (description, date, source, author, permission, other versions).
  • Regular expression fragments that match the local equivalents of the "Summary" and "Licensing" headers.
  • A regular expression that matches {{copy to Commons}}-type templates, so they can be removed.
  • The name of the local equivalent of the {{now Commons}} template.
  • Edit summaries and deletion summaries for the local wiki. (These might work as regular MediaWiki i18n messages.)
  • Zero or more "potential problems", which consist of a regular expression and a warning message. If the file description page matches the regex, the entire FtCG user interface turns yellow, and the associated warning message is shown to the user. For example, {{non-free is a potential problem on enwiki, with warning message "The file appears to be non-free. Commons cannot accept non-free files." There are ten other potential problems for enwiki.
  • Zero or more "replacements", which consists of a "look for" regular expression and a "replace with" string. These are straightforward find-and-replace operations, with the "magic" string %%OriginalUploader%% in the "replace with" string being replaced with the username of the uploader of the earliest revision of the file (this is very useful when manipulating own-work license templates). For example, for enwiki, = *I .*created this (image|work) entirely by myself.? is replaced by = {{own work by original uploader}}.
  • Zero or more "self-license replacements". These are identical to replacements, except that when a self-license replacement occurs, FtCG treats the file as being authored by the original uploader (see below). For example, {{PD-self([^\}]*)}} is a self-license replacement, replaced with {{PD-user|%%OriginalUploader%%|en}}.

You can see some real FtCG local wiki data files here.

I would recommend that FileImporter uses a similar system of local configuration parameters, stored somewhere where they are easily editable by trusted users (i.e. not in a Git repository, and not requiring SWAT deploys). While it might be tempting to (ab)use the localisation system for much of this, that would have the disadvantage that non-admins and users from other wikis would not be able to make changes to a wiki's configuration.


The process carried out by FtCG to transform the local wikitext into something Commons will accept is as follows. Places where local wiki data is used are marked LWD:

  1. Identify potential problems and warn the user (see above).
  2. Replace the copy to Commons regex (LWD) with "".
  3. Transform summary and licensing headers using the regular expression fragments (LWD).
  4. Add interwiki prefix to wikilinks.
  5. Pipe links if they are not already piped (this prevents the interwiki prefix being displayed).
  6. Comment out categories. (Why not delete them? So the user can manually decide what to do with them. Most local files have no directly-applied categories in any case.)
  7. Perform replacements (LWD).
  8. Perform self-license replacements (LWD).
  9. If there is no Information template (detected using regex fragment in LWD), generate one. This is one of the most popular features of FtCG: see note 1 for how it works.
  10. Otherwise, transform the Information template's parameter names (LWD) to Commons' English names, and wrap the description in an {{en|...}} or similar template.
  11. If there is no licensing header, add one before the first template that follows the Information template.
  12. Append an "original upload log", a table containing timestamp, dimensions, linked username, and comment for each revision of the local file (being sure to check whether this info is RevDel'd).
  13. Replace sequences of three or more newlines with two newlines.

That's it!


Note 1: How we generate an {{Information}} tag when none is present:

  1. Start with == {{int:filedesc}} ==\n{{Information\n
  2. The Description parameter is set to any "loose" lines of text picked up on the file description page (i.e. any text not in headings or templates), joined with double newlines and wrapped in {{en|...}} or equivalent.
  3. The Date parameter is set to {{according to EXIF data|<exif date of earliest file version>}}, or if EXIF date is absent or set to year 0000, {{original upload date|<date when earliest file version was uploaded>}}.
  4. The Source parameter is set to {{own work by original uploader}}. This is a controversial part of FtCG, as uploaders are expected to manually delete this if the file is not own work. In my experience, though, a large proportion of files being transferred are own work, and you don't want to be manually typing {{own work by original uploader}} all the time.
  5. The Author parameter is set to the original uploader if a self-license replacement occurred, or otherwise left blank.
  6. The Permission and Other_versions parameters are left blank.
  7. Close the template: }} and prepend it to the description page, removing the "loose" text that was moved to the description parameter.

Like I said, while this recipe can certainly be improved, it is a proven recipe, and I offer it to you as a starting point for your work.