Page MenuHomePhabricator

[WLM] Redirect script from Wikipedia to Commons
Closed, ResolvedPublic1 Estimated Story Points

Description

The Commons Upload wizard needs some parameters that come from the Wikipedia page but can't be determined inside the template that generates the link to the upload wizard. This can be solved with a redirector script that the Wikipedia link points to and which generates the parameters and redirects to the Upload Wizard.

The script must be installed on Tools Lab.

The script should do the following steps:

  1. Read the GET parameters "lat", "lon", "campaign", "pagename" and "id".
  2. Validate that campaign, pagename and id parameters are not empty, otherwise display error page.
  3. Validate with the commons API that the campaign exists (cache positive commons api results for at least 1 day).
  4. Use the Wikipedia API to download the wikipedia page source with the "pagename" parameter (maybe cache the page text for 10 minutes).
  5. Use the JSON template configuration to determine supported template names and the name of the ID template parameter ("ID" or "Nummer").
  6. Iterate through the templates on the page, looking for valid configured template names until you find then one whose ID parameter matches the GET "id" parameter. If none is found, display error page.
  7. Check if ID matches the pattern for valid IDs, check if the ID is unique on the page (not used in other valid templates). If ID is not unique, display error page. If ID is not valid, store the invalid ID.
  8. Determine most precise category: Try the "Commonscat" parameter of the matched template, look for the "Commonscat" template in the "Weblinks" section of the page source, fall back to querying the API for the categories of the page, checking recursively against one of the sub-categories of "Liste (Kulturdenkmale in Deutschland)" (Cache the category tree).
  9. Build the redirect URL. Parameters are explained below. Parameters must be URL-encoded correctly.
  10. Log redirect URL to file (can be used for statistics later)
  11. Redirect to URL (I suggest HTTP status "301 Moved Permanently").

URL for valid IDs (caps are placeholders):
https://commons.wikimedia.org/wiki/Special:UploadWizard?campaign=CAMPAIGN&categories=hiddencat,CATEGORY&fields[]=VALID_ID&lat=LAT&lon=LON&objref=de|PAGENAME|VALID_ID

URL for invalid IDs (caps are placeholders):
https://commons.wikimedia.org/wiki/Special:UploadWizard?campaign=CAMPAIGN&categories=hiddencat,CATEGORY&lat=LAT&lon=LON&objref=de|PAGENAME|INVALID_ID

You can have a look at the existing Python bot code for inspiration.

Event Timeline

gabriel-wmde assigned this task to KasiaWMDE.
gabriel-wmde raised the priority of this task from to High.
gabriel-wmde updated the task description. (Show Details)

Even when it's just a redirect script I'd recommend to use the Silex micro framework instead of plain PHP for a better structure and code quality (Silex provides routing, parameter handling, validation, templating and basic dependency injection). We could also use URL paths instead of GET parameters.

After some research and talking to @Tobi_WMDE_SW we decided that we'll save development time for steps 5-8 by creating a Python script that uses the existing Python code (TemplateChecker, TemplateReplacer, CategoryMapper). This script will be called as a subprocess from the PHP redirect script with the wiki text on STDIN and write a JSON response (found template, template position, ID is unique, ID is valid) on STDOUT.

The main reasons for this solution is because at the moment there is no standalone PHP parser for wikitext links, headings and templates that works like mwparserfromhell. When T27984 is fixed, we might switch from calling the Python script to the official MediaWiki parser.