Page MenuHomePhabricator

detect all MediaWiki sites on the IWM
Closed, ResolvedPublic

Description

The current way to detect if a site uses the MediaWiki API is by querying a normal article page (e.g. the one given via the IWM) and search for a meta tag in the head tag of the HTML (e.g. <meta name="generator" content="MediaWiki 1.25wmf14" /> on the English Wikipedia). The following sites don't have that tag:

None of those sites have also the information where the api.php can be found, which could've helped and is also necessary to reliably begin using that site via a bot.

Details

Related Gerrit Patches:
pywikibot/core : masterAdd site_detect.load_site
pywikibot/core : masterAdd site detection tests

Event Timeline

jayvdb created this task.Feb 4 2015, 9:43 PM
jayvdb raised the priority of this task from to Medium.
jayvdb updated the task description. (Show Details)
jayvdb added a subscriber: jayvdb.
Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 4 2015, 9:43 PM
XZise updated the task description. (Show Details)Feb 4 2015, 11:11 PM
XZise set Security to None.
XZise added a subscriber: XZise.Feb 4 2015, 11:15 PM

The question is what to do with those sites. In theory it would be possible to predefine those and not automatically detect them. The meta/generator tag is only necessary to start it if the site is unknown. In fact I don't think it's as necessary as determining where the api.php is. If the api.php can be found it's better than determining the generator as that would allow pywikibot to test if it behaves like MediaWiki too. And by doing that it would be also able to determine the version, and thus making the generator information obsolete.

So summarized I think instead of focusing on meta/generator, it should try determining where the api.php is, do a test siteinfo request and if that worked out, that API is probably MediaWiki and the bot can start working.

Omegat added a subscriber: Omegat.Feb 15 2015, 6:42 AM

As xzise mentioned, for the sites that do not contain the meta/generator tag, I tried to manually find where the endpoint lies. Of the above 8 sites, 5 of them have the MW API endpoint which was easily detected using the detect_site_type script. But I couldn't find the endpoint of the following:

http://www.otterstedt.de/wiki/index.php/Hauptseite
http://esperanto.blahus.cz/cxej/vikio/index.php/$1
http://www.werelate.org/wiki/Main_Page

Change 186339 had a related patch set uploaded (by Maverick):
Load MW sites without using family file

https://gerrit.wikimedia.org/r/186339

Change 220452 had a related patch set uploaded (by John Vandenberg):
Add site detection tests

https://gerrit.wikimedia.org/r/220452

Change 220452 merged by jenkins-bot:
Add site detection tests

https://gerrit.wikimedia.org/r/220452

jayvdb closed this task as Resolved.Sep 3 2015, 12:01 AM
jayvdb assigned this task to Omegat.

This was resolved with https://gerrit.wikimedia.org/r/#/c/230512/ , which was easier because we have switched to requests , but https://gerrit.wikimedia.org/r/#/c/186339/22/ did the initial work to get this working in httplib2.