Page MenuHomePhabricator

Extension to resolve external links and apply SpamBlacklist
Open, Needs TriagePublic

Description

External links are always a hot matter; for instance, a lot of work goes into the maintenance of the global spam blacklist. A silly aspect of our blacklisting is that we're unable to handle URL shorteners because MediaWiki only parses the URL as is written.

URL shorteners are horrible but many{{cn}} websites have better ways to handle them: for instance Twitter fetches the target URL and shows it for display, StackExchange detects the redirect and suggests to use the actual URL.

As a minimum viable product, a new extension could be written that:

  • for every new external link added, makes a HEAD request to check for status code and location (recursively, by following up to 5? 301/302 redirects);
  • lets other extensions hook into the results, in particular by passing the target URL to SpamBlacklist.

The next steps to make the extension behave well enough for Wikimedia deployment would probably be:

  • when parsing a page, queue the resolution of existing external links so that the results are cached and available next time;
  • allow MediaWiki to use some service or specific server to perform the HTTP requests (WMF may want to perform such requests only from certain proxy servers);
  • when saving an edit, also detect target and status code of existing external links when previously cached.

After this, other extensions, or other features in the same extension, could allow things like:

  • provide an API to query for broken links (or for external links by 404/301/302/whatever status code);
  • upon parsing, automatically rewrite all URL shortener links to their target in the HTML output;
  • similarly, replace broken links with web.archive.org links (a proof of concept is ArchiveLinks::rewriteLinks()).

Event Timeline

Nemo_bis created this task.Jun 2 2016, 8:15 AM
Restricted Application added a project: Internet-Archive. · View Herald TranscriptJun 2 2016, 8:15 AM
Restricted Application added subscribers: Zppix, Aklapper. · View Herald Transcript
Quiddity removed a subscriber: Crosswiki.
Restricted Application added a subscriber: JEumerus. · View Herald TranscriptJun 2 2016, 4:19 PM

Why was I subscribed to this?

Why was I subscribed to this?

Because you are seemingly interested in https://www.mediawiki.org/wiki/Archived_Pages . You can unsubscribe of course. :) Thanks for your time.

Why was I subscribed to this?

Because you are seemingly interested in https://www.mediawiki.org/wiki/Archived_Pages . You can unsubscribe of course. :) Thanks for your time.

I am. I am working closing with the people of IA, and the community tech team at WMF, to develop InternetArchiveBot.

But skimming this ticket, it seems more about resolving URL shorteners, and the blacklist.